Skip to content

Server session scalability: admission control, retry-after signaling, and held-Publish decoupling#3946

Open
marcschier wants to merge 44 commits into
masterfrom
copilot/server-client-rate-limiting
Open

Server session scalability: admission control, retry-after signaling, and held-Publish decoupling#3946
marcschier wants to merge 44 commits into
masterfrom
copilot/server-client-rate-limiting

Conversation

@marcschier

@marcschier marcschier commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Description

Server session scalability work: the reference server degrades gracefully under connect storms, holds many thousands of steady-state long-polls on a small worker pool, and gives clients diagnostics-independent backpressure signals so they ramp instead of hammering. This combines three related efforts — and merges in #3950 — all follow-ups to the scalability analysis in #3941 (Docs/ServerScalability.md).

Server admission control / rate limiting is enabled by default with conservative limits (it sheds excess load with a fast BadServerTooBusy rather than altering steady-state behavior), and the held-Publish decoupling is on by default; the HTTPS rate limiter and the client-side connect gate are opt-in. Nothing here changes the shipped MaxSessionCount (100).

1. Admission control / rate limiting

  • TCP listener admission — a configurable token-bucket connection gate (socket backlog default 512, raised from a hard-coded 10) sheds over-limit connections cheaply instead of accepting and starving them.
  • Session-establishment admissionCreateSession / ActivateSession acquire from a concurrency limiter; over-limit requests fault fast with BadServerTooBusy before the CPU-bound certificate validation / signing, instead of queueing behind the address-space lock.
  • CreateSession RSA out of the lock — the server-nonce signature runs outside the session semaphore, so a slow signature no longer serializes unrelated session establishment.
  • HTTPS/KestrelAddHttpsRateLimiter(...) attaches an ASP.NET Core rate limiter to the HTTPS binding via dependency injection.
  • Built on System.Threading.RateLimiting, DI-injectable (IServerRateLimiterProvider, ConfigureRateLimits(...)) with a direct-construct fallback.

2. Retry-after signaling (diagnostics-independent)

A BadServerTooBusy fault's diagnostic AdditionalInfo only reaches the client when it requests diagnostics, and is dropped on client exception re-wrapping. This delivers the retry-after through carriers that do not depend on diagnostics, all feeding the client's adaptive reconnect policy:

  • ResponseHeader.additionalHeader (all UA transports) — a structured RetryAfterMs (Int64) in the standard AdditionalParametersType, attached to the fault by ServerBusyException + EndpointBase.CreateFault (RetryAfterHeader) and read by ClientBase.ValidateResponse.
  • HTTP Retry-After (HTTPS) — the rate limiter sets the header on its 429; the client maps a 429/503 with Retry-After to BadServerTooBusy.
  • UA-TCP Error reason — the client honors a RetryAfterMs=N token in a transient server-busy ERR message as a lower bound on channel-reconnect backoff (RetryAfterHint).
  • Client foundation — the hint survives the diagnostics gate and exception re-wrapping and reaches IReconnectPolicy.TryGetNextDelay (server-signal-aware adaptive backoff); a client-wide connect admission gate (WithConnectRateLimiter) ramps bulk connects.
  • Load-based Server.ServiceLevel — computed from session-establishment headroom (255 at low load, scaling toward a floor as sessions approach MaxSessionCount, with hysteresis) as a proactive capacity signal clients can read/subscribe.
  • Implementation summary folded into Docs/Sessions.md (Server retry-after backpressure section) + a submittable OPC UA spec-change proposal: Docs/proposals/RetryAfter.md.

3. Held long-poll Publish decoupling

A parked Publish (waiting for the next subscription notification) no longer occupies a MaxRequestThreadCount worker slot for the whole wait, so a small fixed worker pool can hold many thousands of outstanding Publishes and MaxRequestThreadCount no longer has to scale with session count.

  • IRequestParkSink + a one-shot RequestLifetime.ParkSink; ServerBase.RequestQueue awaits Task.WhenAny(processing, ParkedTask) and releases the active-worker slot at the park point rather than at completion. Only Publish carries a sink — every other request keeps the byte-for-byte legacy fast path (no extra allocation, no WhenAny).
  • DecoupleHeldPublishRequests (default true; set false to restore the legacy inline-await worker path).

Docs

Docs/ServerScalability.md (rewritten as user-facing scalability documentation), Docs/RateLimiting.md, Docs/Sessions.md, Docs/proposals/RetryAfter.md.

Tests

Server admission (SessionAdmissionRateLimitTests); retry-after carriers (RetryAfterHeaderTests, RetryAfterHintTests, EndpointBaseTests, ClientBaseTests, HttpsTransportChannelTests, AdaptiveReconnectPolicyTests, ConnectionStateMachineTests); load-based ServiceLevel (ServerServiceLevelCalculatorTests); held-Publish decoupling (RequestQueueTests, SessionPublishQueueTests); AOT coverage (RateLimitingAotTests). Changed libraries build clean on net48 and net10.0.

Related Issues

Merges #3950 ("decouple held long-poll Publishes"); follow-up to the scalability analysis in #3941.

Checklist

  • I have signed the CLA and read the CONTRIBUTING doc.
  • I have added tests that prove my fix is effective or that my feature works and increased code coverage.
  • I have added all necessary documentation.
  • I have verified that my changes do not introduce (new) build or analyzer warnings.
  • I ran all tests locally using the UA.slnx solution against at least .net framework and .net 10, and all passed.
  • I fixed all failing and flaky tests in the CI pipelines and all CodeQL warnings.
  • I have addressed all PR feedback received.

agent added 21 commits July 2, 2026 11:11
Adds an additive, non-breaking asynchronous change-notification API to
NodeState and routes the MonitoredNode2 push path through it, so a node
that exposes an asynchronous value read handler (OnReadValueAsync) is
honored without blocking a thread and the per-node channel applies
back-pressure by awaiting rather than blocking.

NodeState:
- ClearChangeMasksAsync / OnStateChangedAsync / StateChangedAsync
- ReportEventAsync / OnReportEventAsync (+ BaseInstanceState override)
- The synchronous ClearChangeMasks / ReportEvent now also drive the async
  sinks, completing inline for a synchronously-readable node with channel
  capacity and blocking only when a genuinely async read is in flight or
  the channel is full (only synchronous callers pay that cost).

MonitoredNode2:
- OnMonitoredNodeChangedAsync / OnReportEventAsync producers read each
  attribute at enqueue time via ReadAttributeAsync (strict enqueue-time
  snapshot, honoring async read handlers) and enqueue by awaiting the
  bounded channel. Synchronous wrappers are retained.

Internal async flush sites migrated to await ClearChangeMasksAsync
(AsyncCustomNodeManager.WriteAsync value flush, node removal,
ConfigurationNodeManager, KeyCredentialPushSubject). FIFO/LIFO discard and
the Overflow bit remain a per-item queue concern; the shared node channel
stays a lossless conduit.

Tests: MonitoredNode2 async-read-honored, no-block-on-slow-async-read, and
async event delivery; NodeState async/sync dispatch. Docs updated in
AsyncServerSupport.md.
B (event source-report async):
- Add IServerInternal.ReportEventAsync + ServerInternalData.ReportEventAsync
  (over the NodeState.ReportEventAsync added earlier).
- AsyncCustomNodeManager: async root-notifier OnReportEventAsync handler that
  awaits Server.ReportEventAsync; rewire the root-notifier slots to the async
  handler (no OnReportEvent overrides exist, so this is safe). The rare
  model-change Raise* helpers stay synchronous (they drive the async producer
  via the sync wrapper; events do not async-read, so a full async cascade
  through model-change/audit is near-zero value).

C (remaining flush migration):
- Migrate KeyCredentialPushSubject.BindAsync to await ClearChangeMasksAsync.
  All async-context ClearChangeMasks sites are now migrated; genuinely
  synchronous sites (builders, diagnostics timers, sync CustomNodeManager,
  supervision) remain on the sync wrapper (correct, not hot paths).

Tests: update the 3 root-notifier tests (parameterized across the async and
sync managers) to accept either the sync OnReportEvent or async
OnReportEventAsync slot, and to capture the routed event via either
Server.ReportEvent or Server.ReportEventAsync.

Validation: 1900 Server + 144 History AlarmsAndConditions pass; net10/net48
build 0-warning. (D micro-benchmark skipped per user; A load test blocked on
the load-test branch merging to master.)
…change-notifications

# Conflicts:
#	Libraries/Opc.Ua.Server/NodeManager/MonitoredItem/MonitoredNode.cs
…Node2 tests

Review feedback:
- NodeState: sync ReportEvent/ClearChangeMasks wrappers now offload the
  async sink to the thread pool when a SynchronizationContext is present,
  so a context-capturing sink cannot deadlock the blocking wait (the
  no-context server fast path is unchanged).
- MonitoredNode: propagate OperationCanceledException from the async
  producer paths instead of swallowing it (kept ChannelClosedException).
- MonitoredNode2 tests: assert delivery Wait returns true instead of
  ignoring the result.
- Docs: US spelling.

Test fix (root cause of the 2 async-read test timeouts): IMonitoredItem
.QueueValue takes an `in DataValue` parameter. Moq's It.IsAny<DataValue>()
does not match `in`/`ref` parameters, so the .Callback signal never fired
and the delivery Wait timed out at 30 s even though QueueValue was invoked
correctly (Invocations still recorded it, which is why the old assertions
passed). Switched all six QueueValue setups/verifications to
It.Ref<DataValue>.IsAny. MonitoredNode2Tests now run in ~0.35 s instead of
~1 min of masked 30 s timeouts. Production delivery was always correct.
Add Docs/ServerScalability.md documenting why a single node tops out at
~2000 concurrent sessions and how to move beyond it, grounded in code
references:

- Framing: the 2000 ceiling is an establishment (connect-storm) ceiling,
  not steady-state (500 establishes cleanly and delivers 100% of
  notifications; 2000 collapses via a retry -> duplicate-session feedback
  loop, ~4432 creates for a 2000 target).
- Establishment boundaries B1-B5: socket backlog hard-coded to 10; every
  transient handshake error -> BadTcpInternalError -> client retry
  amplifier; O(N^2) session-diagnostics rescan under the global
  address-space semaphore; CreateSession RSA signing inside a global
  semaphore; RSA CPU saturation -> mass subscription abandonment.
- Steady-state boundaries S1-S4: held Publish is async-parked but the
  request-queue accounting couples MaxRequestThreadCount / ThreadPool to
  session count; per-session publish cap; O(N) sweep; ABANDONED feedback.
- Where to linearize: admission control / rate limiting at connection
  accept and session establishment (fast BadServerTooBusy / Retry-After
  instead of mid-handshake aborts), plus a prioritized A/B/C roadmap.

Link the new doc from Docs/README.md and cross-link it with the Server
session scalability section of Docs/Benchmarks.md. Docs-only; no code
changes.
…ishment

Server-side admission control based on System.Threading.RateLimiting, ON by
default with conservative limits (validated: all 1901 Server.Tests pass).

Foundation (Opc.Ua.Server/RateLimiting):
- ServerRateLimitOptions: deterministic, configurable limits (backlog,
  connection token-bucket rate/burst, session-establishment concurrency).
- IServerRateLimiterProvider + DefaultServerRateLimiterProvider: builds the
  limiters from options; DI-replaceable, direct-construct fallback.
- TokenBucketConnectionRateLimiter: IConnectionRateLimiter over a
  TokenBucketRateLimiter.

Transport (Opc.Ua.Core, P1 / B1+B2):
- IConnectionRateLimiter seam + TransportListenerSettings.ListenBacklog and
  .ConnectionRateLimiter.
- TcpTransportListener: configurable backlog (default raised 10 -> 512) and a
  connection admission check in OnAccept that sheds a storm cheaply.
- ServerBase.ConfigureTransportListenerSettings virtual hook.

Session establishment (StandardServer, P2):
- CreateSession/ActivateSession acquire a concurrency permit before the
  CPU-bound crypto; at capacity they return BadServerTooBusy with a
  retry-after hint in the fault message. Provider wired via OnServerStarting
  (default) or the settable RateLimiterProvider/RateLimitOptions (DI/direct).

Deferred: B4 (move CreateSession RSA signing out of m_semaphoreSlim) needs
careful cert-rotation lifetime analysis; tracked separately. HTTPS/Kestrel
(P3), client adaptive backoff (P4), DI hosting wiring, tests + docs (P5)
follow.
…ts + IServerRateLimiterProvider resolution)
Add IAdaptiveReconnectPolicy (non-breaking sibling of IReconnectPolicy;
avoids default-interface-methods for net472/net48). ReconnectPolicy now
implements it: on a server-busy signal (BadServerTooBusy/BadTcpServerTooBusy/
BadTooManySessions/BadTooManyOperations/BadTooManyPublishRequests/timeouts)
it backs off 4x (capped at MaxDelay) and honors a server retry-after hint as
a lower bound. ConnectionStateMachine.HandleReconnectingAsync tracks the last
attempt's status and uses the adaptive overload when available (initial
connect funnels through the same loop). 8 unit tests.

Follow-up: client-wide connect admission limiter (ramp bulk connects) and
structured retry-after extraction from the fault remain.
…Slim

The instance certificate is acquired as a using-scoped, ref-counted owning
handle (CertificateManager.AcquireApplicationCertificateBySecurityPolicy ->
CertificateEntry.AddRef) held for the entire CreateSession call, so the
X509Certificate2 core stays pinned through the signature even outside the
server lock -- certificate rotation cannot dispose it. The lock now only
guards the certificate-blob and endpoint snapshot, so concurrent creates no
longer serialize behind the RSA signing. Verified net10 + net48 build and the
session tests.
… via DI

AddHttpsRateLimiter (default + Action<RateLimiterOptions> overloads) registers
a HttpsRateLimiterStartupContributor onto the HTTPS/WSS listener factories'
StartupContributors, bridging services.AddRateLimiter / app.UseRateLimiter into
the listener's isolated Kestrel host (net8+; rejects with 429). Verified
net10 + net48 builds; 20 DI tests pass.
Add IClientConnectGate + RateLimiterClientConnectGate (ConcurrencyLimiter with
an unbounded queue so excess initial connects WAIT/ramp rather than burst).
ManagedSession acquires a permit around the initial connect only and releases
it once the session is wired; a shared gate lets many concurrent
ManagedSession.CreateAsync calls to one server self-throttle. Surfaced via
ManagedSessionBuilder.WithConnectRateLimiter(int|RateLimiter|IClientConnectGate)
and the AddClient DI option ConnectRateLimiterMaxConcurrency. Default OFF
(null) to preserve behavior. net10 + net48 clean; 10 client RateLimiting tests.
…r param

The P4 connect gate added an IClientConnectGate parameter to the ManagedSession
constructor; two reflection-based test helpers (ManagedSessionComplianceTests,
ManagedSessionTests) locate that ctor by an exact type array and now include
the new parameter (+ a trailing null arg). Resolves 52 client-test failures
(NRE in [SetUp] from a null ConstructorInfo). Full Client.Tests: 1570 pass.
The reconnect lock (SemaphoreSlim) WaitAsync at Session.cs:3215 runs after
ThrowIfDisposed but can still race disposal during teardown: closing a session
mid-reconnect disposes m_reconnectLock while the WaitAsync is pending, throwing
an unobserved ObjectDisposedException that crashed the load-test host (Test Run
Failed despite Passed: 1). Guard the acquire with the file's established
'catch (ObjectDisposedException) when (Disposed)' pattern (see line ~4152),
returning gracefully since a reconnect on a disposed session is moot. The lock
is not acquired when WaitAsync throws, so the early return is safe. 179
reconnect/ManagedSession tests pass. A deterministic unit test is impractical
(requires disposal to interleave precisely across the ThrowIfDisposed/WaitAsync
window).
Exercise the System.Threading.RateLimiting dependency under the AOT test
project (TUnit): server token-bucket connection limiter (admit/reject),
session-establishment concurrency limiter (acquire/reject/release), client
connect gate (AcquireAsync/release), and the adaptive reconnect policy
(IsServerBusySignal + busy backoff). AOT project builds with zero IL/trim
warnings and the 4 tests pass; native AOT compilation is CI-gated via
-p:AotTest=true.
Server: CreateServerTooBusyException now encodes the retry-after as a
machine-readable 'RetryAfterMs=N' token in the fault's AdditionalInfo (via a
ServiceResult) in addition to the human-readable message.

Client: ReconnectPolicy.ParseServerRetryAfter extracts the token;
ConnectionStateMachine tracks the last fault's AdditionalInfo and passes the
parsed retry-after to IAdaptiveReconnectPolicy.GetNextDelay (previously null),
so a cooperating client honors the server's precise hint as a backoff floor.

Note (documented): OPC UA faults carry AdditionalInfo to the client only when
diagnostics are requested, and concurrency limiters do not estimate a
retry-after, so the hint is best-effort; the reliable path remains the
server-busy signal backoff. Parser is cross-TFM (manual digit scan, capped at
one day). 5 new parse tests; 15 client rate-limiting tests pass.
Deterministic in-process tests (ServerFixture<StandardServer>): an
always-rejecting IServerRateLimiterProvider makes CreateSession and
ActivateSession fail fast with BadServerTooBusy, and the retry-after hint is
encoded in the fault AdditionalInfo (RetryAfterMs=2000), validating the
server-encode -> client-parse round-trip. A control test confirms creation
succeeds when admission is permissive. 3 tests pass.
@CLAassistant

CLAassistant commented Jul 3, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ marcschier
❌ agent


agent seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Comment thread Docs/ServerScalability.md Outdated
Comment thread Libraries/Opc.Ua.Client/Session/IAdaptiveReconnectPolicy.cs Outdated
Comment thread Libraries/Opc.Ua.Client/Session/ManagedSession.cs Outdated
@marcschier marcschier marked this pull request as ready for review July 3, 2026 11:57
- Merge IAdaptiveReconnectPolicy into IReconnectPolicy as
  bool TryGetNextDelay(attempt, lastStatus, serverRetryAfter, out delay, ct);
  false => no adaptive behavior and the caller falls back to GetNextDelay.
  Delete IAdaptiveReconnectPolicy (new in 2.0, no obsolete/migration).
  Update ReconnectPolicy, ConnectionStateMachine, tests, and docs.
- ManagedSession.HandleConnectAsync: use the local session instead of
  InnerSession for the just-created session (hoist declaration to method scope).
- ServerScalability.md: rewrite the admission-control note to reflect the
  post-merge state and drop phase/task identifiers.
- Fix pre-existing CS8632 in RateLimitingAotTests (nullable-disabled project
  used out IDisposable? annotations).
Base automatically changed from copilot/async-node-change-notifications to master July 3, 2026 12:42
agent added 4 commits July 3, 2026 15:31
…r budget

A parked Publish (waiting for notifications) now releases its processing worker at
the park point instead of occupying one of MaxRequestThreadCount worker slots for
the whole wait, so the worker/thread budget no longer scales with session count.

- Opc.Ua.Types: new IRequestParkSink; RequestLifetime.ParkSink carrier.
- Opc.Ua.Core: internal RequestParkSink + IParkableIncomingRequest; EndpointIncomingRequest
  creates the sink and flows it onto the RequestLifetime; ServerBase.RequestQueue awaits
  park-or-complete (Task.WhenAny) and releases the active-worker slot at park, observing
  the detached parked task fault-safely. New ServerConfiguration.DecoupleHeldPublishRequests
  (default true) escape hatch, plumbed via InitializeRequestQueue.
- Opc.Ua.Server: StandardServer.PublishAsync flows the sink; SubscriptionManager.PublishAsync
  and SessionPublishQueue.PublishAsync gain IRequestParkSink? overloads; NotifyParked() is
  raised at the single park site (queuing the incomplete Tcs.Task).

Response delivery, ordering, timeout and cancellation are unchanged. net10 + net48 clean.
- Opc.Ua.Core.Tests RequestQueueTests: a parked request releases its worker so a
  single worker services the next request while the first is still parked; the
  DecoupleHeldPublishRequests=false path keeps the worker blocked; 200 parked
  requests are served by a 2-worker pool.
- Opc.Ua.Server.Tests SessionPublishQueueTests: NotifyParked is raised exactly once
  when a Publish parks and never when it returns/faults immediately.
…h hot path free

Only requests that can actually park (Publish long-polls) allocate a park sink and take
the park-or-complete worker path; every other request (Read/Write/Browse/...) keeps the
legacy inline path with no extra per-request allocation or Task.WhenAny. This makes the
dominant non-Publish request path byte-for-byte equivalent to before the change, so there
is no fast-path regression, while held Publishes still release their worker at the park
point. ParkSink is now nullable (null = cannot park).
agent and others added 4 commits July 3, 2026 15:59
…nt-rate-limiting

# Conflicts:
#	Docs/README.md
#	Docs/ServerScalability.md
…tion

P0 (docs + spec proposal):
- Docs/RetryAfterSignaling.md: survey of diagnostics-independent retry-after
  carriers (ResponseHeader.additionalHeader, HTTP Retry-After, UA-TCP ERR reason,
  Server.ServiceLevel) with stack integration points and recommendation tiers.
- Docs/proposals/RetryAfter.md: formal OPC UA spec-change proposal.
- Docs/README.md, Docs/RateLimiting.md: link the new docs.

P1 (client foundation): make a retry-after hint reach the adaptive reconnect
policy regardless of returnDiagnostics and exception re-wrapping.
- ManagedSession.ToAttemptFailure preserves a ServiceResultException's Result
  (new ServiceResult(ex) would overwrite AdditionalInfo with e.Message).
- ConnectionStateMachine.GetAdaptiveDelay parses the retry-after from the fault
  AdditionalInfo or, for transport-level signals, the localized message.
- Tests: ReconnectHonorsRetryAfterFrom{AdditionalInfo,LocalizedMessage}.
- ServerScalability.md: rewrite as user-facing documentation — drop tracking IDs
  (B1-B5/S1-S4) and status/temporal wording (delivered/now/roadmap/options), and
  describe the server's scalability characteristics, admission controls, and the
  held-Publish decoupling as current functionality.
- Remove the delegating PublishAsync compat overloads (2.0-only API, no migration
  needed) in favor of the IRequestParkSink signature throughout:
  ISubscriptionManager, SubscriptionManager, SessionPublishQueue now expose only the
  park-sink method; test/benchmark callers updated to pass the sink (null when none).
Server (AddHttpsRateLimiter default limiter): OnRejected now sets the standard
HTTP Retry-After response header from the rejected lease's RetryAfter metadata,
so a 429 tells a cooperating client when to retry without OPC UA diagnostics.

Client (HttpsTransportChannel.SendRequestAsync): an HTTP 429/503 response is now
translated to BadServerTooBusy before EnsureSuccessStatusCode, honoring the
Retry-After header (delta-seconds or HTTP-date) as a machine-readable
RetryAfterMs=N hint the adaptive reconnect policy consumes (via P1).

Tests: GetRetryAfter (delta/date/absent) and CreateServerTooBusyException
(with/without hint) in HttpsTransportChannelTests.
@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.25786% with 151 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.53%. Comparing base (26ef554) to head (7baf42f).

Files with missing lines Patch % Lines
....Https/Https/HttpsRateLimiterStartupContributor.cs 14.28% 30 Missing ⚠️
...tack/Opc.Ua.Core/Stack/Tcp/TcpTransportListener.cs 52.00% 11 Missing and 1 partial ⚠️
...c.Ua.Client/Fluent/OpcUaClientBuilderExtensions.cs 26.66% 10 Missing and 1 partial ⚠️
...pc.Ua.Core/Stack/Server/ServerBase.RequestQueue.cs 65.51% 10 Missing ⚠️
...pc.Ua.Core/Stack/Tcp/UaSCBinaryTransportChannel.cs 54.54% 7 Missing and 3 partials ⚠️
....Ua.Client/Session/RateLimiterClientConnectGate.cs 70.00% 6 Missing and 3 partials ⚠️
...ries/Opc.Ua.Client/Fluent/ManagedSessionBuilder.cs 11.11% 8 Missing ⚠️
Libraries/Opc.Ua.Client/Session/ManagedSession.cs 80.48% 4 Missing and 4 partials ⚠️
.../Opc.Ua.Server/Hosting/OpcUaServerHostedService.cs 0.00% 8 Missing ⚠️
Libraries/Opc.Ua.Server/Server/StandardServer.cs 85.96% 2 Missing and 6 partials ⚠️
... and 12 more
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #3946    +/-   ##
========================================
  Coverage   73.52%   73.53%            
========================================
  Files        1170     1180    +10     
  Lines      170163   170741   +578     
  Branches    29361    29470   +109     
========================================
+ Hits       125117   125555   +438     
- Misses      34043    34140    +97     
- Partials    11003    11046    +43     
Files with missing lines Coverage Δ
...es/Opc.Ua.Client/Session/ConnectionStateMachine.cs 79.15% <100.00%> (+0.99%) ⬆️
...ies/Opc.Ua.Client/Session/ManagedSessionOptions.cs 100.00% <ø> (ø)
...c.Ua.Server/RateLimiting/ServerRateLimitOptions.cs 100.00% <100.00%> (ø)
...c.Ua.Server/Server/ServerServiceLevelCalculator.cs 100.00% <100.00%> (ø)
.../Opc.Ua.Server/Subscription/SessionPublishQueue.cs 85.15% <100.00%> (+0.05%) ⬆️
.../Opc.Ua.Server/Subscription/SubscriptionManager.cs 82.72% <100.00%> (+0.01%) ⬆️
...DependencyInjection/OpcUaHttpsBuilderExtensions.cs 100.00% <100.00%> (ø)
...e/Stack/Client/Channels/IChannelReconnectPolicy.cs 81.25% <ø> (+9.37%) ⬆️
...ack/Server/EndpointBase.EndpointIncomingRequest.cs 81.25% <100.00%> (+1.25%) ⬆️
Stack/Opc.Ua.Core/Stack/Server/RequestParkSink.cs 100.00% <100.00%> (ø)
... and 25 more

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

marcschier and others added 9 commits July 3, 2026 17:13
Adds a diagnostics-independent structured retry-after carrier reusing the
standard AdditionalParametersType in ResponseHeader.AdditionalHeader (no new
DataType). RetryAfterHeader.AttachTo/Read encode/decode a whole-millisecond
Int64 under the AdditionalParameterNames.RetryAfterMs key, merging with any
existing additional parameters. Unit tests cover round-trip, absence,
non-positive delay, rounding, and merge.

This is the reusable core for emitting the retry-after on a BadServerTooBusy
ServiceFault (delivered regardless of RequestHeader.ReturnDiagnostics); the
CreateFault/server/client wiring follows.
…r on faults

- ServerBusyException (Core): a ServiceResultException carrying an explicit
  TimeSpan? RetryAfter, thrown by StandardServer.CreateServerTooBusyException.
- EndpointBase.CreateFault attaches the retry-after to the fault's
  ResponseHeader.AdditionalHeader (via RetryAfterHeader) when the exception is a
  ServerBusyException, so it is delivered independently of ReturnDiagnostics.
  The CallAsync catch passes the typed exception straight to CreateFault.
- Tests: CreateFault attaches/omits the header (EndpointBaseTests); the admission
  path throws ServerBusyException with RetryAfter=2000ms and keeps the legacy
  AdditionalInfo token (SessionAdmissionRateLimitTests, now CatchAsync to allow
  the subclass).

The diagnostics-independent server emit is complete; client honoring of the
fault AdditionalHeader is the remaining step (see plan.md P3 blueprint).
…Header

ClientBase.ValidateResponse(ResponseHeader) now reads a RetryAfterHeader from a
bad response and re-emits it as a machine-readable RetryAfterMs=N AdditionalInfo
token (only when diagnostics did not already provide one), so the adaptive
reconnect policy honors it via the P1 plumbing. The source-generated clients
call ValidateResponse(genericResponse.ResponseHeader) for every service, so this
covers CreateSession/ActivateSession and all other calls.

The change is additive: behavior is unchanged for responses without a
RetryAfter header. Tests: ValidateResponse surfaces the hint on a bad response
and ignores it on a good one (ClientBaseTests).

P3 (structured retry-after via ResponseHeader.additionalHeader) is now complete
end-to-end: server emit + client honor, delivered independently of diagnostics.
Update RetryAfterSignaling.md with an implementation-status section and
RateLimiting.md follow-ups to reflect that the diagnostics-independent
retry-after carriers (ResponseHeader.additionalHeader and HTTP Retry-After) are
implemented; UA-TCP Error reason and dynamic ServiceLevel remain planned.
Update Server.ServiceLevel from session capacity headroom and add calculator coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Parse RetryAfterMs hints from transient UA-TCP ERR failures and apply them as a lower bound for managed channel reconnect backoff.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ServerBusyException: add a justified CA1032 suppression (the type must carry a
  ServiceResult; the standard parameterless/message-only ctors are intentionally
  omitted).
- Docs: RetryAfterSignaling.md + RateLimiting.md now record the UA-TCP client ERR
  retry-after honoring (P4) and the load-based Server.ServiceLevel (P5) as
  delivered, with remaining items (standardized field, server-emitted UA-TCP ERR
  on connection reject, client proactive ServiceLevel reselection) as planned.

Validation this pass: Core Stack tests 2298 pass (incl. 11 new RetryAfterHint
tests); ServiceLevel calculator + session-admission + session-churn server tests
pass; Core and Server build clean on net48 and net10.0.
…ing' into copilot/server-client-rate-limiting
Copilot AI review requested due to automatic review settings July 3, 2026 18:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces configurable admission control/rate limiting across server and client to mitigate connect/session storms. On the server side, this adds transport-level connection admission limits, session-establishment concurrency limits, structured retry-after signaling, and a dynamic Server.ServiceLevel. On the client side, this adds server-signal-aware adaptive reconnect backoff and an optional connect gate to throttle bulk connects.

Changes:

  • Server: configurable listener backlog + connection/session admission limiters (System.Threading.RateLimiting), BadServerTooBusy signaling with retry-after, and Server.ServiceLevel scaling.
  • Client: adaptive reconnect backoff (IReconnectPolicy.TryGetNextDelay) honoring retry-after hints + optional connect admission gate (IClientConnectGate).
  • Tests/docs: new unit/integration/AOT tests and documentation describing rate limiting + retry-after signaling.

Reviewed changes

Copilot reviewed 60 out of 60 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
Tests/Opc.Ua.Server.Tests/SessionAdmissionRateLimitTests.cs Integration tests for session-admission rejection and retry-after hint.
Tests/Opc.Ua.Server.Tests/ServerServiceLevelCalculatorTests.cs Unit tests for load-based Server.ServiceLevel calculation/hysteresis.
Tests/Opc.Ua.Server.Tests/RateLimiting/ServerRateLimiterTests.cs Unit tests for server connection/session limiters and provider behavior.
Tests/Opc.Ua.Core.Tests/Stack/Transport/HttpsTransportChannelTests.cs Tests mapping HTTP 429/503 + Retry-After to BadServerTooBusy.
Tests/Opc.Ua.Core.Tests/Stack/Server/RetryAfterHeaderTests.cs Tests for ResponseHeader.AdditionalHeader retry-after carrier.
Tests/Opc.Ua.Core.Tests/Stack/Server/EndpointBaseTests.cs Tests that faults attach/omit retry-after header appropriately.
Tests/Opc.Ua.Core.Tests/Stack/Client/RetryAfterHintTests.cs Tests parsing/applying retry-after hints for reconnect backoff.
Tests/Opc.Ua.Core.Tests/Stack/Client/ClientBaseTests.cs Tests ValidateResponse surfacing retry-after from AdditionalHeader.
Tests/Opc.Ua.Client.Tests/Session/ManagedSessionTests.cs Updates reflection-based ctor tests for new connect-gate parameter.
Tests/Opc.Ua.Client.Tests/Session/ManagedSessionComplianceTests.cs Updates compliance tests for new connect-gate parameter.
Tests/Opc.Ua.Client.Tests/Session/ConnectionStateMachineTests.cs Tests reconnect honoring retry-after parsed from service results.
Tests/Opc.Ua.Client.Tests/ClientConnectRateLimiterTests.cs Tests client connect gate concurrency behavior.
Tests/Opc.Ua.Client.Tests/AdaptiveReconnectPolicyTests.cs Tests adaptive reconnect policy backoff classification + hint handling.
Tests/Opc.Ua.Bindings.Https.WebApi.Tests/DependencyInjection/OpcUaHttpsBuilderExtensionsTests.cs Tests DI extensions for Kestrel HTTPS rate limiter contributor.
Tests/Opc.Ua.Aot.Tests/RateLimitingAotTests.cs AOT coverage for new rate-limiting paths.
Stack/Opc.Ua.Core/Stack/Transport/TransportListenerSettings.cs Adds listener backlog + connection limiter settings.
Stack/Opc.Ua.Core/Stack/Transport/IConnectionRateLimiter.cs New interface for transport connection admission control.
Stack/Opc.Ua.Core/Stack/Tcp/UaSCBinaryTransportChannel.cs Captures server retry-after hints from UA-TCP failures for reconnect.
Stack/Opc.Ua.Core/Stack/Tcp/UaSCBinaryChannel.cs Preserves UA-TCP ERR reason in ServiceResult for hint parsing.
Stack/Opc.Ua.Core/Stack/Tcp/TcpTransportListener.cs Uses configured backlog + applies connection admission limiter on accept.
Stack/Opc.Ua.Core/Stack/Server/ServerBusyException.cs New exception carrying retry-after for fault-building.
Stack/Opc.Ua.Core/Stack/Server/ServerBase.cs Hook for derived servers to configure transport listener settings.
Stack/Opc.Ua.Core/Stack/Server/RetryAfterHeader.cs Adds retry-after to ResponseHeader.AdditionalHeader parameters.
Stack/Opc.Ua.Core/Stack/Server/EndpointBase.cs Attaches retry-after header on ServerBusyException faults.
Stack/Opc.Ua.Core/Stack/Https/HttpsTransportChannel.cs Maps HTTP throttling + Retry-After into BadServerTooBusy with token.
Stack/Opc.Ua.Core/Stack/Client/ClientBase.cs Surfaces retry-after from AdditionalHeader into AdditionalInfo token.
Stack/Opc.Ua.Core/Stack/Client/Channels/RetryAfterHint.cs Parses retry-after tokens and applies as reconnect-delay lower bound.
Stack/Opc.Ua.Core/Stack/Client/Channels/Internal/ChannelEntry.cs Applies consumed server retry-after hint to channel reconnect delay.
Stack/Opc.Ua.Core/Stack/Client/Channels/IChannelReconnectPolicy.cs Adds internal IServerRetryAfterHintProvider interface for channels.
Stack/Opc.Ua.Core/Security/Constants/AdditionalParameterNames.cs Adds RetryAfterMs additional-parameter name constant.
Stack/Opc.Ua.Core/Opc.Ua.Core.csproj Adds System.Threading.RateLimiting package reference.
Stack/Opc.Ua.Bindings.Https/Https/HttpsRateLimiterStartupContributor.cs Adds ASP.NET Core rate limiting + emits HTTP Retry-After on rejection.
Stack/Opc.Ua.Bindings.Https/DependencyInjection/OpcUaHttpsBuilderExtensions.cs Adds AddHttpsRateLimiter DI/fluent integration.
Libraries/Opc.Ua.Server/Server/StandardServer.cs Wires server admission control into Create/ActivateSession + transport settings.
Libraries/Opc.Ua.Server/Server/ServerServiceLevelCalculator.cs Implements load-based Server.ServiceLevel calculation.
Libraries/Opc.Ua.Server/Server/ServerInternalData.cs Updates Server.ServiceLevel on session create/close with hysteresis.
Libraries/Opc.Ua.Server/RateLimiting/TokenBucketConnectionRateLimiter.cs Implements connection admission limiter using token bucket.
Libraries/Opc.Ua.Server/RateLimiting/ServerRateLimitOptions.cs Defines server rate limiting configuration options.
Libraries/Opc.Ua.Server/RateLimiting/IServerRateLimiterProvider.cs Provider interface for server admission control limiters.
Libraries/Opc.Ua.Server/RateLimiting/DefaultServerRateLimiterProvider.cs Default provider implementation using System.Threading.RateLimiting.
Libraries/Opc.Ua.Server/Opc.Ua.Server.csproj Adds System.Threading.RateLimiting package reference.
Libraries/Opc.Ua.Server/Hosting/OpcUaServerOptions.cs Adds hosted-service option callback to configure rate limits.
Libraries/Opc.Ua.Server/Hosting/OpcUaServerHostedService.cs Applies DI/provider or options callback to server rate limiting at startup.
Libraries/Opc.Ua.Client/Session/Session.cs Teardown fix: swallow benign ObjectDisposedException on reconnect lock wait.
Libraries/Opc.Ua.Client/Session/ReconnectPolicy.cs Implements adaptive delay API and retry-after parsing.
Libraries/Opc.Ua.Client/Session/RateLimiterClientConnectGate.cs Implements connect admission gate backed by RateLimiter.
Libraries/Opc.Ua.Client/Session/ManagedSessionOptions.cs Adds connect-gate options for client connect throttling.
Libraries/Opc.Ua.Client/Session/ManagedSession.cs Uses connect gate; preserves ServiceResultException.Result for retry-after.
Libraries/Opc.Ua.Client/Session/IReconnectPolicy.cs Extends reconnect policy API with adaptive delay method.
Libraries/Opc.Ua.Client/Session/IClientConnectGate.cs New interface for client connect admission gating.
Libraries/Opc.Ua.Client/Session/ConnectionStateMachine.cs Uses adaptive policy API + parses retry-after hints from last attempt.
Libraries/Opc.Ua.Client/Opc.Ua.Client.csproj Adds System.Threading.RateLimiting package reference.
Libraries/Opc.Ua.Client/Fluent/OpcUaClientBuilderExtensions.cs Plumbs connect gate from options/DI into fluent builder.
Libraries/Opc.Ua.Client/Fluent/ManagedSessionBuilder.cs Adds fluent connect-rate-limiter configuration overloads.
Docs/ServerScalability.md Updates scalability analysis with delivered admission control/backpressure.
Docs/RetryAfterSignaling.md New design survey for diagnostics-independent retry-after carriers.
Docs/README.md Links new rate limiting + retry-after signaling documentation.
Docs/RateLimiting.md New documentation for server/client rate limiting, DI, and hints.
Docs/proposals/RetryAfter.md New specification proposal draft for structured retry-after signaling.
Directory.Packages.props Adds centralized package version for System.Threading.RateLimiting.

Comment thread Libraries/Opc.Ua.Server/Server/StandardServer.cs
Comment thread Libraries/Opc.Ua.Server/Server/StandardServer.cs
Comment thread Tests/Opc.Ua.Server.Tests/SessionAdmissionRateLimitTests.cs
Comment thread Libraries/Opc.Ua.Server/RateLimiting/ServerRateLimitOptions.cs
Comment thread Stack/Opc.Ua.Core/Stack/Client/ClientBase.cs
Comment thread Docs/ServerScalability.md Outdated
Comment thread Docs/RetryAfterSignaling.md Outdated
…anch

Integrates PR #3950 "Server: decouple held long-poll Publishes from the
request-queue worker budget" (the S1 scalability item) into this branch.

Conflict resolution: Docs/ServerScalability.md - took #3950's concise rewrite
(its "Admission control and rate limiting" section already summarizes the
rate-limiting + retry-after work, and it references RateLimiting.md for the full
surface) and added a See-also link to RetryAfterSignaling.md. All code files
auto-merged cleanly (StandardServer.cs regions are disjoint).
@marcschier marcschier changed the title Server + client rate limiting (configurable admission, adaptive backoff) Server session scalability: admission control, retry-after signaling, and held-Publish decoupling Jul 4, 2026
…y numbers

With DecoupleHeldPublishRequests default-on (S1), a held Publish releases its
request-processing worker at the park point, so the worker pool no longer has to
scale with the session count. An A/B load-test campaign (ServerManySessionsLoadTest,
Xeon W-2235 6c/12t, quiet machine) shows a ~200-worker pool cleanly establishes and
serves ~4000 sessions (8-17x the ~140-350 the coupled-off path manages at the same
budget), while a session-count-sized pool (10500) is ~2x slower to establish and
reaches a LOWER ceiling (thread oversubscription during the RSA-handshake burst).

- LoadTest fixture: MaxRequestThreadCount 10500->200, MinRequestThreadCount 200->50,
  with the rationale comment rewritten for the S1 small-pool model.
- Benchmarks.md: refresh the server-session scalability table (2000/2500/4000 now
  establish cleanly; 10000 remains the establishment wall) and the
  MaxRequestThreadCount guidance (size to active concurrency, not session count).

ServerScalability.md prose was rewritten upstream (8f7a95d); measured numbers now
live in Benchmarks.md, which that doc references.
Comment thread Docs/Benchmarks.md
| 2000 | No (*) | — | — |
| 2000 | Yes | 14 | Yes |
| 2500 | Yes | 13 | Yes |
| 4000 | Yes | 11 | Yes |

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retest - it would be great to see the ceiling (i.e. is it 8000 or 5000 or 4000)?

@marcschier marcschier Jul 4, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact ceiling needs the [Explicit] ServerManySessionsLoadTestAsync macro run on dedicated hardware (client and server on separate machines), I left this thread open for a follow-up run.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rerun the other values as well to update the entire table. Goal is to find the ceiling on the machine mentioned

Comment thread Docs/Benchmarks.md Outdated
Comment thread Docs/RateLimiting.md Outdated
Comment thread Docs/RateLimiting.md Outdated
Comment thread Docs/RetryAfterSignaling.md Outdated
Comment thread Docs/ServerScalability.md Outdated
- StandardServer.RateLimiterProvider setter now disposes a server-owned provider
  before replacing it, so its RateLimiter timers are not leaked.
- SessionAdmissionRateLimitTests.TearDown disposes the provider being replaced
  (the tests assign caller-owned providers the server does not dispose).
- ClientBase.ValidateResponse surfaces the AdditionalHeader retry-after by
  checking for token presence (not empty AdditionalInfo) and merges the token
  into any existing AdditionalInfo instead of overwriting it; added merge and
  keep-existing-token tests.
- ServerRateLimitOptions: connection-limit XML docs corrected to describe the
  server-wide (single-bucket) behavior instead of per-remote.
- RetryAfterSignaling.md: candidate-mechanism table + prose updated to reflect
  the implemented additionalHeader / HTTP Retry-After / UA-TCP ERR / ServiceLevel
  carriers (was internally inconsistent with the Implementation status section).
… docs

- Benchmarks: drop internal comparisons to earlier master states in the
  server-scalability footnote (keep only measured numbers and the
  configuration A/B sizing insight).
- RateLimiting: remove the 'Extending the transport' subsection header;
  drop the speculative 'Planned' follow-ups bullet and retitle the
  section 'Server backpressure signals'.
- ServerScalability: reword the held-Publish sentence to acknowledge a
  client may keep several Publish requests outstanding (the classic
  engine does); each parked Publish gets its own RequestParkSink and
  releases a worker independently.
- Remove the standalone RetryAfterSignaling.md design survey; fold a
  concise implementation summary into Sessions.md (reconnect section) as
  'Server retry-after backpressure' and reference it from RateLimiting.md,
  README.md, ServerScalability.md, and proposals/RetryAfter.md.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open a mantis issue.


/// <inheritdoc/>
public ValueTask<IServiceResponse> SendRequestAsync(
public async ValueTask<IServiceResponse> SendRequestAsync(

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a perf critical path and adds another state machine. Can the exception not be intercepted in a leaner way?


// The request is now parked waiting for a notification: release the
// processing worker so it does not remain blocked for the whole wait.
parkSink?.NotifyParked();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will a client initiated cancel of the request still work? Clients are recommended by the spec to cancel outstanding publish request on close session

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants