Server session scalability: admission control, retry-after signaling, and held-Publish decoupling#3946
Server session scalability: admission control, retry-after signaling, and held-Publish decoupling#3946marcschier wants to merge 44 commits into
Conversation
Adds an additive, non-breaking asynchronous change-notification API to NodeState and routes the MonitoredNode2 push path through it, so a node that exposes an asynchronous value read handler (OnReadValueAsync) is honored without blocking a thread and the per-node channel applies back-pressure by awaiting rather than blocking. NodeState: - ClearChangeMasksAsync / OnStateChangedAsync / StateChangedAsync - ReportEventAsync / OnReportEventAsync (+ BaseInstanceState override) - The synchronous ClearChangeMasks / ReportEvent now also drive the async sinks, completing inline for a synchronously-readable node with channel capacity and blocking only when a genuinely async read is in flight or the channel is full (only synchronous callers pay that cost). MonitoredNode2: - OnMonitoredNodeChangedAsync / OnReportEventAsync producers read each attribute at enqueue time via ReadAttributeAsync (strict enqueue-time snapshot, honoring async read handlers) and enqueue by awaiting the bounded channel. Synchronous wrappers are retained. Internal async flush sites migrated to await ClearChangeMasksAsync (AsyncCustomNodeManager.WriteAsync value flush, node removal, ConfigurationNodeManager, KeyCredentialPushSubject). FIFO/LIFO discard and the Overflow bit remain a per-item queue concern; the shared node channel stays a lossless conduit. Tests: MonitoredNode2 async-read-honored, no-block-on-slow-async-read, and async event delivery; NodeState async/sync dispatch. Docs updated in AsyncServerSupport.md.
B (event source-report async): - Add IServerInternal.ReportEventAsync + ServerInternalData.ReportEventAsync (over the NodeState.ReportEventAsync added earlier). - AsyncCustomNodeManager: async root-notifier OnReportEventAsync handler that awaits Server.ReportEventAsync; rewire the root-notifier slots to the async handler (no OnReportEvent overrides exist, so this is safe). The rare model-change Raise* helpers stay synchronous (they drive the async producer via the sync wrapper; events do not async-read, so a full async cascade through model-change/audit is near-zero value). C (remaining flush migration): - Migrate KeyCredentialPushSubject.BindAsync to await ClearChangeMasksAsync. All async-context ClearChangeMasks sites are now migrated; genuinely synchronous sites (builders, diagnostics timers, sync CustomNodeManager, supervision) remain on the sync wrapper (correct, not hot paths). Tests: update the 3 root-notifier tests (parameterized across the async and sync managers) to accept either the sync OnReportEvent or async OnReportEventAsync slot, and to capture the routed event via either Server.ReportEvent or Server.ReportEventAsync. Validation: 1900 Server + 144 History AlarmsAndConditions pass; net10/net48 build 0-warning. (D micro-benchmark skipped per user; A load test blocked on the load-test branch merging to master.)
…change-notifications # Conflicts: # Libraries/Opc.Ua.Server/NodeManager/MonitoredItem/MonitoredNode.cs
…Node2 tests Review feedback: - NodeState: sync ReportEvent/ClearChangeMasks wrappers now offload the async sink to the thread pool when a SynchronizationContext is present, so a context-capturing sink cannot deadlock the blocking wait (the no-context server fast path is unchanged). - MonitoredNode: propagate OperationCanceledException from the async producer paths instead of swallowing it (kept ChannelClosedException). - MonitoredNode2 tests: assert delivery Wait returns true instead of ignoring the result. - Docs: US spelling. Test fix (root cause of the 2 async-read test timeouts): IMonitoredItem .QueueValue takes an `in DataValue` parameter. Moq's It.IsAny<DataValue>() does not match `in`/`ref` parameters, so the .Callback signal never fired and the delivery Wait timed out at 30 s even though QueueValue was invoked correctly (Invocations still recorded it, which is why the old assertions passed). Switched all six QueueValue setups/verifications to It.Ref<DataValue>.IsAny. MonitoredNode2Tests now run in ~0.35 s instead of ~1 min of masked 30 s timeouts. Production delivery was always correct.
…change-notifications
Add Docs/ServerScalability.md documenting why a single node tops out at ~2000 concurrent sessions and how to move beyond it, grounded in code references: - Framing: the 2000 ceiling is an establishment (connect-storm) ceiling, not steady-state (500 establishes cleanly and delivers 100% of notifications; 2000 collapses via a retry -> duplicate-session feedback loop, ~4432 creates for a 2000 target). - Establishment boundaries B1-B5: socket backlog hard-coded to 10; every transient handshake error -> BadTcpInternalError -> client retry amplifier; O(N^2) session-diagnostics rescan under the global address-space semaphore; CreateSession RSA signing inside a global semaphore; RSA CPU saturation -> mass subscription abandonment. - Steady-state boundaries S1-S4: held Publish is async-parked but the request-queue accounting couples MaxRequestThreadCount / ThreadPool to session count; per-session publish cap; O(N) sweep; ABANDONED feedback. - Where to linearize: admission control / rate limiting at connection accept and session establishment (fast BadServerTooBusy / Retry-After instead of mid-handshake aborts), plus a prioritized A/B/C roadmap. Link the new doc from Docs/README.md and cross-link it with the Server session scalability section of Docs/Benchmarks.md. Docs-only; no code changes.
…ishment Server-side admission control based on System.Threading.RateLimiting, ON by default with conservative limits (validated: all 1901 Server.Tests pass). Foundation (Opc.Ua.Server/RateLimiting): - ServerRateLimitOptions: deterministic, configurable limits (backlog, connection token-bucket rate/burst, session-establishment concurrency). - IServerRateLimiterProvider + DefaultServerRateLimiterProvider: builds the limiters from options; DI-replaceable, direct-construct fallback. - TokenBucketConnectionRateLimiter: IConnectionRateLimiter over a TokenBucketRateLimiter. Transport (Opc.Ua.Core, P1 / B1+B2): - IConnectionRateLimiter seam + TransportListenerSettings.ListenBacklog and .ConnectionRateLimiter. - TcpTransportListener: configurable backlog (default raised 10 -> 512) and a connection admission check in OnAccept that sheds a storm cheaply. - ServerBase.ConfigureTransportListenerSettings virtual hook. Session establishment (StandardServer, P2): - CreateSession/ActivateSession acquire a concurrency permit before the CPU-bound crypto; at capacity they return BadServerTooBusy with a retry-after hint in the fault message. Provider wired via OnServerStarting (default) or the settable RateLimiterProvider/RateLimitOptions (DI/direct). Deferred: B4 (move CreateSession RSA signing out of m_semaphoreSlim) needs careful cert-rotation lifetime analysis; tracked separately. HTTPS/Kestrel (P3), client adaptive backoff (P4), DI hosting wiring, tests + docs (P5) follow.
…ts + IServerRateLimiterProvider resolution)
Add IAdaptiveReconnectPolicy (non-breaking sibling of IReconnectPolicy; avoids default-interface-methods for net472/net48). ReconnectPolicy now implements it: on a server-busy signal (BadServerTooBusy/BadTcpServerTooBusy/ BadTooManySessions/BadTooManyOperations/BadTooManyPublishRequests/timeouts) it backs off 4x (capped at MaxDelay) and honors a server retry-after hint as a lower bound. ConnectionStateMachine.HandleReconnectingAsync tracks the last attempt's status and uses the adaptive overload when available (initial connect funnels through the same loop). 8 unit tests. Follow-up: client-wide connect admission limiter (ramp bulk connects) and structured retry-after extraction from the fault remain.
…ServerScalability.md
…Slim The instance certificate is acquired as a using-scoped, ref-counted owning handle (CertificateManager.AcquireApplicationCertificateBySecurityPolicy -> CertificateEntry.AddRef) held for the entire CreateSession call, so the X509Certificate2 core stays pinned through the signature even outside the server lock -- certificate rotation cannot dispose it. The lock now only guards the certificate-blob and endpoint snapshot, so concurrent creates no longer serialize behind the RSA signing. Verified net10 + net48 build and the session tests.
… via DI AddHttpsRateLimiter (default + Action<RateLimiterOptions> overloads) registers a HttpsRateLimiterStartupContributor onto the HTTPS/WSS listener factories' StartupContributors, bridging services.AddRateLimiter / app.UseRateLimiter into the listener's isolated Kestrel host (net8+; rejects with 429). Verified net10 + net48 builds; 20 DI tests pass.
Add IClientConnectGate + RateLimiterClientConnectGate (ConcurrencyLimiter with an unbounded queue so excess initial connects WAIT/ramp rather than burst). ManagedSession acquires a permit around the initial connect only and releases it once the session is wired; a shared gate lets many concurrent ManagedSession.CreateAsync calls to one server self-throttle. Surfaced via ManagedSessionBuilder.WithConnectRateLimiter(int|RateLimiter|IClientConnectGate) and the AddClient DI option ConnectRateLimiterMaxConcurrency. Default OFF (null) to preserve behavior. net10 + net48 clean; 10 client RateLimiting tests.
…r param The P4 connect gate added an IClientConnectGate parameter to the ManagedSession constructor; two reflection-based test helpers (ManagedSessionComplianceTests, ManagedSessionTests) locate that ctor by an exact type array and now include the new parameter (+ a trailing null arg). Resolves 52 client-test failures (NRE in [SetUp] from a null ConstructorInfo). Full Client.Tests: 1570 pass.
The reconnect lock (SemaphoreSlim) WaitAsync at Session.cs:3215 runs after ThrowIfDisposed but can still race disposal during teardown: closing a session mid-reconnect disposes m_reconnectLock while the WaitAsync is pending, throwing an unobserved ObjectDisposedException that crashed the load-test host (Test Run Failed despite Passed: 1). Guard the acquire with the file's established 'catch (ObjectDisposedException) when (Disposed)' pattern (see line ~4152), returning gracefully since a reconnect on a disposed session is moot. The lock is not acquired when WaitAsync throws, so the early return is safe. 179 reconnect/ManagedSession tests pass. A deterministic unit test is impractical (requires disposal to interleave precisely across the ThrowIfDisposed/WaitAsync window).
Exercise the System.Threading.RateLimiting dependency under the AOT test project (TUnit): server token-bucket connection limiter (admit/reject), session-establishment concurrency limiter (acquire/reject/release), client connect gate (AcquireAsync/release), and the adaptive reconnect policy (IsServerBusySignal + busy backoff). AOT project builds with zero IL/trim warnings and the 4 tests pass; native AOT compilation is CI-gated via -p:AotTest=true.
Server: CreateServerTooBusyException now encodes the retry-after as a machine-readable 'RetryAfterMs=N' token in the fault's AdditionalInfo (via a ServiceResult) in addition to the human-readable message. Client: ReconnectPolicy.ParseServerRetryAfter extracts the token; ConnectionStateMachine tracks the last fault's AdditionalInfo and passes the parsed retry-after to IAdaptiveReconnectPolicy.GetNextDelay (previously null), so a cooperating client honors the server's precise hint as a backoff floor. Note (documented): OPC UA faults carry AdditionalInfo to the client only when diagnostics are requested, and concurrency limiters do not estimate a retry-after, so the hint is best-effort; the reliable path remains the server-busy signal backoff. Parser is cross-TFM (manual digit scan, capped at one day). 5 new parse tests; 15 client rate-limiting tests pass.
Deterministic in-process tests (ServerFixture<StandardServer>): an always-rejecting IServerRateLimiterProvider makes CreateSession and ActivateSession fail fast with BadServerTooBusy, and the retry-after hint is encoded in the fault AdditionalInfo (RetryAfterMs=2000), validating the server-encode -> client-parse round-trip. A control test confirms creation succeeds when admission is permissive. 3 tests pass.
|
agent seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
- Merge IAdaptiveReconnectPolicy into IReconnectPolicy as bool TryGetNextDelay(attempt, lastStatus, serverRetryAfter, out delay, ct); false => no adaptive behavior and the caller falls back to GetNextDelay. Delete IAdaptiveReconnectPolicy (new in 2.0, no obsolete/migration). Update ReconnectPolicy, ConnectionStateMachine, tests, and docs. - ManagedSession.HandleConnectAsync: use the local session instead of InnerSession for the just-created session (hoist declaration to method scope). - ServerScalability.md: rewrite the admission-control note to reflect the post-merge state and drop phase/task identifiers. - Fix pre-existing CS8632 in RateLimitingAotTests (nullable-disabled project used out IDisposable? annotations).
…r budget A parked Publish (waiting for notifications) now releases its processing worker at the park point instead of occupying one of MaxRequestThreadCount worker slots for the whole wait, so the worker/thread budget no longer scales with session count. - Opc.Ua.Types: new IRequestParkSink; RequestLifetime.ParkSink carrier. - Opc.Ua.Core: internal RequestParkSink + IParkableIncomingRequest; EndpointIncomingRequest creates the sink and flows it onto the RequestLifetime; ServerBase.RequestQueue awaits park-or-complete (Task.WhenAny) and releases the active-worker slot at park, observing the detached parked task fault-safely. New ServerConfiguration.DecoupleHeldPublishRequests (default true) escape hatch, plumbed via InitializeRequestQueue. - Opc.Ua.Server: StandardServer.PublishAsync flows the sink; SubscriptionManager.PublishAsync and SessionPublishQueue.PublishAsync gain IRequestParkSink? overloads; NotifyParked() is raised at the single park site (queuing the incomplete Tcs.Task). Response delivery, ordering, timeout and cancellation are unchanged. net10 + net48 clean.
- Opc.Ua.Core.Tests RequestQueueTests: a parked request releases its worker so a single worker services the next request while the first is still parked; the DecoupleHeldPublishRequests=false path keeps the worker blocked; 200 parked requests are served by a 2-worker pool. - Opc.Ua.Server.Tests SessionPublishQueueTests: NotifyParked is raised exactly once when a Publish parks and never when it returns/faults immediately.
…h hot path free Only requests that can actually park (Publish long-polls) allocate a park sink and take the park-or-complete worker path; every other request (Read/Write/Browse/...) keeps the legacy inline path with no extra per-request allocation or Task.WhenAny. This makes the dominant non-Publish request path byte-for-byte equivalent to before the change, so there is no fast-path regression, while held Publishes still release their worker at the park point. ParkSink is now nullable (null = cannot park).
…nt-rate-limiting # Conflicts: # Docs/README.md # Docs/ServerScalability.md
…tion
P0 (docs + spec proposal):
- Docs/RetryAfterSignaling.md: survey of diagnostics-independent retry-after
carriers (ResponseHeader.additionalHeader, HTTP Retry-After, UA-TCP ERR reason,
Server.ServiceLevel) with stack integration points and recommendation tiers.
- Docs/proposals/RetryAfter.md: formal OPC UA spec-change proposal.
- Docs/README.md, Docs/RateLimiting.md: link the new docs.
P1 (client foundation): make a retry-after hint reach the adaptive reconnect
policy regardless of returnDiagnostics and exception re-wrapping.
- ManagedSession.ToAttemptFailure preserves a ServiceResultException's Result
(new ServiceResult(ex) would overwrite AdditionalInfo with e.Message).
- ConnectionStateMachine.GetAdaptiveDelay parses the retry-after from the fault
AdditionalInfo or, for transport-level signals, the localized message.
- Tests: ReconnectHonorsRetryAfterFrom{AdditionalInfo,LocalizedMessage}.
- ServerScalability.md: rewrite as user-facing documentation — drop tracking IDs (B1-B5/S1-S4) and status/temporal wording (delivered/now/roadmap/options), and describe the server's scalability characteristics, admission controls, and the held-Publish decoupling as current functionality. - Remove the delegating PublishAsync compat overloads (2.0-only API, no migration needed) in favor of the IRequestParkSink signature throughout: ISubscriptionManager, SubscriptionManager, SessionPublishQueue now expose only the park-sink method; test/benchmark callers updated to pass the sink (null when none).
Server (AddHttpsRateLimiter default limiter): OnRejected now sets the standard HTTP Retry-After response header from the rejected lease's RetryAfter metadata, so a 429 tells a cooperating client when to retry without OPC UA diagnostics. Client (HttpsTransportChannel.SendRequestAsync): an HTTP 429/503 response is now translated to BadServerTooBusy before EnsureSuccessStatusCode, honoring the Retry-After header (delta-seconds or HTTP-date) as a machine-readable RetryAfterMs=N hint the adaptive reconnect policy consumes (via P1). Tests: GetRetryAfter (delta/date/absent) and CreateServerTooBusyException (with/without hint) in HttpsTransportChannelTests.
Adds a diagnostics-independent structured retry-after carrier reusing the standard AdditionalParametersType in ResponseHeader.AdditionalHeader (no new DataType). RetryAfterHeader.AttachTo/Read encode/decode a whole-millisecond Int64 under the AdditionalParameterNames.RetryAfterMs key, merging with any existing additional parameters. Unit tests cover round-trip, absence, non-positive delay, rounding, and merge. This is the reusable core for emitting the retry-after on a BadServerTooBusy ServiceFault (delivered regardless of RequestHeader.ReturnDiagnostics); the CreateFault/server/client wiring follows.
…r on faults - ServerBusyException (Core): a ServiceResultException carrying an explicit TimeSpan? RetryAfter, thrown by StandardServer.CreateServerTooBusyException. - EndpointBase.CreateFault attaches the retry-after to the fault's ResponseHeader.AdditionalHeader (via RetryAfterHeader) when the exception is a ServerBusyException, so it is delivered independently of ReturnDiagnostics. The CallAsync catch passes the typed exception straight to CreateFault. - Tests: CreateFault attaches/omits the header (EndpointBaseTests); the admission path throws ServerBusyException with RetryAfter=2000ms and keeps the legacy AdditionalInfo token (SessionAdmissionRateLimitTests, now CatchAsync to allow the subclass). The diagnostics-independent server emit is complete; client honoring of the fault AdditionalHeader is the remaining step (see plan.md P3 blueprint).
…Header ClientBase.ValidateResponse(ResponseHeader) now reads a RetryAfterHeader from a bad response and re-emits it as a machine-readable RetryAfterMs=N AdditionalInfo token (only when diagnostics did not already provide one), so the adaptive reconnect policy honors it via the P1 plumbing. The source-generated clients call ValidateResponse(genericResponse.ResponseHeader) for every service, so this covers CreateSession/ActivateSession and all other calls. The change is additive: behavior is unchanged for responses without a RetryAfter header. Tests: ValidateResponse surfaces the hint on a bad response and ignores it on a good one (ClientBaseTests). P3 (structured retry-after via ResponseHeader.additionalHeader) is now complete end-to-end: server emit + client honor, delivered independently of diagnostics.
Update RetryAfterSignaling.md with an implementation-status section and RateLimiting.md follow-ups to reflect that the diagnostics-independent retry-after carriers (ResponseHeader.additionalHeader and HTTP Retry-After) are implemented; UA-TCP Error reason and dynamic ServiceLevel remain planned.
…owIfNull in RetryAfterHeader
Update Server.ServiceLevel from session capacity headroom and add calculator coverage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Parse RetryAfterMs hints from transient UA-TCP ERR failures and apply them as a lower bound for managed channel reconnect backoff. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ServerBusyException: add a justified CA1032 suppression (the type must carry a ServiceResult; the standard parameterless/message-only ctors are intentionally omitted). - Docs: RetryAfterSignaling.md + RateLimiting.md now record the UA-TCP client ERR retry-after honoring (P4) and the load-based Server.ServiceLevel (P5) as delivered, with remaining items (standardized field, server-emitted UA-TCP ERR on connection reject, client proactive ServiceLevel reselection) as planned. Validation this pass: Core Stack tests 2298 pass (incl. 11 new RetryAfterHint tests); ServiceLevel calculator + session-admission + session-churn server tests pass; Core and Server build clean on net48 and net10.0.
…ing' into copilot/server-client-rate-limiting
There was a problem hiding this comment.
Pull request overview
Introduces configurable admission control/rate limiting across server and client to mitigate connect/session storms. On the server side, this adds transport-level connection admission limits, session-establishment concurrency limits, structured retry-after signaling, and a dynamic Server.ServiceLevel. On the client side, this adds server-signal-aware adaptive reconnect backoff and an optional connect gate to throttle bulk connects.
Changes:
- Server: configurable listener backlog + connection/session admission limiters (
System.Threading.RateLimiting),BadServerTooBusysignaling with retry-after, andServer.ServiceLevelscaling. - Client: adaptive reconnect backoff (
IReconnectPolicy.TryGetNextDelay) honoring retry-after hints + optional connect admission gate (IClientConnectGate). - Tests/docs: new unit/integration/AOT tests and documentation describing rate limiting + retry-after signaling.
Reviewed changes
Copilot reviewed 60 out of 60 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| Tests/Opc.Ua.Server.Tests/SessionAdmissionRateLimitTests.cs | Integration tests for session-admission rejection and retry-after hint. |
| Tests/Opc.Ua.Server.Tests/ServerServiceLevelCalculatorTests.cs | Unit tests for load-based Server.ServiceLevel calculation/hysteresis. |
| Tests/Opc.Ua.Server.Tests/RateLimiting/ServerRateLimiterTests.cs | Unit tests for server connection/session limiters and provider behavior. |
| Tests/Opc.Ua.Core.Tests/Stack/Transport/HttpsTransportChannelTests.cs | Tests mapping HTTP 429/503 + Retry-After to BadServerTooBusy. |
| Tests/Opc.Ua.Core.Tests/Stack/Server/RetryAfterHeaderTests.cs | Tests for ResponseHeader.AdditionalHeader retry-after carrier. |
| Tests/Opc.Ua.Core.Tests/Stack/Server/EndpointBaseTests.cs | Tests that faults attach/omit retry-after header appropriately. |
| Tests/Opc.Ua.Core.Tests/Stack/Client/RetryAfterHintTests.cs | Tests parsing/applying retry-after hints for reconnect backoff. |
| Tests/Opc.Ua.Core.Tests/Stack/Client/ClientBaseTests.cs | Tests ValidateResponse surfacing retry-after from AdditionalHeader. |
| Tests/Opc.Ua.Client.Tests/Session/ManagedSessionTests.cs | Updates reflection-based ctor tests for new connect-gate parameter. |
| Tests/Opc.Ua.Client.Tests/Session/ManagedSessionComplianceTests.cs | Updates compliance tests for new connect-gate parameter. |
| Tests/Opc.Ua.Client.Tests/Session/ConnectionStateMachineTests.cs | Tests reconnect honoring retry-after parsed from service results. |
| Tests/Opc.Ua.Client.Tests/ClientConnectRateLimiterTests.cs | Tests client connect gate concurrency behavior. |
| Tests/Opc.Ua.Client.Tests/AdaptiveReconnectPolicyTests.cs | Tests adaptive reconnect policy backoff classification + hint handling. |
| Tests/Opc.Ua.Bindings.Https.WebApi.Tests/DependencyInjection/OpcUaHttpsBuilderExtensionsTests.cs | Tests DI extensions for Kestrel HTTPS rate limiter contributor. |
| Tests/Opc.Ua.Aot.Tests/RateLimitingAotTests.cs | AOT coverage for new rate-limiting paths. |
| Stack/Opc.Ua.Core/Stack/Transport/TransportListenerSettings.cs | Adds listener backlog + connection limiter settings. |
| Stack/Opc.Ua.Core/Stack/Transport/IConnectionRateLimiter.cs | New interface for transport connection admission control. |
| Stack/Opc.Ua.Core/Stack/Tcp/UaSCBinaryTransportChannel.cs | Captures server retry-after hints from UA-TCP failures for reconnect. |
| Stack/Opc.Ua.Core/Stack/Tcp/UaSCBinaryChannel.cs | Preserves UA-TCP ERR reason in ServiceResult for hint parsing. |
| Stack/Opc.Ua.Core/Stack/Tcp/TcpTransportListener.cs | Uses configured backlog + applies connection admission limiter on accept. |
| Stack/Opc.Ua.Core/Stack/Server/ServerBusyException.cs | New exception carrying retry-after for fault-building. |
| Stack/Opc.Ua.Core/Stack/Server/ServerBase.cs | Hook for derived servers to configure transport listener settings. |
| Stack/Opc.Ua.Core/Stack/Server/RetryAfterHeader.cs | Adds retry-after to ResponseHeader.AdditionalHeader parameters. |
| Stack/Opc.Ua.Core/Stack/Server/EndpointBase.cs | Attaches retry-after header on ServerBusyException faults. |
| Stack/Opc.Ua.Core/Stack/Https/HttpsTransportChannel.cs | Maps HTTP throttling + Retry-After into BadServerTooBusy with token. |
| Stack/Opc.Ua.Core/Stack/Client/ClientBase.cs | Surfaces retry-after from AdditionalHeader into AdditionalInfo token. |
| Stack/Opc.Ua.Core/Stack/Client/Channels/RetryAfterHint.cs | Parses retry-after tokens and applies as reconnect-delay lower bound. |
| Stack/Opc.Ua.Core/Stack/Client/Channels/Internal/ChannelEntry.cs | Applies consumed server retry-after hint to channel reconnect delay. |
| Stack/Opc.Ua.Core/Stack/Client/Channels/IChannelReconnectPolicy.cs | Adds internal IServerRetryAfterHintProvider interface for channels. |
| Stack/Opc.Ua.Core/Security/Constants/AdditionalParameterNames.cs | Adds RetryAfterMs additional-parameter name constant. |
| Stack/Opc.Ua.Core/Opc.Ua.Core.csproj | Adds System.Threading.RateLimiting package reference. |
| Stack/Opc.Ua.Bindings.Https/Https/HttpsRateLimiterStartupContributor.cs | Adds ASP.NET Core rate limiting + emits HTTP Retry-After on rejection. |
| Stack/Opc.Ua.Bindings.Https/DependencyInjection/OpcUaHttpsBuilderExtensions.cs | Adds AddHttpsRateLimiter DI/fluent integration. |
| Libraries/Opc.Ua.Server/Server/StandardServer.cs | Wires server admission control into Create/ActivateSession + transport settings. |
| Libraries/Opc.Ua.Server/Server/ServerServiceLevelCalculator.cs | Implements load-based Server.ServiceLevel calculation. |
| Libraries/Opc.Ua.Server/Server/ServerInternalData.cs | Updates Server.ServiceLevel on session create/close with hysteresis. |
| Libraries/Opc.Ua.Server/RateLimiting/TokenBucketConnectionRateLimiter.cs | Implements connection admission limiter using token bucket. |
| Libraries/Opc.Ua.Server/RateLimiting/ServerRateLimitOptions.cs | Defines server rate limiting configuration options. |
| Libraries/Opc.Ua.Server/RateLimiting/IServerRateLimiterProvider.cs | Provider interface for server admission control limiters. |
| Libraries/Opc.Ua.Server/RateLimiting/DefaultServerRateLimiterProvider.cs | Default provider implementation using System.Threading.RateLimiting. |
| Libraries/Opc.Ua.Server/Opc.Ua.Server.csproj | Adds System.Threading.RateLimiting package reference. |
| Libraries/Opc.Ua.Server/Hosting/OpcUaServerOptions.cs | Adds hosted-service option callback to configure rate limits. |
| Libraries/Opc.Ua.Server/Hosting/OpcUaServerHostedService.cs | Applies DI/provider or options callback to server rate limiting at startup. |
| Libraries/Opc.Ua.Client/Session/Session.cs | Teardown fix: swallow benign ObjectDisposedException on reconnect lock wait. |
| Libraries/Opc.Ua.Client/Session/ReconnectPolicy.cs | Implements adaptive delay API and retry-after parsing. |
| Libraries/Opc.Ua.Client/Session/RateLimiterClientConnectGate.cs | Implements connect admission gate backed by RateLimiter. |
| Libraries/Opc.Ua.Client/Session/ManagedSessionOptions.cs | Adds connect-gate options for client connect throttling. |
| Libraries/Opc.Ua.Client/Session/ManagedSession.cs | Uses connect gate; preserves ServiceResultException.Result for retry-after. |
| Libraries/Opc.Ua.Client/Session/IReconnectPolicy.cs | Extends reconnect policy API with adaptive delay method. |
| Libraries/Opc.Ua.Client/Session/IClientConnectGate.cs | New interface for client connect admission gating. |
| Libraries/Opc.Ua.Client/Session/ConnectionStateMachine.cs | Uses adaptive policy API + parses retry-after hints from last attempt. |
| Libraries/Opc.Ua.Client/Opc.Ua.Client.csproj | Adds System.Threading.RateLimiting package reference. |
| Libraries/Opc.Ua.Client/Fluent/OpcUaClientBuilderExtensions.cs | Plumbs connect gate from options/DI into fluent builder. |
| Libraries/Opc.Ua.Client/Fluent/ManagedSessionBuilder.cs | Adds fluent connect-rate-limiter configuration overloads. |
| Docs/ServerScalability.md | Updates scalability analysis with delivered admission control/backpressure. |
| Docs/RetryAfterSignaling.md | New design survey for diagnostics-independent retry-after carriers. |
| Docs/README.md | Links new rate limiting + retry-after signaling documentation. |
| Docs/RateLimiting.md | New documentation for server/client rate limiting, DI, and hints. |
| Docs/proposals/RetryAfter.md | New specification proposal draft for structured retry-after signaling. |
| Directory.Packages.props | Adds centralized package version for System.Threading.RateLimiting. |
…anch Integrates PR #3950 "Server: decouple held long-poll Publishes from the request-queue worker budget" (the S1 scalability item) into this branch. Conflict resolution: Docs/ServerScalability.md - took #3950's concise rewrite (its "Admission control and rate limiting" section already summarizes the rate-limiting + retry-after work, and it references RateLimiting.md for the full surface) and added a See-also link to RetryAfterSignaling.md. All code files auto-merged cleanly (StandardServer.cs regions are disjoint).
…y numbers With DecoupleHeldPublishRequests default-on (S1), a held Publish releases its request-processing worker at the park point, so the worker pool no longer has to scale with the session count. An A/B load-test campaign (ServerManySessionsLoadTest, Xeon W-2235 6c/12t, quiet machine) shows a ~200-worker pool cleanly establishes and serves ~4000 sessions (8-17x the ~140-350 the coupled-off path manages at the same budget), while a session-count-sized pool (10500) is ~2x slower to establish and reaches a LOWER ceiling (thread oversubscription during the RSA-handshake burst). - LoadTest fixture: MaxRequestThreadCount 10500->200, MinRequestThreadCount 200->50, with the rationale comment rewritten for the S1 small-pool model. - Benchmarks.md: refresh the server-session scalability table (2000/2500/4000 now establish cleanly; 10000 remains the establishment wall) and the MaxRequestThreadCount guidance (size to active concurrency, not session count). ServerScalability.md prose was rewritten upstream (8f7a95d); measured numbers now live in Benchmarks.md, which that doc references.
… into copilot/server-client-rate-limiting
| | 2000 | No (*) | — | — | | ||
| | 2000 | Yes | 14 | Yes | | ||
| | 2500 | Yes | 13 | Yes | | ||
| | 4000 | Yes | 11 | Yes | |
There was a problem hiding this comment.
Retest - it would be great to see the ceiling (i.e. is it 8000 or 5000 or 4000)?
There was a problem hiding this comment.
The exact ceiling needs the [Explicit] ServerManySessionsLoadTestAsync macro run on dedicated hardware (client and server on separate machines), I left this thread open for a follow-up run.
There was a problem hiding this comment.
Rerun the other values as well to update the entire table. Goal is to find the ceiling on the machine mentioned
- StandardServer.RateLimiterProvider setter now disposes a server-owned provider before replacing it, so its RateLimiter timers are not leaked. - SessionAdmissionRateLimitTests.TearDown disposes the provider being replaced (the tests assign caller-owned providers the server does not dispose). - ClientBase.ValidateResponse surfaces the AdditionalHeader retry-after by checking for token presence (not empty AdditionalInfo) and merges the token into any existing AdditionalInfo instead of overwriting it; added merge and keep-existing-token tests. - ServerRateLimitOptions: connection-limit XML docs corrected to describe the server-wide (single-bucket) behavior instead of per-remote. - RetryAfterSignaling.md: candidate-mechanism table + prose updated to reflect the implemented additionalHeader / HTTP Retry-After / UA-TCP ERR / ServiceLevel carriers (was internally inconsistent with the Implementation status section).
… docs - Benchmarks: drop internal comparisons to earlier master states in the server-scalability footnote (keep only measured numbers and the configuration A/B sizing insight). - RateLimiting: remove the 'Extending the transport' subsection header; drop the speculative 'Planned' follow-ups bullet and retitle the section 'Server backpressure signals'. - ServerScalability: reword the held-Publish sentence to acknowledge a client may keep several Publish requests outstanding (the classic engine does); each parked Publish gets its own RequestParkSink and releases a worker independently. - Remove the standalone RetryAfterSignaling.md design survey; fold a concise implementation summary into Sessions.md (reconnect section) as 'Server retry-after backpressure' and reference it from RateLimiting.md, README.md, ServerScalability.md, and proposals/RetryAfter.md.
There was a problem hiding this comment.
Open a mantis issue.
|
|
||
| /// <inheritdoc/> | ||
| public ValueTask<IServiceResponse> SendRequestAsync( | ||
| public async ValueTask<IServiceResponse> SendRequestAsync( |
There was a problem hiding this comment.
This is a perf critical path and adds another state machine. Can the exception not be intercepted in a leaner way?
|
|
||
| // The request is now parked waiting for a notification: release the | ||
| // processing worker so it does not remain blocked for the whole wait. | ||
| parkSink?.NotifyParked(); |
There was a problem hiding this comment.
will a client initiated cancel of the request still work? Clients are recommended by the spec to cancel outstanding publish request on close session
Description
Server session scalability work: the reference server degrades gracefully under connect storms, holds many thousands of steady-state long-polls on a small worker pool, and gives clients diagnostics-independent backpressure signals so they ramp instead of hammering. This combines three related efforts — and merges in #3950 — all follow-ups to the scalability analysis in #3941 (
Docs/ServerScalability.md).Server admission control / rate limiting is enabled by default with conservative limits (it sheds excess load with a fast
BadServerTooBusyrather than altering steady-state behavior), and the held-Publish decoupling is on by default; the HTTPS rate limiter and the client-side connect gate are opt-in. Nothing here changes the shippedMaxSessionCount(100).1. Admission control / rate limiting
CreateSession/ActivateSessionacquire from a concurrency limiter; over-limit requests fault fast withBadServerTooBusybefore the CPU-bound certificate validation / signing, instead of queueing behind the address-space lock.AddHttpsRateLimiter(...)attaches an ASP.NET Core rate limiter to the HTTPS binding via dependency injection.System.Threading.RateLimiting, DI-injectable (IServerRateLimiterProvider,ConfigureRateLimits(...)) with a direct-construct fallback.2. Retry-after signaling (diagnostics-independent)
A
BadServerTooBusyfault's diagnosticAdditionalInfoonly reaches the client when it requests diagnostics, and is dropped on client exception re-wrapping. This delivers the retry-after through carriers that do not depend on diagnostics, all feeding the client's adaptive reconnect policy:ResponseHeader.additionalHeader(all UA transports) — a structuredRetryAfterMs(Int64) in the standardAdditionalParametersType, attached to the fault byServerBusyException+EndpointBase.CreateFault(RetryAfterHeader) and read byClientBase.ValidateResponse.Retry-After(HTTPS) — the rate limiter sets the header on its 429; the client maps a 429/503 withRetry-AftertoBadServerTooBusy.Errorreason — the client honors aRetryAfterMs=Ntoken in a transient server-busy ERR message as a lower bound on channel-reconnect backoff (RetryAfterHint).IReconnectPolicy.TryGetNextDelay(server-signal-aware adaptive backoff); a client-wide connect admission gate (WithConnectRateLimiter) ramps bulk connects.Server.ServiceLevel— computed from session-establishment headroom (255 at low load, scaling toward a floor as sessions approachMaxSessionCount, with hysteresis) as a proactive capacity signal clients can read/subscribe.Docs/Sessions.md(Server retry-after backpressure section) + a submittable OPC UA spec-change proposal:Docs/proposals/RetryAfter.md.3. Held long-poll Publish decoupling
A parked
Publish(waiting for the next subscription notification) no longer occupies aMaxRequestThreadCountworker slot for the whole wait, so a small fixed worker pool can hold many thousands of outstanding Publishes andMaxRequestThreadCountno longer has to scale with session count.IRequestParkSink+ a one-shotRequestLifetime.ParkSink;ServerBase.RequestQueueawaitsTask.WhenAny(processing, ParkedTask)and releases the active-worker slot at the park point rather than at completion. OnlyPublishcarries a sink — every other request keeps the byte-for-byte legacy fast path (no extra allocation, noWhenAny).DecoupleHeldPublishRequests(defaulttrue; setfalseto restore the legacy inline-await worker path).Docs
Docs/ServerScalability.md(rewritten as user-facing scalability documentation),Docs/RateLimiting.md,Docs/Sessions.md,Docs/proposals/RetryAfter.md.Tests
Server admission (
SessionAdmissionRateLimitTests); retry-after carriers (RetryAfterHeaderTests,RetryAfterHintTests,EndpointBaseTests,ClientBaseTests,HttpsTransportChannelTests,AdaptiveReconnectPolicyTests,ConnectionStateMachineTests); load-based ServiceLevel (ServerServiceLevelCalculatorTests); held-Publish decoupling (RequestQueueTests,SessionPublishQueueTests); AOT coverage (RateLimitingAotTests). Changed libraries build clean on net48 and net10.0.Related Issues
Merges #3950 ("decouple held long-poll Publishes"); follow-up to the scalability analysis in #3941.
Checklist