perf(init): issue telemetry subscribe concurrently with first /next (H18 experiment)#1285
Conversation
…experiment) The Lambda platform ends the extension's INIT phase at the first /next call, which then long-polls until the first INVOKE. Previously the Telemetry API subscribe (~25 ms, the dominant init phase we measured) was awaited to its 200 OK before /next, serializing it on the cold-start critical path. Keep binding the telemetry listener socket synchronously (so it is serving before we proceed and no early telemetry is lost to a closed port) and keep the cancel token wired exactly as before, but issue the subscribe in a detached tokio::spawn instead of awaiting it. Init then proceeds straight to /next and the subscribe round-trip overlaps the /next long-poll wait. Both On-Demand and Managed-Instance paths are preserved. Experimental: does not address whether the platform still delivers platform.initStart/initReport/early logs if the subscription registers slightly after /next. That is the open risk this change exists to benchmark.
|
Tested — recommend not pursuing H18Built this branch into an arm64 layer and benchmarked it against the integration baseline (#1284) and prod v98, across python, go, and rust ( Finding: H18 makes the extension's own init faster, but does not reduce total Rust, n=30 (the decisive case):
Same pattern on python (184 → 187, flat) and go (flat/noisy at n=17). So H18 does do what it was designed to — Why: the Plus an unresolved correctness risk: this change fires Recommendation: close / don't pursue. No Init-Duration payoff on any runtime tested, plus a telemetry-correctness risk. The remaining cold-start lever for minimal runtimes is the extension binary's load time (build/link: eager symbol binding, opt-level/size, etc.) — not further init-sequence reordering. (Benchmark detail: identical functions per runtime, arm64 / 1024 MB / |
Jira: none yet — add before marking ready.
Warning
EXPERIMENT. This is an intentionally experimental cold-start change (hypothesis H18) to be benchmarked later. It is stacked on the cold-start integration umbrella PR #1284 — base branch is
jordan.gonzalez/cold-start-integration/feature, notmain. Do not merge tomaindirectly; review/merge into the integration branch.Overview
Take the Telemetry API
subscriberound-trip off the cold-start critical path by issuing it concurrently with the first/next, instead of awaiting its200 OKbefore the extension reachesextension::next_event(/next).Mechanism. The Lambda Extensions API ends the extension's INIT phase at the first
/nextcall. That call then long-polls until the first INVOKE arrives (≫ the ~25 ms subscribe round-trip, which our per-phase init checkpoints measured as the single dominant init phase). Previously the subscribe was awaited to its200 OKbefore/next, serializing it on the critical path. This change issuestelemetry::subscribe(...)in a detachedtokio::spawnand lets init proceed straight to/next, so the subscribe HTTP round-trip overlaps the/nextlong-poll wait rather than running ahead of it. Target: shave the dominant ~25 ms init phase offInit Duration/ ext-ready-in.What changed (
bottlecap/src/bin/bottlecap/main.rs,setup_telemetry_client):listener.start().awaitstill awaits the bind before we proceed, so the socket is already accepting connections (no early telemetry lost to a closed port).listener.cancel_token()and returned unchanged, so graceful shutdown of the spawned serve task and both shutdown paths are untouched.telemetry::subscribe(...)into a detachedtokio::spawn(ownedclient/runtime_api/extension_id), with itsResulthandled inside the task viaerror!. It is no longer awaited before/next.tokio::join!(telemetry_setup, build_services)(from H2) and all downstream wiring are identical. Both On-Demand and Managed-Instance paths are preserved — they share the sametelemetry_listener_cancel_token.This is a behavior change limited to when the subscribe completes relative to
/next. Everything else (logs channel, all services, cancel-token shutdown wiring) is unchanged.Relationship to H2
H2 (already on the integration branch) overlaps the subscribe with service construction via
tokio::join!. H18 goes further: it removes the subscribe from the critical path entirely by not awaiting it before/next. The two compose cleanly — the bind still runs concurrently with construction; only the subscribe await is removed.OPEN RISK
We deliberately do not solve the correctness question here. When the subscription registers slightly after
/next, the platform may already have emittedplatform.initStart/platform.initReportand early function logs the moment INIT ended — those could be delivered before the subscription exists and be lost. The listener socket is bound before we return, so nothing is dropped for a closed-port reason; the only exposure is the platform-side registration race. Closing (or proving harmless) that race is exactly what the benchmark below is for.Testing
Validation plan (to run before this is marked ready) — on a minimal
provided.al2023/ Go function:aws.lambda.enhanced.init_durationis still emitted (i.e.platform.initReportis still received after the reordered subscribe).platform.logsDroppedtelemetry event is observed (no early/init logs lost to the registration race).INIT | phase=...checkpoints against the integration branch baseline to quantify the savings from overlapping the subscribe with the/nextlong-poll.If init telemetry is lost or
platform.logsDroppedappears, this hypothesis is rejected (or needs a guard) and should not advance.