Skip to content

perf(init): issue telemetry subscribe concurrently with first /next (H18 experiment)#1285

Draft
duncanista wants to merge 1 commit into
jordan.gonzalez/cold-start-integration/featurefrom
jordan.gonzalez/telemetry-subscribe-async/feature
Draft

perf(init): issue telemetry subscribe concurrently with first /next (H18 experiment)#1285
duncanista wants to merge 1 commit into
jordan.gonzalez/cold-start-integration/featurefrom
jordan.gonzalez/telemetry-subscribe-async/feature

Conversation

@duncanista

Copy link
Copy Markdown
Contributor

Jira: none yet — add before marking ready.

Warning

EXPERIMENT. This is an intentionally experimental cold-start change (hypothesis H18) to be benchmarked later. It is stacked on the cold-start integration umbrella PR #1284 — base branch is jordan.gonzalez/cold-start-integration/feature, not main. Do not merge to main directly; review/merge into the integration branch.

Overview

Take the Telemetry API subscribe round-trip off the cold-start critical path by issuing it concurrently with the first /next, instead of awaiting its 200 OK before the extension reaches extension::next_event (/next).

Mechanism. The Lambda Extensions API ends the extension's INIT phase at the first /next call. That call then long-polls until the first INVOKE arrives (≫ the ~25 ms subscribe round-trip, which our per-phase init checkpoints measured as the single dominant init phase). Previously the subscribe was awaited to its 200 OK before /next, serializing it on the critical path. This change issues telemetry::subscribe(...) in a detached tokio::spawn and lets init proceed straight to /next, so the subscribe HTTP round-trip overlaps the /next long-poll wait rather than running ahead of it. Target: shave the dominant ~25 ms init phase off Init Duration / ext-ready-in.

What changed (bottlecap/src/bin/bottlecap/main.rs, setup_telemetry_client):

  • Kept binding the telemetry listener socket synchronously — listener.start().await still awaits the bind before we proceed, so the socket is already accepting connections (no early telemetry lost to a closed port).
  • Kept the listener cancel token wired exactly as before: still obtained via listener.cancel_token() and returned unchanged, so graceful shutdown of the spawned serve task and both shutdown paths are untouched.
  • Moved telemetry::subscribe(...) into a detached tokio::spawn (owned client/runtime_api/extension_id), with its Result handled inside the task via error!. It is no longer awaited before /next.
  • The function signature is unchanged, so the existing tokio::join!(telemetry_setup, build_services) (from H2) and all downstream wiring are identical. Both On-Demand and Managed-Instance paths are preserved — they share the same telemetry_listener_cancel_token.
  • Updated the call-site doc comment to describe the reorder and reference the open risk.

This is a behavior change limited to when the subscribe completes relative to /next. Everything else (logs channel, all services, cancel-token shutdown wiring) is unchanged.

Relationship to H2

H2 (already on the integration branch) overlaps the subscribe with service construction via tokio::join!. H18 goes further: it removes the subscribe from the critical path entirely by not awaiting it before /next. The two compose cleanly — the bind still runs concurrently with construction; only the subscribe await is removed.

OPEN RISK

We deliberately do not solve the correctness question here. When the subscription registers slightly after /next, the platform may already have emitted platform.initStart / platform.initReport and early function logs the moment INIT ended — those could be delivered before the subscription exists and be lost. The listener socket is bound before we return, so nothing is dropped for a closed-port reason; the only exposure is the platform-side registration race. Closing (or proving harmless) that race is exactly what the benchmark below is for.

Testing

Validation plan (to run before this is marked ready) — on a minimal provided.al2023 / Go function:

  • Confirm aws.lambda.enhanced.init_duration is still emitted (i.e. platform.initReport is still received after the reordered subscribe).
  • Confirm no platform.logsDropped telemetry event is observed (no early/init logs lost to the registration race).
  • Measure the cold-start delta: compare reported Init Duration and the "Datadog Next-Gen Extension ready in …ms" / INIT | phase=... checkpoints against the integration branch baseline to quantify the savings from overlapping the subscribe with the /next long-poll.

If init telemetry is lost or platform.logsDropped appears, this hypothesis is rejected (or needs a guard) and should not advance.

…experiment)

The Lambda platform ends the extension's INIT phase at the first /next
call, which then long-polls until the first INVOKE. Previously the
Telemetry API subscribe (~25 ms, the dominant init phase we measured)
was awaited to its 200 OK before /next, serializing it on the
cold-start critical path.

Keep binding the telemetry listener socket synchronously (so it is
serving before we proceed and no early telemetry is lost to a closed
port) and keep the cancel token wired exactly as before, but issue the
subscribe in a detached tokio::spawn instead of awaiting it. Init then
proceeds straight to /next and the subscribe round-trip overlaps the
/next long-poll wait. Both On-Demand and Managed-Instance paths are
preserved.

Experimental: does not address whether the platform still delivers
platform.initStart/initReport/early logs if the subscription registers
slightly after /next. That is the open risk this change exists to
benchmark.
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 24, 2026

Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 5 Pipeline jobs failed

DataDog/datadog-lambda-extension | integration-suite: [lmi]   View in Datadog   GitLab

DataDog/datadog-lambda-extension | e2e-test-status (amd64)   View in Datadog   GitLab

DataDog/datadog-lambda-extension | e2e-test-status (amd64, fips)   View in Datadog   GitLab

View all 5 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 418d443 | Docs | Datadog PR Page | Give us feedback!

@duncanista

Copy link
Copy Markdown
Contributor Author

Tested — recommend not pursuing H18

Built this branch into an arm64 layer and benchmarked it against the integration baseline (#1284) and prod v98, across python, go, and rust (provided.al2023) runtimes, forcing cold starts via the env-var bump trick.

Finding: H18 makes the extension's own init faster, but does not reduce total Init Duration on any runtime — including the most minimal one (rust), where the extension genuinely is the cold-start long pole.

Rust, n=30 (the decisive case):

layer Init Duration p50 (ms) ext ready_in p50 (ms)
integration (#1284) 100.4 19
+H18 97.8 (−2.6, within noise) 12

Same pattern on python (184 → 187, flat) and go (flat/noisy at n=17). So H18 does do what it was designed to — ready_in drops as the subscribe moves off the path to /next — but it buys ~nothing in the customer-visible cold start.

Why: the Init Duration floor (min ~80 ms on rust) is sandbox provisioning + loading/linking the ~15 MB extension binary, not the extension's async init and not the runtime (rust runtime is ~16 ms). Once the integration changes already drop ready_in to ~19 ms, the extension sits at/below that floor — so taking the subscribe off the path has nothing left to give.

Plus an unresolved correctness risk: this change fires telemetry::subscribe detached (not awaited before /next), so the subscription can register after the platform emits platform.initStart/initReport/early logs → potential loss of the aws.lambda.enhanced.init_duration metric / platform.logsDropped. I never verified this, because the perf result already makes it not worth shipping.

Recommendation: close / don't pursue. No Init-Duration payoff on any runtime tested, plus a telemetry-correctness risk. The remaining cold-start lever for minimal runtimes is the extension binary's load time (build/link: eager symbol binding, opt-level/size, etc.) — not further init-sequence reordering.

(Benchmark detail: identical functions per runtime, arm64 / 1024 MB / DD_LOG_LEVEL=debug, cold starts forced by bumping a dummy env var. Companion results on the integration PR #1284.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant