perf(init): issue telemetry subscribe concurrently with first /next (H18 experiment) by duncanista · Pull Request #1285 · DataDog/datadog-lambda-extension

duncanista · 2026-06-24T06:07:07Z

Jira: none yet — add before marking ready.

Warning

EXPERIMENT. This is an intentionally experimental cold-start change (hypothesis H18) to be benchmarked later. It is stacked on the cold-start integration umbrella PR #1284 — base branch is jordan.gonzalez/cold-start-integration/feature, not main. Do not merge to main directly; review/merge into the integration branch.

Overview

Take the Telemetry API subscribe round-trip off the cold-start critical path by issuing it concurrently with the first /next, instead of awaiting its 200 OK before the extension reaches extension::next_event (/next).

Mechanism. The Lambda Extensions API ends the extension's INIT phase at the first /next call. That call then long-polls until the first INVOKE arrives (≫ the ~25 ms subscribe round-trip, which our per-phase init checkpoints measured as the single dominant init phase). Previously the subscribe was awaited to its 200 OK before /next, serializing it on the critical path. This change issues telemetry::subscribe(...) in a detached tokio::spawn and lets init proceed straight to /next, so the subscribe HTTP round-trip overlaps the /next long-poll wait rather than running ahead of it. Target: shave the dominant ~25 ms init phase off Init Duration / ext-ready-in.

What changed (bottlecap/src/bin/bottlecap/main.rs, setup_telemetry_client):

Kept binding the telemetry listener socket synchronously — listener.start().await still awaits the bind before we proceed, so the socket is already accepting connections (no early telemetry lost to a closed port).
Kept the listener cancel token wired exactly as before: still obtained via listener.cancel_token() and returned unchanged, so graceful shutdown of the spawned serve task and both shutdown paths are untouched.
Moved telemetry::subscribe(...) into a detached tokio::spawn (owned client/runtime_api/extension_id), with its Result handled inside the task via error!. It is no longer awaited before /next.
The function signature is unchanged, so the existing tokio::join!(telemetry_setup, build_services) (from H2) and all downstream wiring are identical. Both On-Demand and Managed-Instance paths are preserved — they share the same telemetry_listener_cancel_token.
Updated the call-site doc comment to describe the reorder and reference the open risk.

This is a behavior change limited to when the subscribe completes relative to /next. Everything else (logs channel, all services, cancel-token shutdown wiring) is unchanged.

Relationship to H2

H2 (already on the integration branch) overlaps the subscribe with service construction via tokio::join!. H18 goes further: it removes the subscribe from the critical path entirely by not awaiting it before /next. The two compose cleanly — the bind still runs concurrently with construction; only the subscribe await is removed.

OPEN RISK

We deliberately do not solve the correctness question here. When the subscription registers slightly after /next, the platform may already have emitted platform.initStart / platform.initReport and early function logs the moment INIT ended — those could be delivered before the subscription exists and be lost. The listener socket is bound before we return, so nothing is dropped for a closed-port reason; the only exposure is the platform-side registration race. Closing (or proving harmless) that race is exactly what the benchmark below is for.

Testing

Validation plan (to run before this is marked ready) — on a minimal provided.al2023 / Go function:

Confirm aws.lambda.enhanced.init_duration is still emitted (i.e. platform.initReport is still received after the reordered subscribe).
Confirm no platform.logsDropped telemetry event is observed (no early/init logs lost to the registration race).
Measure the cold-start delta: compare reported Init Duration and the "Datadog Next-Gen Extension ready in …ms" / INIT | phase=... checkpoints against the integration branch baseline to quantify the savings from overlapping the subscribe with the /next long-poll.

If init telemetry is lost or platform.logsDropped appears, this hypothesis is rejected (or needs a guard) and should not advance.

…experiment) The Lambda platform ends the extension's INIT phase at the first /next call, which then long-polls until the first INVOKE. Previously the Telemetry API subscribe (~25 ms, the dominant init phase we measured) was awaited to its 200 OK before /next, serializing it on the cold-start critical path. Keep binding the telemetry listener socket synchronously (so it is serving before we proceed and no early telemetry is lost to a closed port) and keep the cancel token wired exactly as before, but issue the subscribe in a detached tokio::spawn instead of awaiting it. Init then proceeds straight to /next and the subscribe round-trip overlaps the /next long-poll wait. Both On-Demand and Managed-Instance paths are preserved. Experimental: does not address whether the platform still delivers platform.initStart/initReport/early logs if the subscription registers slightly after /next. That is the open risk this change exists to benchmark.

datadog-datadog-prod-us1-2 · 2026-06-24T06:25:32Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 5 Pipeline jobs failed

DataDog/datadog-lambda-extension | integration-suite: [lmi]

DataDog/datadog-lambda-extension | e2e-test-status (amd64)

DataDog/datadog-lambda-extension | e2e-test-status (amd64, fips)

View all 5 failed jobs.

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 418d443 | Docs | Datadog PR Page | Give us feedback!}

duncanista · 2026-06-24T18:17:49Z

Tested — recommend not pursuing H18

Built this branch into an arm64 layer and benchmarked it against the integration baseline (#1284) and prod v98, across python, go, and rust (provided.al2023) runtimes, forcing cold starts via the env-var bump trick.

Finding: H18 makes the extension's own init faster, but does not reduce total Init Duration on any runtime — including the most minimal one (rust), where the extension genuinely is the cold-start long pole.

Rust, n=30 (the decisive case):

layer	Init Duration p50 (ms)	ext `ready_in` p50 (ms)
integration (#1284)	100.4	19
+H18	97.8 (−2.6, within noise)	12

Same pattern on python (184 → 187, flat) and go (flat/noisy at n=17). So H18 does do what it was designed to — ready_in drops as the subscribe moves off the path to /next — but it buys ~nothing in the customer-visible cold start.

Why: the Init Duration floor (min ~80 ms on rust) is sandbox provisioning + loading/linking the ~15 MB extension binary, not the extension's async init and not the runtime (rust runtime is ~16 ms). Once the integration changes already drop ready_in to ~19 ms, the extension sits at/below that floor — so taking the subscribe off the path has nothing left to give.

Plus an unresolved correctness risk: this change fires telemetry::subscribe detached (not awaited before /next), so the subscription can register after the platform emits platform.initStart/initReport/early logs → potential loss of the aws.lambda.enhanced.init_duration metric / platform.logsDropped. I never verified this, because the perf result already makes it not worth shipping.

Recommendation: close / don't pursue. No Init-Duration payoff on any runtime tested, plus a telemetry-correctness risk. The remaining cold-start lever for minimal runtimes is the extension binary's load time (build/link: eager symbol binding, opt-level/size, etc.) — not further init-sequence reordering.

(Benchmark detail: identical functions per runtime, arm64 / 1024 MB / DD_LOG_LEVEL=debug, cold starts forced by bumping a dummy env var. Companion results on the integration PR #1284.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(init): issue telemetry subscribe concurrently with first /next (H18 experiment)#1285

perf(init): issue telemetry subscribe concurrently with first /next (H18 experiment)#1285
duncanista wants to merge 1 commit into
jordan.gonzalez/cold-start-integration/featurefrom
jordan.gonzalez/telemetry-subscribe-async/feature

duncanista commented Jun 24, 2026

Uh oh!

datadog-datadog-prod-us1-2 Bot commented Jun 24, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading

Uh oh!

duncanista commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

duncanista commented Jun 24, 2026

Overview

Relationship to H2

OPEN RISK

Testing

Uh oh!

datadog-datadog-prod-us1-2 Bot commented Jun 24, 2026 • edited by datadog-datadog-prod-us1 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

duncanista commented Jun 24, 2026

Tested — recommend not pursuing H18

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

datadog-datadog-prod-us1-2 Bot commented Jun 24, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading