Skip to content

perf(init): issue telemetry subscribe earlier, overlapping service construction#1281

Draft
duncanista wants to merge 1 commit into
jordan.gonzalez/cold-start-instrumentation/featurefrom
jordan.gonzalez/init-reorder/feature
Draft

perf(init): issue telemetry subscribe earlier, overlapping service construction#1281
duncanista wants to merge 1 commit into
jordan.gonzalez/cold-start-instrumentation/featurefrom
jordan.gonzalez/init-reorder/feature

Conversation

@duncanista

Copy link
Copy Markdown
Contributor

Jira: none yet — add before marking ready.

Draft, stacked on #1271 (jordan.gonzalez/cold-start-instrumentation/feature). This PR targets that branch, not main. Review/merge #1271 first. Depends on the H0 init-timing instrumentation in #1271 to validate the cold-start win.

Overview

Cold-start optimization (Confluence H2: reorder init to call /next sooner).

The Lambda Extensions API ends the INIT phase at the first extension::next_event (/next) call, so any work serialized before that call inflates measured cold-start time. In extension_loop_active, the telemetry subscription (setup_telemetry_clienttelemetry::subscribe) previously ran last, serialized behind InvocationProcessorService::new, AppSecProcessor::new, start_trace_agent, start_api_runtime_proxy, and the lifecycle listener — even though the subscribe only depends on logs_agent_channel (available right after start_dogstatsd).

What this PR does (the safe subset):

  1. Hoist + overlap the telemetry subscribe. The setup_telemetry_client future is now created as soon as logs_agent_channel is available and run concurrently with the remaining service construction via tokio::join!. The subscribe future is the first join! argument, so it is polled first: it binds the listener socket and issues the subscribe HTTP PUT, then yields on the network round-trip while the construction branch (invocation processor, AppSec, trace agent, API-runtime proxy, lifecycle listener) runs. The subscribe round-trip is therefore in flight during construction instead of being serialized after it. The construction branch is otherwise synchronous (it only spawns background tasks and returns handles), so behavior is unchanged — only the timing of the subscribe moves earlier.

  2. Make telemetry-listener readiness deterministic. TelemetryListener::start now binds its TCP socket synchronously (with .await?, before returning) instead of binding inside a spawned task. The serve loop still runs in a background task. This matches the pattern already used by the lifecycle, trace, and OTLP listeners.

What was deferred (intentionally not done): the fuller refactor that would call /next before constructing the trace agent / OTLP / AppSec. That reordering is more invasive and risks dropping or mis-ordering early invocation lifecycle handling, so it is out of scope here. This PR keeps every service started and the entire subsequent event loop (both On-Demand and Managed Instance paths) byte-for-byte intact — it changes only when the subscribe is issued, not what gets built.

Correctness argument

  • No telemetry dropped. Hoisting the subscribe earlier shrinks the window between binding the listener and the Telemetry API beginning to POST events, which would increase the risk of dropping early platform.initStart / initReport / logs if the bind were still racy. To remove that race entirely, TelemetryListener::start now binds the socket synchronously before subscribe is called, so the listener is guaranteed to be accepting connections by the time the Telemetry API is told to deliver to http://sandbox:8999/. No early platform events or logs can be dropped.
  • Order-independence. The build_services branch performs no I/O that the subscribe depends on, and the subscribe performs no work the construction depends on — the two are independent, so running them under tokio::join! cannot reorder any observable side effect. All produced handles (invocation_processor_handle, trace bundle, appsec_processor, proxy/lifecycle shutdown tokens, etc.) are returned from the join and consumed exactly as before.
  • Cancellation / shutdown unchanged. telemetry_listener_cancel_token, lifecycle_listener_shutdown_token, trace_agent_shutdown_token, etc. are all still wired into cancel_background_services and the tombstone path identically.
  • H0 instrumentation preserved. All log_init_checkpoint(...) calls from feat(init): add cold-start init-phase timing instrumentation #1271 are retained (dogstatsd_started, trace_agent_started, telemetry_subscribed, ready, …); trace_agent_started now fires inside the join branch but still records.

Testing

  • cargo fmt clean.
  • cargo clippy --bin bottlecap --no-deps clean under clippy::all + pedantic + unwrap_used denied (only the pre-existing buf_redux/multipart future-incompat warning remains).
  • Not yet validated at runtime. This needs the H0 init-timing instrumentation (from feat(init): add cold-start init-phase timing instrumentation #1271) plus integration testing on a real Lambda to confirm the cold-start reduction and verify no early telemetry events are dropped. Recommended: compare the INIT | phase=telemetry_subscribed cumulative=… and phase=ready checkpoints before/after, and confirm platform.initReport is still received.

…nstruction

The Lambda Extensions API ends the INIT phase at the first /next call, so the
serialized work before it directly inflates cold start. The telemetry subscribe
round-trip previously ran last, behind trace-agent/AppSec/API-proxy/lifecycle
construction.

Hoist the telemetry subscribe so it runs as soon as logs_agent_channel is
available, and overlap its HTTP round-trip with the remaining (synchronous)
service construction via tokio::join! (subscribe polled first, so its network
call is in flight during construction).

To keep this correct, TelemetryListener::start now binds its socket
synchronously (before subscribe returns) instead of inside a spawned task, so
the listener is already accepting connections when the Telemetry API begins
delivering events. No early platform.initStart/initReport or logs are dropped.
What gets built is unchanged; only when the subscribe is issued.
@datadog-prod-us1-3

datadog-prod-us1-3 Bot commented Jun 24, 2026

Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 6 Pipeline jobs failed

DataDog/datadog-lambda-extension | integration-suite: [lmi]   View in Datadog   GitLab

DataDog/datadog-lambda-extension | integration-suite: [on-demand]   View in Datadog   GitLab

DataDog/datadog-lambda-extension | e2e-test-status (amd64)   View in Datadog   GitLab

View all 6 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 1144e14 | Docs | Datadog PR Page | Give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant