perf(init): issue telemetry subscribe earlier, overlapping service construction#1281
Draft
duncanista wants to merge 1 commit into
Draft
Conversation
…nstruction The Lambda Extensions API ends the INIT phase at the first /next call, so the serialized work before it directly inflates cold start. The telemetry subscribe round-trip previously ran last, behind trace-agent/AppSec/API-proxy/lifecycle construction. Hoist the telemetry subscribe so it runs as soon as logs_agent_channel is available, and overlap its HTTP round-trip with the remaining (synchronous) service construction via tokio::join! (subscribe polled first, so its network call is in flight during construction). To keep this correct, TelemetryListener::start now binds its socket synchronously (before subscribe returns) instead of inside a spawned task, so the listener is already accepting connections when the Telemetry API begins delivering events. No early platform.initStart/initReport or logs are dropped. What gets built is unchanged; only when the subscribe is issued.
|
12 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Jira: none yet — add before marking ready.
Overview
Cold-start optimization (Confluence H2: reorder init to call
/nextsooner).The Lambda Extensions API ends the INIT phase at the first
extension::next_event(/next) call, so any work serialized before that call inflates measured cold-start time. Inextension_loop_active, the telemetry subscription (setup_telemetry_client→telemetry::subscribe) previously ran last, serialized behindInvocationProcessorService::new,AppSecProcessor::new,start_trace_agent,start_api_runtime_proxy, and the lifecycle listener — even though the subscribe only depends onlogs_agent_channel(available right afterstart_dogstatsd).What this PR does (the safe subset):
Hoist + overlap the telemetry subscribe. The
setup_telemetry_clientfuture is now created as soon aslogs_agent_channelis available and run concurrently with the remaining service construction viatokio::join!. The subscribe future is the firstjoin!argument, so it is polled first: it binds the listener socket and issues thesubscribeHTTP PUT, then yields on the network round-trip while the construction branch (invocation processor, AppSec, trace agent, API-runtime proxy, lifecycle listener) runs. The subscribe round-trip is therefore in flight during construction instead of being serialized after it. The construction branch is otherwise synchronous (it only spawns background tasks and returns handles), so behavior is unchanged — only the timing of the subscribe moves earlier.Make telemetry-listener readiness deterministic.
TelemetryListener::startnow binds its TCP socket synchronously (with.await?, before returning) instead of binding inside a spawned task. The serve loop still runs in a background task. This matches the pattern already used by the lifecycle, trace, and OTLP listeners.What was deferred (intentionally not done): the fuller refactor that would call
/nextbefore constructing the trace agent / OTLP / AppSec. That reordering is more invasive and risks dropping or mis-ordering early invocation lifecycle handling, so it is out of scope here. This PR keeps every service started and the entire subsequent event loop (both On-Demand and Managed Instance paths) byte-for-byte intact — it changes only when the subscribe is issued, not what gets built.Correctness argument
platform.initStart/initReport/ logs if the bind were still racy. To remove that race entirely,TelemetryListener::startnow binds the socket synchronously beforesubscribeis called, so the listener is guaranteed to be accepting connections by the time the Telemetry API is told to deliver tohttp://sandbox:8999/. No early platform events or logs can be dropped.build_servicesbranch performs no I/O that the subscribe depends on, and the subscribe performs no work the construction depends on — the two are independent, so running them undertokio::join!cannot reorder any observable side effect. All produced handles (invocation_processor_handle, trace bundle,appsec_processor, proxy/lifecycle shutdown tokens, etc.) are returned from the join and consumed exactly as before.telemetry_listener_cancel_token,lifecycle_listener_shutdown_token,trace_agent_shutdown_token, etc. are all still wired intocancel_background_servicesand the tombstone path identically.log_init_checkpoint(...)calls from feat(init): add cold-start init-phase timing instrumentation #1271 are retained (dogstatsd_started,trace_agent_started,telemetry_subscribed,ready, …);trace_agent_startednow fires inside the join branch but still records.Testing
cargo fmtclean.cargo clippy --bin bottlecap --no-depsclean underclippy::all+pedantic+unwrap_useddenied (only the pre-existingbuf_redux/multipartfuture-incompat warning remains).INIT | phase=telemetry_subscribed cumulative=…andphase=readycheckpoints before/after, and confirmplatform.initReportis still received.