Cold-start improvements — integration (combined testing)#1284
Cold-start improvements — integration (combined testing)#1284duncanista wants to merge 24 commits into
Conversation
Add debug-gated checkpoints at the key cold-start init boundaries (crypto provider, TLS client build, config parse, shared client, register, dogstatsd, trace agent, telemetry subscribe, ready), plus a one-time available_parallelism() log. Each checkpoint logs delta (time since the previous checkpoint = that phase's own cost) and cumulative (time since process start), in milliseconds to 6 decimal places (nanosecond resolution) so sub-millisecond phases are visible. Init time is attributed per phase directly, with no manual subtraction. The per-phase bookkeeping is guarded behind a DEBUG-level check, so it stays effectively free at the default info level. This is the measurement prerequisite (H0) for the cold-start improvements.
Append -Clink-arg=-Wl,-z,now -Clink-arg=-Wl,-z,relro to the clang-linker RUSTFLAGS in both compile Dockerfiles. Eager (now) binding resolves all dynamic symbols at load time instead of lazily via the PLT, moving resolution stalls off the Lambda INIT path; relro hardens the GOT. This only affects the dynamically-linked glibc layers; it is a no-op on the static musl build.
tikv-jemallocator links jemalloc with the _rjem_ symbol prefix, so a runtime MALLOC_CONF env var is never read. Set the compile-time JEMALLOC_SYS_WITH_MALLOC_CONF instead, in both the GNU and Alpine compile Dockerfiles. A single arena reduces the metadata jemalloc maps at init and lowers RSS; the extension is not allocation-throughput-bound, so arena contention is not a concern. Dockerfile-only change; not docker-built locally and pending benchmarking.
Lambda CPUs are known at build time: arm64 is Graviton2 (neoverse-n1) and x86_64 is targeted at the universally-safe x86-64-v2 baseline. Pin -Ctarget-cpu per PLATFORM in both compile Dockerfiles so codegen can use the available ISA extensions (helps crypto/compression during init). x86-64-v3 is deliberately avoided: it is not guaranteed across all Lambda x86 hosts and a wrong ISA surfaces as SIGILL at runtime.
The compile Dockerfiles built with 'cargo +stable', overriding the channel = "1.93.1" pin in rust-toolchain.toml. Drop the '+stable' override and install rustup with --default-toolchain none so rust-toolchain.toml auto-installs and drives the toolchain, making builds reproducible against the pinned version. Also remove the dead UPX install from Dockerfile.build_layer: the binary ships uncompressed, so nothing invokes upx anymore.
Switch the non-FIPS default feature from reqwest/rustls-tls-native-roots to reqwest/rustls-tls-webpki-roots so the two init-time reqwest clients (the register client in bin/bottlecap/main.rs and the shared flush client in src/http.rs) no longer call rustls_native_certs::load_native_certs() on every reqwest::Client::build(). webpki-roots uses a compiled-in Mozilla CA bundle, eliminating the per-build filesystem cert scan during cold start. Custom-cert (tls_cert_file -> add_root_certificate), proxy, and skip-ssl-validation paths are unchanged. The FIPS feature still uses native roots and is untouched.
Replace #[tokio::main] with an explicit multi-thread runtime whose worker count is derived from AWS_LAMBDA_FUNCTION_MEMORY_SIZE. AWS grants ~1 vCPU per 1769 MB, so workers = round(mem_mb / 1769) clamped to 1..=4 (integer math, no float casts; defaults to 2 when the env var is missing or unparseable). The init body moves verbatim into run(); all H0 cold-start instrumentation is preserved.
Compute the Lambda tag vec/string/function-tags-map once in Lambda::new_from_config and return the cached values from the getters, so repeated init- and per-trace-time calls to get_tags_vec/get_tags_string/get_function_tags_map are O(1) reads instead of re-iterating the tag map and re-running format!/join on every call. Hoist the two static limits-file regexes in proc/mod.rs (Max open files, Max processes) to LazyLock<Regex> so they compile once instead of on every fd/threads metrics sample. The trace_processor span_matches_tag_regex pattern is left as-is: its value comes from per-call user config (apm_filter_tags_regex_reject), not a static literal, so it cannot be hoisted to a LazyLock without changing behavior. Output (tag set, format, values) is unchanged.
Bump direct deps to match the transitive graph and collapse duplicate compiled crate versions: - nix 0.26 -> 0.29 (also removes the duplicate bitflags 1.x) - thiserror 1 -> 2 (drop-in; no source changes) - opentelemetry-semantic-conventions 0.30 -> 0.31 (no source changes) - rand 0.8 -> 0.9 (thread_rng->rng, gen->random, OsRng now TryRngCore) nix/bitflags and semconv duplicates fully collapse. The rand 0.8 and thiserror 1.x copies that remain are pulled only by upstream Datadog git crates (dd-trace-rs, serverless-components, libdatadog) and cannot be removed from this repo.
…nstruction The Lambda Extensions API ends the INIT phase at the first /next call, so the serialized work before it directly inflates cold start. The telemetry subscribe round-trip previously ran last, behind trace-agent/AppSec/API-proxy/lifecycle construction. Hoist the telemetry subscribe so it runs as soon as logs_agent_channel is available, and overlap its HTTP round-trip with the remaining (synchronous) service construction via tokio::join! (subscribe polled first, so its network call is in flight during construction). To keep this correct, TelemetryListener::start now binds its socket synchronously (before subscribe returns) instead of inside a spawned task, so the listener is already accepting connections when the Telemetry API begins delivering events. No early platform.initStart/initReport or logs are dropped. What gets built is unchanged; only when the subscribe is issued.
Build the register/`/next` reqwest client on a blocking thread inside a spawned task so its native-cert-loading TLS build (and the register network round-trip) overlaps with config parsing and the shared flushing client build, instead of running serially during cold start. The register/`/next` client and the shared flushing client are kept separate on purpose and not collapsed: the Extension API register + `/next` long-poll must use `.no_proxy()` and carry no `flush_timeout` (which would abort the long-poll), while the shared client requires proxy support, a flush_timeout, and pool_max_idle_per_host(0). Those needs conflict, so their construction is overlapped rather than merged. All existing client settings and the cold-start init checkpoints are preserved.
AppSecProcessor::new zstd-decompresses a ~29KB->322KB ruleset, JSON-parses it, and compiles the libddwaf WAF (tens of ms) synchronously during init. The WAF is only needed once the first request payload is evaluated, which is strictly after the first /next, so this work does not belong on the init critical path. Replace the eager Option<Arc<Mutex<Processor>>> with a deferred, awaitable handle (Arc<OnceCell<Option<Arc<Mutex<Processor>>>>>). When AppSec is enabled, the build runs on the blocking pool (spawn_blocking) from a background task; consumers (trace processor and the runtime API proxy) resolve the handle where they actually use the WAF, awaiting the in-flight build if a request somehow arrives before it finishes. The disabled-by-default path stays cheap: the feature flag is checked synchronously and yields no handle and no build.
With --default-toolchain none, rust-src had nothing to attach to; the toml-pinned toolchain installs only rustfmt/clippy and nothing consumes rust-src.
…ature' into jordan.gonzalez/cold-start-integration/feature
…re' into jordan.gonzalez/cold-start-integration/feature # Conflicts: # images/Dockerfile.bottlecap.alpine.compile # images/Dockerfile.bottlecap.compile
…ature' into jordan.gonzalez/cold-start-integration/feature # Conflicts: # images/Dockerfile.bottlecap.compile
…/feature' into jordan.gonzalez/cold-start-integration/feature
…ture' into jordan.gonzalez/cold-start-integration/feature
…ature' into jordan.gonzalez/cold-start-integration/feature
…e/feature' into jordan.gonzalez/cold-start-integration/feature
…ture' into jordan.gonzalez/cold-start-integration/feature
…ture' into jordan.gonzalez/cold-start-integration/feature # Conflicts: # bottlecap/src/bin/bottlecap/main.rs
…p/feature' into jordan.gonzalez/cold-start-integration/feature
…call site H3 (appsec-defer) added Arc::clone(config) to the interceptor::start call inside start_api_runtime_proxy's body (signature unchanged at 5 params). While resolving the H2/H3 conflict I mistakenly also added that arg to the start_api_runtime_proxy call site inside H2's build_services block, which the 5-param signature rejects. Remove the stray arg; the inner interceptor::start call already carries config.
|
Cold-start benchmark — integration build vs prod v98Method: 25 forced cold starts per function (bump a dummy env var before each invoke → fresh sandbox). Identical config: Extension's own init —
|
| min | p50 | p90 | p99 | max | mean | |
|---|---|---|---|---|---|---|
| Integration (#1284) | 10 | 59 | 78 | 106 | 106 | 54.5 |
| Baseline (v98) | 29 | 74 | 95 | 116 | 116 | 69.0 |
| Δ | −19 | −15 (~20%) | −17 | −10 | — | −14.5 (~21%) |
Platform Init Duration (runtime + extension, run in parallel)
| min | p50 | p90 | p99 | max | mean | |
|---|---|---|---|---|---|---|
| Integration (#1284) | 166.8 | 208.9 | 231.1 | 258.5 | 258.5 | 207.5 |
| Baseline (v98) | 166.8 | 219.8 | 243.0 | 251.1 | 251.1 | 212.7 |
| Δ | 0 | −11 | −12 | +7 | — | −5 |
Takeaways
- Extension init is ~20% faster at p50 (consistent across min/p50/p90/mean) — and the integration build is also carrying the extra H0 init-timing debug logging the baseline lacks, yet still wins.
- It propagates only partly to total
Init Duration(−11 ms p50): onpython3.12the runtime/platform dominate (identical 166.8 ms floor for both), so the extension win is muted. On a minimalprovided.al2023/Go function (extension = long pole) the delta would show through more directly.
Per-phase breakdown of a representative cold start (from the new init instrumentation):
crypto_provider_ready 9.34ms · config_parse 1.86ms · shared_client_ready 0.15ms · register_ready 2.49ms · dogstatsd 0.51ms · trace_agent 0.31ms · telemetry_subscribed 24.97ms → ready ~40ms. The dominant single phase is telemetry_subscribed (a network round-trip to the Telemetry API).
Caveats: N=25–26 (p50/mean reliable; p99≈max and noisy). Loops ran sequentially, not interleaved. Both at debug log level (inflates absolutes equally). This run also confirms the integration's Dockerfile changes (eager-binding / target-cpu / jemalloc / toolchain) build cleanly in release/arm64.
Jira: none yet — add before marking ready.
Overview
Integration branch (
jordan.gonzalez/cold-start-integration/feature) built onorigin/main, merging the full cold-start improvement stack. Conflicts were resolved by taking the union of every PR's intent (no change was dropped). Merge order was chosen to minimize conflict pain (build PRs first, then deps, then themain.rsinit-path PRs last):Included PRs:
log_init_checkpoint+ checkpoints inmain.rs); the base all others were stacked onnarenas:1(ENV JEMALLOC_SYS_WITH_MALLOC_CONF) in both compile Dockerfiles-Clink-arg=-Wl,-z,now -Clink-arg=-Wl,-z,relro)-Ctarget-cpuper platform (neoverse-n1on arm64,x86-64-v2on x86_64)--default-toolchain none, drop+stable/--component rust-src, remove dead UPX install inbuild_layer)reqwest/rustls-tls-webpki-roots(FIPS still uses native-roots)lifecycle/invocation/mod.rs)Builderfrom the Lambda memory tier; body moved intoasync fn run();tokio_worker_threads()helper/nextclient build +register()moved into a background task returning(client, RegisterResponse)subscribeearlier, overlapping service construction viatokio::join!;TelemetryListener::start()is nowasyncwith a synchronous bindappsec::defer_processor/appsec::resolve,DeferredProcessorhandle threaded through trace agent + runtime proxy)LazyLock(tags/,proc/)Excluded (intentionally):
Non-trivial conflict reconciliations
Dockerfile.bottlecap.compile,Dockerfile.bottlecap.alpine.compile) — all four build PRs touch the sameexport RUSTFLAGSregion. The exportedRUSTFLAGSnow contains the union: base${RUSTFLAGS:-}, the-Clinker=clang -L…builtins…flags, H7's-z,now -z,relro, and H8's-Ctarget-cpu=${TARGET_CPU}(with H8'saarch64→neoverse-n1 / else→x86-64-v2conditional). In the alpine file, H8 introduced a non-x86_64elsebranch — that branch was also given H7's-z,now -z,relroso eager binding applies on arm64 too. H16'sENV JEMALLOC_SYS_WITH_MALLOC_CONF="narenas:1"and H10's toolchain changes (--default-toolchain none,cargo buildwithout+stable) + UPX removal all coexist.bottlecap/Cargo.toml— H1 (feature line) and H12 (dependency lines) edit different sections; unioned cleanly.Cargo.lockcame in via H12 and was unchanged by the clippy/build pass.bottlecap/src/bin/bottlecap/main.rs— the four init-path PRs were combined so every behavior survives:fn main()builds the runtime and callsblock_on(run()); the formermainbody lives inasync fn run().spawn_blockingbuild +register(), returning(client, RegisterResponse)) lives insiderun().setup_telemetry_clienthoisted into atelemetry_setupfuture; remaining service construction wrapped in abuild_servicesasync block; both driven bytokio::join!) is preserved inextension_loop_active.build_servicesblock: the old eagermatch AppSecProcessor::new(...)becameappsec::defer_processor(config), and theDeferredProcessor(Arc<OnceCell<Option<SharedProcessor>>>) flows out through thejoin!tuple to the trace agent and runtime proxy, which resolve it viaappsec::resolveat use sites.Arc::clone(config)argument to the wrong place — thestart_api_runtime_proxycall site insidebuild_servicesrather than the innerinterceptor::startcall (which already carries it). The stray argument was removed;start_api_runtime_proxy's signature is unchanged at 5 params.Testing
cargo fmt --manifest-path bottlecap/Cargo.toml— no changes.cargo clippy --manifest-path bottlecap/Cargo.toml --bin bottlecap --no-deps— clean (clippy::all+pedantic+unwrap_useddenied). Only remaining note is the pre-existingbuf_redux/multipartfuture-incompat warning, which is onmainalready.cargo build --manifest-path bottlecap/Cargo.toml --bin bottlecap— builds successfully.cargo test --manifest-path bottlecap/Cargo.toml --lib lifecycle::invocation— 219 passed / 0 failed (covers H12's rand-0.9generate_span_idmigration).Reviewer notes / risks
join!edbuild_servicesblock — worth a careful look that the deferred handle's lifecycle (resolve-on-first-use) behaves under the reordered init.-Ctarget-cpu=neoverse-n1+ H7's eager binding now both apply to the alpine non-x86_64 build path; confirm the arm64 musl build is happy with the combined flags in CI.