Cold-start improvements — integration (combined testing) by duncanista · Pull Request #1284 · DataDog/datadog-lambda-extension

duncanista · 2026-06-24T04:43:12Z

Jira: none yet — add before marking ready.

DRAFT / DO NOT MERGE AS-IS. This is an umbrella PR that merges every individual cold-start PR onto one branch so the set can be built and benchmarked together in CI as a single unit. Each change lands (and is reviewed) in its own PR; this branch exists only for combined testing. It is not intended to merge — once the constituent PRs land on main this branch is discarded.

Overview

Integration branch (jordan.gonzalez/cold-start-integration/feature) built on origin/main, merging the full cold-start improvement stack. Conflicts were resolved by taking the union of every PR's intent (no change was dropped). Merge order was chosen to minimize conflict pain (build PRs first, then deps, then the main.rs init-path PRs last):

Included PRs:

Excluded (intentionally):

H13 refactor(build): feature-gate OTLP and AppSec subsystems (default-on) #1280 — feature-gate OTLP & AppSec — closed (incompatible with the single-binary shipping model); not merged.
H11 — remove duplicate ring crypto backend — upstream-only change, no PR in this repo; nothing to merge here.

Non-trivial conflict reconciliations

Dockerfiles (Dockerfile.bottlecap.compile, Dockerfile.bottlecap.alpine.compile) — all four build PRs touch the same export RUSTFLAGS region. The exported RUSTFLAGS now contains the union: base ${RUSTFLAGS:-}, the -Clinker=clang -L…builtins… flags, H7's -z,now -z,relro, and H8's -Ctarget-cpu=${TARGET_CPU} (with H8's aarch64→neoverse-n1 / else→x86-64-v2 conditional). In the alpine file, H8 introduced a non-x86_64 else branch — that branch was also given H7's -z,now -z,relro so eager binding applies on arm64 too. H16's ENV JEMALLOC_SYS_WITH_MALLOC_CONF="narenas:1" and H10's toolchain changes (--default-toolchain none, cargo build without +stable) + UPX removal all coexist.
bottlecap/Cargo.toml — H1 (feature line) and H12 (dependency lines) edit different sections; unioned cleanly. Cargo.lock came in via H12 and was unchanged by the clippy/build pass.
bottlecap/src/bin/bottlecap/main.rs — the four init-path PRs were combined so every behavior survives:
- H15's fn main() builds the runtime and calls block_on(run()); the former main body lives in async fn run().
- H4's parallel register-client construction (spawn_blocking build + register(), returning (client, RegisterResponse)) lives inside run().
- H2's reorder (setup_telemetry_client hoisted into a telemetry_setup future; remaining service construction wrapped in a build_services async block; both driven by tokio::join!) is preserved in extension_loop_active.
- H3's deferred AppSec handle is applied inside H2's build_services block: the old eager match AppSecProcessor::new(...) became appsec::defer_processor(config), and the DeferredProcessor (Arc<OnceCell<Option<SharedProcessor>>>) flows out through the join! tuple to the trace agent and runtime proxy, which resolve it via appsec::resolve at use sites.
One integration-only fixup commit: while resolving the H2/H3 overlap I briefly added H3's Arc::clone(config) argument to the wrong place — the start_api_runtime_proxy call site inside build_services rather than the inner interceptor::start call (which already carries it). The stray argument was removed; start_api_runtime_proxy's signature is unchanged at 5 params.

Testing

cargo fmt --manifest-path bottlecap/Cargo.toml — no changes.
cargo clippy --manifest-path bottlecap/Cargo.toml --bin bottlecap --no-deps — clean (clippy::all + pedantic + unwrap_used denied). Only remaining note is the pre-existing buf_redux / multipart future-incompat warning, which is on main already.
cargo build --manifest-path bottlecap/Cargo.toml --bin bottlecap — builds successfully.
cargo test --manifest-path bottlecap/Cargo.toml --lib lifecycle::invocation — 219 passed / 0 failed (covers H12's rand-0.9 generate_span_id migration).
Combined CI / cold-start benchmarking across the full stack: to be run on this branch (the reason it exists).

Reviewer notes / risks

This branch is for measurement, not merge; review each behavior in its own PR.
The H2 × H3 interaction is the only place two PRs edited the same region. AppSec is now built lazily on a background task and its construction now sits inside H2's concurrently-join!ed build_services block — worth a careful look that the deferred handle's lifecycle (resolve-on-first-use) behaves under the reordered init.
H8's arm64 -Ctarget-cpu=neoverse-n1 + H7's eager binding now both apply to the alpine non-x86_64 build path; confirm the arm64 musl build is happy with the combined flags in CI.

Add debug-gated checkpoints at the key cold-start init boundaries (crypto provider, TLS client build, config parse, shared client, register, dogstatsd, trace agent, telemetry subscribe, ready), plus a one-time available_parallelism() log. Each checkpoint logs delta (time since the previous checkpoint = that phase's own cost) and cumulative (time since process start), in milliseconds to 6 decimal places (nanosecond resolution) so sub-millisecond phases are visible. Init time is attributed per phase directly, with no manual subtraction. The per-phase bookkeeping is guarded behind a DEBUG-level check, so it stays effectively free at the default info level. This is the measurement prerequisite (H0) for the cold-start improvements.

Append -Clink-arg=-Wl,-z,now -Clink-arg=-Wl,-z,relro to the clang-linker RUSTFLAGS in both compile Dockerfiles. Eager (now) binding resolves all dynamic symbols at load time instead of lazily via the PLT, moving resolution stalls off the Lambda INIT path; relro hardens the GOT. This only affects the dynamically-linked glibc layers; it is a no-op on the static musl build.

tikv-jemallocator links jemalloc with the _rjem_ symbol prefix, so a runtime MALLOC_CONF env var is never read. Set the compile-time JEMALLOC_SYS_WITH_MALLOC_CONF instead, in both the GNU and Alpine compile Dockerfiles. A single arena reduces the metadata jemalloc maps at init and lowers RSS; the extension is not allocation-throughput-bound, so arena contention is not a concern. Dockerfile-only change; not docker-built locally and pending benchmarking.

Lambda CPUs are known at build time: arm64 is Graviton2 (neoverse-n1) and x86_64 is targeted at the universally-safe x86-64-v2 baseline. Pin -Ctarget-cpu per PLATFORM in both compile Dockerfiles so codegen can use the available ISA extensions (helps crypto/compression during init). x86-64-v3 is deliberately avoided: it is not guaranteed across all Lambda x86 hosts and a wrong ISA surfaces as SIGILL at runtime.

The compile Dockerfiles built with 'cargo +stable', overriding the channel = "1.93.1" pin in rust-toolchain.toml. Drop the '+stable' override and install rustup with --default-toolchain none so rust-toolchain.toml auto-installs and drives the toolchain, making builds reproducible against the pinned version. Also remove the dead UPX install from Dockerfile.build_layer: the binary ships uncompressed, so nothing invokes upx anymore.

Switch the non-FIPS default feature from reqwest/rustls-tls-native-roots to reqwest/rustls-tls-webpki-roots so the two init-time reqwest clients (the register client in bin/bottlecap/main.rs and the shared flush client in src/http.rs) no longer call rustls_native_certs::load_native_certs() on every reqwest::Client::build(). webpki-roots uses a compiled-in Mozilla CA bundle, eliminating the per-build filesystem cert scan during cold start. Custom-cert (tls_cert_file -> add_root_certificate), proxy, and skip-ssl-validation paths are unchanged. The FIPS feature still uses native roots and is untouched.

Replace #[tokio::main] with an explicit multi-thread runtime whose worker count is derived from AWS_LAMBDA_FUNCTION_MEMORY_SIZE. AWS grants ~1 vCPU per 1769 MB, so workers = round(mem_mb / 1769) clamped to 1..=4 (integer math, no float casts; defaults to 2 when the env var is missing or unparseable). The init body moves verbatim into run(); all H0 cold-start instrumentation is preserved.

Compute the Lambda tag vec/string/function-tags-map once in Lambda::new_from_config and return the cached values from the getters, so repeated init- and per-trace-time calls to get_tags_vec/get_tags_string/get_function_tags_map are O(1) reads instead of re-iterating the tag map and re-running format!/join on every call. Hoist the two static limits-file regexes in proc/mod.rs (Max open files, Max processes) to LazyLock<Regex> so they compile once instead of on every fd/threads metrics sample. The trace_processor span_matches_tag_regex pattern is left as-is: its value comes from per-call user config (apm_filter_tags_regex_reject), not a static literal, so it cannot be hoisted to a LazyLock without changing behavior. Output (tag set, format, values) is unchanged.

Bump direct deps to match the transitive graph and collapse duplicate compiled crate versions: - nix 0.26 -> 0.29 (also removes the duplicate bitflags 1.x) - thiserror 1 -> 2 (drop-in; no source changes) - opentelemetry-semantic-conventions 0.30 -> 0.31 (no source changes) - rand 0.8 -> 0.9 (thread_rng->rng, gen->random, OsRng now TryRngCore) nix/bitflags and semconv duplicates fully collapse. The rand 0.8 and thiserror 1.x copies that remain are pulled only by upstream Datadog git crates (dd-trace-rs, serverless-components, libdatadog) and cannot be removed from this repo.

…nstruction The Lambda Extensions API ends the INIT phase at the first /next call, so the serialized work before it directly inflates cold start. The telemetry subscribe round-trip previously ran last, behind trace-agent/AppSec/API-proxy/lifecycle construction. Hoist the telemetry subscribe so it runs as soon as logs_agent_channel is available, and overlap its HTTP round-trip with the remaining (synchronous) service construction via tokio::join! (subscribe polled first, so its network call is in flight during construction). To keep this correct, TelemetryListener::start now binds its socket synchronously (before subscribe returns) instead of inside a spawned task, so the listener is already accepting connections when the Telemetry API begins delivering events. No early platform.initStart/initReport or logs are dropped. What gets built is unchanged; only when the subscribe is issued.

Build the register/`/next` reqwest client on a blocking thread inside a spawned task so its native-cert-loading TLS build (and the register network round-trip) overlaps with config parsing and the shared flushing client build, instead of running serially during cold start. The register/`/next` client and the shared flushing client are kept separate on purpose and not collapsed: the Extension API register + `/next` long-poll must use `.no_proxy()` and carry no `flush_timeout` (which would abort the long-poll), while the shared client requires proxy support, a flush_timeout, and pool_max_idle_per_host(0). Those needs conflict, so their construction is overlapped rather than merged. All existing client settings and the cold-start init checkpoints are preserved.

AppSecProcessor::new zstd-decompresses a ~29KB->322KB ruleset, JSON-parses it, and compiles the libddwaf WAF (tens of ms) synchronously during init. The WAF is only needed once the first request payload is evaluated, which is strictly after the first /next, so this work does not belong on the init critical path. Replace the eager Option<Arc<Mutex<Processor>>> with a deferred, awaitable handle (Arc<OnceCell<Option<Arc<Mutex<Processor>>>>>). When AppSec is enabled, the build runs on the blocking pool (spawn_blocking) from a background task; consumers (trace processor and the runtime API proxy) resolve the handle where they actually use the WAF, awaiting the in-flight build if a request somehow arrives before it finishes. The disabled-by-default path stays cheap: the feature flag is checked synchronously and yields no handle and no build.

With --default-toolchain none, rust-src had nothing to attach to; the toml-pinned toolchain installs only rustfmt/clippy and nothing consumes rust-src.

…ature' into jordan.gonzalez/cold-start-integration/feature

…re' into jordan.gonzalez/cold-start-integration/feature # Conflicts: # images/Dockerfile.bottlecap.alpine.compile # images/Dockerfile.bottlecap.compile

…ature' into jordan.gonzalez/cold-start-integration/feature # Conflicts: # images/Dockerfile.bottlecap.compile

…/feature' into jordan.gonzalez/cold-start-integration/feature

…ture' into jordan.gonzalez/cold-start-integration/feature

…ature' into jordan.gonzalez/cold-start-integration/feature

…e/feature' into jordan.gonzalez/cold-start-integration/feature

…ture' into jordan.gonzalez/cold-start-integration/feature

…ture' into jordan.gonzalez/cold-start-integration/feature # Conflicts: # bottlecap/src/bin/bottlecap/main.rs

…p/feature' into jordan.gonzalez/cold-start-integration/feature

…call site H3 (appsec-defer) added Arc::clone(config) to the interceptor::start call inside start_api_runtime_proxy's body (signature unchanged at 5 params). While resolving the H2/H3 conflict I mistakenly also added that arg to the start_api_runtime_proxy call site inside H2's build_services block, which the 5-param signature rejects. Remove the stray arg; the inner interceptor::start call already carries config.

datadog-datadog-prod-us1-2 · 2026-06-24T04:50:28Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 5 Pipeline jobs failed

DataDog/datadog-lambda-extension | integration-suite: [lmi]

DataDog/datadog-lambda-extension | integration-suite: [on-demand]

DataDog/datadog-lambda-extension | publish layer e2e sandbox (amd64)

View all 5 failed jobs.

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 8b565fe | Docs | Datadog PR Page | Give us feedback!}

duncanista · 2026-06-24T05:28:45Z

Cold-start benchmark — integration build vs prod v98

Method: 25 forced cold starts per function (bump a dummy env var before each invoke → fresh sandbox). Identical config: python3.12, arm64, 1024 MB, DD_LOG_LEVEL=debug, same account/region (us-east-1). Integration = this branch built into a release/arm64 layer; baseline = Datadog-Extension-ARM:98.

Extension's own init — `Datadog Next-Gen Extension ready in …ms` (the part these PRs change)

	min	p50	p90	p99	max	mean
Integration (#1284)	10	59	78	106	106	54.5
Baseline (v98)	29	74	95	116	116	69.0
Δ	−19	−15 (~20%)	−17	−10	—	−14.5 (~21%)

Platform `Init Duration` (runtime + extension, run in parallel)

	min	p50	p90	p99	max	mean
Integration (#1284)	166.8	208.9	231.1	258.5	258.5	207.5
Baseline (v98)	166.8	219.8	243.0	251.1	251.1	212.7
Δ	0	−11	−12	+7	—	−5

Takeaways

Extension init is ~20% faster at p50 (consistent across min/p50/p90/mean) — and the integration build is also carrying the extra H0 init-timing debug logging the baseline lacks, yet still wins.
It propagates only partly to total Init Duration (−11 ms p50): on python3.12 the runtime/platform dominate (identical 166.8 ms floor for both), so the extension win is muted. On a minimal provided.al2023/Go function (extension = long pole) the delta would show through more directly.

Per-phase breakdown of a representative cold start (from the new init instrumentation):
crypto_provider_ready 9.34ms · config_parse 1.86ms · shared_client_ready 0.15ms · register_ready 2.49ms · dogstatsd 0.51ms · trace_agent 0.31ms · telemetry_subscribed 24.97ms → ready ~40ms. The dominant single phase is telemetry_subscribed (a network round-trip to the Telemetry API).

Caveats: N=25–26 (p50/mean reliable; p99≈max and noisy). Loops ran sequentially, not interleaved. Both at debug log level (inflates absolutes equally). This run also confirms the integration's Dockerfile changes (eager-binding / target-cpu / jemalloc / toolchain) build cleanly in release/arm64.

duncanista added 24 commits June 23, 2026 22:13

chore(build): drop inert --component rust-src flag

ee1bd39

With --default-toolchain none, rust-src had nothing to attach to; the toml-pinned toolchain installs only rustfmt/clippy and nothing consumes rust-src.

Merge remote-tracking branch 'origin/jordan.gonzalez/eager-binding/fe…

5bb8141

…ature' into jordan.gonzalez/cold-start-integration/feature

Merge remote-tracking branch 'origin/jordan.gonzalez/target-cpu/featu…

f99a3d5

…re' into jordan.gonzalez/cold-start-integration/feature # Conflicts: # images/Dockerfile.bottlecap.alpine.compile # images/Dockerfile.bottlecap.compile

Merge remote-tracking branch 'origin/jordan.gonzalez/build-hygiene/fe…

7aa1984

…ature' into jordan.gonzalez/cold-start-integration/feature # Conflicts: # images/Dockerfile.bottlecap.compile

Merge remote-tracking branch 'origin/jordan.gonzalez/tls-shared-roots…

a58a2c6

…/feature' into jordan.gonzalez/cold-start-integration/feature

Merge remote-tracking branch 'origin/jordan.gonzalez/dedup-crates/fea…

2cbe08a

…ture' into jordan.gonzalez/cold-start-integration/feature

Merge remote-tracking branch 'origin/jordan.gonzalez/tokio-runtime/fe…

f8b447a

…ature' into jordan.gonzalez/cold-start-integration/feature

Merge remote-tracking branch 'origin/jordan.gonzalez/http-client-reus…

f78d053

…e/feature' into jordan.gonzalez/cold-start-integration/feature

Merge remote-tracking branch 'origin/jordan.gonzalez/init-reorder/fea…

22e6a64

…ture' into jordan.gonzalez/cold-start-integration/feature

Merge remote-tracking branch 'origin/jordan.gonzalez/appsec-defer/fea…

cc0e1aa

…ture' into jordan.gonzalez/cold-start-integration/feature # Conflicts: # bottlecap/src/bin/bottlecap/main.rs

Merge remote-tracking branch 'origin/jordan.gonzalez/tag-regex-cleanu…

9dc28e9

…p/feature' into jordan.gonzalez/cold-start-integration/feature

duncanista mentioned this pull request Jun 24, 2026

perf(init): issue telemetry subscribe concurrently with first /next (H18 experiment) #1285

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cold-start improvements — integration (combined testing)#1284

Cold-start improvements — integration (combined testing)#1284
duncanista wants to merge 24 commits into
mainfrom
jordan.gonzalez/cold-start-integration/feature

duncanista commented Jun 24, 2026

Uh oh!

datadog-datadog-prod-us1-2 Bot commented Jun 24, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading

Uh oh!

duncanista commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

duncanista commented Jun 24, 2026

Overview

Non-trivial conflict reconciliations

Testing

Reviewer notes / risks

Uh oh!

datadog-datadog-prod-us1-2 Bot commented Jun 24, 2026 • edited by datadog-datadog-prod-us1 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

duncanista commented Jun 24, 2026

Cold-start benchmark — integration build vs prod v98

Extension's own init — Datadog Next-Gen Extension ready in …ms (the part these PRs change)

Platform Init Duration (runtime + extension, run in parallel)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

datadog-datadog-prod-us1-2 Bot commented Jun 24, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading

Extension's own init — `Datadog Next-Gen Extension ready in …ms` (the part these PRs change)

Platform `Init Duration` (runtime + extension, run in parallel)