Skip to content

Cold-start improvements — integration (combined testing)#1284

Draft
duncanista wants to merge 24 commits into
mainfrom
jordan.gonzalez/cold-start-integration/feature
Draft

Cold-start improvements — integration (combined testing)#1284
duncanista wants to merge 24 commits into
mainfrom
jordan.gonzalez/cold-start-integration/feature

Conversation

@duncanista

Copy link
Copy Markdown
Contributor

Jira: none yet — add before marking ready.

DRAFT / DO NOT MERGE AS-IS. This is an umbrella PR that merges every individual cold-start PR onto one branch so the set can be built and benchmarked together in CI as a single unit. Each change lands (and is reviewed) in its own PR; this branch exists only for combined testing. It is not intended to merge — once the constituent PRs land on main this branch is discarded.

Overview

Integration branch (jordan.gonzalez/cold-start-integration/feature) built on origin/main, merging the full cold-start improvement stack. Conflicts were resolved by taking the union of every PR's intent (no change was dropped). Merge order was chosen to minimize conflict pain (build PRs first, then deps, then the main.rs init-path PRs last):

Included PRs:

Excluded (intentionally):

Non-trivial conflict reconciliations

  • Dockerfiles (Dockerfile.bottlecap.compile, Dockerfile.bottlecap.alpine.compile) — all four build PRs touch the same export RUSTFLAGS region. The exported RUSTFLAGS now contains the union: base ${RUSTFLAGS:-}, the -Clinker=clang -L…builtins… flags, H7's -z,now -z,relro, and H8's -Ctarget-cpu=${TARGET_CPU} (with H8's aarch64→neoverse-n1 / else→x86-64-v2 conditional). In the alpine file, H8 introduced a non-x86_64 else branch — that branch was also given H7's -z,now -z,relro so eager binding applies on arm64 too. H16's ENV JEMALLOC_SYS_WITH_MALLOC_CONF="narenas:1" and H10's toolchain changes (--default-toolchain none, cargo build without +stable) + UPX removal all coexist.
  • bottlecap/Cargo.toml — H1 (feature line) and H12 (dependency lines) edit different sections; unioned cleanly. Cargo.lock came in via H12 and was unchanged by the clippy/build pass.
  • bottlecap/src/bin/bottlecap/main.rs — the four init-path PRs were combined so every behavior survives:
    • H15's fn main() builds the runtime and calls block_on(run()); the former main body lives in async fn run().
    • H4's parallel register-client construction (spawn_blocking build + register(), returning (client, RegisterResponse)) lives inside run().
    • H2's reorder (setup_telemetry_client hoisted into a telemetry_setup future; remaining service construction wrapped in a build_services async block; both driven by tokio::join!) is preserved in extension_loop_active.
    • H3's deferred AppSec handle is applied inside H2's build_services block: the old eager match AppSecProcessor::new(...) became appsec::defer_processor(config), and the DeferredProcessor (Arc<OnceCell<Option<SharedProcessor>>>) flows out through the join! tuple to the trace agent and runtime proxy, which resolve it via appsec::resolve at use sites.
  • One integration-only fixup commit: while resolving the H2/H3 overlap I briefly added H3's Arc::clone(config) argument to the wrong place — the start_api_runtime_proxy call site inside build_services rather than the inner interceptor::start call (which already carries it). The stray argument was removed; start_api_runtime_proxy's signature is unchanged at 5 params.

Testing

  • cargo fmt --manifest-path bottlecap/Cargo.toml — no changes.
  • cargo clippy --manifest-path bottlecap/Cargo.toml --bin bottlecap --no-depsclean (clippy::all + pedantic + unwrap_used denied). Only remaining note is the pre-existing buf_redux / multipart future-incompat warning, which is on main already.
  • cargo build --manifest-path bottlecap/Cargo.toml --bin bottlecap — builds successfully.
  • cargo test --manifest-path bottlecap/Cargo.toml --lib lifecycle::invocation — 219 passed / 0 failed (covers H12's rand-0.9 generate_span_id migration).
  • Combined CI / cold-start benchmarking across the full stack: to be run on this branch (the reason it exists).

Reviewer notes / risks

  • This branch is for measurement, not merge; review each behavior in its own PR.
  • The H2 × H3 interaction is the only place two PRs edited the same region. AppSec is now built lazily on a background task and its construction now sits inside H2's concurrently-join!ed build_services block — worth a careful look that the deferred handle's lifecycle (resolve-on-first-use) behaves under the reordered init.
  • H8's arm64 -Ctarget-cpu=neoverse-n1 + H7's eager binding now both apply to the alpine non-x86_64 build path; confirm the arm64 musl build is happy with the combined flags in CI.

Add debug-gated checkpoints at the key cold-start init boundaries (crypto
provider, TLS client build, config parse, shared client, register,
dogstatsd, trace agent, telemetry subscribe, ready), plus a one-time
available_parallelism() log.

Each checkpoint logs delta (time since the previous checkpoint = that
phase's own cost) and cumulative (time since process start), in
milliseconds to 6 decimal places (nanosecond resolution) so sub-millisecond
phases are visible. Init time is attributed per phase directly, with no
manual subtraction. The per-phase bookkeeping is guarded behind a
DEBUG-level check, so it stays effectively free at the default info level.

This is the measurement prerequisite (H0) for the cold-start improvements.
Append -Clink-arg=-Wl,-z,now -Clink-arg=-Wl,-z,relro to the clang-linker RUSTFLAGS in both compile Dockerfiles. Eager (now) binding resolves all dynamic symbols at load time instead of lazily via the PLT, moving resolution stalls off the Lambda INIT path; relro hardens the GOT. This only affects the dynamically-linked glibc layers; it is a no-op on the static musl build.
tikv-jemallocator links jemalloc with the _rjem_ symbol prefix, so a runtime MALLOC_CONF env var is never read. Set the compile-time JEMALLOC_SYS_WITH_MALLOC_CONF instead, in both the GNU and Alpine compile Dockerfiles. A single arena reduces the metadata jemalloc maps at init and lowers RSS; the extension is not allocation-throughput-bound, so arena contention is not a concern.

Dockerfile-only change; not docker-built locally and pending benchmarking.
Lambda CPUs are known at build time: arm64 is Graviton2 (neoverse-n1)
and x86_64 is targeted at the universally-safe x86-64-v2 baseline.
Pin -Ctarget-cpu per PLATFORM in both compile Dockerfiles so codegen
can use the available ISA extensions (helps crypto/compression during
init). x86-64-v3 is deliberately avoided: it is not guaranteed across
all Lambda x86 hosts and a wrong ISA surfaces as SIGILL at runtime.
The compile Dockerfiles built with 'cargo +stable', overriding the
channel = "1.93.1" pin in rust-toolchain.toml. Drop the '+stable'
override and install rustup with --default-toolchain none so
rust-toolchain.toml auto-installs and drives the toolchain, making
builds reproducible against the pinned version.

Also remove the dead UPX install from Dockerfile.build_layer: the
binary ships uncompressed, so nothing invokes upx anymore.
Switch the non-FIPS default feature from reqwest/rustls-tls-native-roots
to reqwest/rustls-tls-webpki-roots so the two init-time reqwest clients
(the register client in bin/bottlecap/main.rs and the shared flush client
in src/http.rs) no longer call rustls_native_certs::load_native_certs() on
every reqwest::Client::build(). webpki-roots uses a compiled-in Mozilla CA
bundle, eliminating the per-build filesystem cert scan during cold start.

Custom-cert (tls_cert_file -> add_root_certificate), proxy, and
skip-ssl-validation paths are unchanged. The FIPS feature still uses
native roots and is untouched.
Replace #[tokio::main] with an explicit multi-thread runtime whose worker
count is derived from AWS_LAMBDA_FUNCTION_MEMORY_SIZE. AWS grants ~1 vCPU
per 1769 MB, so workers = round(mem_mb / 1769) clamped to 1..=4 (integer
math, no float casts; defaults to 2 when the env var is missing or
unparseable). The init body moves verbatim into run(); all H0 cold-start
instrumentation is preserved.
Compute the Lambda tag vec/string/function-tags-map once in Lambda::new_from_config and return the cached values from the getters, so repeated init- and per-trace-time calls to get_tags_vec/get_tags_string/get_function_tags_map are O(1) reads instead of re-iterating the tag map and re-running format!/join on every call.

Hoist the two static limits-file regexes in proc/mod.rs (Max open files, Max processes) to LazyLock<Regex> so they compile once instead of on every fd/threads metrics sample.

The trace_processor span_matches_tag_regex pattern is left as-is: its value comes from per-call user config (apm_filter_tags_regex_reject), not a static literal, so it cannot be hoisted to a LazyLock without changing behavior.

Output (tag set, format, values) is unchanged.
Bump direct deps to match the transitive graph and collapse duplicate
compiled crate versions:

- nix 0.26 -> 0.29 (also removes the duplicate bitflags 1.x)
- thiserror 1 -> 2 (drop-in; no source changes)
- opentelemetry-semantic-conventions 0.30 -> 0.31 (no source changes)
- rand 0.8 -> 0.9 (thread_rng->rng, gen->random, OsRng now TryRngCore)

nix/bitflags and semconv duplicates fully collapse. The rand 0.8 and
thiserror 1.x copies that remain are pulled only by upstream Datadog git
crates (dd-trace-rs, serverless-components, libdatadog) and cannot be
removed from this repo.
…nstruction

The Lambda Extensions API ends the INIT phase at the first /next call, so the
serialized work before it directly inflates cold start. The telemetry subscribe
round-trip previously ran last, behind trace-agent/AppSec/API-proxy/lifecycle
construction.

Hoist the telemetry subscribe so it runs as soon as logs_agent_channel is
available, and overlap its HTTP round-trip with the remaining (synchronous)
service construction via tokio::join! (subscribe polled first, so its network
call is in flight during construction).

To keep this correct, TelemetryListener::start now binds its socket
synchronously (before subscribe returns) instead of inside a spawned task, so
the listener is already accepting connections when the Telemetry API begins
delivering events. No early platform.initStart/initReport or logs are dropped.
What gets built is unchanged; only when the subscribe is issued.
Build the register/`/next` reqwest client on a blocking thread inside a
spawned task so its native-cert-loading TLS build (and the register network
round-trip) overlaps with config parsing and the shared flushing client
build, instead of running serially during cold start.

The register/`/next` client and the shared flushing client are kept
separate on purpose and not collapsed: the Extension API register + `/next`
long-poll must use `.no_proxy()` and carry no `flush_timeout` (which would
abort the long-poll), while the shared client requires proxy support, a
flush_timeout, and pool_max_idle_per_host(0). Those needs conflict, so their
construction is overlapped rather than merged. All existing client settings
and the cold-start init checkpoints are preserved.
AppSecProcessor::new zstd-decompresses a ~29KB->322KB ruleset, JSON-parses
it, and compiles the libddwaf WAF (tens of ms) synchronously during init.
The WAF is only needed once the first request payload is evaluated, which is
strictly after the first /next, so this work does not belong on the init
critical path.

Replace the eager Option<Arc<Mutex<Processor>>> with a deferred, awaitable
handle (Arc<OnceCell<Option<Arc<Mutex<Processor>>>>>). When AppSec is enabled,
the build runs on the blocking pool (spawn_blocking) from a background task;
consumers (trace processor and the runtime API proxy) resolve the handle where
they actually use the WAF, awaiting the in-flight build if a request somehow
arrives before it finishes. The disabled-by-default path stays cheap: the
feature flag is checked synchronously and yields no handle and no build.
With --default-toolchain none, rust-src had nothing to attach to; the toml-pinned toolchain installs only rustfmt/clippy and nothing consumes rust-src.
…ature' into jordan.gonzalez/cold-start-integration/feature
…re' into jordan.gonzalez/cold-start-integration/feature

# Conflicts:
#	images/Dockerfile.bottlecap.alpine.compile
#	images/Dockerfile.bottlecap.compile
…ature' into jordan.gonzalez/cold-start-integration/feature

# Conflicts:
#	images/Dockerfile.bottlecap.compile
…/feature' into jordan.gonzalez/cold-start-integration/feature
…ture' into jordan.gonzalez/cold-start-integration/feature
…ature' into jordan.gonzalez/cold-start-integration/feature
…e/feature' into jordan.gonzalez/cold-start-integration/feature
…ture' into jordan.gonzalez/cold-start-integration/feature
…ture' into jordan.gonzalez/cold-start-integration/feature

# Conflicts:
#	bottlecap/src/bin/bottlecap/main.rs
…p/feature' into jordan.gonzalez/cold-start-integration/feature
…call site

H3 (appsec-defer) added Arc::clone(config) to the interceptor::start call inside
start_api_runtime_proxy's body (signature unchanged at 5 params). While resolving
the H2/H3 conflict I mistakenly also added that arg to the start_api_runtime_proxy
call site inside H2's build_services block, which the 5-param signature rejects.
Remove the stray arg; the inner interceptor::start call already carries config.
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 24, 2026

Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 5 Pipeline jobs failed

DataDog/datadog-lambda-extension | integration-suite: [lmi]   View in Datadog   GitLab

DataDog/datadog-lambda-extension | integration-suite: [on-demand]   View in Datadog   GitLab

DataDog/datadog-lambda-extension | publish layer e2e sandbox (amd64)   View in Datadog   GitLab

View all 5 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 8b565fe | Docs | Datadog PR Page | Give us feedback!

@duncanista

Copy link
Copy Markdown
Contributor Author

Cold-start benchmark — integration build vs prod v98

Method: 25 forced cold starts per function (bump a dummy env var before each invoke → fresh sandbox). Identical config: python3.12, arm64, 1024 MB, DD_LOG_LEVEL=debug, same account/region (us-east-1). Integration = this branch built into a release/arm64 layer; baseline = Datadog-Extension-ARM:98.

Extension's own init — Datadog Next-Gen Extension ready in …ms (the part these PRs change)

min p50 p90 p99 max mean
Integration (#1284) 10 59 78 106 106 54.5
Baseline (v98) 29 74 95 116 116 69.0
Δ −19 −15 (~20%) −17 −10 −14.5 (~21%)

Platform Init Duration (runtime + extension, run in parallel)

min p50 p90 p99 max mean
Integration (#1284) 166.8 208.9 231.1 258.5 258.5 207.5
Baseline (v98) 166.8 219.8 243.0 251.1 251.1 212.7
Δ 0 −11 −12 +7 −5

Takeaways

  • Extension init is ~20% faster at p50 (consistent across min/p50/p90/mean) — and the integration build is also carrying the extra H0 init-timing debug logging the baseline lacks, yet still wins.
  • It propagates only partly to total Init Duration (−11 ms p50): on python3.12 the runtime/platform dominate (identical 166.8 ms floor for both), so the extension win is muted. On a minimal provided.al2023/Go function (extension = long pole) the delta would show through more directly.

Per-phase breakdown of a representative cold start (from the new init instrumentation):
crypto_provider_ready 9.34ms · config_parse 1.86ms · shared_client_ready 0.15ms · register_ready 2.49ms · dogstatsd 0.51ms · trace_agent 0.31ms · telemetry_subscribed 24.97msready ~40ms. The dominant single phase is telemetry_subscribed (a network round-trip to the Telemetry API).

Caveats: N=25–26 (p50/mean reliable; p99≈max and noisy). Loops ran sequentially, not interleaved. Both at debug log level (inflates absolutes equally). This run also confirms the integration's Dockerfile changes (eager-binding / target-cpu / jemalloc / toolchain) build cleanly in release/arm64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant