[parametric] reuse the test-agent per xdist worker to cut container churn (POC) by bm1549 · Pull Request #7235 · DataDog/system-tests

bm1549 · 2026-06-29T19:55:01Z

Motivation

Parametric tests create a fresh Docker network + test-agent + client container per test. Across 16 xdist workers × ~440 tests that's heavy, bursty container churn on the shared docker-in-docker microVM runners, which intermittently stalls container/network setup so a test-agent misses the tracer's ~3s trace-receipt poll (Number (1) of traces not available, got 0). It's a recurring dd-trace-php flake. POC requested by R&P (Slack thread) to cut that churn by reusing the test-agent.

Net result (measured in CI) — one parametric shard (php dev, 8 xdist workers, ~395 test instances):

Metric	Without pooling	With pooling	Reduction
Test-agent containers created	~395 (1/test)	44	~89%
Docker networks created	~395 (1/test)	44	~89%
…agents serving the pooled tests (359 of 395)	359	8 (1 per worker)	~98%

The 44 = 8 pooled agents (one per worker) + 36 fresh agents for non-poolable tests. Single-worker benchmark (test_otel_span_methods.py, all-poolable): 59 tests on 1 agent, 118 → 2 create/destroy cycles (~98%).

Runtime impact. Against two non-pooling PRs on the same dev/2-shard layout with identical test counts (#7232, #7234), pooled parametric shards ran ~16–47% faster (php/python/ruby) — a roughly fixed ~100s/shard saved, so the percentage scales inversely with how heavy a language's tests are. The two baselines agreed within ~2%, so it's not a low-load fluke. Caveat: shard 1 of 2, one run per PR.

Found and fixed while bringing the POC green. Three isolation issues surfaced in CI, each root-caused:

Custom-OTLP-port tests were pooled by mistake. Tests that parametrize a custom container OTLP port (e.g. ...OTLP_ENDPOINT-4320) need an agent listening on that port; the fixed-port pooled agent can't serve them, so their data never arrived (Number 1 of logs not available, got 0). Now excluded from pooling.
A bool skipif marker crashed collection for ~169 jobs. The Docker smoke test used a module-level pytest.mark.skipif(<bool>), and the root conftest's _item_must_pass does all(marker.args[0]), which raises on a bool. Replaced with an in-body pytest.skip().
clear() leaked remote-config between pooled tests. It reset traces but not the agent's RC responses; non-snapshot tests share the default RC token, so a prior test's RC corrupted the next test's no-RC baseline (broke TestDynamicConfigV1 sampling tests). clear() now resets RC to the fresh-agent {} state; RC is the only agent state outside _requests.

A latent issue also turned up, left for a standalone fix: root conftest.py's all(marker.args[0]) crashes on any plain bool skipif anywhere in the collected suite.

Changes

Reuse one test-agent + one Docker network per xdist worker, reset with /test/session/clear between tests, instead of recreating per test. Scoped to non-snapshot tests with the default agent_env (~94% of parametric tests); snapshot and non-default-agent_env tests keep the existing fresh-per-test path. The library client is unchanged. Pooled agents use a distinct host-port band (4900/5000/5100) so they don't collide with fresh-path agents (4600/4701/4802) on the same worker.

Measured on tests/parametric/test_otel_span_methods.py (single worker): 59 tests drove 1 pooled agent; agent+network create/destroy cycles dropped 118 → 2 (~98% fewer). Per-test container ops drop 3 → 1 for default-env tests (the client still churns). 3 consecutive runs were stable with no cross-test leakage. See the POC design doc (internal, AppGate) for scope, risks, and the hardening path (extend clear() leakage checks per data type, session tokens for in-flight data, per-(worker,env) ports to pool non-default-env tests).

Workflow

⚠️ Create your PR as draft ⚠️
Work on you PR until the CI passes
Mark it as ready for review
- Test logic is modified? -> Get a review from RFC owner.
- Framework is modified, or non obvious usage of it -> get a review from R&P team

🚀 Once your PR is reviewed and the CI green, you can merge it!

🛟 #apm-shared-testing 🛟

Reviewer checklist

Anything but tests/ or manifests/ is modified ? I have the approval from R&P team — this is the POC R&P (Charles) asked for in the Slack thread; utils/ is modified intentionally.
A docker base image is modified?
- the relevant build-XXX-image label is present
A scenario is added, removed or renamed?
- Get a review from R&P team

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nt_api - Fix container leak on readiness timeout: the for...else branch in start_agent now calls cm.__exit__() in a try/finally before pytest.fail, ensuring the container is stopped even when it never becomes ready. - Extract _agent_log_path() module-level helper to deduplicate the log path construction that previously existed independently in start_agent and get_test_agent_api's finally block. - Fix ruff F541 lint: f"8126/tcp" -> "8126/tcp" (f-string with no placeholder). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ation, import order Three review fixes on the get_agent_pool/_creator code (Task 3): 1. Type `_pool_seed_request` as `pytest.FixtureRequest | None` (was bare None). 2. Add assert guard at top of `_creator` so a missing `_pool_seed_request` raises a clear message instead of an AttributeError deep in start_agent. 3. Reorder imports: move `from docker.errors import DockerException` (third-party) before the first-party `utils.*` imports (stdlib → third-party → first-party). 4. Add comment in get_agent_pool explaining the POC single-agent-per-worker invariant and why worker-keyed ports are safe. Production fresh-path behavior is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n metrics

…ol_seed_request global The pooled test_agent fixture was mutating scenarios.parametric._pool_seed_request (a process-wide singleton) so the WorkerAgentPool creator closure could read the current pytest request. This relied on synchronous timing and was a fragile global. Fix by threading `request` through the creator call: - WorkerAgentPool.acquire() passes request to creator on cache miss - Creator type hint updated to Callable[[Any, dict[str, str]], AgentLease] - _docker_fixtures.py _creator signature changed to (request, agent_env) - _pool_seed_request attribute and assertion guard removed from DockerFixturesScenario - conftest.py drops the _pool_seed_request mutation and redundant return after yield Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…events fresh-path fall-through/port collision)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…collision with fresh-path agents When a pooled test-agent (persistent for the whole worker session) and a fresh-path test-agent ran on the same xdist worker, they both computed host ports from get_host_port(worker_id, 4600/4701/4802), causing docker run to fail with "port is already allocated". Fix: add agent_port_base/otlp_http_port_base/otlp_grpc_port_base keyword params to TestAgentFactory.start_agent (defaulting to 4600/4701/4802 so the fresh-path callers are unchanged). Pooled-agent _creator passes agent_port_base=4900, otlp_http_port_base=5000, otlp_grpc_port_base=5100 so pooled agents never occupy the same host ports as fresh-path agents on the same worker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ectness - FIX 1: clear() raises on HTTP errors via raise_for_status() on both GETs - FIX 2: acquire() is now the single clear point (clears on every acquire, rebind only on reuse); remove redundant clear() from conftest pooled branch; update test assertion to clear_calls==2, rebind_calls==["req2"] - FIX 3: shutdown() isolates per-lease stop() failures with try/except so one bad lease doesn't abort cleanup of remaining leases - FIX 4a: wrap cm.__enter__() to close log_file on failure (no leak) - FIX 4b: wrap start_agent() in _creator to remove network on failure - FIX 5: add return type annotation to test_agent_pool fixture; annotate _creator request param as pytest.FixtureRequest - FIX 6: reword port-band comment to state real ~97-worker bound explicitly Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…by design); enforce only the trace clear

github-actions · 2026-06-29T19:55:26Z

CODEOWNERS have been resolved as:

tests/test_the_test/test_start_stop_agent.py                            @DataDog/system-tests-core
tests/test_the_test/test_test_agent_pool.py                             @DataDog/system-tests-core
utils/docker_fixtures/_test_agent_pool.py                               @DataDog/system-tests-core
tests/parametric/conftest.py                                            @DataDog/system-tests-core @DataDog/apm-sdk-capabilities
utils/_context/_scenarios/_docker_fixtures.py                           @DataDog/system-tests-core
utils/docker_fixtures/_test_agent.py                                    @DataDog/system-tests-core

datadog-official · 2026-06-29T19:56:44Z

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: d50bfd7 | Docs | Datadog PR Page | Give us feedback!}

…ath lines)

…og on network cleanup) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… attr-defined on fakes)

Tests in test_otel_logs.py / test_otel_metrics.py parametrize the test_agent_otlp_http_port / test_agent_otlp_grpc_port fixtures to custom values so the agent container listens on a non-default OTLP port. The pooled agent is created once per worker with fixed container OTLP ports (4318/4317), so a pooled custom-port test sends OTLP to a port the agent isn't listening on -> "Number 1 of logs not available, got 0". Extend the poolable predicate to require the resolved OTLP ports equal the pooled defaults; custom-port tests fall back to the fresh-per-test path. Consolidate the default OTLP ports into DEFAULT_OTLP_{HTTP,GRPC}_PORT in _test_agent.py so the pool creator and the poolable check share one source of truth and cannot drift apart. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@scenarios

… marker The module-level pytest.mark.skipif(<bool>, ...) crashed collection for every session that collects tests/test_the_test/: the root conftest's _item_must_pass does all(marker.args[0]) over skipif markers, which raises "TypeError: 'bool' object is not iterable" on a bool condition. system-tests gates via @features/@scenarios rather than raw skipif, so this was the only bool-arg skipif in the suite and broke ~169 end-to-end jobs at collection. Gate the Docker-backed test with an in-body pytest.skip() instead, and use the shared DEFAULT_OTLP_{HTTP,GRPC}_PORT constants. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

/test/session/clear drops recorded requests but not the remote-config the agent serves on /v0.7/config (RemoteConfigServer._responses). Non-snapshot parametric tests all use the default (token-less) RC slot, so on a pooled agent a prior test that sets RC leaks into the next test's no-RC baseline -- e.g. TestDynamicConfigV1::test_trace_sampling_rate_override_* assert the default sample rate on the first trace before applying their own RC, and saw the leaked rate instead. Reset the RC response to empty {} in clear() (the same state a fresh agent serves). RC is the only persistent agent state outside _requests; traces, telemetry, and tested-integrations all flow through _requests and are already cleared by /test/session/clear. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1957508062

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-30T17:47:09Z

+        # dynamic-config sampling tests assert the default sample rate before applying
+        # RC). Posting an empty config restores the same {} state a fresh agent has.
+        # Safety-critical for reuse, so surface a bad status.
+        self._session.post(self._url("/test/session/responses/config"), json={}).raise_for_status()


Keep remote config out of generic clear()

When a test sets remote config and then uses an existing helper with clear=True (for example wait_for_telemetry_event(..., clear=True) or wait_for_rc_apply_state(..., clear=True) in tests/parametric/test_dynamic_configuration.py), this new generic clear() step replaces the served RC payload with {}. That changes the semantics from “clear recorded requests” to “remove the active config”, so the tracer can observe the empty config on its next RC poll before the test asserts behavior that depends on the just-applied config. The pooled-agent reset needs a separate RC reset path rather than changing every mid-test clear=True call.

Useful? React with 👍 / 👎.

The design doc is hosted internally (chonk) and linked from the PR description instead of living in the repo. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

clear() is also called mid-test by the clear=True helpers (traces, telemetry, wait_for_rc_apply_state, ...). Resetting RC inside clear() wiped a test's active config mid-test, so the tracer could poll the empty config before the test asserted RC-dependent behavior (race; passed CI only by timing). Move the RC reset to a dedicated reset_remote_config() called only on the pooled agent's reuse path (WorkerAgentPool.acquire), between tests. clear() goes back to clearing recorded requests only. Pooled tests still get a clean RC baseline; no mid-test clear=True call touches the served config. Addresses Codex review on PR #7235. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bm1549 and others added 16 commits June 29, 2026 12:51

docs(parametric): add test-agent reuse POC plan

ce3f55a

feat(parametric): add WorkerAgentPool for test-agent reuse

a8c2b82

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor(parametric): extract start_agent/stop_agent + rebind_request

d63db0e

feat(parametric): pooled-agent creator (worker-scoped network + agent)

d329b36

docs(parametric): scope plan to default-env pooling (Option B) + chur…

7ca986d

…n metrics

feat(parametric): use WorkerAgentPool in test_agent fixture

185319c

fix(parametric): restore early return in pooled test_agent branch (pr…

902231d

…events fresh-path fall-through/port collision)

test(parametric): log agent container creations for reuse verification

84c8485

docs(parametric): test-agent reuse POC handoff note

e690585

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(parametric): drop internal SDD scratch + plan doc from POC branch

d9b4b48

fix(parametric): keep OTLP session-clear best-effort (it returns 400 …

4318912

…by design); enforce only the trace clear

bm1549 added the ai-generated The pull request includes a significant amount of AI-generated code label Jun 29, 2026

bm1549 and others added 6 commits June 29, 2026 15:56

style(parametric): ruff format _test_agent.py (collapse wrapped log_p…

6d50787

…ath lines)

fix(parametric): satisfy ruff check (annotations, drop unused noqa, l…

2d8cb67

…og on network cleanup) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(parametric): use Any+noqa for duck-typed pool acquire (fixes mypy…

f4f2c34

… attr-defined on fakes)

bm1549 marked this pull request as ready for review June 30, 2026 17:44

bm1549 requested review from a team as code owners June 30, 2026 17:44

bm1549 requested review from anna-git and removed request for a team June 30, 2026 17:44

chatgpt-codex-connector Bot reviewed Jun 30, 2026

View reviewed changes

bm1549 and others added 2 commits June 30, 2026 13:48

docs: move parametric agent-reuse POC writeup off the PR

73e0de7

The design doc is hosted internally (chonk) and linked from the PR description instead of living in the repo. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[parametric] reuse the test-agent per xdist worker to cut container churn (POC)#7235

[parametric] reuse the test-agent per xdist worker to cut container churn (POC)#7235
bm1549 wants to merge 24 commits into
mainfrom
brian.marks/parametric-agent-reuse

bm1549 commented Jun 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

datadog-official Bot commented Jun 29, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bm1549 commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Workflow

Reviewer checklist

Uh oh!

github-actions Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datadog-official Bot commented Jun 29, 2026 • edited by datadog-datadog-prod-us1-2 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bm1549 commented Jun 29, 2026 •

edited

Loading

github-actions Bot commented Jun 29, 2026 •

edited

Loading

datadog-official Bot commented Jun 29, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading