Skip to content

[parametric] reuse the test-agent per xdist worker to cut container churn (POC)#7235

Open
bm1549 wants to merge 24 commits into
mainfrom
brian.marks/parametric-agent-reuse
Open

[parametric] reuse the test-agent per xdist worker to cut container churn (POC)#7235
bm1549 wants to merge 24 commits into
mainfrom
brian.marks/parametric-agent-reuse

Conversation

@bm1549

@bm1549 bm1549 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Motivation

Parametric tests create a fresh Docker network + test-agent + client container per test. Across 16 xdist workers × ~440 tests that's heavy, bursty container churn on the shared docker-in-docker microVM runners, which intermittently stalls container/network setup so a test-agent misses the tracer's ~3s trace-receipt poll (Number (1) of traces not available, got 0). It's a recurring dd-trace-php flake. POC requested by R&P (Slack thread) to cut that churn by reusing the test-agent.

Net result (measured in CI) — one parametric shard (php dev, 8 xdist workers, ~395 test instances):

Metric Without pooling With pooling Reduction
Test-agent containers created ~395 (1/test) 44 ~89%
Docker networks created ~395 (1/test) 44 ~89%
…agents serving the pooled tests (359 of 395) 359 8 (1 per worker) ~98%

The 44 = 8 pooled agents (one per worker) + 36 fresh agents for non-poolable tests. Single-worker benchmark (test_otel_span_methods.py, all-poolable): 59 tests on 1 agent, 118 → 2 create/destroy cycles (~98%).

Runtime impact. Against two non-pooling PRs on the same dev/2-shard layout with identical test counts (#7232, #7234), pooled parametric shards ran ~16–47% faster (php/python/ruby) — a roughly fixed ~100s/shard saved, so the percentage scales inversely with how heavy a language's tests are. The two baselines agreed within ~2%, so it's not a low-load fluke. Caveat: shard 1 of 2, one run per PR.

Found and fixed while bringing the POC green. Three isolation issues surfaced in CI, each root-caused:

  1. Custom-OTLP-port tests were pooled by mistake. Tests that parametrize a custom container OTLP port (e.g. ...OTLP_ENDPOINT-4320) need an agent listening on that port; the fixed-port pooled agent can't serve them, so their data never arrived (Number 1 of logs not available, got 0). Now excluded from pooling.
  2. A bool skipif marker crashed collection for ~169 jobs. The Docker smoke test used a module-level pytest.mark.skipif(<bool>), and the root conftest's _item_must_pass does all(marker.args[0]), which raises on a bool. Replaced with an in-body pytest.skip().
  3. clear() leaked remote-config between pooled tests. It reset traces but not the agent's RC responses; non-snapshot tests share the default RC token, so a prior test's RC corrupted the next test's no-RC baseline (broke TestDynamicConfigV1 sampling tests). clear() now resets RC to the fresh-agent {} state; RC is the only agent state outside _requests.

A latent issue also turned up, left for a standalone fix: root conftest.py's all(marker.args[0]) crashes on any plain bool skipif anywhere in the collected suite.

Changes

Reuse one test-agent + one Docker network per xdist worker, reset with /test/session/clear between tests, instead of recreating per test. Scoped to non-snapshot tests with the default agent_env (~94% of parametric tests); snapshot and non-default-agent_env tests keep the existing fresh-per-test path. The library client is unchanged. Pooled agents use a distinct host-port band (4900/5000/5100) so they don't collide with fresh-path agents (4600/4701/4802) on the same worker.

Measured on tests/parametric/test_otel_span_methods.py (single worker): 59 tests drove 1 pooled agent; agent+network create/destroy cycles dropped 118 → 2 (~98% fewer). Per-test container ops drop 3 → 1 for default-env tests (the client still churns). 3 consecutive runs were stable with no cross-test leakage. See the POC design doc (internal, AppGate) for scope, risks, and the hardening path (extend clear() leakage checks per data type, session tokens for in-flight data, per-(worker,env) ports to pool non-default-env tests).

Workflow

  1. ⚠️ Create your PR as draft ⚠️
  2. Work on you PR until the CI passes
  3. Mark it as ready for review
    • Test logic is modified? -> Get a review from RFC owner.
    • Framework is modified, or non obvious usage of it -> get a review from R&P team

🚀 Once your PR is reviewed and the CI green, you can merge it!

🛟 #apm-shared-testing 🛟

Reviewer checklist

  • Anything but tests/ or manifests/ is modified ? I have the approval from R&P team — this is the POC R&P (Charles) asked for in the Slack thread; utils/ is modified intentionally.
  • A docker base image is modified?
    • the relevant build-XXX-image label is present
  • A scenario is added, removed or renamed?

🤖 Generated with Claude Code

bm1549 and others added 16 commits June 29, 2026 12:51
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt_api

- Fix container leak on readiness timeout: the for...else branch in
  start_agent now calls cm.__exit__() in a try/finally before pytest.fail,
  ensuring the container is stopped even when it never becomes ready.
- Extract _agent_log_path() module-level helper to deduplicate the log
  path construction that previously existed independently in start_agent
  and get_test_agent_api's finally block.
- Fix ruff F541 lint: f"8126/tcp" -> "8126/tcp" (f-string with no placeholder).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation, import order

Three review fixes on the get_agent_pool/_creator code (Task 3):

1. Type `_pool_seed_request` as `pytest.FixtureRequest | None` (was bare None).
2. Add assert guard at top of `_creator` so a missing `_pool_seed_request` raises
   a clear message instead of an AttributeError deep in start_agent.
3. Reorder imports: move `from docker.errors import DockerException` (third-party)
   before the first-party `utils.*` imports (stdlib → third-party → first-party).
4. Add comment in get_agent_pool explaining the POC single-agent-per-worker
   invariant and why worker-keyed ports are safe.

Production fresh-path behavior is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ol_seed_request global

The pooled test_agent fixture was mutating scenarios.parametric._pool_seed_request
(a process-wide singleton) so the WorkerAgentPool creator closure could read the
current pytest request. This relied on synchronous timing and was a fragile global.

Fix by threading `request` through the creator call:
- WorkerAgentPool.acquire() passes request to creator on cache miss
- Creator type hint updated to Callable[[Any, dict[str, str]], AgentLease]
- _docker_fixtures.py _creator signature changed to (request, agent_env)
- _pool_seed_request attribute and assertion guard removed from DockerFixturesScenario
- conftest.py drops the _pool_seed_request mutation and redundant return after yield

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…events fresh-path fall-through/port collision)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…collision with fresh-path agents

When a pooled test-agent (persistent for the whole worker session) and a
fresh-path test-agent ran on the same xdist worker, they both computed host
ports from get_host_port(worker_id, 4600/4701/4802), causing docker run to
fail with "port is already allocated".

Fix: add agent_port_base/otlp_http_port_base/otlp_grpc_port_base keyword
params to TestAgentFactory.start_agent (defaulting to 4600/4701/4802 so the
fresh-path callers are unchanged). Pooled-agent _creator passes
agent_port_base=4900, otlp_http_port_base=5000, otlp_grpc_port_base=5100 so
pooled agents never occupy the same host ports as fresh-path agents on the
same worker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ectness

- FIX 1: clear() raises on HTTP errors via raise_for_status() on both GETs
- FIX 2: acquire() is now the single clear point (clears on every acquire,
  rebind only on reuse); remove redundant clear() from conftest pooled branch;
  update test assertion to clear_calls==2, rebind_calls==["req2"]
- FIX 3: shutdown() isolates per-lease stop() failures with try/except so one
  bad lease doesn't abort cleanup of remaining leases
- FIX 4a: wrap cm.__enter__() to close log_file on failure (no leak)
- FIX 4b: wrap start_agent() in _creator to remove network on failure
- FIX 5: add return type annotation to test_agent_pool fixture; annotate
  _creator request param as pytest.FixtureRequest
- FIX 6: reword port-band comment to state real ~97-worker bound explicitly

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bm1549 bm1549 added the ai-generated The pull request includes a significant amount of AI-generated code label Jun 29, 2026
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

CODEOWNERS have been resolved as:

tests/test_the_test/test_start_stop_agent.py                            @DataDog/system-tests-core
tests/test_the_test/test_test_agent_pool.py                             @DataDog/system-tests-core
utils/docker_fixtures/_test_agent_pool.py                               @DataDog/system-tests-core
tests/parametric/conftest.py                                            @DataDog/system-tests-core @DataDog/apm-sdk-capabilities
utils/_context/_scenarios/_docker_fixtures.py                           @DataDog/system-tests-core
utils/docker_fixtures/_test_agent.py                                    @DataDog/system-tests-core

@datadog-official

datadog-official Bot commented Jun 29, 2026

Copy link
Copy Markdown

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: d50bfd7 | Docs | Datadog PR Page | Give us feedback!

bm1549 and others added 6 commits June 29, 2026 15:56
…og on network cleanup)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests in test_otel_logs.py / test_otel_metrics.py parametrize the
test_agent_otlp_http_port / test_agent_otlp_grpc_port fixtures to custom
values so the agent container listens on a non-default OTLP port. The
pooled agent is created once per worker with fixed container OTLP ports
(4318/4317), so a pooled custom-port test sends OTLP to a port the agent
isn't listening on -> "Number 1 of logs not available, got 0".

Extend the poolable predicate to require the resolved OTLP ports equal the
pooled defaults; custom-port tests fall back to the fresh-per-test path.
Consolidate the default OTLP ports into DEFAULT_OTLP_{HTTP,GRPC}_PORT in
_test_agent.py so the pool creator and the poolable check share one source
of truth and cannot drift apart.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… marker

The module-level pytest.mark.skipif(<bool>, ...) crashed collection for every
session that collects tests/test_the_test/: the root conftest's _item_must_pass
does all(marker.args[0]) over skipif markers, which raises
"TypeError: 'bool' object is not iterable" on a bool condition. system-tests
gates via @features/@scenarios rather than raw skipif, so this was the only
bool-arg skipif in the suite and broke ~169 end-to-end jobs at collection.

Gate the Docker-backed test with an in-body pytest.skip() instead, and use the
shared DEFAULT_OTLP_{HTTP,GRPC}_PORT constants.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
/test/session/clear drops recorded requests but not the remote-config the
agent serves on /v0.7/config (RemoteConfigServer._responses). Non-snapshot
parametric tests all use the default (token-less) RC slot, so on a pooled
agent a prior test that sets RC leaks into the next test's no-RC baseline --
e.g. TestDynamicConfigV1::test_trace_sampling_rate_override_* assert the
default sample rate on the first trace before applying their own RC, and saw
the leaked rate instead.

Reset the RC response to empty {} in clear() (the same state a fresh agent
serves). RC is the only persistent agent state outside _requests; traces,
telemetry, and tested-integrations all flow through _requests and are already
cleared by /test/session/clear.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bm1549 bm1549 marked this pull request as ready for review June 30, 2026 17:44
@bm1549 bm1549 requested review from a team as code owners June 30, 2026 17:44
@bm1549 bm1549 requested review from anna-git and removed request for a team June 30, 2026 17:44

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1957508062

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread utils/docker_fixtures/_test_agent.py Outdated
# dynamic-config sampling tests assert the default sample rate before applying
# RC). Posting an empty config restores the same {} state a fresh agent has.
# Safety-critical for reuse, so surface a bad status.
self._session.post(self._url("/test/session/responses/config"), json={}).raise_for_status()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep remote config out of generic clear()

When a test sets remote config and then uses an existing helper with clear=True (for example wait_for_telemetry_event(..., clear=True) or wait_for_rc_apply_state(..., clear=True) in tests/parametric/test_dynamic_configuration.py), this new generic clear() step replaces the served RC payload with {}. That changes the semantics from “clear recorded requests” to “remove the active config”, so the tracer can observe the empty config on its next RC poll before the test asserts behavior that depends on the just-applied config. The pooled-agent reset needs a separate RC reset path rather than changing every mid-test clear=True call.

Useful? React with 👍 / 👎.

bm1549 and others added 2 commits June 30, 2026 13:48
The design doc is hosted internally (chonk) and linked from the PR
description instead of living in the repo.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
clear() is also called mid-test by the clear=True helpers (traces, telemetry,
wait_for_rc_apply_state, ...). Resetting RC inside clear() wiped a test's active
config mid-test, so the tracer could poll the empty config before the test
asserted RC-dependent behavior (race; passed CI only by timing).

Move the RC reset to a dedicated reset_remote_config() called only on the pooled
agent's reuse path (WorkerAgentPool.acquire), between tests. clear() goes back to
clearing recorded requests only. Pooled tests still get a clean RC baseline; no
mid-test clear=True call touches the served config.

Addresses Codex review on PR #7235.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-generated The pull request includes a significant amount of AI-generated code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant