[parametric] reuse the test-agent per xdist worker to cut container churn (POC)#7235
[parametric] reuse the test-agent per xdist worker to cut container churn (POC)#7235bm1549 wants to merge 24 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nt_api - Fix container leak on readiness timeout: the for...else branch in start_agent now calls cm.__exit__() in a try/finally before pytest.fail, ensuring the container is stopped even when it never becomes ready. - Extract _agent_log_path() module-level helper to deduplicate the log path construction that previously existed independently in start_agent and get_test_agent_api's finally block. - Fix ruff F541 lint: f"8126/tcp" -> "8126/tcp" (f-string with no placeholder). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ation, import order Three review fixes on the get_agent_pool/_creator code (Task 3): 1. Type `_pool_seed_request` as `pytest.FixtureRequest | None` (was bare None). 2. Add assert guard at top of `_creator` so a missing `_pool_seed_request` raises a clear message instead of an AttributeError deep in start_agent. 3. Reorder imports: move `from docker.errors import DockerException` (third-party) before the first-party `utils.*` imports (stdlib → third-party → first-party). 4. Add comment in get_agent_pool explaining the POC single-agent-per-worker invariant and why worker-keyed ports are safe. Production fresh-path behavior is unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ol_seed_request global The pooled test_agent fixture was mutating scenarios.parametric._pool_seed_request (a process-wide singleton) so the WorkerAgentPool creator closure could read the current pytest request. This relied on synchronous timing and was a fragile global. Fix by threading `request` through the creator call: - WorkerAgentPool.acquire() passes request to creator on cache miss - Creator type hint updated to Callable[[Any, dict[str, str]], AgentLease] - _docker_fixtures.py _creator signature changed to (request, agent_env) - _pool_seed_request attribute and assertion guard removed from DockerFixturesScenario - conftest.py drops the _pool_seed_request mutation and redundant return after yield Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…events fresh-path fall-through/port collision)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…collision with fresh-path agents When a pooled test-agent (persistent for the whole worker session) and a fresh-path test-agent ran on the same xdist worker, they both computed host ports from get_host_port(worker_id, 4600/4701/4802), causing docker run to fail with "port is already allocated". Fix: add agent_port_base/otlp_http_port_base/otlp_grpc_port_base keyword params to TestAgentFactory.start_agent (defaulting to 4600/4701/4802 so the fresh-path callers are unchanged). Pooled-agent _creator passes agent_port_base=4900, otlp_http_port_base=5000, otlp_grpc_port_base=5100 so pooled agents never occupy the same host ports as fresh-path agents on the same worker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ectness - FIX 1: clear() raises on HTTP errors via raise_for_status() on both GETs - FIX 2: acquire() is now the single clear point (clears on every acquire, rebind only on reuse); remove redundant clear() from conftest pooled branch; update test assertion to clear_calls==2, rebind_calls==["req2"] - FIX 3: shutdown() isolates per-lease stop() failures with try/except so one bad lease doesn't abort cleanup of remaining leases - FIX 4a: wrap cm.__enter__() to close log_file on failure (no leak) - FIX 4b: wrap start_agent() in _creator to remove network on failure - FIX 5: add return type annotation to test_agent_pool fixture; annotate _creator request param as pytest.FixtureRequest - FIX 6: reword port-band comment to state real ~97-worker bound explicitly Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…by design); enforce only the trace clear
|
|
🎉 All green!🧪 All tests passed 🔗 Commit SHA: d50bfd7 | Docs | Datadog PR Page | Give us feedback! |
…og on network cleanup) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… attr-defined on fakes)
Tests in test_otel_logs.py / test_otel_metrics.py parametrize the
test_agent_otlp_http_port / test_agent_otlp_grpc_port fixtures to custom
values so the agent container listens on a non-default OTLP port. The
pooled agent is created once per worker with fixed container OTLP ports
(4318/4317), so a pooled custom-port test sends OTLP to a port the agent
isn't listening on -> "Number 1 of logs not available, got 0".
Extend the poolable predicate to require the resolved OTLP ports equal the
pooled defaults; custom-port tests fall back to the fresh-per-test path.
Consolidate the default OTLP ports into DEFAULT_OTLP_{HTTP,GRPC}_PORT in
_test_agent.py so the pool creator and the poolable check share one source
of truth and cannot drift apart.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… marker The module-level pytest.mark.skipif(<bool>, ...) crashed collection for every session that collects tests/test_the_test/: the root conftest's _item_must_pass does all(marker.args[0]) over skipif markers, which raises "TypeError: 'bool' object is not iterable" on a bool condition. system-tests gates via @features/@scenarios rather than raw skipif, so this was the only bool-arg skipif in the suite and broke ~169 end-to-end jobs at collection. Gate the Docker-backed test with an in-body pytest.skip() instead, and use the shared DEFAULT_OTLP_{HTTP,GRPC}_PORT constants. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
/test/session/clear drops recorded requests but not the remote-config the
agent serves on /v0.7/config (RemoteConfigServer._responses). Non-snapshot
parametric tests all use the default (token-less) RC slot, so on a pooled
agent a prior test that sets RC leaks into the next test's no-RC baseline --
e.g. TestDynamicConfigV1::test_trace_sampling_rate_override_* assert the
default sample rate on the first trace before applying their own RC, and saw
the leaked rate instead.
Reset the RC response to empty {} in clear() (the same state a fresh agent
serves). RC is the only persistent agent state outside _requests; traces,
telemetry, and tested-integrations all flow through _requests and are already
cleared by /test/session/clear.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1957508062
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # dynamic-config sampling tests assert the default sample rate before applying | ||
| # RC). Posting an empty config restores the same {} state a fresh agent has. | ||
| # Safety-critical for reuse, so surface a bad status. | ||
| self._session.post(self._url("/test/session/responses/config"), json={}).raise_for_status() |
There was a problem hiding this comment.
Keep remote config out of generic clear()
When a test sets remote config and then uses an existing helper with clear=True (for example wait_for_telemetry_event(..., clear=True) or wait_for_rc_apply_state(..., clear=True) in tests/parametric/test_dynamic_configuration.py), this new generic clear() step replaces the served RC payload with {}. That changes the semantics from “clear recorded requests” to “remove the active config”, so the tracer can observe the empty config on its next RC poll before the test asserts behavior that depends on the just-applied config. The pooled-agent reset needs a separate RC reset path rather than changing every mid-test clear=True call.
Useful? React with 👍 / 👎.
The design doc is hosted internally (chonk) and linked from the PR description instead of living in the repo. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
clear() is also called mid-test by the clear=True helpers (traces, telemetry, wait_for_rc_apply_state, ...). Resetting RC inside clear() wiped a test's active config mid-test, so the tracer could poll the empty config before the test asserted RC-dependent behavior (race; passed CI only by timing). Move the RC reset to a dedicated reset_remote_config() called only on the pooled agent's reuse path (WorkerAgentPool.acquire), between tests. clear() goes back to clearing recorded requests only. Pooled tests still get a clean RC baseline; no mid-test clear=True call touches the served config. Addresses Codex review on PR #7235. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Motivation
Parametric tests create a fresh Docker network + test-agent + client container per test. Across 16 xdist workers × ~440 tests that's heavy, bursty container churn on the shared docker-in-docker microVM runners, which intermittently stalls container/network setup so a test-agent misses the tracer's ~3s trace-receipt poll (
Number (1) of traces not available, got 0). It's a recurring dd-trace-php flake. POC requested by R&P (Slack thread) to cut that churn by reusing the test-agent.Net result (measured in CI) — one parametric shard (php dev, 8 xdist workers, ~395 test instances):
The 44 = 8 pooled agents (one per worker) + 36 fresh agents for non-poolable tests. Single-worker benchmark (
test_otel_span_methods.py, all-poolable): 59 tests on 1 agent, 118 → 2 create/destroy cycles (~98%).Runtime impact. Against two non-pooling PRs on the same dev/2-shard layout with identical test counts (#7232, #7234), pooled parametric shards ran ~16–47% faster (php/python/ruby) — a roughly fixed ~100s/shard saved, so the percentage scales inversely with how heavy a language's tests are. The two baselines agreed within ~2%, so it's not a low-load fluke. Caveat: shard 1 of 2, one run per PR.
Found and fixed while bringing the POC green. Three isolation issues surfaced in CI, each root-caused:
...OTLP_ENDPOINT-4320) need an agent listening on that port; the fixed-port pooled agent can't serve them, so their data never arrived (Number 1 of logs not available, got 0). Now excluded from pooling.skipifmarker crashed collection for ~169 jobs. The Docker smoke test used a module-levelpytest.mark.skipif(<bool>), and the root conftest's_item_must_passdoesall(marker.args[0]), which raises on a bool. Replaced with an in-bodypytest.skip().clear()leaked remote-config between pooled tests. It reset traces but not the agent's RC responses; non-snapshot tests share the default RC token, so a prior test's RC corrupted the next test's no-RC baseline (brokeTestDynamicConfigV1sampling tests).clear()now resets RC to the fresh-agent{}state; RC is the only agent state outside_requests.A latent issue also turned up, left for a standalone fix: root
conftest.py'sall(marker.args[0])crashes on any plain boolskipifanywhere in the collected suite.Changes
Reuse one test-agent + one Docker network per xdist worker, reset with
/test/session/clearbetween tests, instead of recreating per test. Scoped to non-snapshot tests with the defaultagent_env(~94% of parametric tests); snapshot and non-default-agent_envtests keep the existing fresh-per-test path. The library client is unchanged. Pooled agents use a distinct host-port band (4900/5000/5100) so they don't collide with fresh-path agents (4600/4701/4802) on the same worker.Measured on
tests/parametric/test_otel_span_methods.py(single worker): 59 tests drove 1 pooled agent; agent+network create/destroy cycles dropped 118 → 2 (~98% fewer). Per-test container ops drop 3 → 1 for default-env tests (the client still churns). 3 consecutive runs were stable with no cross-test leakage. See the POC design doc (internal, AppGate) for scope, risks, and the hardening path (extendclear()leakage checks per data type, session tokens for in-flight data, per-(worker,env) ports to pool non-default-env tests).Workflow
🚀 Once your PR is reviewed and the CI green, you can merge it!
🛟 #apm-shared-testing 🛟
Reviewer checklist
tests/ormanifests/is modified ? I have the approval from R&P team — this is the POC R&P (Charles) asked for in the Slack thread;utils/is modified intentionally.build-XXX-imagelabel is present🤖 Generated with Claude Code