ci: optimize self-hosted KubeVirt runners and CI pipeline#787
ci: optimize self-hosted KubeVirt runners and CI pipeline#787mangelajo wants to merge 29 commits into
Conversation
Switch E2E and pytest amd64/Linux jobs from ubuntu-24.04 to arc-runner-kubevirt, running on self-hosted KubeVirt VMs on the beast cluster. ARM64 and macOS jobs are unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR updates GitHub Actions workflow runner configurations across the E2E and Python test pipelines, migrating from ChangesRunner Infrastructure Migration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
kirkbrauer
left a comment
There was a problem hiding this comment.
This looks good to me, hopefully we get quite a speedup here!
|
I am experimenting @kirkbrauer , the operator I am trying for the github actions in K8s + kubevirt seems to be a bit slow to rotate VMs, and grab jobs. Unless we get to improve that we would have to look for an alternative option |
Enable stderr capture for the DUT network exporter in e2e tests. The exporter was crashing with exit code 1 but its stderr was discarded, making it impossible to diagnose the failure. Changes: - Enable captureStderr for the DUT network exporter in BeforeAll - Use port-based log names to avoid collisions between exporters - Add DumpLogs in AfterAll so errors appear in test output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Ubuntu, nft, dnsmasq, and sysctl live in /usr/sbin which is not in PATH for non-root users. The runtime commands work fine because they go through sudo, but the shutil.which() startup check fails. Extend the search path to include /usr/sbin and /sbin. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sysctl binary lives in /usr/sbin which isn't in PATH for non-root users on Ubuntu. The read-only sysctl call (without sudo) fails with FileNotFoundError. Resolve the full path at call time using the same /usr/sbin search path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The which side_effect functions need to accept **kwargs since the driver now passes path= to shutil.which(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sysctl calls now resolve to full path via _resolve_tool. Mock it in tests so assertions match the bare command name. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All tests that call get_interface_forwarding or set_interface_forwarding now mock _resolve_tool so they don't depend on whether sysctl exists on the build host. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On kubevirt runners with pre-baked images, Renode is already present. Skip the 200MB download and install when the binary is available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add pkg-test-all-parallel target that runs pkg-test-all with -j6 --output-sync so per-package test output is buffered and not interleaved. Use it in the CI workflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
macOS ships BSD make which doesn't support --output-sync. Check for GNU Make 4+ before using parallel flags. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
macOS BSD make supports -j but not --output-sync. Use -j6 on both platforms so macOS also benefits from parallel execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build jobs and unit tests use arc-runner-kubevirt-small (3Gi). E2E and compat tests use arc-runner-kubevirt-large (16Gi). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use -p no:logging to suppress all log output in CI, preventing DEBUG/INFO/WARNING/ERROR noise from polluting test logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
400dc29 to
a240d4e
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents DEBUG/INFO/WARNING/ERROR log output from polluting test output while keeping the logging plugin active for caplog tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check for each binary/package before installing. On golden images with everything pre-installed, this skips apt-get entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On smaller VMs under parallel test load, Renode can take longer than 10s to start up and bind its monitor port, causing DEADLINE_EXCEEDED. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revert --log-level=CRITICAL from project pyproject.toml (keep local dev defaults clean) - Add --log-level=CRITICAL and --log-cli-level=CRITICAL to PYTEST_ADDOPTS in CI workflow (suppresses both captured and live log output) - Reduce make parallelism from -j6 to -j4 to ease resource pressure on smaller VMs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BSD make on macOS doesn't support --output-sync and the GNU Make version check wasn't reliably detecting it. Checking make --help for the flag directly is more portable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The download had no timeout, no progress logging, and raw exceptions on failure — making CI failures impossible to diagnose in interleaved parallel output. Now logs download progress, sets a 120s timeout, and reports a clear pytest.fail message on network errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With --output-sync, make buffers each target's output and flushes on completion. When a test fails, the failure details (traceback, FAILURES section) get truncated or lost entirely, making CI failures impossible to diagnose. Interleaved output is noisy but at least complete. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When LOGS_DIR is set, each pkg-test-% target writes stdout/stderr to a separate log file and records failures via .failed markers (exit-code based, not grep). The test-report target prints full logs for failed packages and exits non-zero. Local dev behavior is unchanged. CI now sets LOGS_DIR, uploads all logs as artifacts (7-day retention), and uses fail-fast: false so all matrix jobs run to completion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without this, Python buffers stdout when piped through tee, hiding print() diagnostics until the test finishes or hangs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RPM extraction hangs silently on resource-constrained VMs — add prints before and after to pinpoint where it stalls. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rpmfile decompresses the entire zstd CPIO payload (572MB) into memory to extract a 1.6MB file. On 4Gi VMs running parallel tests this triggers the OOM killer. Use rpm2cpio + cpio on Linux which streams without buffering; fall back to rpmfile on macOS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…).st_size Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
arc-runner-kubevirt-small, E2E test jobs toarc-runner-kubevirt-largemake test -j4 --output-syncon GNU Make,-j4on BSD)--log-level=CRITICAL --log-cli-level=CRITICALviaPYTEST_ADDOPTS(project defaults unchanged for local dev)apt-get installwhen CI dependencies are already pre-baked in the golden image--output-syncsupport viamake --helpinstead of version string (works on macOS BSD make)Test plan
python-testsmatrix jobs pass (Linux small + macOS, Python 3.11/3.12/3.13)arc-runner-kubevirt-largelabel-j4without--output-sync(BSD make)🤖 Generated with Claude Code