Skip to content

ci: optimize self-hosted KubeVirt runners and CI pipeline#787

Draft
mangelajo wants to merge 29 commits into
mainfrom
kubevirt-runners
Draft

ci: optimize self-hosted KubeVirt runners and CI pipeline#787
mangelajo wants to merge 29 commits into
mainfrom
kubevirt-runners

Conversation

@mangelajo

@mangelajo mangelajo commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

  • Split KubeVirt runners into small (4Gi) and large (16Gi) VM flavors for better concurrency
  • Route unit test / build jobs to arc-runner-kubevirt-small, E2E test jobs to arc-runner-kubevirt-large
  • Run package tests in parallel (make test -j4 --output-sync on GNU Make, -j4 on BSD)
  • Suppress log noise in CI with --log-level=CRITICAL --log-cli-level=CRITICAL via PYTEST_ADDOPTS (project defaults unchanged for local dev)
  • Skip apt-get install when CI dependencies are already pre-baked in the golden image
  • Increase Renode monitor connect timeout from 10s to 45s (smaller VMs need more startup time)
  • Add download logging/timeout to u-boot test fixture for CI debuggability
  • Detect --output-sync support via make --help instead of version string (works on macOS BSD make)

Test plan

  • All python-tests matrix jobs pass (Linux small + macOS, Python 3.11/3.12/3.13)
  • E2E tests pick up arc-runner-kubevirt-large label
  • Log output is suppressed in CI but not locally
  • macOS jobs run with -j4 without --output-sync (BSD make)
  • Pre-baked dependencies skip apt-get install (check "Setup Linux dependencies" step)

🤖 Generated with Claude Code

Switch E2E and pytest amd64/Linux jobs from ubuntu-24.04 to
arc-runner-kubevirt, running on self-hosted KubeVirt VMs on the
beast cluster. ARM64 and macOS jobs are unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c31365b6-4454-4fdc-b112-269ddabf32bf

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR updates GitHub Actions workflow runner configurations across the E2E and Python test pipelines, migrating from ubuntu-24.04 to arc-runner-kubevirt for job execution while maintaining the existing matrix structures for architecture coverage.

Changes

Runner Infrastructure Migration

Layer / File(s) Summary
E2E workflow jobs runner migration
.github/workflows/e2e.yaml
Build controller and operator image jobs, build-python-wheels, e2e-tests, and compatibility jobs switch from ubuntu-24.04 to arc-runner-kubevirt in their job matrices and runner configurations.
Python tests workflow runner migration
.github/workflows/python-tests.yaml
Pytest-matrix job updates its runs-on strategy from ubuntu-24.04 to arc-runner-kubevirt while retaining macos-15.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 In the CI clouds we hop and bound,
New runners found, a better ground,
Arc-runner-kubevirt, swift and true,
Let tests and builds run spry and new!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title 'ci: optimize self-hosted KubeVirt runners and CI pipeline' is partially related to the changeset, which focuses on switching to KubeVirt runners, but overstates the scope by claiming 'optimize CI pipeline' when the changes are primarily runner migrations.
Description check ✅ Passed The PR description comprehensively aligns with the changeset, detailing infrastructure changes (KubeVirt runner migration), performance improvements (parallel testing), and CI optimizations (log suppression, timeout adjustments).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch kubevirt-runners

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kirkbrauer kirkbrauer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, hopefully we get quite a speedup here!

@mangelajo mangelajo marked this pull request as draft June 15, 2026 07:31
@mangelajo

Copy link
Copy Markdown
Member Author

I am experimenting @kirkbrauer , the operator I am trying for the github actions in K8s + kubevirt seems to be a bit slow to rotate VMs, and grab jobs. Unless we get to improve that we would have to look for an alternative option

mangelajo and others added 13 commits June 17, 2026 12:28
Enable stderr capture for the DUT network exporter in e2e tests.
The exporter was crashing with exit code 1 but its stderr was
discarded, making it impossible to diagnose the failure.

Changes:
- Enable captureStderr for the DUT network exporter in BeforeAll
- Use port-based log names to avoid collisions between exporters
- Add DumpLogs in AfterAll so errors appear in test output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Ubuntu, nft, dnsmasq, and sysctl live in /usr/sbin which is
not in PATH for non-root users. The runtime commands work fine
because they go through sudo, but the shutil.which() startup
check fails. Extend the search path to include /usr/sbin and
/sbin.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sysctl binary lives in /usr/sbin which isn't in PATH for
non-root users on Ubuntu. The read-only sysctl call (without
sudo) fails with FileNotFoundError. Resolve the full path at
call time using the same /usr/sbin search path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The which side_effect functions need to accept **kwargs since
the driver now passes path= to shutil.which().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sysctl calls now resolve to full path via _resolve_tool.
Mock it in tests so assertions match the bare command name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All tests that call get_interface_forwarding or
set_interface_forwarding now mock _resolve_tool so they don't
depend on whether sysctl exists on the build host.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On kubevirt runners with pre-baked images, Renode is already
present. Skip the 200MB download and install when the binary
is available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add pkg-test-all-parallel target that runs pkg-test-all with
-j6 --output-sync so per-package test output is buffered and
not interleaved. Use it in the CI workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
macOS ships BSD make which doesn't support --output-sync.
Check for GNU Make 4+ before using parallel flags.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
macOS BSD make supports -j but not --output-sync. Use -j6 on both
platforms so macOS also benefits from parallel execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build jobs and unit tests use arc-runner-kubevirt-small (3Gi).
E2E and compat tests use arc-runner-kubevirt-large (16Gi).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use -p no:logging to suppress all log output in CI, preventing
DEBUG/INFO/WARNING/ERROR noise from polluting test logs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mangelajo and others added 3 commits June 19, 2026 15:03
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents DEBUG/INFO/WARNING/ERROR log output from polluting test
output while keeping the logging plugin active for caplog tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check for each binary/package before installing. On golden images
with everything pre-installed, this skips apt-get entirely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mangelajo mangelajo changed the title ci: use self-hosted KubeVirt runners for amd64 jobs ci: optimize self-hosted KubeVirt runners and CI pipeline Jun 19, 2026
mangelajo and others added 7 commits June 19, 2026 15:33
On smaller VMs under parallel test load, Renode can take longer than
10s to start up and bind its monitor port, causing DEADLINE_EXCEEDED.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revert --log-level=CRITICAL from project pyproject.toml (keep local
  dev defaults clean)
- Add --log-level=CRITICAL and --log-cli-level=CRITICAL to PYTEST_ADDOPTS
  in CI workflow (suppresses both captured and live log output)
- Reduce make parallelism from -j6 to -j4 to ease resource pressure on
  smaller VMs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BSD make on macOS doesn't support --output-sync and the GNU Make
version check wasn't reliably detecting it. Checking make --help
for the flag directly is more portable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The download had no timeout, no progress logging, and raw exceptions
on failure — making CI failures impossible to diagnose in interleaved
parallel output. Now logs download progress, sets a 120s timeout, and
reports a clear pytest.fail message on network errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With --output-sync, make buffers each target's output and flushes on
completion. When a test fails, the failure details (traceback, FAILURES
section) get truncated or lost entirely, making CI failures impossible
to diagnose. Interleaved output is noisy but at least complete.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When LOGS_DIR is set, each pkg-test-% target writes stdout/stderr to a
separate log file and records failures via .failed markers (exit-code
based, not grep). The test-report target prints full logs for failed
packages and exits non-zero. Local dev behavior is unchanged.

CI now sets LOGS_DIR, uploads all logs as artifacts (7-day retention),
and uses fail-fast: false so all matrix jobs run to completion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mangelajo and others added 5 commits June 19, 2026 17:17
Without this, Python buffers stdout when piped through tee, hiding
print() diagnostics until the test finishes or hangs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RPM extraction hangs silently on resource-constrained VMs — add prints
before and after to pinpoint where it stalls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rpmfile decompresses the entire zstd CPIO payload (572MB) into memory
to extract a 1.6MB file. On 4Gi VMs running parallel tests this
triggers the OOM killer. Use rpm2cpio + cpio on Linux which streams
without buffering; fall back to rpmfile on macOS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…).st_size

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants