Skip to content

ci: fix flaky integration tests by distributing images via GHCR#3582

Open
amir-deris wants to merge 5 commits into
mainfrom
amir/plt-476-CI-integration-test-image-fix
Open

ci: fix flaky integration tests by distributing images via GHCR#3582
amir-deris wants to merge 5 commits into
mainfrom
amir/plt-476-CI-integration-test-image-fix

Conversation

@amir-deris

@amir-deris amir-deris commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

The Docker Integration Test workflow packaged the localnode/rpcnode Docker images into a ~1 GB artifact (integration-docker-images.tar.zst) that ~40 matrix jobs each downloaded concurrently via actions/download-artifact@v4. The action streams and extracts the zip without an end-to-end integrity check, so a prematurely closed connection can leave a truncated file without failing the step. The first detector was zstd -d | docker load, failing with Read error (39): premature end / unexpected EOF and requiring a manual rerun. With 40 concurrent 1 GB downloads per run, this flaked regularly.

Fix

Distribute the images via GHCR instead of an artifact. Registry pulls are content-addressed — every layer is sha256-verified and retried automatically by the docker client — so truncation cannot slip through silently.

  • prepare-cluster pushes both images to ghcr.io/sei-protocol/sei-chain-integration-test-{localnode,rpcnode}:<run_id> using GITHUB_TOKEN (no OIDC or external secrets required). The CI artifact now carries only the small seid tarball.
  • Test jobs log in to GHCR, docker pull the run-tagged images, and retag them to sei-chain/{localnode,rpcnode} — everything downstream (docker-cluster-start-ci etc.) is unchanged.
  • Both builds stamp a sei-chain.ci-run-id label so every run pushes a unique image digest. Labels are config-only: the layer cache is unaffected and a cache-hit run uploads just a new config blob + manifest. This avoids the pitfall of re-tagging a stable digest where in-flight runs could be affected by tag moves.
  • Reruns of failed test jobs keep working: tags are keyed by run_id and persist in GHCR across attempts.
  • Adds ghcr-integration-test-cleanup.yml: a weekly scheduled workflow (Sundays 06:00 UTC) that prunes run-id tags older than 14 days from both GHCR repos, while preserving the :cache tag. Supports workflow_dispatch with a dry-run option.

Advantage over ECR

It avoid ~3000$ monthly cost for egress charge from AWS to GitHub runners. Also GITHUB_TOKEN is automatically available to all workflows including fork PRs, removing the need for OIDC role assumptions or AWS credentials for image distribution. No IAM setup required.

@cursor

cursor Bot commented Jun 12, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Changes how CI obtains test images and removes ECR/OIDC for that path; fork PRs lose integration-test runs until images are pushed from an upstream branch, but production chain code is unaffected.

Overview
Fixes flaky Docker Integration Test runs by stopping ~1 GB integration-docker-images.tar.zst artifact fan-out to many matrix jobs. prepare-cluster now pushes localnode/rpcnode to ghcr.io/sei-protocol/sei-chain-integration-test-{localnode,rpcnode}:<run_id>, keeps only the seid tarball in CI artifacts, and moves buildx layer cache from AWS ECR to GHCR :cache tags (OIDC/ECR login removed).

Matrix jobs log in to GHCR, docker pull the run-tagged images, and retag to sei-chain/localnode and sei-chain/rpcnode so downstream cluster steps stay the same. Builds add a sei-chain.ci-run-id label so each run gets a distinct digest for safe tag pruning.

Adds ghcr-integration-test-cleanup.yml: weekly (and manual dry-run) deletion of numeric run-id tags older than 14 days, preserving :cache. Docs note fork PRs cannot publish org packages, so integration tests need a branch on this repo.

Reviewed by Cursor Bugbot for commit 061d214. Bugbot is set up for automated code reviews on this repo. Configure here.

@amir-deris amir-deris changed the title modified integration-test yaml to push pull from ecr ci: distribute integration test images via ECR instead of 1GB artifact Jun 12, 2026
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 12, 2026, 9:54 PM

@amir-deris amir-deris requested review from bdchatham and masih June 12, 2026 19:11
@codecov

codecov Bot commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.35%. Comparing base (0a2c388) to head (061d214).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3582      +/-   ##
==========================================
- Coverage   59.22%   58.35%   -0.87%     
==========================================
  Files        2214     2140      -74     
  Lines      183389   174842    -8547     
==========================================
- Hits       108604   102031    -6573     
+ Misses      64994    63720    -1274     
+ Partials     9791     9091     -700     
Flag Coverage Δ
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 74 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@amir-deris amir-deris changed the title ci: distribute integration test images via ECR instead of 1GB artifact ci: distribute integration test images via GHCR instead of 1GB artifact Jun 12, 2026
@amir-deris amir-deris changed the title ci: distribute integration test images via GHCR instead of 1GB artifact ci: fix flaky integration tests by distributing images via GHCR Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants