feat(ci): merge-gate job that checks failures against a flaky allowlist#4028
feat(ci): merge-gate job that checks failures against a flaky allowlist#4028Leiyks wants to merge 1 commit into
Conversation
|
623cddb to
eb18681
Compare
eb18681 to
72bcfae
Compare
Add a single `merge-gate` job (new final stage, runs on every branch via
`when: always`) that, after the whole pipeline runs, uses the GitLab API
(short-lived token fetched like `analyze and create pr`) to collect every failed
job across this pipeline and its triggered child pipelines, then matches each
failure against the glob patterns in .gitlab/flaky-jobs.txt.
A failure matching no pattern is a real regression and fails the gate; failures
that match are known-flaky and ignored. The gate is green iff every non-flaky
job passed, so flaky jobs may fail (pipeline red) without blocking the merge.
No `allow_failure` is added to any job.
The gate logic lives in .gitlab/merge-gate.sh (keeps the YAML clean).
.gitlab/flaky-jobs.txt is glob-reduced ("base:*" — any failing version marks all
versions flaky) and regenerated monthly from the "failed on master" export by a
script kept outside the repo.
72bcfae to
696c8a2
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 696c8a2b6a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| data=$(curl -g -sf -H "${AUTH}" \ | ||
| "${GITLAB_API}/projects/${CI_PROJECT_ID}/pipelines/${pid}/jobs?scope[]=failed&per_page=100&page=${page}" || echo "[]") |
There was a problem hiding this comment.
Fail closed when job lookup fails
In any run where GitLab returns a non-2xx response here (for example an expired/empty BTI token, permission problem, or API outage), curl ... || echo "[]" turns the error into an empty page, so the gate can report zero failed jobs and pass even though it did not actually verify the pipeline. Since this job is intended to be the required merge check, API/auth failures need to fail the gate instead of being treated as no failures.
Useful? React with 👍 / 👎.
| "${GITLAB_API}/projects/${CI_PROJECT_ID}/pipelines/${CI_PIPELINE_ID}/bridges?per_page=100" || echo "[]") | ||
| while read -r child; do | ||
| [ -n "${child}" ] && pipelines+=("${child}") | ||
| done < <(echo "${bridges}" | jq -r '.[] | select(.downstream_pipeline != null) | .downstream_pipeline.id') |
There was a problem hiding this comment.
Treat failed bridges without children as blocking
When a trigger bridge fails before creating its child pipeline (for example invalid generated YAML or a downstream creation/permission error), GitLab exposes a failed bridge with downstream_pipeline == null; this filter drops that bridge, and bridge jobs are not collected by the later /jobs?scope[]=failed calls. In that scenario the merge-gate status can pass even though an entire child suite never ran, so failed bridges with no downstream pipeline should be counted as non-flaky failures.
Useful? React with 👍 / 👎.
What
Adds a single
merge-gatejob that makes a curated subset of CI mandatory to merge — withoutallow_failureon any job, and without touching the generated pipelines.The gate runs last (new
merge-gatestage,when: always, on every branch) and, after the whole pipeline has run:analyze and create prjob (VaultsdmJWT → BTI CI API)./pipelines/:id/bridges→/pipelines/:id/jobs?scope[]=failed)..gitlab/flaky-jobs.txt.A failure matching no pattern is a real regression → the gate fails. Failures that match are known-flaky → ignored. So the gate is green iff every non-flaky job passed; flaky jobs may fail (turning the pipeline red) without blocking the merge.
Files
.gitlab-ci.yml— newmerge-gatestage + a small job that runs.gitlab/merge-gate.sh..gitlab/merge-gate.sh— the gate logic (kept out of the YAML for readability)..gitlab/flaky-jobs.txt— glob patterns of jobs excluded from the gate.Flaky list
Glob-reduced: if a job failed on any version, all its versions are treated as flaky (
base: [..]→base:*); non-matrix jobs stay exact. The current 815-row "failed on master" export reduces to 150 patterns.It's regenerated monthly from the export by a script kept outside the repo (
~/git/generate-flaky-jobs.php "<export>.csv"); only the resulting.gitlab/flaky-jobs.txtis committed.Notes
analyze and create pr, so no new secrets.merge-gateas an individual GitHub status check and that thesdmVault token path is available to this job.Validation (local)
.gitlab-ci.ymlparses;merge-gate.shpassesbash -n; glob matching correctly classifies known-flaky matrix/bare jobs as ignored and unknown jobs as blocking.