Skip to content

feat(ci): merge-gate job that checks failures against a flaky allowlist#4028

Open
Leiyks wants to merge 1 commit into
masterfrom
leiyks/ci-merge-gate
Open

feat(ci): merge-gate job that checks failures against a flaky allowlist#4028
Leiyks wants to merge 1 commit into
masterfrom
leiyks/ci-merge-gate

Conversation

@Leiyks

@Leiyks Leiyks commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What

Adds a single merge-gate job that makes a curated subset of CI mandatory to merge — without allow_failure on any job, and without touching the generated pipelines.

The gate runs last (new merge-gate stage, when: always, on every branch) and, after the whole pipeline has run:

  1. Fetches a short-lived GitLab API token the same way as the analyze and create pr job (Vault sdm JWT → BTI CI API).
  2. Uses the API to collect every failed job across this pipeline and its triggered child pipelines (/pipelines/:id/bridges/pipelines/:id/jobs?scope[]=failed).
  3. Matches each failure against the glob patterns in .gitlab/flaky-jobs.txt.

A failure matching no pattern is a real regression → the gate fails. Failures that match are known-flaky → ignored. So the gate is green iff every non-flaky job passed; flaky jobs may fail (turning the pipeline red) without blocking the merge.

Branch-protection setup (manual, after merge): require the single merge-gate status check.

Files

  • .gitlab-ci.yml — new merge-gate stage + a small job that runs .gitlab/merge-gate.sh.
  • .gitlab/merge-gate.sh — the gate logic (kept out of the YAML for readability).
  • .gitlab/flaky-jobs.txt — glob patterns of jobs excluded from the gate.

Flaky list

Glob-reduced: if a job failed on any version, all its versions are treated as flaky (base: [..]base:*); non-matrix jobs stay exact. The current 815-row "failed on master" export reduces to 150 patterns.

It's regenerated monthly from the export by a script kept outside the repo (~/git/generate-flaky-jobs.php "<export>.csv"); only the resulting .gitlab/flaky-jobs.txt is committed.

Notes

  • Reuses the existing token/API pattern from analyze and create pr, so no new secrets.
  • Trigger (bridge) jobs aren't counted — they only propagate child status; the gate inspects the real leaf jobs in each pipeline.
  • Confirm on the first pipeline that GitLab surfaces merge-gate as an individual GitHub status check and that the sdm Vault token path is available to this job.

Validation (local)

.gitlab-ci.yml parses; merge-gate.sh passes bash -n; glob matching correctly classifies known-flaky matrix/bare jobs as ignored and unknown jobs as blocking.

@datadog-datadog-prod-us1

datadog-datadog-prod-us1 Bot commented Jul 1, 2026

Copy link
Copy Markdown

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 4 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-php | ASAN test_c with multiple observers: [8.4]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [7.0]   View in Datadog   GitLab

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [8.5]   View in Datadog   GitLab

View all 4 failed jobs.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 54.08% (-0.04%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 696c8a2 | Docs | Datadog PR Page | Give us feedback!

@Leiyks Leiyks changed the title feat(ci): per-pipeline merge-gate jobs for required-status branch protection feat(ci): merge gate from per-pipeline local gates (required-status branch protection) Jul 1, 2026
@Leiyks Leiyks changed the title feat(ci): merge gate from per-pipeline local gates (required-status branch protection) feat(ci): merge gate via committed per-pipeline gate files Jul 1, 2026
@Leiyks Leiyks force-pushed the leiyks/ci-merge-gate branch from 623cddb to eb18681 Compare July 1, 2026 13:26
@Leiyks Leiyks changed the title feat(ci): merge gate via committed per-pipeline gate files feat(ci): merge-gate job that checks failures against a flaky allowlist Jul 1, 2026
@Leiyks Leiyks force-pushed the leiyks/ci-merge-gate branch from eb18681 to 72bcfae Compare July 1, 2026 13:30
Add a single `merge-gate` job (new final stage, runs on every branch via
`when: always`) that, after the whole pipeline runs, uses the GitLab API
(short-lived token fetched like `analyze and create pr`) to collect every failed
job across this pipeline and its triggered child pipelines, then matches each
failure against the glob patterns in .gitlab/flaky-jobs.txt.

A failure matching no pattern is a real regression and fails the gate; failures
that match are known-flaky and ignored. The gate is green iff every non-flaky
job passed, so flaky jobs may fail (pipeline red) without blocking the merge.
No `allow_failure` is added to any job.

The gate logic lives in .gitlab/merge-gate.sh (keeps the YAML clean).
.gitlab/flaky-jobs.txt is glob-reduced ("base:*" — any failing version marks all
versions flaky) and regenerated monthly from the "failed on master" export by a
script kept outside the repo.
@Leiyks Leiyks force-pushed the leiyks/ci-merge-gate branch from 72bcfae to 696c8a2 Compare July 1, 2026 13:32
@Leiyks Leiyks marked this pull request as ready for review July 1, 2026 15:06
@Leiyks Leiyks requested a review from a team as a code owner July 1, 2026 15:06

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 696c8a2b6a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread .gitlab/merge-gate.sh
Comment on lines +42 to +43
data=$(curl -g -sf -H "${AUTH}" \
"${GITLAB_API}/projects/${CI_PROJECT_ID}/pipelines/${pid}/jobs?scope[]=failed&per_page=100&page=${page}" || echo "[]")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail closed when job lookup fails

In any run where GitLab returns a non-2xx response here (for example an expired/empty BTI token, permission problem, or API outage), curl ... || echo "[]" turns the error into an empty page, so the gate can report zero failed jobs and pass even though it did not actually verify the pipeline. Since this job is intended to be the required merge check, API/auth failures need to fail the gate instead of being treated as no failures.

Useful? React with 👍 / 👎.

Comment thread .gitlab/merge-gate.sh
"${GITLAB_API}/projects/${CI_PROJECT_ID}/pipelines/${CI_PIPELINE_ID}/bridges?per_page=100" || echo "[]")
while read -r child; do
[ -n "${child}" ] && pipelines+=("${child}")
done < <(echo "${bridges}" | jq -r '.[] | select(.downstream_pipeline != null) | .downstream_pipeline.id')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat failed bridges without children as blocking

When a trigger bridge fails before creating its child pipeline (for example invalid generated YAML or a downstream creation/permission error), GitLab exposes a failed bridge with downstream_pipeline == null; this filter drops that bridge, and bridge jobs are not collected by the later /jobs?scope[]=failed calls. In that scenario the merge-gate status can pass even though an entire child suite never ran, so failed bridges with no downstream pipeline should be counted as non-flaky failures.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant