OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup by lucaconsalvi · Pull Request #31276 · openshift/origin

lucaconsalvi · 2026-06-09T16:26:44Z

Problem

The test should coordinate recovery with peer when local etcd container is killed flakes at ~1.9% (4 failures / 212 CI runs, seen on both PR payload runs and periodic nightlies).

All failures are identical: 600s timeout waiting for etcd-clone to be Started on both nodes. The failure log shows:

Failed Resource Actions:
  * etcd start on master-1 returned 'error' after 2m1.131s
  * etcd 30s-interval monitor on master-0 returned 'error' (master-0 must force a new cluster)

Root cause

When podman kill etcd is run on one node, the podman-etcd resource agent coordinates recovery via the force_new_cluster protocol:

The peer node's monitor detects a leadership loss → sets force_new_cluster → monitor fails
The peer restarts etcd with --force-new-cluster
The killed node's start action retries until it can rejoin as a learner

The race: step 3 (start action) can fail multiple times before step 1–2 completes. If it hits Pacemaker's migration-threshold (default: 3 failures) before the peer is ready, Pacemaker stops scheduling the resource entirely. The longRecoveryTimeout (10 min) then expires waiting for a recovery that Pacemaker won't attempt.

Fix

Restructure the test to capture the "Failed Resource Actions" evidence before running pcs resource cleanup, then let cleanup unblock Pacemaker:

kill etcd
  → wait cluster healthy          (first Eventually)
  → assert Failed Resource Actions on both nodes   ← moved here
  → pcs resource cleanup etcd-clone               ← new step
  → wait pcs Started on both nodes (second Eventually)
  → verifyFinalClusterHealth

By the time the cluster health check passes, the coordinated recovery has already run — both the peer's force_new_cluster monitor failure and the killed node's start failures are recorded in pcs status. We capture that evidence, then reset the failure count so Pacemaker can retry against the now-healthy peer cluster.

In the non-flaky path (98.1% of runs), cleanup is a no-op and the timing of the existing assertion is simply moved earlier.

Verification

Passes locally on TNF cluster (~400s)
Payload CI: first run 2/3 PASS, 1/3 infra flake (504 timeout during cluster setup — test never ran); second run 2/3 PASS, 1/3 FAIL due to unrelated Monitor test failures (missing SCC annotations, kubelet lease backoffs — our test passed in that shard)

Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700

Summary by CodeRabbit

Release Notes

Tests
- Enhanced test resilience for coordinated etcd recovery scenarios by improving resource management during failure conditions.

When a local etcd container is killed, the peer sets force_new_cluster and etcd recovers. In rare cases (~1.9% of runs) the killed node's repeated start failures hit Pacemaker's migration-threshold before coordination completes, causing Pacemaker to stop scheduling the resource and the 10-minute Eventually to expire. The "Failed Resource Actions" evidence (which the test asserts) is captured immediately after the cluster self-heals — at that point both nodes are guaranteed to appear since the peer's force_new_cluster monitor failure and the killed node's start failures have already been recorded. After capturing the evidence, pcs resource cleanup resets the failure count so Pacemaker can retry the start action against the now-healthy peer cluster. Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

openshift-merge-bot · 2026-06-09T16:26:47Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci · 2026-06-09T16:26:56Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-09T16:27:08Z

Walkthrough

Test adds helpers to read, set, and restore Pacemaker’s migration-threshold for etcd-clone. The coordinated local-etcd-kill test now captures the original threshold, sets it to INFINITY before killing the container, and defers restoring the prior value during cleanup.

Changes

Etcd disruption test robustness improvements

Layer / File(s)	Summary
migration-threshold helpers and test precondition `test/extended/edge_topologies/tnf_etcd_disruption.go`	Adds internal helpers to get, set, and restore Pacemaker `migration-threshold` for `etcd-clone`. The coordinated test captures the original threshold, sets it to `INFINITY` before killing the local etcd container, and defers restoration (deleting the override if originally unset). The existing cleanup, verification that `etcd-clone` is `Started` on both nodes, and "Failed Resource Actions" observation remain.

Sequence Diagram(s)

sequenceDiagram
  participant Test
  participant Pacemaker
  participant NodeA
  participant NodeB
  Test->>Pacemaker: read migration-threshold(etcd-clone)
  Test->>Pacemaker: set migration-threshold=INFINITY
  Test->>NodeA: kill local etcd container
  NodeA->>Test: report etcd recovered
  Test->>Pacemaker: verify etcd-clone Started on NodeA and NodeB
  Test->>Pacemaker: restore migration-threshold(original or delete override)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

openshift/origin#30950: Similar changes to tnf_etcd_disruption.go adding pcs cleanup/wait steps to recover Pacemaker/etcd components.

Suggested labels

jira/valid-reference

Suggested reviewers

fracappa
fonta-rh
suleymanakbas91

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
No-Sensitive-Data-In-Logs	❌ Error	tnf_etcd_disruption.go logs internal hostnames via framework.Logf with node.Name and dumps pcs status/Failed Resource Actions output that includes those node names.	Redact or remove hostname values from log messages (e.g., don’t print node.Name), and avoid dumping full pcs status/failed-action sections unless hostnames are masked.
Test Structure And Quality	⚠️ Warning	cleanup may delete pacemaker migration-threshold incorrectly: getMigrationThreshold returns "" on read error and restoreMigrationThreshold deletes override when originalValue is empty.	Make getMigrationThreshold return (string,error) and assert baseline read success before calling setMigrationThreshold/restore; only delete the override when you confirmed it was originally unset.
Title check	❓ Inconclusive	The title references fixing a flaky test and mentions 'pcs cleanup', which aligns with the summary's description of adding Pacemaker migration-threshold management to prevent test flakiness. However, the actual implementation (as shown in the summary) uses 'migration-threshold=INFINITY' approach rather than 'pcs cleanup', making the title partially outdated based on the final solution described in PR objectives.	Verify if the title should be updated to reflect the final implementation approach (migration-threshold management) instead of 'pcs cleanup', as the PR objectives indicate the solution was reworked.

✅ Passed checks (12 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	In test/extended/edge_topologies/tnf_etcd_disruption.go, all ginkgo It/Describe titles are static string literals; no fmt.Sprintf/variables/dynamic node names, dates, UUIDs, or IPs in test titles w...
Microshift Test Compatibility	✅ Passed	`should coordinate recovery...` is inside a single Describe tagged `[apigroup:config.openshift.io]`, so MicroShift CI will skip it; code uses etcd-clone/podman/pacemaker which would otherwise be un...
Single Node Openshift (Sno) Test Compatibility	✅ Passed	The test is multi-node (expects 2 nodes), but it’s guarded by utils.SkipIfNotTopology(oc, v1.DualReplicaTopologyMode) in BeforeEach, so SNO won’t run it.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only tnf_etcd_disruption test to set/restore Pacemaker migration-threshold; inspected file has no k8s scheduling constraints (no affinity/topologySpread/nodeSelectors).
Ote Binary Stdout Contract	✅ Passed	Scanned all PR-changed .go files for process-level stdout contract patterns (fmt.Print/Printf/Println, os.Stdout, klog.SetOutput/LogToStderr) and found none.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	In tnf_etcd_disruption.go, the etcd-clone kill/recovery test contains no hardcoded IPv4/localhost literals, no ParseIP/ParseCIDR/To4 logic, and no external URLs/curl/wget/public-host connections (o...
No-Weak-Crypto	✅ Passed	In test/extended/edge_topologies/tnf_etcd_disruption.go (and scanned edge_topologies/*.go), no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB keywords, crypto imports, or custom crypto/secret comparisons found.
Container-Privileges	✅ Passed	Only files changed were .ci-operator.yaml, images/tests/Dockerfile.rhel, and tnf_etcd_disruption.go; the git diff contains no privileged/hostPID/hostNetwork/hostIPC/SYS_ADMIN/allowPrivilegeEscalati...

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-09T16:27:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lucaconsalvi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/extended/edge_topologies/OWNERS~~ [lucaconsalvi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lucaconsalvi · 2026-06-09T16:30:57Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery

openshift-ci · 2026-06-09T16:31:01Z

@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-1of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-2of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/98519310-6420-11f1-85a8-bef173c7a909-0

coderabbitai

🧹 Nitpick comments (1)

test/extended/edge_topologies/tnf_etcd_disruption.go (1)

535-543: ⚡ Quick win

Consider logging cleanup output on success for consistency and diagnostics.

The AfterEach cleanup at Line 337-343 logs the pcs resource cleanup output even on success (Line 342). For consistency and better observability when investigating test runs, consider logging the output here too.

📝 Suggested improvement

 g.By("Running pcs resource cleanup to unblock etcd-clone if Pacemaker hit migration-threshold")
 if output, cleanupErr := exutil.DebugNodeRetryWithOptionsAndChroot(
     oc, execNode.Name, "default", "bash", "-c", "sudo pcs resource cleanup etcd-clone"); cleanupErr != nil {
     framework.Logf("Warning: pcs resource cleanup failed: %v\noutput: %s", cleanupErr, output)
+} else {
+    framework.Logf("PCS resource cleanup output: %s", output)
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/edge_topologies/tnf_etcd_disruption.go` around lines 535 - 543,
Update the pcs resource cleanup block that calls
exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and
"sudo pcs resource cleanup etcd-clone") to log the command output on success as
well as on error; i.e., after the call, always call framework.Logf with a
concise success message and the captured output when cleanupErr is nil, and keep
the existing warning log when cleanupErr is non-nil so diagnostics are
consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/extended/edge_topologies/tnf_etcd_disruption.go`:
- Around line 535-543: Update the pcs resource cleanup block that calls
exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and
"sudo pcs resource cleanup etcd-clone") to log the command output on success as
well as on error; i.e., after the call, always call framework.Logf with a
concise success message and the captured output when cleanupErr is nil, and keep
the existing warning log when cleanupErr is non-nil so diagnostics are
consistent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4ffeaed2-43be-4d3e-819c-258eef5d5aab

📥 Commits

Reviewing files that changed from the base of the PR and between 6b33871 and a69bbb6.

📒 Files selected for processing (1)

test/extended/edge_topologies/tnf_etcd_disruption.go

lucaconsalvi · 2026-06-10T07:56:16Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery

openshift-ci · 2026-06-10T07:56:23Z

@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-1of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-2of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dbf33f20-64a1-11f1-99be-5db787e1433a-0

openshift-merge-bot · 2026-06-10T13:59:44Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-ovn-two-node-fencing-recovery

openshift-ci-robot · 2026-06-10T19:28:23Z

@lucaconsalvi: This pull request references OCPEDGE-2700 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Problem

The test should coordinate recovery with peer when local etcd container is killed flakes at ~1.9% (4 failures / 212 CI runs, seen on both PR payload runs and periodic nightlies).

All failures are identical: 600s timeout waiting for etcd-clone to be Started on both nodes. The failure log shows:
Failed Resource Actions:
 * etcd start on master-1 returned 'error' after 2m1.131s
 * etcd 30s-interval monitor on master-0 returned 'error' (master-0 must force a new cluster)
Root cause

When podman kill etcd is run on one node, the podman-etcd resource agent coordinates recovery via the force_new_cluster protocol:

The peer node's monitor detects a leadership loss → sets force_new_cluster → monitor fails

The peer restarts etcd with --force-new-cluster

The killed node's start action retries until it can rejoin as a learner

The race: step 3 (start action) can fail multiple times before step 1–2 completes. If it hits Pacemaker's migration-threshold (default: 3 failures) before the peer is ready, Pacemaker stops scheduling the resource entirely. The longRecoveryTimeout (10 min) then expires waiting for a recovery that Pacemaker won't attempt.

Fix

Restructure the test to capture the "Failed Resource Actions" evidence before running pcs resource cleanup, then let cleanup unblock Pacemaker:
kill etcd
 → wait cluster healthy          (first Eventually)
 → assert Failed Resource Actions on both nodes   ← moved here
 → pcs resource cleanup etcd-clone               ← new step
 → wait pcs Started on both nodes (second Eventually)
 → verifyFinalClusterHealth
By the time the cluster health check passes, the coordinated recovery has already run — both the peer's force_new_cluster monitor failure and the killed node's start failures are recorded in pcs status. We capture that evidence, then reset the failure count so Pacemaker can retry against the now-healthy peer cluster.

In the non-flaky path (98.1% of runs), cleanup is a no-op and the timing of the existing assertion is simply moved earlier.

Verification

Passes locally on TNF cluster (~400s)

Payload CI: first run 2/3 PASS, 1/3 infra flake (504 timeout during cluster setup — test never ran); second run 2/3 PASS, 1/3 FAIL due to unrelated Monitor test failures (missing SCC annotations, kubelet lease backoffs — our test passed in that shard)

Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

lucaconsalvi · 2026-06-11T09:11:08Z

/retest

openshift-trt · 2026-06-11T14:47:39Z

Job Failure Risk Analysis for sha: ff7fde9

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift	High install should succeed: other This test has passed 99.39% of 3782 runs on release 5.0 [Overall] in the last week.

lucaconsalvi · 2026-06-11T15:09:55Z

/retest-required

dhensel-rh · 2026-06-11T20:31:25Z

@lucaconsalvi We are capturing the failure, and running the necessary command to clean up the cluster so it is not left in a bad state. How do we make sure this does not result in missing failures if a real problem introduced ?

lucaconsalvi · 2026-06-12T09:17:26Z

@dhensel-rh Thanks for the observation! The cleanup could mask a real start-action failure. I've reworked the approach: instead of running pcs resource cleanup mid-test, I now raise migration-threshold to INFINITY on etcd-clone before the container kill, and restore it in DeferCleanup/AfterEach.
This way Pacemaker never stops retrying during the transient coordination window, so the test purely observes recovery without intervening. If a future code change genuinely breaks the start action, etcd-clone will never reach Started and the test times out — no cleanup to hide it.
I'm testing it locally to see if this could be a better solution!

…shold Raise migration-threshold to INFINITY on the etcd resource before killing the container, so Pacemaker never stops retrying during the transient coordination window. This replaces the previous approach of running pcs resource cleanup mid-test, which could mask real start-action failures. The test now purely observes recovery without intervening. If the resource agent has a genuine bug, etcd-clone will never start and the test times out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/edge_topologies/tnf_etcd_disruption.go`:
- Around line 207-216: getMigrationThreshold currently masks read failures by
returning an empty string, which causes the cleanup code to wrongly delete the
migration-threshold; change getMigrationThreshold to return (string, error)
(preserve the call signature using oc *exutil.CLI and nodeName) and propagate
the actual error instead of returning ""; then update the cleanup logic that
performs the migration-threshold deletion to check the error and either
fail/abort the cleanup or skip deleting migration-threshold when the read failed
so the original cluster config is preserved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 97b7db45-7e31-4bbf-bad2-708ef0ef063d

📥 Commits

Reviewing files that changed from the base of the PR and between ff7fde9 and 4888122.

📒 Files selected for processing (1)

test/extended/edge_topologies/tnf_etcd_disruption.go

coderabbitai · 2026-06-12T16:31:37Z

+func getMigrationThreshold(oc *exutil.CLI, nodeName string) string {
+	output, err := exutil.DebugNodeRetryWithOptionsAndChroot(
+		oc, nodeName, "default", "bash", "-c",
+		"sudo pcs resource config etcd-clone 2>/dev/null | grep migration-threshold | grep -oP 'migration-threshold=\\K\\S+'")
+	if err != nil {
+		framework.Logf("Warning: could not read migration-threshold: %v (output: %s)", err, output)
+		return ""
+	}
+	return strings.TrimSpace(output)
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve original cluster config by failing when baseline threshold read fails.

At Line 207, getMigrationThreshold collapses read failures to "". That feeds Line 620 cleanup logic, which then deletes migration-threshold as if it was originally unset. A transient read failure can therefore mutate persisted Pacemaker config after the test.

Suggested fix

-func getMigrationThreshold(oc *exutil.CLI, nodeName string) string { +func getMigrationThreshold(oc *exutil.CLI, nodeName string) (string, error) { output, err := exutil.DebugNodeRetryWithOptionsAndChroot( oc, nodeName, "default", "bash", "-c", "sudo pcs resource config etcd-clone 2>/dev/null | grep migration-threshold | grep -oP 'migration-threshold=\\K\\S+'") if err != nil { - framework.Logf("Warning: could not read migration-threshold: %v (output: %s)", err, output) - return "" + return "", fmt.Errorf("could not read migration-threshold: %v (output: %s)", err, output) } - return strings.TrimSpace(output) + return strings.TrimSpace(output), nil }

- originalThreshold := getMigrationThreshold(oc, execNode.Name) + originalThreshold, err := getMigrationThreshold(oc, execNode.Name) + o.Expect(err).NotTo(o.HaveOccurred(), "Must read existing migration-threshold before mutating it")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/edge_topologies/tnf_etcd_disruption.go` around lines 207 - 216, getMigrationThreshold currently masks read failures by returning an empty string, which causes the cleanup code to wrongly delete the migration-threshold; change getMigrationThreshold to return (string, error) (preserve the call signature using oc *exutil.CLI and nodeName) and propagate the actual error instead of returning ""; then update the cleanup logic that performs the migration-threshold deletion to check the error and either fail/abort the cleanup or skip deleting migration-threshold when the read failed so the original cluster config is preserved.

openshift-merge-bot · 2026-06-12T16:49:56Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-ovn-two-node-fencing-recovery

openshift-ci · 2026-06-12T21:08:27Z

@lucaconsalvi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-metal-ovn-two-node-fencing-recovery	`4888122`	link	false	`/test e2e-metal-ovn-two-node-fencing-recovery`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-trt · 2026-06-12T23:11:31Z

Job Failure Risk Analysis for sha: 4888122

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-gcp-ovn	Low [Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel] This test has passed 0.00% of 8 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:standard Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:gcp Procedure:none SecurityMode:default Topology:ha Upgrade:micro] in the last week.
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6	Low [Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel] This test has passed 0.00% of 5 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:standard Network:ovn NetworkStack:ipv6 OS:rhcos9 Owner:eng Platform:metal Procedure:none SecurityMode:default Topology:ha Upgrade:micro] in the last week.
pull-ci-openshift-origin-main-e2e-metal-ovn-two-node-arbiter	Low [Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel] This test has passed 0.00% of 6 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:candidate Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:metal Procedure:none SecurityMode:default Topology:two-node-arbiter Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-metal-ovn-two-node-fencing	Low [Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel] This test has passed 0.00% of 19 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:candidate Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:metal Procedure:none SecurityMode:default Topology:two-node-fencing Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-vsphere-ovn	Low [Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel] This test has passed 0.00% of 18 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:standard Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:vsphere Procedure:none SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi	Low [Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel] This test has passed 0.00% of 11 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:upi JobTier:standard Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:vsphere Procedure:none SecurityMode:default Topology:ha Upgrade:none] in the last week.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2026

openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 9, 2026

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 9, 2026

View reviewed changes

openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 9, 2026

lucaconsalvi marked this pull request as ready for review June 10, 2026 13:36

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2026

openshift-ci Bot requested review from eggfoobar and jeff-roche June 10, 2026 13:38

lucaconsalvi changed the title ~~test/etcd: fix flaky container-kill recovery test by adding pcs cleanup~~ OCPEDGE-2700: test/etcd: fix flaky container-kill recovery test by adding pcs cleanup Jun 10, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 10, 2026

lucaconsalvi changed the title ~~OCPEDGE-2700: test/etcd: fix flaky container-kill recovery test by adding pcs cleanup~~ OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup Jun 10, 2026

lucaconsalvi marked this pull request as draft June 12, 2026 09:17

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2026

lucaconsalvi force-pushed the tnf-etcd-recovery-flake-fix branch from ff7fde9 to 4888122 Compare June 12, 2026 16:24

lucaconsalvi marked this pull request as ready for review June 12, 2026 16:28

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2026

coderabbitai Bot requested changes Jun 12, 2026

View reviewed changes

Conversation

lucaconsalvi commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Verification

Summary by CodeRabbit

Release Notes

Uh oh!

openshift-merge-bot Bot commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

lucaconsalvi commented Jun 9, 2026

Uh oh!

openshift-ci Bot commented Jun 9, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

lucaconsalvi commented Jun 10, 2026

Uh oh!

openshift-ci Bot commented Jun 10, 2026

Uh oh!

openshift-merge-bot Bot commented Jun 10, 2026

Uh oh!

openshift-ci-robot commented Jun 10, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Verification

Uh oh!

lucaconsalvi commented Jun 11, 2026

Uh oh!

openshift-trt Bot commented Jun 11, 2026

Uh oh!

lucaconsalvi commented Jun 11, 2026

Uh oh!

dhensel-rh commented Jun 11, 2026

Uh oh!

lucaconsalvi commented Jun 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot Bot commented Jun 12, 2026

Uh oh!

openshift-ci Bot commented Jun 12, 2026

Uh oh!

openshift-trt Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lucaconsalvi commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

openshift-ci-robot commented Jun 10, 2026 •

edited by openshift-ci Bot

Loading