Skip to content

OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup#31276

Open
lucaconsalvi wants to merge 2 commits into
openshift:mainfrom
lucaconsalvi:tnf-etcd-recovery-flake-fix
Open

OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup#31276
lucaconsalvi wants to merge 2 commits into
openshift:mainfrom
lucaconsalvi:tnf-etcd-recovery-flake-fix

Conversation

@lucaconsalvi

@lucaconsalvi lucaconsalvi commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Problem

The test should coordinate recovery with peer when local etcd container is killed flakes at ~1.9% (4 failures / 212 CI runs, seen on both PR payload runs and periodic nightlies).

All failures are identical: 600s timeout waiting for etcd-clone to be Started on both nodes. The failure log shows:

Failed Resource Actions:
  * etcd start on master-1 returned 'error' after 2m1.131s
  * etcd 30s-interval monitor on master-0 returned 'error' (master-0 must force a new cluster)

Root cause

When podman kill etcd is run on one node, the podman-etcd resource agent coordinates recovery via the force_new_cluster protocol:

  1. The peer node's monitor detects a leadership loss → sets force_new_cluster → monitor fails
  2. The peer restarts etcd with --force-new-cluster
  3. The killed node's start action retries until it can rejoin as a learner

The race: step 3 (start action) can fail multiple times before step 1–2 completes. If it hits Pacemaker's migration-threshold (default: 3 failures) before the peer is ready, Pacemaker stops scheduling the resource entirely. The longRecoveryTimeout (10 min) then expires waiting for a recovery that Pacemaker won't attempt.

Fix

Restructure the test to capture the "Failed Resource Actions" evidence before running pcs resource cleanup, then let cleanup unblock Pacemaker:

kill etcd
  → wait cluster healthy          (first Eventually)
  → assert Failed Resource Actions on both nodes   ← moved here
  → pcs resource cleanup etcd-clone               ← new step
  → wait pcs Started on both nodes (second Eventually)
  → verifyFinalClusterHealth

By the time the cluster health check passes, the coordinated recovery has already run — both the peer's force_new_cluster monitor failure and the killed node's start failures are recorded in pcs status. We capture that evidence, then reset the failure count so Pacemaker can retry against the now-healthy peer cluster.

In the non-flaky path (98.1% of runs), cleanup is a no-op and the timing of the existing assertion is simply moved earlier.

Verification

  • Passes locally on TNF cluster (~400s)
  • Payload CI: first run 2/3 PASS, 1/3 infra flake (504 timeout during cluster setup — test never ran); second run 2/3 PASS, 1/3 FAIL due to unrelated Monitor test failures (missing SCC annotations, kubelet lease backoffs — our test passed in that shard)

Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700

Summary by CodeRabbit

Release Notes

  • Tests
    • Enhanced test resilience for coordinated etcd recovery scenarios by improving resource management during failure conditions.

When a local etcd container is killed, the peer sets force_new_cluster and
etcd recovers. In rare cases (~1.9% of runs) the killed node's repeated
start failures hit Pacemaker's migration-threshold before coordination
completes, causing Pacemaker to stop scheduling the resource and the
10-minute Eventually to expire.

The "Failed Resource Actions" evidence (which the test asserts) is captured
immediately after the cluster self-heals — at that point both nodes are
guaranteed to appear since the peer's force_new_cluster monitor failure and
the killed node's start failures have already been recorded. After capturing
the evidence, pcs resource cleanup resets the failure count so Pacemaker
can retry the start action against the now-healthy peer cluster.

Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 9, 2026
@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Walkthrough

Test adds helpers to read, set, and restore Pacemaker’s migration-threshold for etcd-clone. The coordinated local-etcd-kill test now captures the original threshold, sets it to INFINITY before killing the container, and defers restoring the prior value during cleanup.

Changes

Etcd disruption test robustness improvements

Layer / File(s) Summary
migration-threshold helpers and test precondition
test/extended/edge_topologies/tnf_etcd_disruption.go
Adds internal helpers to get, set, and restore Pacemaker migration-threshold for etcd-clone. The coordinated test captures the original threshold, sets it to INFINITY before killing the local etcd container, and defers restoration (deleting the override if originally unset). The existing cleanup, verification that etcd-clone is Started on both nodes, and "Failed Resource Actions" observation remain.

Sequence Diagram(s)

sequenceDiagram
  participant Test
  participant Pacemaker
  participant NodeA
  participant NodeB
  Test->>Pacemaker: read migration-threshold(etcd-clone)
  Test->>Pacemaker: set migration-threshold=INFINITY
  Test->>NodeA: kill local etcd container
  NodeA->>Test: report etcd recovered
  Test->>Pacemaker: verify etcd-clone Started on NodeA and NodeB
  Test->>Pacemaker: restore migration-threshold(original or delete override)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • openshift/origin#30950: Similar changes to tnf_etcd_disruption.go adding pcs cleanup/wait steps to recover Pacemaker/etcd components.

Suggested labels

jira/valid-reference

Suggested reviewers

  • fracappa
  • fonta-rh
  • suleymanakbas91

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error tnf_etcd_disruption.go logs internal hostnames via framework.Logf with node.Name and dumps pcs status/Failed Resource Actions output that includes those node names. Redact or remove hostname values from log messages (e.g., don’t print node.Name), and avoid dumping full pcs status/failed-action sections unless hostnames are masked.
Test Structure And Quality ⚠️ Warning cleanup may delete pacemaker migration-threshold incorrectly: getMigrationThreshold returns "" on read error and restoreMigrationThreshold deletes override when originalValue is empty. Make getMigrationThreshold return (string,error) and assert baseline read success before calling setMigrationThreshold/restore; only delete the override when you confirmed it was originally unset.
Title check ❓ Inconclusive The title references fixing a flaky test and mentions 'pcs cleanup', which aligns with the summary's description of adding Pacemaker migration-threshold management to prevent test flakiness. However, the actual implementation (as shown in the summary) uses 'migration-threshold=INFINITY' approach rather than 'pcs cleanup', making the title partially outdated based on the final solution described in PR objectives. Verify if the title should be updated to reflect the final implementation approach (migration-threshold management) instead of 'pcs cleanup', as the PR objectives indicate the solution was reworked.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed In test/extended/edge_topologies/tnf_etcd_disruption.go, all ginkgo It/Describe titles are static string literals; no fmt.Sprintf/variables/dynamic node names, dates, UUIDs, or IPs in test titles w...
Microshift Test Compatibility ✅ Passed should coordinate recovery... is inside a single Describe tagged [apigroup:config.openshift.io], so MicroShift CI will skip it; code uses etcd-clone/podman/pacemaker which would otherwise be un...
Single Node Openshift (Sno) Test Compatibility ✅ Passed The test is multi-node (expects 2 nodes), but it’s guarded by utils.SkipIfNotTopology(oc, v1.DualReplicaTopologyMode) in BeforeEach, so SNO won’t run it.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only tnf_etcd_disruption test to set/restore Pacemaker migration-threshold; inspected file has no k8s scheduling constraints (no affinity/topologySpread/nodeSelectors).
Ote Binary Stdout Contract ✅ Passed Scanned all PR-changed .go files for process-level stdout contract patterns (fmt.Print/Printf/Println, os.Stdout, klog.SetOutput/LogToStderr) and found none.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed In tnf_etcd_disruption.go, the etcd-clone kill/recovery test contains no hardcoded IPv4/localhost literals, no ParseIP/ParseCIDR/To4 logic, and no external URLs/curl/wget/public-host connections (o...
No-Weak-Crypto ✅ Passed In test/extended/edge_topologies/tnf_etcd_disruption.go (and scanned edge_topologies/*.go), no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB keywords, crypto imports, or custom crypto/secret comparisons found.
Container-Privileges ✅ Passed Only files changed were .ci-operator.yaml, images/tests/Dockerfile.rhel, and tnf_etcd_disruption.go; the git diff contains no privileged/hostPID/hostNetwork/hostIPC/SYS_ADMIN/allowPrivilegeEscalati...
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lucaconsalvi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 9, 2026
@lucaconsalvi

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery

@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-1of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-2of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/98519310-6420-11f1-85a8-bef173c7a909-0

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/extended/edge_topologies/tnf_etcd_disruption.go (1)

535-543: ⚡ Quick win

Consider logging cleanup output on success for consistency and diagnostics.

The AfterEach cleanup at Line 337-343 logs the pcs resource cleanup output even on success (Line 342). For consistency and better observability when investigating test runs, consider logging the output here too.

📝 Suggested improvement
 g.By("Running pcs resource cleanup to unblock etcd-clone if Pacemaker hit migration-threshold")
 if output, cleanupErr := exutil.DebugNodeRetryWithOptionsAndChroot(
     oc, execNode.Name, "default", "bash", "-c", "sudo pcs resource cleanup etcd-clone"); cleanupErr != nil {
     framework.Logf("Warning: pcs resource cleanup failed: %v\noutput: %s", cleanupErr, output)
+} else {
+    framework.Logf("PCS resource cleanup output: %s", output)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/edge_topologies/tnf_etcd_disruption.go` around lines 535 - 543,
Update the pcs resource cleanup block that calls
exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and
"sudo pcs resource cleanup etcd-clone") to log the command output on success as
well as on error; i.e., after the call, always call framework.Logf with a
concise success message and the captured output when cleanupErr is nil, and keep
the existing warning log when cleanupErr is non-nil so diagnostics are
consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/extended/edge_topologies/tnf_etcd_disruption.go`:
- Around line 535-543: Update the pcs resource cleanup block that calls
exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and
"sudo pcs resource cleanup etcd-clone") to log the command output on success as
well as on error; i.e., after the call, always call framework.Logf with a
concise success message and the captured output when cleanupErr is nil, and keep
the existing warning log when cleanupErr is non-nil so diagnostics are
consistent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4ffeaed2-43be-4d3e-819c-258eef5d5aab

📥 Commits

Reviewing files that changed from the base of the PR and between 6b33871 and a69bbb6.

📒 Files selected for processing (1)
  • test/extended/edge_topologies/tnf_etcd_disruption.go

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 9, 2026
@lucaconsalvi

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-1of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-2of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dbf33f20-64a1-11f1-99be-5db787e1433a-0

@lucaconsalvi lucaconsalvi marked this pull request as ready for review June 10, 2026 13:36
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2026
@openshift-ci openshift-ci Bot requested review from eggfoobar and jeff-roche June 10, 2026 13:38
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-ovn-two-node-fencing-recovery

@lucaconsalvi lucaconsalvi changed the title test/etcd: fix flaky container-kill recovery test by adding pcs cleanup OCPEDGE-2700: test/etcd: fix flaky container-kill recovery test by adding pcs cleanup Jun 10, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 10, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 10, 2026

Copy link
Copy Markdown

@lucaconsalvi: This pull request references OCPEDGE-2700 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Problem

The test should coordinate recovery with peer when local etcd container is killed flakes at ~1.9% (4 failures / 212 CI runs, seen on both PR payload runs and periodic nightlies).

All failures are identical: 600s timeout waiting for etcd-clone to be Started on both nodes. The failure log shows:

Failed Resource Actions:
 * etcd start on master-1 returned 'error' after 2m1.131s
 * etcd 30s-interval monitor on master-0 returned 'error' (master-0 must force a new cluster)

Root cause

When podman kill etcd is run on one node, the podman-etcd resource agent coordinates recovery via the force_new_cluster protocol:

  1. The peer node's monitor detects a leadership loss → sets force_new_cluster → monitor fails
  2. The peer restarts etcd with --force-new-cluster
  3. The killed node's start action retries until it can rejoin as a learner

The race: step 3 (start action) can fail multiple times before step 1–2 completes. If it hits Pacemaker's migration-threshold (default: 3 failures) before the peer is ready, Pacemaker stops scheduling the resource entirely. The longRecoveryTimeout (10 min) then expires waiting for a recovery that Pacemaker won't attempt.

Fix

Restructure the test to capture the "Failed Resource Actions" evidence before running pcs resource cleanup, then let cleanup unblock Pacemaker:

kill etcd
 → wait cluster healthy          (first Eventually)
 → assert Failed Resource Actions on both nodes   ← moved here
 → pcs resource cleanup etcd-clone               ← new step
 → wait pcs Started on both nodes (second Eventually)
 → verifyFinalClusterHealth

By the time the cluster health check passes, the coordinated recovery has already run — both the peer's force_new_cluster monitor failure and the killed node's start failures are recorded in pcs status. We capture that evidence, then reset the failure count so Pacemaker can retry against the now-healthy peer cluster.

In the non-flaky path (98.1% of runs), cleanup is a no-op and the timing of the existing assertion is simply moved earlier.

Verification

  • Passes locally on TNF cluster (~400s)
  • Payload CI: first run 2/3 PASS, 1/3 infra flake (504 timeout during cluster setup — test never ran); second run 2/3 PASS, 1/3 FAIL due to unrelated Monitor test failures (missing SCC annotations, kubelet lease backoffs — our test passed in that shard)

Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@lucaconsalvi lucaconsalvi changed the title OCPEDGE-2700: test/etcd: fix flaky container-kill recovery test by adding pcs cleanup OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup Jun 10, 2026
@lucaconsalvi

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-trt

openshift-trt Bot commented Jun 11, 2026

Copy link
Copy Markdown

Job Failure Risk Analysis for sha: ff7fde9

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift High
install should succeed: other
This test has passed 99.39% of 3782 runs on release 5.0 [Overall] in the last week.

@lucaconsalvi

Copy link
Copy Markdown
Contributor Author

/retest-required

@dhensel-rh

Copy link
Copy Markdown
Contributor

@lucaconsalvi We are capturing the failure, and running the necessary command to clean up the cluster so it is not left in a bad state. How do we make sure this does not result in missing failures if a real problem introduced ?

@lucaconsalvi

Copy link
Copy Markdown
Contributor Author

@dhensel-rh Thanks for the observation! The cleanup could mask a real start-action failure. I've reworked the approach: instead of running pcs resource cleanup mid-test, I now raise migration-threshold to INFINITY on etcd-clone before the container kill, and restore it in DeferCleanup/AfterEach.
This way Pacemaker never stops retrying during the transient coordination window, so the test purely observes recovery without intervening. If a future code change genuinely breaks the start action, etcd-clone will never reach Started and the test times out — no cleanup to hide it.
I'm testing it locally to see if this could be a better solution!

@lucaconsalvi lucaconsalvi marked this pull request as draft June 12, 2026 09:17
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2026
…shold

Raise migration-threshold to INFINITY on the etcd resource before
killing the container, so Pacemaker never stops retrying during the
transient coordination window. This replaces the previous approach
of running pcs resource cleanup mid-test, which could mask real
start-action failures.

The test now purely observes recovery without intervening. If the
resource agent has a genuine bug, etcd-clone will never start and
the test times out.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@lucaconsalvi lucaconsalvi force-pushed the tnf-etcd-recovery-flake-fix branch from ff7fde9 to 4888122 Compare June 12, 2026 16:24
@lucaconsalvi lucaconsalvi marked this pull request as ready for review June 12, 2026 16:28
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/edge_topologies/tnf_etcd_disruption.go`:
- Around line 207-216: getMigrationThreshold currently masks read failures by
returning an empty string, which causes the cleanup code to wrongly delete the
migration-threshold; change getMigrationThreshold to return (string, error)
(preserve the call signature using oc *exutil.CLI and nodeName) and propagate
the actual error instead of returning ""; then update the cleanup logic that
performs the migration-threshold deletion to check the error and either
fail/abort the cleanup or skip deleting migration-threshold when the read failed
so the original cluster config is preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 97b7db45-7e31-4bbf-bad2-708ef0ef063d

📥 Commits

Reviewing files that changed from the base of the PR and between ff7fde9 and 4888122.

📒 Files selected for processing (1)
  • test/extended/edge_topologies/tnf_etcd_disruption.go

Comment on lines +207 to +216
func getMigrationThreshold(oc *exutil.CLI, nodeName string) string {
output, err := exutil.DebugNodeRetryWithOptionsAndChroot(
oc, nodeName, "default", "bash", "-c",
"sudo pcs resource config etcd-clone 2>/dev/null | grep migration-threshold | grep -oP 'migration-threshold=\\K\\S+'")
if err != nil {
framework.Logf("Warning: could not read migration-threshold: %v (output: %s)", err, output)
return ""
}
return strings.TrimSpace(output)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve original cluster config by failing when baseline threshold read fails.

At Line 207, getMigrationThreshold collapses read failures to "". That feeds Line 620 cleanup logic, which then deletes migration-threshold as if it was originally unset. A transient read failure can therefore mutate persisted Pacemaker config after the test.

Suggested fix
-func getMigrationThreshold(oc *exutil.CLI, nodeName string) string {
+func getMigrationThreshold(oc *exutil.CLI, nodeName string) (string, error) {
 	output, err := exutil.DebugNodeRetryWithOptionsAndChroot(
 		oc, nodeName, "default", "bash", "-c",
 		"sudo pcs resource config etcd-clone 2>/dev/null | grep migration-threshold | grep -oP 'migration-threshold=\\K\\S+'")
 	if err != nil {
-		framework.Logf("Warning: could not read migration-threshold: %v (output: %s)", err, output)
-		return ""
+		return "", fmt.Errorf("could not read migration-threshold: %v (output: %s)", err, output)
 	}
-	return strings.TrimSpace(output)
+	return strings.TrimSpace(output), nil
}
-		originalThreshold := getMigrationThreshold(oc, execNode.Name)
+		originalThreshold, err := getMigrationThreshold(oc, execNode.Name)
+		o.Expect(err).NotTo(o.HaveOccurred(), "Must read existing migration-threshold before mutating it")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/edge_topologies/tnf_etcd_disruption.go` around lines 207 - 216,
getMigrationThreshold currently masks read failures by returning an empty
string, which causes the cleanup code to wrongly delete the migration-threshold;
change getMigrationThreshold to return (string, error) (preserve the call
signature using oc *exutil.CLI and nodeName) and propagate the actual error
instead of returning ""; then update the cleanup logic that performs the
migration-threshold deletion to check the error and either fail/abort the
cleanup or skip deleting migration-threshold when the read failed so the
original cluster config is preserved.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-metal-ovn-two-node-fencing-recovery

@openshift-ci

openshift-ci Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@lucaconsalvi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ovn-two-node-fencing-recovery 4888122 link false /test e2e-metal-ovn-two-node-fencing-recovery

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt

openshift-trt Bot commented Jun 12, 2026

Copy link
Copy Markdown

Job Failure Risk Analysis for sha: 4888122

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-gcp-ovn Low
[Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 8 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:standard Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:gcp Procedure:none SecurityMode:default Topology:ha Upgrade:micro] in the last week.
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6 Low
[Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 5 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:standard Network:ovn NetworkStack:ipv6 OS:rhcos9 Owner:eng Platform:metal Procedure:none SecurityMode:default Topology:ha Upgrade:micro] in the last week.
pull-ci-openshift-origin-main-e2e-metal-ovn-two-node-arbiter Low
[Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 6 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:candidate Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:metal Procedure:none SecurityMode:default Topology:two-node-arbiter Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-metal-ovn-two-node-fencing Low
[Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 19 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:candidate Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:metal Procedure:none SecurityMode:default Topology:two-node-fencing Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-vsphere-ovn Low
[Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 18 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:standard Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:vsphere Procedure:none SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi Low
[Feature:NetworkSegmentation][ovn-kubernetes-ote][sig-network] Network Segmentation: services on a user defined primary network should be reachable through their cluster IP, node port and load balancer L2 primary UDN with custom network, cluster-networked pods, NodePort service [Suite:openshift/conformance/parallel]
This test has passed 0.00% of 11 runs on release 5.0 [Architecture:amd64 FeatureSet:default Installer:upi JobTier:standard Network:ovn NetworkStack:ipv4 OS:rhcos9 Owner:eng Platform:vsphere Procedure:none SecurityMode:default Topology:ha Upgrade:none] in the last week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants