OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup#31276
OCPEDGE-2700: fix flaky container-kill recovery test by adding pcs cleanup#31276lucaconsalvi wants to merge 2 commits into
Conversation
When a local etcd container is killed, the peer sets force_new_cluster and etcd recovers. In rare cases (~1.9% of runs) the killed node's repeated start failures hit Pacemaker's migration-threshold before coordination completes, causing Pacemaker to stop scheduling the resource and the 10-minute Eventually to expire. The "Failed Resource Actions" evidence (which the test asserts) is captured immediately after the cluster self-heals — at that point both nodes are guaranteed to appear since the peer's force_new_cluster monitor failure and the killed node's start failures have already been recorded. After capturing the evidence, pcs resource cleanup resets the failure count so Pacemaker can retry the start action against the now-healthy peer cluster. Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Skipping CI for Draft Pull Request. |
WalkthroughTest adds helpers to read, set, and restore Pacemaker’s ChangesEtcd disruption test robustness improvements
Sequence Diagram(s)sequenceDiagram
participant Test
participant Pacemaker
participant NodeA
participant NodeB
Test->>Pacemaker: read migration-threshold(etcd-clone)
Test->>Pacemaker: set migration-threshold=INFINITY
Test->>NodeA: kill local etcd container
NodeA->>Test: report etcd recovered
Test->>Pacemaker: verify etcd-clone Started on NodeA and NodeB
Test->>Pacemaker: restore migration-threshold(original or delete override)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 1 warning, 1 inconclusive)
✅ Passed checks (12 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lucaconsalvi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery |
|
@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/98519310-6420-11f1-85a8-bef173c7a909-0 |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/extended/edge_topologies/tnf_etcd_disruption.go (1)
535-543: ⚡ Quick winConsider logging cleanup output on success for consistency and diagnostics.
The AfterEach cleanup at Line 337-343 logs the
pcs resource cleanupoutput even on success (Line 342). For consistency and better observability when investigating test runs, consider logging the output here too.📝 Suggested improvement
g.By("Running pcs resource cleanup to unblock etcd-clone if Pacemaker hit migration-threshold") if output, cleanupErr := exutil.DebugNodeRetryWithOptionsAndChroot( oc, execNode.Name, "default", "bash", "-c", "sudo pcs resource cleanup etcd-clone"); cleanupErr != nil { framework.Logf("Warning: pcs resource cleanup failed: %v\noutput: %s", cleanupErr, output) +} else { + framework.Logf("PCS resource cleanup output: %s", output) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/edge_topologies/tnf_etcd_disruption.go` around lines 535 - 543, Update the pcs resource cleanup block that calls exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and "sudo pcs resource cleanup etcd-clone") to log the command output on success as well as on error; i.e., after the call, always call framework.Logf with a concise success message and the captured output when cleanupErr is nil, and keep the existing warning log when cleanupErr is non-nil so diagnostics are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/extended/edge_topologies/tnf_etcd_disruption.go`:
- Around line 535-543: Update the pcs resource cleanup block that calls
exutil.DebugNodeRetryWithOptionsAndChroot (the call using oc, execNode.Name and
"sudo pcs resource cleanup etcd-clone") to log the command output on success as
well as on error; i.e., after the call, always call framework.Logf with a
concise success message and the captured output when cleanupErr is nil, and keep
the existing warning log when cleanupErr is non-nil so diagnostics are
consistent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 4ffeaed2-43be-4d3e-819c-258eef5d5aab
📒 Files selected for processing (1)
test/extended/edge_topologies/tnf_etcd_disruption.go
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery |
|
@lucaconsalvi: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dbf33f20-64a1-11f1-99be-5db787e1433a-0 |
|
Scheduling required tests: Scheduling tests matching the |
|
@lucaconsalvi: This pull request references OCPEDGE-2700 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest |
|
Job Failure Risk Analysis for sha: ff7fde9
|
|
/retest-required |
|
@lucaconsalvi We are capturing the failure, and running the necessary command to clean up the cluster so it is not left in a bad state. How do we make sure this does not result in missing failures if a real problem introduced ? |
|
@dhensel-rh Thanks for the observation! The cleanup could mask a real start-action failure. I've reworked the approach: instead of running pcs resource cleanup mid-test, I now raise migration-threshold to INFINITY on etcd-clone before the container kill, and restore it in DeferCleanup/AfterEach. |
…shold Raise migration-threshold to INFINITY on the etcd resource before killing the container, so Pacemaker never stops retrying during the transient coordination window. This replaces the previous approach of running pcs resource cleanup mid-test, which could mask real start-action failures. The test now purely observes recovery without intervening. If the resource agent has a genuine bug, etcd-clone will never start and the test times out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ff7fde9 to
4888122
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/edge_topologies/tnf_etcd_disruption.go`:
- Around line 207-216: getMigrationThreshold currently masks read failures by
returning an empty string, which causes the cleanup code to wrongly delete the
migration-threshold; change getMigrationThreshold to return (string, error)
(preserve the call signature using oc *exutil.CLI and nodeName) and propagate
the actual error instead of returning ""; then update the cleanup logic that
performs the migration-threshold deletion to check the error and either
fail/abort the cleanup or skip deleting migration-threshold when the read failed
so the original cluster config is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 97b7db45-7e31-4bbf-bad2-708ef0ef063d
📒 Files selected for processing (1)
test/extended/edge_topologies/tnf_etcd_disruption.go
| func getMigrationThreshold(oc *exutil.CLI, nodeName string) string { | ||
| output, err := exutil.DebugNodeRetryWithOptionsAndChroot( | ||
| oc, nodeName, "default", "bash", "-c", | ||
| "sudo pcs resource config etcd-clone 2>/dev/null | grep migration-threshold | grep -oP 'migration-threshold=\\K\\S+'") | ||
| if err != nil { | ||
| framework.Logf("Warning: could not read migration-threshold: %v (output: %s)", err, output) | ||
| return "" | ||
| } | ||
| return strings.TrimSpace(output) | ||
| } |
There was a problem hiding this comment.
Preserve original cluster config by failing when baseline threshold read fails.
At Line 207, getMigrationThreshold collapses read failures to "". That feeds Line 620 cleanup logic, which then deletes migration-threshold as if it was originally unset. A transient read failure can therefore mutate persisted Pacemaker config after the test.
Suggested fix
-func getMigrationThreshold(oc *exutil.CLI, nodeName string) string {
+func getMigrationThreshold(oc *exutil.CLI, nodeName string) (string, error) {
output, err := exutil.DebugNodeRetryWithOptionsAndChroot(
oc, nodeName, "default", "bash", "-c",
"sudo pcs resource config etcd-clone 2>/dev/null | grep migration-threshold | grep -oP 'migration-threshold=\\K\\S+'")
if err != nil {
- framework.Logf("Warning: could not read migration-threshold: %v (output: %s)", err, output)
- return ""
+ return "", fmt.Errorf("could not read migration-threshold: %v (output: %s)", err, output)
}
- return strings.TrimSpace(output)
+ return strings.TrimSpace(output), nil
}- originalThreshold := getMigrationThreshold(oc, execNode.Name)
+ originalThreshold, err := getMigrationThreshold(oc, execNode.Name)
+ o.Expect(err).NotTo(o.HaveOccurred(), "Must read existing migration-threshold before mutating it")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended/edge_topologies/tnf_etcd_disruption.go` around lines 207 - 216,
getMigrationThreshold currently masks read failures by returning an empty
string, which causes the cleanup code to wrongly delete the migration-threshold;
change getMigrationThreshold to return (string, error) (preserve the call
signature using oc *exutil.CLI and nodeName) and propagate the actual error
instead of returning ""; then update the cleanup logic that performs the
migration-threshold deletion to check the error and either fail/abort the
cleanup or skip deleting migration-threshold when the read failed so the
original cluster config is preserved.
|
Scheduling required tests: Scheduling tests matching the |
|
@lucaconsalvi: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Job Failure Risk Analysis for sha: 4888122
|
Problem
The test
should coordinate recovery with peer when local etcd container is killedflakes at ~1.9% (4 failures / 212 CI runs, seen on both PR payload runs and periodic nightlies).All failures are identical: 600s timeout waiting for etcd-clone to be Started on both nodes. The failure log shows:
Root cause
When
podman kill etcdis run on one node, thepodman-etcdresource agent coordinates recovery via the force_new_cluster protocol:force_new_cluster→ monitor fails--force-new-clusterThe race: step 3 (start action) can fail multiple times before step 1–2 completes. If it hits Pacemaker's
migration-threshold(default: 3 failures) before the peer is ready, Pacemaker stops scheduling the resource entirely. ThelongRecoveryTimeout(10 min) then expires waiting for a recovery that Pacemaker won't attempt.Fix
Restructure the test to capture the "Failed Resource Actions" evidence before running
pcs resource cleanup, then let cleanup unblock Pacemaker:By the time the cluster health check passes, the coordinated recovery has already run — both the peer's force_new_cluster monitor failure and the killed node's start failures are recorded in pcs status. We capture that evidence, then reset the failure count so Pacemaker can retry against the now-healthy peer cluster.
In the non-flaky path (98.1% of runs), cleanup is a no-op and the timing of the existing assertion is simply moved earlier.
Verification
Fixes: https://redhat.atlassian.net/browse/OCPEDGE-2700
Summary by CodeRabbit
Release Notes