OCPBUGS-85782: Add extra wait after mco and cvo considers upgrade complete#31289
OCPBUGS-85782: Add extra wait after mco and cvo considers upgrade complete#31289xueqzhan wants to merge 1 commit into
Conversation
This is due to the fact that MCO does not wait for worker pool before declaring upgrade complete. So when the remaining tests started running, some daemonset were still not ready.
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@xueqzhan: This pull request references Jira Issue OCPBUGS-85782, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughE2E upgrade test suite now imports a cluster info utility and replaces a fixed 5-second sleep after upgrade steps with an active wait for cluster stabilization. If stabilization times out or fails, the test logs a warning instead of failing, improving test robustness. ChangesUpgrade Cluster Stabilization
🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@xueqzhan: This pull request references Jira Issue OCPBUGS-85782, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: xueqzhan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/e2e/upgrade/upgrade.go (1)
196-203: ⚡ Quick winPost-upgrade stabilization wait looks good.
The approach of waiting for cluster stability with
WaitForStableClusterafter upgrade completion addresses the issue where MCO reports success before worker nodes finish rebooting and workloads restart. Logging a warning on timeout without failing the test is appropriate and matches the existing pattern shown inpkg/test/ginkgo/cmd_runsuite.go.💡 Optional: Capture detailed failure information for better diagnostics
The discarded JUnit test cases (first return value) contain detailed information about which cluster operators failed to stabilize, recovery status, and timing. While the current "warning only" approach is reasonable, capturing and logging these details would provide better diagnostics for intermittent issues:
- if _, waitErr := clusterinfo.WaitForStableCluster(context.Background(), config); waitErr != nil { - framework.Logf("WARNING: cluster did not stabilize within timeout after upgrade: %v", waitErr) + if testCases, waitErr := clusterinfo.WaitForStableCluster(context.Background(), config); waitErr != nil { + msg := fmt.Sprintf("WARNING: cluster did not stabilize within timeout after upgrade: %v", waitErr) + for _, tc := range testCases { + if tc.FailureOutput != nil { + msg += fmt.Sprintf("\n%s", tc.FailureOutput.Output) + } + } + framework.Logf("%s", msg) }Based on learnings from Context snippet 1, the test cases include summaries of unstable operators that never recovered and those that recovered with timing information.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/upgrade/upgrade.go` around lines 196 - 203, The test currently calls clusterinfo.WaitForStableCluster and discards the first return value (the detailed JUnit test cases); update the call to capture both returns (e.g., tests, waitErr := clusterinfo.WaitForStableCluster(...)) and when waitErr != nil log the detailed test case summaries (tests or their summaries) along with the existing warning so diagnostic info about which cluster operators failed or recovered is preserved; ensure you reference the existing framework.Logf call to emit those details and keep the test behavior (not failing) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/e2e/upgrade/upgrade.go`:
- Around line 196-203: The test currently calls clusterinfo.WaitForStableCluster
and discards the first return value (the detailed JUnit test cases); update the
call to capture both returns (e.g., tests, waitErr :=
clusterinfo.WaitForStableCluster(...)) and when waitErr != nil log the detailed
test case summaries (tests or their summaries) along with the existing warning
so diagnostic info about which cluster operators failed or recovered is
preserved; ensure you reference the existing framework.Logf call to emit those
details and keep the test behavior (not failing) unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 468798ef-3681-45c7-a485-9aa2d1a36c47
📒 Files selected for processing (1)
test/e2e/upgrade/upgrade.go
|
Scheduling required tests: Scheduling tests matching the |
|
/retest-required |
|
@xueqzhan: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This is due to the fact that MCO does not wait for worker pool before declaring upgrade complete. So when the remaining tests started running, some daemonset were still not ready.
Summary by CodeRabbit