OPNET-603: Fail haproxy test when all kubeapi servers are down simultaneously#31291
OPNET-603: Fail haproxy test when all kubeapi servers are down simultaneously#31291mkowalski wants to merge 1 commit into
Conversation
…aneously The on-prem haproxy monitor test so far only flaked when haproxy detected individual kube-apiserver backends as down, which is expected over the course of installation or upgrade. What really matters is whether at any point in time a single haproxy instance saw all kube-apiserver backends down at the same time, because that means the API was completely unreachable through the on-prem loadbalancer. Add a sweep over the constructed OnPremHaproxyDetectsDown intervals that finds, per haproxy instance, the time windows during which at least three distinct backends were down simultaneously. The first such window for every haproxy instance is tolerated, because when haproxy starts during the installation all kube-apiservers are expected to be down until they come up for the first time. Any subsequent window fails the new junit test. Implements Ben's suggestion from openshift#29149 (comment)
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@mkowalski: This pull request references OPNET-603 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
WalkthroughThe PR extends on-prem HAProxy network monitoring to detect full kube-apiserver backend outages. It adds interval-window computation logic that groups backend-down events by HAProxy node, sweeps time to find concurrent-backend outage windows, merges overlapping windows, and appends JUnit test results that fail only when multiple backends are simultaneously unavailable after the startup phase. ChangesFull API Outage Detection for HAProxy
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mkowalski The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Scheduling required tests: |
|
@mkowalski: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Risk analysis has seen new tests most likely introduced by this PR. New tests seen in this PR at sha: 871e3e8
|
Implements https://issues.redhat.com/browse/OPNET-603 (mirrored at https://redhat.atlassian.net/browse/OPNET-603).
The on-prem haproxy monitor test added in #29149 so far only flaked when haproxy detected individual kube-apiserver backends as down, which is expected over the course of installation or upgrade. As suggested in #29149 (comment), what really matters is whether a single haproxy instance saw all kube-apiserver backends down at the same time, because that means the API was completely unreachable through the on-prem loadbalancer.
Changes:
findFullAPIOutageWindows(): sweeps over the constructedOnPremHaproxyDetectsDownintervals per haproxy instance, counting how many distinct kube-apiserver backends are concurrently down; windows with at least 3 backends down (on-prem HA control plane size) are full API outages. A backend recovering the same second another goes down is not counted as overlap, and touching windows are merged so a same-second flap does not split one outage into two.evaluateFullAPIOutages(): new junit test[Jira: Networking / On-Prem Host Networking] Haproxy must not detect all kubeapi servers down simultaneously. The first outage window for every haproxy instance is tolerated, since when haproxy starts during the installation all kube-apiservers are expected to be down until they come up for the first time. Any subsequent window fails the test with the node, time range and duration in the output.EvaluateTestsFromConstructedIntervals.Verified with
go build,go vetandgo test ./pkg/monitortests/network/onpremhaproxy/...(all tests pass).Summary by CodeRabbit
New Features
Tests