Skip to content

OPNET-602: Fail haproxy monitor test when log collection fails#31292

Open
mkowalski wants to merge 1 commit into
openshift:mainfrom
mkowalski:opnet-602-haproxy-collection-failure
Open

OPNET-602: Fail haproxy monitor test when log collection fails#31292
mkowalski wants to merge 1 commit into
openshift:mainfrom
mkowalski:opnet-602-haproxy-collection-failure

Conversation

@mkowalski

@mkowalski mkowalski commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Implements https://issues.redhat.com/browse/OPNET-602 (mirrored at https://redhat.atlassian.net/browse/OPNET-602).

As suggested in #29149 (comment), the haproxy reachability test itself should flake, but a failed test should mean collection errors. So far the on-prem haproxy monitor test could not distinguish between "no data because the platform has no haproxy" and "no data because collection silently broke": on AWS and other cloud platforms it produced vacuous test passes, and on on-prem clusters a renamed namespace or label would silently disable the test forever. This also answers the follow-up question in that thread about a programmatic platform.isOnPrem() check.

Changes:

  • notSupportedPlatformReason() (new platform.go): the monitor test is supported only where the haproxy api-lb static pods are actually deployed - platform BareMetal/OpenStack/vSphere with an API VIP configured and the OpenShiftManagedDefault loadbalancer. vSphere UPI (no VIP), user-managed LB deployments and all other platforms are not.
  • PrepareCollection reads the Infrastructure config and stores a monitortestframework.NotSupportedError, which all phases return on unsupported platforms - the framework turns this into junit skips instead of today's vacuous passes. Clusters without the infrastructure config (e.g. MicroShift) are skipped via the IsNotFound check.
  • On supported platforms, scanAllOperatorPods now returns an error when it finds no haproxy pods at all, guarding against silent rot of the namespace/label assumptions. Together with the errors already returned when reading pod logs fails, the framework converts those into a failed junit for the collection phase ("failed during collection"), which fails the run - a failed test now always means collection errors.
  • Unit tests for the platform gating.

Verified with go build, go vet and go test ./pkg/monitortests/network/onpremhaproxy/... (all tests pass).

Summary by CodeRabbit

  • Bug Fixes

    • Improved platform detection for on-premises API loadbalancer monitoring to properly identify unsupported platforms and report reasons accordingly.
    • Enhanced error handling to fail collection when HAProxy pods are not found, replacing silent failures.
  • Tests

    • Added comprehensive test coverage for platform compatibility validation across various cluster configurations.

So far the on-prem haproxy monitor test could not distinguish between
"no data because the platform has no haproxy" and "no data because
collection silently broke". On AWS and other cloud platforms it
produced vacuous test passes, and on on-prem clusters a renamed
namespace or label would silently disable the test forever.

Gate the monitor test on the platform: read the Infrastructure config
in PrepareCollection and skip all phases with NotSupportedError unless
the cluster is BareMetal/OpenStack/vSphere with an API VIP and the
OpenShift-managed loadbalancer, i.e. the deployments that run the
haproxy api-lb static pods. Clusters without the infrastructure config
(e.g. MicroShift) are skipped as well.

On supported platforms finding no haproxy pods at all is now reported
as an error. Together with the errors already returned when reading
the logs fails, the framework turns those into a failed junit for the
collection phase, so a failed test now always means collection errors.

Implements Vadim's suggestion from
openshift#29149 (comment)
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 11, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 11, 2026

Copy link
Copy Markdown

@mkowalski: This pull request references OPNET-602 which is a valid jira issue.

Details

In response to this:

Implements https://issues.redhat.com/browse/OPNET-602 (mirrored at https://redhat.atlassian.net/browse/OPNET-602).

As suggested in #29149 (comment), the haproxy reachability test itself should flake, but a failed test should mean collection errors. So far the on-prem haproxy monitor test could not distinguish between "no data because the platform has no haproxy" and "no data because collection silently broke": on AWS and other cloud platforms it produced vacuous test passes, and on on-prem clusters a renamed namespace or label would silently disable the test forever. This also answers the follow-up question in that thread about a programmatic platform.isOnPrem() check.

Changes:

  • notSupportedPlatformReason() (new platform.go): the monitor test is supported only where the haproxy api-lb static pods are actually deployed - platform BareMetal/OpenStack/vSphere with an API VIP configured and the OpenShiftManagedDefault loadbalancer. vSphere UPI (no VIP), user-managed LB deployments and all other platforms are not.
  • PrepareCollection reads the Infrastructure config and stores a monitortestframework.NotSupportedError, which all phases return on unsupported platforms - the framework turns this into junit skips instead of today's vacuous passes. Clusters without the infrastructure config (e.g. MicroShift) are skipped via the IsNotFound check.
  • On supported platforms, scanAllOperatorPods now returns an error when it finds no haproxy pods at all, guarding against silent rot of the namespace/label assumptions. Together with the errors already returned when reading pod logs fails, the framework converts those into a failed junit for the collection phase ("failed during collection"), which fails the run - a failed test now always means collection errors.
  • Unit tests for the platform gating.

Verified with go build, go vet and go test ./pkg/monitortests/network/onpremhaproxy/... (all tests pass).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Walkthrough

This PR adds platform detection to skip the on-premises HA proxy monitor test when the cluster infrastructure does not support OpenShift-managed on-prem API loadbalancing. It introduces a helper function to check platform eligibility, adds comprehensive test coverage, and integrates the check into the monitor test collection pipeline to short-circuit early when unsupported.

Changes

On-Premises Monitor Test Platform Gating

Layer / File(s) Summary
Platform detection helper and unit tests
pkg/monitortests/network/onpremhaproxy/platform.go, pkg/monitortests/network/onpremhaproxy/platform_test.go
notSupportedPlatformReason inspects cluster infrastructure platform type and API VIP configuration to determine on-prem HA proxy monitor eligibility. Table-driven test covers BareMetal, OpenStack, vSphere, AWS, Nutanix variants, API VIP fields, and load balancer configurations.
Monitor test pipeline integration
pkg/monitortests/network/onpremhaproxy/monitortest.go
operatorLogAnalyzer fetches cluster Infrastructure config during PrepareCollection and uses platform detection to set notSupportedReason. All downstream phases (StartCollection, CollectData, ConstructComputedIntervals, EvaluateTestsFromConstructedIntervals, WriteContentToStorage) return early when unsupported. Pod scanning now fails explicitly when no haproxy pods are found.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: making the haproxy monitor test fail when log collection fails, which aligns with the PR's primary objective of distinguishing failures from unsupported platforms.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains no Ginkgo tests. The only test added is a standard Go table-driven test with static, deterministic names unrelated to Ginkgo framework.
Test Structure And Quality ✅ Passed PR adds standard Go unit tests, not Ginkgo tests. Table-driven test with 14 focused subtests; each tests one platform scenario. Assertion messages are descriptive. No cluster resources to setup/cle...
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. The only new test file (platform_test.go) uses standard Go testing (testing.T), not Ginkgo framework. The check does not apply.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests (It(), Describe(), Context(), When()) added; PR contains only standard Go unit tests and monitor framework code, so SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR contains only monitoring test code with no deployment manifests, scheduling constraints, replica assumptions, or topology-specific configurations that would affect workload scheduling.
Ote Binary Stdout Contract ✅ Passed No stdout writes found in process-level code. Files contain only interface implementations, helper functions, and unit tests with no main(), init(), BeforeSuite(), or similar process-level functions.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. The new platform_test.go is a standard Go unit test (using testing.T), not a Ginkgo e2e test, so the check does not apply.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons detected. The PR modifies haproxy monitoring logi...
Container-Privileges ✅ Passed PR contains only Go source files with no container/Kubernetes manifests or privileged container configurations to flag.
No-Sensitive-Data-In-Logs ✅ Passed All logging in the PR uses safe, non-sensitive data: only node names, platform/LB types, and parsed hostnames/IPs via restricted regex. No passwords, tokens, API keys, PII, or customer data are log...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from deads2k and sjenning June 11, 2026 16:13
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mkowalski

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/monitortests/network/onpremhaproxy/platform_test.go (1)

67-142: ⚡ Quick win

Add LB-type coverage for OpenStack and vSphere branches.

Line 81 validates LoadBalancerTypeUserManaged only for BareMetal, but notSupportedPlatformReason applies the same LB-type gating to OpenStack and vSphere. Add equivalent cases so branch behavior can’t regress silently.

Proposed test additions
 		{
 			name: "openstack without API VIP",
 			platformStatus: &configv1.PlatformStatus{
 				Type:      configv1.OpenStackPlatformType,
 				OpenStack: &configv1.OpenStackPlatformStatus{},
 			},
 			expectSupported: false,
 		},
+		{
+			name: "openstack with user-managed loadbalancer",
+			platformStatus: &configv1.PlatformStatus{
+				Type: configv1.OpenStackPlatformType,
+				OpenStack: &configv1.OpenStackPlatformStatus{
+					APIServerInternalIPs: []string{"10.0.0.5"},
+					LoadBalancer: &configv1.OpenStackPlatformLoadBalancer{
+						Type: configv1.LoadBalancerTypeUserManaged,
+					},
+				},
+			},
+			expectSupported: false,
+		},
 		{
 			name: "vsphere IPI with API VIPs",
 			platformStatus: &configv1.PlatformStatus{
 				Type: configv1.VSpherePlatformType,
 				VSphere: &configv1.VSpherePlatformStatus{
 					APIServerInternalIPs: []string{"10.0.0.5"},
 				},
 			},
 			expectSupported: true,
 		},
+		{
+			name: "vsphere with user-managed loadbalancer",
+			platformStatus: &configv1.PlatformStatus{
+				Type: configv1.VSpherePlatformType,
+				VSphere: &configv1.VSpherePlatformStatus{
+					APIServerInternalIPs: []string{"10.0.0.5"},
+					LoadBalancer: &configv1.VSpherePlatformLoadBalancer{
+						Type: configv1.LoadBalancerTypeUserManaged,
+					},
+				},
+			},
+			expectSupported: false,
+		},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/monitortests/network/onpremhaproxy/platform_test.go` around lines 67 -
142, The table-driven tests in platform_test.go add load-balancer type checks
for BareMetal but miss equivalent cases for OpenStack and VSphere even though
notSupportedPlatformReason enforces LB-type gating for those platforms; update
the test vector in platform_test.go to include cases for OpenStack and VSphere
with a LoadBalancer set to LoadBalancerTypeUserManaged (expectSupported: false)
and with LoadBalancerTypeOpenShiftManagedDefault or an APIServerInternalIPs
entry (expectSupported: true) so the test covers LB-type gating the same way as
BareMetal and prevents silent regressions in notSupportedPlatformReason.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/monitortests/network/onpremhaproxy/monitortest.go`:
- Around line 106-111: The check currently only fails when len(infraPods) == 0,
but if pods exist and none include a haproxy container the test still passes;
update the logic after the haproxy container filter (the code that iterates
infraPods and picks containers named "haproxy") to count total haproxy
containers (e.g., totalHaproxyContainers) and return the same fmt.Errorf when
that count == 0; locate the filtering loop that references infraPods and the
haproxy container name and add the zero-container check there so the monitor
fails when no haproxy containers are scanned.

---

Nitpick comments:
In `@pkg/monitortests/network/onpremhaproxy/platform_test.go`:
- Around line 67-142: The table-driven tests in platform_test.go add
load-balancer type checks for BareMetal but miss equivalent cases for OpenStack
and VSphere even though notSupportedPlatformReason enforces LB-type gating for
those platforms; update the test vector in platform_test.go to include cases for
OpenStack and VSphere with a LoadBalancer set to LoadBalancerTypeUserManaged
(expectSupported: false) and with LoadBalancerTypeOpenShiftManagedDefault or an
APIServerInternalIPs entry (expectSupported: true) so the test covers LB-type
gating the same way as BareMetal and prevents silent regressions in
notSupportedPlatformReason.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7a591e67-ccf4-424e-acd3-8bc551c392eb

📥 Commits

Reviewing files that changed from the base of the PR and between bde16f5 and b4a2d0f.

📒 Files selected for processing (3)
  • pkg/monitortests/network/onpremhaproxy/monitortest.go
  • pkg/monitortests/network/onpremhaproxy/platform.go
  • pkg/monitortests/network/onpremhaproxy/platform_test.go

Comment thread pkg/monitortests/network/onpremhaproxy/monitortest.go
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@openshift-ci openshift-ci Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label Jun 11, 2026
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@mkowalski: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn-upgrade b4a2d0f link true /test e2e-gcp-ovn-upgrade
ci/prow/e2e-aws-ovn-microshift b4a2d0f link true /test e2e-aws-ovn-microshift
ci/prow/e2e-vsphere-ovn b4a2d0f link true /test e2e-vsphere-ovn
ci/prow/e2e-vsphere-ovn-upi b4a2d0f link true /test e2e-vsphere-ovn-upi
ci/prow/e2e-aws-ovn-serial-2of2 b4a2d0f link true /test e2e-aws-ovn-serial-2of2
ci/prow/e2e-gcp-ovn b4a2d0f link true /test e2e-gcp-ovn
ci/prow/e2e-aws-ovn-serial-1of2 b4a2d0f link true /test e2e-aws-ovn-serial-1of2
ci/prow/e2e-aws-ovn-fips b4a2d0f link true /test e2e-aws-ovn-fips
ci/prow/e2e-aws-ovn-microshift-serial b4a2d0f link true /test e2e-aws-ovn-microshift-serial
ci/prow/e2e-metal-ipi-ovn-ipv6 b4a2d0f link true /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants