OCPBUGS-86798: Bug: Poll for CRD Established status after creation to fix TOCTOU race#1396
Conversation
When CVO creates a new CRD during payload sync, it immediately checks for Established=True in status.conditions. The API server needs time (100-500ms) to validate the schema and populate conditions, but CVO checks within 9-107ms — before conditions are set. This produces spurious error-level log entries: "does not declare an Established status condition: []" On a fresh OCP 4.22.0-rc.4 cluster, enabling CustomNoUpgrade triggers this race for 10 newly-created CRDs. The sync self-heals on retry, but the errors are noisy and could cause issues on slow API servers. Fix: When the CRD has empty status conditions (indicating it was just created), poll the API server for up to 5 seconds (100ms intervals) before returning an error. This gives the API server time to complete CRD registration while preserving the safety of the Established check for CRDs that genuinely fail to establish. The Established check logic is extracted into a testable checkCRDEstablished helper function with corresponding unit tests. Discovered during OCPSTRAT-2485 feature gate testing. Co-authored-by: Cursor <cursoragent@cursor.com>
WalkthroughAdds a reusable checkCRDEstablished helper, polling logic in checkCustomResourceDefinitionHealth to re-fetch CRDs when Status.Conditions is empty, debug logging for polling/timeouts, and unit tests covering established-state and builder modes. ChangesCRD Health Check with Establishment Polling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nikhilprajapati-world The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
lib/resourcebuilder/apiext_test.go (1)
112-139: 🏗️ Heavy liftMissing test coverage for the polling path.
Both tests exercise fast paths (InitializingMode skip, already-Established CRD) but don't cover the core TOCTOU fix: polling when
Status.Conditionsis empty. Consider adding tests for:
- Empty conditions → polling succeeds when CRD becomes Established
- Empty conditions → polling times out
- Empty conditions → Get returns error during polling
This would require a mock/fake apiextensions client to control the Get responses.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/resourcebuilder/apiext_test.go` around lines 112 - 139, Add unit tests covering the polling/TOCTOU path of builder.checkCustomResourceDefinitionHealth: create new tests that construct a builder in ReconcilingMode with a CRD whose Status.Conditions is empty and inject a fake/mocked apiextensions client that you can control the Get behavior; write three cases: (1) the fake client initially returns the CRD with empty conditions but on a subsequent Get returns a CRD with an Established/ConditionTrue so the poll loop succeeds, (2) the fake client always returns empty conditions so the poll times out and the test asserts an error, and (3) the fake client returns an error from Get during polling and the test asserts that error is propagated; target the same test helpers and symbols used in existing tests (builder, checkCustomResourceDefinitionHealth, TestCheckCustomResourceDefinitionHealth*), use t.Context() or a cancellable context to bound polling duration, and ensure the tests inject the fake client into the builder so the polling loop exercises the Get path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/resourcebuilder/apiext_test.go`:
- Around line 141-152: The test defines custom functions containsSubstring and
containsSubstringImpl that duplicate standard library behavior; replace all uses
of containsSubstring with strings.Contains and remove both containsSubstring and
containsSubstringImpl definitions, adding an import for the "strings" package if
not already present; ensure any test calls reference strings.Contains(s, substr)
so the duplicates can be deleted.
---
Nitpick comments:
In `@lib/resourcebuilder/apiext_test.go`:
- Around line 112-139: Add unit tests covering the polling/TOCTOU path of
builder.checkCustomResourceDefinitionHealth: create new tests that construct a
builder in ReconcilingMode with a CRD whose Status.Conditions is empty and
inject a fake/mocked apiextensions client that you can control the Get behavior;
write three cases: (1) the fake client initially returns the CRD with empty
conditions but on a subsequent Get returns a CRD with an
Established/ConditionTrue so the poll loop succeeds, (2) the fake client always
returns empty conditions so the poll times out and the test asserts an error,
and (3) the fake client returns an error from Get during polling and the test
asserts that error is propagated; target the same test helpers and symbols used
in existing tests (builder, checkCustomResourceDefinitionHealth,
TestCheckCustomResourceDefinitionHealth*), use t.Context() or a cancellable
context to bound polling duration, and ensure the tests inject the fake client
into the builder so the polling loop exercises the Get path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 0c2ceef8-bfc7-4368-9521-a03037104df9
📒 Files selected for processing (2)
lib/resourcebuilder/apiext.golib/resourcebuilder/apiext_test.go
|
/retest |
Address CodeRabbit review feedback: the custom containsSubstring and containsSubstringImpl helper functions duplicate the standard library strings.Contains function. Replace all uses and remove the custom implementations. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
lib/resourcebuilder/apiext_test.go (2)
113-139: 💤 Low valueConsider collapsing the two health-check tests into one table-driven test.
TestCheckCustomResourceDefinitionHealthInitializingModeandTestCheckCustomResourceDefinitionHealthEstablisheddiffer only in buildermodeand CRD setup, so they're a natural fit for a single table-driven test.As per coding guidelines: "Prefer table-driven tests over multiple similar test functions. If two test functions differ only in setup values, collapse them into one function with test-case tuples."
♻️ Proposed refactor
-func TestCheckCustomResourceDefinitionHealthInitializingMode(t *testing.T) { - b := &builder{mode: InitializingMode} - crd := &apiextv1.CustomResourceDefinition{ - ObjectMeta: metav1.ObjectMeta{Name: "test.example.io"}, - Status: apiextv1.CustomResourceDefinitionStatus{}, - } - if err := b.checkCustomResourceDefinitionHealth(t.Context(), crd); err != nil { - t.Errorf("InitializingMode should skip health check, got: %v", err) - } -} - -func TestCheckCustomResourceDefinitionHealthEstablished(t *testing.T) { - b := &builder{mode: ReconcilingMode} - crd := &apiextv1.CustomResourceDefinition{ - ObjectMeta: metav1.ObjectMeta{Name: "test.example.io"}, - Status: apiextv1.CustomResourceDefinitionStatus{ - Conditions: []apiextv1.CustomResourceDefinitionCondition{ - { - Type: apiextv1.Established, - Status: apiextv1.ConditionTrue, - }, - }, - }, - } - if err := b.checkCustomResourceDefinitionHealth(t.Context(), crd); err != nil { - t.Errorf("Established=True CRD should pass health check, got: %v", err) - } -} +func TestCheckCustomResourceDefinitionHealth(t *testing.T) { + tests := []struct { + name string + mode Mode + crd *apiextv1.CustomResourceDefinition + }{ + { + name: "InitializingMode skips health check", + mode: InitializingMode, + crd: &apiextv1.CustomResourceDefinition{ + ObjectMeta: metav1.ObjectMeta{Name: "test.example.io"}, + Status: apiextv1.CustomResourceDefinitionStatus{}, + }, + }, + { + name: "ReconcilingMode passes when Established=True", + mode: ReconcilingMode, + crd: &apiextv1.CustomResourceDefinition{ + ObjectMeta: metav1.ObjectMeta{Name: "test.example.io"}, + Status: apiextv1.CustomResourceDefinitionStatus{ + Conditions: []apiextv1.CustomResourceDefinitionCondition{ + {Type: apiextv1.Established, Status: apiextv1.ConditionTrue}, + }, + }, + }, + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + b := &builder{mode: tt.mode} + if err := b.checkCustomResourceDefinitionHealth(t.Context(), tt.crd); err != nil { + t.Errorf("%s: unexpected error: %v", tt.name, err) + } + }) + } +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/resourcebuilder/apiext_test.go` around lines 113 - 139, Collapse the two near-duplicate tests TestCheckCustomResourceDefinitionHealthInitializingMode and TestCheckCustomResourceDefinitionHealthEstablished into a single table-driven test that iterates cases containing builder.mode and the CRD.Status (e.g., InitializingMode with empty Status and ReconcilingMode with Conditions containing Established=True), calls b.checkCustomResourceDefinitionHealth for each case, and asserts expected error/nil; locate the test logic by referencing the existing functions TestCheckCustomResourceDefinitionHealthInitializingMode, TestCheckCustomResourceDefinitionHealthEstablished, the builder struct, and the checkCustomResourceDefinitionHealth method to replace them with a single TestCheckCustomResourceDefinitionHealth table-driven implementation.
113-139: 🏗️ Heavy liftPolling path is untested.
The core fix in this PR — polling
apiextensionsClientv1whenStatus.Conditionsis empty in non-Initializing mode — has no coverage. Both health-check tests bypass the poll branch (one returns early viaInitializingMode, the other already hasEstablished=True). A test using a fake apiextensions clientset whose CRD starts with empty conditions and later reportsEstablished=Truewould exercise the actual TOCTOU fix.Want me to draft a fake-client-based test that covers the empty-conditions polling branch?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@lib/resourcebuilder/apiext_test.go` around lines 113 - 139, Add a new test for builder.checkCustomResourceDefinitionHealth that exercises the polling branch by using a fake apiextensions client: create a builder in ReconcilingMode with an initial CRD whose Status.Conditions is empty, inject a fake apiextensionsClientv1 (fake clientset) into the builder, then in a background goroutine update the CRD on the fake client to include an Established Condition=True after a short delay so the polling loop in checkCustomResourceDefinitionHealth observes the transition; assert that the call returns no error and thus the TOCTOU/polling branch is covered. Ensure the test references builder, ReconcilingMode, checkCustomResourceDefinitionHealth and the fake apiextensions client so it fails if polling logic is removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@lib/resourcebuilder/apiext_test.go`:
- Around line 113-139: Collapse the two near-duplicate tests
TestCheckCustomResourceDefinitionHealthInitializingMode and
TestCheckCustomResourceDefinitionHealthEstablished into a single table-driven
test that iterates cases containing builder.mode and the CRD.Status (e.g.,
InitializingMode with empty Status and ReconcilingMode with Conditions
containing Established=True), calls b.checkCustomResourceDefinitionHealth for
each case, and asserts expected error/nil; locate the test logic by referencing
the existing functions TestCheckCustomResourceDefinitionHealthInitializingMode,
TestCheckCustomResourceDefinitionHealthEstablished, the builder struct, and the
checkCustomResourceDefinitionHealth method to replace them with a single
TestCheckCustomResourceDefinitionHealth table-driven implementation.
- Around line 113-139: Add a new test for
builder.checkCustomResourceDefinitionHealth that exercises the polling branch by
using a fake apiextensions client: create a builder in ReconcilingMode with an
initial CRD whose Status.Conditions is empty, inject a fake
apiextensionsClientv1 (fake clientset) into the builder, then in a background
goroutine update the CRD on the fake client to include an Established
Condition=True after a short delay so the polling loop in
checkCustomResourceDefinitionHealth observes the transition; assert that the
call returns no error and thus the TOCTOU/polling branch is covered. Ensure the
test references builder, ReconcilingMode, checkCustomResourceDefinitionHealth
and the fake apiextensions client so it fails if polling logic is removed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: f3b2ff7c-36af-4cd6-9943-1ac12ae5e74b
📒 Files selected for processing (1)
lib/resourcebuilder/apiext_test.go
|
@nikhilprajapati-world: This pull request references Jira Issue OCPBUGS-86798, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@nikhilprajapati-world: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Fixes a TOCTOU (Time-Of-Check-Time-Of-Use) race condition in CVO's CRD apply logic where newly-created CRDs are checked for
Established=Truebefore the API server has had time to populatestatus.conditions.Established=Trueinstatus.conditionsRoot Cause
In
lib/resourcebuilder/apiext.go,checkCustomResourceDefinitionHealth()checks the CRD object returned directly fromCreate(). At that point,status.conditionsis still empty because the API server hasn't finished CRD registration yet.What Changed
lib/resourcebuilder/apiext.go: Extracted Established check into a reusablecheckCRDEstablished()helper. Whenstatus.conditionsis empty (indicating the CRD was just created), poll the API server for up to 5 seconds at 100ms intervals before returning an error.lib/resourcebuilder/apiext_test.go: Added unit tests forcheckCRDEstablished()covering all condition states (Established=True, Established=False, empty conditions, nil conditions, missing Established condition).Evidence — Reproduced on Fresh OCP 4.22.0-rc.4 Cluster
On a freshly installed cluster (no prior
CustomNoUpgrade), enabling a feature gate triggered the race for 10 different CRDs:dnsnameresolvers.network.openshift.ioclusterversionoperators.operator.openshift.ioetcdbackups.operator.openshift.iobackups.config.openshift.iocompatibilityrequirements.apiextensions.openshift.ioclustermonitorings.config.openshift.iointernalreleaseimages.machineconfiguration.openshift.iocriocredentialproviderconfigs.config.openshift.ioosimagestreams.machineconfiguration.openshift.iopkis.config.openshift.ioCVO log showing the race (CRD created, then immediately failed Established check):
All 10 CRDs eventually reached
Established=Trueand the sync succeeded on retry — confirming the CRDs are valid and this is purely a timing issue.Why This Wasn't Caught Before
Established=True, so CVO updates them (no race)Risk Assessment
Low risk. The change:
wait.PollUntilContextTimeoutpattern already used elsewhere in CVOTest Plan
checkCRDEstablished()— all 5 condition statescheckCustomResourceDefinitionHealth()— InitializingMode skip + Established passlib/package tests pass (go test ./lib/...)CustomNoUpgradeon a fresh 4.22 cluster — verify no "does not declare an Established" errors in CVO logsEstablished=Trueand cluster reaches healthy stateRelated
Made with Cursor
Summary by CodeRabbit
Bug Fixes
Tests