Skip to content

OCPBUGS-86798: Bug: Poll for CRD Established status after creation to fix TOCTOU race#1396

Open
nikhilprajapati-world wants to merge 2 commits into
openshift:mainfrom
nikhilprajapati-world:fix/crd-established-toctou-race
Open

OCPBUGS-86798: Bug: Poll for CRD Established status after creation to fix TOCTOU race#1396
nikhilprajapati-world wants to merge 2 commits into
openshift:mainfrom
nikhilprajapati-world:fix/crd-established-toctou-race

Conversation

@nikhilprajapati-world
Copy link
Copy Markdown

@nikhilprajapati-world nikhilprajapati-world commented Jun 1, 2026

Summary

Fixes a TOCTOU (Time-Of-Check-Time-Of-Use) race condition in CVO's CRD apply logic where newly-created CRDs are checked for Established=True before the API server has had time to populate status.conditions.

  • When CVO creates a new CRD, it immediately checks for Established=True in status.conditions
  • The API server needs 100–500ms to validate the schema and set conditions
  • CVO checks within 9–107ms — before conditions are populated
  • This produces spurious error-level log entries that are noisy and confusing

Root Cause

In lib/resourcebuilder/apiext.go, checkCustomResourceDefinitionHealth() checks the CRD object returned directly from Create(). At that point, status.conditions is still empty because the API server hasn't finished CRD registration yet.

What Changed

  • lib/resourcebuilder/apiext.go: Extracted Established check into a reusable checkCRDEstablished() helper. When status.conditions is empty (indicating the CRD was just created), poll the API server for up to 5 seconds at 100ms intervals before returning an error.
  • lib/resourcebuilder/apiext_test.go: Added unit tests for checkCRDEstablished() covering all condition states (Established=True, Established=False, empty conditions, nil conditions, missing Established condition).

Evidence — Reproduced on Fresh OCP 4.22.0-rc.4 Cluster

On a freshly installed cluster (no prior CustomNoUpgrade), enabling a feature gate triggered the race for 10 different CRDs:

CRD Create → Check Gap
dnsnameresolvers.network.openshift.io 107ms
clusterversionoperators.operator.openshift.io 58ms
etcdbackups.operator.openshift.io 9ms
backups.config.openshift.io 105ms
compatibilityrequirements.apiextensions.openshift.io 63ms
clustermonitorings.config.openshift.io 98ms
internalreleaseimages.machineconfiguration.openshift.io 13ms
criocredentialproviderconfigs.config.openshift.io 11ms
osimagestreams.machineconfiguration.openshift.io 14ms
pkis.config.openshift.io 15ms

CVO log showing the race (CRD created, then immediately failed Established check):

I0601 13:20:29.976973  1 apiext.go:21] CRD compatibilityrequirements.apiextensions.openshift.io not found, creating
E0601 13:20:30.039500  1 task.go:128] "Unhandled Error" err="error running apply for
  customresourcedefinition \"compatibilityrequirements.apiextensions.openshift.io\" (95 of 1039):
  CustomResourceDefinition compatibilityrequirements.apiextensions.openshift.io
  does not declare an Established status condition: []"

All 10 CRDs eventually reached Established=True and the sync succeeded on retry — confirming the CRDs are valid and this is purely a timing issue.

Why This Wasn't Caught Before

  • Existing clusters: CRDs already exist with Established=True, so CVO updates them (no race)
  • TechPreviewNoUpgrade at install: CRDs are created during bootstrap, not during live CVO sync
  • CustomNoUpgrade at runtime: This is the only path that triggers CRD creation during a live sync cycle

Risk Assessment

Low risk. The change:

  • Only affects the CRD creation path (not update — existing behavior unchanged)
  • Preserves the Established safety check (CRDs that genuinely fail still error after 5s)
  • Uses the standard wait.PollUntilContextTimeout pattern already used elsewhere in CVO
  • All existing tests pass unchanged

Test Plan

  • Unit tests for checkCRDEstablished() — all 5 condition states
  • Unit tests for checkCustomResourceDefinitionHealth() — InitializingMode skip + Established pass
  • Full lib/ package tests pass (go test ./lib/...)
  • E2E: Enable CustomNoUpgrade on a fresh 4.22 cluster — verify no "does not declare an Established" errors in CVO logs
  • E2E: Verify all CRDs reach Established=True and cluster reaches healthy state

Related

  • CVO PR #771 — Restored the Established check (correct behavior, but races on creation)
  • Discovered during OCPSTRAT-2485 feature gate testing

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes

    • Improved CRD readiness checks by adding timed polling for cases with initially empty status conditions, reducing false negatives and improving reliability; adds additional debug/verbose logging around polling and failures.
  • Tests

    • Added unit tests covering CRD establishment and health-check behavior across modes and varied condition states.

When CVO creates a new CRD during payload sync, it immediately checks
for Established=True in status.conditions. The API server needs time
(100-500ms) to validate the schema and populate conditions, but CVO
checks within 9-107ms — before conditions are set.

This produces spurious error-level log entries:
  "does not declare an Established status condition: []"

On a fresh OCP 4.22.0-rc.4 cluster, enabling CustomNoUpgrade triggers
this race for 10 newly-created CRDs. The sync self-heals on retry, but
the errors are noisy and could cause issues on slow API servers.

Fix: When the CRD has empty status conditions (indicating it was just
created), poll the API server for up to 5 seconds (100ms intervals)
before returning an error. This gives the API server time to complete
CRD registration while preserving the safety of the Established check
for CRDs that genuinely fail to establish.

The Established check logic is extracted into a testable checkCRDEstablished
helper function with corresponding unit tests.

Discovered during OCPSTRAT-2485 feature gate testing.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Walkthrough

Adds a reusable checkCRDEstablished helper, polling logic in checkCustomResourceDefinitionHealth to re-fetch CRDs when Status.Conditions is empty, debug logging for polling/timeouts, and unit tests covering established-state and builder modes.

Changes

CRD Health Check with Establishment Polling

Layer / File(s) Summary
CRD Established Check Contract and Polling Configuration
lib/resourcebuilder/apiext.go
Polling interval and timeout constants and imports added; checkCRDEstablished helper introduced to validate the CRD Established condition from status.
CRD Health Check with Polling Retry Logic
lib/resourcebuilder/apiext.go
checkCustomResourceDefinitionHealth now calls checkCRDEstablished and performs context-aware polling via wait.PollUntilContextTimeout when crd.Status.Conditions is empty; includes klog messages for polling and timeout/failure and returns the helper-derived error or nil on success.
CRD Health Check and Helper Unit Tests
lib/resourcebuilder/apiext_test.go
Adds TestCheckCRDEstablished (table-driven) and tests for checkCustomResourceDefinitionHealth in InitializingMode and ReconcilingMode; tests use strings.Contains for substring assertions (no custom substring helper).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names are stable and deterministic with no dynamic content like UUIDs, timestamps, or generated identifiers.
Test Structure And Quality ✅ Passed These are standard Go unit tests using testing.T, not Ginkgo. Custom check does not apply as check specifically targets Ginkgo test code.
Microshift Test Compatibility ✅ Passed PR adds only Go unit tests (testing.T), not Ginkgo e2e tests. Custom check applies only to Ginkgo e2e patterns (It, Describe, Context, When), which are not present.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR contains only standard Go unit tests for lib/resourcebuilder, not Ginkgo e2e tests. The check applies only to new Ginkgo e2e tests, making it not applicable here.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only internal utility code in lib/resourcebuilder/apiext.go checking CRD health status. No deployment manifests, pod specs, controllers, or scheduling constraints are added or modified.
Ote Binary Stdout Contract ✅ Passed Library code only (resourcebuilder package) with klog.V() calls directed to stderr via main's alsologtostderr=true. No process-level stdout violations found.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds standard Go unit tests (not Ginkgo e2e tests). Custom check applies only to Ginkgo e2e tests with IPv4/connectivity assumptions; this PR is not applicable.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms, custom crypto implementations, or non-constant-time secret comparisons detected. PR focuses on CRD status polling, not cryptography.
Container-Privileges ✅ Passed PR contains only Go source code changes (apiext.go, apiext_test.go) for CRD health checking. No container/Kubernetes manifests or Dockerfiles with privileged settings are modified.
No-Sensitive-Data-In-Logs ✅ Passed The code logs only non-sensitive data: CRD names, condition status/reason/message, and timeout information. No passwords, tokens, API keys, PII, session IDs, or customer data are logged.
Title check ✅ Passed The title clearly and specifically describes the main change: fixing a TOCTOU race condition by polling for CRD Established status after creation, which aligns with the changeset's core purpose.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nikhilprajapati-world
Once this PR has been reviewed and has the lgtm label, please assign davidhurta for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
lib/resourcebuilder/apiext_test.go (1)

112-139: 🏗️ Heavy lift

Missing test coverage for the polling path.

Both tests exercise fast paths (InitializingMode skip, already-Established CRD) but don't cover the core TOCTOU fix: polling when Status.Conditions is empty. Consider adding tests for:

  • Empty conditions → polling succeeds when CRD becomes Established
  • Empty conditions → polling times out
  • Empty conditions → Get returns error during polling

This would require a mock/fake apiextensions client to control the Get responses.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/resourcebuilder/apiext_test.go` around lines 112 - 139, Add unit tests
covering the polling/TOCTOU path of builder.checkCustomResourceDefinitionHealth:
create new tests that construct a builder in ReconcilingMode with a CRD whose
Status.Conditions is empty and inject a fake/mocked apiextensions client that
you can control the Get behavior; write three cases: (1) the fake client
initially returns the CRD with empty conditions but on a subsequent Get returns
a CRD with an Established/ConditionTrue so the poll loop succeeds, (2) the fake
client always returns empty conditions so the poll times out and the test
asserts an error, and (3) the fake client returns an error from Get during
polling and the test asserts that error is propagated; target the same test
helpers and symbols used in existing tests (builder,
checkCustomResourceDefinitionHealth, TestCheckCustomResourceDefinitionHealth*),
use t.Context() or a cancellable context to bound polling duration, and ensure
the tests inject the fake client into the builder so the polling loop exercises
the Get path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/resourcebuilder/apiext_test.go`:
- Around line 141-152: The test defines custom functions containsSubstring and
containsSubstringImpl that duplicate standard library behavior; replace all uses
of containsSubstring with strings.Contains and remove both containsSubstring and
containsSubstringImpl definitions, adding an import for the "strings" package if
not already present; ensure any test calls reference strings.Contains(s, substr)
so the duplicates can be deleted.

---

Nitpick comments:
In `@lib/resourcebuilder/apiext_test.go`:
- Around line 112-139: Add unit tests covering the polling/TOCTOU path of
builder.checkCustomResourceDefinitionHealth: create new tests that construct a
builder in ReconcilingMode with a CRD whose Status.Conditions is empty and
inject a fake/mocked apiextensions client that you can control the Get behavior;
write three cases: (1) the fake client initially returns the CRD with empty
conditions but on a subsequent Get returns a CRD with an
Established/ConditionTrue so the poll loop succeeds, (2) the fake client always
returns empty conditions so the poll times out and the test asserts an error,
and (3) the fake client returns an error from Get during polling and the test
asserts that error is propagated; target the same test helpers and symbols used
in existing tests (builder, checkCustomResourceDefinitionHealth,
TestCheckCustomResourceDefinitionHealth*), use t.Context() or a cancellable
context to bound polling duration, and ensure the tests inject the fake client
into the builder so the polling loop exercises the Get path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 0c2ceef8-bfc7-4368-9521-a03037104df9

📥 Commits

Reviewing files that changed from the base of the PR and between 8a13554 and a00a0d3.

📒 Files selected for processing (2)
  • lib/resourcebuilder/apiext.go
  • lib/resourcebuilder/apiext_test.go

Comment thread lib/resourcebuilder/apiext_test.go Outdated
@nikhilprajapati-world
Copy link
Copy Markdown
Author

/retest

Address CodeRabbit review feedback: the custom containsSubstring and
containsSubstringImpl helper functions duplicate the standard library
strings.Contains function. Replace all uses and remove the custom
implementations.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
lib/resourcebuilder/apiext_test.go (2)

113-139: 💤 Low value

Consider collapsing the two health-check tests into one table-driven test.

TestCheckCustomResourceDefinitionHealthInitializingMode and TestCheckCustomResourceDefinitionHealthEstablished differ only in builder mode and CRD setup, so they're a natural fit for a single table-driven test.

As per coding guidelines: "Prefer table-driven tests over multiple similar test functions. If two test functions differ only in setup values, collapse them into one function with test-case tuples."

♻️ Proposed refactor
-func TestCheckCustomResourceDefinitionHealthInitializingMode(t *testing.T) {
-	b := &builder{mode: InitializingMode}
-	crd := &apiextv1.CustomResourceDefinition{
-		ObjectMeta: metav1.ObjectMeta{Name: "test.example.io"},
-		Status:     apiextv1.CustomResourceDefinitionStatus{},
-	}
-	if err := b.checkCustomResourceDefinitionHealth(t.Context(), crd); err != nil {
-		t.Errorf("InitializingMode should skip health check, got: %v", err)
-	}
-}
-
-func TestCheckCustomResourceDefinitionHealthEstablished(t *testing.T) {
-	b := &builder{mode: ReconcilingMode}
-	crd := &apiextv1.CustomResourceDefinition{
-		ObjectMeta: metav1.ObjectMeta{Name: "test.example.io"},
-		Status: apiextv1.CustomResourceDefinitionStatus{
-			Conditions: []apiextv1.CustomResourceDefinitionCondition{
-				{
-					Type:   apiextv1.Established,
-					Status: apiextv1.ConditionTrue,
-				},
-			},
-		},
-	}
-	if err := b.checkCustomResourceDefinitionHealth(t.Context(), crd); err != nil {
-		t.Errorf("Established=True CRD should pass health check, got: %v", err)
-	}
-}
+func TestCheckCustomResourceDefinitionHealth(t *testing.T) {
+	tests := []struct {
+		name string
+		mode Mode
+		crd  *apiextv1.CustomResourceDefinition
+	}{
+		{
+			name: "InitializingMode skips health check",
+			mode: InitializingMode,
+			crd: &apiextv1.CustomResourceDefinition{
+				ObjectMeta: metav1.ObjectMeta{Name: "test.example.io"},
+				Status:     apiextv1.CustomResourceDefinitionStatus{},
+			},
+		},
+		{
+			name: "ReconcilingMode passes when Established=True",
+			mode: ReconcilingMode,
+			crd: &apiextv1.CustomResourceDefinition{
+				ObjectMeta: metav1.ObjectMeta{Name: "test.example.io"},
+				Status: apiextv1.CustomResourceDefinitionStatus{
+					Conditions: []apiextv1.CustomResourceDefinitionCondition{
+						{Type: apiextv1.Established, Status: apiextv1.ConditionTrue},
+					},
+				},
+			},
+		},
+	}
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			b := &builder{mode: tt.mode}
+			if err := b.checkCustomResourceDefinitionHealth(t.Context(), tt.crd); err != nil {
+				t.Errorf("%s: unexpected error: %v", tt.name, err)
+			}
+		})
+	}
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/resourcebuilder/apiext_test.go` around lines 113 - 139, Collapse the two
near-duplicate tests TestCheckCustomResourceDefinitionHealthInitializingMode and
TestCheckCustomResourceDefinitionHealthEstablished into a single table-driven
test that iterates cases containing builder.mode and the CRD.Status (e.g.,
InitializingMode with empty Status and ReconcilingMode with Conditions
containing Established=True), calls b.checkCustomResourceDefinitionHealth for
each case, and asserts expected error/nil; locate the test logic by referencing
the existing functions TestCheckCustomResourceDefinitionHealthInitializingMode,
TestCheckCustomResourceDefinitionHealthEstablished, the builder struct, and the
checkCustomResourceDefinitionHealth method to replace them with a single
TestCheckCustomResourceDefinitionHealth table-driven implementation.

113-139: 🏗️ Heavy lift

Polling path is untested.

The core fix in this PR — polling apiextensionsClientv1 when Status.Conditions is empty in non-Initializing mode — has no coverage. Both health-check tests bypass the poll branch (one returns early via InitializingMode, the other already has Established=True). A test using a fake apiextensions clientset whose CRD starts with empty conditions and later reports Established=True would exercise the actual TOCTOU fix.

Want me to draft a fake-client-based test that covers the empty-conditions polling branch?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/resourcebuilder/apiext_test.go` around lines 113 - 139, Add a new test
for builder.checkCustomResourceDefinitionHealth that exercises the polling
branch by using a fake apiextensions client: create a builder in ReconcilingMode
with an initial CRD whose Status.Conditions is empty, inject a fake
apiextensionsClientv1 (fake clientset) into the builder, then in a background
goroutine update the CRD on the fake client to include an Established
Condition=True after a short delay so the polling loop in
checkCustomResourceDefinitionHealth observes the transition; assert that the
call returns no error and thus the TOCTOU/polling branch is covered. Ensure the
test references builder, ReconcilingMode, checkCustomResourceDefinitionHealth
and the fake apiextensions client so it fails if polling logic is removed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@lib/resourcebuilder/apiext_test.go`:
- Around line 113-139: Collapse the two near-duplicate tests
TestCheckCustomResourceDefinitionHealthInitializingMode and
TestCheckCustomResourceDefinitionHealthEstablished into a single table-driven
test that iterates cases containing builder.mode and the CRD.Status (e.g.,
InitializingMode with empty Status and ReconcilingMode with Conditions
containing Established=True), calls b.checkCustomResourceDefinitionHealth for
each case, and asserts expected error/nil; locate the test logic by referencing
the existing functions TestCheckCustomResourceDefinitionHealthInitializingMode,
TestCheckCustomResourceDefinitionHealthEstablished, the builder struct, and the
checkCustomResourceDefinitionHealth method to replace them with a single
TestCheckCustomResourceDefinitionHealth table-driven implementation.
- Around line 113-139: Add a new test for
builder.checkCustomResourceDefinitionHealth that exercises the polling branch by
using a fake apiextensions client: create a builder in ReconcilingMode with an
initial CRD whose Status.Conditions is empty, inject a fake
apiextensionsClientv1 (fake clientset) into the builder, then in a background
goroutine update the CRD on the fake client to include an Established
Condition=True after a short delay so the polling loop in
checkCustomResourceDefinitionHealth observes the transition; assert that the
call returns no error and thus the TOCTOU/polling branch is covered. Ensure the
test references builder, ReconcilingMode, checkCustomResourceDefinitionHealth
and the fake apiextensions client so it fails if polling logic is removed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: f3b2ff7c-36af-4cd6-9943-1ac12ae5e74b

📥 Commits

Reviewing files that changed from the base of the PR and between a00a0d3 and 2ea6010.

📒 Files selected for processing (1)
  • lib/resourcebuilder/apiext_test.go

@nikhilprajapati-world nikhilprajapati-world changed the title Bug: Poll for CRD Established status after creation to fix TOCTOU race OCPBUGS-86798: Bug: Poll for CRD Established status after creation to fix TOCTOU race Jun 2, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 2, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@nikhilprajapati-world: This pull request references Jira Issue OCPBUGS-86798, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Fixes a TOCTOU (Time-Of-Check-Time-Of-Use) race condition in CVO's CRD apply logic where newly-created CRDs are checked for Established=True before the API server has had time to populate status.conditions.

  • When CVO creates a new CRD, it immediately checks for Established=True in status.conditions
  • The API server needs 100–500ms to validate the schema and set conditions
  • CVO checks within 9–107ms — before conditions are populated
  • This produces spurious error-level log entries that are noisy and confusing

Root Cause

In lib/resourcebuilder/apiext.go, checkCustomResourceDefinitionHealth() checks the CRD object returned directly from Create(). At that point, status.conditions is still empty because the API server hasn't finished CRD registration yet.

What Changed

  • lib/resourcebuilder/apiext.go: Extracted Established check into a reusable checkCRDEstablished() helper. When status.conditions is empty (indicating the CRD was just created), poll the API server for up to 5 seconds at 100ms intervals before returning an error.
  • lib/resourcebuilder/apiext_test.go: Added unit tests for checkCRDEstablished() covering all condition states (Established=True, Established=False, empty conditions, nil conditions, missing Established condition).

Evidence — Reproduced on Fresh OCP 4.22.0-rc.4 Cluster

On a freshly installed cluster (no prior CustomNoUpgrade), enabling a feature gate triggered the race for 10 different CRDs:

CRD Create → Check Gap
dnsnameresolvers.network.openshift.io 107ms
clusterversionoperators.operator.openshift.io 58ms
etcdbackups.operator.openshift.io 9ms
backups.config.openshift.io 105ms
compatibilityrequirements.apiextensions.openshift.io 63ms
clustermonitorings.config.openshift.io 98ms
internalreleaseimages.machineconfiguration.openshift.io 13ms
criocredentialproviderconfigs.config.openshift.io 11ms
osimagestreams.machineconfiguration.openshift.io 14ms
pkis.config.openshift.io 15ms

CVO log showing the race (CRD created, then immediately failed Established check):

I0601 13:20:29.976973  1 apiext.go:21] CRD compatibilityrequirements.apiextensions.openshift.io not found, creating
E0601 13:20:30.039500  1 task.go:128] "Unhandled Error" err="error running apply for
 customresourcedefinition \"compatibilityrequirements.apiextensions.openshift.io\" (95 of 1039):
 CustomResourceDefinition compatibilityrequirements.apiextensions.openshift.io
 does not declare an Established status condition: []"

All 10 CRDs eventually reached Established=True and the sync succeeded on retry — confirming the CRDs are valid and this is purely a timing issue.

Why This Wasn't Caught Before

  • Existing clusters: CRDs already exist with Established=True, so CVO updates them (no race)
  • TechPreviewNoUpgrade at install: CRDs are created during bootstrap, not during live CVO sync
  • CustomNoUpgrade at runtime: This is the only path that triggers CRD creation during a live sync cycle

Risk Assessment

Low risk. The change:

  • Only affects the CRD creation path (not update — existing behavior unchanged)
  • Preserves the Established safety check (CRDs that genuinely fail still error after 5s)
  • Uses the standard wait.PollUntilContextTimeout pattern already used elsewhere in CVO
  • All existing tests pass unchanged

Test Plan

  • Unit tests for checkCRDEstablished() — all 5 condition states
  • Unit tests for checkCustomResourceDefinitionHealth() — InitializingMode skip + Established pass
  • Full lib/ package tests pass (go test ./lib/...)
  • E2E: Enable CustomNoUpgrade on a fresh 4.22 cluster — verify no "does not declare an Established" errors in CVO logs
  • E2E: Verify all CRDs reach Established=True and cluster reaches healthy state

Related

  • CVO PR #771 — Restored the Established check (correct behavior, but races on creation)
  • Discovered during OCPSTRAT-2485 feature gate testing

Made with Cursor

Summary by CodeRabbit

  • Bug Fixes

  • Improved CRD readiness checks by adding timed polling for cases with initially empty status conditions, reducing false negatives and improving reliability; adds additional debug/verbose logging around polling and failures.

  • Tests

  • Added unit tests covering CRD establishment and health-check behavior across modes and varied condition states.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

@nikhilprajapati-world: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify-update 2ea6010 link true /test verify-update
ci/prow/e2e-aws-ovn-techpreview 2ea6010 link true /test e2e-aws-ovn-techpreview
ci/prow/gofmt 2ea6010 link true /test gofmt

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants