OCPNODE-4443: Add runc upgradeable guard to block upgrades on RHEL 10 streams#5891
OCPNODE-4443: Add runc upgradeable guard to block upgrades on RHEL 10 streams#5891bitoku wants to merge 2 commits into
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@bitoku: This pull request references OCPNODE-4443 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis pull request adds RHEL 10 compatibility enforcement by preventing the ChangesRHEL 10 Runtime Validation
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/hold |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/operator/status_test.go (1)
1487-1518: Good test coverage for RHEL 9 exemption.This test case properly verifies that pools on RHEL 9 streams are not blocked even when runc is configured. This is critical for ensuring the upgrade guard only affects RHEL 10 / CentOS 10 targets.
Consider adding a similar test case for CentOS 10 to verify that both
osimagestream.StreamNameRHEL10andosimagestream.StreamNameCentOS10trigger the upgrade block.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/status_test.go` around lines 1487 - 1518, Add a test parallel to "runc not blocked on rhel-9 stream" that sets the OSImageStream.Status.DefaultStream to osimagestream.StreamNameCentOS10 (and optionally another for osimagestream.StreamNameRHEL10) and asserts that optr.checkRuncUpgradeableGuard() returns upgradeBlock == true, non-empty message, and no error; reuse the same Operator setup (mcLister with rendered-worker config containing the runc drop-in, osImageStreamLister, fgHandler) and reference the Operator.checkRuncUpgradeableGuard method, the osImageStreamLister/osisIndexer setup, and mockMCPLister/mockMCLister used in the existing test to locate where to add the new test(s).pkg/operator/status.go (1)
506-514: Consider whether TOML parse failures should be non-fatal.Currently, a malformed TOML drop-in in any pool's rendered MachineConfig will cause the entire detection to fail, potentially preventing the operator from evaluating runc status for all pools.
If the TOML is malformed due to a user-provided MachineConfig, consider logging a warning and continuing rather than returning an error. However, if you expect all drop-ins to be well-formed since they're generated by the controller, the current behavior is appropriate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/status.go` around lines 506 - 514, The TOML decoding failure currently returns an error and aborts detection; change this to log a warning and continue so one malformed CRI-O drop-in doesn't fail the whole routine: in the block that decodes into crioRuntimeConfig using toml.NewDecoder(bytes.NewReader(fileData)).Decode(&cfg), replace the return of fmt.Errorf(...) with a non-fatal log (e.g., klog.Warningf or the package's logger) that includes path and mc.Name and the decode error, then skip using cfg for that drop-in and continue processing to allow effectiveRuntime evaluation from other drop-ins; keep the existing behavior (return error) only if you determine the drop-ins should always be trusted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/operator/status_test.go`:
- Around line 1487-1518: Add a test parallel to "runc not blocked on rhel-9
stream" that sets the OSImageStream.Status.DefaultStream to
osimagestream.StreamNameCentOS10 (and optionally another for
osimagestream.StreamNameRHEL10) and asserts that
optr.checkRuncUpgradeableGuard() returns upgradeBlock == true, non-empty
message, and no error; reuse the same Operator setup (mcLister with
rendered-worker config containing the runc drop-in, osImageStreamLister,
fgHandler) and reference the Operator.checkRuncUpgradeableGuard method, the
osImageStreamLister/osisIndexer setup, and mockMCPLister/mockMCLister used in
the existing test to locate where to add the new test(s).
In `@pkg/operator/status.go`:
- Around line 506-514: The TOML decoding failure currently returns an error and
aborts detection; change this to log a warning and continue so one malformed
CRI-O drop-in doesn't fail the whole routine: in the block that decodes into
crioRuntimeConfig using toml.NewDecoder(bytes.NewReader(fileData)).Decode(&cfg),
replace the return of fmt.Errorf(...) with a non-fatal log (e.g., klog.Warningf
or the package's logger) that includes path and mc.Name and the decode error,
then skip using cfg for that drop-in and continue processing to allow
effectiveRuntime evaluation from other drop-ins; keep the existing behavior
(return error) only if you determine the drop-ins should always be trusted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 326ee946-c807-400a-9eb7-82bffd5370e1
📒 Files selected for processing (2)
pkg/operator/status.gopkg/operator/status_test.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/operator/status.go (1)
393-412:⚠️ Potential issue | 🔴 CriticalThe claimed expansion of runc detection logic is not implemented.
The AI summary states that
cfeEvalRuncwas expanded to "evaluate rendered MachineConfig Ignition data for CRI-O drop-ins" and consider OS image streams. However, the actual implementation only checksContainerRuntimeConfigresources forDefaultRuntime == runc.While
DetectRuncInMachineConfigexists inpkg/controller/common/helpers.goandvalidateNoRuncOnRHEL10exists inpkg/controller/render/render_controller.go, neither is called bycfeEvalRunc. The current implementation will flag anyContainerRuntimeConfigwith runc regardless of target OS stream.If the intent is to also detect runc from rendered MachineConfig CRI-O drop-ins and only flag runc as an evaluation concern for RHEL 10/CentOS 10 pools, this logic is missing from the status.go implementation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/operator/status.go` around lines 393 - 412, cfeEvalRunc currently only checks ContainerRuntimeConfig.DefaultRuntime for runc but must also inspect rendered MachineConfig CRI-O drop-ins and restrict findings to RHEL10/CentOS10 pools; update cfeEvalRunc to (1) keep the existing crcLister loop for ContainerRuntimeConfig, (2) iterate rendered MachineConfigs (use DetectRuncInMachineConfig from pkg/controller/common/helpers.go) to detect runc in CRI-O drop-ins, and (3) apply the same OS-stream restriction/logic used by validateNoRuncOnRHEL10 (call or mirror that check) so the function only returns true for pools targeting RHEL10/CentOS10; reference cfeEvalRunc, DetectRuncInMachineConfig, validateNoRuncOnRHEL10, ContainerRuntimeConfig and rendered MachineConfig resources when making the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/operator/status.go`:
- Around line 393-412: cfeEvalRunc currently only checks
ContainerRuntimeConfig.DefaultRuntime for runc but must also inspect rendered
MachineConfig CRI-O drop-ins and restrict findings to RHEL10/CentOS10 pools;
update cfeEvalRunc to (1) keep the existing crcLister loop for
ContainerRuntimeConfig, (2) iterate rendered MachineConfigs (use
DetectRuncInMachineConfig from pkg/controller/common/helpers.go) to detect runc
in CRI-O drop-ins, and (3) apply the same OS-stream restriction/logic used by
validateNoRuncOnRHEL10 (call or mirror that check) so the function only returns
true for pools targeting RHEL10/CentOS10; reference cfeEvalRunc,
DetectRuncInMachineConfig, validateNoRuncOnRHEL10, ContainerRuntimeConfig and
rendered MachineConfig resources when making the changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d132fa61-d68c-42df-87d4-3cfa7203dbea
📒 Files selected for processing (7)
pkg/controller/common/helpers.gopkg/controller/common/helpers_test.gopkg/controller/render/render_controller.gopkg/controller/render/render_controller_test.gopkg/operator/status.gopkg/operator/status_test.gopkg/osimagestream/streams.go
✅ Files skipped from review due to trivial changes (1)
- pkg/controller/render/render_controller_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/operator/status_test.go
5e4727d to
73f6f26
Compare
|
/verified by @bitoku |
|
@bitoku: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/unhold |
|
@bitoku: This pull request references OCPNODE-4443 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/controller/common/helpers.go`:
- Around line 1538-1559: GetIgnitionFileDataByPath currently returns raw bytes
which may be gzip-compressed; before attempting toml.Decode you must detect and
decompress Ignition gzip payloads so TOML parsing and runc detection won't fail.
In the loop that reads fileData (after calling GetIgnitionFileDataByPath and
before toml.NewDecoder(...).Decode(&cfg)), check for gzip magic bytes
(0x1f,0x8b) and, if present, create a gzip.Reader to decompress fileData into a
new byte slice and use that decompressed data for toml decoding; keep using the
same crioRuntimeConfig and error wrapping semantics if decompression or decoding
fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 59d53499-18c5-4cef-983c-09366c3b72d5
📒 Files selected for processing (7)
pkg/controller/common/helpers.gopkg/controller/common/helpers_test.gopkg/controller/render/render_controller.gopkg/controller/render/render_controller_test.gopkg/operator/status.gopkg/operator/status_test.gopkg/osimagestream/streams.go
🚧 Files skipped from review as they are similar to previous changes (5)
- pkg/osimagestream/streams.go
- pkg/controller/render/render_controller_test.go
- pkg/controller/common/helpers_test.go
- pkg/operator/status.go
- pkg/controller/render/render_controller.go
73f6f26 to
0fd053c
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/controller/common/helpers_test.go (1)
1955-1968: ⚡ Quick winAdd a reversed-input ordering case to lock in “lexical last drop-in wins.”
Current override tests pass files in both insertion and lexical order, so an implementation bug that relies on slice order could still pass. Add one case with reversed file order to pin the expected precedence.
Proposed test case
{ name: "crun overridden by runc in later drop-in", mc: helpers.NewMachineConfig("test-runc-override", nil, "", []ign3types.File{ helpers.CreateEncodedIgn3File("/etc/crio/crio.conf.d/00-default", makeCRIODropIn("crun"), 0644), helpers.CreateEncodedIgn3File("/etc/crio/crio.conf.d/01-ctrcfg", makeCRIODropIn("runc"), 0644), }), expectedMC: "test-runc-override", }, + { + name: "lexical order wins even if input slice is reversed", + mc: helpers.NewMachineConfig("test-runc-lexical", nil, "", []ign3types.File{ + helpers.CreateEncodedIgn3File("/etc/crio/crio.conf.d/01-ctrcfg", makeCRIODropIn("runc"), 0644), + helpers.CreateEncodedIgn3File("/etc/crio/crio.conf.d/00-default", makeCRIODropIn("crun"), 0644), + }), + expectedMC: "test-runc-lexical", + },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controller/common/helpers_test.go` around lines 1955 - 1968, The test suite in helpers_test.go relies on file insertion order and doesn't assert lexical-last-wins for CRI-O drop-ins; add a new test case that uses helpers.NewMachineConfig and helpers.CreateEncodedIgn3File with the same two drop-in contents from makeCRIODropIn("runc") and makeCRIODropIn("crun") but provide the files in reversed slice order (e.g., put the "01-ctrcfg" entry before "00-default") and assert expectedMC equals the name that should win under lexical ordering (same as the earlier test where lexical last wins); place this new case alongside the existing "runc overridden by crun..." and "crun overridden by runc..." cases so the test verifies precedence is determined by filename, not insertion order.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/controller/common/helpers_test.go`:
- Around line 1955-1968: The test suite in helpers_test.go relies on file
insertion order and doesn't assert lexical-last-wins for CRI-O drop-ins; add a
new test case that uses helpers.NewMachineConfig and
helpers.CreateEncodedIgn3File with the same two drop-in contents from
makeCRIODropIn("runc") and makeCRIODropIn("crun") but provide the files in
reversed slice order (e.g., put the "01-ctrcfg" entry before "00-default") and
assert expectedMC equals the name that should win under lexical ordering (same
as the earlier test where lexical last wins); place this new case alongside the
existing "runc overridden by crun..." and "crun overridden by runc..." cases so
the test verifies precedence is determined by filename, not insertion order.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2d6683e1-1475-4dc9-9684-b7ea885c3b8c
📒 Files selected for processing (7)
pkg/controller/common/helpers.gopkg/controller/common/helpers_test.gopkg/controller/render/render_controller.gopkg/controller/render/render_controller_test.gopkg/operator/status.gopkg/operator/status_test.gopkg/osimagestream/streams.go
🚧 Files skipped from review as they are similar to previous changes (6)
- pkg/controller/render/render_controller.go
- pkg/controller/common/helpers.go
- pkg/controller/render/render_controller_test.go
- pkg/osimagestream/streams.go
- pkg/operator/status.go
- pkg/operator/status_test.go
|
/cc @asahay19 |
|
/pipeline required |
|
Scheduling tests matching the |
|
/retest |
| // effective default container runtime is runc. Drop-ins are processed in | ||
| // lexicographic order (matching CRI-O's layering behavior) so the last file's | ||
| // default_runtime wins. | ||
| func DetectRuncInMachineConfig(mc *mcfgv1.MachineConfig) (string, error) { |
There was a problem hiding this comment.
I don't think we need to do this, but we do have the potential optimization of just checking the paths of the MC. We know what the file name is when we use ctrcfg, as well as what it is when it was an upgraded drop in. technically, they can use MC to drop in a file, but that's unsupported from node team perspective so if they break themselves... 🤷 .
I don't think this needs optimizing per-se, but it does make clear what cases we're supporting and may cut down the code a bit
There was a problem hiding this comment.
That makes complete sense, but I'd still like to keep this check.
Even if it's unsupported, user complaints will ultimately fall on us to handle.
We can just be more explicit in the comments about what is and isn't supported.
There was a problem hiding this comment.
I'm fine with this as is as well, technically, if we want to be absolutely sure, we'd be checking the on-disk crio config per node to see if anything is using runc. Practically speaking that's probably overkill.
Out of curiosity, if the user does bypass this check (let's say, they overrode /etc/crio/crio.conf instead of using a dropin), what would happen to the node post upgrade to R10? Does crio fail upon reboot?
There was a problem hiding this comment.
CRI-O fails upon reboot, yes, so I guess the node stalls NotReady.
It's unsupported operation, and not going to check that though.
|
Are we modifying the Cluster Fleet Evaluations? I'm not seeing new code in this area. By enabling a new check in the CFE code path it will set MCO as not upgradeable. |
Sorry I read it wrong. |
|
Scheduling tests matching the |
|
/payload-job-with-prs periodic-ci-openshift-release-main-nightly-5.0-e2e-gcp-ovn openshift/origin#31266 |
|
@bitoku: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
|
/payload-job-with-prs periodic-ci-openshift-release-main-ci-5.0-e2e-gcp-ovn openshift/origin#31266 |
|
@bitoku: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/de265370-633b-11f1-9c47-870e6f6dcba6-0 |
|
/payload-job-with-prs periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning openshift/origin#31266 |
|
@bitoku: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/67ece2a0-634f-11f1-8aab-94564898c66c-0 |
|
/payload-job-with-prs periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-1of2 periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-2of2 openshift/origin#31266 |
|
@bitoku: An error was encountered. No known errors were detected, please see the full error message for details. Full error message.
unable to get additional pr info from string: periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-2of2: string: periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-2of2 doesn't match expected format: org/repo#number
Please contact an administrator to resolve this issue. |
|
/payload-job-with-prs periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-1of2 openshift/origin#31266 |
|
@bitoku: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b9a6bab0-63dd-11f1-828a-621d5ba8e722-0 |
|
/payload-job-with-prs periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-2of2 openshift/origin#31266 |
|
@bitoku: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c1104ff0-63dd-11f1-9f15-5905ea882eb2-0 |
Add render-time validation that rejects generated MachineConfigs when runc is set as the default container runtime and the pool targets a RHEL 10 or CentOS 10 OS image stream, since runc is not available on RHCOS 10. Introduce DetectRuncInMachineConfig() to parse CRI-O drop-in files from Ignition configs and determine the effective default runtime by processing files in lexicographic order. Add IsRHEL10Stream() helper to identify RHEL 10 / CentOS 10 streams. Assisted-by: Claude Code <https://claude.com/claude-code>
Move validateNoRuncOnRHEL10 into generateRenderedMachineConfig so all callers are protected, including RunBootstrap. Co-authored-by: Ryan Phillips <rphillips@redhat.com> Assisted-by: Claude Code <https://claude.com/claude-code>
af088bb to
50a5088
Compare
|
/payload-job-with-prs periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-2of2 openshift/origin#31266 |
|
@bitoku: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8921abf0-6407-11f1-9686-c8cdaefe3b6e-0 |
|
/retest-required |
|
/payload-job-with-prs periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-2of2 openshift/origin#31266 |
|
/payload-job-with-prs periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-2of2 openshift/origin#31266 |
|
@bitoku: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/38c35e00-64b2-11f1-8870-ba2c7e80b8b9-0 |
|
/payload-job-with-prs periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-2of2 openshift/origin#31266 |
|
@bitoku: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1af69240-64d3-11f1-991c-fc8c3045c5fc-0 |
|
@bitoku: This PR was included in a payload test run from openshift/origin#31266
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/289b6a20-657c-11f1-8b2c-d2cc517bfd33-0 |
|
/payload-job-with-prs periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-1of2 openshift/origin#31266 |
|
@bitoku: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/324d1dc0-6644-11f1-9791-c9a6e2bd4d91-0 |
|
/payload-job-with-prs periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning-techpreview-1of2 openshift/origin#31266 |
|
@bitoku: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b8439fe0-665c-11f1-8102-7cdaa8f39c7e-0 |
Assisted-by: Claude Code https://claude.com/claude-code
- What I did
Block cluster upgrades if any MachineConfigPool targeting a RHEL 10 or CentOS 10 OS image stream has runc configured as the default container runtime. Each pool's effective stream is resolved via the OSImageStream API, so upgraded clusters that remain on RHCOS 9 are not affected.
- How to verify it
e2e, runc cluster upgrade to rhcos 10.
- Description for the changelog
Block cluster upgrades if any MachineConfigPool targeting a RHEL 10 or CentOS 10 OS image stream has runc configured as the default container runtime.
Summary by CodeRabbit
New Features
Tests