Feat/aml dev npu operator v1.2.0 kubeos#849
Conversation
…elease notes The 1.2.0 release of Alauda Build of NPU Operator changes the delivery model from cluster plugin (`Marketplace > Cluster Plugins`) to OLM operator (`Marketplace > OperatorHub`). Adapt the installation page and add a release notes page mirroring the per-product layout used by hami-docs. Installation page: - Rename Downloading/Uploading sections from "Cluster plugin" to "Packages" and enumerate both the operator package (npu-operator) and the cluster plugin packages (NFD required, Volcano optional). - Split installation into two subsections: NFD as a Cluster Plugin and NPU Operator via OperatorHub (Install dialog, namespace selector, deployment form). - Bump default Driver Version to 25.5.0 and rename the row to match the form label; add a note that all operator-managed pods land in the operator namespace and that Volcano components are absent. - Verification step 1 switches from the Cluster plugin page to the OperatorHub details page / Installed Operators view. - Verification step 2 watches the npu-driver pod in the operator namespace (was kube-system) with a note for non-default namespaces. - Installing Monitor: drop the obsolete manual ServiceMonitor snippet (which targeted the wrong namespaces); the operator now auto- installs npu-exporter-servicemonitor in its own namespace. Release notes: - v1.2.0 mapped to openFuyao npu-operator 1.2.0; headline is the cluster-plugin-to-operator delivery change (no in-place upgrade from v1.1.3) and the MindCluster/Ascend v7.3.0 stack bump. - Downstream bug fix highlighted is the npu-exporter ServiceMonitor not taking effect; plus the two community 1.1.1 -> 1.2.0 fixes. - v1.1.3 mapped to openFuyao npu-operator 1.1.1 (MindCluster v7.2.RC1, cluster-plugin delivery).
…-healing
Layers in all the user-facing functional changes that landed on top of
the in-flight `docs/npu-operator-1.2-form-update` form-update work:
- intro.mdx: spell out v1.2.0 headline features (KubeOS, pre-compiled
driver image, CDI, upgrade lifecycle, chip self-healing, Volcano
removal, ServiceMonitor fix).
- installation.mdx:
* Document the new pre-compiled driver image prerequisite (replaces
the v1.1.x `.run` + DKMS path).
* Update the supported OS list: KubeOS 6.6 (new), openEuler 22.03 LTS
SP3; flag that Ubuntu 22.04 is no longer shipped out of the box.
* Drop `runtimeClassName: ascend` from the validation workload —
CDI handles device injection now.
* Add a `npu-smi` host-access FAQ entry (no host PATH symlink on
KubeOS because `/usr` is read-only).
* Replace the deprecated uninstall command with KubeOS-aware
cleanup guidance.
- upgrade.mdx (new): full walk-through of the driver upgrade flow
(state machine, per-node phases, auto vs. `approve-reboot`
annotation, MaxUnavailable / MaxParallelUpgrades / DrainSpec) and
the chip self-healing path (health-watch loop, autoRecover gate,
false-positive suppression). Includes an NPUClusterPolicy YAML
reference at the end.
- release_notes.mdx: expand the v1.2.0 entry with breaking changes
(delivery model, driver image, OS list, no `runtimeClass`, Volcano
unbundled) and the new feature catalogue (KubeOS, upgrade
lifecycle, chip self-healing, CDI, validator DaemonSet,
drain-aware device-plugin, host `npu-smi`).
WalkthroughThis PR comprehensively updates NPU Operator documentation for v1.2.0, covering containerized driver architecture, CDI-based device injection, driver upgrade state machines, and chip self-healing. The installation guide was reorganized for OperatorHub/Marketplace workflow, and new pages document driver lifecycle management and recovery policies. ChangesNPU Operator v1.2.0 Documentation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/en/hardware_accelerator/npu/npu_operator/installation.mdx (1)
124-128: ⚡ Quick winConsider using
kubectl waitor label selector for better reliability.The
-w(watch) flag piped togrepcan cause output buffering issues—users may see no output even when the pod is starting. For a verification step, consider:kubectl -n npu-operator wait --for=condition=ready pod -l app=npu-driver --timeout=5mor if you need to watch all pods:
kubectl -n npu-operator get pod -w -l app=npu-driver🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/en/hardware_accelerator/npu/npu_operator/installation.mdx` around lines 124 - 128, The current verification step uses "kubectl -n npu-operator get pod -w | grep npu-driver", which can suffer from output buffering and unreliable results; replace this with a label-aware wait or watch command instead—use "kubectl -n npu-operator wait --for=condition=ready pod -l app=npu-driver --timeout=5m" to reliably wait for the npu-driver pod to become ready, or if you need streaming output use "kubectl -n npu-operator get pod -w -l app=npu-driver" to watch only the driver pods.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx`:
- Line 78: The doc uses incorrect CRD field casing `MaxUnavailable` and
`MaxParallelUpgrades`; update the text to use the exact schema field names
`maxUnavailable` and `maxParallelUpgrades` (lower camelCase) so examples and
descriptions match the CRD and avoid copy/paste misconfiguration.
- Line 22: The upgrade example under the "change spec.driver.version" heading
currently shows a downgrade "25.5.0 -> 25.3.RC1"; update that example to an
increasing version direction (e.g., "25.3.RC1 -> 25.5.0") so the walkthrough
correctly reflects an upgrade operation and avoids operational confusion.
---
Nitpick comments:
In `@docs/en/hardware_accelerator/npu/npu_operator/installation.mdx`:
- Around line 124-128: The current verification step uses "kubectl -n
npu-operator get pod -w | grep npu-driver", which can suffer from output
buffering and unreliable results; replace this with a label-aware wait or watch
command instead—use "kubectl -n npu-operator wait --for=condition=ready pod -l
app=npu-driver --timeout=5m" to reliably wait for the npu-driver pod to become
ready, or if you need streaming output use "kubectl -n npu-operator get pod -w
-l app=npu-driver" to watch only the driver pods.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 73e7caf1-30d4-449d-bfb4-c233173532f9
📒 Files selected for processing (4)
docs/en/hardware_accelerator/npu/npu_operator/installation.mdxdocs/en/hardware_accelerator/npu/npu_operator/intro.mdxdocs/en/hardware_accelerator/npu/npu_operator/release_notes.mdxdocs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx
|
|
||
| ```bash | ||
| kubectl edit npuclusterpolicy cluster | ||
| # change spec.driver.version, e.g. 25.5.0 -> 25.3.RC1 |
There was a problem hiding this comment.
Upgrade example currently shows a downgrade version direction.
Line 22 shows 25.5.0 -> 25.3.RC1, which reads as a downgrade in an “upgrade” walkthrough. Suggest flipping to an increasing example (for example, 25.3.RC1 -> 25.5.0) to avoid operational confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx` at line 22, The
upgrade example under the "change spec.driver.version" heading currently shows a
downgrade "25.5.0 -> 25.3.RC1"; update that example to an increasing version
direction (e.g., "25.3.RC1 -> 25.5.0") so the walkthrough correctly reflects an
upgrade operation and avoids operational confusion.
| timeoutSecond: 1200 | ||
| ``` | ||
|
|
||
| `MaxUnavailable` and `MaxParallelUpgrades` together gate how many nodes leave the available pool at once; the rebooter additionally serializes the actual reboot step via a cluster-wide annotation so only one node reboots at a time even when several are in the unavailable region. |
There was a problem hiding this comment.
Use exact CRD field names to avoid copy/paste misconfiguration.
Line 78 uses MaxUnavailable and MaxParallelUpgrades, but the documented schema uses maxUnavailable and maxParallelUpgrades (lower camel case). In docs for Kubernetes specs, casing mismatches can cause invalid manifests if users copy terms literally.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx` at line 78, The
doc uses incorrect CRD field casing `MaxUnavailable` and `MaxParallelUpgrades`;
update the text to use the exact schema field names `maxUnavailable` and
`maxParallelUpgrades` (lower camelCase) so examples and descriptions match the
CRD and avoid copy/paste misconfiguration.
Summary by CodeRabbit