Feat/aml dev npu operator v1.2.0 kubeos by luohua13 · Pull Request #849 · alauda/acp-docs

luohua13 · 2026-06-02T06:23:38Z

Summary by CodeRabbit

Documentation
- Added comprehensive v1.2.0 release notes and upgrade guides for NPU Operator.
- Updated installation documentation with containerized driver deployment approach and verification procedures.
- New documentation for driver lifecycle management, including rolling upgrades and chip self-healing recovery.
- Added KubeOS support and CDI-based device injection guidance.

…elease notes The 1.2.0 release of Alauda Build of NPU Operator changes the delivery model from cluster plugin (`Marketplace > Cluster Plugins`) to OLM operator (`Marketplace > OperatorHub`). Adapt the installation page and add a release notes page mirroring the per-product layout used by hami-docs. Installation page: - Rename Downloading/Uploading sections from "Cluster plugin" to "Packages" and enumerate both the operator package (npu-operator) and the cluster plugin packages (NFD required, Volcano optional). - Split installation into two subsections: NFD as a Cluster Plugin and NPU Operator via OperatorHub (Install dialog, namespace selector, deployment form). - Bump default Driver Version to 25.5.0 and rename the row to match the form label; add a note that all operator-managed pods land in the operator namespace and that Volcano components are absent. - Verification step 1 switches from the Cluster plugin page to the OperatorHub details page / Installed Operators view. - Verification step 2 watches the npu-driver pod in the operator namespace (was kube-system) with a note for non-default namespaces. - Installing Monitor: drop the obsolete manual ServiceMonitor snippet (which targeted the wrong namespaces); the operator now auto- installs npu-exporter-servicemonitor in its own namespace. Release notes: - v1.2.0 mapped to openFuyao npu-operator 1.2.0; headline is the cluster-plugin-to-operator delivery change (no in-place upgrade from v1.1.3) and the MindCluster/Ascend v7.3.0 stack bump. - Downstream bug fix highlighted is the npu-exporter ServiceMonitor not taking effect; plus the two community 1.1.1 -> 1.2.0 fixes. - v1.1.3 mapped to openFuyao npu-operator 1.1.1 (MindCluster v7.2.RC1, cluster-plugin delivery).

…-healing Layers in all the user-facing functional changes that landed on top of the in-flight `docs/npu-operator-1.2-form-update` form-update work: - intro.mdx: spell out v1.2.0 headline features (KubeOS, pre-compiled driver image, CDI, upgrade lifecycle, chip self-healing, Volcano removal, ServiceMonitor fix). - installation.mdx: * Document the new pre-compiled driver image prerequisite (replaces the v1.1.x `.run` + DKMS path). * Update the supported OS list: KubeOS 6.6 (new), openEuler 22.03 LTS SP3; flag that Ubuntu 22.04 is no longer shipped out of the box. * Drop `runtimeClassName: ascend` from the validation workload — CDI handles device injection now. * Add a `npu-smi` host-access FAQ entry (no host PATH symlink on KubeOS because `/usr` is read-only). * Replace the deprecated uninstall command with KubeOS-aware cleanup guidance. - upgrade.mdx (new): full walk-through of the driver upgrade flow (state machine, per-node phases, auto vs. `approve-reboot` annotation, MaxUnavailable / MaxParallelUpgrades / DrainSpec) and the chip self-healing path (health-watch loop, autoRecover gate, false-positive suppression). Includes an NPUClusterPolicy YAML reference at the end. - release_notes.mdx: expand the v1.2.0 entry with breaking changes (delivery model, driver image, OS list, no `runtimeClass`, Volcano unbundled) and the new feature catalogue (KubeOS, upgrade lifecycle, chip self-healing, CDI, validator DaemonSet, drain-aware device-plugin, host `npu-smi`).

coderabbitai · 2026-06-02T06:23:50Z

Walkthrough

This PR comprehensively updates NPU Operator documentation for v1.2.0, covering containerized driver architecture, CDI-based device injection, driver upgrade state machines, and chip self-healing. The installation guide was reorganized for OperatorHub/Marketplace workflow, and new pages document driver lifecycle management and recovery policies.

Changes

NPU Operator v1.2.0 Documentation

Layer / File(s)	Summary
Overview and Release Information `docs/en/hardware_accelerator/npu/npu_operator/intro.mdx`, `docs/en/hardware_accelerator/npu/npu_operator/release_notes.mdx`	Introduction explains NPUClusterPolicy end-to-end reconciliation scope. Release notes detail v1.2.0 breaking changes (OLM/OperatorHub delivery, containerized driver, CDI-based injection, Volcano removal), new features (KubeOS support, rolling driver upgrades, chip self-healing, validator DaemonSet, per-node npu-smi staging), and bug fixes.
Installation Prerequisites and Workflow `docs/en/hardware_accelerator/npu/npu_operator/installation.mdx` (lines 9–119)	Prerequisites consolidated into single section covering ACP version, NFD plugin, supported hardware, Arm64 OS, and optional MindIO SDK. Installation procedure reorganized for OperatorHub flow: install NFD, label nodes, deploy operator via deployment form. Deployment form explanations clarify operator-managed components, namespace isolation, and v1.2.0 driver image selection tied to pre-staged images.
Verification, Monitoring, and Post-Installation `docs/en/hardware_accelerator/npu/npu_operator/installation.mdx` (lines 120–255)	Verification updated to check subscription/driver/NPUClusterPolicy readiness and allocatable resources. `npu-smi` guidance revised (binary staged in driver pod with LD_LIBRARY_PATH). Workload validation uses CDI injection without `runtimeClassName: ascend` requirement. ServiceMonitor auto-created for NPU Exporter. FAQ and uninstall updated with node reboot-centric procedure and optional staged file cleanup.
Driver Upgrade and Chip Self-Healing Management `docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx`	New page documenting v1.2.0 driver lifecycle: per-node upgrade phase state machine with autoUpgrade approval gates; upgradePolicy configuration and manual upgrade walk-through with node label/annotation/event tables; chip self-healing health-watch loop detecting runtime wedges; recoveryPolicy.autoRecover switch for automatic vs manual reboot; false-positive suppression when chip recovers before reboot; CDI and legacy runtimeClassName coexistence; quick-reference NPUClusterPolicy YAML example.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

chinameok
jianliao82

🐰 A tale of drivers, v1.2 so fine,
Containers and CDI, in design align,
Upgrades rolled smoothly, self-healing too,
The NPU docs are shiny and new!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title references the main change: NPU Operator v1.2.0 with KubeOS support. However, it uses a branch naming convention (Feat/aml dev) rather than a clear, descriptive PR title format.	Consider using a clearer title format like 'Add NPU Operator v1.2.0 documentation with KubeOS support' to improve readability and clarity for scanning commit history.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/aml-dev-npu-operator-v1.2.0-kubeos

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

docs/en/hardware_accelerator/npu/npu_operator/installation.mdx (1)

124-128: ⚡ Quick win

Consider using kubectl wait or label selector for better reliability.

The -w (watch) flag piped to grep can cause output buffering issues—users may see no output even when the pod is starting. For a verification step, consider:

kubectl -n npu-operator wait --for=condition=ready pod -l app=npu-driver --timeout=5m

or if you need to watch all pods:

kubectl -n npu-operator get pod -w -l app=npu-driver

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/hardware_accelerator/npu/npu_operator/installation.mdx` around lines
124 - 128, The current verification step uses "kubectl -n npu-operator get pod
-w | grep npu-driver", which can suffer from output buffering and unreliable
results; replace this with a label-aware wait or watch command instead—use
"kubectl -n npu-operator wait --for=condition=ready pod -l app=npu-driver
--timeout=5m" to reliably wait for the npu-driver pod to become ready, or if you
need streaming output use "kubectl -n npu-operator get pod -w -l app=npu-driver"
to watch only the driver pods.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx`:
- Line 78: The doc uses incorrect CRD field casing `MaxUnavailable` and
`MaxParallelUpgrades`; update the text to use the exact schema field names
`maxUnavailable` and `maxParallelUpgrades` (lower camelCase) so examples and
descriptions match the CRD and avoid copy/paste misconfiguration.
- Line 22: The upgrade example under the "change spec.driver.version" heading
currently shows a downgrade "25.5.0 -> 25.3.RC1"; update that example to an
increasing version direction (e.g., "25.3.RC1 -> 25.5.0") so the walkthrough
correctly reflects an upgrade operation and avoids operational confusion.

---

Nitpick comments:
In `@docs/en/hardware_accelerator/npu/npu_operator/installation.mdx`:
- Around line 124-128: The current verification step uses "kubectl -n
npu-operator get pod -w | grep npu-driver", which can suffer from output
buffering and unreliable results; replace this with a label-aware wait or watch
command instead—use "kubectl -n npu-operator wait --for=condition=ready pod -l
app=npu-driver --timeout=5m" to reliably wait for the npu-driver pod to become
ready, or if you need streaming output use "kubectl -n npu-operator get pod -w
-l app=npu-driver" to watch only the driver pods.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73e7caf1-30d4-449d-bfb4-c233173532f9

📥 Commits

Reviewing files that changed from the base of the PR and between d7f2ff6 and 5139af8.

📒 Files selected for processing (4)

docs/en/hardware_accelerator/npu/npu_operator/installation.mdx
docs/en/hardware_accelerator/npu/npu_operator/intro.mdx
docs/en/hardware_accelerator/npu/npu_operator/release_notes.mdx
docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx

coderabbitai · 2026-06-02T06:31:00Z

+
+```bash
+kubectl edit npuclusterpolicy cluster
+# change spec.driver.version, e.g. 25.5.0 -> 25.3.RC1


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Upgrade example currently shows a downgrade version direction.

Line 22 shows 25.5.0 -> 25.3.RC1, which reads as a downgrade in an “upgrade” walkthrough. Suggest flipping to an increasing example (for example, 25.3.RC1 -> 25.5.0) to avoid operational confusion.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx` at line 22, The upgrade example under the "change spec.driver.version" heading currently shows a downgrade "25.5.0 -> 25.3.RC1"; update that example to an increasing version direction (e.g., "25.3.RC1 -> 25.5.0") so the walkthrough correctly reflects an upgrade operation and avoids operational confusion.

coderabbitai · 2026-06-02T06:31:00Z

+        timeoutSecond: 1200
+```
+
+`MaxUnavailable` and `MaxParallelUpgrades` together gate how many nodes leave the available pool at once; the rebooter additionally serializes the actual reboot step via a cluster-wide annotation so only one node reboots at a time even when several are in the unavailable region.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use exact CRD field names to avoid copy/paste misconfiguration.

Line 78 uses MaxUnavailable and MaxParallelUpgrades, but the documented schema uses maxUnavailable and maxParallelUpgrades (lower camel case). In docs for Kubernetes specs, casing mismatches can cause invalid manifests if users copy terms literally.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx` at line 78, The doc uses incorrect CRD field casing `MaxUnavailable` and `MaxParallelUpgrades`; update the text to use the exact schema field names `maxUnavailable` and `maxParallelUpgrades` (lower camelCase) so examples and descriptions match the CRD and avoid copy/paste misconfiguration.

luohua13 added 2 commits June 2, 2026 06:22

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/aml dev npu operator v1.2.0 kubeos#849

Feat/aml dev npu operator v1.2.0 kubeos#849
luohua13 wants to merge 2 commits into
masterfrom
feat/aml-dev-npu-operator-v1.2.0-kubeos

luohua13 commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

luohua13 commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

luohua13 commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading