Skip to content

Feat/aml dev npu operator v1.2.0 kubeos#849

Open
luohua13 wants to merge 2 commits into
masterfrom
feat/aml-dev-npu-operator-v1.2.0-kubeos
Open

Feat/aml dev npu operator v1.2.0 kubeos#849
luohua13 wants to merge 2 commits into
masterfrom
feat/aml-dev-npu-operator-v1.2.0-kubeos

Conversation

@luohua13
Copy link
Copy Markdown
Contributor

@luohua13 luohua13 commented Jun 2, 2026

Summary by CodeRabbit

  • Documentation
    • Added comprehensive v1.2.0 release notes and upgrade guides for NPU Operator.
    • Updated installation documentation with containerized driver deployment approach and verification procedures.
    • New documentation for driver lifecycle management, including rolling upgrades and chip self-healing recovery.
    • Added KubeOS support and CDI-based device injection guidance.

luohua13 added 2 commits June 2, 2026 06:22
…elease notes

The 1.2.0 release of Alauda Build of NPU Operator changes the delivery
model from cluster plugin (`Marketplace > Cluster Plugins`) to OLM
operator (`Marketplace > OperatorHub`). Adapt the installation page
and add a release notes page mirroring the per-product layout used
by hami-docs.

Installation page:
- Rename Downloading/Uploading sections from "Cluster plugin" to
  "Packages" and enumerate both the operator package (npu-operator)
  and the cluster plugin packages (NFD required, Volcano optional).
- Split installation into two subsections: NFD as a Cluster Plugin
  and NPU Operator via OperatorHub (Install dialog, namespace
  selector, deployment form).
- Bump default Driver Version to 25.5.0 and rename the row to match
  the form label; add a note that all operator-managed pods land in
  the operator namespace and that Volcano components are absent.
- Verification step 1 switches from the Cluster plugin page to the
  OperatorHub details page / Installed Operators view.
- Verification step 2 watches the npu-driver pod in the operator
  namespace (was kube-system) with a note for non-default namespaces.
- Installing Monitor: drop the obsolete manual ServiceMonitor snippet
  (which targeted the wrong namespaces); the operator now auto-
  installs npu-exporter-servicemonitor in its own namespace.

Release notes:
- v1.2.0 mapped to openFuyao npu-operator 1.2.0; headline is the
  cluster-plugin-to-operator delivery change (no in-place upgrade
  from v1.1.3) and the MindCluster/Ascend v7.3.0 stack bump.
- Downstream bug fix highlighted is the npu-exporter ServiceMonitor
  not taking effect; plus the two community 1.1.1 -> 1.2.0 fixes.
- v1.1.3 mapped to openFuyao npu-operator 1.1.1 (MindCluster v7.2.RC1,
  cluster-plugin delivery).
…-healing

Layers in all the user-facing functional changes that landed on top of
the in-flight `docs/npu-operator-1.2-form-update` form-update work:

- intro.mdx: spell out v1.2.0 headline features (KubeOS, pre-compiled
  driver image, CDI, upgrade lifecycle, chip self-healing, Volcano
  removal, ServiceMonitor fix).

- installation.mdx:
  * Document the new pre-compiled driver image prerequisite (replaces
    the v1.1.x `.run` + DKMS path).
  * Update the supported OS list: KubeOS 6.6 (new), openEuler 22.03 LTS
    SP3; flag that Ubuntu 22.04 is no longer shipped out of the box.
  * Drop `runtimeClassName: ascend` from the validation workload —
    CDI handles device injection now.
  * Add a `npu-smi` host-access FAQ entry (no host PATH symlink on
    KubeOS because `/usr` is read-only).
  * Replace the deprecated uninstall command with KubeOS-aware
    cleanup guidance.

- upgrade.mdx (new): full walk-through of the driver upgrade flow
  (state machine, per-node phases, auto vs. `approve-reboot`
  annotation, MaxUnavailable / MaxParallelUpgrades / DrainSpec) and
  the chip self-healing path (health-watch loop, autoRecover gate,
  false-positive suppression). Includes an NPUClusterPolicy YAML
  reference at the end.

- release_notes.mdx: expand the v1.2.0 entry with breaking changes
  (delivery model, driver image, OS list, no `runtimeClass`, Volcano
  unbundled) and the new feature catalogue (KubeOS, upgrade
  lifecycle, chip self-healing, CDI, validator DaemonSet,
  drain-aware device-plugin, host `npu-smi`).
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

Walkthrough

This PR comprehensively updates NPU Operator documentation for v1.2.0, covering containerized driver architecture, CDI-based device injection, driver upgrade state machines, and chip self-healing. The installation guide was reorganized for OperatorHub/Marketplace workflow, and new pages document driver lifecycle management and recovery policies.

Changes

NPU Operator v1.2.0 Documentation

Layer / File(s) Summary
Overview and Release Information
docs/en/hardware_accelerator/npu/npu_operator/intro.mdx, docs/en/hardware_accelerator/npu/npu_operator/release_notes.mdx
Introduction explains NPUClusterPolicy end-to-end reconciliation scope. Release notes detail v1.2.0 breaking changes (OLM/OperatorHub delivery, containerized driver, CDI-based injection, Volcano removal), new features (KubeOS support, rolling driver upgrades, chip self-healing, validator DaemonSet, per-node npu-smi staging), and bug fixes.
Installation Prerequisites and Workflow
docs/en/hardware_accelerator/npu/npu_operator/installation.mdx (lines 9–119)
Prerequisites consolidated into single section covering ACP version, NFD plugin, supported hardware, Arm64 OS, and optional MindIO SDK. Installation procedure reorganized for OperatorHub flow: install NFD, label nodes, deploy operator via deployment form. Deployment form explanations clarify operator-managed components, namespace isolation, and v1.2.0 driver image selection tied to pre-staged images.
Verification, Monitoring, and Post-Installation
docs/en/hardware_accelerator/npu/npu_operator/installation.mdx (lines 120–255)
Verification updated to check subscription/driver/NPUClusterPolicy readiness and allocatable resources. npu-smi guidance revised (binary staged in driver pod with LD_LIBRARY_PATH). Workload validation uses CDI injection without runtimeClassName: ascend requirement. ServiceMonitor auto-created for NPU Exporter. FAQ and uninstall updated with node reboot-centric procedure and optional staged file cleanup.
Driver Upgrade and Chip Self-Healing Management
docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx
New page documenting v1.2.0 driver lifecycle: per-node upgrade phase state machine with autoUpgrade approval gates; upgradePolicy configuration and manual upgrade walk-through with node label/annotation/event tables; chip self-healing health-watch loop detecting runtime wedges; recoveryPolicy.autoRecover switch for automatic vs manual reboot; false-positive suppression when chip recovers before reboot; CDI and legacy runtimeClassName coexistence; quick-reference NPUClusterPolicy YAML example.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • chinameok
  • jianliao82

🐰 A tale of drivers, v1.2 so fine,
Containers and CDI, in design align,
Upgrades rolled smoothly, self-healing too,
The NPU docs are shiny and new!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title references the main change: NPU Operator v1.2.0 with KubeOS support. However, it uses a branch naming convention (Feat/aml dev) rather than a clear, descriptive PR title format. Consider using a clearer title format like 'Add NPU Operator v1.2.0 documentation with KubeOS support' to improve readability and clarity for scanning commit history.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/aml-dev-npu-operator-v1.2.0-kubeos

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/en/hardware_accelerator/npu/npu_operator/installation.mdx (1)

124-128: ⚡ Quick win

Consider using kubectl wait or label selector for better reliability.

The -w (watch) flag piped to grep can cause output buffering issues—users may see no output even when the pod is starting. For a verification step, consider:

kubectl -n npu-operator wait --for=condition=ready pod -l app=npu-driver --timeout=5m

or if you need to watch all pods:

kubectl -n npu-operator get pod -w -l app=npu-driver
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/hardware_accelerator/npu/npu_operator/installation.mdx` around lines
124 - 128, The current verification step uses "kubectl -n npu-operator get pod
-w | grep npu-driver", which can suffer from output buffering and unreliable
results; replace this with a label-aware wait or watch command instead—use
"kubectl -n npu-operator wait --for=condition=ready pod -l app=npu-driver
--timeout=5m" to reliably wait for the npu-driver pod to become ready, or if you
need streaming output use "kubectl -n npu-operator get pod -w -l app=npu-driver"
to watch only the driver pods.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx`:
- Line 78: The doc uses incorrect CRD field casing `MaxUnavailable` and
`MaxParallelUpgrades`; update the text to use the exact schema field names
`maxUnavailable` and `maxParallelUpgrades` (lower camelCase) so examples and
descriptions match the CRD and avoid copy/paste misconfiguration.
- Line 22: The upgrade example under the "change spec.driver.version" heading
currently shows a downgrade "25.5.0 -> 25.3.RC1"; update that example to an
increasing version direction (e.g., "25.3.RC1 -> 25.5.0") so the walkthrough
correctly reflects an upgrade operation and avoids operational confusion.

---

Nitpick comments:
In `@docs/en/hardware_accelerator/npu/npu_operator/installation.mdx`:
- Around line 124-128: The current verification step uses "kubectl -n
npu-operator get pod -w | grep npu-driver", which can suffer from output
buffering and unreliable results; replace this with a label-aware wait or watch
command instead—use "kubectl -n npu-operator wait --for=condition=ready pod -l
app=npu-driver --timeout=5m" to reliably wait for the npu-driver pod to become
ready, or if you need streaming output use "kubectl -n npu-operator get pod -w
-l app=npu-driver" to watch only the driver pods.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73e7caf1-30d4-449d-bfb4-c233173532f9

📥 Commits

Reviewing files that changed from the base of the PR and between d7f2ff6 and 5139af8.

📒 Files selected for processing (4)
  • docs/en/hardware_accelerator/npu/npu_operator/installation.mdx
  • docs/en/hardware_accelerator/npu/npu_operator/intro.mdx
  • docs/en/hardware_accelerator/npu/npu_operator/release_notes.mdx
  • docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx


```bash
kubectl edit npuclusterpolicy cluster
# change spec.driver.version, e.g. 25.5.0 -> 25.3.RC1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Upgrade example currently shows a downgrade version direction.

Line 22 shows 25.5.0 -> 25.3.RC1, which reads as a downgrade in an “upgrade” walkthrough. Suggest flipping to an increasing example (for example, 25.3.RC1 -> 25.5.0) to avoid operational confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx` at line 22, The
upgrade example under the "change spec.driver.version" heading currently shows a
downgrade "25.5.0 -> 25.3.RC1"; update that example to an increasing version
direction (e.g., "25.3.RC1 -> 25.5.0") so the walkthrough correctly reflects an
upgrade operation and avoids operational confusion.

timeoutSecond: 1200
```

`MaxUnavailable` and `MaxParallelUpgrades` together gate how many nodes leave the available pool at once; the rebooter additionally serializes the actual reboot step via a cluster-wide annotation so only one node reboots at a time even when several are in the unavailable region.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use exact CRD field names to avoid copy/paste misconfiguration.

Line 78 uses MaxUnavailable and MaxParallelUpgrades, but the documented schema uses maxUnavailable and maxParallelUpgrades (lower camel case). In docs for Kubernetes specs, casing mismatches can cause invalid manifests if users copy terms literally.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/hardware_accelerator/npu/npu_operator/upgrade.mdx` at line 78, The
doc uses incorrect CRD field casing `MaxUnavailable` and `MaxParallelUpgrades`;
update the text to use the exact schema field names `maxUnavailable` and
`maxParallelUpgrades` (lower camelCase) so examples and descriptions match the
CRD and avoid copy/paste misconfiguration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant