From 7e62376a002d996284ffe5b2ad976b1b7a2c0738 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Thu, 28 May 2026 21:31:33 +0300 Subject: [PATCH 1/3] docs(gpu): drop manual KubeVirt patch step now that the platform auto-wires permittedHostDevices MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Step 2 of the GPU Passthrough guide instructed operators to `kubectl edit kubevirt -n cozy-kubevirt` and hand-paste a permittedHostDevices.pciHostDevices block. cozystack/cozystack#2768 removes the need for that step: when cozystack.gpu-operator is in bundles.enabledPackages, the platform now mirrors the chosen GPU variant into the KubeVirt CR automatically — appending HostDevices to the feature-gate list and rendering a starter NVIDIA pciHostDevices table covering Hopper, Ada Lovelace, Ampere, Turing and Volta. The new step 2 documents the contract (what the platform auto-injects and why), the verification recipe, the escape hatch via .gpu.permittedHostDevices / .gpu.replaceDefaults, and the manual Package-CR override path used by operators who need overrides the bundle does not expose (driver settings, custom node selectors, validator / dcgmExporter tweaks) — in that flow they also hand-craft the matching cozystack.kubevirt Package CR. Only next/virtualization/gpu.md is updated; v1.4 and earlier describe releases that still require the manual patch and stay as-is. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/next/virtualization/gpu.md | 50 ++++++++++++++-------- 1 file changed, 32 insertions(+), 18 deletions(-) diff --git a/content/en/docs/next/virtualization/gpu.md b/content/en/docs/next/virtualization/gpu.md index bc71d894..745de72d 100644 --- a/content/en/docs/next/virtualization/gpu.md +++ b/content/en/docs/next/virtualization/gpu.md @@ -100,32 +100,46 @@ Allocatable: For example, the database entry for A10 reads `2236 GA102GL [A10]`, which results in a resource name `nvidia.com/GA102GL_A10`. {{% /alert %}} -## 2. Update the KubeVirt Custom Resource +## 2. KubeVirt is wired automatically -Next, we will update the KubeVirt Custom Resource, as documented in the -[KubeVirt user guide](https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices), -so that the passthrough GPUs are permitted and can be requested by a KubeVirt VM. +When `cozystack.gpu-operator` is in `bundles.enabledPackages`, Cozystack mirrors the chosen GPU variant into the `KubeVirt` Custom Resource for you. There is no `kubectl edit kubevirt` step. -Adjust the `pciVendorSelector` and `resourceName` values to match your specific GPU model. -Setting `externalResourceProvider=true` indicates that this resource is provided by an external device plugin, -in this case the `sandbox-device-plugin` which is deployed by the Operator. +Specifically, the platform injects: + +- `HostDevices` into `spec.configuration.developerConfiguration.featureGates` (current KubeVirt splits this from the `GPU` gate; the admission webhook rejects `domain.devices.hostDevices` without it). +- A starter `spec.configuration.permittedHostDevices.pciHostDevices` table covering common NVIDIA datacenter GPUs — Hopper (H100, H200), Ada Lovelace (L4, L40, L40S), Ampere (A100 PCIe/SXM, A40, A30, A10), Turing (T4), Volta (V100, V100S). PCI vendor:device pairs are stable; `resourceName` slugs follow the `__
_` convention `nvidia-sandbox-device-plugin` v25.x emits (e.g. `nvidia.com/GA102GL_A10`). `externalResourceProvider: true` is set on every entry because the resources are advertised by the sandbox plugin, not by KubeVirt's in-tree device plugin. + +Verify the resulting CR: ```bash -kubectl edit kubevirt -n cozy-kubevirt +kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml \ + | yq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}' ``` -example config: + +### Extending or replacing the NVIDIA defaults + +If your cluster ships a GPU not in the default table, or your `nvidia-sandbox-device-plugin` version emits a different `resourceName` (check with `kubectl describe node | grep nvidia.com/`), extend the defaults via platform values: + ```yaml - ... - spec: - configuration: - permittedHostDevices: - pciHostDevices: - - externalResourceProvider: true - pciVendorSelector: 10DE:2236 - resourceName: nvidia.com/GA102GL_A10 - ... +# Platform Package values +gpu: + # Append (default) — your entries land alongside the NVIDIA table. + # Set to true to drop the NVIDIA table entirely (useful for non-NVIDIA-only + # clusters or strict allowlists). With replaceDefaults: true and an empty + # list below, the rendered CR carries no permittedHostDevices block at all + # and the admission webhook rejects every GPU VM — supply your own list. + replaceDefaults: false + permittedHostDevices: + pciHostDevices: + - pciVendorSelector: "10DE:2236" + resourceName: nvidia.com/GA102GL_A10 + externalResourceProvider: true ``` +### Manual Package-CR override path + +If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist. + ## 3. Create a Virtual Machine We are now ready to create a VM. From e0366205c69b9e9e21047a7d6bf881780a63d57b Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Wed, 3 Jun 2026 02:20:07 +0300 Subject: [PATCH 2/3] docs(gpu): add pre-upgrade migration steps for hand-edited permittedHostDevices The bundle now owns spec.configuration.permittedHostDevices, so the first reconcile after upgrade overwrites manual kubectl-edit entries with the NVIDIA default table. Tell operators to move custom entries into .gpu.permittedHostDevices and verify each resourceName against node-advertised names before upgrading, since the default slugs (e.g. TU104GL_T4) differ from legacy names (e.g. TU104GL_TESLA_T4) and a mismatch silently rejects GPU VMs. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/next/virtualization/gpu.md | 23 ++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/content/en/docs/next/virtualization/gpu.md b/content/en/docs/next/virtualization/gpu.md index 745de72d..3e25b017 100644 --- a/content/en/docs/next/virtualization/gpu.md +++ b/content/en/docs/next/virtualization/gpu.md @@ -136,6 +136,29 @@ gpu: externalResourceProvider: true ``` +### Upgrading from a hand-edited KubeVirt CR + +Earlier Cozystack releases left `spec.configuration.permittedHostDevices` for operators to hand-edit (`kubectl edit kubevirt`). The bundle now **owns** that field: the first reconcile after the upgrade replaces your manual entries with the rendered NVIDIA default table. + +Before upgrading: + +1. Dump your current entries: + + ```bash + kubectl get kubevirt -n cozy-kubevirt -o yaml \ + | yq '.items[0].spec.configuration.permittedHostDevices' + ``` + +2. Move any custom entries into the Platform Package values under `.gpu.permittedHostDevices` (set `.gpu.replaceDefaults: true` if you want only your own list instead of appending to the NVIDIA defaults). + +3. Verify every `resourceName` against what your nodes actually advertise — the default table uses `nvidia-sandbox-device-plugin` slugs (e.g. `nvidia.com/TU104GL_T4`) that differ from legacy driver names (e.g. `TU104GL_TESLA_T4`): + + ```bash + kubectl describe node | grep nvidia.com/ + ``` + +A `resourceName` mismatch is silent until a GPU VM restarts or migrates, at which point the admission webhook rejects it. + ### Manual Package-CR override path If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist. From 3e5b50484dd35c5a04cec60cbbb0d58f82d09176 Mon Sep 17 00:00:00 2001 From: Aleksei Sviridkin Date: Fri, 5 Jun 2026 00:42:25 +0300 Subject: [PATCH 3/3] docs(gpu): make the permittedHostDevices escape hatch discoverable and portable MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a callout that redirects operators looking for the removed `kubectl edit kubevirt` step to the `.gpu.permittedHostDevices` knob, linking the extend/replace and upgrade sections so the persistent manual path stays easy to find. Use `kubectl -o json | jq` for the verify and dump commands — matches the convention used across the rest of the docs and avoids the Go-yq vs Python-yq expression-syntax drift. Correct the resourceName slug convention to `_` with optional `__` qualifiers, and note the default table is rendered in the passthrough (vfio-pci) variant. Assisted-By: Claude Signed-off-by: Aleksei Sviridkin --- content/en/docs/next/virtualization/gpu.md | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/content/en/docs/next/virtualization/gpu.md b/content/en/docs/next/virtualization/gpu.md index 3e25b017..e556c630 100644 --- a/content/en/docs/next/virtualization/gpu.md +++ b/content/en/docs/next/virtualization/gpu.md @@ -107,15 +107,21 @@ When `cozystack.gpu-operator` is in `bundles.enabledPackages`, Cozystack mirrors Specifically, the platform injects: - `HostDevices` into `spec.configuration.developerConfiguration.featureGates` (current KubeVirt splits this from the `GPU` gate; the admission webhook rejects `domain.devices.hostDevices` without it). -- A starter `spec.configuration.permittedHostDevices.pciHostDevices` table covering common NVIDIA datacenter GPUs — Hopper (H100, H200), Ada Lovelace (L4, L40, L40S), Ampere (A100 PCIe/SXM, A40, A30, A10), Turing (T4), Volta (V100, V100S). PCI vendor:device pairs are stable; `resourceName` slugs follow the `___` convention `nvidia-sandbox-device-plugin` v25.x emits (e.g. `nvidia.com/GA102GL_A10`). `externalResourceProvider: true` is set on every entry because the resources are advertised by the sandbox plugin, not by KubeVirt's in-tree device plugin. +- A starter `spec.configuration.permittedHostDevices.pciHostDevices` table (rendered in the default `gpuOperatorVariant: default` — vfio-pci passthrough) covering common NVIDIA datacenter GPUs — Hopper (H100, H200), Ada Lovelace (L4, L40, L40S), Ampere (A100 PCIe/SXM, A40, A30, A10), Turing (T4), Volta (V100, V100S). PCI vendor:device pairs are stable; `resourceName` slugs follow what `nvidia-sandbox-device-plugin` v25.x emits — `_`, with optional `__` qualifiers appended when a model ships in several memory or form-factor variants (e.g. `nvidia.com/GA102GL_A10` for the single-SKU A10, `nvidia.com/GH100_H200_SXM_141GB` for the H200). `externalResourceProvider: true` is set on every entry because the resources are advertised by the sandbox plugin, not by KubeVirt's in-tree device plugin. Verify the resulting CR: ```bash -kubectl -n cozy-kubevirt get kubevirt kubevirt -o yaml \ - | yq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}' +kubectl -n cozy-kubevirt get kubevirt kubevirt -o json \ + | jq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}' ``` +{{% alert color="info" %}} + +**My GPU isn't in the default table — where's the old `kubectl edit kubevirt` step?** It is gone on purpose. `permittedHostDevices` is now owned by the chart template and reconciled from platform values, so any hand edit to the live CR is reverted on the next Flux/Helm reconcile. Add your card through `.gpu.permittedHostDevices` instead — see [Extending or replacing the NVIDIA defaults](#extending-or-replacing-the-nvidia-defaults) below. If you are upgrading from a release where you hand-edited the CR, follow [Upgrading from a hand-edited KubeVirt CR](#upgrading-from-a-hand-edited-kubevirt-cr) first. + +{{% /alert %}} + ### Extending or replacing the NVIDIA defaults If your cluster ships a GPU not in the default table, or your `nvidia-sandbox-device-plugin` version emits a different `resourceName` (check with `kubectl describe node | grep nvidia.com/`), extend the defaults via platform values: @@ -145,8 +151,8 @@ Before upgrading: 1. Dump your current entries: ```bash - kubectl get kubevirt -n cozy-kubevirt -o yaml \ - | yq '.items[0].spec.configuration.permittedHostDevices' + kubectl -n cozy-kubevirt get kubevirt kubevirt -o json \ + | jq '.spec.configuration.permittedHostDevices' ``` 2. Move any custom entries into the Platform Package values under `.gpu.permittedHostDevices` (set `.gpu.replaceDefaults: true` if you want only your own list instead of appending to the NVIDIA defaults).