Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 61 additions & 18 deletions content/en/docs/next/virtualization/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,32 +100,75 @@ Allocatable:
For example, the database entry for A10 reads `2236 GA102GL [A10]`, which results in a resource name `nvidia.com/GA102GL_A10`.
{{% /alert %}}

## 2. Update the KubeVirt Custom Resource
## 2. KubeVirt is wired automatically

Next, we will update the KubeVirt Custom Resource, as documented in the
[KubeVirt user guide](https://kubevirt.io/user-guide/virtual_machines/host-devices/#listing-permitted-devices),
so that the passthrough GPUs are permitted and can be requested by a KubeVirt VM.
When `cozystack.gpu-operator` is in `bundles.enabledPackages`, Cozystack mirrors the chosen GPU variant into the `KubeVirt` Custom Resource for you. There is no `kubectl edit kubevirt` step.

Adjust the `pciVendorSelector` and `resourceName` values to match your specific GPU model.
Setting `externalResourceProvider=true` indicates that this resource is provided by an external device plugin,
in this case the `sandbox-device-plugin` which is deployed by the Operator.
Specifically, the platform injects:

- `HostDevices` into `spec.configuration.developerConfiguration.featureGates` (current KubeVirt splits this from the `GPU` gate; the admission webhook rejects `domain.devices.hostDevices` without it).
- A starter `spec.configuration.permittedHostDevices.pciHostDevices` table (rendered in the default `gpuOperatorVariant: default` — vfio-pci passthrough) covering common NVIDIA datacenter GPUs — Hopper (H100, H200), Ada Lovelace (L4, L40, L40S), Ampere (A100 PCIe/SXM, A40, A30, A10), Turing (T4), Volta (V100, V100S). PCI vendor:device pairs are stable; `resourceName` slugs follow what `nvidia-sandbox-device-plugin` v25.x emits — `<arch>_<model>`, with optional `_<form>_<mem>` qualifiers appended when a model ships in several memory or form-factor variants (e.g. `nvidia.com/GA102GL_A10` for the single-SKU A10, `nvidia.com/GH100_H200_SXM_141GB` for the H200). `externalResourceProvider: true` is set on every entry because the resources are advertised by the sandbox plugin, not by KubeVirt's in-tree device plugin.

Verify the resulting CR:

```bash
kubectl edit kubevirt -n cozy-kubevirt
kubectl -n cozy-kubevirt get kubevirt kubevirt -o json \
| jq '.spec.configuration | {featureGates: .developerConfiguration.featureGates, permittedHostDevices: .permittedHostDevices}'
```
example config:

{{% alert color="info" %}}

**My GPU isn't in the default table — where's the old `kubectl edit kubevirt` step?** It is gone on purpose. `permittedHostDevices` is now owned by the chart template and reconciled from platform values, so any hand edit to the live CR is reverted on the next Flux/Helm reconcile. Add your card through `.gpu.permittedHostDevices` instead — see [Extending or replacing the NVIDIA defaults](#extending-or-replacing-the-nvidia-defaults) below. If you are upgrading from a release where you hand-edited the CR, follow [Upgrading from a hand-edited KubeVirt CR](#upgrading-from-a-hand-edited-kubevirt-cr) first.

{{% /alert %}}

### Extending or replacing the NVIDIA defaults

If your cluster ships a GPU not in the default table, or your `nvidia-sandbox-device-plugin` version emits a different `resourceName` (check with `kubectl describe node <node> | grep nvidia.com/`), extend the defaults via platform values:

```yaml
...
spec:
configuration:
permittedHostDevices:
pciHostDevices:
- externalResourceProvider: true
pciVendorSelector: 10DE:2236
resourceName: nvidia.com/GA102GL_A10
...
# Platform Package values
gpu:
# Append (default) — your entries land alongside the NVIDIA table.
# Set to true to drop the NVIDIA table entirely (useful for non-NVIDIA-only
# clusters or strict allowlists). With replaceDefaults: true and an empty
# list below, the rendered CR carries no permittedHostDevices block at all
# and the admission webhook rejects every GPU VM — supply your own list.
replaceDefaults: false
permittedHostDevices:
pciHostDevices:
- pciVendorSelector: "10DE:2236"
resourceName: nvidia.com/GA102GL_A10
externalResourceProvider: true
```

### Upgrading from a hand-edited KubeVirt CR

Earlier Cozystack releases left `spec.configuration.permittedHostDevices` for operators to hand-edit (`kubectl edit kubevirt`). The bundle now **owns** that field: the first reconcile after the upgrade replaces your manual entries with the rendered NVIDIA default table.

Before upgrading:

1. Dump your current entries:

```bash
kubectl -n cozy-kubevirt get kubevirt kubevirt -o json \
| jq '.spec.configuration.permittedHostDevices'
```

2. Move any custom entries into the Platform Package values under `.gpu.permittedHostDevices` (set `.gpu.replaceDefaults: true` if you want only your own list instead of appending to the NVIDIA defaults).

3. Verify every `resourceName` against what your nodes actually advertise — the default table uses `nvidia-sandbox-device-plugin` slugs (e.g. `nvidia.com/TU104GL_T4`) that differ from legacy driver names (e.g. `TU104GL_TESLA_T4`):

```bash
kubectl describe node <node> | grep nvidia.com/
```

A `resourceName` mismatch is silent until a GPU VM restarts or migrates, at which point the admission webhook rejects it.

### Manual Package-CR override path

If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When creating a standalone cozystack.kubevirt Package CR directly, the configuration values should be defined under spec.values rather than components.kubevirt.values. The components.<name>.values structure is used when configuring components within the umbrella cozystack-platform package.

Updating this path ensures the standalone Package CR is configured correctly.

Suggested change
If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `components.kubevirt.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist.
If you opt out of bundle management and hand-craft a `cozystack.gpu-operator` Package CR directly (to apply overrides the bundle does not expose — driver settings, custom node selectors, validator / dcgmExporter tweaks), the platform does NOT auto-wire `HostDevices` or `permittedHostDevices` into the KubeVirt CR. In that flow, mirror the bundle behaviour by also creating a `cozystack.kubevirt` Package CR with `spec.values.extraFeatureGates: [HostDevices]` and the appropriate `permittedHostDevices` block. The manual Package-CR override path takes precedence over the bundle render whenever both exist.


## 3. Create a Virtual Machine

We are now ready to create a VM.
Expand Down