feat(gpu): introduce GPU request spec by elezar · Pull Request #1156 · NVIDIA/OpenShell

elezar · 2026-05-04T20:36:12Z

Summary

Adds compact GPU resource requirements to the public and compute-driver protos, moving the API toward the resource-requirements shape proposed in #1360.

Related Issue

Related to #1444 and #1360

Changes

Replaces SandboxSpec.gpu with SandboxSpec.resource_requirements.gpu.
Supports default GPU requests, GPU count requests, and interim driver-specific device_ids.
Preserves driver behavior: Kubernetes honors counts, VM accepts one GPU, Docker/Podman reject count requests while using CDI device IDs.
Updates CLI mapping, gateway validation/translation, tests, and docs.

Testing

mise run pre-commit

Checklist

Follows Conventional Commits
Docs updated

copy-pr-bot · 2026-05-04T20:36:16Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2026-05-13T16:45:22Z

🌿 Preview your docs: https://nvidia-preview-pr-1156.docs.buildwithfern.com/openshell

github-actions · 2026-05-18T13:45:05Z

Label test:e2e-gpu applied for cfceac9. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

drew · 2026-06-03T06:39:59Z

+message GpuRequestSpec {
+  // Optional driver-native device identifiers. Empty means the driver chooses
+  // its default GPU assignment behavior.
+  repeated string device_id = 1;


From the PR description

Replace legacy GPU-specific request fields in the public and driver protos with ResourceRequirements carrying a GPUSpec. This also adds GPU count requests while preserving the existing default GPU request shape used by --gpu and image-name auto-detection.

Where do we support counts in this PR?

Another thought, with the device_config convention we're trying to establish, does it make as much sense to include device_id in this proto? device_id could get represented from the driver_config, and gpu could simply represent the number of gpus to request.

Where do we support counts in this PR?

Looking locally, I think I had tested a version of this branch on a machine with GPUs and then pushed these changes without first pulling the commit including counts or forgot to push the local changes).

does it make as much sense to include device_id in this proto? device_id could get represented from the driver_config, and gpu could simply represent the number of gpus to request.

Using driver_config for GPU ids is an interesting idea and may make sense since the actual IDs available depend on the driver implementation (e.g. index vs BDF vs CDI device name).

This probably means that we're looking at the following shape:

driver_config: docker: device_ids: - nvidia.com/gpu=0 podman: device_ids: - nvidia.com/gpu=0 vm: device_ids: - 0

This would select the specified devices when --gpu-count=1 (assuming that we extend the docker and podman drivers to support this). We would also remove / deprecate the --gpu-device command line flag in this case and have --gpu imply --gpu-count=1. Note that the device_ids would ONLY be processed if a count is actually specified so that we don't allow driver-config to override sandbox-specific resources.

Let's me use this as a concrete driver for an implementaion of the proposal in #1589.

See #1716 as a draft POC for what this would look like. I still need to make another pass or two though.

elezar · 2026-06-03T13:08:42Z

Compatibility note: this revision intentionally keeps the proto compact while the API is still pre-alpha. It reuses the old direct GPU field position for resource_requirements and does not reserve the removed GPU-related tags/names. If reviewers prefer compatibility protection here, we can reserve the old fields in a follow-up revision.

elezar · 2026-06-03T14:08:25Z

@drew I readded the counts (from local history), and then went a step further and added the ResourceRequirements envelope instead of doing that as a follow up.

This is currently a separate commit, so I can roll it back if we're more comfortable with the initial change.

elezar · 2026-06-03T19:14:02Z

                        gpu,
                        gpu_device.as_deref(),
+                        gpu_count,


As a follow-up: Does it make sense to introduce a type for gpu and other resources going forward?

elezar · 2026-06-03T19:15:30Z

    gpu: bool,
    gpu_device: Option<&str>,
+    gpu_count: Option<u32>,
    cpu: Option<&str>,
    memory: Option<&str>,


For follow-up: Let's add a struct that handles these resources together instead of adding new arguments.

elezar · 2026-06-03T19:18:09Z

+pub fn cdi_gpu_device_ids(gpu: Option<&DriverGpuResourceRequirement>) -> Option<Vec<String>> {
+    match gpu {
+        Some(gpu) if gpu.device_ids.is_empty() => Some(vec![CDI_GPU_DEVICE_ALL.to_string()]),
+        Some(gpu) => Some(gpu.device_ids.clone()),
+        None => None,
+    }


One of the core changes of #1675 is to not default to nvidia.com/gpu=all when an empty device_id is specified, but instead select the "next" GPU id. We're currently keeping the behaviour unchanged here as part of this PR, but it may be useful to call this out in a comment.

Another note is that #1675 builds on this PR to also add support for count-based GPU requests.

Done. cdi_gpu_device_ids now documents that an empty explicit device list preserves the current all-GPU CDI default for this PR.

elezar · 2026-06-03T19:21:53Z

+fn driver_gpu_requirement(
+    spec: &openshell_core::proto::compute::v1::DriverSandboxSpec,
+) -> Option<&DriverGpuResourceRequirement> {
+    spec.resource_requirements
+        .as_ref()
+        .and_then(|requirements| requirements.gpu.as_ref())
+}


As a question: Is this a common Rust pattern to extract a reference to a specific field (or sub-field) form a type?

Another question: This specific function does not seem docker-specific. Should it be moved to somewhere like crates/openshell-core/src/gpu.rs if it is also present in podman or vm implementations?

Done. The shared extraction helper now lives in openshell_core::gpu, and Docker/Podman/Kubernetes/VM use the common driver_gpu_requirement path.

elezar · 2026-06-03T19:24:14Z

-const GPU_RESOURCE_QUANTITY: &str = "1";
+const DEFAULT_GPU_COUNT: u32 = 1;
+
+fn driver_gpu_requirement(spec: &SandboxSpec) -> Option<&DriverGpuResourceRequirement> {


This is common to the other drivers.

Done. This now uses the shared openshell_core::gpu::driver_gpu_requirement helper instead of a Kubernetes-local pattern.

elezar · 2026-06-03T19:27:59Z

+        if let Some(gpu) = sandbox.spec.as_ref().and_then(driver_gpu_requirement)
+            && gpu_has_explicit_device_ids(Some(gpu))
+        {
+            return Err(KubernetesDriverError::Precondition(
+                "kubernetes compute driver does not support explicit GPU device IDs".to_string(),
+            ));
+        }


Why are we seemingly duplicating the sandbox validation here?

elezar · 2026-06-03T19:29:08Z

            sandbox_template_to_k8s(
                &SandboxTemplate::default(),
-                spec.is_some_and(|s| s.gpu),
+                spec.and_then(driver_gpu_requirement),


What is the difference between spec.and_then(driver_gpu_requirement) and just calling driver_gpu_requirement(spec). If they're equivalent, can we settle on one of these for consistency?

elezar · 2026-06-03T19:35:48Z

    static ENV_LOCK: std::sync::LazyLock<std::sync::Mutex<()>> =
        std::sync::LazyLock::new(|| std::sync::Mutex::new(()));

+    fn gpu_request(count: Option<u32>) -> DriverGpuResourceRequirement {


Why do we have multiple mechanisms to construct a DriverGpuResourceRequirement in the kubernetes driver? This particular method seems to only be used in testing. If this is the case, is there a way to mark it as such. Furthermore, if we could "normalize" the construction, so that we always construct a DriverGpuResourceRequirement internally with the default count (i.e. the default request), then we don't need to remember to handle the special case when Count is not specified.

Done. The construction helpers are now local to the Kubernetes test module, with default_gpu_request() for the default request and gpu_request(count) only for count-specific cases.

elezar · 2026-06-03T19:36:55Z

+    fn gpu_has_explicit_device_ids_only_when_ids_are_present() {
+        assert!(!gpu_has_explicit_device_ids(None));
+        assert!(!gpu_has_explicit_device_ids(Some(
+            &DriverGpuResourceRequirement {
+                device_ids: vec![],
+                count: None,
+            }
+        )));
+        assert!(gpu_has_explicit_device_ids(Some(
+            &DriverGpuResourceRequirement {
+                device_ids: vec!["nvidia.com/gpu=0".to_string()],
+                count: None,
+            }
+        )));
+    }


Why are we testing this in the kubernetes driver? Explicit device IDs are not supported here.

Done. The Kubernetes test now verifies that explicit GPU device IDs are rejected, since Kubernetes does not support device-specific requests in this PR.

elezar · 2026-06-03T19:37:40Z

            sandbox_template_to_k8s(
                &template,
-                true,
+                Some(&gpu_request(None)),


This should probably just be a "defaultGpuRequest" which maps better to gpu == true before these changes.

Done. The helper is now default_gpu_request(), with gpu_request(count) kept for count-specific Kubernetes tests.

elezar · 2026-06-03T19:38:40Z

-workload.
+When `resource_requirements.gpu` is present, the driver checks node allocatable
+capacity for `nvidia.com/gpu` and sets the workload's `nvidia.com/gpu` resource
+limit. Requests without an explicit count use one GPU. The sandbox image must


Suggested change

limit. Requests without an explicit count use one GPU. The sandbox image must

limit. The default is to request a single GPU. The sandbox image must

Done. The Kubernetes README now states that the default is to request a single GPU.

elezar · 2026-06-03T19:40:18Z

+            resource_requirements: Some(DriverSandboxResourceRequirements {
+                gpu: Some(DriverGpuResourceRequirement {
+                    device_ids: vec![],
+                    count: None,
+                }),
+            }),


Is this a "defaultGPURequest"? (which may be testing-only and make this simpler to read). This may not be the case though if we're only using it once.

Done. The Podman test now uses default_gpu_request() for the default GPU request case.

elezar · 2026-06-03T19:41:45Z

    ) -> Result<(), ComputeDriverError> {
-        let gpu_requested = sandbox.spec.as_ref().is_some_and(|s| s.gpu);
-        Self::validate_gpu_request(gpu_requested)
+        let gpu = sandbox.spec.as_ref().and_then(driver_gpu_requirement);


Another note to be consistent on how the DriverGpuRequirements are constructed.

Done. Podman now uses the shared driver_gpu_requirement extraction helper, and the test construction path uses the default GPU request helper where appropriate.

elezar · 2026-06-03T19:46:33Z

+    gpu: Option<&DriverGpuResourceRequirement>,
+    gpu_enabled: bool,
+) -> Result<(), Status> {
+    if gpu.is_some() && !gpu_enabled {


Is it idiomatic in Rust to perform a quick return if gpu is NOT set?

Yes. This is the usual guard-clause shape in Rust for optional validation: return early for None, then validate the concrete GPU request without repeatedly unwrapping or matching.

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 8b95e09 to 3d17c4e Compare May 13, 2026 16:00

This was referenced May 13, 2026

feat: add GPU count support for Kubernetes sandboxes #1338

Open

feat(sandbox): add typed resource spec #1340

Closed

elezar force-pushed the feat/gpu-request-spec branch from 3d17c4e to 930c581 Compare May 13, 2026 16:44

elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 07b171d to dd18d21 Compare May 15, 2026 15:12

elezar marked this pull request as ready for review May 15, 2026 15:12

elezar requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 15, 2026 15:12

elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 9ff168a to cfceac9 Compare May 18, 2026 13:28

elezar added the test:e2e-gpu Requires GPU end-to-end coverage label May 18, 2026

elezar force-pushed the feat/gpu-request-spec branch 3 times, most recently from ff9e99d to 76af3a3 Compare May 20, 2026 11:57

elezar mentioned this pull request May 20, 2026

GPU-accelerated Sandbox Support #1444

Open

7 tasks

elezar force-pushed the feat/gpu-request-spec branch 4 times, most recently from 7f09b83 to 3c6a501 Compare June 1, 2026 21:15

drew reviewed Jun 3, 2026

View reviewed changes

elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 7bdaa73 to 23bdbb1 Compare June 3, 2026 11:29

elezar mentioned this pull request Jun 3, 2026

fix(gpu): select single CDI GPU defaults #1675

Draft

2 tasks

elezar mentioned this pull request Jun 3, 2026

feat(gpu): route device selection through driver config #1716

Draft

8 tasks

elezar commented Jun 3, 2026

View reviewed changes

Comment thread crates/openshell-cli/src/run.rs Outdated

elezar commented Jun 3, 2026

View reviewed changes

Comment thread architecture/compute-runtimes.md Outdated

elezar commented Jun 3, 2026

View reviewed changes

elezar added 5 commits June 3, 2026 23:09

fix(cli): allow gpu-device to imply GPU requests

2cafd43

Signed-off-by: Evan Lezar <elezar@nvidia.com>

refactor(gpu): centralize driver request validation

6899577

Signed-off-by: Evan Lezar <elezar@nvidia.com>

refactor(vm): derive GPU device request once

f07b1a7

Signed-off-by: Evan Lezar <elezar@nvidia.com>

feat(gpu): introduce GPU request spec

a5fc130

Signed-off-by: Evan Lezar <elezar@nvidia.com>

refactor(gpu): use resource requirements for GPU requests

4c18100

Signed-off-by: Evan Lezar <elezar@nvidia.com>

elezar force-pushed the feat/gpu-request-spec branch from 009a6ee to 4c18100 Compare June 3, 2026 21:21

	limit. Requests without an explicit count use one GPU. The sandbox image must
	limit. The default is to request a single GPU. The sandbox image must

Conversation

elezar commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elezar commented Jun 3, 2026

Uh oh!

elezar commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elezar Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

elezar commented May 4, 2026 •

edited

Loading

elezar Jun 3, 2026 •

edited

Loading