feat(gpu): introduce GPU request spec#1156
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
8b95e09 to
3d17c4e
Compare
3d17c4e to
930c581
Compare
|
🌿 Preview your docs: https://nvidia-preview-pr-1156.docs.buildwithfern.com/openshell |
07b171d to
dd18d21
Compare
9ff168a to
cfceac9
Compare
|
Label |
ff9e99d to
76af3a3
Compare
7f09b83 to
3c6a501
Compare
| message GpuRequestSpec { | ||
| // Optional driver-native device identifiers. Empty means the driver chooses | ||
| // its default GPU assignment behavior. | ||
| repeated string device_id = 1; |
There was a problem hiding this comment.
From the PR description
Replace legacy GPU-specific request fields in the public and driver protos with ResourceRequirements carrying a GPUSpec. This also adds GPU count requests while preserving the existing default GPU request shape used by --gpu and image-name auto-detection.
Where do we support counts in this PR?
Another thought, with the device_config convention we're trying to establish, does it make as much sense to include device_id in this proto? device_id could get represented from the driver_config, and gpu could simply represent the number of gpus to request.
There was a problem hiding this comment.
Where do we support counts in this PR?
Looking locally, I think I had tested a version of this branch on a machine with GPUs and then pushed these changes without first pulling the commit including counts or forgot to push the local changes).
does it make as much sense to include device_id in this proto? device_id could get represented from the driver_config, and gpu could simply represent the number of gpus to request.
Using driver_config for GPU ids is an interesting idea and may make sense since the actual IDs available depend on the driver implementation (e.g. index vs BDF vs CDI device name).
This probably means that we're looking at the following shape:
driver_config:
docker:
device_ids:
- nvidia.com/gpu=0
podman:
device_ids:
- nvidia.com/gpu=0
vm:
device_ids:
- 0
This would select the specified devices when --gpu-count=1 (assuming that we extend the docker and podman drivers to support this). We would also remove / deprecate the --gpu-device command line flag in this case and have --gpu imply --gpu-count=1. Note that the device_ids would ONLY be processed if a count is actually specified so that we don't allow driver-config to override sandbox-specific resources.
Let's me use this as a concrete driver for an implementaion of the proposal in #1589.
There was a problem hiding this comment.
See #1716 as a draft POC for what this would look like. I still need to make another pass or two though.
7bdaa73 to
23bdbb1
Compare
|
Compatibility note: this revision intentionally keeps the proto compact while the API is still pre-alpha. It reuses the old direct GPU field position for |
|
@drew I readded the counts (from local history), and then went a step further and added the This is currently a separate commit, so I can roll it back if we're more comfortable with the initial change. |
| gpu, | ||
| gpu_device.as_deref(), | ||
| gpu_count, |
There was a problem hiding this comment.
As a follow-up: Does it make sense to introduce a type for gpu and other resources going forward?
| gpu: bool, | ||
| gpu_device: Option<&str>, | ||
| gpu_count: Option<u32>, | ||
| cpu: Option<&str>, | ||
| memory: Option<&str>, |
There was a problem hiding this comment.
For follow-up: Let's add a struct that handles these resources together instead of adding new arguments.
| pub fn cdi_gpu_device_ids(gpu: Option<&DriverGpuResourceRequirement>) -> Option<Vec<String>> { | ||
| match gpu { | ||
| Some(gpu) if gpu.device_ids.is_empty() => Some(vec![CDI_GPU_DEVICE_ALL.to_string()]), | ||
| Some(gpu) => Some(gpu.device_ids.clone()), | ||
| None => None, | ||
| } |
There was a problem hiding this comment.
One of the core changes of #1675 is to not default to nvidia.com/gpu=all when an empty device_id is specified, but instead select the "next" GPU id. We're currently keeping the behaviour unchanged here as part of this PR, but it may be useful to call this out in a comment.
There was a problem hiding this comment.
Another note is that #1675 builds on this PR to also add support for count-based GPU requests.
There was a problem hiding this comment.
Done. cdi_gpu_device_ids now documents that an empty explicit device list preserves the current all-GPU CDI default for this PR.
| fn driver_gpu_requirement( | ||
| spec: &openshell_core::proto::compute::v1::DriverSandboxSpec, | ||
| ) -> Option<&DriverGpuResourceRequirement> { | ||
| spec.resource_requirements | ||
| .as_ref() | ||
| .and_then(|requirements| requirements.gpu.as_ref()) | ||
| } |
There was a problem hiding this comment.
As a question: Is this a common Rust pattern to extract a reference to a specific field (or sub-field) form a type?
There was a problem hiding this comment.
Another question: This specific function does not seem docker-specific. Should it be moved to somewhere like crates/openshell-core/src/gpu.rs if it is also present in podman or vm implementations?
There was a problem hiding this comment.
Done. The shared extraction helper now lives in openshell_core::gpu, and Docker/Podman/Kubernetes/VM use the common driver_gpu_requirement path.
| const GPU_RESOURCE_QUANTITY: &str = "1"; | ||
| const DEFAULT_GPU_COUNT: u32 = 1; | ||
|
|
||
| fn driver_gpu_requirement(spec: &SandboxSpec) -> Option<&DriverGpuResourceRequirement> { |
There was a problem hiding this comment.
This is common to the other drivers.
There was a problem hiding this comment.
Done. This now uses the shared openshell_core::gpu::driver_gpu_requirement helper instead of a Kubernetes-local pattern.
| if let Some(gpu) = sandbox.spec.as_ref().and_then(driver_gpu_requirement) | ||
| && gpu_has_explicit_device_ids(Some(gpu)) | ||
| { | ||
| return Err(KubernetesDriverError::Precondition( | ||
| "kubernetes compute driver does not support explicit GPU device IDs".to_string(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
Why are we seemingly duplicating the sandbox validation here?
| sandbox_template_to_k8s( | ||
| &SandboxTemplate::default(), | ||
| spec.is_some_and(|s| s.gpu), | ||
| spec.and_then(driver_gpu_requirement), |
There was a problem hiding this comment.
What is the difference between spec.and_then(driver_gpu_requirement) and just calling driver_gpu_requirement(spec). If they're equivalent, can we settle on one of these for consistency?
| static ENV_LOCK: std::sync::LazyLock<std::sync::Mutex<()>> = | ||
| std::sync::LazyLock::new(|| std::sync::Mutex::new(())); | ||
|
|
||
| fn gpu_request(count: Option<u32>) -> DriverGpuResourceRequirement { |
There was a problem hiding this comment.
Why do we have multiple mechanisms to construct a DriverGpuResourceRequirement in the kubernetes driver? This particular method seems to only be used in testing. If this is the case, is there a way to mark it as such. Furthermore, if we could "normalize" the construction, so that we always construct a DriverGpuResourceRequirement internally with the default count (i.e. the default request), then we don't need to remember to handle the special case when Count is not specified.
There was a problem hiding this comment.
Done. The construction helpers are now local to the Kubernetes test module, with default_gpu_request() for the default request and gpu_request(count) only for count-specific cases.
| fn gpu_has_explicit_device_ids_only_when_ids_are_present() { | ||
| assert!(!gpu_has_explicit_device_ids(None)); | ||
| assert!(!gpu_has_explicit_device_ids(Some( | ||
| &DriverGpuResourceRequirement { | ||
| device_ids: vec![], | ||
| count: None, | ||
| } | ||
| ))); | ||
| assert!(gpu_has_explicit_device_ids(Some( | ||
| &DriverGpuResourceRequirement { | ||
| device_ids: vec!["nvidia.com/gpu=0".to_string()], | ||
| count: None, | ||
| } | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Why are we testing this in the kubernetes driver? Explicit device IDs are not supported here.
There was a problem hiding this comment.
Done. The Kubernetes test now verifies that explicit GPU device IDs are rejected, since Kubernetes does not support device-specific requests in this PR.
| sandbox_template_to_k8s( | ||
| &template, | ||
| true, | ||
| Some(&gpu_request(None)), |
There was a problem hiding this comment.
This should probably just be a "defaultGpuRequest" which maps better to gpu == true before these changes.
There was a problem hiding this comment.
Done. The helper is now default_gpu_request(), with gpu_request(count) kept for count-specific Kubernetes tests.
| workload. | ||
| When `resource_requirements.gpu` is present, the driver checks node allocatable | ||
| capacity for `nvidia.com/gpu` and sets the workload's `nvidia.com/gpu` resource | ||
| limit. Requests without an explicit count use one GPU. The sandbox image must |
There was a problem hiding this comment.
| limit. Requests without an explicit count use one GPU. The sandbox image must | |
| limit. The default is to request a single GPU. The sandbox image must |
There was a problem hiding this comment.
Done. The Kubernetes README now states that the default is to request a single GPU.
| resource_requirements: Some(DriverSandboxResourceRequirements { | ||
| gpu: Some(DriverGpuResourceRequirement { | ||
| device_ids: vec![], | ||
| count: None, | ||
| }), | ||
| }), |
There was a problem hiding this comment.
Is this a "defaultGPURequest"? (which may be testing-only and make this simpler to read). This may not be the case though if we're only using it once.
There was a problem hiding this comment.
Done. The Podman test now uses default_gpu_request() for the default GPU request case.
| ) -> Result<(), ComputeDriverError> { | ||
| let gpu_requested = sandbox.spec.as_ref().is_some_and(|s| s.gpu); | ||
| Self::validate_gpu_request(gpu_requested) | ||
| let gpu = sandbox.spec.as_ref().and_then(driver_gpu_requirement); |
There was a problem hiding this comment.
Another note to be consistent on how the DriverGpuRequirements are constructed.
There was a problem hiding this comment.
Done. Podman now uses the shared driver_gpu_requirement extraction helper, and the test construction path uses the default GPU request helper where appropriate.
| gpu: Option<&DriverGpuResourceRequirement>, | ||
| gpu_enabled: bool, | ||
| ) -> Result<(), Status> { | ||
| if gpu.is_some() && !gpu_enabled { |
There was a problem hiding this comment.
Is it idiomatic in Rust to perform a quick return if gpu is NOT set?
There was a problem hiding this comment.
Yes. This is the usual guard-clause shape in Rust for optional validation: return early for None, then validate the concrete GPU request without repeatedly unwrapping or matching.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
009a6ee to
4c18100
Compare
Summary
Adds compact GPU resource requirements to the public and compute-driver protos, moving the API toward the resource-requirements shape proposed in #1360.
Related Issue
Related to #1444 and #1360
Changes
SandboxSpec.gpuwithSandboxSpec.resource_requirements.gpu.device_ids.Testing
mise run pre-commitChecklist