Skip to content

RFC 0005: Platform-managed Kubernetes sandboxes#1680

Open
rohancmr wants to merge 2 commits into
NVIDIA:mainfrom
rohancmr:rfc/platform-managed-kubernetes-sandboxes
Open

RFC 0005: Platform-managed Kubernetes sandboxes#1680
rohancmr wants to merge 2 commits into
NVIDIA:mainfrom
rohancmr:rfc/platform-managed-kubernetes-sandboxes

Conversation

@rohancmr
Copy link
Copy Markdown

@rohancmr rohancmr commented Jun 2, 2026

Summary

Adds RFC 0005 for platform-managed Kubernetes sandbox provisioning.

This RFC proposes support for a trusted Kubernetes platform control plane to call OpenShell Gateway with a platform-selected namespace, supplied sandbox policy, approved Kubernetes Secret-backed provider credentials, runtime placement metadata, and optional Agent Sandbox allocation through SandboxClaim and SandboxWarmPool.

Related issue: #1678

Notes

  • RFC state is set to review.
  • This is a design proposal only; no runtime code changes are included.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@rohancmr rohancmr force-pushed the rfc/platform-managed-kubernetes-sandboxes branch from 7119994 to 94b22ee Compare June 2, 2026 10:48
@rohancmr
Copy link
Copy Markdown
Author

rohancmr commented Jun 2, 2026

I have read the DCO document and I hereby sign the DCO.

@rohancmr
Copy link
Copy Markdown
Author

rohancmr commented Jun 2, 2026

recheck


### Request shape

The exact API can be protobuf-native, driver-specific configuration, or a
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we would add support through the proposal in #1589? (Or at least partially).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this should partially align with #1589.

The Kubernetes-specific request fields in this RFC, such as requested namespace, allocation mode, warm-pool reference, RuntimeClass, node selector, tolerations, and possibly service account selection, seem like good candidates for the driver-owned config shape proposed in #1589.

Kubernetes placement fields can use driver_config, but credentials and authorization should be handled by OpenShell’s main credential and authorization systems, because they affect the whole control plane, not just the Kubernetes driver.

I can update this RFC to reference #1589 as the likely mechanism for the Kubernetes-specific configuration surface.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like a lot of the k8s specific implementations fields like agent-sandbox configuration could use the driver_config. It would be nice if the 2 proposals are aligned 👍

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should use driver_config detailed in #1589.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The Kubernetes-specific implementation fields should align with the driver_config proposal in #1589.

I will update this RFC so requested namespace, Agent Sandbox allocation mode, SandboxClaim/SandboxWarmPool references, RuntimeClass, node selector, tolerations, and other Kubernetes placement/configuration fields are framed as Kubernetes driver configuration rather than new generic OpenShell API fields.

I will keep cross-cutting concerns like authorization, credential resolution, sandbox ownership, and lifecycle events separate from driver_config because those apply to the OpenShell control plane more broadly.

its configured default namespace. When the request namespace is omitted, the
driver keeps existing behavior and provisions into the configured namespace.

When the request namespace is present, the driver uses it as the namespace for
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means it is still up to the user to specify the namespace, correct? How is access control to the namespace handled?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this needs to be clearer.

The intent is not that an end user can choose an arbitrary namespace. In this model, the caller is a trusted Kubernetes platform controller. That controller resolves the tenant, selects the namespace, reconciles the namespace controls, and then calls OpenShell.

OpenShell still needs an authorization check on its side.
The Kubernetes platform controller should authenticate to OpenShell Gateway with a control-plane identity, for example a Kubernetes ServiceAccount/OIDC token, mTLS client identity, or another configured gateway identity.
OpenShell would then authorize that caller identity against allowed operations and resource scopes. For example, the controller identity may be allowed to request only specific namespace patterns, approved Secret namespaces, approved service accounts, approved allocation modes, and approved Kubernetes driver config.
Therefore, it is trusted only when requested by an authenticated and authorized platform-controller identity.

I can update the RFC to make this explicit and describe namespace override as a trusted control-plane field, not a general user-supplied field.

Copy link
Copy Markdown
Collaborator

@derekwaynecarr derekwaynecarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to finish authorization semantics on the OpenShell control plane itself before proceeding too far on this. In particular, this design would need to describe how the OpenShell control plane should be secured to support this operational pattern.

This RFC proposes a platform-managed Kubernetes sandbox provisioning model for
OpenShell. In this model, a Kubernetes platform owns tenant onboarding,
namespace creation, quotas, network policy, secret synchronization, policy
compilation and scheduling. OpenShell remains the sandbox execution plane: the
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need to support two types of usage scenarios:

  1. users of OpenShell that have no underlying knowledge of the control surface that runs their sandboxes (sandbox as a service)
  2. users of existing platforms that want to delegate to OpenShell as their sandbox execution plane on their existing platform (its more an operator in an existing platform)

I think we need to finish up authorization of the core OpenShell control plane, and firm up its associated data model a bit more, and then map how usage pattern (2) would fit into that authorization model before opening up too many prescriptive knobs. Is the identity/authz model of OpenShell and the k8s control plane common in this proposal, or different? How would we control who can connect to which sandboxes, etc.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This RFC is targeting the second usage model: an existing Kubernetes platform wants to delegate sandbox execution to OpenShell.

It is not trying to define the direct sandbox-as-a-service model where end users call OpenShell without knowing the underlying control plane.

In the current proposal, the Kubernetes platform identity/authz model and the OpenShell identity/authz model are different but connected.

The Kubernetes platform remains responsible for tenant onboarding, namespace creation, namespace RBAC, quota, NetworkPolicy, ESO Secret creation, and deciding which tenant/workload is allowed to request a sandbox.

OpenShell still needs its own control-plane identity and authorization model. The platform controller would authenticate to OpenShell as a trusted control-plane caller, for example using a Kubernetes ServiceAccount/OIDC token, mTLS identity, or another configured gateway identity. OpenShell would then authorize that caller to request only specific namespaces or namespace pattern, Secret refs, service accounts, allocation modes, and driver config.

For sandbox access, OpenShell should not rely only on Kubernetes namespace RBAC. OpenShell should have an ownership/access model for sandbox objects. A sandbox should carry owner/tenant/project metadata, and OpenShell should use that metadata to decide who can connect to the sandbox, stream logs, exec, delete it, or attach credentials. In the platform-delegated model, the platform controller may be the only direct OpenShell caller, and end-user access is mediated through the platform. In the sandbox-as-a-service model, OpenShell would need to authorize end users directly.

So the two authz systems are not the same, but they need to be mapped:
Kubernetes authz protects Kubernetes resources and namespaces.
OpenShell authz protects OpenShell API operations, sandbox ownership, credential use, and sandbox connection rights.

I can update the RFC to describe this explicitly and frame this RFC as usage pattern (2): an existing Kubernetes platform delegates sandbox execution to OpenShell.

Kubernetes Secret-backed credentials, placement metadata, and optional Agent
Sandbox allocation settings.

The proposal adds first-class support for:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about the security implications for this until we can connect it to a proposed authorization model for OpenShell in a bit more detail. Being able to target a specific namespace, reference secrets, and presumably control what service account the pod is running under needs scrutiny.

I agree that we will want to support fast sandbox creation (so claim/warm-pool is useful), but we will also want to support checkpoint/restore and/or scale-to-zero semantics. I think we should figure out the proxy -> sandbox split first and then see where we stand after that point.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The RFC should not move forward as an accepted API shape until the OpenShell control-plane authorization model and proxy/sandbox split are clearer.

I can move this back to draft and revise it to focus on the requirements and security boundaries: caller identity, namespace authorization, Secret reference authorization, service account authorization, driver config authorization, sandbox ownership, and who can connect to or operate a sandbox.


Today, the required integration hooks are not all available:

- The Kubernetes driver is configured with one sandbox namespace and creates
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we will want to separate sandboxes in separate namespaces, and definitely separate it from the namespace that may also be running the OpenShell control plane itself. One thing I had been waiting to see how things shake out is if we introduce a domain object above Sandbox in the OpenShell domain model.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

The requirement I am trying to capture is that platform-managed sandboxes need namespace separation and platform-owned metadata, but I agree the RFC should not assume all of that belongs directly on SandboxSpec.

A higher-level OpenShell domain object above Sandbox may be the right place to represent tenant/session/workload intent, with the Kubernetes driver mapping that intent to Sandbox, SandboxClaim, or warm-pool-backed allocation.

I can update the RFC to describe the requirement and leave the exact domain object/API shape open.

audit metadata and cleanup lifecycle.
- The sandbox create path does not expose a trusted platform-selected target
namespace.
- OpenShell provider credentials are stored in OpenShell provider records;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had imagined we would explore this via a Credential proto plugin design.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Kubernetes Secrets should probably be treated as one credential source backend, not as a Kubernetes-driver-only feature.

The RFC can be updated to align with a broader Credential proto/plugin model. In that model, OpenShell defines a common credential-source interface, and Kubernetes Secret is one implementation of that interface. The gateway resolves the approved credential source, while the existing provider/supervisor flow remains responsible for controlled credential injection into sandbox traffic.

- OpenShell provider credentials are stored in OpenShell provider records;
provider/provider-v2 attachment does not accept Kubernetes `Secret`
references as the credential source.
- The Kubernetes driver does not expose `SandboxClaim` and `SandboxWarmPool` as
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should determine if OpenShell itself should write SandboxWarmPool itself, so a user of the OpenShell control plane could define the behavior they desire for particular sandbox templates.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The RFC currently assumes OpenShell can select an existing SandboxWarmPool, but it does not clearly answer who owns creation and reconciliation of warm pools.

There seem to be two possible models: the platform pre-creates warm pools and OpenShell references them, or OpenShell owns warm-pool creation/reconciliation based on sandbox templates and desired profiles.

I can move this into open questions and avoid prescribing the ownership model until we decide how warm-pool lifecycle should fit into OpenShell.

- The Kubernetes driver does not expose `SandboxClaim` and `SandboxWarmPool` as
the platform allocation path for warm pools or template-backed placement.

Without these hooks, a Kubernetes platform would need workarounds such as one
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we need to solve all these issues, but I am not sure if this particular workaround is right until we get authorization on the OpenShell control plane completed. Maybe we can keep this RFC in a draft state until we get that satisfied? In particular, what gaps are missing in the OpenShell authorization surface to make it safe for this pattern would be good to explore. /cc @mrunalp

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable.

The goals still seem valid, but I agree the RFC should not move toward acceptance until the OpenShell control-plane authorization model is clear enough to make namespace selection, Secret references, service account selection, and sandbox access safe.

I can move the RFC back to draft and revise it to focus on requirements, authorization gaps, and integration points rather than presenting the API shape as ready for acceptance.

@sjenning
Copy link
Copy Markdown
Contributor

sjenning commented Jun 2, 2026

I agree that we will eventually need to support a single OpenShell gateway running Sandboxes in multiple kube namespaces that are preconfigured by the platform. We will need an authz system before we can do that though.

Authz systems are hard to get right and kube already has an RBAC system. I'm trying to think about how we can leverage it when we are running on kube, or if we just need to roll our own for uniform UX across OpenShell deployment environments.

There are many needs for a sandbox-as-a-service use case that do no apply in single-player use cases and podman/docker on a local workstation where the authn/authz is very basic.

The authz and sandbox namespacing is a big enough topic on its own. Storing creds in Secret and using other resources from the Sandbox API could each be their own RFC. I'm not saying to create those at this time as, IMHO, they would be too forward looking to be actionable right now.

@rohancmr rohancmr force-pushed the rfc/platform-managed-kubernetes-sandboxes branch from 94b22ee to 0d2da99 Compare June 2, 2026 15:42
- https://github.com/NVIDIA/OpenShell/pull/1680
---

# RFC 0005 - Platform-Managed Kubernetes Sandboxes
Copy link
Copy Markdown
Collaborator

@drew drew Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I had to summarize this RFC, would it be correct to say you are asking for

  1. Sandbox configuration hooks so that Sandboxes can be launched with specific Kubernetes properties (eg namespaces, warm pools, etc).
  2. A Kubernetes secret backend for Providers
  3. An event stream that publishes OpenShell events (eg sandbox created, sandbox deleted, etc).

Am I missing anything?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is mostly correct.

I would summarize the RFC as asking for:

  1. A platform-managed Kubernetes usage pattern, where a trusted platform controller calls OpenShell as the sandbox execution plane.
  2. Kubernetes driver configuration hooks, likely through docs(rfc): add driver config passthrough proposal #1589 / driver_config, so sandboxes can be launched with platform-selected Kubernetes properties such as namespace, RuntimeClass, node placement, SandboxClaim, and SandboxWarmPool.
  3. A Kubernetes Secret-backed credential source for Providers, or more generally a Credential proto/plugin model where Kubernetes Secret is one backend for approved credential material.
  4. Lifecycle/status events so platform schedulers, audit systems, and cleanup controllers can track sandbox provisioning, readiness, failure, and deletion.
  5. An authorization model that makes the above safe: caller identity, namespace authorization, Secret reference authorization, service account / driver config authorization, sandbox ownership, and sandbox connection/operation permissions.

Signed-off-by: Rohan Kumar <rohank@nvidia.com>
@rohancmr rohancmr force-pushed the rfc/platform-managed-kubernetes-sandboxes branch from 0d2da99 to 910a2d4 Compare June 3, 2026 06:52
@rohancmr
Copy link
Copy Markdown
Author

rohancmr commented Jun 3, 2026

Updated the draft RFC based on the review discussion.

Main changes:

  • Kept the RFC in draft and reframed it as a platform-managed Kubernetes requirements/authz draft rather than a final API proposal.
  • Made the target usage pattern explicit: an existing Kubernetes platform delegates sandbox execution to OpenShell; this is separate from the direct sandbox-as-a-service model.
  • Added an Authorization and trust model section covering platform-controller identity, namespace authorization, credential-source authorization, service account authorization, driver_config authorization, sandbox ownership, and connect/logs/exec/delete permissions.
  • Aligned Kubernetes-specific fields such as namespace, RuntimeClass, service account, placement, SandboxClaim, and SandboxWarmPool with the driver_config direction from docs(rfc): add driver config passthrough proposal #1589.
  • Reframed Kubernetes Secrets as one backend for a broader credential-source / Credential proto/plugin model, rather than as a Kubernetes-driver-only field.
  • Left SandboxWarmPool ownership and lifecycle/event boundaries as open questions.

This revision is intended to address the main concern that the platform-managed pattern needs to be connected to OpenShell control-plane authorization before moving toward an accepted API shape.

Signed-off-by: Rohan Kumar <rohank@nvidia.com>
@rohancmr
Copy link
Copy Markdown
Author

rohancmr commented Jun 3, 2026

Updated the draft RFC to focus more directly on the OpenShell control-plane authorization prerequisite for platform-managed Kubernetes sandboxes.

Main changes in this revision:

  • Reframed the summary so the central requirement is OpenShell authn/authz for trusted platform controllers, not just accepting more Kubernetes configuration.
  • Added an explicit authorization requirements section covering namespace scope, credential-source scope, service account scope, driver_config scope, policy attachment, sandbox ownership and connect/logs/exec/delete permissions.
  • Clarified that Kubernetes authorization and OpenShell authorization are different but mapped together: Kubernetes protects Kubernetes resources, while OpenShell must protect OpenShell API operations, sandbox ownership, credential use and sandbox operation rights.
  • Added an open question on whether OpenShell should reuse Kubernetes ServiceAccount/OIDC, RBAC or SubjectAccessReview when the gateway runs on Kubernetes, versus an OpenShell-native authorization policy.
  • Kept the concrete API shape intentionally open and kept Kubernetes-specific fields aligned with the driver_config direction from docs(rfc): add driver config passthrough proposal #1589.

The intent is to keep this RFC in draft and use it to capture the authorization gaps and platform-managed requirements before proposing concrete API changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants