Skip to content

feat: emulate GCE metadata server for Google SDK access in sandboxes #1706

@p5

Description

@p5

Problem Statement

Users need to use Google Cloud SDKs (Drive, Sheets, BigQuery, etc.) from inside sandboxes. Google SDKs use Application Default Credentials (ADC), which resolves credentials through a fixed chain: GOOGLE_APPLICATION_CREDENTIALS file → well-known gcloud ADC file → GCE metadata server. The first two require raw secrets on disk inside the sandbox. The metadata server only exists on GCP compute. None work in an OpenShell sandbox today.

The proposed solution is to emulate the GCE metadata server via a loopback HTTP server inside the sandbox network namespace. The gateway already manages service account keys and generates short-lived tokens for the Vertex AI provider — this generalizes that capability.

Technical Context

ADC Resolution Chain

Google SDKs resolve credentials in this order (docs):

  1. GOOGLE_APPLICATION_CREDENTIALS env var → reads JSON key file (raw private key on disk)
  2. ~/.config/gcloud/application_default_credentials.json → has refresh token (raw secret)
  3. GCE metadata server at http://{GCE_METADATA_HOST}/computeMetadata/v1/... → short-lived tokens (no secrets on disk)
  4. Fail

Options 1 & 2 violate OpenShell's security model (no raw secrets in sandbox). Option 3 only works on GCP compute — unless we emulate it.

Critical Finding: SDKs Bypass HTTP_PROXY for Metadata

Deep investigation of the SDK source code revealed that neither Go nor Python honors HTTP_PROXY for metadata requests. This rules out a proxy-interception-only approach.

SDK Detection Data Fetches Transport Proxy?
Go cloud.google.com/go GCE_METADATA_HOST set → OnGCE() returns true immediately Custom http.Transport{Proxy: nil} Direct TCP Never
Python google-auth ping() uses GCE_METADATA_IP (NOT GCE_METADATA_HOST) via raw http.client.HTTPConnection _http_client.Request for init, requests.Session for refresh Direct TCP (init), proxy-aware (refresh) Not for detection or init
Node.js gcp-metadata BIOS probe + metadata ping Standard Node HTTP Direct TCP Honors proxy but doesn't need it

Go: Creates &http.Transport{} with Proxy: nil — explicitly skips ProxyFromEnvironment. Not configurable. OnGCE() returns true immediately when GCE_METADATA_HOST is set ("The user explicitly said they're on GCE, so trust them").

Python: Two separate env vars: GCE_METADATA_IP (detection ping, default 169.254.169.254) and GCE_METADATA_HOST (data fetches, default metadata.google.internal). Detection uses http.client.HTTPConnection with zero proxy support. Not configurable — the source comment says: "This is only acceptable because the metadata server doesn't do SSL and never requires proxies." Both env vars must be set.

Node.js: METADATA_SERVER_DETECTION=assume-present bypasses BIOS detection. GCE_METADATA_HOST directs all requests to the loopback server.

Why Loopback Server, Not Proxy Interception

Since Go and Python hardcode direct TCP with no proxy support, the metadata emulator must be reachable via direct TCP inside the sandbox. A loopback server on 127.0.0.1:8174 handles all three SDKs uniformly.

Affected Components

Existing on main (to build upon)

Component Key Files Relevance
Vertex AI provider profile providers/google-vertex-ai.yaml Pattern for SA key bootstrap + token refresh credentials
SA JWT token minting crates/openshell-server/src/provider_refresh.rs:501 mint_google_service_account_jwt() — reuse for google-cloud provider
Provider env resolution crates/openshell-server/src/grpc/provider.rs:425-534 resolve_provider_environment() — extend with GCP env injection
Credential state crates/openshell-sandbox/src/provider_credentials.rs ProviderCredentialState — extend with child_env_resolved()
Policy local handler crates/openshell-sandbox/src/policy_local.rs Pattern reference for synthetic HTTP serving
SSH netns connect crates/openshell-sandbox/src/ssh.rs Pattern reference for setns() on dedicated thread
Provider registry crates/openshell-providers/src/lib.rs Extend with google-cloud + vertex provider plugins
Bypass rules / nftables crates/openshell-sandbox/src/sandbox/linux/netns.rs Network namespace setup — loopback server binds inside netns

New files to create

Component File Purpose
Metadata server crates/openshell-sandbox/src/metadata_server.rs Generic loopback HTTP server with MetadataHandler trait
GCE metadata handler crates/openshell-sandbox/src/gcp_metadata.rs GCE metadata API implementation, OCSF logging
GCP constants crates/openshell-core/src/gcp.rs Shared constants: env var aliases, config keys, loopback address
GCP provider plugin crates/openshell-providers/src/providers/gcp.rs google-cloud provider type env injection
Vertex provider plugin crates/openshell-providers/src/providers/vertex.rs Extracted Vertex AI provider logic (from inline code in provider.rs)
GCP provider profile providers/google-cloud.yaml Credential definitions: service_account_token, adc_token
Docs docs/sandboxes/gcp-credentials.mdx User-facing GCP credentials documentation

Files to modify

File Change
crates/openshell-sandbox/src/lib.rs Add mod gcp_metadata, mod metadata_server; spawn loopback server in netns
crates/openshell-sandbox/src/provider_credentials.rs Add child_env_resolved() — triple-layer env var injection
crates/openshell-sandbox/src/secrets.rs Add placeholder_for_env_key() helper
crates/openshell-server/src/grpc/provider.rs Replace inline Vertex AI config injection with registry.inject_env()
crates/openshell-providers/src/lib.rs Register google-cloud and vertex provider plugins
crates/openshell-providers/src/profiles.rs Add google-cloud.yaml to embedded profiles
crates/openshell-providers/src/providers/mod.rs Add gcp and vertex modules
crates/openshell-core/src/lib.rs Add pub mod gcp

Technical Investigation

Architecture: Loopback Metadata Server

Google SDK in sandbox (Go, Python, Node.js)
  │
  │  GCE_METADATA_HOST=127.0.0.1:8174
  │  GCE_METADATA_IP=127.0.0.1:8174   (Python detection)
  │  METADATA_SERVER_DETECTION=assume-present  (Node.js)
  │
  ├─ Direct TCP to 127.0.0.1:8174
  │
  ▼
Loopback Metadata Server (127.0.0.1:8174)
  bound inside sandbox netns via setns()
  MetadataHandler trait → GCE MetadataContext
  │
  ├─ Validates Metadata-Flavor: Google header
  ├─ Rejects X-Forwarded-For (SSRF defense)
  ├─ Reads token from ProviderCredentialState
  ▼
Returns: {"access_token":"<placeholder>","expires_in":N,"token_type":"Bearer"}
  │
  ▼
SDK uses token in Authorization header on outbound *.googleapis.com requests
  (routed through proxy with normal egress policy, placeholder resolved to real token)

Loopback Server Design

  1. Namespace entry: Dedicated OS thread calls setns(netns_fd, CLONE_NEWNET) then TcpListener::bind("127.0.0.1:8174"). Thread exits after bind — no tokio thread pool contamination. Same pattern as ssh.rs::connect_in_netns.
  2. Accept loop: Runs on tokio runtime. 32 max concurrent connections, 4096 byte request cap.
  3. Handler trait: MetadataHandler is generic — future cloud providers (AWS IMDS, Azure IMDS) can reuse the same bind_in_netns + accept loop infrastructure.
  4. Port choice: 8174 (unprivileged, avoids CAP_NET_BIND_SERVICE).
  5. Readiness signal: oneshot::Sender<SocketAddr> signals when bound, with 5s timeout.
  6. Conditional startup: Only starts if GCE_METADATA_HOST is present in provider env (i.e., a google-cloud provider is attached).

SDK Compatibility: Three-Layer Env Var Injection

A new child_env_resolved() method on ProviderCredentialState handles all three SDKs:

Env Var Value Purpose
GCE_METADATA_HOST 127.0.0.1:8174 Go: instant OnGCE()=true + data fetch target. Python/Node.js: data fetch target
GCE_METADATA_IP 127.0.0.1:8174 Python: detection ping target (separate from GCE_METADATA_HOST)
METADATA_SERVER_DETECTION assume-present Node.js: skip BIOS probe that fails in sandboxes
GCP_PROJECT_ID, GOOGLE_CLOUD_PROJECT Resolved from provider config Non-secret config, un-placeholderized for SDK startup reads
CLOUD_ML_REGION, GCP_LOCATION Resolved from provider config Region aliases
GCP_SERVICE_ACCOUNT_EMAIL Resolved from provider config SA email for metadata /email endpoint

GCE Metadata API Surface

Endpoint Response Content-Type
GET / computeMetadata/\n text/plain
GET /computeMetadata/v1/instance/service-accounts/default/token {"access_token":"<placeholder>","expires_in":N,"token_type":"Bearer"} application/json
GET /computeMetadata/v1/instance/service-accounts/default/email SA email (real value) text/plain
GET /computeMetadata/v1/instance/service-accounts/default/scopes https://www.googleapis.com/auth/cloud-platform text/plain
GET /computeMetadata/v1/instance/service-accounts/default/aliases default\n text/plain
GET /computeMetadata/v1/instance/service-accounts/default?recursive=true JSON with aliases, email, scopes application/json
GET /computeMetadata/v1/instance/service-accounts/ default/\n text/plain
GET /computeMetadata/v1/project/project-id Project ID (real value) text/plain

All responses include Metadata-Flavor: Google header. Requests without Metadata-Flavor: Google → 403. Requests with X-Forwarded-For → 403.

Token Security: Placeholder-Based Resolution

The metadata /token endpoint serves placeholders (openshell:resolve:env:v{revision}_GCP_SA_ACCESS_TOKEN), not real token values. Real values are only resolved at the proxy layer when the token is used in outbound API requests via Authorization: Bearer headers. Non-secret values (project ID, SA email) are served as real values, matching real GCE metadata server behavior.

Provider Architecture

New google-cloud provider type alongside existing google-vertex-ai:

  • service_account_token: Gateway signs JWT with SA key, exchanges at oauth2.googleapis.com/token for access token. cloud-platform scope covers all Google APIs. Reuses existing mint_google_service_account_jwt().
  • adc_token: Gateway exchanges gcloud ADC refresh token for access token. Same OAuth2 refresh flow as existing Vertex AI ADC credential.
  • inject_env(): Generalized via ProviderRegistry — replaces inline Vertex AI config injection in resolve_provider_environment(). Both google-cloud and google-vertex-ai providers inject type-specific env vars through the same interface.

Existing Patterns to Follow

Pattern Location on main How it applies
Synthetic HTTP serving crates/openshell-sandbox/src/policy_local.rs PolicyLocalContext route dispatch, response format, OCSF audit
setns() on dedicated thread crates/openshell-sandbox/src/ssh.rs connect_in_netns pattern — OS thread + setns to avoid tokio contamination
SA JWT token minting crates/openshell-server/src/provider_refresh.rs:501-544 mint_google_service_account_jwt() — RS256 JWT → Google access token
Provider env var injection crates/openshell-server/src/grpc/provider.rs:488-527 Inline Vertex AI config injection — to be generalized
Credential snapshot crates/openshell-sandbox/src/provider_credentials.rs ProviderCredentialState atomic snapshot for credential access
Provider profile YAML providers/google-vertex-ai.yaml Credential definitions, refresh strategies, scopes

Proposed Approach

Loopback HTTP server on 127.0.0.1:8174 inside the sandbox netns, reachable by all three SDKs via direct TCP. Three env vars (GCE_METADATA_HOST, GCE_METADATA_IP, METADATA_SERVER_DETECTION) ensure detection succeeds across Go, Python, and Node.js. MetadataHandler trait enables future cloud provider emulators (AWS IMDS, Azure). New google-cloud provider type with SA JWT + ADC OAuth2 refresh. SA keys never enter the sandbox.

Scope Assessment

  • Complexity: Medium
  • Confidence: High — clear implementation path, all building blocks exist on main
  • New files: 7
  • Modified files: 8
  • Issue type: feat

Risks & Open Questions

  • Loopback server lifecycle: Server runs for sandbox lifetime. If credential state is stale (refresh delay), tokens served may be near-expiry. Same latency as current env var injection — acceptable.
  • Port 8174 collision: Unlikely but possible if sandbox processes bind the same port. Could be made configurable.
  • setns safety: Uses a dedicated OS thread that exits after bind — no namespace contamination of tokio thread pool.
  • Placeholder tokens: SDKs receive placeholder strings, not real tokens. SDKs that validate token format (e.g., check for ya29. prefix) may fail. Proxy resolves placeholders on outbound requests.
  • Opt-in activation: Metadata server only starts if GCE_METADATA_HOST is present in provider env (conditional on having a google-cloud provider attached).
  • Egress policy: Sandboxes still need egress policy allowing *.googleapis.com for actual API calls.
  • /universe/universe_domain endpoint: Go SDK fetches this during ADC init. Should return 404 (triggers default googleapis.com fallback) or serve the value.
  • Gateway config docs: No new gateway TOML fields — activation is purely provider-driven.

Test Considerations

  • Unit tests: Metadata handler endpoint routing, Metadata-Flavor: Google header enforcement, X-Forwarded-For rejection, response format, expiry computation, missing credential handling (503), unknown paths (404)
  • Unit tests: STATIC_CONFIG_KEYS consistency with alias arrays, config_key() resolution, token env key ordering
  • Unit tests: Provider injection — metadata host set, project ID/region/email propagation, non-overwrite of user values
  • Integration tests (requires root): Loopback server bind-in-netns, verify TCP reachability from sandbox namespace
  • E2E tests: Sandbox with google-cloud provider, verify Python/Go/Node.js ADC detection succeeds and token fetch returns valid placeholder
  • Negative tests: Missing Metadata-Flavor header → 403, X-Forwarded-For → 403, unknown paths → 404, no credentials → 503, POST method → 405

Created by spike investigation. Use build-from-issue to plan and implement.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions