Skip to content

bug: Homebrew upgrades can keep stale Docker supervisor_image pin from old gateway.toml #1718

@kirit93

Description

@kirit93

Agent Diagnostic

  • Investigated a local Homebrew OpenShell upgrade where sandbox creation got stuck in Provisioning.

  • Found that an older Homebrew install had created /opt/homebrew/var/openshell/gateway.toml.

  • That file persisted across upgrade and still pinned:

    [openshell.drivers.docker]
    supervisor_image = "ghcr.io/nvidia/openshell/supervisor:0.0.43"
  • After upgrading the CLI/gateway to 0.0.54, the Homebrew service still honored the old prefix config, so the gateway launched Docker sandboxes with supervisor 0.0.43.

  • Result: sandbox containers started, but never completed the supervisor relay handshake.

  • Cleaning the old state and reinstalling 0.0.55 removed the stale pin; the gateway then pulled ghcr.io/nvidia/openshell/supervisor:0.0.55 and sandbox creation worked.

Description

Actual behavior:

A Homebrew upgrade can leave an old /opt/homebrew/var/openshell/gateway.toml in place. If that file was generated by an older package flow and contains a version-pinned Docker supervisor_image, the upgraded gateway continues using that old supervisor image.

In my case:

CLI/gateway:        0.0.54
Docker supervisor:  0.0.43

Sandbox creation then got stuck at:

Starting sandbox... Waiting for supervisor relay

Expected behavior:

The upgrade/install path should not silently keep using an old Docker supervisor image that is incompatible with the upgraded gateway.

Possible fixes:

  • Detect stale Homebrew prefix config during install/upgrade.
  • Warn if [openshell.drivers.docker].supervisor_image is pinned to a different OpenShell version than the installed gateway.
  • Migrate/remove old generated Homebrew config when it only contains package-generated defaults.
  • Prefer runtime defaults over old generated prefix config unless the user explicitly opted into it.

Reproduction Steps

  1. Start from an older Homebrew OpenShell install that generated /opt/homebrew/var/openshell/gateway.toml.

  2. Ensure that file contains a pinned Docker supervisor image, for example:

    [openshell.drivers.docker]
    supervisor_image = "ghcr.io/nvidia/openshell/supervisor:0.0.43"
  3. Upgrade OpenShell using the current install script or Homebrew.

  4. Run:

    openshell sandbox create
  5. Observe that sandbox creation can hang at Waiting for supervisor relay.

Environment

  • OS: macOS Darwin 25.1.0 arm64
  • Install method: Homebrew via install.sh
  • Docker: Docker Desktop 28.3.x
  • OpenShell upgrade observed: old 0.0.43-era config to 0.0.54
  • Confirmed working after clean reinstall: 0.0.55

Logs

Relevant stale config:


[openshell.drivers.docker]
supervisor_image = "ghcr.io/nvidia/openshell/supervisor:0.0.43"


Version mismatch:


openshell --version
openshell 0.0.54

docker exec <sandbox-container> /opt/openshell/bin/openshell-sandbox --version
openshell-sandbox 0.0.43


Sandbox symptoms:


Starting sandbox... Waiting for supervisor relay


Sandbox logs included:


PermissionDenied, message: "this method requires a sandbox principal"
NET:FAIL host.openshell.internal:17670


After cleanup/reinstall, expected behavior returned:


Pulling docker supervisor image image="ghcr.io/nvidia/openshell/supervisor:0.0.55"
Extracting supervisor binary from image to host cache
Server listening address=127.0.0.1:17670

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions