fix(loops): survive cold-start sandbox acquisition (don't re-POST-thrash)#185
Merged
Conversation
…light box, don't re-POST-thrash)
On a cold scale-from-zero (0 warm boxes), the SDK create request times out (~30s)
before the orchestrator finishes provisioning the NAMED box. The recovery scanned
list() ONCE — missing the still-provisioning row — then re-POSTed a fresh cold
provision every backoff, restarting the same wall and never converging within the
600s budget ('could not acquire a running sandbox within budget').
Fix: after a retryable create error, poll list() across a short window
(appearScans=5 × pollMs) for the named box to APPEAR, and attach to it — the
orchestrator usually accepted the create and the row shows up seconds later. Only
re-create if it truly never appears (genuine rollback). Turns a cold acquire from
a hard failure into a (slower) success without touching the orchestrator.
Defense-in-depth for the warm-pool-disabled prod regime; the durable fix is the
ContainerPool warm-box floor (separate, orchestrator-side).
- typecheck clean; loops suite 143/143 (acquire 9/9). Default behavior preserved
when the box appears on the first scan (the prior single-scan happy path).
drewstone
added a commit
that referenced
this pull request
Jun 6, 2026
…de-dup steer-firewall (#187) * fix(runtime): recover orphaned read-retry + provision-retry hardenings (reconciled with #185) * refactor(bench): extract shared stats.mts + buildRunRecordFromAttempts; configurable worker provider (dedup the gate zoo) * refactor(runtime): de-dup steer-firewall to one site; drop dead analyst-driver-hook; document canonical atom
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Prod runs with 0 warm boxes (the
ContainerPoolfloor is disabled in deploy — separate orchestrator fix). So every acquisition cold-scales-from-zero. The SDKcreaterequest times out (~30s) before the orchestrator finishes provisioning the named box; the recovery scannedlist()once, missed the still-provisioning row, then re-POSTed a fresh cold provision every backoff — restarting the same wall and never converging within the 600s budget →could not acquire a running sandbox within budget. This blocked the eyes-present self-improvement proof.Fix (
src/runtime/sandbox-acquire.ts)After a retryable create error, poll
list()across a short window (appearScans=5 × pollMs) for the named box to appear, and attach to it — the orchestrator usually accepted the create and the row shows up seconds later. Only re-create if it genuinely never appears (true rollback). Turns a cold acquire from a hard failure into a (slower) success — defense-in-depth that can unblock batch runs without touching the prod orchestrator.Validation
tscclean; loops suite 143/143 (acquire 9/9). Happy path (box on first scan) byte-identical to the prior single-scan. Updated the one timing-pinned test to a budget that fits the new scan-window (fake clock → still instant).Scope / the durable companion
This is the kernel-side half. The primary fix is the orchestrator warm-box floor (enable + host-aware
ContainerPoolin agent-dev-containerdeploy.yml/generate-env.sh) so acquisitions hit a ready box instead of cold-provisioning. That's a HIGH-risk prod-deploy (ship Part A alone in multi-host → tripsbroken→ silently 0 warm) — staged separately, not in this PR.