Skip to content

fix: reconcile in-flight ledger at watcher startup#302

Merged
ProtocolWarden merged 2 commits into
mainfrom
fix/startup-inflight-reconcile
Jun 15, 2026
Merged

fix: reconcile in-flight ledger at watcher startup#302
ProtocolWarden merged 2 commits into
mainfrom
fix/startup-inflight-reconcile

Conversation

@ProtocolWarden

Copy link
Copy Markdown
Owner

Problem

Most watchdog cycles fire ORPHANED_IN_FLIGHT_CLEAR (and the matching STALE_RUNNING_REQUEUE). Root cause: coordinator.py records execution_started and pairs execution_finished in a finally — correct, but a finally can't run when the executor process dies between the two markers (a session-limit kill, an OOM, or the SIGTERM a code-pull restart sends to a watcher mid-dispatch). The (backend, task_id) slot then leaks against the per-backend concurrency cap until board_unblock Rule 10 clears it on its next watchdog cycle — ~15–30 min of a needlessly-held slot, and the leak is most likely exactly when the controller restarts watchers on new code.

The existing startup reconciliation covers Plane Running tasks but not the usage_store in-flight ledger.

Change

  • Extract the Rule 10 orphan scan into operations_center/in_flight_reconcile.py (find_orphaned_in_flight / clear_orphaned_in_flight) — one definition of "what is an orphan" for both callers. board_unblock._clear_orphaned_in_flight_events now delegates to it; its private _state_name/_is_terminal/_TERMINAL_STATES are imported from the new module (no duplicated bodies), and the now-unused httpx import is dropped.
  • board_worker runs reconcile_in_flight_on_startup once before its poll loop, so a code-pull restart reclaims the slots its own SIGTERM may have leaked, immediately. Serialised across the role processes (goal/test/improve/spec-author boot together) with an exclusive lock on the usage store, and best-effort — it never blocks a watcher from coming up.

Action output is byte-identical to Rule 10, so existing watchdog logging is unaffected.

Tests

  • tests/unit/test_in_flight_reconcile.py — 404/terminal/running, finished-closes-slot, window cutoff, non-404 skip, apply/dry-run, record-error, startup lock-skip, never-raise.
  • 99 pass with the existing board_unblock suite (now exercising the delegation); 366 pass across maintenance + board_worker. ruff clean. custodian-multi clean (0 findings).

🤖 Generated with Claude Code

ProtocolWarden and others added 2 commits June 15, 2026 02:19
OC custodian-audit runs ~94s under light load but consistently exceeds
120s when the sweep is running parallel jobs. The timeout caused the sweep
to emit `custodian-audit timed out (>120s)` for OperationsCenter each
cycle even though the audit itself reports 0 findings. 180s gives the OC
audit a safe margin even under concurrent sweep load.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A dispatch records execution_started and pairs execution_finished in a
finally (coordinator.py). That is correct, but a finally cannot run when
the executor process itself dies between the two markers — a session-limit
kill, an OOM, or the SIGTERM a code-pull restart sends to a watcher
mid-dispatch. The (backend, task_id) slot then leaks against the per-backend
concurrency cap until board_unblock Rule 10 clears it on its next watchdog
cycle (~15-30 min later).

Extract the Rule 10 orphan scan into operations_center/in_flight_reconcile.py
so the watcher-startup path can run the identical check. board_unblock now
delegates to it (no duplicated logic; shared state_name/is_terminal). The
board_worker runs reconcile_in_flight_on_startup once before its poll loop,
serialised across role processes with an exclusive lock on the usage store
and best-effort so it never blocks startup — so a code-pull restart reclaims
the slots its own SIGTERM may have leaked, immediately.

Tests: tests/unit/test_in_flight_reconcile.py. Action output is byte-identical
to Rule 10, so existing watchdog logging and board_unblock tests are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ProtocolWarden ProtocolWarden merged commit 59bf8dc into main Jun 15, 2026
17 of 18 checks passed
@ProtocolWarden ProtocolWarden deleted the fix/startup-inflight-reconcile branch June 15, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant