fix: reconcile in-flight ledger at watcher startup#302
Merged
Conversation
OC custodian-audit runs ~94s under light load but consistently exceeds 120s when the sweep is running parallel jobs. The timeout caused the sweep to emit `custodian-audit timed out (>120s)` for OperationsCenter each cycle even though the audit itself reports 0 findings. 180s gives the OC audit a safe margin even under concurrent sweep load. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
A dispatch records execution_started and pairs execution_finished in a finally (coordinator.py). That is correct, but a finally cannot run when the executor process itself dies between the two markers — a session-limit kill, an OOM, or the SIGTERM a code-pull restart sends to a watcher mid-dispatch. The (backend, task_id) slot then leaks against the per-backend concurrency cap until board_unblock Rule 10 clears it on its next watchdog cycle (~15-30 min later). Extract the Rule 10 orphan scan into operations_center/in_flight_reconcile.py so the watcher-startup path can run the identical check. board_unblock now delegates to it (no duplicated logic; shared state_name/is_terminal). The board_worker runs reconcile_in_flight_on_startup once before its poll loop, serialised across role processes with an exclusive lock on the usage store and best-effort so it never blocks startup — so a code-pull restart reclaims the slots its own SIGTERM may have leaked, immediately. Tests: tests/unit/test_in_flight_reconcile.py. Action output is byte-identical to Rule 10, so existing watchdog logging and board_unblock tests are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Most watchdog cycles fire
ORPHANED_IN_FLIGHT_CLEAR(and the matchingSTALE_RUNNING_REQUEUE). Root cause:coordinator.pyrecordsexecution_startedand pairsexecution_finishedin afinally— correct, but afinallycan't run when the executor process dies between the two markers (a session-limit kill, an OOM, or the SIGTERM a code-pull restart sends to a watcher mid-dispatch). The(backend, task_id)slot then leaks against the per-backend concurrency cap untilboard_unblockRule 10 clears it on its next watchdog cycle — ~15–30 min of a needlessly-held slot, and the leak is most likely exactly when the controller restarts watchers on new code.The existing startup reconciliation covers Plane
Runningtasks but not theusage_storein-flight ledger.Change
operations_center/in_flight_reconcile.py(find_orphaned_in_flight/clear_orphaned_in_flight) — one definition of "what is an orphan" for both callers.board_unblock._clear_orphaned_in_flight_eventsnow delegates to it; its private_state_name/_is_terminal/_TERMINAL_STATESare imported from the new module (no duplicated bodies), and the now-unusedhttpximport is dropped.board_workerrunsreconcile_in_flight_on_startuponce before its poll loop, so a code-pull restart reclaims the slots its own SIGTERM may have leaked, immediately. Serialised across the role processes (goal/test/improve/spec-author boot together) with an exclusive lock on the usage store, and best-effort — it never blocks a watcher from coming up.Action output is byte-identical to Rule 10, so existing watchdog logging is unaffected.
Tests
tests/unit/test_in_flight_reconcile.py— 404/terminal/running, finished-closes-slot, window cutoff, non-404 skip, apply/dry-run, record-error, startup lock-skip, never-raise.board_unblocksuite (now exercising the delegation); 366 pass across maintenance + board_worker. ruff clean.custodian-multiclean (0 findings).🤖 Generated with Claude Code