fix: idle-based startup readiness and port-safe teardown recycle#9
Conversation
Addresses two harness races (#8) that cause intermittent failures under sharded/concurrent CI. Race 1 — startup deadline: runHarperCommand used a single wall-clock deadline, so a slow-but-healthy boot was indistinguishable from a hang. Replace it with an idle timeout that resets on every chunk of Harper output (limit = time-since-last -progress), plus a generous CI-aware absolute ceiling as a backstop. New startupMaxMs option and HARPER_INTEGRATION_TEST_STARTUP_MAX_MS env var. Race 2 — teardown loopback recycle: killHarper escalated to SIGKILL after only 200ms and resolved without confirming exit; teardownHarper then immediately recycled the loopback address while Harper's fixed ports could still be held by lingering workers, causing EADDRINUSE/ECONNREFUSED for the next suite. Give a longer, configurable SIGTERM grace (wait for real exit after SIGKILL), and verify the fixed ports are actually free before recycling the address. Adds the repo's first unit tests (node:test) and an npm test script; tests are excluded from the published build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Gemini:
- Match the startup completion marker against accumulated stdout, not a single
chunk, so a marker split across stream chunks is still detected.
- startHarper now waits for the fixed ports to be free before spawning, closing
the killHarper()->startHarper() restart race (no-op for fresh pool addresses).
- killHarper unregisters its 'exit' listener in finish() so it can't leak if the
SIGKILL backstop fires.
- Close the log streams in runHarperCommand's spawn-error handler ('exit' never
fires for a failed spawn), avoiding leaked file descriptors.
- Document the startupTimeoutMs semantic change (absolute deadline -> idle window).
Codex:
- Raise devEngines runtime to >=22.18.0 so `npm test` (which runs .ts directly via
native type stripping) works across the declared contributor Node range.
- Update CONTRIBUTING.md / AGENTS.md for the new `npm test` command and Node req.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Test files don't belong in src/ (the compiled tree) and no Harper repo lays them out that way — the peer oauth package uses test/, core uses unitTests/. Move both *.test.ts into test/, point `npm test` at test/**/*.test.ts, have tsconfig.json type-check both src/ and test/ while tsconfig.build.json narrows the emit to src/ (so tests are never published). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Ethan-Arrowood
left a comment
There was a problem hiding this comment.
The whole point of the loopback address system was to avoid the port collision. If an address is in use, it should be locked. New processes should be given a new address instead. Is that not working?
The isPortFree() implementation is a race condition no matter what. There is no way to guarantee the port remains available between calling that and assigning it to another server.
|
After digging into this, here's where I landed on the TL;DR
The pool is doing its job — the collision is a sequential recycle raceThe loopback pool correctly guarantees two live holders never share an address, so concurrent suites never collide on the fixed ports. The collision isn't concurrent, it's temporal:
So the pool's lock protects the allocation table, not the OS port state. The address gets unlocked before the previous tenant is actually gone. The real root cause is in
|
|
@Ethan-Arrowood right and thanks for also digging in. This stemmed from earlier test flake investigations and you know how ai can often eagerly and confidently go down the wrong path. LMS what else can be done here... |
…port-poll to assertion Per Ethan's review: the loopback pool already serializes address ownership, so the collision was the *temporal* recycle race — the address was unlocked before the previous tenant's ports were actually free. The real fix is in killHarper, not a port poll. - Spawn Harper detached (POSIX) so it leads its own process group; killHarper now signals the whole tree (negative PID on POSIX, `taskkill /T` on Windows), SIGTERM → grace → SIGKILL, then waits for exit. A dead tree releases its listening sockets, so ports are free once killHarper returns. - Drop the proactive startHarper pre-spawn port wait (redundant given a complete kill). - Keep waitForPortsFree only as a warn-only post-kill assertion in teardownHarper: it should pass instantly; a warning means a child process escaped the tree kill. - Add a test proving the group kill reaps a child process, not just the parent. Note: Harper's listeners run in worker_threads (in-process), so they die with the parent regardless; the tree kill is belt-and-suspenders for component-spawned child processes. Windows taskkill path is untested locally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spawning Harper detached means it survives signals sent to the runner's process group (Ctrl+C, CI group-kill), so an interrupted runner could orphan it holding the fixed ports. Track live Harper processes and SIGKILL their groups on the runner's 'exit'/SIGINT/SIGTERM (re-raising the signal so the runner still dies). This addresses the orphan window introduced by `detached`. The deeper pool issue (removeDeadProcessesFromPool keys on the runner PID and only sweeps when full) is pre-existing and unaffected by this change — tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Updated and added a workflow for win/linux in #12 |
| const port = typeof address === 'object' && address ? address.port : 0; | ||
| await close(server); | ||
| return port; | ||
| } |
There was a problem hiding this comment.
😅 We've moved the race condition to a unit test now -- talk about potential test flake!
I don't necessarily know the best way to test this otherwise. Maybe its a necessary evil and if it ever does flake then automatic retries would work?
There was a problem hiding this comment.
Confined that race to just the two tests that genuinely need a free port — the bound-port tests no longer round-trip through bind/close at all.
🤖 Posted by Claude on Nathan's behalf
There was a problem hiding this comment.
I haven't really given much thought to testing this project yet. I'm fine with this and we can always iterate later.
There was a problem hiding this comment.
Wall-clock upper-bound assertions will flake on contended CI (harperLifecycle.test.ts, portUtils.test.ts)
ok(Date.now() - start < 1500, 'should exit well before the grace deadline');
ok(Date.now() - start < 500, 'should not have polled/waited');
ok(Date.now() - start < 200, 'should not wait the grace period ...');Upper-bound timing asserts are themselves a flakiness source — a GC pause or scheduler stall on a loaded runner can blow a 200ms ceiling even when the code is correct, which is ironic for a PR fixing flakes. The lower-bound check (>= 150 for a 200ms grace) is fine and meaningful. I'd loosen or drop the upper bounds, or gate them behind !IS_CI.
(Ty Claude)
There was a problem hiding this comment.
Agreed, the irony wasn't lost on me. Dropped the <1500 (the SIGTERM signalCode already proves it died before the grace escalation), loosened the already-dead <200→<1000, and gated the waitForPortsFree <500 behind !IS_CI. Kept the >=150 lower bound. (2db4dfe)
🤖 Posted by Claude on Nathan's behalf
| pollIntervalMs = 100 | ||
| ): Promise<boolean> { | ||
| const deadline = Date.now() + timeoutMs; | ||
| for (;;) { |
There was a problem hiding this comment.
| for (;;) { | |
| while(true) { |
Something wrong with while ? 😆
There was a problem hiding this comment.
Ha — no 😄 switched to while (true). It's what the rest of src/ uses anyway.
🤖 Posted by Claude on Nathan's behalf
| return port; | ||
| } | ||
|
|
||
| test('isPortFree returns true for an unbound port', async () => { |
There was a problem hiding this comment.
Claude says only this test truly needs the getEphemeralPort implementation and has an irreducible / unavoidable TOCTOU ("time-of-check to time-of-use") race condition.
There was a problem hiding this comment.
And then you can add some kind of retry here; either just retry on failure or something more complicated like an internal retry-on-EADDRINUSE
There was a problem hiding this comment.
Right — those two now go through a withFreePorts() wrapper that re-acquires fresh ports and retries if the bind-after-release window loses one. (2db4dfe)
🤖 Posted by Claude on Nathan's behalf
| test('isPortFree returns false for a bound port', async () => { | ||
| const port = await getEphemeralPort(HOST); | ||
| const server = await listen(HOST, port); | ||
| try { | ||
| strictEqual(await isPortFree(HOST, port), false); | ||
| } finally { | ||
| await close(server); | ||
| } | ||
| }); | ||
|
|
||
| test('waitForPortsFree resolves true immediately when all ports are free', async () => { | ||
| const a = await getEphemeralPort(HOST); | ||
| const b = await getEphemeralPort(HOST); | ||
| const start = Date.now(); | ||
| const freed = await waitForPortsFree(HOST, [a, b], 1000, 50); | ||
| ok(freed, 'expected ports to be reported free'); | ||
| ok(Date.now() - start < 500, 'should not have polled/waited'); | ||
| }); | ||
|
|
||
| test('waitForPortsFree waits until a held port is released, then resolves true', async () => { | ||
| const port = await getEphemeralPort(HOST); | ||
| const server = await listen(HOST, port); | ||
| // Release the port shortly after we start waiting. | ||
| setTimeout(() => void close(server), 200); | ||
| const freed = await waitForPortsFree(HOST, [port], 2000, 50); | ||
| ok(freed, 'expected the port to become free before the timeout'); | ||
| }); | ||
|
|
||
| test('waitForPortsFree resolves false when a port stays held past the timeout', async () => { | ||
| const port = await getEphemeralPort(HOST); | ||
| const server = await listen(HOST, port); | ||
| try { | ||
| const freed = await waitForPortsFree(HOST, [port], 300, 50); | ||
| strictEqual(freed, false); | ||
| } finally { | ||
| await close(server); | ||
| } | ||
| }); |
There was a problem hiding this comment.
The rest of these tests should all just bind to post 0 directly and read the assigned port off the live server, and test against it while it's still bound. This would improve the tests holistically.
There was a problem hiding this comment.
Done — added a listenEphemeral() helper; the three bound-port tests now bind to port 0 and assert against the live port while it's still listening, so no re-listen race. (2db4dfe)
🤖 Posted by Claude on Nathan's behalf
| settled = true; | ||
| clearTimers(); | ||
| reject(new HarperStartupError(message, stdout, stderr)); | ||
| proc.kill(); |
There was a problem hiding this comment.
I think this should be signalHarperTree(proc, 'SIGKILL') now.
There was a problem hiding this comment.
Good catch — that path was still proc.kill(), which only signals the direct child now that Harper's spawned detached. Switched it to signalHarperTree(proc, 'SIGKILL') so a failed startup reaps the whole group too. (2db4dfe)
🤖 Posted by Claude on Nathan's behalf
…e-port race, deflake timing asserts - runHarperCommand startup-failure path now signalHarperTree(SIGKILL) instead of proc.kill(), so the detached process tree is reaped rather than just the direct child - portUtils.test: bound-port tests bind to port 0 and assert the live port (no re-listen race); the two free-port tests retry via withFreePorts() to absorb the irreducible TOCTOU - deflake timing asserts: drop redundant <1500 SIGTERM bound (signalCode already proves pre-grace death), loosen already-dead <200 -> <1000, gate waitForPortsFree <500 behind !IS_CI; keep the >=150 lower bound - waitForPortsFree: for(;;) -> while(true) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks @Ethan-Arrowood — the kill-path root-cause writeup was the real unlock here; the process-group kill + wait-for-death is what makes the port-poll redundant on the normal paths. Pushed 2db4dfe addressing the inline comments (replies in each thread). The one thing I left as a judgment call: I treated the timing asserts per-merit rather than uniformly (drop the redundant one, loosen the fast-path one, gate the busy-wait one behind Separately tracking the mid-shard runner-kill pool-leak you flagged (dead-PID sweep only firing when the pool is full) — out of scope for this PR. 🤖 Posted by Claude on Nathan's behalf |
Fixes two flaky-CI races in the integration-test lifecycle helpers (#8).
Startup (
runHarperCommand) — replaced the single wall-clock deadline, which can't tell a slow boot from a hang, with an idle timeout that resets on each chunk of Harper output, plus a CI-aware absolute cap.Teardown (
killHarper/teardownHarper) — kill Harper's whole process tree (process group on POSIX,taskkill /Ton Windows) and wait for it to exit before recycling the loopback address. It previously SIGKILL'd after 200ms and recycled immediately, so a still-bound port hitEADDRINUSEon the next suite. A warn-only port-free check remains as a safety assertion.Behavior change:
startupTimeoutMs(and…_STARTUP_TIMEOUT_MS) now mean the idle / no-output window, not an absolute deadline — use the newstartupMaxMsfor a hard ceiling.Adds the repo's first
node:testsuite; cross-platform CI is in #12.Review focus: promise settlement in
runHarperCommand, and the runner-exit SIGINT/SIGTERM cleanup intrackHarperProcess.🤖 Generated with Claude Code