Skip to content

fix: idle-based startup readiness and port-safe teardown recycle#9

Merged
heskew merged 6 commits into
mainfrom
fix/harness-startup-readiness-and-teardown-recycle
Jun 9, 2026
Merged

fix: idle-based startup readiness and port-safe teardown recycle#9
heskew merged 6 commits into
mainfrom
fix/harness-startup-readiness-and-teardown-recycle

Conversation

@heskew

@heskew heskew commented Jun 4, 2026

Copy link
Copy Markdown
Member

Fixes two flaky-CI races in the integration-test lifecycle helpers (#8).

Startup (runHarperCommand) — replaced the single wall-clock deadline, which can't tell a slow boot from a hang, with an idle timeout that resets on each chunk of Harper output, plus a CI-aware absolute cap.

Teardown (killHarper / teardownHarper) — kill Harper's whole process tree (process group on POSIX, taskkill /T on Windows) and wait for it to exit before recycling the loopback address. It previously SIGKILL'd after 200ms and recycled immediately, so a still-bound port hit EADDRINUSE on the next suite. A warn-only port-free check remains as a safety assertion.

Behavior change: startupTimeoutMs (and …_STARTUP_TIMEOUT_MS) now mean the idle / no-output window, not an absolute deadline — use the new startupMaxMs for a hard ceiling.

Adds the repo's first node:test suite; cross-platform CI is in #12.

Review focus: promise settlement in runHarperCommand, and the runner-exit SIGINT/SIGTERM cleanup in trackHarperProcess.


🤖 Generated with Claude Code

heskew and others added 2 commits June 4, 2026 11:30
Addresses two harness races (#8) that cause
intermittent failures under sharded/concurrent CI.

Race 1 — startup deadline: runHarperCommand used a single wall-clock deadline,
so a slow-but-healthy boot was indistinguishable from a hang. Replace it with an
idle timeout that resets on every chunk of Harper output (limit = time-since-last
-progress), plus a generous CI-aware absolute ceiling as a backstop. New
startupMaxMs option and HARPER_INTEGRATION_TEST_STARTUP_MAX_MS env var.

Race 2 — teardown loopback recycle: killHarper escalated to SIGKILL after only
200ms and resolved without confirming exit; teardownHarper then immediately
recycled the loopback address while Harper's fixed ports could still be held by
lingering workers, causing EADDRINUSE/ECONNREFUSED for the next suite. Give a
longer, configurable SIGTERM grace (wait for real exit after SIGKILL), and verify
the fixed ports are actually free before recycling the address.

Adds the repo's first unit tests (node:test) and an npm test script; tests are
excluded from the published build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Gemini:
- Match the startup completion marker against accumulated stdout, not a single
  chunk, so a marker split across stream chunks is still detected.
- startHarper now waits for the fixed ports to be free before spawning, closing
  the killHarper()->startHarper() restart race (no-op for fresh pool addresses).
- killHarper unregisters its 'exit' listener in finish() so it can't leak if the
  SIGKILL backstop fires.
- Close the log streams in runHarperCommand's spawn-error handler ('exit' never
  fires for a failed spawn), avoiding leaked file descriptors.
- Document the startupTimeoutMs semantic change (absolute deadline -> idle window).

Codex:
- Raise devEngines runtime to >=22.18.0 so `npm test` (which runs .ts directly via
  native type stripping) works across the declared contributor Node range.
- Update CONTRIBUTING.md / AGENTS.md for the new `npm test` command and Node req.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Test files don't belong in src/ (the compiled tree) and no Harper repo lays
them out that way — the peer oauth package uses test/, core uses unitTests/.
Move both *.test.ts into test/, point `npm test` at test/**/*.test.ts, have
tsconfig.json type-check both src/ and test/ while tsconfig.build.json narrows
the emit to src/ (so tests are never published).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@heskew heskew requested review from Ethan-Arrowood and kriszyp June 4, 2026 22:33

@Ethan-Arrowood Ethan-Arrowood left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of the loopback address system was to avoid the port collision. If an address is in use, it should be locked. New processes should be given a new address instead. Is that not working?

The isPortFree() implementation is a race condition no matter what. There is no way to guarantee the port remains available between calling that and assigning it to another server.

@Ethan-Arrowood

Ethan-Arrowood commented Jun 5, 2026

Copy link
Copy Markdown
Member

After digging into this, here's where I landed on the isPortFree()/waitForPortsFree() approach and what I think the underlying problem actually is.

TL;DR

isPortFree() isn't wrong, but it's treating a symptom. The real fix is in killHarper, and if that's done properly the port-polling becomes redundant on the normal lifecycle.

The pool is doing its job — the collision is a sequential recycle race

The loopback pool correctly guarantees two live holders never share an address, so concurrent suites never collide on the fixed ports. The collision isn't concurrent, it's temporal:

  1. Suite A holds 127.0.0.2, runs Harper.
  2. Teardown → killHarper (old code SIGKILL's after 200ms and resolves without waiting for exit).
  3. releaseLoopbackAddress('127.0.0.2') nulls the slot — pool now thinks it's free.
  4. Harper's listeners on 127.0.0.2:9925… are still bound (parent maybe not even dead; workers/children lingering).
  5. Suite B legitimately acquires 127.0.0.2EADDRINUSE.

So the pool's lock protects the allocation table, not the OS port state. The address gets unlocked before the previous tenant is actually gone.

The real root cause is in killHarper

Two things, both in the kill path:

  1. It didn't wait for death. SIGKILL at 200ms + resolve immediately → teardown raced ahead and recycled the address while Harper was potentially still exiting. The grace + wait-for-exit change in this PR is the genuinely valuable fix.
  2. It only signals the direct child. Harper is spawned without detached: true (harperLifecycle.ts:276), so there's no process group. proc.kill() hits only the parent PID; any real child processes are orphaned and keep the fixed ports bound after the parent dies. SIGKILL also skips Harper's own graceful shutdown, so its workers never get told to close.

The longer SIGTERM grace partially mitigates (2) indirectly by letting Harper shut its own tree down — which is why the grace bump matters more than the port-polling.

On isPortFree() specifically

It's a TOCTOU check and can't guarantee anything — but the nuance is that it isn't acting as a lock here, it's a "wait for the previous tenant to drain" heuristic, and the pool already provides the mutual exclusion:

  • Teardown side: ordering is sound (killHarperwaitForPortsFree while the address is still locked → release). No competitor can grab the address during the wait, so the TOCTOU window is benign.
  • startHarper side: pure belt-and-suspenders.

The honest framing: it works in practice only because the pool serializes address ownership. It's not load-bearing for correctness — it's papering over an incomplete kill. If the kill were complete, waitForPortsFree would return on the first poll every time.

Recommendation

Make killHarper guarantee the whole Harper process tree is dead before teardown releases the address:

  • Spawn with detached: true (new process group), signal the group (process.kill(-proc.pid, 'SIGTERM') → escalate to SIGKILL), then wait for exit.
  • "Process tree dead ⇒ listening ports free" is an OS invariant (TIME_WAIT is per-connection, not per-listener; SO_REUSEADDR is on by default), so on the normal teardown + restart paths the port-polling becomes redundant and can be dropped — or kept only as a cheap post-kill assertion that warns if it ever trips.

Two residual caveats, both fine to track separately:

  • killHarper only has the parent's exit event, not each worker child's — with a group SIGKILL the children's sockets are released essentially synchronously, so the gap is negligible.
  • A runner killed mid-shard (teardown never runs) is not fixed by either approach — that's the separate pool bug where removeDeadProcessesFromPool keys on the runner PID and the sweep only fires when the pool is full.

The startup idle-timeout watchdog (Race 1) is unrelated to the above and looks well-built.

sent with Claude Opus 4.8

@heskew

heskew commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

@Ethan-Arrowood right and thanks for also digging in. This stemmed from earlier test flake investigations and you know how ai can often eagerly and confidently go down the wrong path. LMS what else can be done here...

heskew and others added 2 commits June 5, 2026 10:55
…port-poll to assertion

Per Ethan's review: the loopback pool already serializes address ownership, so the
collision was the *temporal* recycle race — the address was unlocked before the
previous tenant's ports were actually free. The real fix is in killHarper, not a
port poll.

- Spawn Harper detached (POSIX) so it leads its own process group; killHarper now
  signals the whole tree (negative PID on POSIX, `taskkill /T` on Windows), SIGTERM
  → grace → SIGKILL, then waits for exit. A dead tree releases its listening sockets,
  so ports are free once killHarper returns.
- Drop the proactive startHarper pre-spawn port wait (redundant given a complete kill).
- Keep waitForPortsFree only as a warn-only post-kill assertion in teardownHarper:
  it should pass instantly; a warning means a child process escaped the tree kill.
- Add a test proving the group kill reaps a child process, not just the parent.

Note: Harper's listeners run in worker_threads (in-process), so they die with the
parent regardless; the tree kill is belt-and-suspenders for component-spawned child
processes. Windows taskkill path is untested locally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spawning Harper detached means it survives signals sent to the runner's process
group (Ctrl+C, CI group-kill), so an interrupted runner could orphan it holding
the fixed ports. Track live Harper processes and SIGKILL their groups on the
runner's 'exit'/SIGINT/SIGTERM (re-raising the signal so the runner still dies).

This addresses the orphan window introduced by `detached`. The deeper pool issue
(removeDeadProcessesFromPool keys on the runner PID and only sweeps when full) is
pre-existing and unaffected by this change — tracked separately.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@heskew

heskew commented Jun 5, 2026

Copy link
Copy Markdown
Member Author

Updated and added a workflow for win/linux in #12

Comment thread test/portUtils.test.ts
const port = typeof address === 'object' && address ? address.port : 0;
await close(server);
return port;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅 We've moved the race condition to a unit test now -- talk about potential test flake!

I don't necessarily know the best way to test this otherwise. Maybe its a necessary evil and if it ever does flake then automatic retries would work?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confined that race to just the two tests that genuinely need a free port — the bound-port tests no longer round-trip through bind/close at all.


🤖 Posted by Claude on Nathan's behalf

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't really given much thought to testing this project yet. I'm fine with this and we can always iterate later.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wall-clock upper-bound assertions will flake on contended CI (harperLifecycle.test.ts, portUtils.test.ts)

ok(Date.now() - start < 1500, 'should exit well before the grace deadline');
ok(Date.now() - start < 500, 'should not have polled/waited');
ok(Date.now() - start < 200, 'should not wait the grace period ...');

Upper-bound timing asserts are themselves a flakiness source — a GC pause or scheduler stall on a loaded runner can blow a 200ms ceiling even when the code is correct, which is ironic for a PR fixing flakes. The lower-bound check (>= 150 for a 200ms grace) is fine and meaningful. I'd loosen or drop the upper bounds, or gate them behind !IS_CI.

(Ty Claude)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the irony wasn't lost on me. Dropped the <1500 (the SIGTERM signalCode already proves it died before the grace escalation), loosened the already-dead <200<1000, and gated the waitForPortsFree <500 behind !IS_CI. Kept the >=150 lower bound. (2db4dfe)


🤖 Posted by Claude on Nathan's behalf

Comment thread src/portUtils.ts Outdated
pollIntervalMs = 100
): Promise<boolean> {
const deadline = Date.now() + timeoutMs;
for (;;) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (;;) {
while(true) {

Something wrong with while ? 😆

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's too readable? 🫠

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha — no 😄 switched to while (true). It's what the rest of src/ uses anyway.


🤖 Posted by Claude on Nathan's behalf

Comment thread test/portUtils.test.ts
return port;
}

test('isPortFree returns true for an unbound port', async () => {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says only this test truly needs the getEphemeralPort implementation and has an irreducible / unavoidable TOCTOU ("time-of-check to time-of-use") race condition.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then you can add some kind of retry here; either just retry on failure or something more complicated like an internal retry-on-EADDRINUSE

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — those two now go through a withFreePorts() wrapper that re-acquires fresh ports and retries if the bind-after-release window loses one. (2db4dfe)


🤖 Posted by Claude on Nathan's behalf

Comment thread test/portUtils.test.ts
Comment on lines +35 to +72
test('isPortFree returns false for a bound port', async () => {
const port = await getEphemeralPort(HOST);
const server = await listen(HOST, port);
try {
strictEqual(await isPortFree(HOST, port), false);
} finally {
await close(server);
}
});

test('waitForPortsFree resolves true immediately when all ports are free', async () => {
const a = await getEphemeralPort(HOST);
const b = await getEphemeralPort(HOST);
const start = Date.now();
const freed = await waitForPortsFree(HOST, [a, b], 1000, 50);
ok(freed, 'expected ports to be reported free');
ok(Date.now() - start < 500, 'should not have polled/waited');
});

test('waitForPortsFree waits until a held port is released, then resolves true', async () => {
const port = await getEphemeralPort(HOST);
const server = await listen(HOST, port);
// Release the port shortly after we start waiting.
setTimeout(() => void close(server), 200);
const freed = await waitForPortsFree(HOST, [port], 2000, 50);
ok(freed, 'expected the port to become free before the timeout');
});

test('waitForPortsFree resolves false when a port stays held past the timeout', async () => {
const port = await getEphemeralPort(HOST);
const server = await listen(HOST, port);
try {
const freed = await waitForPortsFree(HOST, [port], 300, 50);
strictEqual(freed, false);
} finally {
await close(server);
}
});

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of these tests should all just bind to post 0 directly and read the assigned port off the live server, and test against it while it's still bound. This would improve the tests holistically.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added a listenEphemeral() helper; the three bound-port tests now bind to port 0 and assert against the live port while it's still listening, so no re-listen race. (2db4dfe)


🤖 Posted by Claude on Nathan's behalf

Comment thread src/harperLifecycle.ts Outdated
settled = true;
clearTimers();
reject(new HarperStartupError(message, stdout, stderr));
proc.kill();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be signalHarperTree(proc, 'SIGKILL') now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — that path was still proc.kill(), which only signals the direct child now that Harper's spawned detached. Switched it to signalHarperTree(proc, 'SIGKILL') so a failed startup reaps the whole group too. (2db4dfe)


🤖 Posted by Claude on Nathan's behalf

…e-port race, deflake timing asserts

- runHarperCommand startup-failure path now signalHarperTree(SIGKILL) instead of proc.kill(), so the detached process tree is reaped rather than just the direct child
- portUtils.test: bound-port tests bind to port 0 and assert the live port (no re-listen race); the two free-port tests retry via withFreePorts() to absorb the irreducible TOCTOU
- deflake timing asserts: drop redundant <1500 SIGTERM bound (signalCode already proves pre-grace death), loosen already-dead <200 -> <1000, gate waitForPortsFree <500 behind !IS_CI; keep the >=150 lower bound
- waitForPortsFree: for(;;) -> while(true)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@heskew

heskew commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

Thanks @Ethan-Arrowood — the kill-path root-cause writeup was the real unlock here; the process-group kill + wait-for-death is what makes the port-poll redundant on the normal paths.

Pushed 2db4dfe addressing the inline comments (replies in each thread). The one thing I left as a judgment call: I treated the timing asserts per-merit rather than uniformly (drop the redundant one, loosen the fast-path one, gate the busy-wait one behind !IS_CI) — happy to make it uniform if you'd prefer.

Separately tracking the mid-shard runner-kill pool-leak you flagged (dead-PID sweep only firing when the pool is full) — out of scope for this PR.


🤖 Posted by Claude on Nathan's behalf

@Ethan-Arrowood Ethan-Arrowood left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work

@heskew heskew merged commit 6a38699 into main Jun 9, 2026
@heskew heskew mentioned this pull request Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants