fix: idle-based startup readiness and port-safe teardown recycle by heskew · Pull Request #9 · HarperFast/integration-testing

heskew · 2026-06-04T21:50:25Z

Fixes two flaky-CI races in the integration-test lifecycle helpers (#8).

Startup (runHarperCommand) — replaced the single wall-clock deadline, which can't tell a slow boot from a hang, with an idle timeout that resets on each chunk of Harper output, plus a CI-aware absolute cap.

Teardown (killHarper / teardownHarper) — kill Harper's whole process tree (process group on POSIX, taskkill /T on Windows) and wait for it to exit before recycling the loopback address. It previously SIGKILL'd after 200ms and recycled immediately, so a still-bound port hit EADDRINUSE on the next suite. A warn-only port-free check remains as a safety assertion.

Behavior change: startupTimeoutMs (and …_STARTUP_TIMEOUT_MS) now mean the idle / no-output window, not an absolute deadline — use the new startupMaxMs for a hard ceiling.

Adds the repo's first node:test suite; cross-platform CI is in #12.

Review focus: promise settlement in runHarperCommand, and the runner-exit SIGINT/SIGTERM cleanup in trackHarperProcess.

🤖 Generated with Claude Code

Addresses two harness races (#8) that cause intermittent failures under sharded/concurrent CI. Race 1 — startup deadline: runHarperCommand used a single wall-clock deadline, so a slow-but-healthy boot was indistinguishable from a hang. Replace it with an idle timeout that resets on every chunk of Harper output (limit = time-since-last -progress), plus a generous CI-aware absolute ceiling as a backstop. New startupMaxMs option and HARPER_INTEGRATION_TEST_STARTUP_MAX_MS env var. Race 2 — teardown loopback recycle: killHarper escalated to SIGKILL after only 200ms and resolved without confirming exit; teardownHarper then immediately recycled the loopback address while Harper's fixed ports could still be held by lingering workers, causing EADDRINUSE/ECONNREFUSED for the next suite. Give a longer, configurable SIGTERM grace (wait for real exit after SIGKILL), and verify the fixed ports are actually free before recycling the address. Adds the repo's first unit tests (node:test) and an npm test script; tests are excluded from the published build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Gemini: - Match the startup completion marker against accumulated stdout, not a single chunk, so a marker split across stream chunks is still detected. - startHarper now waits for the fixed ports to be free before spawning, closing the killHarper()->startHarper() restart race (no-op for fresh pool addresses). - killHarper unregisters its 'exit' listener in finish() so it can't leak if the SIGKILL backstop fires. - Close the log streams in runHarperCommand's spawn-error handler ('exit' never fires for a failed spawn), avoiding leaked file descriptors. - Document the startupTimeoutMs semantic change (absolute deadline -> idle window). Codex: - Raise devEngines runtime to >=22.18.0 so `npm test` (which runs .ts directly via native type stripping) works across the declared contributor Node range. - Update CONTRIBUTING.md / AGENTS.md for the new `npm test` command and Node req. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-06-04T21:52:43Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Test files don't belong in src/ (the compiled tree) and no Harper repo lays them out that way — the peer oauth package uses test/, core uses unitTests/. Move both *.test.ts into test/, point `npm test` at test/**/*.test.ts, have tsconfig.json type-check both src/ and test/ while tsconfig.build.json narrows the emit to src/ (so tests are never published). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-06-04T22:33:54Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Ethan-Arrowood

The whole point of the loopback address system was to avoid the port collision. If an address is in use, it should be locked. New processes should be given a new address instead. Is that not working?

The isPortFree() implementation is a race condition no matter what. There is no way to guarantee the port remains available between calling that and assigning it to another server.

Ethan-Arrowood · 2026-06-05T14:52:11Z

After digging into this, here's where I landed on the isPortFree()/waitForPortsFree() approach and what I think the underlying problem actually is.

TL;DR

isPortFree() isn't wrong, but it's treating a symptom. The real fix is in killHarper, and if that's done properly the port-polling becomes redundant on the normal lifecycle.

The pool is doing its job — the collision is a sequential recycle race

The loopback pool correctly guarantees two live holders never share an address, so concurrent suites never collide on the fixed ports. The collision isn't concurrent, it's temporal:

Suite A holds 127.0.0.2, runs Harper.
Teardown → killHarper (old code SIGKILL's after 200ms and resolves without waiting for exit).
releaseLoopbackAddress('127.0.0.2') nulls the slot — pool now thinks it's free.
Harper's listeners on 127.0.0.2:9925… are still bound (parent maybe not even dead; workers/children lingering).
Suite B legitimately acquires 127.0.0.2 → EADDRINUSE.

So the pool's lock protects the allocation table, not the OS port state. The address gets unlocked before the previous tenant is actually gone.

The real root cause is in `killHarper`

Two things, both in the kill path:

It didn't wait for death. SIGKILL at 200ms + resolve immediately → teardown raced ahead and recycled the address while Harper was potentially still exiting. The grace + wait-for-exit change in this PR is the genuinely valuable fix.
It only signals the direct child. Harper is spawned without detached: true (harperLifecycle.ts:276), so there's no process group. proc.kill() hits only the parent PID; any real child processes are orphaned and keep the fixed ports bound after the parent dies. SIGKILL also skips Harper's own graceful shutdown, so its workers never get told to close.

The longer SIGTERM grace partially mitigates (2) indirectly by letting Harper shut its own tree down — which is why the grace bump matters more than the port-polling.

On `isPortFree()` specifically

It's a TOCTOU check and can't guarantee anything — but the nuance is that it isn't acting as a lock here, it's a "wait for the previous tenant to drain" heuristic, and the pool already provides the mutual exclusion:

Teardown side: ordering is sound (killHarper → waitForPortsFree while the address is still locked → release). No competitor can grab the address during the wait, so the TOCTOU window is benign.
startHarper side: pure belt-and-suspenders.

The honest framing: it works in practice only because the pool serializes address ownership. It's not load-bearing for correctness — it's papering over an incomplete kill. If the kill were complete, waitForPortsFree would return on the first poll every time.

Recommendation

Make killHarper guarantee the whole Harper process tree is dead before teardown releases the address:

Spawn with detached: true (new process group), signal the group (process.kill(-proc.pid, 'SIGTERM') → escalate to SIGKILL), then wait for exit.
"Process tree dead ⇒ listening ports free" is an OS invariant (TIME_WAIT is per-connection, not per-listener; SO_REUSEADDR is on by default), so on the normal teardown + restart paths the port-polling becomes redundant and can be dropped — or kept only as a cheap post-kill assertion that warns if it ever trips.

Two residual caveats, both fine to track separately:

killHarper only has the parent's exit event, not each worker child's — with a group SIGKILL the children's sockets are released essentially synchronously, so the gap is negligible.
A runner killed mid-shard (teardown never runs) is not fixed by either approach — that's the separate pool bug where removeDeadProcessesFromPool keys on the runner PID and the sweep only fires when the pool is full.

The startup idle-timeout watchdog (Race 1) is unrelated to the above and looks well-built.

_{sent with Claude Opus 4.8}

heskew · 2026-06-05T15:56:44Z

@Ethan-Arrowood right and thanks for also digging in. This stemmed from earlier test flake investigations and you know how ai can often eagerly and confidently go down the wrong path. LMS what else can be done here...

…port-poll to assertion Per Ethan's review: the loopback pool already serializes address ownership, so the collision was the *temporal* recycle race — the address was unlocked before the previous tenant's ports were actually free. The real fix is in killHarper, not a port poll. - Spawn Harper detached (POSIX) so it leads its own process group; killHarper now signals the whole tree (negative PID on POSIX, `taskkill /T` on Windows), SIGTERM → grace → SIGKILL, then waits for exit. A dead tree releases its listening sockets, so ports are free once killHarper returns. - Drop the proactive startHarper pre-spawn port wait (redundant given a complete kill). - Keep waitForPortsFree only as a warn-only post-kill assertion in teardownHarper: it should pass instantly; a warning means a child process escaped the tree kill. - Add a test proving the group kill reaps a child process, not just the parent. Note: Harper's listeners run in worker_threads (in-process), so they die with the parent regardless; the tree kill is belt-and-suspenders for component-spawned child processes. Windows taskkill path is untested locally. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Spawning Harper detached means it survives signals sent to the runner's process group (Ctrl+C, CI group-kill), so an interrupted runner could orphan it holding the fixed ports. Track live Harper processes and SIGKILL their groups on the runner's 'exit'/SIGINT/SIGTERM (re-raising the signal so the runner still dies). This addresses the orphan window introduced by `detached`. The deeper pool issue (removeDeadProcessesFromPool keys on the runner PID and only sweeps when full) is pre-existing and unaffected by this change — tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

heskew · 2026-06-05T20:53:24Z

Updated and added a workflow for win/linux in #12

Ethan-Arrowood · 2026-06-08T21:57:58Z

+	const port = typeof address === 'object' && address ? address.port : 0;
+	await close(server);
+	return port;
+}


😅 We've moved the race condition to a unit test now -- talk about potential test flake!

I don't necessarily know the best way to test this otherwise. Maybe its a necessary evil and if it ever does flake then automatic retries would work?

Confined that race to just the two tests that genuinely need a free port — the bound-port tests no longer round-trip through bind/close at all.

🤖 Posted by Claude on Nathan's behalf

Ethan-Arrowood · 2026-06-08T22:08:24Z

I haven't really given much thought to testing this project yet. I'm fine with this and we can always iterate later.

Wall-clock upper-bound assertions will flake on contended CI (harperLifecycle.test.ts, portUtils.test.ts)

ok(Date.now() - start < 1500, 'should exit well before the grace deadline'); ok(Date.now() - start < 500, 'should not have polled/waited'); ok(Date.now() - start < 200, 'should not wait the grace period ...');

Upper-bound timing asserts are themselves a flakiness source — a GC pause or scheduler stall on a loaded runner can blow a 200ms ceiling even when the code is correct, which is ironic for a PR fixing flakes. The lower-bound check (>= 150 for a 200ms grace) is fine and meaningful. I'd loosen or drop the upper bounds, or gate them behind !IS_CI.

(Ty Claude)

Agreed, the irony wasn't lost on me. Dropped the <1500 (the SIGTERM signalCode already proves it died before the grace escalation), loosened the already-dead <200→<1000, and gated the waitForPortsFree <500 behind !IS_CI. Kept the >=150 lower bound. (2db4dfe)

🤖 Posted by Claude on Nathan's behalf

Ethan-Arrowood · 2026-06-08T22:09:05Z

+	pollIntervalMs = 100
+): Promise<boolean> {
+	const deadline = Date.now() + timeoutMs;
+	for (;;) {


Suggested change

for (;;) {

while(true) {

Something wrong with while ? 😆

it's too readable? 🫠

Ha — no 😄 switched to while (true). It's what the rest of src/ uses anyway.

🤖 Posted by Claude on Nathan's behalf

Ethan-Arrowood · 2026-06-08T22:11:31Z

+	return port;
+}
+
+test('isPortFree returns true for an unbound port', async () => {


Claude says only this test truly needs the getEphemeralPort implementation and has an irreducible / unavoidable TOCTOU ("time-of-check to time-of-use") race condition.

And then you can add some kind of retry here; either just retry on failure or something more complicated like an internal retry-on-EADDRINUSE

Right — those two now go through a withFreePorts() wrapper that re-acquires fresh ports and retries if the bind-after-release window loses one. (2db4dfe)

🤖 Posted by Claude on Nathan's behalf

Ethan-Arrowood · 2026-06-08T22:12:25Z

+test('isPortFree returns false for a bound port', async () => {
+	const port = await getEphemeralPort(HOST);
+	const server = await listen(HOST, port);
+	try {
+		strictEqual(await isPortFree(HOST, port), false);
+	} finally {
+		await close(server);
+	}
+});
+
+test('waitForPortsFree resolves true immediately when all ports are free', async () => {
+	const a = await getEphemeralPort(HOST);
+	const b = await getEphemeralPort(HOST);
+	const start = Date.now();
+	const freed = await waitForPortsFree(HOST, [a, b], 1000, 50);
+	ok(freed, 'expected ports to be reported free');
+	ok(Date.now() - start < 500, 'should not have polled/waited');
+});
+
+test('waitForPortsFree waits until a held port is released, then resolves true', async () => {
+	const port = await getEphemeralPort(HOST);
+	const server = await listen(HOST, port);
+	// Release the port shortly after we start waiting.
+	setTimeout(() => void close(server), 200);
+	const freed = await waitForPortsFree(HOST, [port], 2000, 50);
+	ok(freed, 'expected the port to become free before the timeout');
+});
+
+test('waitForPortsFree resolves false when a port stays held past the timeout', async () => {
+	const port = await getEphemeralPort(HOST);
+	const server = await listen(HOST, port);
+	try {
+		const freed = await waitForPortsFree(HOST, [port], 300, 50);
+		strictEqual(freed, false);
+	} finally {
+		await close(server);
+	}
+});


The rest of these tests should all just bind to post 0 directly and read the assigned port off the live server, and test against it while it's still bound. This would improve the tests holistically.

Done — added a listenEphemeral() helper; the three bound-port tests now bind to port 0 and assert against the live port while it's still listening, so no re-listen race. (2db4dfe)

🤖 Posted by Claude on Nathan's behalf

Ethan-Arrowood · 2026-06-08T22:13:56Z

+			settled = true;
+			clearTimers();
+			reject(new HarperStartupError(message, stdout, stderr));
 			proc.kill();


I think this should be signalHarperTree(proc, 'SIGKILL') now.

Good catch — that path was still proc.kill(), which only signals the direct child now that Harper's spawned detached. Switched it to signalHarperTree(proc, 'SIGKILL') so a failed startup reaps the whole group too. (2db4dfe)

🤖 Posted by Claude on Nathan's behalf

…e-port race, deflake timing asserts - runHarperCommand startup-failure path now signalHarperTree(SIGKILL) instead of proc.kill(), so the detached process tree is reaped rather than just the direct child - portUtils.test: bound-port tests bind to port 0 and assert the live port (no re-listen race); the two free-port tests retry via withFreePorts() to absorb the irreducible TOCTOU - deflake timing asserts: drop redundant <1500 SIGTERM bound (signalCode already proves pre-grace death), loosen already-dead <200 -> <1000, gate waitForPortsFree <500 behind !IS_CI; keep the >=150 lower bound - waitForPortsFree: for(;;) -> while(true) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

heskew · 2026-06-08T23:21:47Z

Thanks @Ethan-Arrowood — the kill-path root-cause writeup was the real unlock here; the process-group kill + wait-for-death is what makes the port-poll redundant on the normal paths.

Pushed 2db4dfe addressing the inline comments (replies in each thread). The one thing I left as a judgment call: I treated the timing asserts per-merit rather than uniformly (drop the redundant one, loosen the fast-path one, gate the busy-wait one behind !IS_CI) — happy to make it uniform if you'd prefer.

Separately tracking the mid-shard runner-kill pool-leak you flagged (dead-PID sweep only firing when the pool is full) — out of scope for this PR.

🤖 Posted by Claude on Nathan's behalf

Ethan-Arrowood

great work

heskew and others added 2 commits June 4, 2026 11:30

This was referenced Jun 4, 2026

Tests: cover remaining runHarperCommand startup-watchdog branches #10

Open

Tests: integration coverage for startHarper/teardownHarper against a real Harper #11

Open

heskew marked this pull request as ready for review June 4, 2026 22:33

heskew requested review from Ethan-Arrowood and kriszyp June 4, 2026 22:33

Ethan-Arrowood reviewed Jun 5, 2026

View reviewed changes

heskew and others added 2 commits June 5, 2026 10:55

This was referenced Jun 5, 2026

test(ci): drop startup-timeout workarounds, now handled by integration-testing 0.5.x (#1139) HarperFast/harper#1149

Merged

ci: add cross-platform test workflow (ubuntu + windows × Node 22 & 24) #12

Merged

Ethan-Arrowood reviewed Jun 8, 2026

View reviewed changes

heskew mentioned this pull request Jun 8, 2026

Loopback pool leaks addresses from runners killed mid-shard (dead-PID sweep only runs when the pool is full) #13

Open

Ethan-Arrowood approved these changes Jun 9, 2026

View reviewed changes

heskew merged commit 6a38699 into main Jun 9, 2026

heskew mentioned this pull request Jun 9, 2026

Release v0.5.0 #14

Merged

Conversation

heskew commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot commented Jun 4, 2026

Uh oh!

Ethan-Arrowood left a comment

Choose a reason for hiding this comment

Uh oh!

Ethan-Arrowood commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

The pool is doing its job — the collision is a sequential recycle race

The real root cause is in killHarper

On isPortFree() specifically

Recommendation

Uh oh!

heskew commented Jun 5, 2026

Uh oh!

heskew commented Jun 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heskew commented Jun 8, 2026

Uh oh!

Ethan-Arrowood left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

heskew commented Jun 4, 2026 •

edited

Loading

Ethan-Arrowood commented Jun 5, 2026 •

edited

Loading

The real root cause is in `killHarper`

On `isPortFree()` specifically