fix(loopback): prevent SO_REUSEPORT co-binds via an operations-port conflict canary#17
Conversation
…onflict canary Under concurrency, the loopback pool can hand the same 127.0.0.x to two suites that are alive at the same time (e.g. a slot freed while its Harper node lingered). On Linux, Harper binds its HTTP/replication ports with SO_REUSEPORT, so the freshly-assigned node silently co-binds an address another node still owns — the kernel then splits connections between the two nodes and corrupts both suites (manifests as replication sockets that never connect, "no encryption keys", missing databases/tables, etc.). macOS binds those ports without SO_REUSEPORT, which is why this only reproduces on Linux. Two defenses: - Conflict canary in getNextAvailableLoopbackAddress: after claiming a slot, probe the operations-API port (9925) with a plain, non-SO_REUSEPORT bind. The operations API is the one port Harper binds exclusively (main-thread only), so EADDRINUSE here means a node still holds the address. On conflict we park the slot under our PID (so no other process retries the poisoned address) and claim the next one. Override the probe port via HARPER_INTEGRATION_TEST_CONFLICT_PROBE_PORT. - teardownHarper no longer recycles an address whose ports are still held by a child that escaped the tree-kill; it parks the slot (reclaimed when the per-file process exits). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a conflict canary mechanism to prevent port conflicts and silent co-binding under SO_REUSEPORT by probing loopback addresses before allocation and avoiding recycling addresses if ports remain in use after teardown. The review feedback highlights several critical issues: a potential infinite hang/deadlock under pool exhaustion when poisoned slots are permanently parked under the current PID, a potential TypeError during teardown if the hostname is undefined, a crash risk if the conflict probe port environment variable is parsed as NaN, and a race condition where the server's 'error' listener is not cleaned up after a successful bind.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…cases From PR review (gemini-code-assist): - Pool-exhaustion deadlock: parking poisoned slots under our own live PID could fill the pool with our PID, leaving nothing reclaimable and spinning forever. Instead release poisoned slots back to the pool and skip them via a per-attempt local Set (cleared before each wait so they can be re-probed once the lingering node exits). Removed the now-unused findAvailableIndex helper. - teardownHarper: nest the release under `if (ctx.harper.hostname)` so an early startup failure (no hostname) can't call releaseLoopbackAddress(undefined) and mask the real error with a TypeError. - CONFLICT_PROBE_PORT: fall back to 9925 on a NaN/out-of-range env override instead of crashing on server.listen(NaN). - isLoopbackAddressInUse: remove the 'error' listener once listen() succeeds so a post-bind socket error during close() can't flip the verdict. Validated locally: skip-in-use returns the next address; full-pool exhaustion recovers without deadlock once an address frees; an invalid probe-port env falls back cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review — all four were valid and are addressed in ec2a7ce:
Added local validation for each: skip-in-use returns the next address; a fully-exhausted pool recovers without deadlock the moment an address frees (~3s); an invalid probe-port env falls back cleanly. End-to-end harper-pro Integration Tests came back green across all 4 shards × Node 22/24/26. — Claude Opus 4.8 |
0.5.2 fixes the loopback-address pool's SO_REUSEPORT co-bind: under concurrency the pool could hand the same 127.0.0.x to two simultaneously-alive suites, and on Linux Harper binds HTTP/replication with SO_REUSEPORT, so both nodes silently co-bound the address — the kernel then split connections between them, corrupting both suites. That is the root cause of the persistently-red Integration Tests (selectiveTableSubscription sockets that never connect, "no encryption keys", missing databases/tables). 0.5.2 adds an operations-port conflict canary at allocation time and stops teardown from recycling an address whose ports are still held. See HarperFast/integration-testing#17. Pairs with HarperFast/harper#1301 (ops API never uses SO_REUSEPORT) to keep the canary's invariant true. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
Under concurrent execution the loopback pool can hand the same
127.0.0.xto two suites that are alive at the same time (e.g. a slot freed while its Harper node lingered, or a child that outlived teardown). On Linux, Harper binds its HTTP/replication ports with SO_REUSEPORT, so the freshly-assigned node silently co-binds an address another node still owns. The kernel then load-balances incoming connections across the two nodes, corrupting both suites.Observed in harper-pro Integration Tests as:
selectiveTableSubscriptionreplication sockets that never connect,unable to generate JWT as there are no encryption keys,database 'data' does not exist, etc. — all the same root cause. macOS binds those ports without SO_REUSEPORT, which is why it only reproduces on Linux and "passes in isolation."Confirmed in a failing run:
Issue_135 ….3(alive 12:37:40→12:39:36) andSelective_table_subscription.3(12:38:31→12:38:38) both held127.0.0.3simultaneously.Fix
getNextAvailableLoopbackAddress: after claiming a slot, probe the operations-API port (9925) with a plain, non-SO_REUSEPORT bind. The operations API is the one port Harper binds exclusively (main-thread only), soEADDRINUSEhere reliably means a node still holds the address. On conflict, park the slot under our PID (so nobody else retries the poisoned address) and claim the next one. Probe port overridable viaHARPER_INTEGRATION_TEST_CONFLICT_PROBE_PORT.teardownHarperno longer recycles an address whose ports are still held by an escaped child; it parks the slot (reclaimed when the per-file process exits).Validation
Local functional tests against the built output:
127.0.0.71:9925occupied) → pool logs the skip and returns127.0.0.72.127.0.0.71), no regression.End-to-end harper-pro Integration Tests validation via a git-dep branch is in progress; will link results before merge.
🤖 Generated with Claude Code