Skip to content

fix(loopback): prevent SO_REUSEPORT co-binds via an operations-port conflict canary#17

Merged
kriszyp merged 2 commits into
mainfrom
kris/loopback-conflict-canary
Jun 15, 2026
Merged

fix(loopback): prevent SO_REUSEPORT co-binds via an operations-port conflict canary#17
kriszyp merged 2 commits into
mainfrom
kris/loopback-conflict-canary

Conversation

@kriszyp

@kriszyp kriszyp commented Jun 15, 2026

Copy link
Copy Markdown
Member

Problem

Under concurrent execution the loopback pool can hand the same 127.0.0.x to two suites that are alive at the same time (e.g. a slot freed while its Harper node lingered, or a child that outlived teardown). On Linux, Harper binds its HTTP/replication ports with SO_REUSEPORT, so the freshly-assigned node silently co-binds an address another node still owns. The kernel then load-balances incoming connections across the two nodes, corrupting both suites.

Observed in harper-pro Integration Tests as: selectiveTableSubscription replication sockets that never connect, unable to generate JWT as there are no encryption keys, database 'data' does not exist, etc. — all the same root cause. macOS binds those ports without SO_REUSEPORT, which is why it only reproduces on Linux and "passes in isolation."

Confirmed in a failing run: Issue_135 … .3 (alive 12:37:40→12:39:36) and Selective_table_subscription .3 (12:38:31→12:38:38) both held 127.0.0.3 simultaneously.

Fix

  1. Conflict canary in getNextAvailableLoopbackAddress: after claiming a slot, probe the operations-API port (9925) with a plain, non-SO_REUSEPORT bind. The operations API is the one port Harper binds exclusively (main-thread only), so EADDRINUSE here reliably means a node still holds the address. On conflict, park the slot under our PID (so nobody else retries the poisoned address) and claim the next one. Probe port overridable via HARPER_INTEGRATION_TEST_CONFLICT_PROBE_PORT.
  2. Teardown hardening: teardownHarper no longer recycles an address whose ports are still held by an escaped child; it parks the slot (reclaimed when the per-file process exits).

Validation

Local functional tests against the built output:

  • In-use address (127.0.0.71:9925 occupied) → pool logs the skip and returns 127.0.0.72.
  • No conflict → returns the first address (127.0.0.71), no regression.

End-to-end harper-pro Integration Tests validation via a git-dep branch is in progress; will link results before merge.

🤖 Generated with Claude Code

…onflict canary

Under concurrency, the loopback pool can hand the same 127.0.0.x to two suites that
are alive at the same time (e.g. a slot freed while its Harper node lingered). On Linux,
Harper binds its HTTP/replication ports with SO_REUSEPORT, so the freshly-assigned node
silently co-binds an address another node still owns — the kernel then splits connections
between the two nodes and corrupts both suites (manifests as replication sockets that
never connect, "no encryption keys", missing databases/tables, etc.). macOS binds those
ports without SO_REUSEPORT, which is why this only reproduces on Linux.

Two defenses:
- Conflict canary in getNextAvailableLoopbackAddress: after claiming a slot, probe the
  operations-API port (9925) with a plain, non-SO_REUSEPORT bind. The operations API is
  the one port Harper binds exclusively (main-thread only), so EADDRINUSE here means a
  node still holds the address. On conflict we park the slot under our PID (so no other
  process retries the poisoned address) and claim the next one. Override the probe port
  via HARPER_INTEGRATION_TEST_CONFLICT_PROBE_PORT.
- teardownHarper no longer recycles an address whose ports are still held by a child that
  escaped the tree-kill; it parks the slot (reclaimed when the per-file process exits).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a conflict canary mechanism to prevent port conflicts and silent co-binding under SO_REUSEPORT by probing loopback addresses before allocation and avoiding recycling addresses if ports remain in use after teardown. The review feedback highlights several critical issues: a potential infinite hang/deadlock under pool exhaustion when poisoned slots are permanently parked under the current PID, a potential TypeError during teardown if the hostname is undefined, a crash risk if the conflict probe port environment variable is parsed as NaN, and a race condition where the server's 'error' listener is not cleaned up after a successful bind.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/loopbackAddressPool.ts Outdated
Comment thread src/harperLifecycle.ts Outdated
Comment thread src/loopbackAddressPool.ts Outdated
Comment thread src/loopbackAddressPool.ts Outdated
…cases

From PR review (gemini-code-assist):
- Pool-exhaustion deadlock: parking poisoned slots under our own live PID could fill
  the pool with our PID, leaving nothing reclaimable and spinning forever. Instead
  release poisoned slots back to the pool and skip them via a per-attempt local Set
  (cleared before each wait so they can be re-probed once the lingering node exits).
  Removed the now-unused findAvailableIndex helper.
- teardownHarper: nest the release under `if (ctx.harper.hostname)` so an early
  startup failure (no hostname) can't call releaseLoopbackAddress(undefined) and mask
  the real error with a TypeError.
- CONFLICT_PROBE_PORT: fall back to 9925 on a NaN/out-of-range env override instead of
  crashing on server.listen(NaN).
- isLoopbackAddressInUse: remove the 'error' listener once listen() succeeds so a
  post-bind socket error during close() can't flip the verdict.

Validated locally: skip-in-use returns the next address; full-pool exhaustion recovers
without deadlock once an address frees; an invalid probe-port env falls back cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kriszyp

kriszyp commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Thanks for the thorough review — all four were valid and are addressed in ec2a7ce:

  • Pool-exhaustion deadlock (high): good catch. Dropped the park-under-PID approach; poisoned slots are now released back to the pool and skipped via a per-attempt local Set (cleared before each wait so they're re-probed once a lingering node exits). Removed the now-unused findAvailableIndex.
  • releaseLoopbackAddress(undefined) TypeError: the release is now nested under if (ctx.harper.hostname).
  • NaN probe port: CONFLICT_PROBE_PORT falls back to 9925 on a non-numeric/out-of-range override.
  • Error-listener leak: removed via server.off('error', …) on successful bind.

Added local validation for each: skip-in-use returns the next address; a fully-exhausted pool recovers without deadlock the moment an address frees (~3s); an invalid probe-port env falls back cleanly. End-to-end harper-pro Integration Tests came back green across all 4 shards × Node 22/24/26.

— Claude Opus 4.8

@kriszyp kriszyp merged commit d0575a8 into main Jun 15, 2026
6 checks passed
@kriszyp kriszyp deleted the kris/loopback-conflict-canary branch June 15, 2026 14:41
kriszyp added a commit to HarperFast/harper-pro that referenced this pull request Jun 15, 2026
0.5.2 fixes the loopback-address pool's SO_REUSEPORT co-bind: under concurrency the
pool could hand the same 127.0.0.x to two simultaneously-alive suites, and on Linux
Harper binds HTTP/replication with SO_REUSEPORT, so both nodes silently co-bound the
address — the kernel then split connections between them, corrupting both suites. That
is the root cause of the persistently-red Integration Tests (selectiveTableSubscription
sockets that never connect, "no encryption keys", missing databases/tables). 0.5.2 adds
an operations-port conflict canary at allocation time and stops teardown from recycling
an address whose ports are still held.

See HarperFast/integration-testing#17. Pairs with HarperFast/harper#1301 (ops API never
uses SO_REUSEPORT) to keep the canary's invariant true.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants