Skip to content

fix(tunnel-node): bound per-session memory to prevent OOM#1401

Open
yyoyoian-pixel wants to merge 1 commit into
therealaleph:mainfrom
yyoyoian-pixel:fix/tunnel-node-oom-backpressure
Open

fix(tunnel-node): bound per-session memory to prevent OOM#1401
yyoyoian-pixel wants to merge 1 commit into
therealaleph:mainfrom
yyoyoian-pixel:fix/tunnel-node-oom-backpressure

Conversation

@yyoyoian-pixel
Copy link
Copy Markdown
Contributor

Summary

The tunnel-node reader_task appended upstream data into each session's read_buf with no size limit. With Apps Script RTT of 2-7s, fast upstreams (video, downloads) could push tens of MB between drains. Multiple concurrent sessions compounded until the VM was exhausted and the process was OOM-killed.

Changes

  • READ_BUF_CAP (32 MB) backpressurereader_task pauses reads when the per-session buffer reaches the cap and resumes once the client drains it. Bounds per-session memory regardless of upstream throughput.
  • Conditional last_active bump — TCP data ops now only refresh last_active on real uplink writes or when a drain returns downstream data. Empty long-poll batches no longer keep idle sessions alive past the 300s reaper (matches the existing udp_data had_uplink pattern).
  • abort_all() in reaper + batch-drain cleanup — previously only reader_handle.abort() was called, leaking udpgw_handle on virtual (udpgw) sessions.
  • Diagnostic logging in cleanup_task — logs TCP session count and total read_buf size every 30s so memory pressure is observable in logs.

Validation

  • cargo test in tunnel-node: 38 passed, 0 failed.
  • Deployed to production VPS: memory held at ~26% (1.0 GB) over 21 hours / 72 GB of traffic, where the prior build OOM-killed itself at the same age.

Test plan

  • Unit tests pass (cargo test)
  • Deployed and monitored on production VPS for 21h — memory stable, no OOM
  • Reviewer sanity-check on the backpressure sleep-loop interval (50ms)

🤖 Generated with Claude Code

The reader_task appended upstream data into each session's read_buf
with no size limit. With Apps Script RTT of 2-7s, fast upstreams
(video, downloads) could push tens of MB between drains; multiple
sessions compounded to exhaust the VM and trigger an OOM kill.

- READ_BUF_CAP (32 MB): reader_task pauses when the buffer is full and
  resumes once the client drains it, bounding per-session memory.
- Conditional last_active bump: TCP data ops only refresh last_active
  on real uplink writes or when a drain returns data, so empty polls no
  longer keep idle sessions alive past the reaper (matches udp_data).
- abort_all() in reaper and batch-drain cleanup: previously only
  reader_handle was aborted, leaking udpgw_handle on virtual sessions.
- Diagnostic logging in cleanup_task: logs tcp session count and total
  read_buf size every 30s so memory pressure is observable.

Validated in production: memory held at ~26% over 21h / 72 GB of
traffic where the prior build OOM-killed itself at the same age.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant