Skip to content

fix(retry): keep flow-control wired to the active connection across resumes#5405

Open
bogomya wants to merge 1 commit into
nodejs:mainfrom
bogomya:retry-resume-flow-control
Open

fix(retry): keep flow-control wired to the active connection across resumes#5405
bogomya wants to merge 1 commit into
nodejs:mainfrom
bogomya:retry-resume-flow-control

Conversation

@bogomya

@bogomya bogomya commented Jun 9, 2026

Copy link
Copy Markdown

This relates to...

A hang in RetryAgent / RetryHandler when a transparently-resumed (Range) response body is consumed under backpressure.

Rationale

When a response body is consumed under backpressure and the connection drops mid-stream, RetryAgent's transparent Range-resume reconnects but the resumed body never resumes reading — the request hangs forever on a live socket with data still buffered. This reproducibly hangs large downloads (e.g. ~1 GB from S3) on a single transient network blip.

Root cause

Each transparent retry/resume is a separate dispatch with its own RequestController (created per dispatch in lib/core/request.js). The downstream body, however, is created once in lib/api/api-request.js with resume: () => controller.resume() closured over the first controller. On a 206 resume, RetryHandler.onResponseStart takes the headersSent === true branch and returns without re-calling handler.onResponseStart, so the downstream is never given the new connection's controller.

The consequence is split flow-control:

  • backpressure on the resumed body → onResponseDatacontrollerB.pause() (the new connection), but
  • the consumer's drain → () => controllerA.resume() (the original controller). controllerA was never paused, so RequestController.resume()'s if (this.#paused) guard makes it a no-op — the new connection's parser is never resumed.

Why pause and resume diverge

They reach the controller through different paths — which is why one tracks the live connection while the other stays frozen:

// PAUSE — api-request.js onResponseData(): the controller arrives as an ARGUMENT
// on every chunk, so it is always the CURRENT connection (controllerB after a resume)
onResponseData (controller, chunk) {
  if (this.body.push(chunk) === false) controller.pause()   // pauses controllerB
}

// RESUME — api-request.js onResponseStart(): the body is created ONCE and its
// resume hook CLOSES OVER the controller from that call (controllerA) — frozen
this.body = new Readable({ resume: () => controller.resume(), /* ... */ })  // always controllerA

onResponseStart is not called again on a resume (it would re-emit headers/body to the consumer), so the resume closure is never refreshed. The body's abort hook is closured the same way, so an AbortSignal fired after a resume aborts the dead original and leaks the live connection — the same fix covers it.

So after a resume followed by the first bit of backpressure, the resumed connection is never read again. (bodyTimeout cannot rescue it either: while the parser is paused the body timeout is intentionally suppressed — see #447.) This only triggers when resume and backpressure coincide, which is why the existing retry tests (no backpressure) never caught it.

A control run isolates backpressure as the trigger: the same reconnect with a non-backpressured (flowing) consumer completes in ~200 ms (resume works), whereas resume + backpressure hangs. Both conditions are required.

Verified live, single network drop, socket alive throughout:

STATE  paused=true  socket.destroyed=false  bytesRead=566602  readableLength=35840   <- 35 KB sat UNREAD on a live socket
onParserTimeout BODY paused=true -> SKIP (suppressed)
resume() calls on the resumed parser: 0

Changes

RetryHandler now hands the downstream handler a stable RetryController proxy for the whole request, and forwards pause() / resume() / abort() (plus the paused / aborted / reason / rawHeaders / rawTrailers getters) to the controller of the currently active connection, re-pointed on each (re)dispatch. The downstream therefore always flow-controls the live connection, so a resumed body resumes (and aborts) the right connection.

The proxy's target is re-pointed in onRequestStart only: lib/core/request.js creates one RequestController per dispatch and passes that same instance to every later callback of the dispatch, and onRequestStart is the first callback of every dispatch (including each retry), so a single re-point per dispatch keeps the proxy on the active connection. The pause()/resume() calls inside onResponseStartWithRetry deliberately stay on the raw connection controller (not the proxy): they hold that exact connection while the retry policy decides, and routing them through the proxy could resume a different connection if a later dispatch re-points it in between.

Features

N/A

Bug Fixes

  • Fix RetryAgent / RetryHandler hanging forever when a resumed (Range) response body is consumed under backpressure: flow-control is no longer split across the old and new connection controllers.
  • Fix an AbortSignal aborting the wrong connection after a resume: the abort hit the dead original and leaked the live resumed connection's socket; it now aborts the live connection.

Breaking Changes and Deprecations

None. No public API change; the behavior change is limited to the previously-broken resume path.

Reproduction

Minimal standalone repro (before this PR it hangs forever; after, it completes byte-accurate):

const { createServer } = require('node:http')
const { Writable } = require('node:stream')
const { pipeline } = require('node:stream/promises')
const { Agent, RetryAgent, request } = require('undici')

const TOTAL = 4 * 1024 * 1024
const PART = 1 * 1024 * 1024

const server = createServer((req, res) => {
  if (!req.headers.range) {
    // First attempt: send part of the body, then reset the connection
    // (ECONNRESET) so RetryAgent transparently resumes with a Range request.
    res.writeHead(200, { 'content-length': String(TOTAL) })
    res.write(Buffer.alloc(PART))
    setTimeout(() => res.socket.resetAndDestroy(), 100)
  } else {
    // Resume: honor the Range with a 206 and the remainder.
    const start = Number(/bytes=(\d+)-/.exec(req.headers.range)[1])
    res.writeHead(206, {
      'content-range': `bytes ${start}-${TOTAL - 1}/${TOTAL}`,
      'content-length': String(TOTAL - start)
    })
    res.end(Buffer.alloc(TOTAL - start))
  }
})

server.listen(0, async () => {
  const dispatcher = new RetryAgent(new Agent(), { maxRetries: 5, minTimeout: 100, timeoutFactor: 1 })
  const { body } = await request(`http://localhost:${server.address().port}`, { dispatcher })

  // Slow consumer => sustained backpressure on the resumed connection (the trigger).
  let received = 0
  const slow = new Writable({
    highWaterMark: 16 * 1024,
    write (chunk, enc, cb) { received += chunk.length; setTimeout(cb, 5) }
  })

  await pipeline(body, slow) // before this PR: hangs forever; after: resolves
  console.log('done:', received, 'of', TOTAL, '->', received === TOTAL ? 'OK' : 'MISMATCH')
  server.close()
  await dispatcher.close()
})

test/client-retry-resume-backpressure.js automates both scenarios — the backpressure hang and the abort routing — each fails without the fix and passes with it. Existing *retry* tests and npm run lint pass.

Status

  • I have read and agreed to the Developer's Certificate of Origin
  • Tested
  • [S] Benchmarked (optional)
  • [S] Documented (no public API change)
  • Review ready
  • In review
  • Merge ready

…esumes

RetryHandler handed each connection's per-dispatch controller straight through
to the downstream handler. But a transparent retry/resume is a *separate*
dispatch with its *own* controller, while the downstream body's resume() stays
closured over the FIRST controller (api-request.js). After a resume, backpressure
pauses the new connection's controller, but the consumer resumes the original
(now-dead) one — and since that controller was never paused, resume() is a no-op.
The resumed body is then never read again and the request hangs forever on a
live socket with data still buffered.

Hand the downstream handler a stable RetryController proxy that always forwards
pause()/resume()/abort() to the controller of the currently active connection.

Confirmed against a live ~1GB S3 download with a real mid-stream network drop:
before, the resumed connection delivered a few hundred KB then stalled forever;
after, it resumes and completes byte-accurate.

Adds a resume-under-backpressure regression test — the existing retry tests
don't exercise backpressure, which is why this went unnoticed.

Signed-off-by: Dmitry Bogomya <dmitry.bogomya@gmail.com>
@bogomya bogomya force-pushed the retry-resume-flow-control branch from 0fbd619 to 3bbd13f Compare June 9, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant