fix(retry): keep flow-control wired to the active connection across resumes#5405
Open
bogomya wants to merge 1 commit into
Open
fix(retry): keep flow-control wired to the active connection across resumes#5405bogomya wants to merge 1 commit into
bogomya wants to merge 1 commit into
Conversation
5 tasks
…esumes RetryHandler handed each connection's per-dispatch controller straight through to the downstream handler. But a transparent retry/resume is a *separate* dispatch with its *own* controller, while the downstream body's resume() stays closured over the FIRST controller (api-request.js). After a resume, backpressure pauses the new connection's controller, but the consumer resumes the original (now-dead) one — and since that controller was never paused, resume() is a no-op. The resumed body is then never read again and the request hangs forever on a live socket with data still buffered. Hand the downstream handler a stable RetryController proxy that always forwards pause()/resume()/abort() to the controller of the currently active connection. Confirmed against a live ~1GB S3 download with a real mid-stream network drop: before, the resumed connection delivered a few hundred KB then stalled forever; after, it resumes and completes byte-accurate. Adds a resume-under-backpressure regression test — the existing retry tests don't exercise backpressure, which is why this went unnoticed. Signed-off-by: Dmitry Bogomya <dmitry.bogomya@gmail.com>
0fbd619 to
3bbd13f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This relates to...
A hang in
RetryAgent/RetryHandlerwhen a transparently-resumed (Range) response body is consumed under backpressure.Rationale
When a response body is consumed under backpressure and the connection drops mid-stream,
RetryAgent's transparent Range-resume reconnects but the resumed body never resumes reading — the request hangs forever on a live socket with data still buffered. This reproducibly hangs large downloads (e.g. ~1 GB from S3) on a single transient network blip.Root cause
Each transparent retry/resume is a separate dispatch with its own
RequestController(created per dispatch inlib/core/request.js). The downstream body, however, is created once inlib/api/api-request.jswithresume: () => controller.resume()closured over the first controller. On a206resume,RetryHandler.onResponseStarttakes theheadersSent === truebranch and returns without re-callinghandler.onResponseStart, so the downstream is never given the new connection's controller.The consequence is split flow-control:
onResponseData→controllerB.pause()(the new connection), but() => controllerA.resume()(the original controller).controllerAwas never paused, soRequestController.resume()'sif (this.#paused)guard makes it a no-op — the new connection's parser is never resumed.Why pause and resume diverge
They reach the controller through different paths — which is why one tracks the live connection while the other stays frozen:
onResponseStartis not called again on a resume (it would re-emit headers/body to the consumer), so the resume closure is never refreshed. The body'saborthook is closured the same way, so anAbortSignalfired after a resume aborts the dead original and leaks the live connection — the same fix covers it.So after a resume followed by the first bit of backpressure, the resumed connection is never read again. (
bodyTimeoutcannot rescue it either: while the parser ispausedthe body timeout is intentionally suppressed — see #447.) This only triggers when resume and backpressure coincide, which is why the existing retry tests (no backpressure) never caught it.A control run isolates backpressure as the trigger: the same reconnect with a non-backpressured (flowing) consumer completes in ~200 ms (resume works), whereas resume + backpressure hangs. Both conditions are required.
Verified live, single network drop, socket alive throughout:
Changes
RetryHandlernow hands the downstream handler a stableRetryControllerproxy for the whole request, and forwardspause()/resume()/abort()(plus thepaused/aborted/reason/rawHeaders/rawTrailersgetters) to the controller of the currently active connection, re-pointed on each (re)dispatch. The downstream therefore always flow-controls the live connection, so a resumed body resumes (and aborts) the right connection.The proxy's target is re-pointed in
onRequestStartonly:lib/core/request.jscreates oneRequestControllerper dispatch and passes that same instance to every later callback of the dispatch, andonRequestStartis the first callback of every dispatch (including each retry), so a single re-point per dispatch keeps the proxy on the active connection. Thepause()/resume()calls insideonResponseStartWithRetrydeliberately stay on the raw connection controller (not the proxy): they hold that exact connection while the retry policy decides, and routing them through the proxy could resume a different connection if a later dispatch re-points it in between.Features
N/A
Bug Fixes
RetryAgent/RetryHandlerhanging forever when a resumed (Range) response body is consumed under backpressure: flow-control is no longer split across the old and new connection controllers.AbortSignalaborting the wrong connection after a resume: the abort hit the dead original and leaked the live resumed connection's socket; it now aborts the live connection.Breaking Changes and Deprecations
None. No public API change; the behavior change is limited to the previously-broken resume path.
Reproduction
Minimal standalone repro (before this PR it hangs forever; after, it completes byte-accurate):
test/client-retry-resume-backpressure.jsautomates both scenarios — the backpressure hang and the abort routing — each fails without the fix and passes with it. Existing*retry*tests andnpm run lintpass.Status