fix: inter-chunk stream timeout and download part guard (RFC)#1236
Open
planetf1 wants to merge 7 commits into
Open
fix: inter-chunk stream timeout and download part guard (RFC)#1236planetf1 wants to merge 7 commits into
planetf1 wants to merge 7 commits into
Conversation
Adds ModelOption.STREAM_TIMEOUT (default 60 s) and threads it through all backend send_to_queue call sites. A hung or stalled backend iterator now surfaces as a TimeoutError in the queue rather than blocking forever. Addresses the root cause reported in generative-computing#650. Adds a _MAX_PARTS=1000 guard to the retriever parquet download loop so a malformed or unexpectedly large dataset cannot exhaust disk. Prototype / RFC — see generative-computing#1235 for discussion. Default timeout value and whether MAX_NEW_TOKENS should also gain a default cap are open questions. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Adds a complete ModelOption reference table to configure-model-options, a streaming-timeout section explaining the default 60 s inter-chunk bound and how to override it, and a matching note on the async/streaming how-to page. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
The largest corpus in frreiss/mt-rag-embeddings currently has 20 parts. 1000 was effectively unbounded for this dataset; 50 gives reasonable headroom while actually catching runaway downloads. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Backend defaults vary widely — vLLM defaults to 16 tokens which silently truncates most responses. Added a warning callout and recommendation to always set MAX_NEW_TOKENS explicitly in production code. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- Rename _DEFAULT_CHUNK_TIMEOUT -> DEFAULT_CHUNK_TIMEOUT (public export); all backends now import and reference it, eliminating the hardcoded 60.0 - Use aiter()/anext() builtins instead of __aiter__/__anext__ dunders - Close underlying AsyncGenerator on timeout via isinstance guard - Clarify docstring: timeout covers TTFT as well as inter-chunk gaps; no trailing sentinel after TimeoutError (exception item is the terminator) - Fix off-by-one in retriever download guard: allow _MAX_PARTS+1 loop iterations so a corpus of exactly _MAX_PARTS parts can see the 404 - Add tests: timeout fires + no trailing sentinel, None disables timeout, DEFAULT_CHUNK_TIMEOUT value exported correctly (all 14 pass) - Update docs: explicit TTFT note, add 300s slow-model example alongside the None disable example Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
0a1e230 to
036e813
Compare
Backend-default STREAM_TIMEOUT was silently dropped because the call sites read from the raw per-call model_options instead of the merged model_opts dict produced by _simplify_and_merge(). This meant setting STREAM_TIMEOUT on a backend constructor or via push_model_options() had no effect. Fix: four sites (ollama, litellm, watsonx, openai standard path) now read `model_opts.get(ModelOption.STREAM_TIMEOUT, DEFAULT_CHUNK_TIMEOUT)` in line with every other option lookup in those functions. HF's three sites were already correct (they receive the merged dict under the model_options param). Also: update model_options.py docstrings to say "per-chunk" and note time-to-first-token coverage; bump test timeout from 0.05s to 0.5s to reduce flakiness risk on loaded CI runners. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
ajbozarth
reviewed
Jun 9, 2026
ajbozarth
left a comment
Contributor
There was a problem hiding this comment.
A quick review from Claude to start:
Nice scoped fix — the streaming hang really did need bounding, and the API surface (single DEFAULT_CHUNK_TIMEOUT constant, single STREAM_TIMEOUT ModelOption, None to disable) is restrained. Tests pin the right invariants, especially that a timeout uses the exception itself as the stream terminator with no trailing sentinel.
Four action items inline. The (model_options or {}).get(...) consistency item and the aclose() reach-through are worth addressing in this PR; the cancel-hook miss can be a follow-up.
Replace the isinstance(ait, AsyncGenerator) guard before aclose() with
duck-typing via getattr — OpenAI's AsyncStream and HF's
AsyncTextIteratorStreamer are class-based AsyncIterators, not
AsyncGenerator instances, so the original branch was unreachable on the
two streaming paths that matter most. The new path handles both aclose()
(async generators) and close() (class-based iterators), logging any
cleanup failure at DEBUG rather than swallowing it silently.
Drop the `(model_options or {})` defensive wrapper at the four
STREAM_TIMEOUT read sites (hf.py:3, openai.py:1). The parameter is
typed dict[str, Any] and is always a merged non-None dict at those call
sites — the guard was dead code and inconsistent with the other four
sites in the same files.
Add TODO(generative-computing#1242) at both _cancel_hook arming sites noting that direct
avalue()/astream() callers do not fire the cancel event on
STREAM_TIMEOUT; the stream_with_chunking path already mitigates this.
Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
ajbozarth
approved these changes
Jun 9, 2026
ajbozarth
left a comment
Contributor
There was a problem hiding this comment.
LGTM on a code read through and Claude says so too
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #650. Addresses all three concerns in #1235.
Fixes #650
Fixes #1235
What changed
Streaming hang fix (
helpers/async_helpers.py)A stalled backend iterator previously blocked the process forever.
send_to_queuehad no per-chunk timeout, so its iteration loop waited indefinitely and no sentinel ever reached the queue. Every downstream consumer blocked with it.This wraps each
anext()call inasyncio.timeout(). A stalled stream now surfaces as aTimeoutErrorthrough the queue. On timeout, the underlying iterator's close method is called via duck-typing (getattr(ait, "aclose", None) or getattr(ait, "close", None)) to release HTTP connections cleanly — this covers bothasync def-styleAsyncGeneratorand class-based iterators such as OpenAI'sAsyncStreamand HF'sAsyncTextIteratorStreamer.DEFAULT_CHUNK_TIMEOUT = 60.0is a public constant exported fromhelpers/. All eight backend call sites import and use it — no hardcoded values. The newModelOption.STREAM_TIMEOUTkey lets callers override per-call or at the backend/session level.The 60 s default covers time-to-first-token as well as inter-chunk gaps. Slow local inference (large models on CPU, long prompts) can legitimately exceed this before the first token. Use a higher value or
Nonefor those deployments.Retriever download guard (
formatters/granite/retrievers/util.py)The parquet download loop now stops at
_MAX_PARTS + 1attempts. The extra iteration allows a corpus of exactly_MAX_PARTSparts to receive the terminating 404 rather than raising spuriously. The largest corpus currently has 20 parts; 50 gives headroom.Documentation (
docs/)ModelOptionreference table updated:STREAM_TIMEOUTdescription notes it covers TTFT.MAX_NEW_TOKENSrow flags that backend defaults vary widely (vLLM defaults to 16). Both streaming how-to pages show three examples: tight timeout, slow-model value (300 s), andNoneto disable.Tests (
test/helpers/test_async_helpers.py)Three new cases: timeout fires and puts
TimeoutErrorwith no trailing sentinel;Nonedisables the timeout and clean completion still gets a sentinel;DEFAULT_CHUNK_TIMEOUTis 60.0. All 14 tests pass.Follow-up
#1242 tracks the cancel hook not firing for direct
avalue()/astream()callers on HFSTREAM_TIMEOUT— the worker thread keeps generating until natural completion on those paths. Marked withTODO(#1242)at the two_cancel_hookarming sites. Thestream_with_chunkingpath already callscancel_generation()and is not affected.🤖 Generated with Claude Code