fix: inter-chunk stream timeout and download part guard (RFC) by planetf1 · Pull Request #1236 · generative-computing/mellea

planetf1 · 2026-06-09T10:15:46Z

Closes #650. Addresses all three concerns in #1235.

What changed

Streaming hang fix (helpers/async_helpers.py)

A stalled backend iterator previously blocked the process forever. send_to_queue had no per-chunk timeout, so its iteration loop waited indefinitely and no sentinel ever reached the queue. Every downstream consumer blocked with it.

This wraps each anext() call in asyncio.timeout(). A stalled stream now surfaces as a TimeoutError through the queue. On timeout, the underlying iterator's close method is called via duck-typing (getattr(ait, "aclose", None) or getattr(ait, "close", None)) to release HTTP connections cleanly — this covers both async def-style AsyncGenerator and class-based iterators such as OpenAI's AsyncStream and HF's AsyncTextIteratorStreamer.

DEFAULT_CHUNK_TIMEOUT = 60.0 is a public constant exported from helpers/. All eight backend call sites import and use it — no hardcoded values. The new ModelOption.STREAM_TIMEOUT key lets callers override per-call or at the backend/session level.

The 60 s default covers time-to-first-token as well as inter-chunk gaps. Slow local inference (large models on CPU, long prompts) can legitimately exceed this before the first token. Use a higher value or None for those deployments.

Retriever download guard (formatters/granite/retrievers/util.py)

The parquet download loop now stops at _MAX_PARTS + 1 attempts. The extra iteration allows a corpus of exactly _MAX_PARTS parts to receive the terminating 404 rather than raising spuriously. The largest corpus currently has 20 parts; 50 gives headroom.

Documentation (docs/)

ModelOption reference table updated: STREAM_TIMEOUT description notes it covers TTFT. MAX_NEW_TOKENS row flags that backend defaults vary widely (vLLM defaults to 16). Both streaming how-to pages show three examples: tight timeout, slow-model value (300 s), and None to disable.

Tests (test/helpers/test_async_helpers.py)

Three new cases: timeout fires and puts TimeoutError with no trailing sentinel; None disables the timeout and clean completion still gets a sentinel; DEFAULT_CHUNK_TIMEOUT is 60.0. All 14 tests pass.

Follow-up

#1242 tracks the cancel hook not firing for direct avalue()/astream() callers on HF STREAM_TIMEOUT — the worker thread keeps generating until natural completion on those paths. Marked with TODO(#1242) at the two _cancel_hook arming sites. The stream_with_chunking path already calls cancel_generation() and is not affected.

🤖 Generated with Claude Code

Adds ModelOption.STREAM_TIMEOUT (default 60 s) and threads it through all backend send_to_queue call sites. A hung or stalled backend iterator now surfaces as a TimeoutError in the queue rather than blocking forever. Addresses the root cause reported in generative-computing#650. Adds a _MAX_PARTS=1000 guard to the retriever parquet download loop so a malformed or unexpectedly large dataset cannot exhaust disk. Prototype / RFC — see generative-computing#1235 for discussion. Default timeout value and whether MAX_NEW_TOKENS should also gain a default cap are open questions. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Adds a complete ModelOption reference table to configure-model-options, a streaming-timeout section explaining the default 60 s inter-chunk bound and how to override it, and a matching note on the async/streaming how-to page. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

The largest corpus in frreiss/mt-rag-embeddings currently has 20 parts. 1000 was effectively unbounded for this dataset; 50 gives reasonable headroom while actually catching runaway downloads. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Backend defaults vary widely — vLLM defaults to 16 tokens which silently truncates most responses. Added a warning callout and recommendation to always set MAX_NEW_TOKENS explicitly in production code. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

- Rename _DEFAULT_CHUNK_TIMEOUT -> DEFAULT_CHUNK_TIMEOUT (public export); all backends now import and reference it, eliminating the hardcoded 60.0 - Use aiter()/anext() builtins instead of __aiter__/__anext__ dunders - Close underlying AsyncGenerator on timeout via isinstance guard - Clarify docstring: timeout covers TTFT as well as inter-chunk gaps; no trailing sentinel after TimeoutError (exception item is the terminator) - Fix off-by-one in retriever download guard: allow _MAX_PARTS+1 loop iterations so a corpus of exactly _MAX_PARTS parts can see the 404 - Add tests: timeout fires + no trailing sentinel, None disables timeout, DEFAULT_CHUNK_TIMEOUT value exported correctly (all 14 pass) - Update docs: explicit TTFT note, add 300s slow-model example alongside the None disable example Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

Backend-default STREAM_TIMEOUT was silently dropped because the call sites read from the raw per-call model_options instead of the merged model_opts dict produced by _simplify_and_merge(). This meant setting STREAM_TIMEOUT on a backend constructor or via push_model_options() had no effect. Fix: four sites (ollama, litellm, watsonx, openai standard path) now read `model_opts.get(ModelOption.STREAM_TIMEOUT, DEFAULT_CHUNK_TIMEOUT)` in line with every other option lookup in those functions. HF's three sites were already correct (they receive the merged dict under the model_options param). Also: update model_options.py docstrings to say "per-chunk" and note time-to-first-token coverage; bump test timeout from 0.05s to 0.5s to reduce flakiness risk on loaded CI runners. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

ajbozarth

A quick review from Claude to start:

Nice scoped fix — the streaming hang really did need bounding, and the API surface (single DEFAULT_CHUNK_TIMEOUT constant, single STREAM_TIMEOUT ModelOption, None to disable) is restrained. Tests pin the right invariants, especially that a timeout uses the exception itself as the stream terminator with no trailing sentinel.

Four action items inline. The (model_options or {}).get(...) consistency item and the aclose() reach-through are worth addressing in this PR; the cancel-hook miss can be a follow-up.

Replace the isinstance(ait, AsyncGenerator) guard before aclose() with duck-typing via getattr — OpenAI's AsyncStream and HF's AsyncTextIteratorStreamer are class-based AsyncIterators, not AsyncGenerator instances, so the original branch was unreachable on the two streaming paths that matter most. The new path handles both aclose() (async generators) and close() (class-based iterators), logging any cleanup failure at DEBUG rather than swallowing it silently. Drop the `(model_options or {})` defensive wrapper at the four STREAM_TIMEOUT read sites (hf.py:3, openai.py:1). The parameter is typed dict[str, Any] and is always a merged non-None dict at those call sites — the guard was dead code and inconsistent with the other four sites in the same files. Add TODO(generative-computing#1242) at both _cancel_hook arming sites noting that direct avalue()/astream() callers do not fire the cancel event on STREAM_TIMEOUT; the stream_with_chunking path already mitigates this. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

ajbozarth

LGTM on a code read through and Claude says so too

github-actions Bot added the bug Something isn't working label Jun 9, 2026

planetf1 marked this pull request as ready for review June 9, 2026 10:31

planetf1 requested review from a team as code owners June 9, 2026 10:31

planetf1 requested review from AngeloDanducci, ajbozarth and psschwei June 9, 2026 10:31

planetf1 added 3 commits June 9, 2026 11:47

planetf1 force-pushed the fix/stream-liveness-bounds branch from 0a1e230 to 036e813 Compare June 9, 2026 10:56

ajbozarth reviewed Jun 9, 2026

View reviewed changes

Comment thread mellea/backends/huggingface.py Outdated

Comment thread mellea/helpers/async_helpers.py Outdated

Comment thread mellea/backends/huggingface.py

Comment thread docs/docs/how-to/configure-model-options.md

planetf1 mentioned this pull request Jun 9, 2026

fix(hf): fire _cancel_event on STREAM_TIMEOUT for direct avalue/astream paths #1242

Open

planetf1 requested a review from ajbozarth June 9, 2026 18:07

ajbozarth approved these changes Jun 9, 2026

View reviewed changes

planetf1 enabled auto-merge June 10, 2026 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: inter-chunk stream timeout and download part guard (RFC)#1236

fix: inter-chunk stream timeout and download part guard (RFC)#1236
planetf1 wants to merge 7 commits into
generative-computing:mainfrom
planetf1:fix/stream-liveness-bounds

planetf1 commented Jun 9, 2026 •

edited

Loading

Uh oh!

ajbozarth left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbozarth left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

planetf1 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Follow-up

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

planetf1 commented Jun 9, 2026 •

edited

Loading