Skip to content

fix: inter-chunk stream timeout and download part guard (RFC)#1236

Open
planetf1 wants to merge 7 commits into
generative-computing:mainfrom
planetf1:fix/stream-liveness-bounds
Open

fix: inter-chunk stream timeout and download part guard (RFC)#1236
planetf1 wants to merge 7 commits into
generative-computing:mainfrom
planetf1:fix/stream-liveness-bounds

Conversation

@planetf1

@planetf1 planetf1 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Closes #650. Addresses all three concerns in #1235.

Fixes #650
Fixes #1235

What changed

Streaming hang fix (helpers/async_helpers.py)

A stalled backend iterator previously blocked the process forever. send_to_queue had no per-chunk timeout, so its iteration loop waited indefinitely and no sentinel ever reached the queue. Every downstream consumer blocked with it.

This wraps each anext() call in asyncio.timeout(). A stalled stream now surfaces as a TimeoutError through the queue. On timeout, the underlying iterator's close method is called via duck-typing (getattr(ait, "aclose", None) or getattr(ait, "close", None)) to release HTTP connections cleanly — this covers both async def-style AsyncGenerator and class-based iterators such as OpenAI's AsyncStream and HF's AsyncTextIteratorStreamer.

DEFAULT_CHUNK_TIMEOUT = 60.0 is a public constant exported from helpers/. All eight backend call sites import and use it — no hardcoded values. The new ModelOption.STREAM_TIMEOUT key lets callers override per-call or at the backend/session level.

The 60 s default covers time-to-first-token as well as inter-chunk gaps. Slow local inference (large models on CPU, long prompts) can legitimately exceed this before the first token. Use a higher value or None for those deployments.

Retriever download guard (formatters/granite/retrievers/util.py)

The parquet download loop now stops at _MAX_PARTS + 1 attempts. The extra iteration allows a corpus of exactly _MAX_PARTS parts to receive the terminating 404 rather than raising spuriously. The largest corpus currently has 20 parts; 50 gives headroom.

Documentation (docs/)

ModelOption reference table updated: STREAM_TIMEOUT description notes it covers TTFT. MAX_NEW_TOKENS row flags that backend defaults vary widely (vLLM defaults to 16). Both streaming how-to pages show three examples: tight timeout, slow-model value (300 s), and None to disable.

Tests (test/helpers/test_async_helpers.py)

Three new cases: timeout fires and puts TimeoutError with no trailing sentinel; None disables the timeout and clean completion still gets a sentinel; DEFAULT_CHUNK_TIMEOUT is 60.0. All 14 tests pass.

Follow-up

#1242 tracks the cancel hook not firing for direct avalue()/astream() callers on HF STREAM_TIMEOUT — the worker thread keeps generating until natural completion on those paths. Marked with TODO(#1242) at the two _cancel_hook arming sites. The stream_with_chunking path already calls cancel_generation() and is not affected.

🤖 Generated with Claude Code

Adds ModelOption.STREAM_TIMEOUT (default 60 s) and threads it through
all backend send_to_queue call sites. A hung or stalled backend iterator
now surfaces as a TimeoutError in the queue rather than blocking forever.
Addresses the root cause reported in generative-computing#650.

Adds a _MAX_PARTS=1000 guard to the retriever parquet download loop so
a malformed or unexpectedly large dataset cannot exhaust disk.

Prototype / RFC — see generative-computing#1235 for discussion. Default timeout value and
whether MAX_NEW_TOKENS should also gain a default cap are open questions.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@github-actions github-actions Bot added the bug Something isn't working label Jun 9, 2026
Adds a complete ModelOption reference table to configure-model-options,
a streaming-timeout section explaining the default 60 s inter-chunk
bound and how to override it, and a matching note on the async/streaming
how-to page.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 marked this pull request as ready for review June 9, 2026 10:31
@planetf1 planetf1 requested review from a team as code owners June 9, 2026 10:31
planetf1 added 3 commits June 9, 2026 11:47
The largest corpus in frreiss/mt-rag-embeddings currently has 20 parts.
1000 was effectively unbounded for this dataset; 50 gives reasonable
headroom while actually catching runaway downloads.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Backend defaults vary widely — vLLM defaults to 16 tokens which silently
truncates most responses. Added a warning callout and recommendation to
always set MAX_NEW_TOKENS explicitly in production code.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
- Rename _DEFAULT_CHUNK_TIMEOUT -> DEFAULT_CHUNK_TIMEOUT (public export);
  all backends now import and reference it, eliminating the hardcoded 60.0
- Use aiter()/anext() builtins instead of __aiter__/__anext__ dunders
- Close underlying AsyncGenerator on timeout via isinstance guard
- Clarify docstring: timeout covers TTFT as well as inter-chunk gaps; no
  trailing sentinel after TimeoutError (exception item is the terminator)
- Fix off-by-one in retriever download guard: allow _MAX_PARTS+1 loop
  iterations so a corpus of exactly _MAX_PARTS parts can see the 404
- Add tests: timeout fires + no trailing sentinel, None disables timeout,
  DEFAULT_CHUNK_TIMEOUT value exported correctly (all 14 pass)
- Update docs: explicit TTFT note, add 300s slow-model example alongside
  the None disable example

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 force-pushed the fix/stream-liveness-bounds branch from 0a1e230 to 036e813 Compare June 9, 2026 10:56
Backend-default STREAM_TIMEOUT was silently dropped because the call sites
read from the raw per-call model_options instead of the merged model_opts
dict produced by _simplify_and_merge(). This meant setting STREAM_TIMEOUT
on a backend constructor or via push_model_options() had no effect.

Fix: four sites (ollama, litellm, watsonx, openai standard path) now read
`model_opts.get(ModelOption.STREAM_TIMEOUT, DEFAULT_CHUNK_TIMEOUT)` in line
with every other option lookup in those functions. HF's three sites were
already correct (they receive the merged dict under the model_options param).

Also: update model_options.py docstrings to say "per-chunk" and note
time-to-first-token coverage; bump test timeout from 0.05s to 0.5s to
reduce flakiness risk on loaded CI runners.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>

@ajbozarth ajbozarth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick review from Claude to start:

Nice scoped fix — the streaming hang really did need bounding, and the API surface (single DEFAULT_CHUNK_TIMEOUT constant, single STREAM_TIMEOUT ModelOption, None to disable) is restrained. Tests pin the right invariants, especially that a timeout uses the exception itself as the stream terminator with no trailing sentinel.

Four action items inline. The (model_options or {}).get(...) consistency item and the aclose() reach-through are worth addressing in this PR; the cancel-hook miss can be a follow-up.

Comment thread mellea/backends/huggingface.py Outdated
Comment thread mellea/helpers/async_helpers.py Outdated
Comment thread mellea/backends/huggingface.py
Comment thread docs/docs/how-to/configure-model-options.md
Replace the isinstance(ait, AsyncGenerator) guard before aclose() with
duck-typing via getattr — OpenAI's AsyncStream and HF's
AsyncTextIteratorStreamer are class-based AsyncIterators, not
AsyncGenerator instances, so the original branch was unreachable on the
two streaming paths that matter most. The new path handles both aclose()
(async generators) and close() (class-based iterators), logging any
cleanup failure at DEBUG rather than swallowing it silently.

Drop the `(model_options or {})` defensive wrapper at the four
STREAM_TIMEOUT read sites (hf.py:3, openai.py:1). The parameter is
typed dict[str, Any] and is always a merged non-None dict at those call
sites — the guard was dead code and inconsistent with the other four
sites in the same files.

Add TODO(generative-computing#1242) at both _cancel_hook arming sites noting that direct
avalue()/astream() callers do not fire the cancel event on
STREAM_TIMEOUT; the stream_with_chunking path already mitigates this.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 requested a review from ajbozarth June 9, 2026 18:07

@ajbozarth ajbozarth left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on a code read through and Claude says so too

@planetf1 planetf1 enabled auto-merge June 10, 2026 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reliability: missing resource bounds on streaming (liveness hang, download guard, token cap) bug: generate_from_raw hanging when ollama under load

2 participants