quiet-node · quiet-node · Jun 21, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/docs/configurations.md b/docs/configurations.md
@@ -151,7 +151,7 @@ Upgrading from an older version is automatic: a pre-providers config with a flat
 | Constant          | Default    | Tunable? | Bounds              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
 | :---------------- | :--------- | :------- | :------------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `active_provider` | `"builtin"` | Yes      | id of a provider    | Which provider receives inference. Must match the `id` of one of the `[[inference.providers]]` entries; an empty or dangling value resets to `builtin`. Exception: a config that predates the providers list is pinned to `ollama` on load, because no working built-in provider existed when that file was written.                                                                                                                                                                                                                                                                                              |
-| `num_ctx`         | `16384`    | Yes      | `[2048, 1048576]`   | Context window size in tokens sent to the active provider with every request. For the built-in engine, the value becomes `--ctx-size` when the `llama-server` process starts, so changing it restarts the engine. For Ollama, warmup and chat share this value so the same runner instance and its cached KV prefix for the system prompt are reused: they must match or Ollama creates a second runner and the warmup saves nothing. Ollama silently clamps this to the model's physical maximum. For OpenAI-compatible providers the value is informational only; the server controls the actual context. Raise to fit longer conversations: each doubling roughly doubles VRAM for the KV cache; lower to reclaim GPU memory. See [Tuning the Context Window](./tuning-context-window.md). |
+| `num_ctx`         | `16384`    | Yes      | `[2048, 1048576]`   | Context window size in tokens sent to the active provider with every request. For the built-in engine, the value becomes `--ctx-size` when the `llama-server` process starts, so changing it restarts the engine. For Ollama, warmup and chat share this value so the same runner instance and its cached KV prefix for the system prompt are reused: they must match or Ollama creates a second runner and the warmup saves nothing. Ollama silently clamps this to the model's physical maximum. For OpenAI-compatible providers the value is informational only; the server controls the actual context. Raise to fit longer conversations: the KV cache grows roughly linearly with the context size (the model weights stay the same), so each doubling roughly doubles its memory footprint; benchmark on your hardware before pushing it high, and lower to reclaim memory. See [Tuning the Context Window](./tuning-context-window.md). |
 | `keep_warm_inactivity_minutes` | `0` | Yes | `-1` or `[0, 1440]` | Minutes of inactivity before Thuki releases the active model from memory. Governs both local providers: the built-in engine stops its sidecar to free RAM, and Ollama is told to release the model from VRAM. Not applicable to a remote OpenAI-compatible server, whose residency Thuki does not manage. `0` uses the provider's natural short default (about 5 minutes): Ollama defers to its own timer, the built-in engine applies its own ~5-minute timer (`DEFAULT_BUILTIN_IDLE_MINUTES`). `-1` keeps the model resident forever. Raise for longer sessions between uses; lower to reclaim memory sooner. |
 
 Each `[[inference.providers]]` block has these fields:
@@ -185,11 +185,22 @@ The table below also lists the baked-in safety limits that govern Thuki's commun
 | `DEFAULT_BUILTIN_IDLE_MINUTES`              | `5 min`  | No       | The fixed translation of the `keep_warm_inactivity_minutes = 0` sentinel for the built-in engine, not a separate preference. The built-in engine has no external daemon to defer to, so `0` ("use the provider's natural short default") resolves to this value. Users who want a different timeout set `keep_warm_inactivity_minutes` directly (`N` minutes, or `-1` for forever). | —      | The idle window the built-in engine applies when `keep_warm_inactivity_minutes` is `0`. After this many minutes of inactivity the sidecar is stopped to free RAM. |
 | `ENGINE_HEALTH_PROBE_TIMEOUT_SECS`          | `5 s`    | No       | Internal lifecycle contract between the runner and the engine process. A wedged-but-connected server must not park the poll loop forever; loopback probes are normally instant so 5 s is generous. The poll interval and deadline are the user-facing knobs. | —      | How long a single `/health` GET is allowed to take inside the startup poll loop. If the engine has accepted the TCP connection but stopped responding, this timeout causes the probe to return an error (treated as Wait and retried after `ENGINE_HEALTH_POLL_INTERVAL_MS`). |
 | `ENGINE_COMMAND_QUEUE_CAPACITY`             | `64`     | No       | Bounds memory under command bursts; 64 slots is ample for all UI-driven traffic (Ensure, Touch, SetIdleMinutes, Shutdown) under any realistic usage pattern. | —      | Capacity of the bounded `mpsc` channel that carries commands from `EngineHandle` to the runner actor task. Back-pressure from a full queue is not observable in normal use. |
+| `ENGINE_STDERR_TAIL_LINES`                  | `20`     | No       | Defense-in-depth bound on captured subprocess output: 20 lines cover the load-error block `llama-server` prints on exit without retaining its whole log. | —      | Number of trailing `llama-server` stderr lines the runner keeps so a crash can report the engine's own reason (e.g. `unknown model architecture`) instead of a generic message. |
+| `ENGINE_STDERR_TAIL_LINE_MAX_BYTES`         | `500`    | No       | Defense-in-depth bound on attacker-influenced data: a single pathological newline-less stderr line (e.g. an enormous architecture string echoed from crafted GGUF metadata) is capped during the read, so neither peak read buffering nor the retained tail can grow without limit. | —      | Maximum bytes buffered and retained per captured engine stderr line. |
+| `ENGINE_CRASH_FALLBACK_MESSAGE`             | `"engine process exited unexpectedly"` | No | Internal diagnostic fallback surfaced only when the real reason is unavailable; not meaningful to expose. | n/a | Reason reported when the built-in engine process exits without leaving any stderr to capture (e.g. an external `SIGKILL`). |
 | `DOWNLOAD_PROGRESS_MIN_INTERVAL_MS`         | `500 ms` | No       | Pure IPC hygiene: a fast local connection can deliver thousands of chunks per second and the UI only needs a few updates per second, so throttling below the UI refresh rate is invisible to the user. | —      | Minimum interval between `Progress` events emitted while a model file downloads. An update is also emitted whenever at least 1% of the file has arrived since the last one, whichever comes first, and a final 100% update always precedes verification. |
 | `BLOB_HASH_BUFFER_BYTES`                     | `4 MiB`  | No       | Internal I/O buffer with no user-visible effect beyond verify speed. A few-MB buffer turns hashing a multi-GB blob into a few hundred reads instead of hundreds of thousands. | —      | Read-buffer size for streaming a downloaded blob through SHA-256 during verification. The common path hashes bytes as they download, so this applies only to a full-length partial left from a prior run or a resumed download's on-disk prefix. |
 | `MAX_HF_API_BODY_BYTES`                     | `4 MiB`  | No       | Defense-in-depth bound on attacker-controlled data from a remote service, mirroring `MAX_OLLAMA_TAGS_BODY_BYTES`. | —      | The largest Hugging Face API response body (repo file listings) Thuki will accept while resolving a model to download. Larger responses are rejected mid-stream and the request returns an error. |
+| `MAX_GGUF_KV_COUNT`                         | `4096`   | No       | Defense-in-depth bound on a downloaded GGUF's metadata-key count. A corrupt or hostile `metadata_kv_count` could otherwise drive an unbounded scan; real models carry a few dozen entries, so 4096 never truncates legitimate metadata. | —      | The most GGUF metadata key-value pairs the reasoning classifier scans when reading a downloaded model's chat template. Scanning stops at the cap. |
+| `MAX_GGUF_KEY_BYTES`                        | `1 KiB`  | No       | Defense-in-depth bound on a downloaded GGUF's metadata-key length. Keys are short dotted identifiers (`tokenizer.chat_template`); capping the length stops a corrupt length field from forcing a large allocation. | —      | The longest GGUF metadata key the reasoning classifier will read. A longer key stops the scan. |
+| `MAX_GGUF_STRING_BYTES`                     | `4 MiB`  | No       | Defense-in-depth bound on a downloaded GGUF's string values. Real chat templates run a few KB to ~100 KB; 4 MiB never truncates one while bounding the memory a corrupt length field can demand. | —      | The largest GGUF string value (the chat template or architecture) the reasoning classifier will materialize. A larger value stops the scan and the model relies on the runtime backstop instead. |
 | `HF_API_TIMEOUT_SECS`                       | `15 s`   | No       | Protocol cap on a hung remote service so the download UI cannot stall on metadata resolution; 15 s is generous for a small metadata call over the internet. | —      | How long Thuki waits for a Hugging Face API metadata call (repo file listing) to respond before giving up. Applies to resolving pasted repo ids and listing a repo's GGUF files, not to the model download itself. |
 | `HF_BASE_URL`                               | `https://huggingface.co` | No | Single origin for model metadata and downloads. Provenance comes from the pinned repo revisions in the curated starter registry, and those pins are only meaningful against the canonical Hub; an arbitrary mirror could serve different content under the same revision ids. | — | The Hugging Face origin Thuki uses for all model metadata calls and blob downloads. Every starter in the registry pins a repo at an exact revision and carries a compiled-in sha256 digest checked after download; the digest catches truncation, bit rot, and resume corruption, while the pinned revision on the canonical Hub is what fixes which content is fetched. |
+| `HF_SEARCH_LIMIT`                           | `30`     | No       | The per-page step for the in-app model browser. The "Load more" control raises the requested page size in multiples of this value, so it is a layout step rather than a user preference. | —      | How many GGUF model repos the first page of an in-app Hugging Face search returns, most-downloaded first. |
+| `HF_SEARCH_LIMIT_MAX`                        | `120`    | No       | Defense-in-depth bound on request size: "Load more" grows the requested page size in `HF_SEARCH_LIMIT` steps, and this caps the largest single request so a runaway page count cannot ask the Hub for an unbounded result set. | —      | The largest page size a single in-app Hugging Face search request may ask for, regardless of how many times "Load more" was pressed. |
+| `MAX_MODEL_CONTEXT_LENGTH`                   | `1 M`    | No       | Defense-in-depth bound on attacker-controlled GGUF metadata: a repo's `context_length` is editable (`gguf_set_metadata.py`) and occasionally inflated, so a value above this sane ceiling is treated as untrustworthy and dropped rather than shown. Mirrors the `num_ctx` upper bound; 1 M tokens covers every current model. | — | The largest model context window Thuki will trust and display from a Browse-all repo's parsed GGUF metadata. A larger declared value is dropped (no context window shown) rather than rendered. Curated Staff Picks models carry a hand-vetted value in the registry instead. |
+| `RUNTIME_OVERHEAD_GB`                        | `2.0`    | No       | Feeds the approximate RAM-fit hint shown in Library and Discover only; the authoritative per-starter memory estimates live in the model registry. A user-tunable overhead would imply a precision the hint does not claim. | —      | Resident-memory overhead added on top of a model's weights size (KV cache plus runtime buffers) when estimating whether it fits in this Mac's RAM. |
+| `MAX_HF_SEARCH_QUERY_LEN`                   | `200 bytes` | No    | Defense-in-depth bound on attacker-influenced input: the query reaches the fixed Hub host (no SSRF) and is percent-encoded by the client, but an unbounded string is still rejected to cap request size. | —      | The longest search string Thuki sends to the Hugging Face model search. A longer query is rejected before any network call. |
 | `OPENAI_MODELS_TIMEOUT_SECS`                | `5 s`    | No       | Protocol cap on a hung server so the Settings model dropdown cannot stall; the OpenAI-compatible server is local or LAN-hosted in the common case, so 5 s is generous. | —      | How long Thuki waits for an OpenAI-compatible server's `/v1/models` listing to respond before giving up. Applies to the Settings model dropdown for that provider, not to chat requests. |
 | `MAX_SSE_LINE_BYTES`                        | `1 MiB`  | No       | Defense-in-depth bound on attacker-controlled stream data. A malicious or broken chat server could otherwise grow a single stream line without limit and exhaust memory. | —      | The longest single Server-Sent-Events line Thuki accepts while streaming a chat response from an OpenAI-compatible (`/v1`) server. A stream line exceeding this aborts the response with an error. |