From 4eb3dd6f7bb35c86141a5cbb9c85c044e99f6e14 Mon Sep 17 00:00:00 2001
From: webdevtodayjason <jason@webdevtoday.com>
Date: Sat, 27 Jun 2026 11:54:59 -0500
Subject: [PATCH] =?UTF-8?q?docs:=20cover=200.4.44=20=E2=80=94=20quantize,?=
 =?UTF-8?q?=20federation,=20models=20redesign,=20secrets?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Bring the docs site up to the 0.4.44 product surface.

- add guides/quantize.mdx — in-browser AWQ / NVFP4 quantization, idle-node
  guardrail, HF push, Qwen3.5 multimodal/hybrid handling (AWQ verified;
  NVFP4-on-Qwen3.5 marked experimental)
- add guides/secrets.mdx — local Secrets store; HF read vs write tokens
  (write required for pushes, read-only rejected up front); NGC/W&B/OpenAI
- add guides/federation.mdx — master router (route /v1/* by model name),
  load/unload any node from the UI, model stacking + persistence, mem-util knob
- rewrite guides/models.mdx — two-list Installed/Browse redesign, Unload
  (was DELETE), serve-from-on-disk-weights
- introduction.mdx — add Quantization / Federated serving / Model stacking
  cards; add federation + quantization to the architecture box
- docs.json — add the three new pages to the Guides nav

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01JfM4xyZR4DdC3W74ea99Mi
---
 docs.json             |  3 ++
 guides/federation.mdx | 45 +++++++++++++++++++++++
 guides/models.mdx     | 48 ++++++++++++++++++++-----
 guides/quantize.mdx   | 84 +++++++++++++++++++++++++++++++++++++++++++
 guides/secrets.mdx    | 54 ++++++++++++++++++++++++++++
 introduction.mdx      | 14 ++++++--
 6 files changed, 237 insertions(+), 11 deletions(-)
 create mode 100644 guides/federation.mdx
 create mode 100644 guides/quantize.mdx
 create mode 100644 guides/secrets.mdx
diff --git a/docs.json b/docs.json
index 7799eef..731fc18 100644
--- a/docs.json
+++ b/docs.json
@@ -66,7 +66,10 @@
               "guides/chat",
               "guides/models",
               "guides/training",
+              "guides/quantize",
               "guides/distributed",
+              "guides/federation",
+              "guides/secrets",
               "guides/update"
             ]
           },
diff --git a/guides/federation.mdx b/guides/federation.mdx
new file mode 100644
index 0000000..2d74f35
--- /dev/null
+++ b/guides/federation.mdx
@@ -0,0 +1,45 @@
+---
+title: "Federated Serving"
+description: "One master routes requests to whichever node holds each model — load and unload any model on any node from the browser."
+icon: "network-wired"
+---
+
+A cluster doesn't have to shard one giant model across every node. With
+**federation**, each node can serve its own model(s), and the **master** routes
+each `/v1/*` request to the node that holds the requested model — a single
+OpenAI-compatible endpoint in front of the whole fleet.
+
+## How routing works
+
+- The master inspects the `model` field of each `/v1/*` request and forwards it to
+  the node currently serving that model.
+- You manage the whole fleet from one browser tab: see every node, what each is
+  serving, and **load or unload a model on any node** without SSHing in.
+- If a target node is mid-swap or unreachable, the master applies **routing
+  failover** rather than hanging the request.
+
+<Note>
+Federation (one endpoint, many models, routed by name) is distinct from
+**distributed tensor-parallel** (one model sharded across nodes — see
+[Distributed Inference](/guides/distributed)). Use federation to serve many
+models across a fleet; use TP to serve one model too big for a single node.
+</Note>
+
+## Model stacking
+
+A single node can hold **more than one model at a time**. Stacked instances each
+get their own port, and they **persist across restarts** — AINode replays them on
+boot so your fleet comes back exactly as you left it. Loads are serialized so two
+models never try to claim the same memory at once.
+
+<Tip>
+When you ask the master for a model by name, point your OpenAI client at the
+master's `/v1` endpoint and set `model` to the exact id you loaded. A specific
+stacked instance is also reachable directly on its own port.
+</Tip>
+
+## Per-node memory headroom
+
+Each node exposes a **memory-utilization** knob so you can leave room for a second
+stacked model or for a quantization job (which needs the full unified memory — see
+[Quantize a Model](/guides/quantize)).
diff --git a/guides/models.mdx b/guides/models.mdx
index d1c886d..17ce662 100644
--- a/guides/models.mdx
+++ b/guides/models.mdx
@@ -1,43 +1,73 @@
 ---
 title: "Models"
-description: "Download, manage, and launch models from the browser."
+description: "Download, manage, launch, and unload models from the browser."
 icon: "cube"
 ---
 
+The **Models** page is split into two lists:
+
+- **Installed** — models already on disk, ready to launch. Shows quantization,
+  size, and architecture per model.
+- **Browse** — the live catalog: curated picks plus live Hugging Face search.
+
 ## Downloading models
 
-Open the **Downloads** tab. Three ways to add a model:
+In **Browse**, add a model three ways:
 
-1. **Catalog** — 50+ curated models, click to download
-2. **Search** — search Hugging Face live
-3. **Custom** — paste any HF repo ID (e.g. `microsoft/Phi-4`)
+1. **Catalog** — curated models (including frontier MoE and NVFP4/AWQ picks for
+   GB10 clusters), click to download.
+2. **Search** — search Hugging Face live; results show **FITS GPU** / availability
+   badges computed from your cluster's aggregate VRAM.
+3. **Custom** — paste any HF repo id (e.g. `microsoft/Phi-4`).
 
 Downloads run in the background with a progress bar. Click **✕** to cancel.
+Already-downloaded models show as **Installed** instead of offering a re-download.
 
 ## Launching a model
 
-Once downloaded, click the model card → **▶ Launch Model**. AINode saves the model to config and restarts the engine.
+In **Installed**, click a model → **▶ Launch**. On a single node AINode serves it
+directly; on a cluster you can pick which nodes span it (see
+[Distributed Inference](/guides/distributed)) or serve different models on
+different nodes (see [Federated Serving](/guides/federation)).
+
+## Unloading a model
+
+Each running instance has an **Unload** control that stops just that instance and
+frees its memory. (Unload force-clears a dead or phantom instance too, so a node
+that stopped responding doesn't stay stuck "loaded".) Unload a node's models
+before starting a [quantization job](/guides/quantize) — quantization needs the
+full unified memory.
+
+## Serving from on-disk weights
+
+AINode serves directly from the weights in `~/.ainode/models/<slug>`. A model you
+downloaded once — or produced with a [quantization](/guides/quantize) or
+[fine-tuning](/guides/training) job — is immediately launchable; no re-download.
 
 ## Gated models
 
-Llama, Gemma, and other gated repos require a Hugging Face token:
+Llama, Gemma, and other gated repos require a Hugging Face **read** token:
 
 ```bash
 ainode config --hf-token hf_xxxxxxxxxxxxxxxx
 ```
 
+See [Tokens & Secrets](/guides/secrets) for read vs write tokens.
+
 ## Supported models
 
-Any model compatible with vLLM works. Tested models include:
+Any model compatible with vLLM works. Tested families include:
 
 - Meta Llama 3 / 3.1 (8B, 70B, 405B)
-- Qwen 2.5 (0.5B → 72B)
+- Qwen 2.5 and Qwen3 / Qwen3.5 (incl. frontier MoE, e.g. Qwen3-235B-A22B)
 - Mistral / Mixtral
 - Google Gemma 2
 - DeepSeek V3
 - Phi-3 / Phi-4
 - Command R+
 
+AWQ (`awq_marlin`) and NVFP4 quantized checkpoints serve natively on GB10.
+
 ## Shared storage across cluster
 
 To avoid downloading the same model on every node, mount a shared NFS path:
diff --git a/guides/quantize.mdx b/guides/quantize.mdx
new file mode 100644
index 0000000..e3930a9
--- /dev/null
+++ b/guides/quantize.mdx
@@ -0,0 +1,84 @@
+---
+title: "Quantize a Model"
+description: "Compress any model to AWQ or NVFP4 in the browser, then serve it or push it to Hugging Face."
+icon: "compress"
+---
+
+AINode quantizes models **on your own GPU** — no external service, no notebook.
+A quant job runs llm-compressor one-shot PTQ inside a GPU container and writes a
+compressed-tensors checkpoint that vLLM serves natively.
+
+## Run a quantization
+
+Open **Training → Quantize a Model**.
+
+<Steps>
+  <Step title="Base model">
+    A Hugging Face repo id (`Qwen/Qwen3.5-4B`) or a model already in **Installed**.
+    An on-disk copy is used automatically when present (offline + reproducible).
+  </Step>
+  <Step title="Scheme">
+    - **AWQ** — W4A16, 4-bit weights. Serves as `awq_marlin` on GB10. **Proven.**
+    - **NVFP4** — Blackwell-native 4-bit float. Newer; verified on dense text models.
+  </Step>
+  <Step title="Calibration samples">
+    Default **256**, drawn from `HuggingFaceH4/ultrachat_200k`. More samples =
+    better calibration, longer job.
+  </Step>
+  <Step title="Push to Hugging Face (optional)">
+    Tick **Push result to Hugging Face** and (optionally) name the repo. Requires
+    a **write** token — see [Secrets](/guides/secrets). The push happens after the
+    job finishes and creates a **private** repo under your token's namespace.
+  </Step>
+</Steps>
+
+<Warning>
+The target node must be **idle**. Quantization needs the full unified memory, so
+AINode refuses to start a quant job while a model is loaded — **unload all models
+on the node first** (or resubmit with `force: true`). You'll get a `409` otherwise.
+</Warning>
+
+When the job finishes, the result appears in **Installed** as
+`<org--name>-<scheme>` (e.g. `Qwen--Qwen3.5-4B-awq`), ready to launch.
+
+## Multimodal & hybrid models (Qwen3.5)
+
+Qwen3.5 models bundle a vision tower and use Gated-DeltaNet linear attention.
+AINode handles this automatically: it loads the full model class so the saved
+config is complete (vLLM-servable), keeps the vision tower, embeddings, `lm_head`
+and the `linear_attn` projections in bf16, and saves the image processor.
+
+<Note>
+**AWQ on Qwen3.5 is verified servable. NVFP4 on Qwen3.5 is experimental and not
+yet validated** — prefer AWQ for the Qwen3.5 family today.
+</Note>
+
+## API
+
+```bash
+# Start a quant job (idle node required)
+curl -X POST http://localhost:3000/api/training/jobs \
+  -H 'Content-Type: application/json' \
+  -d '{
+        "method": "quantize",
+        "base_model": "Qwen/Qwen3.5-4B",
+        "scheme": "awq",
+        "calib_samples": 256,
+        "push_to_hf": false
+      }'
+
+# Poll status / progress
+curl http://localhost:3000/api/training/jobs/{job_id}
+
+# When done, the quantized model is listed in the catalog
+curl http://localhost:3000/api/models | grep awq
+```
+
+To push to the Hub, add `"push_to_hf": true` (and optionally `"hf_repo": "name"`).
+The job validates write scope **before** running, so a read-only token fails fast.
+
+## Why 4-bit on GB10
+
+GB10 decode is memory-bandwidth bound. 4-bit weights mean fewer bytes read per
+token, so a quantized model both **fits more easily** and **decodes faster**.
+AWQ (`awq_marlin`) is the proven kernel path on GB10's sm120.
diff --git a/guides/secrets.mdx b/guides/secrets.mdx
new file mode 100644
index 0000000..08fd755
--- /dev/null
+++ b/guides/secrets.mdx
@@ -0,0 +1,54 @@
+---
+title: "Tokens & Secrets"
+description: "Store Hugging Face, NGC, W&B, and OpenAI credentials locally — read vs write HF tokens."
+icon: "key"
+---
+
+AINode keeps credentials in a local **Secrets store** at `~/.ainode/secrets.json`
+(file mode `0600`, values obfuscated at rest). Secrets never appear in API
+responses — the UI only ever shows a masked value (last 4 characters). Set them in
+**Config → Secrets**, where each credential has a **Test** button that reports the
+detected scope without revealing the value.
+
+## Hugging Face: read vs write
+
+There are two HF token slots, and the difference matters:
+
+| Slot | Used for | Required scope |
+|---|---|---|
+| **HuggingFace Token (read)** | Download gated models (Llama, Gemma, …) | `read` is enough |
+| **HuggingFace Token (write)** | **Push** quantized / fine-tuned models to the Hub | `write` (or a write-scoped fine-grained token) |
+
+AINode prefers the **write** token for pushes and falls back to the read token
+only if it has write scope. A read-only token is **rejected up front** — before
+any multi-GB upload starts — so you never wait through a transfer that ends in a
+`403`.
+
+<Tip>
+Keep your everyday downloads on a `read` token and add a separate `write` token
+only when you intend to push models back to the Hub.
+</Tip>
+
+### Setting the read token from the CLI
+
+```bash
+ainode config --hf-token hf_xxxxxxxxxxxxxxxx   # set
+ainode config --hf-token ""                    # clear
+```
+
+This populates the read slot and propagates automatically to gated-model
+downloads and training jobs. The write token is set in the **Secrets** UI.
+
+## Other supported credentials
+
+| Secret | Purpose |
+|---|---|
+| **NVIDIA NGC API Key** | Pull NGC-hosted models and containers |
+| **Weights & Biases API Key** | Stream training metrics to W&B |
+| **OpenAI API Key** | Benchmark comparisons against hosted OpenAI models |
+
+## Where pushes land
+
+When you push a quantized or fine-tuned model, AINode creates a **private** repo
+under the write token's namespace (`<your-namespace>/<repo-name>`). See
+[Quantize a Model](/guides/quantize) for the push workflow.
diff --git a/introduction.mdx b/introduction.mdx
index 5d74dc8..4fac734 100644
--- a/introduction.mdx
+++ b/introduction.mdx
@@ -35,9 +35,18 @@ curl -fsSL https://ainode.dev/install | bash
   <Card title="Fine-tuning" icon="brain" href="/guides/training">
     LoRA, QLoRA, and full fine-tune from the browser. No notebooks.
   </Card>
+  <Card title="Quantization" icon="compress" href="/guides/quantize">
+    Compress any model to AWQ or NVFP4 on your own GPU, then serve it or push it to Hugging Face.
+  </Card>
   <Card title="Multi-node clustering" icon="server" href="/guides/distributed">
     Automatic peer discovery. One model sharded across all GPUs in the cluster.
   </Card>
+  <Card title="Federated serving" icon="network-wired" href="/guides/federation">
+    One endpoint, many models — the master routes each request to the node that holds the model.
+  </Card>
+  <Card title="Model stacking" icon="layer-group" href="/guides/federation">
+    Run several models per node. They persist across restarts and replay on boot.
+  </Card>
   <Card title="Prometheus metrics" icon="chart-line" href="/api-reference/metrics">
     `/metrics` endpoint for Grafana, Prometheus, VictoriaMetrics.
   </Card>
@@ -147,9 +156,10 @@ AINode ships as a single unified container image. Every node in the cluster runs
 ghcr.io/getainode/ainode:latest   ← pulled by the installer
          │
          ├── aiohttp web server (chat UI + API proxy, port 3000 / 8000)
-         ├── vLLM inference engine (launched as subprocess)
+         ├── federated master router (routes /v1/* by model name)
+         ├── vLLM inference engine (one or more instances per node)
          ├── UDP discovery broadcaster (port 5679)
-         └── training pipeline (LoRA / QLoRA / Full)
+         └── training pipeline (LoRA / QLoRA / Full + quantization: AWQ / NVFP4)
 ```
 
 No host Python venv. No source builds. Upgrade is `ainode update`.