diff --git a/docs.json b/docs.json
index 7799eef..731fc18 100644
--- a/docs.json
+++ b/docs.json
@@ -66,7 +66,10 @@
"guides/chat",
"guides/models",
"guides/training",
+ "guides/quantize",
"guides/distributed",
+ "guides/federation",
+ "guides/secrets",
"guides/update"
]
},
diff --git a/guides/federation.mdx b/guides/federation.mdx
new file mode 100644
index 0000000..2d74f35
--- /dev/null
+++ b/guides/federation.mdx
@@ -0,0 +1,45 @@
+---
+title: "Federated Serving"
+description: "One master routes requests to whichever node holds each model — load and unload any model on any node from the browser."
+icon: "network-wired"
+---
+
+A cluster doesn't have to shard one giant model across every node. With
+**federation**, each node can serve its own model(s), and the **master** routes
+each `/v1/*` request to the node that holds the requested model — a single
+OpenAI-compatible endpoint in front of the whole fleet.
+
+## How routing works
+
+- The master inspects the `model` field of each `/v1/*` request and forwards it to
+ the node currently serving that model.
+- You manage the whole fleet from one browser tab: see every node, what each is
+ serving, and **load or unload a model on any node** without SSHing in.
+- If a target node is mid-swap or unreachable, the master applies **routing
+ failover** rather than hanging the request.
+
+
+Federation (one endpoint, many models, routed by name) is distinct from
+**distributed tensor-parallel** (one model sharded across nodes — see
+[Distributed Inference](/guides/distributed)). Use federation to serve many
+models across a fleet; use TP to serve one model too big for a single node.
+
+
+## Model stacking
+
+A single node can hold **more than one model at a time**. Stacked instances each
+get their own port, and they **persist across restarts** — AINode replays them on
+boot so your fleet comes back exactly as you left it. Loads are serialized so two
+models never try to claim the same memory at once.
+
+
+When you ask the master for a model by name, point your OpenAI client at the
+master's `/v1` endpoint and set `model` to the exact id you loaded. A specific
+stacked instance is also reachable directly on its own port.
+
+
+## Per-node memory headroom
+
+Each node exposes a **memory-utilization** knob so you can leave room for a second
+stacked model or for a quantization job (which needs the full unified memory — see
+[Quantize a Model](/guides/quantize)).
diff --git a/guides/models.mdx b/guides/models.mdx
index d1c886d..17ce662 100644
--- a/guides/models.mdx
+++ b/guides/models.mdx
@@ -1,43 +1,73 @@
---
title: "Models"
-description: "Download, manage, and launch models from the browser."
+description: "Download, manage, launch, and unload models from the browser."
icon: "cube"
---
+The **Models** page is split into two lists:
+
+- **Installed** — models already on disk, ready to launch. Shows quantization,
+ size, and architecture per model.
+- **Browse** — the live catalog: curated picks plus live Hugging Face search.
+
## Downloading models
-Open the **Downloads** tab. Three ways to add a model:
+In **Browse**, add a model three ways:
-1. **Catalog** — 50+ curated models, click to download
-2. **Search** — search Hugging Face live
-3. **Custom** — paste any HF repo ID (e.g. `microsoft/Phi-4`)
+1. **Catalog** — curated models (including frontier MoE and NVFP4/AWQ picks for
+ GB10 clusters), click to download.
+2. **Search** — search Hugging Face live; results show **FITS GPU** / availability
+ badges computed from your cluster's aggregate VRAM.
+3. **Custom** — paste any HF repo id (e.g. `microsoft/Phi-4`).
Downloads run in the background with a progress bar. Click **✕** to cancel.
+Already-downloaded models show as **Installed** instead of offering a re-download.
## Launching a model
-Once downloaded, click the model card → **▶ Launch Model**. AINode saves the model to config and restarts the engine.
+In **Installed**, click a model → **▶ Launch**. On a single node AINode serves it
+directly; on a cluster you can pick which nodes span it (see
+[Distributed Inference](/guides/distributed)) or serve different models on
+different nodes (see [Federated Serving](/guides/federation)).
+
+## Unloading a model
+
+Each running instance has an **Unload** control that stops just that instance and
+frees its memory. (Unload force-clears a dead or phantom instance too, so a node
+that stopped responding doesn't stay stuck "loaded".) Unload a node's models
+before starting a [quantization job](/guides/quantize) — quantization needs the
+full unified memory.
+
+## Serving from on-disk weights
+
+AINode serves directly from the weights in `~/.ainode/models/`. A model you
+downloaded once — or produced with a [quantization](/guides/quantize) or
+[fine-tuning](/guides/training) job — is immediately launchable; no re-download.
## Gated models
-Llama, Gemma, and other gated repos require a Hugging Face token:
+Llama, Gemma, and other gated repos require a Hugging Face **read** token:
```bash
ainode config --hf-token hf_xxxxxxxxxxxxxxxx
```
+See [Tokens & Secrets](/guides/secrets) for read vs write tokens.
+
## Supported models
-Any model compatible with vLLM works. Tested models include:
+Any model compatible with vLLM works. Tested families include:
- Meta Llama 3 / 3.1 (8B, 70B, 405B)
-- Qwen 2.5 (0.5B → 72B)
+- Qwen 2.5 and Qwen3 / Qwen3.5 (incl. frontier MoE, e.g. Qwen3-235B-A22B)
- Mistral / Mixtral
- Google Gemma 2
- DeepSeek V3
- Phi-3 / Phi-4
- Command R+
+AWQ (`awq_marlin`) and NVFP4 quantized checkpoints serve natively on GB10.
+
## Shared storage across cluster
To avoid downloading the same model on every node, mount a shared NFS path:
diff --git a/guides/quantize.mdx b/guides/quantize.mdx
new file mode 100644
index 0000000..e3930a9
--- /dev/null
+++ b/guides/quantize.mdx
@@ -0,0 +1,84 @@
+---
+title: "Quantize a Model"
+description: "Compress any model to AWQ or NVFP4 in the browser, then serve it or push it to Hugging Face."
+icon: "compress"
+---
+
+AINode quantizes models **on your own GPU** — no external service, no notebook.
+A quant job runs llm-compressor one-shot PTQ inside a GPU container and writes a
+compressed-tensors checkpoint that vLLM serves natively.
+
+## Run a quantization
+
+Open **Training → Quantize a Model**.
+
+
+
+ A Hugging Face repo id (`Qwen/Qwen3.5-4B`) or a model already in **Installed**.
+ An on-disk copy is used automatically when present (offline + reproducible).
+
+
+ - **AWQ** — W4A16, 4-bit weights. Serves as `awq_marlin` on GB10. **Proven.**
+ - **NVFP4** — Blackwell-native 4-bit float. Newer; verified on dense text models.
+
+
+ Default **256**, drawn from `HuggingFaceH4/ultrachat_200k`. More samples =
+ better calibration, longer job.
+
+
+ Tick **Push result to Hugging Face** and (optionally) name the repo. Requires
+ a **write** token — see [Secrets](/guides/secrets). The push happens after the
+ job finishes and creates a **private** repo under your token's namespace.
+
+
+
+
+The target node must be **idle**. Quantization needs the full unified memory, so
+AINode refuses to start a quant job while a model is loaded — **unload all models
+on the node first** (or resubmit with `force: true`). You'll get a `409` otherwise.
+
+
+When the job finishes, the result appears in **Installed** as
+`-` (e.g. `Qwen--Qwen3.5-4B-awq`), ready to launch.
+
+## Multimodal & hybrid models (Qwen3.5)
+
+Qwen3.5 models bundle a vision tower and use Gated-DeltaNet linear attention.
+AINode handles this automatically: it loads the full model class so the saved
+config is complete (vLLM-servable), keeps the vision tower, embeddings, `lm_head`
+and the `linear_attn` projections in bf16, and saves the image processor.
+
+
+**AWQ on Qwen3.5 is verified servable. NVFP4 on Qwen3.5 is experimental and not
+yet validated** — prefer AWQ for the Qwen3.5 family today.
+
+
+## API
+
+```bash
+# Start a quant job (idle node required)
+curl -X POST http://localhost:3000/api/training/jobs \
+ -H 'Content-Type: application/json' \
+ -d '{
+ "method": "quantize",
+ "base_model": "Qwen/Qwen3.5-4B",
+ "scheme": "awq",
+ "calib_samples": 256,
+ "push_to_hf": false
+ }'
+
+# Poll status / progress
+curl http://localhost:3000/api/training/jobs/{job_id}
+
+# When done, the quantized model is listed in the catalog
+curl http://localhost:3000/api/models | grep awq
+```
+
+To push to the Hub, add `"push_to_hf": true` (and optionally `"hf_repo": "name"`).
+The job validates write scope **before** running, so a read-only token fails fast.
+
+## Why 4-bit on GB10
+
+GB10 decode is memory-bandwidth bound. 4-bit weights mean fewer bytes read per
+token, so a quantized model both **fits more easily** and **decodes faster**.
+AWQ (`awq_marlin`) is the proven kernel path on GB10's sm120.
diff --git a/guides/secrets.mdx b/guides/secrets.mdx
new file mode 100644
index 0000000..08fd755
--- /dev/null
+++ b/guides/secrets.mdx
@@ -0,0 +1,54 @@
+---
+title: "Tokens & Secrets"
+description: "Store Hugging Face, NGC, W&B, and OpenAI credentials locally — read vs write HF tokens."
+icon: "key"
+---
+
+AINode keeps credentials in a local **Secrets store** at `~/.ainode/secrets.json`
+(file mode `0600`, values obfuscated at rest). Secrets never appear in API
+responses — the UI only ever shows a masked value (last 4 characters). Set them in
+**Config → Secrets**, where each credential has a **Test** button that reports the
+detected scope without revealing the value.
+
+## Hugging Face: read vs write
+
+There are two HF token slots, and the difference matters:
+
+| Slot | Used for | Required scope |
+|---|---|---|
+| **HuggingFace Token (read)** | Download gated models (Llama, Gemma, …) | `read` is enough |
+| **HuggingFace Token (write)** | **Push** quantized / fine-tuned models to the Hub | `write` (or a write-scoped fine-grained token) |
+
+AINode prefers the **write** token for pushes and falls back to the read token
+only if it has write scope. A read-only token is **rejected up front** — before
+any multi-GB upload starts — so you never wait through a transfer that ends in a
+`403`.
+
+
+Keep your everyday downloads on a `read` token and add a separate `write` token
+only when you intend to push models back to the Hub.
+
+
+### Setting the read token from the CLI
+
+```bash
+ainode config --hf-token hf_xxxxxxxxxxxxxxxx # set
+ainode config --hf-token "" # clear
+```
+
+This populates the read slot and propagates automatically to gated-model
+downloads and training jobs. The write token is set in the **Secrets** UI.
+
+## Other supported credentials
+
+| Secret | Purpose |
+|---|---|
+| **NVIDIA NGC API Key** | Pull NGC-hosted models and containers |
+| **Weights & Biases API Key** | Stream training metrics to W&B |
+| **OpenAI API Key** | Benchmark comparisons against hosted OpenAI models |
+
+## Where pushes land
+
+When you push a quantized or fine-tuned model, AINode creates a **private** repo
+under the write token's namespace (`/`). See
+[Quantize a Model](/guides/quantize) for the push workflow.
diff --git a/introduction.mdx b/introduction.mdx
index 5d74dc8..4fac734 100644
--- a/introduction.mdx
+++ b/introduction.mdx
@@ -35,9 +35,18 @@ curl -fsSL https://ainode.dev/install | bash
LoRA, QLoRA, and full fine-tune from the browser. No notebooks.
+
+ Compress any model to AWQ or NVFP4 on your own GPU, then serve it or push it to Hugging Face.
+
Automatic peer discovery. One model sharded across all GPUs in the cluster.
+
+ One endpoint, many models — the master routes each request to the node that holds the model.
+
+
+ Run several models per node. They persist across restarts and replay on boot.
+
`/metrics` endpoint for Grafana, Prometheus, VictoriaMetrics.
@@ -147,9 +156,10 @@ AINode ships as a single unified container image. Every node in the cluster runs
ghcr.io/getainode/ainode:latest ← pulled by the installer
│
├── aiohttp web server (chat UI + API proxy, port 3000 / 8000)
- ├── vLLM inference engine (launched as subprocess)
+ ├── federated master router (routes /v1/* by model name)
+ ├── vLLM inference engine (one or more instances per node)
├── UDP discovery broadcaster (port 5679)
- └── training pipeline (LoRA / QLoRA / Full)
+ └── training pipeline (LoRA / QLoRA / Full + quantization: AWQ / NVFP4)
```
No host Python venv. No source builds. Upgrade is `ainode update`.