From 4eb3dd6f7bb35c86141a5cbb9c85c044e99f6e14 Mon Sep 17 00:00:00 2001 From: webdevtodayjason Date: Sat, 27 Jun 2026 11:54:59 -0500 Subject: [PATCH] =?UTF-8?q?docs:=20cover=200.4.44=20=E2=80=94=20quantize,?= =?UTF-8?q?=20federation,=20models=20redesign,=20secrets?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bring the docs site up to the 0.4.44 product surface. - add guides/quantize.mdx — in-browser AWQ / NVFP4 quantization, idle-node guardrail, HF push, Qwen3.5 multimodal/hybrid handling (AWQ verified; NVFP4-on-Qwen3.5 marked experimental) - add guides/secrets.mdx — local Secrets store; HF read vs write tokens (write required for pushes, read-only rejected up front); NGC/W&B/OpenAI - add guides/federation.mdx — master router (route /v1/* by model name), load/unload any node from the UI, model stacking + persistence, mem-util knob - rewrite guides/models.mdx — two-list Installed/Browse redesign, Unload (was DELETE), serve-from-on-disk-weights - introduction.mdx — add Quantization / Federated serving / Model stacking cards; add federation + quantization to the architecture box - docs.json — add the three new pages to the Guides nav Co-Authored-By: Claude Opus 4.8 (1M context) Claude-Session: https://claude.ai/code/session_01JfM4xyZR4DdC3W74ea99Mi --- docs.json | 3 ++ guides/federation.mdx | 45 +++++++++++++++++++++++ guides/models.mdx | 48 ++++++++++++++++++++----- guides/quantize.mdx | 84 +++++++++++++++++++++++++++++++++++++++++++ guides/secrets.mdx | 54 ++++++++++++++++++++++++++++ introduction.mdx | 14 ++++++-- 6 files changed, 237 insertions(+), 11 deletions(-) create mode 100644 guides/federation.mdx create mode 100644 guides/quantize.mdx create mode 100644 guides/secrets.mdx diff --git a/docs.json b/docs.json index 7799eef..731fc18 100644 --- a/docs.json +++ b/docs.json @@ -66,7 +66,10 @@ "guides/chat", "guides/models", "guides/training", + "guides/quantize", "guides/distributed", + "guides/federation", + "guides/secrets", "guides/update" ] }, diff --git a/guides/federation.mdx b/guides/federation.mdx new file mode 100644 index 0000000..2d74f35 --- /dev/null +++ b/guides/federation.mdx @@ -0,0 +1,45 @@ +--- +title: "Federated Serving" +description: "One master routes requests to whichever node holds each model — load and unload any model on any node from the browser." +icon: "network-wired" +--- + +A cluster doesn't have to shard one giant model across every node. With +**federation**, each node can serve its own model(s), and the **master** routes +each `/v1/*` request to the node that holds the requested model — a single +OpenAI-compatible endpoint in front of the whole fleet. + +## How routing works + +- The master inspects the `model` field of each `/v1/*` request and forwards it to + the node currently serving that model. +- You manage the whole fleet from one browser tab: see every node, what each is + serving, and **load or unload a model on any node** without SSHing in. +- If a target node is mid-swap or unreachable, the master applies **routing + failover** rather than hanging the request. + + +Federation (one endpoint, many models, routed by name) is distinct from +**distributed tensor-parallel** (one model sharded across nodes — see +[Distributed Inference](/guides/distributed)). Use federation to serve many +models across a fleet; use TP to serve one model too big for a single node. + + +## Model stacking + +A single node can hold **more than one model at a time**. Stacked instances each +get their own port, and they **persist across restarts** — AINode replays them on +boot so your fleet comes back exactly as you left it. Loads are serialized so two +models never try to claim the same memory at once. + + +When you ask the master for a model by name, point your OpenAI client at the +master's `/v1` endpoint and set `model` to the exact id you loaded. A specific +stacked instance is also reachable directly on its own port. + + +## Per-node memory headroom + +Each node exposes a **memory-utilization** knob so you can leave room for a second +stacked model or for a quantization job (which needs the full unified memory — see +[Quantize a Model](/guides/quantize)). diff --git a/guides/models.mdx b/guides/models.mdx index d1c886d..17ce662 100644 --- a/guides/models.mdx +++ b/guides/models.mdx @@ -1,43 +1,73 @@ --- title: "Models" -description: "Download, manage, and launch models from the browser." +description: "Download, manage, launch, and unload models from the browser." icon: "cube" --- +The **Models** page is split into two lists: + +- **Installed** — models already on disk, ready to launch. Shows quantization, + size, and architecture per model. +- **Browse** — the live catalog: curated picks plus live Hugging Face search. + ## Downloading models -Open the **Downloads** tab. Three ways to add a model: +In **Browse**, add a model three ways: -1. **Catalog** — 50+ curated models, click to download -2. **Search** — search Hugging Face live -3. **Custom** — paste any HF repo ID (e.g. `microsoft/Phi-4`) +1. **Catalog** — curated models (including frontier MoE and NVFP4/AWQ picks for + GB10 clusters), click to download. +2. **Search** — search Hugging Face live; results show **FITS GPU** / availability + badges computed from your cluster's aggregate VRAM. +3. **Custom** — paste any HF repo id (e.g. `microsoft/Phi-4`). Downloads run in the background with a progress bar. Click **✕** to cancel. +Already-downloaded models show as **Installed** instead of offering a re-download. ## Launching a model -Once downloaded, click the model card → **▶ Launch Model**. AINode saves the model to config and restarts the engine. +In **Installed**, click a model → **▶ Launch**. On a single node AINode serves it +directly; on a cluster you can pick which nodes span it (see +[Distributed Inference](/guides/distributed)) or serve different models on +different nodes (see [Federated Serving](/guides/federation)). + +## Unloading a model + +Each running instance has an **Unload** control that stops just that instance and +frees its memory. (Unload force-clears a dead or phantom instance too, so a node +that stopped responding doesn't stay stuck "loaded".) Unload a node's models +before starting a [quantization job](/guides/quantize) — quantization needs the +full unified memory. + +## Serving from on-disk weights + +AINode serves directly from the weights in `~/.ainode/models/`. A model you +downloaded once — or produced with a [quantization](/guides/quantize) or +[fine-tuning](/guides/training) job — is immediately launchable; no re-download. ## Gated models -Llama, Gemma, and other gated repos require a Hugging Face token: +Llama, Gemma, and other gated repos require a Hugging Face **read** token: ```bash ainode config --hf-token hf_xxxxxxxxxxxxxxxx ``` +See [Tokens & Secrets](/guides/secrets) for read vs write tokens. + ## Supported models -Any model compatible with vLLM works. Tested models include: +Any model compatible with vLLM works. Tested families include: - Meta Llama 3 / 3.1 (8B, 70B, 405B) -- Qwen 2.5 (0.5B → 72B) +- Qwen 2.5 and Qwen3 / Qwen3.5 (incl. frontier MoE, e.g. Qwen3-235B-A22B) - Mistral / Mixtral - Google Gemma 2 - DeepSeek V3 - Phi-3 / Phi-4 - Command R+ +AWQ (`awq_marlin`) and NVFP4 quantized checkpoints serve natively on GB10. + ## Shared storage across cluster To avoid downloading the same model on every node, mount a shared NFS path: diff --git a/guides/quantize.mdx b/guides/quantize.mdx new file mode 100644 index 0000000..e3930a9 --- /dev/null +++ b/guides/quantize.mdx @@ -0,0 +1,84 @@ +--- +title: "Quantize a Model" +description: "Compress any model to AWQ or NVFP4 in the browser, then serve it or push it to Hugging Face." +icon: "compress" +--- + +AINode quantizes models **on your own GPU** — no external service, no notebook. +A quant job runs llm-compressor one-shot PTQ inside a GPU container and writes a +compressed-tensors checkpoint that vLLM serves natively. + +## Run a quantization + +Open **Training → Quantize a Model**. + + + + A Hugging Face repo id (`Qwen/Qwen3.5-4B`) or a model already in **Installed**. + An on-disk copy is used automatically when present (offline + reproducible). + + + - **AWQ** — W4A16, 4-bit weights. Serves as `awq_marlin` on GB10. **Proven.** + - **NVFP4** — Blackwell-native 4-bit float. Newer; verified on dense text models. + + + Default **256**, drawn from `HuggingFaceH4/ultrachat_200k`. More samples = + better calibration, longer job. + + + Tick **Push result to Hugging Face** and (optionally) name the repo. Requires + a **write** token — see [Secrets](/guides/secrets). The push happens after the + job finishes and creates a **private** repo under your token's namespace. + + + + +The target node must be **idle**. Quantization needs the full unified memory, so +AINode refuses to start a quant job while a model is loaded — **unload all models +on the node first** (or resubmit with `force: true`). You'll get a `409` otherwise. + + +When the job finishes, the result appears in **Installed** as +`-` (e.g. `Qwen--Qwen3.5-4B-awq`), ready to launch. + +## Multimodal & hybrid models (Qwen3.5) + +Qwen3.5 models bundle a vision tower and use Gated-DeltaNet linear attention. +AINode handles this automatically: it loads the full model class so the saved +config is complete (vLLM-servable), keeps the vision tower, embeddings, `lm_head` +and the `linear_attn` projections in bf16, and saves the image processor. + + +**AWQ on Qwen3.5 is verified servable. NVFP4 on Qwen3.5 is experimental and not +yet validated** — prefer AWQ for the Qwen3.5 family today. + + +## API + +```bash +# Start a quant job (idle node required) +curl -X POST http://localhost:3000/api/training/jobs \ + -H 'Content-Type: application/json' \ + -d '{ + "method": "quantize", + "base_model": "Qwen/Qwen3.5-4B", + "scheme": "awq", + "calib_samples": 256, + "push_to_hf": false + }' + +# Poll status / progress +curl http://localhost:3000/api/training/jobs/{job_id} + +# When done, the quantized model is listed in the catalog +curl http://localhost:3000/api/models | grep awq +``` + +To push to the Hub, add `"push_to_hf": true` (and optionally `"hf_repo": "name"`). +The job validates write scope **before** running, so a read-only token fails fast. + +## Why 4-bit on GB10 + +GB10 decode is memory-bandwidth bound. 4-bit weights mean fewer bytes read per +token, so a quantized model both **fits more easily** and **decodes faster**. +AWQ (`awq_marlin`) is the proven kernel path on GB10's sm120. diff --git a/guides/secrets.mdx b/guides/secrets.mdx new file mode 100644 index 0000000..08fd755 --- /dev/null +++ b/guides/secrets.mdx @@ -0,0 +1,54 @@ +--- +title: "Tokens & Secrets" +description: "Store Hugging Face, NGC, W&B, and OpenAI credentials locally — read vs write HF tokens." +icon: "key" +--- + +AINode keeps credentials in a local **Secrets store** at `~/.ainode/secrets.json` +(file mode `0600`, values obfuscated at rest). Secrets never appear in API +responses — the UI only ever shows a masked value (last 4 characters). Set them in +**Config → Secrets**, where each credential has a **Test** button that reports the +detected scope without revealing the value. + +## Hugging Face: read vs write + +There are two HF token slots, and the difference matters: + +| Slot | Used for | Required scope | +|---|---|---| +| **HuggingFace Token (read)** | Download gated models (Llama, Gemma, …) | `read` is enough | +| **HuggingFace Token (write)** | **Push** quantized / fine-tuned models to the Hub | `write` (or a write-scoped fine-grained token) | + +AINode prefers the **write** token for pushes and falls back to the read token +only if it has write scope. A read-only token is **rejected up front** — before +any multi-GB upload starts — so you never wait through a transfer that ends in a +`403`. + + +Keep your everyday downloads on a `read` token and add a separate `write` token +only when you intend to push models back to the Hub. + + +### Setting the read token from the CLI + +```bash +ainode config --hf-token hf_xxxxxxxxxxxxxxxx # set +ainode config --hf-token "" # clear +``` + +This populates the read slot and propagates automatically to gated-model +downloads and training jobs. The write token is set in the **Secrets** UI. + +## Other supported credentials + +| Secret | Purpose | +|---|---| +| **NVIDIA NGC API Key** | Pull NGC-hosted models and containers | +| **Weights & Biases API Key** | Stream training metrics to W&B | +| **OpenAI API Key** | Benchmark comparisons against hosted OpenAI models | + +## Where pushes land + +When you push a quantized or fine-tuned model, AINode creates a **private** repo +under the write token's namespace (`/`). See +[Quantize a Model](/guides/quantize) for the push workflow. diff --git a/introduction.mdx b/introduction.mdx index 5d74dc8..4fac734 100644 --- a/introduction.mdx +++ b/introduction.mdx @@ -35,9 +35,18 @@ curl -fsSL https://ainode.dev/install | bash LoRA, QLoRA, and full fine-tune from the browser. No notebooks. + + Compress any model to AWQ or NVFP4 on your own GPU, then serve it or push it to Hugging Face. + Automatic peer discovery. One model sharded across all GPUs in the cluster. + + One endpoint, many models — the master routes each request to the node that holds the model. + + + Run several models per node. They persist across restarts and replay on boot. + `/metrics` endpoint for Grafana, Prometheus, VictoriaMetrics. @@ -147,9 +156,10 @@ AINode ships as a single unified container image. Every node in the cluster runs ghcr.io/getainode/ainode:latest ← pulled by the installer │ ├── aiohttp web server (chat UI + API proxy, port 3000 / 8000) - ├── vLLM inference engine (launched as subprocess) + ├── federated master router (routes /v1/* by model name) + ├── vLLM inference engine (one or more instances per node) ├── UDP discovery broadcaster (port 5679) - └── training pipeline (LoRA / QLoRA / Full) + └── training pipeline (LoRA / QLoRA / Full + quantization: AWQ / NVFP4) ``` No host Python venv. No source builds. Upgrade is `ainode update`.