Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,10 @@
"guides/chat",
"guides/models",
"guides/training",
"guides/quantize",
"guides/distributed",
"guides/federation",
"guides/secrets",
"guides/update"
]
},
Expand Down
45 changes: 45 additions & 0 deletions guides/federation.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
title: "Federated Serving"
description: "One master routes requests to whichever node holds each model — load and unload any model on any node from the browser."
icon: "network-wired"
---

A cluster doesn't have to shard one giant model across every node. With
**federation**, each node can serve its own model(s), and the **master** routes
each `/v1/*` request to the node that holds the requested model — a single
OpenAI-compatible endpoint in front of the whole fleet.

## How routing works

- The master inspects the `model` field of each `/v1/*` request and forwards it to
the node currently serving that model.
- You manage the whole fleet from one browser tab: see every node, what each is
serving, and **load or unload a model on any node** without SSHing in.
- If a target node is mid-swap or unreachable, the master applies **routing
failover** rather than hanging the request.

<Note>
Federation (one endpoint, many models, routed by name) is distinct from
**distributed tensor-parallel** (one model sharded across nodes — see
[Distributed Inference](/guides/distributed)). Use federation to serve many
models across a fleet; use TP to serve one model too big for a single node.
</Note>

## Model stacking

A single node can hold **more than one model at a time**. Stacked instances each
get their own port, and they **persist across restarts** — AINode replays them on
boot so your fleet comes back exactly as you left it. Loads are serialized so two
models never try to claim the same memory at once.

<Tip>
When you ask the master for a model by name, point your OpenAI client at the
master's `/v1` endpoint and set `model` to the exact id you loaded. A specific
stacked instance is also reachable directly on its own port.
</Tip>

## Per-node memory headroom

Each node exposes a **memory-utilization** knob so you can leave room for a second
stacked model or for a quantization job (which needs the full unified memory — see
[Quantize a Model](/guides/quantize)).
48 changes: 39 additions & 9 deletions guides/models.mdx
Original file line number Diff line number Diff line change
@@ -1,43 +1,73 @@
---
title: "Models"
description: "Download, manage, and launch models from the browser."
description: "Download, manage, launch, and unload models from the browser."
icon: "cube"
---

The **Models** page is split into two lists:

- **Installed** — models already on disk, ready to launch. Shows quantization,
size, and architecture per model.
- **Browse** — the live catalog: curated picks plus live Hugging Face search.

## Downloading models

Open the **Downloads** tab. Three ways to add a model:
In **Browse**, add a model three ways:

1. **Catalog** — 50+ curated models, click to download
2. **Search** — search Hugging Face live
3. **Custom** — paste any HF repo ID (e.g. `microsoft/Phi-4`)
1. **Catalog** — curated models (including frontier MoE and NVFP4/AWQ picks for
GB10 clusters), click to download.
2. **Search** — search Hugging Face live; results show **FITS GPU** / availability
badges computed from your cluster's aggregate VRAM.
3. **Custom** — paste any HF repo id (e.g. `microsoft/Phi-4`).

Downloads run in the background with a progress bar. Click **✕** to cancel.
Already-downloaded models show as **Installed** instead of offering a re-download.

## Launching a model

Once downloaded, click the model card → **▶ Launch Model**. AINode saves the model to config and restarts the engine.
In **Installed**, click a model → **▶ Launch**. On a single node AINode serves it
directly; on a cluster you can pick which nodes span it (see
[Distributed Inference](/guides/distributed)) or serve different models on
different nodes (see [Federated Serving](/guides/federation)).

## Unloading a model

Each running instance has an **Unload** control that stops just that instance and
frees its memory. (Unload force-clears a dead or phantom instance too, so a node
that stopped responding doesn't stay stuck "loaded".) Unload a node's models
before starting a [quantization job](/guides/quantize) — quantization needs the
full unified memory.

## Serving from on-disk weights

AINode serves directly from the weights in `~/.ainode/models/<slug>`. A model you
downloaded once — or produced with a [quantization](/guides/quantize) or
[fine-tuning](/guides/training) job — is immediately launchable; no re-download.

## Gated models

Llama, Gemma, and other gated repos require a Hugging Face token:
Llama, Gemma, and other gated repos require a Hugging Face **read** token:

```bash
ainode config --hf-token hf_xxxxxxxxxxxxxxxx
```

See [Tokens & Secrets](/guides/secrets) for read vs write tokens.

## Supported models

Any model compatible with vLLM works. Tested models include:
Any model compatible with vLLM works. Tested families include:

- Meta Llama 3 / 3.1 (8B, 70B, 405B)
- Qwen 2.5 (0.5B → 72B)
- Qwen 2.5 and Qwen3 / Qwen3.5 (incl. frontier MoE, e.g. Qwen3-235B-A22B)
- Mistral / Mixtral
- Google Gemma 2
- DeepSeek V3
- Phi-3 / Phi-4
- Command R+

AWQ (`awq_marlin`) and NVFP4 quantized checkpoints serve natively on GB10.

## Shared storage across cluster

To avoid downloading the same model on every node, mount a shared NFS path:
Expand Down
84 changes: 84 additions & 0 deletions guides/quantize.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: "Quantize a Model"
description: "Compress any model to AWQ or NVFP4 in the browser, then serve it or push it to Hugging Face."
icon: "compress"
---

AINode quantizes models **on your own GPU** — no external service, no notebook.
A quant job runs llm-compressor one-shot PTQ inside a GPU container and writes a
compressed-tensors checkpoint that vLLM serves natively.

## Run a quantization

Open **Training → Quantize a Model**.

<Steps>
<Step title="Base model">
A Hugging Face repo id (`Qwen/Qwen3.5-4B`) or a model already in **Installed**.
An on-disk copy is used automatically when present (offline + reproducible).
</Step>
<Step title="Scheme">
- **AWQ** — W4A16, 4-bit weights. Serves as `awq_marlin` on GB10. **Proven.**
- **NVFP4** — Blackwell-native 4-bit float. Newer; verified on dense text models.
</Step>
<Step title="Calibration samples">
Default **256**, drawn from `HuggingFaceH4/ultrachat_200k`. More samples =
better calibration, longer job.
</Step>
<Step title="Push to Hugging Face (optional)">
Tick **Push result to Hugging Face** and (optionally) name the repo. Requires
a **write** token — see [Secrets](/guides/secrets). The push happens after the
job finishes and creates a **private** repo under your token's namespace.
</Step>
</Steps>

<Warning>
The target node must be **idle**. Quantization needs the full unified memory, so
AINode refuses to start a quant job while a model is loaded — **unload all models
on the node first** (or resubmit with `force: true`). You'll get a `409` otherwise.
</Warning>

When the job finishes, the result appears in **Installed** as
`<org--name>-<scheme>` (e.g. `Qwen--Qwen3.5-4B-awq`), ready to launch.

## Multimodal & hybrid models (Qwen3.5)

Qwen3.5 models bundle a vision tower and use Gated-DeltaNet linear attention.
AINode handles this automatically: it loads the full model class so the saved
config is complete (vLLM-servable), keeps the vision tower, embeddings, `lm_head`
and the `linear_attn` projections in bf16, and saves the image processor.

<Note>
**AWQ on Qwen3.5 is verified servable. NVFP4 on Qwen3.5 is experimental and not
yet validated** — prefer AWQ for the Qwen3.5 family today.
</Note>

## API

```bash
# Start a quant job (idle node required)
curl -X POST http://localhost:3000/api/training/jobs \
-H 'Content-Type: application/json' \
-d '{
"method": "quantize",
"base_model": "Qwen/Qwen3.5-4B",
"scheme": "awq",
"calib_samples": 256,
"push_to_hf": false
}'

# Poll status / progress
curl http://localhost:3000/api/training/jobs/{job_id}

# When done, the quantized model is listed in the catalog
curl http://localhost:3000/api/models | grep awq
```

To push to the Hub, add `"push_to_hf": true` (and optionally `"hf_repo": "name"`).
The job validates write scope **before** running, so a read-only token fails fast.

## Why 4-bit on GB10

GB10 decode is memory-bandwidth bound. 4-bit weights mean fewer bytes read per
token, so a quantized model both **fits more easily** and **decodes faster**.
AWQ (`awq_marlin`) is the proven kernel path on GB10's sm120.
54 changes: 54 additions & 0 deletions guides/secrets.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: "Tokens & Secrets"
description: "Store Hugging Face, NGC, W&B, and OpenAI credentials locally — read vs write HF tokens."
icon: "key"
---

AINode keeps credentials in a local **Secrets store** at `~/.ainode/secrets.json`
(file mode `0600`, values obfuscated at rest). Secrets never appear in API
responses — the UI only ever shows a masked value (last 4 characters). Set them in
**Config → Secrets**, where each credential has a **Test** button that reports the
detected scope without revealing the value.

## Hugging Face: read vs write

There are two HF token slots, and the difference matters:

| Slot | Used for | Required scope |
|---|---|---|
| **HuggingFace Token (read)** | Download gated models (Llama, Gemma, …) | `read` is enough |
| **HuggingFace Token (write)** | **Push** quantized / fine-tuned models to the Hub | `write` (or a write-scoped fine-grained token) |

AINode prefers the **write** token for pushes and falls back to the read token
only if it has write scope. A read-only token is **rejected up front** — before
any multi-GB upload starts — so you never wait through a transfer that ends in a
`403`.

<Tip>
Keep your everyday downloads on a `read` token and add a separate `write` token
only when you intend to push models back to the Hub.
</Tip>

### Setting the read token from the CLI

```bash
ainode config --hf-token hf_xxxxxxxxxxxxxxxx # set
ainode config --hf-token "" # clear
```

This populates the read slot and propagates automatically to gated-model
downloads and training jobs. The write token is set in the **Secrets** UI.

## Other supported credentials

| Secret | Purpose |
|---|---|
| **NVIDIA NGC API Key** | Pull NGC-hosted models and containers |
| **Weights & Biases API Key** | Stream training metrics to W&B |
| **OpenAI API Key** | Benchmark comparisons against hosted OpenAI models |

## Where pushes land

When you push a quantized or fine-tuned model, AINode creates a **private** repo
under the write token's namespace (`<your-namespace>/<repo-name>`). See
[Quantize a Model](/guides/quantize) for the push workflow.
14 changes: 12 additions & 2 deletions introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,18 @@ curl -fsSL https://ainode.dev/install | bash
<Card title="Fine-tuning" icon="brain" href="/guides/training">
LoRA, QLoRA, and full fine-tune from the browser. No notebooks.
</Card>
<Card title="Quantization" icon="compress" href="/guides/quantize">
Compress any model to AWQ or NVFP4 on your own GPU, then serve it or push it to Hugging Face.
</Card>
<Card title="Multi-node clustering" icon="server" href="/guides/distributed">
Automatic peer discovery. One model sharded across all GPUs in the cluster.
</Card>
<Card title="Federated serving" icon="network-wired" href="/guides/federation">
One endpoint, many models — the master routes each request to the node that holds the model.
</Card>
<Card title="Model stacking" icon="layer-group" href="/guides/federation">
Run several models per node. They persist across restarts and replay on boot.
</Card>
<Card title="Prometheus metrics" icon="chart-line" href="/api-reference/metrics">
`/metrics` endpoint for Grafana, Prometheus, VictoriaMetrics.
</Card>
Expand Down Expand Up @@ -147,9 +156,10 @@ AINode ships as a single unified container image. Every node in the cluster runs
ghcr.io/getainode/ainode:latest ← pulled by the installer
├── aiohttp web server (chat UI + API proxy, port 3000 / 8000)
├── vLLM inference engine (launched as subprocess)
├── federated master router (routes /v1/* by model name)
├── vLLM inference engine (one or more instances per node)
├── UDP discovery broadcaster (port 5679)
└── training pipeline (LoRA / QLoRA / Full)
└── training pipeline (LoRA / QLoRA / Full + quantization: AWQ / NVFP4)
```

No host Python venv. No source builds. Upgrade is `ainode update`.
Expand Down