Local LLM serving across multiple machines. Each model gets its own directory with configuration; shared engine scripts handle the actual launching.
# Setup (once per machine)
./setup.sh
# Run a model
./run.sh qwen-3.5-4b # from root
cd qwen-3.5-4b && ../run.sh # from model dir
./run.sh gemma-4-26b-a4b --engine vllm # override engine| Hostname | Hardware | GPU Memory | OS | Primary Backend |
|---|---|---|---|---|
| smarty | RTX PRO 6000 Blackwell | 96 GB VRAM | Ubuntu Linux | llama-server (GGUF), bare-metal vLLM |
| snappy | Mac Mini M4 Pro | 64 GB unified | macOS | mlx-vlm (MLX) |
| scrappy | RTX 3070 Laptop | 8 GB VRAM | Windows 11 | — |
| sparky | DGX Spark GB10 | 128 GB unified | Ubuntu Linux | offline |
| Port | Model | Type | Quant | KV Cache | Context | Parallel |
|---|---|---|---|---|---|---|
| 2025 | Qwen 3.5 9B | big dense | UD-Q4_K_XL | q8_0 | 64K | 2 |
| 2026 | Qwen 3.5 27B | big dense | UD-Q4_K_XL | q8_0 | 64K | 2 |
| 2027 | Qwen 3.5 35B A3B | MoE | UD-Q4_K_XL | q8_0 | 64K | 8 |
| 2028 | Qwen 3.6 35B A3B | MoE | UD-Q4_K_XL | q8_0 | 64K | 8 |
| 2029 | Qwen 3.5 4B | small dense | UD-Q4_K_XL | q8_0 | 64K | 2 |
| 2030 | Qwen 3.5 2B | small dense | Q8_0 | q8_0 | 32K | 2 |
| 2031 | Qwen 3.5 0.8B | small dense | Q8_0 | q8_0 | 32K | 2 |
| 2032 | Qwen 3.6 27B | big dense | UD-Q4_K_XL | q8_0 | 64K | 2 |
| 2033 | Nemotron 3 Super 120B A12B | MoE (NVFP4) | NVFP4 | fp8 | 64K | 8 |
| 2034 | Nemotron 3 Nano 30B A3B | MoE (NVFP4) | NVFP4 | fp8 | 64K | 8 |
| 2035 | Nemotron Cascade 2 30B A3B | MoE | UD-Q4_K_XL | q8_0 | 64K | 8 |
| 2036 | Gemma 4 26B-A4B | MoE | UD-Q4_K_XL | q8_0 | 64K | 8 |
| 2037 | Gemma 4 31B | big dense | UD-Q4_K_XL | q8_0 | 64K | 2 |
| 2038 | Gemma 4 E4B | small dense | UD-Q4_K_XL | q8_0 | 64K | 2 |
| 2039 | Gemma 4 E2B | small dense | Q8_0 | q8_0 | 32K | 2 |
| 2043 | Gemma 4 12B | big dense | UD-Q4_K_XL | q8_0 | 64K | 2 |
| 4007 | Penumbra | custom | — | — | — | — |
models.server/
├── run.sh # Single entry point — detects platform, dispatches
├── setup.sh # Environment setup (MLX on macOS, vLLM on Linux)
├── scripts/
│ ├── run-llama.sh # Generic llama.cpp launcher
│ ├── run-mlx.sh # Generic MLX launcher
│ ├── run-vllm.sh # Generic vLLM launcher
│ ├── run-cpu.sh # Generic CPU-only launcher (Pi)
│ ├── parse-config.py # Reads model.json → shell variables
│ ├── setup-common.sh # Shared helpers (CUDA env, venv paths)
│ ├── setup-vllm.sh # Creates/updates .venv-vllm
│ └── setup-mlx.sh # Creates/updates .venv-mlx
├── <model-id>/
│ ├── model.json # All config: ports, quants, engine settings
│ ├── launchd/ # macOS service unit
│ └── systemd/ # Linux service unit
├── .venv-mlx/ # Shared MLX venv (macOS)
├── .venv-vllm/ # Shared vLLM venv (Linux)
├── llama.cpp/ # llama.cpp build scripts
├── whisper.cpp/ # whisper.cpp build scripts
└── bench/ # Benchmark results
run.sh picks the engine automatically:
- macOS →
mlx(mlx-vlm) - ARM Linux without CUDA →
cpu(Raspberry Pi) - Linux with CUDA →
llama(llama.cpp), orvllmif model has no GGUF (NVFP4)
Override with --engine: ./run.sh qwen-3.5-4b --engine vllm
GGUF-quantized models via llama.cpp. OpenAI-compatible API at /v1/chat/completions. CUDA + flash attention on smarty, Metal on snappy.
llama.cpp PR #22673 adds MTP (Multi-Token Prediction) speculative decoding using draft heads baked into the main GGUF (no separate drafter file). Set llama.mtp=true in model.json to pass --spec-type draft-mtp; optional llama.mtp_n_max overrides --spec-draft-n-max (llama.cpp default 3, PR notes 2-3 is the sweet spot for ~1.7-2x speedup at 72-83% accept rate). Requires a llama.cpp build from after PR #22673 and a GGUF repo that ships MTP heads (e.g. unsloth's *-MTP-GGUF variants). Used by both Qwen 3.6 models.
Vision Language Models via mlx-vlm. macOS only (Apple Silicon / MLX). Uses mlx-community/ quantized models. Serves at /chat/completions (no /v1 prefix).
mlx-vlm>=0.6.0 supports speculative decoding on the server. Add optional mlx.draft_model, mlx.draft_kind, and mlx.draft_block_size fields in model.json to pass --draft-model, --draft-kind, and --draft-block-size; set MLX_DISABLE_DRAFT=1 when launching to run without the configured drafter.
Gemma 4 MTP drafters work but only help large/slow targets. E2B/E4B run with mlx.draft_enabled=false (MTP measured slower than no-drafter on E4B — 66.8 vs 70.6 tok/s; see bench/BENCHMARKS.md); 26B-A4B/31B keep draft_enabled=true pending an MLX bench. The Gemma 4 MTP rollback crash (mlx-vlm#1260, AttributeError: 'list' object has no attribute 'max') is fixed upstream in mlx-vlm 0.6.1 (our PR #1261) — hence the >=0.6.1 floor in setup-mlx.sh. The old local patch has been removed.
GPU-accelerated serving via vLLM. Linux only (CUDA). Supports online FP8 quantization, Marlin NVFP4, and continuous batching for high-throughput concurrent serving.
| Model size | Weight quant | KV cache | Context | Parallel slots |
|---|---|---|---|---|
| >= 4B | UD-Q4_K_XL | q8_0 / fp8 | 64K | MoE: 8, big dense: 2, small: 2 |
| < 4B | Q8_0 | q8_0 / fp8 | 32K | 2 |
NVFP4 models (Nemotron Nano/Super) use vLLM with Marlin backend instead of llama.cpp.
- Create
<model-id>/directory - Add
model.jsonwith all engine config (see any existing model for the schema) - Add
launchd/andsystemd/service units - Follow the quantization standards above
- Test:
./run.sh <model-id>
ln -s ~/src/models.server/<model-id>/launchd/ai.kortexa.<model-id>.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/ai.kortexa.<model-id>.plist
launchctl start ai.kortexa.<model-id>sudo ln -s ~/src/models.server/<model-id>/systemd/kortexa-ai-llm-<model-id>.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl start kortexa-ai-llm-<model-id>