Model Serving Infrastructure

Local LLM serving across multiple machines. Each model gets its own directory with configuration; shared engine scripts handle the actual launching.

Quick Start

# Setup (once per machine)
./setup.sh

# Run a model
./run.sh qwen-3.5-4b                    # from root
cd qwen-3.5-4b && ../run.sh             # from model dir
./run.sh gemma-4-26b-a4b --engine vllm  # override engine

Machines

Hostname	Hardware	GPU Memory	OS	Primary Backend
smarty	RTX PRO 6000 Blackwell	96 GB VRAM	Ubuntu Linux	`llama-server` (GGUF), bare-metal vLLM
snappy	Mac Mini M4 Pro	64 GB unified	macOS	`mlx-vlm` (MLX)
scrappy	RTX 3070 Laptop	8 GB VRAM	Windows 11	—
sparky	DGX Spark GB10	128 GB unified	Ubuntu Linux	offline

Model Inventory

Port	Model	Type	Quant	KV Cache	Context	Parallel
2025	Qwen 3.5 9B	big dense	UD-Q4_K_XL	q8_0	64K	2
2026	Qwen 3.5 27B	big dense	UD-Q4_K_XL	q8_0	64K	2
2027	Qwen 3.5 35B A3B	MoE	UD-Q4_K_XL	q8_0	64K	8
2028	Qwen 3.6 35B A3B	MoE	UD-Q4_K_XL	q8_0	64K	8
2029	Qwen 3.5 4B	small dense	UD-Q4_K_XL	q8_0	64K	2
2030	Qwen 3.5 2B	small dense	Q8_0	q8_0	32K	2
2031	Qwen 3.5 0.8B	small dense	Q8_0	q8_0	32K	2
2032	Qwen 3.6 27B	big dense	UD-Q4_K_XL	q8_0	64K	2
2033	Nemotron 3 Super 120B A12B	MoE (NVFP4)	NVFP4	fp8	64K	8
2034	Nemotron 3 Nano 30B A3B	MoE (NVFP4)	NVFP4	fp8	64K	8
2035	Nemotron Cascade 2 30B A3B	MoE	UD-Q4_K_XL	q8_0	64K	8
2036	Gemma 4 26B-A4B	MoE	UD-Q4_K_XL	q8_0	64K	8
2037	Gemma 4 31B	big dense	UD-Q4_K_XL	q8_0	64K	2
2038	Gemma 4 E4B	small dense	UD-Q4_K_XL	q8_0	64K	2
2039	Gemma 4 E2B	small dense	Q8_0	q8_0	32K	2
2043	Gemma 4 12B	big dense	UD-Q4_K_XL	q8_0	64K	2
4007	Penumbra	custom	—	—	—	—

Directory Structure

models.server/
├── run.sh                  # Single entry point — detects platform, dispatches
├── setup.sh                # Environment setup (MLX on macOS, vLLM on Linux)
├── scripts/
│   ├── run-llama.sh        # Generic llama.cpp launcher
│   ├── run-mlx.sh          # Generic MLX launcher
│   ├── run-vllm.sh         # Generic vLLM launcher
│   ├── run-cpu.sh          # Generic CPU-only launcher (Pi)
│   ├── parse-config.py     # Reads model.json → shell variables
│   ├── setup-common.sh     # Shared helpers (CUDA env, venv paths)
│   ├── setup-vllm.sh       # Creates/updates .venv-vllm
│   └── setup-mlx.sh        # Creates/updates .venv-mlx
├── <model-id>/
│   ├── model.json          # All config: ports, quants, engine settings
│   ├── launchd/            # macOS service unit
│   └── systemd/            # Linux service unit
├── .venv-mlx/              # Shared MLX venv (macOS)
├── .venv-vllm/             # Shared vLLM venv (Linux)
├── llama.cpp/              # llama.cpp build scripts
├── whisper.cpp/            # whisper.cpp build scripts
└── bench/                  # Benchmark results

Engine Auto-Detection

run.sh picks the engine automatically:

macOS → mlx (mlx-vlm)
ARM Linux without CUDA → cpu (Raspberry Pi)
Linux with CUDA → llama (llama.cpp), or vllm if model has no GGUF (NVFP4)

Override with --engine: ./run.sh qwen-3.5-4b --engine vllm

Serving Backends

llama-server (llama.cpp)

GGUF-quantized models via llama.cpp. OpenAI-compatible API at /v1/chat/completions. CUDA + flash attention on smarty, Metal on snappy.

llama.cpp PR #22673 adds MTP (Multi-Token Prediction) speculative decoding using draft heads baked into the main GGUF (no separate drafter file). Set llama.mtp=true in model.json to pass --spec-type draft-mtp; optional llama.mtp_n_max overrides --spec-draft-n-max (llama.cpp default 3, PR notes 2-3 is the sweet spot for ~1.7-2x speedup at 72-83% accept rate). Requires a llama.cpp build from after PR #22673 and a GGUF repo that ships MTP heads (e.g. unsloth's *-MTP-GGUF variants). Used by both Qwen 3.6 models.

mlx-vlm

Vision Language Models via mlx-vlm. macOS only (Apple Silicon / MLX). Uses mlx-community/ quantized models. Serves at /chat/completions (no /v1 prefix).

mlx-vlm>=0.6.0 supports speculative decoding on the server. Add optional mlx.draft_model, mlx.draft_kind, and mlx.draft_block_size fields in model.json to pass --draft-model, --draft-kind, and --draft-block-size; set MLX_DISABLE_DRAFT=1 when launching to run without the configured drafter.

Gemma 4 MTP drafters work but only help large/slow targets. E2B/E4B run with mlx.draft_enabled=false (MTP measured slower than no-drafter on E4B — 66.8 vs 70.6 tok/s; see bench/BENCHMARKS.md); 26B-A4B/31B keep draft_enabled=true pending an MLX bench. The Gemma 4 MTP rollback crash (mlx-vlm#1260, AttributeError: 'list' object has no attribute 'max') is fixed upstream in mlx-vlm 0.6.1 (our PR #1261) — hence the >=0.6.1 floor in setup-mlx.sh. The old local patch has been removed.

vLLM

GPU-accelerated serving via vLLM. Linux only (CUDA). Supports online FP8 quantization, Marlin NVFP4, and continuous batching for high-throughput concurrent serving.

Quantization Standards

Model size	Weight quant	KV cache	Context	Parallel slots
>= 4B	UD-Q4_K_XL	q8_0 / fp8	64K	MoE: 8, big dense: 2, small: 2
< 4B	Q8_0	q8_0 / fp8	32K	2

NVFP4 models (Nemotron Nano/Super) use vLLM with Marlin backend instead of llama.cpp.

Adding a New Model

Create <model-id>/ directory
Add model.json with all engine config (see any existing model for the schema)
Add launchd/ and systemd/ service units
Follow the quantization standards above
Test: ./run.sh <model-id>

Service Management

macOS (launchd)

ln -s ~/src/models.server/<model-id>/launchd/ai.kortexa.<model-id>.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/ai.kortexa.<model-id>.plist
launchctl start ai.kortexa.<model-id>

Linux (systemd)

sudo ln -s ~/src/models.server/<model-id>/systemd/kortexa-ai-llm-<model-id>.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl start kortexa-ai-llm-<model-id>

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
bench		bench
embeddinggemma-300m		embeddinggemma-300m
gemma-4-12b		gemma-4-12b
gemma-4-26b-a4b		gemma-4-26b-a4b
gemma-4-31b		gemma-4-31b
gemma-4-e2b		gemma-4-e2b
gemma-4-e4b		gemma-4-e4b
llama.cpp		llama.cpp
nemotron-3-nano-30b-a3b		nemotron-3-nano-30b-a3b
nemotron-3-super-120b-a12b		nemotron-3-super-120b-a12b
nemotron-cascade-2-30b-a3b		nemotron-cascade-2-30b-a3b
penumbra		penumbra
qwen-3.5-0.8b		qwen-3.5-0.8b
qwen-3.5-27b		qwen-3.5-27b
qwen-3.5-2b		qwen-3.5-2b
qwen-3.5-4b		qwen-3.5-4b
qwen-3.5-9b		qwen-3.5-9b
qwen-3.6-27b		qwen-3.6-27b
qwen-3.6-35b-a3b-prism-nvfp4		qwen-3.6-35b-a3b-prism-nvfp4
qwen-3.6-35b-a3b		qwen-3.6-35b-a3b
qwen3-embedding-0.6b		qwen3-embedding-0.6b
scripts		scripts
whisper.cpp		whisper.cpp
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
run.sh		run.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model Serving Infrastructure

Quick Start

Machines

Model Inventory

Directory Structure

Engine Auto-Detection

Serving Backends

llama-server (llama.cpp)

mlx-vlm

vLLM

Quantization Standards

Adding a New Model

Service Management

macOS (launchd)

Linux (systemd)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Model Serving Infrastructure

Quick Start

Machines

Model Inventory

Directory Structure

Engine Auto-Detection

Serving Backends

llama-server (llama.cpp)

mlx-vlm

vLLM

Quantization Standards

Adding a New Model

Service Management

macOS (launchd)

Linux (systemd)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages