ROCm: discrete GPU memory management by cattivik66 · Pull Request #461 · antirez/ds4

cattivik66 · 2026-06-26T19:18:48Z

Summary

With these changes, the ds4flash.gguf (DeepSeek-V4 Flash) model runs on
discrete GPUs with 48 GB VRAM (tested: AMD Radeon Pro W7800 x2). Without
this PR the upstream code hard-fails: the streaming expert cache exhausts
VRAM during model load and cuda_stream_resident_alloc() returns -1,
preventing the model from starting at all.
Sorry I was not able to test on other systems, but I expect the code to be fully
compatible with the current supported systems as they should be untouched.

The PR adds:

A host-mapped expert overflow path so routed experts degrade gracefully
instead of aborting.
Automatic discrete/integrated GPU detection so the registered-weights
path activates without manual configuration.
Three env-var tunables for operators to size the VRAM budget.

Compilation

Built on ROCm 7.2 with hipcc:

make rocm ROCM_ARCH=gfx1100

Linker flags: -O3 -ffast-math -g -fno-finite-math-only -pthread -D__HIP_PLATFORM_AMD__ --offload-arch=gfx1100 -lhipblas -lhipblaslt

Problem

Expert cache OOM -- cuda_stream_resident_alloc() returns -1 when
cuda_stream_resident_make_room() fails the 16 GiB free-reserve floor.
On 48 GB cards the working set of routed experts from the Flash model
exceeds the remaining VRAM after model weights + KV cache, so the
startup crashes with "streaming expert cache cannot keep ... MiB".
Raw host-pointer reads -- when cuda_model_range_ptr_from_fd() can't fit
a range in the arena it returns cuda_model_ptr(), which for discrete GPUs
is a plain host virtual address. On discrete GPUs this causes slow uncached
PCIe reads on every kernel access. In distributed mode, the raw pointer
also fails the cuda_model_range_is_cached() check, crashing the worker.

Validation

Test setup

GPU: AMD Radeon Pro W7800 48 GB (gfx1100 / Navi 31) x2
ROCm: 7.2, hipcc, HIP runtime
Model: ds4flash.gguf (DeepSeek-V4 Flash, IQ2XXS quant, ~47.6 GiB)
Build: make rocm ROCM_ARCH=gfx1100

Single GPU (auto-detection, no DS4_ROCM_REGISTERED_WEIGHTS needed)

DS4_ROCM_STREAM_FREE_RESERVE_GIB=4 \
DS4_CUDA_Q8_F16_CACHE_MB=0 \
./ds4 --model ds4flash.gguf --backend rocm --ssd-streaming \
  -p "Hello, who are you?" -n 30 --nothink

ds4: ROCm backend initialized on AMD Radeon Pro W7800 48GB (sm_110)
ds4:   using 80% total for model + cached experts: 38.39 GiB
ds4:   cached expert count: 4580 (30.19 GiB)
Hi there! I'm DeepSeek, an AI assistant created by the Chinese company
DeepSeek (深度求索). I'm here to help...
ds4: prefill: 1.44 t/s, generation: 4.42 t/s

Dual GPU (distributed, layers split 0:21 + 22:output)

The auto-detection correctly identifies the W7800 as discrete, enabling the
registered-weights fallback path automatically. No DS4_ROCM_REGISTERED_WEIGHTS
env var is needed.

Terminal 1 (GPU 1, worker, layers 22:output):

ROCR_VISIBLE_DEVICES=1 DS4_LOCK_FILE=/tmp/ds4-w.lock \
DS4_CUDA_Q8_F16_CACHE_MB=0 DS4_ROCM_STREAM_FREE_RESERVE_GIB=1 \
./ds4 --role worker --layers 22:output --ssd-streaming \
  --coordinator 127.0.0.1 9000 \
  --backend rocm -m ds4flash.gguf

Terminal 2 (GPU 0, coordinator, layers 0:21):

ROCR_VISIBLE_DEVICES=0 DS4_LOCK_FILE=/tmp/ds4-c.lock \
DS4_CUDA_Q8_F16_CACHE_MB=0 DS4_ROCM_STREAM_FREE_RESERVE_GIB=1 \
./ds4 --role coordinator --layers 0:21 --ssd-streaming \
  --listen 127.0.0.1 9000 \
  --backend rocm -m ds4flash.gguf \
  -p "Hello, who are you?" -n 50 --nothink

Coordinator output:

processing 15 input tokens: 15/15 (100.0%)
Hi there! I'm DeepSeek, an AI assistant created by DeepSeek (深度求索)，
a Chinese company. I'm here to help you with a wide range of tasks...
ds4: prefill: 0.91 t/s, generation: 4.09 t/s

Both GPUs participate in every token via the distributed layer pipeline.
Each GPU uses ~42 GiB VRAM during inference.

Dual GPU distributed (full residency, no SSD)

Coordinator: layers 0:20 (21 layers), Worker: layers 21:output (22 layers)
Context: -c 1024 (min KV cache ~102 MiB)
Env vars: DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK=0 DS4_CUDA_Q8_F16_CACHE_MB=0

Run 1 (cold page cache): Prefill: 20.50 t/s, Generation: 32.80 t/s
Run 2 (warm page cache): Prefill: 43.10 t/s, Generation: 32.74 t/s

Model load: ~15s per GPU
VRAM: ~32 GiB model (in arena) + ~8 GiB (cudaHostRegister'd) per GPU

Outcome summary

Scenario	Before	After
ds4flash.gguf on single 48 GB W7800	fails to load (VRAM OOM)	loads and infers at 4.42 t/s
ds4flash.gguf on dual 48 GB W7800 (distributed)	fails to load (VRAM OOM)	loads and infers at 4.09 t/s
Larger models / APUs	unchanged	unchanged
Upstream defaults (no env vars)	unchanged	auto-detection works out of the box

Changes

`ds4_rocm.h`

Added cudaDevAttrIntegrated -> hipDeviceAttributeIntegrated mapping for the
HIP/CUDA compatibility layer, enabling cudaDeviceGetAttribute to query whether
the GPU is integrated or discrete.

`rocm/ds4_rocm_runtime.cuh`

1. Host-mapped expert fallback (`cuda_stream_resident_alloc`)

When a routed expert cannot be placed in VRAM:

Copy the expert weights (gate | up | down) into a contiguous malloc'd host buffer.
Register that buffer with cudaHostRegister(cudaHostRegisterMapped).
Obtain a GPU-side pointer via cudaHostGetDevicePointer.
Store the expert with host_mapped=1 so the GPU kernel reads through the
registered mapping (zero-copy over PCIe).

The fallback is enabled by default and activates only when the existing
cudaMalloc path would have returned -1. Set DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK=0
to restore the previous hard-fail behaviour.

Two new fields (host_mapped, host_base) are added to cuda_stream_resident_expert;
cuda_stream_resident_cache_release() was updated to call cudaHostUnregister+free
for host-mapped experts while preserving the original cudaFree path for VRAM experts.

2. Automatic discrete GPU detection (`cuda_device_is_discrete`)

Queries cudaDeviceGetAttribute(cudaDevAttrIntegrated) at first use and caches
the result. On discrete GPUs (separate VRAM), the model arena fallthrough
returns NULL instead of a raw host pointer, forcing the cudaHostRegister
mapping path. On integrated GPUs (APUs, unified memory), the raw host pointer
path is preserved for zero-copy access.

DS4_ROCM_REGISTERED_WEIGHTS env var overrides auto-detection:

"1" = force discrete (always use registered path)
"0" = force integrated (always use raw host pointer)

When the attribute query fails (older ROCm versions), defaults to discrete
(the safer choice).

3. `DS4_ROCM_STREAM_FREE_RESERVE_GIB` (VRAM free-reserve floor)

Override the hard-coded 16 GiB free-reserve in cuda_stream_resident_free_reserve_bytes().
Fractional values accepted (e.g. 4.5). On 48 GB cards a reserve of ~4 GiB is
sufficient; the upstream 16 GiB leaves too little room for the Flash model's
routed-expert working set.

Default (unset): 16 GiB -- upstream behaviour unchanged.

4. `DS4_CUDA_Q8_F16_CACHE_MB` (q8->f16 cache budget)

The optional q8->f16 transpose cache defaults to unlimited (UINT64_MAX).
On tight discrete GPUs this cache can consume VRAM needed for expert residency.
Set to 0 to disable it entirely.

Default (unset): unlimited -- upstream behaviour unchanged.

Environment variables

Variable	Purpose	Default
`DS4_ROCM_STREAM_FREE_RESERVE_GIB`	VRAM free-reserve floor (GiB)	`16`
`DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK`	Set `0` to disable host-mapped expert overflow	enabled
`DS4_CUDA_Q8_F16_CACHE_MB`	q8->f16 cache limit in MiB (`0`=disable)	unlimited
`DS4_ROCM_REGISTERED_WEIGHTS`	Override GPU type detection (`1`=discrete, `0`=integrated)	auto-detect

Safety

All defaults preserve upstream behaviour. No env var means no change.
Discrete/integrated detection is automatic; APU users (Strix Halo) get the
same raw-pointer path as before.
Host-mapped experts are automatically cleaned up on cache teardown.
The DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK=0 escape hatch restores
the previous hard-fail path.
Single-file change (plus 1-line compat mapping) -- no impact on CUDA, Metal,
or CPU backends.
No ds4.c or protocol changes -- distributed inference works identically.

Add host-mapped expert cache fallback for VRAM-constrained discrete GPUs (e.g. AMD W7800 48GB). When the streaming expert cache is exhausted, copy the expert weights to pinned host memory and map them into the GPU address space via cudaHostRegister/cudaHostGetDevicePointer. Inference proceeds (slower) instead of hard-failing. Add automatic discrete/integrated GPU detection via cudaDeviceGetAttribute so the registered-weights path activates without manual configuration. DS4_ROCM_REGISTERED_WEIGHTS env var overrides detection for edge cases. Add three env var tunables: - DS4_ROCM_STREAM_FREE_RESERVE_GIB: override VRAM free-reserve floor (default 16 GiB). - DS4_CUDA_Q8_F16_CACHE_MB: limit the optional q8->f16 cache to free VRAM for routed-expert residency (e.g. DS4_CUDA_Q8_F16_CACHE_MB=0). - DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK: set to 0 to disable host-mapped expert overflow (default: enabled). Host-mapped fallback can be disabled with DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK=0. All defaults preserve upstream behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm: discrete GPU memory management#461

ROCm: discrete GPU memory management#461
cattivik66 wants to merge 1 commit into
antirez:mainfrom
cattivik66:fix/rocm-discrete-gpu-memory

cattivik66 commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cattivik66 commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Compilation

Problem

Validation

Test setup

Single GPU (auto-detection, no DS4_ROCM_REGISTERED_WEIGHTS needed)

Dual GPU (distributed, layers split 0:21 + 22:output)

Dual GPU distributed (full residency, no SSD)

Outcome summary

Changes

ds4_rocm.h

rocm/ds4_rocm_runtime.cuh

1. Host-mapped expert fallback (cuda_stream_resident_alloc)

2. Automatic discrete GPU detection (cuda_device_is_discrete)

3. DS4_ROCM_STREAM_FREE_RESERVE_GIB (VRAM free-reserve floor)

4. DS4_CUDA_Q8_F16_CACHE_MB (q8->f16 cache budget)

Environment variables

Safety

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cattivik66 commented Jun 26, 2026 •

edited

Loading

`ds4_rocm.h`

`rocm/ds4_rocm_runtime.cuh`

1. Host-mapped expert fallback (`cuda_stream_resident_alloc`)

2. Automatic discrete GPU detection (`cuda_device_is_discrete`)

3. `DS4_ROCM_STREAM_FREE_RESERVE_GIB` (VRAM free-reserve floor)

4. `DS4_CUDA_Q8_F16_CACHE_MB` (q8->f16 cache budget)