Skip to content

ROCm: discrete GPU memory management#461

Open
cattivik66 wants to merge 1 commit into
antirez:mainfrom
cattivik66:fix/rocm-discrete-gpu-memory
Open

ROCm: discrete GPU memory management#461
cattivik66 wants to merge 1 commit into
antirez:mainfrom
cattivik66:fix/rocm-discrete-gpu-memory

Conversation

@cattivik66

@cattivik66 cattivik66 commented Jun 26, 2026

Copy link
Copy Markdown

Summary

With these changes, the ds4flash.gguf (DeepSeek-V4 Flash) model runs on
discrete GPUs with 48 GB VRAM (tested: AMD Radeon Pro W7800 x2). Without
this PR the upstream code hard-fails: the streaming expert cache exhausts
VRAM during model load and cuda_stream_resident_alloc() returns -1,
preventing the model from starting at all.
Sorry I was not able to test on other systems, but I expect the code to be fully
compatible with the current supported systems as they should be untouched.

The PR adds:

  • A host-mapped expert overflow path so routed experts degrade gracefully
    instead of aborting.
  • Automatic discrete/integrated GPU detection so the registered-weights
    path activates without manual configuration.
  • Three env-var tunables for operators to size the VRAM budget.

Compilation

Built on ROCm 7.2 with hipcc:

make rocm ROCM_ARCH=gfx1100

Linker flags: -O3 -ffast-math -g -fno-finite-math-only -pthread -D__HIP_PLATFORM_AMD__ --offload-arch=gfx1100 -lhipblas -lhipblaslt

Problem

  1. Expert cache OOM -- cuda_stream_resident_alloc() returns -1 when
    cuda_stream_resident_make_room() fails the 16 GiB free-reserve floor.
    On 48 GB cards the working set of routed experts from the Flash model
    exceeds the remaining VRAM after model weights + KV cache, so the
    startup crashes with "streaming expert cache cannot keep ... MiB".

  2. Raw host-pointer reads -- when cuda_model_range_ptr_from_fd() can't fit
    a range in the arena it returns cuda_model_ptr(), which for discrete GPUs
    is a plain host virtual address. On discrete GPUs this causes slow uncached
    PCIe reads on every kernel access. In distributed mode, the raw pointer
    also fails the cuda_model_range_is_cached() check, crashing the worker.

Validation

Test setup

  • GPU: AMD Radeon Pro W7800 48 GB (gfx1100 / Navi 31) x2
  • ROCm: 7.2, hipcc, HIP runtime
  • Model: ds4flash.gguf (DeepSeek-V4 Flash, IQ2XXS quant, ~47.6 GiB)
  • Build: make rocm ROCM_ARCH=gfx1100

Single GPU (auto-detection, no DS4_ROCM_REGISTERED_WEIGHTS needed)

DS4_ROCM_STREAM_FREE_RESERVE_GIB=4 \
DS4_CUDA_Q8_F16_CACHE_MB=0 \
./ds4 --model ds4flash.gguf --backend rocm --ssd-streaming \
  -p "Hello, who are you?" -n 30 --nothink
ds4: ROCm backend initialized on AMD Radeon Pro W7800 48GB (sm_110)
ds4:   using 80% total for model + cached experts: 38.39 GiB
ds4:   cached expert count: 4580 (30.19 GiB)
Hi there! I'm DeepSeek, an AI assistant created by the Chinese company
DeepSeek (深度求索). I'm here to help...
ds4: prefill: 1.44 t/s, generation: 4.42 t/s

Dual GPU (distributed, layers split 0:21 + 22:output)

The auto-detection correctly identifies the W7800 as discrete, enabling the
registered-weights fallback path automatically. No DS4_ROCM_REGISTERED_WEIGHTS
env var is needed.

Terminal 1 (GPU 1, worker, layers 22:output):

ROCR_VISIBLE_DEVICES=1 DS4_LOCK_FILE=/tmp/ds4-w.lock \
DS4_CUDA_Q8_F16_CACHE_MB=0 DS4_ROCM_STREAM_FREE_RESERVE_GIB=1 \
./ds4 --role worker --layers 22:output --ssd-streaming \
  --coordinator 127.0.0.1 9000 \
  --backend rocm -m ds4flash.gguf

Terminal 2 (GPU 0, coordinator, layers 0:21):

ROCR_VISIBLE_DEVICES=0 DS4_LOCK_FILE=/tmp/ds4-c.lock \
DS4_CUDA_Q8_F16_CACHE_MB=0 DS4_ROCM_STREAM_FREE_RESERVE_GIB=1 \
./ds4 --role coordinator --layers 0:21 --ssd-streaming \
  --listen 127.0.0.1 9000 \
  --backend rocm -m ds4flash.gguf \
  -p "Hello, who are you?" -n 50 --nothink

Coordinator output:

processing 15 input tokens: 15/15 (100.0%)
Hi there! I'm DeepSeek, an AI assistant created by DeepSeek (深度求索),
a Chinese company. I'm here to help you with a wide range of tasks...
ds4: prefill: 0.91 t/s, generation: 4.09 t/s

Both GPUs participate in every token via the distributed layer pipeline.
Each GPU uses ~42 GiB VRAM during inference.

Dual GPU distributed (full residency, no SSD)

Coordinator: layers 0:20 (21 layers), Worker: layers 21:output (22 layers)
Context: -c 1024 (min KV cache ~102 MiB)
Env vars: DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK=0 DS4_CUDA_Q8_F16_CACHE_MB=0

Run 1 (cold page cache): Prefill: 20.50 t/s, Generation: 32.80 t/s
Run 2 (warm page cache): Prefill: 43.10 t/s, Generation: 32.74 t/s

Model load: ~15s per GPU
VRAM: ~32 GiB model (in arena) + ~8 GiB (cudaHostRegister'd) per GPU

Outcome summary

Scenario Before After
ds4flash.gguf on single 48 GB W7800 fails to load (VRAM OOM) loads and infers at 4.42 t/s
ds4flash.gguf on dual 48 GB W7800 (distributed) fails to load (VRAM OOM) loads and infers at 4.09 t/s
Larger models / APUs unchanged unchanged
Upstream defaults (no env vars) unchanged auto-detection works out of the box

Changes

ds4_rocm.h

Added cudaDevAttrIntegrated -> hipDeviceAttributeIntegrated mapping for the
HIP/CUDA compatibility layer, enabling cudaDeviceGetAttribute to query whether
the GPU is integrated or discrete.

rocm/ds4_rocm_runtime.cuh

1. Host-mapped expert fallback (cuda_stream_resident_alloc)

When a routed expert cannot be placed in VRAM:

  • Copy the expert weights (gate | up | down) into a contiguous malloc'd host buffer.
  • Register that buffer with cudaHostRegister(cudaHostRegisterMapped).
  • Obtain a GPU-side pointer via cudaHostGetDevicePointer.
  • Store the expert with host_mapped=1 so the GPU kernel reads through the
    registered mapping (zero-copy over PCIe).

The fallback is enabled by default and activates only when the existing
cudaMalloc path would have returned -1. Set DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK=0
to restore the previous hard-fail behaviour.

Two new fields (host_mapped, host_base) are added to cuda_stream_resident_expert;
cuda_stream_resident_cache_release() was updated to call cudaHostUnregister+free
for host-mapped experts while preserving the original cudaFree path for VRAM experts.

2. Automatic discrete GPU detection (cuda_device_is_discrete)

Queries cudaDeviceGetAttribute(cudaDevAttrIntegrated) at first use and caches
the result. On discrete GPUs (separate VRAM), the model arena fallthrough
returns NULL instead of a raw host pointer, forcing the cudaHostRegister
mapping path. On integrated GPUs (APUs, unified memory), the raw host pointer
path is preserved for zero-copy access.

DS4_ROCM_REGISTERED_WEIGHTS env var overrides auto-detection:

  • "1" = force discrete (always use registered path)
  • "0" = force integrated (always use raw host pointer)

When the attribute query fails (older ROCm versions), defaults to discrete
(the safer choice).

3. DS4_ROCM_STREAM_FREE_RESERVE_GIB (VRAM free-reserve floor)

Override the hard-coded 16 GiB free-reserve in cuda_stream_resident_free_reserve_bytes().
Fractional values accepted (e.g. 4.5). On 48 GB cards a reserve of ~4 GiB is
sufficient; the upstream 16 GiB leaves too little room for the Flash model's
routed-expert working set.

Default (unset): 16 GiB -- upstream behaviour unchanged.

4. DS4_CUDA_Q8_F16_CACHE_MB (q8->f16 cache budget)

The optional q8->f16 transpose cache defaults to unlimited (UINT64_MAX).
On tight discrete GPUs this cache can consume VRAM needed for expert residency.
Set to 0 to disable it entirely.

Default (unset): unlimited -- upstream behaviour unchanged.

Environment variables

Variable Purpose Default
DS4_ROCM_STREAM_FREE_RESERVE_GIB VRAM free-reserve floor (GiB) 16
DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK Set 0 to disable host-mapped expert overflow enabled
DS4_CUDA_Q8_F16_CACHE_MB q8->f16 cache limit in MiB (0=disable) unlimited
DS4_ROCM_REGISTERED_WEIGHTS Override GPU type detection (1=discrete, 0=integrated) auto-detect

Safety

  • All defaults preserve upstream behaviour. No env var means no change.
  • Discrete/integrated detection is automatic; APU users (Strix Halo) get the
    same raw-pointer path as before.
  • Host-mapped experts are automatically cleaned up on cache teardown.
  • The DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK=0 escape hatch restores
    the previous hard-fail path.
  • Single-file change (plus 1-line compat mapping) -- no impact on CUDA, Metal,
    or CPU backends.
  • No ds4.c or protocol changes -- distributed inference works identically.

Add host-mapped expert cache fallback for VRAM-constrained discrete GPUs
(e.g. AMD W7800 48GB). When the streaming expert cache is exhausted, copy
the expert weights to pinned host memory and map them into the GPU address
space via cudaHostRegister/cudaHostGetDevicePointer. Inference proceeds
(slower) instead of hard-failing.

Add automatic discrete/integrated GPU detection via cudaDeviceGetAttribute
so the registered-weights path activates without manual configuration.
DS4_ROCM_REGISTERED_WEIGHTS env var overrides detection for edge cases.

Add three env var tunables:

- DS4_ROCM_STREAM_FREE_RESERVE_GIB: override VRAM free-reserve floor
  (default 16 GiB).

- DS4_CUDA_Q8_F16_CACHE_MB: limit the optional q8->f16 cache to free
  VRAM for routed-expert residency (e.g. DS4_CUDA_Q8_F16_CACHE_MB=0).

- DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK: set to 0 to disable host-mapped
  expert overflow (default: enabled).

Host-mapped fallback can be disabled with DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK=0.
All defaults preserve upstream behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant