Fix CUDA MoE router hardcoded to 256 experts by slackarea · Pull Request #466 · antirez/ds4

slackarea · 2026-06-27T07:32:14Z

The CUDA MoE router rejects any model whose routed-expert count isn't 256:
ds4_gpu_router_select_tensor / _batch_tensor do
if (n_expert != 256u || n_expert_used != 6u || fabsf(scale-1.5f)>1e-6f) return 0;
and the three router_select kernels hardcode 256 (logits/probs stride, top-k bound) and
the 1.5f routed-weight scale. On a DeepSeek-V4-Pro GGUF (384 experts) prefill fails with
gpu layer N ffn batch encode failed (the router-select call returns 0).

Fix (minimal, zero-regression for Flash):

Parametrize the serial router_select_kernel with n_expert and scale.
Relax both host-wrapper guards to accept 256 or 384; replace the 256u*sizeof(float)
bias/logits/probs checks with n_expert.
Dispatch: n_expert != 256 -> the parametrized serial kernel; 256 stays on the
existing fast warp/parallel kernels (unchanged).

Regression (Linux, CUDA, H200, DeepSeek-V4-Flash q2-imatrix; clean main + this patch only):

ds4_test --logprob-vectors: PASS
Flash decode: 39.67 t/s (256 fast path untouched; baseline ~39 t/s)

Caveat / follow-up: still assumes n_expert_used == 6 (the for j<6 loops + guard) — fine
for Flash/Pro (top-6), not for other top-k (e.g. GLM-5.2 top-8). The serial path is
correctness-first (1 thread/token); the warp/parallel kernels could be parametrized for
speed (shared sprob[256] and the 32x8 unroll). Happy to extend if useful.

The CUDA MoE router rejects any model whose routed-expert count isn't 256: ds4_gpu_router_select_tensor / _batch_tensor guard on n_expert==256 && scale==1.5, and the three router_select kernels hardcode 256 (logits/probs stride, top-k bound) and the 1.5 scale. Fix (minimal, zero-regression for the 256 fast path): - Parametrize the serial router_select_kernel with n_expert and scale. - Relax both host-wrapper guards to accept 256 or 384; size the bias/logits/probs checks by n_expert instead of 256. - Dispatch n_expert != 256 to the parametrized serial kernel; 256 stays on the existing fast warp/parallel kernels (unchanged). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

This was referenced Jun 28, 2026

Feasibility: a GLM-5.2 (GlmMoeDsa) backend for DS4 — runs the real 744B with correct logits #470

Open

DeepSeek-V4-Pro: CUDA-streaming + CPU paths produce wrong output / errors #471

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA MoE router hardcoded to 256 experts#466

Fix CUDA MoE router hardcoded to 256 experts#466
slackarea wants to merge 1 commit into
antirez:mainfrom
slackarea:fix-cuda-router-expert-count

slackarea commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

slackarea commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant