Skip to content

Fix CUDA MoE router hardcoded to 256 experts#466

Open
slackarea wants to merge 1 commit into
antirez:mainfrom
slackarea:fix-cuda-router-expert-count
Open

Fix CUDA MoE router hardcoded to 256 experts#466
slackarea wants to merge 1 commit into
antirez:mainfrom
slackarea:fix-cuda-router-expert-count

Conversation

@slackarea

Copy link
Copy Markdown

The CUDA MoE router rejects any model whose routed-expert count isn't 256:
ds4_gpu_router_select_tensor / _batch_tensor do
if (n_expert != 256u || n_expert_used != 6u || fabsf(scale-1.5f)>1e-6f) return 0;
and the three router_select kernels hardcode 256 (logits/probs stride, top-k bound) and
the 1.5f routed-weight scale. On a DeepSeek-V4-Pro GGUF (384 experts) prefill fails with
gpu layer N ffn batch encode failed (the router-select call returns 0).

Fix (minimal, zero-regression for Flash):

  • Parametrize the serial router_select_kernel with n_expert and scale.
  • Relax both host-wrapper guards to accept 256 or 384; replace the 256u*sizeof(float)
    bias/logits/probs checks with n_expert.
  • Dispatch: n_expert != 256 -> the parametrized serial kernel; 256 stays on the
    existing fast warp/parallel kernels (unchanged).

Regression (Linux, CUDA, H200, DeepSeek-V4-Flash q2-imatrix; clean main + this patch only):

  • ds4_test --logprob-vectors: PASS
  • Flash decode: 39.67 t/s (256 fast path untouched; baseline ~39 t/s)

Caveat / follow-up: still assumes n_expert_used == 6 (the for j<6 loops + guard) — fine
for Flash/Pro (top-6), not for other top-k (e.g. GLM-5.2 top-8). The serial path is
correctness-first (1 thread/token); the warp/parallel kernels could be parametrized for
speed (shared sprob[256] and the 32x8 unroll). Happy to extend if useful.

The CUDA MoE router rejects any model whose routed-expert count isn't
256: ds4_gpu_router_select_tensor / _batch_tensor guard on
n_expert==256 && scale==1.5, and the three router_select kernels
hardcode 256 (logits/probs stride, top-k bound) and the 1.5 scale.

Fix (minimal, zero-regression for the 256 fast path):
- Parametrize the serial router_select_kernel with n_expert and scale.
- Relax both host-wrapper guards to accept 256 or 384; size the
  bias/logits/probs checks by n_expert instead of 256.
- Dispatch n_expert != 256 to the parametrized serial kernel; 256
  stays on the existing fast warp/parallel kernels (unchanged).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant