Skip to content

[AMD] improve dsr1 fp4 disagg#1584

Open
billishyahao wants to merge 6 commits into
mainfrom
amd/mi355x-dsfp4-may29
Open

[AMD] improve dsr1 fp4 disagg#1584
billishyahao wants to merge 6 commits into
mainfrom
amd/mi355x-dsfp4-may29

Conversation

@billishyahao
Copy link
Copy Markdown
Collaborator

@billishyahao billishyahao commented May 29, 2026

Improve dsr1 fp4 disagg

Co-authored-by: billishyahao bill.he@amd.com
Co-authored-by: Duyi-Wang duyi.wang@amd.com


Note

Low Risk
Benchmark matrix and MoRI env defaults only; no auth, serving API, or application runtime logic changes.

Overview
Updates DeepSeek-R1 FP4 MI355X SGLang disaggregated MTP (dsr1-fp4-mi355x-sglang-disagg-mtp) benchmark coverage and the shared MoRI runtime env used by AMD PD jobs.

The config bumps the SGLang ROCm image to lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260529. For 8k/1k MTP 1×DEP8 + 1×DEP8 scenarios it retunes concurrency (e.g. 128/512 → 384/512 and 64/256 → 192/256), sets DECODE_MTP_SIZE=3 on those rows (including one case that was 2), and adds two new sweep points at conc 96/128 and 48/64. A separate 1k/1k MTP row moves decode DECODE_MTP_SIZE from 2 to 3.

benchmarks/multi_node/amd_utils/env.sh drops SGLANG_MORI_FP8_COMB=true in favor of MORI_COMBINE_DTYPE_PREFILL=fp8_direct_cast and MORI_COMBINE_DTYPE_DECODE=fp8 (consumed by the disagg SGLang launcher for prefill vs decode combine paths).

perf-changelog.yaml documents the image bump and new 128/256 concurrency coverage for this config key.

Reviewed by Cursor Bugbot for commit ae39e6b. Bugbot is set up for automated code reviews on this repo. Configure here.

> Co-authored-by: billishyahao <bill.he@amd.com>
> Co-authored-by: Duyi-Wang <duyi.wang@amd.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +2071 to +2090
- "DECODE_MTP_SIZE=3"

# 1*DEP8 + 1*DEP8
- spec-decoding: "mtp"
conc-list: [ 64, 128 ]
prefill:
num-worker: 1
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "PREFILL_NODES=1"
decode:
num-worker: 1
tp: 8
ep: 8
dp-attn: true
additional-settings:
- "DECODE_NODES=1"
- "DECODE_MTP_SIZE=3"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This PR introduces duplicate benchmark sweep points: after bumping the two pre-existing 1*DEP8 + 1*DEP8 blocks in the ISL=8192 search-space from DECODE_MTP_SIZE=1 to DECODE_MTP_SIZE=3 (lines 2052, 2071), and then adding a new block with conc-list [64, 128] and the same MTP=3 (lines 2073-2090), all three blocks now share byte-identical topology. conc=64 and conc=128 will each be benchmarked twice on the expensive mi355x-disagg multinode runner. Fix by either dropping 64 and 128 from the new block, or consolidating the three blocks into a single entry with conc-list: [64, 128, 256, 512].

Extended reasoning...

What the bug is

In .github/configs/amd-master.yaml, under dsr1-fp4-mi355x-sglang-disagg-mtp / scenarios.fixed-seq-len / isl: 8192, the post-PR YAML contains three 1*DEP8 + 1*DEP8 search-space entries that, after this PR's edits, have byte-identical prefill and decode configuration. The only thing that differs across them is the conc-list:

Lines conc-list DECODE_MTP_SIZE pre-PR DECODE_MTP_SIZE post-PR
~2034-2052 [128, 512] 1 3 (bumped by PR)
~2053-2071 [64, 256] 1 3 (bumped by PR)
~2073-2090 [64, 128] (new) 3 (added by PR)

All three entries are: spec-decoding: mtp, prefill num-worker=1 tp=8 ep=8 dp-attn=true PREFILL_NODES=1, decode num-worker=1 tp=8 ep=8 dp-attn=true DECODE_NODES=1 DECODE_MTP_SIZE=3.

Why existing code does not prevent it

The sweep generator (utils/matrix_logic/generate_sweep_configs.py, multinode branch) emits one matrix entry per search-space entry, passing the conc-list through unchanged. There is no cross-entry deduplication step: each (config, conc) pair from each entry becomes a real benchmark run. So overlapping conc values across two configurationally-identical entries produce two real, identical benchmark runs.

Step-by-step proof

  1. Reader sees the three entries above. All three have identical prefill+decode configuration after the PR.
  2. For the [128, 512] entry the matrix generator will emit benchmark runs at conc=128 and conc=512.
  3. For the [64, 256] entry it will emit runs at conc=64 and conc=256.
  4. For the new [64, 128] entry it will emit runs at conc=64 and conc=128.
  5. Therefore conc=64 is benchmarked twice (entries 2 and 3) and conc=128 is benchmarked twice (entries 1 and 3), each with byte-identical sglang/topology settings.

Why this PR is responsible

Before this PR, the two pre-existing entries had DECODE_MTP_SIZE=1, so any new MTP=3 entry would not have duplicated them. This PR creates the duplication by doing both of:

  • (a) Bumping the two existing entries' DECODE_MTP_SIZE from 1 to 3 (lines 2052 and 2071 of the diff), homogenizing their topology with the new entry, and
  • (b) Adding a new entry with conc-list: [64, 128] and DECODE_MTP_SIZE=3 whose conc values overlap with the two pre-existing entries.

Either change alone would have been fine; the combination produces the duplication.

Impact

This runner is mi355x-disagg with multinode: true. Each benchmark run uses multiple multinode-disagg-grade nodes for isl=8192 osl=1024, so two duplicate sweep points are not free CI-time — they are real GPU-hours on a constrained multinode runner, producing no new data.

How to fix

One-line edits, pick either:

  • Drop 64 and 128 from the new block's conc-list. (But after dropping them, the new block has no novel conc values — implying the entry should simply be removed.)
  • Or consolidate the three blocks into a single entry with conc-list: [64, 128, 256, 512], which is cleaner and matches the apparent intent of running the full set at MTP=3.

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@billishyahao billishyahao force-pushed the amd/mi355x-dsfp4-may29 branch from f425a6e to bfb242c Compare May 30, 2026 00:06
@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant