[AMD] improve dsr1 fp4 disagg by billishyahao · Pull Request #1584 · SemiAnalysisAI/InferenceX

billishyahao · 2026-05-29T05:18:16Z

Improve dsr1 fp4 disagg

Co-authored-by: billishyahao bill.he@amd.com
Co-authored-by: Duyi-Wang duyi.wang@amd.com

Note

Low Risk
Benchmark matrix and MoRI env defaults only; no auth, serving API, or application runtime logic changes.

Overview
Updates DeepSeek-R1 FP4 MI355X SGLang disaggregated MTP (dsr1-fp4-mi355x-sglang-disagg-mtp) benchmark coverage and the shared MoRI runtime env used by AMD PD jobs.

The config bumps the SGLang ROCm image to lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260529. For 8k/1k MTP 1×DEP8 + 1×DEP8 scenarios it retunes concurrency (e.g. 128/512 → 384/512 and 64/256 → 192/256), sets DECODE_MTP_SIZE=3 on those rows (including one case that was 2), and adds two new sweep points at conc 96/128 and 48/64. A separate 1k/1k MTP row moves decode DECODE_MTP_SIZE from 2 to 3.

benchmarks/multi_node/amd_utils/env.sh drops SGLANG_MORI_FP8_COMB=true in favor of MORI_COMBINE_DTYPE_PREFILL=fp8_direct_cast and MORI_COMBINE_DTYPE_DECODE=fp8 (consumed by the disagg SGLang launcher for prefill vs decode combine paths).

perf-changelog.yaml documents the image bump and new 128/256 concurrency coverage for this config key.

^{Reviewed by Cursor Bugbot for commit ae39e6b. Bugbot is set up for automated code reviews on this repo. Configure here.}

> Co-authored-by: billishyahao <bill.he@amd.com> > Co-authored-by: Duyi-Wang <duyi.wang@amd.com>

github-actions · 2026-05-29T05:18:25Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-05-29T05:26:19Z

+          - "DECODE_MTP_SIZE=3"
+
+      # 1*DEP8 + 1*DEP8
+      - spec-decoding: "mtp"
+        conc-list: [ 64, 128 ]
+        prefill:
+          num-worker: 1
+          tp: 8
+          ep: 8
+          dp-attn: true
+          additional-settings:
+          - "PREFILL_NODES=1"
+        decode:
+          num-worker: 1
+          tp: 8
+          ep: 8
+          dp-attn: true
+          additional-settings:
+          - "DECODE_NODES=1"
+          - "DECODE_MTP_SIZE=3"


🔴 This PR introduces duplicate benchmark sweep points: after bumping the two pre-existing 1*DEP8 + 1*DEP8 blocks in the ISL=8192 search-space from DECODE_MTP_SIZE=1 to DECODE_MTP_SIZE=3 (lines 2052, 2071), and then adding a new block with conc-list [64, 128] and the same MTP=3 (lines 2073-2090), all three blocks now share byte-identical topology. conc=64 and conc=128 will each be benchmarked twice on the expensive mi355x-disagg multinode runner. Fix by either dropping 64 and 128 from the new block, or consolidating the three blocks into a single entry with conc-list: [64, 128, 256, 512].

Extended reasoning...

What the bug is

In .github/configs/amd-master.yaml, under dsr1-fp4-mi355x-sglang-disagg-mtp / scenarios.fixed-seq-len / isl: 8192, the post-PR YAML contains three 1*DEP8 + 1*DEP8 search-space entries that, after this PR's edits, have byte-identical prefill and decode configuration. The only thing that differs across them is the conc-list:

Lines conc-list DECODE_MTP_SIZE pre-PR DECODE_MTP_SIZE post-PR

~2034-2052 [128, 512] 1 3 (bumped by PR)

~2053-2071 [64, 256] 1 3 (bumped by PR)

~2073-2090 [64, 128] (new) 3 (added by PR)

All three entries are: spec-decoding: mtp, prefill num-worker=1 tp=8 ep=8 dp-attn=true PREFILL_NODES=1, decode num-worker=1 tp=8 ep=8 dp-attn=true DECODE_NODES=1 DECODE_MTP_SIZE=3.

Why existing code does not prevent it

The sweep generator (utils/matrix_logic/generate_sweep_configs.py, multinode branch) emits one matrix entry per search-space entry, passing the conc-list through unchanged. There is no cross-entry deduplication step: each (config, conc) pair from each entry becomes a real benchmark run. So overlapping conc values across two configurationally-identical entries produce two real, identical benchmark runs.

Step-by-step proof

Reader sees the three entries above. All three have identical prefill+decode configuration after the PR.

For the [128, 512] entry the matrix generator will emit benchmark runs at conc=128 and conc=512.

For the [64, 256] entry it will emit runs at conc=64 and conc=256.

For the new [64, 128] entry it will emit runs at conc=64 and conc=128.

Therefore conc=64 is benchmarked twice (entries 2 and 3) and conc=128 is benchmarked twice (entries 1 and 3), each with byte-identical sglang/topology settings.

Why this PR is responsible

Before this PR, the two pre-existing entries had DECODE_MTP_SIZE=1, so any new MTP=3 entry would not have duplicated them. This PR creates the duplication by doing both of:

(a) Bumping the two existing entries' DECODE_MTP_SIZE from 1 to 3 (lines 2052 and 2071 of the diff), homogenizing their topology with the new entry, and

(b) Adding a new entry with conc-list: [64, 128] and DECODE_MTP_SIZE=3 whose conc values overlap with the two pre-existing entries.

Either change alone would have been fine; the combination produces the duplication.

Impact

This runner is mi355x-disagg with multinode: true. Each benchmark run uses multiple multinode-disagg-grade nodes for isl=8192 osl=1024, so two duplicate sweep points are not free CI-time — they are real GPU-hours on a constrained multinode runner, producing no new data.

How to fix

One-line edits, pick either:

Drop 64 and 128 from the new block's conc-list. (But after dropping them, the new block has no novel conc values — implying the entry should simply be removed.)

Or consolidate the three blocks into a single entry with conc-list: [64, 128, 256, 512], which is cleaner and matches the apparent intent of running the full set at MTP=3.

github-actions · 2026-05-29T22:57:15Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26619796352
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26619796352

github-actions · 2026-05-29T23:38:28Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26666404342
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26666404342

github-actions · 2026-05-29T23:38:54Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26619796352
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26619796352

github-actions · 2026-05-29T23:41:21Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26666404342
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26666404342

github-actions · 2026-05-30T05:44:52Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26668569482
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26668569482

github-actions · 2026-05-30T11:08:11Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26668569482
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26668569482

github-actions · 2026-05-30T17:04:37Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26689344210
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26689344210

github-actions · 2026-05-30T20:42:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26689767233
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26689767233

[AMD] improve dsr1 fp4 disagg

d93b9c3

> Co-authored-by: billishyahao <bill.he@amd.com> > Co-authored-by: Duyi-Wang <duyi.wang@amd.com>

billishyahao requested a review from a team May 29, 2026 05:18

billishyahao requested review from chunfangamd and seungrokj as code owners May 29, 2026 05:18

github-project-automation Bot added this to InferenceMAX Board May 29, 2026

billishyahao requested review from 1am9trash and yctseng0211 as code owners May 29, 2026 05:18

billishyahao added AMD full-sweep-enabled labels May 29, 2026

add perf changelog

1739443

claude Bot reviewed May 29, 2026

View reviewed changes

arygupt mentioned this pull request May 29, 2026

fix(process_result): fail loudly on zero-throughput disagg runs (no more masked ZeroDivisionError) #1590

Open

arygupt mentioned this pull request May 29, 2026

ci(disagg): fail before writing result file + surface real failure class #1591

Open

billishyahao added 2 commits May 30, 2026 00:01

Merge remote-tracking branch 'inf/main' into amd/mi355x-dsfp4-may29

1e5e851

fix

bfb242c

billishyahao force-pushed the amd/mi355x-dsfp4-may29 branch from f425a6e to bfb242c Compare May 30, 2026 00:06

billishyahao added 2 commits May 30, 2026 16:44

fix regression

9a39f19

sync config

ae39e6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] improve dsr1 fp4 disagg#1584

[AMD] improve dsr1 fp4 disagg#1584
billishyahao wants to merge 6 commits into
mainfrom
amd/mi355x-dsfp4-may29

billishyahao commented May 29, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

claude Bot May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lines	conc-list	DECODE_MTP_SIZE pre-PR	DECODE_MTP_SIZE post-PR
~2034-2052	`[128, 512]`	`1`	`3` (bumped by PR)
~2053-2071	`[64, 256]`	`1`	`3` (bumped by PR)
~2073-2090	`[64, 128]`	(new)	`3` (added by PR)

Conversation

billishyahao commented May 29, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

claude Bot May 29, 2026

Choose a reason for hiding this comment

What the bug is

Why existing code does not prevent it

Step-by-step proof

Why this PR is responsible

Impact

How to fix

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

github-actions Bot commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

billishyahao commented May 29, 2026 •

edited by cursor Bot

Loading