support rocm7.2 by Binyang2014 · Pull Request #819 · microsoft/mscclpp

Binyang2014 · 2026-06-17T21:54:49Z

This pull request introduces support for ROCm 7.2 across the build system, CI pipelines, Docker images, and documentation, while also improving ROCm FP8 type selection and CUDA IPC memory handle management. It updates dependencies and configurations to ensure compatibility with ROCm 7.2, adds new options for native FP8 variants, and refines some benchmarking and internal memory handling logic.

Pls notice: there is an issue in rocm7.2 (rocm7.2 user lib + rocm6.2 driver) when execution code in this order: allocating memory -> ipc communication -> allocate new memory -> free old memory.

github-advanced-security · 2026-06-17T22:44:58Z

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

Copilot

Pull request overview

This pull request expands MSCCL++ support for ROCm 7.2 across packaging, CI, Docker images, and docs, while also refining ROCm FP8 native-type selection and updating CUDA IPC handle lifecycle management to better accommodate ROCm IPC/mapping limits.

Changes:

Add ROCm 7.x Python extras/requirements and auto-detect ROCm major version during test deployment installs.
Extend CI (GitHub Actions CodeQL + Azure Pipelines) and Docker build targets to include ROCm 7.2 images and test runs.
Update ROCm FP8 selection to be controlled via a CMake option/compile definition; improve CUDA IPC handle caching/closing behavior and adjust fullmesh allreduce context/channel setup.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test/deploy/setup.sh	Auto-detect ROCm major version and install matching Python extra (`rocm6`/`rocm7`).
src/ext/collectives/include/allreduce/allreduce_fullmesh.hpp	Remove per-input channel cache member from the builder state.
src/ext/collectives/allreduce/allreduce_fullmesh.cu	Update fullmesh allreduce kernel launch bounds and rework context/channel initialization & keying.
src/core/registered_memory.cc	Strengthen CUDA IPC handle serialization validation and standardize unknown-transport error handling.
src/core/gpu_ipc_mem.cc	Refine runtime IPC open/close caching (HIP-focused) using `shared_ptr`/`weak_ptr` ownership.
python/requirements_rocm7.txt	Add ROCm 7 Python requirements set (incl. `hip-python>=7,<8`).
python/mscclpp_benchmark/tuner.py	Remove redundant `reset()` between correctness and timing in the tuning loop.
python/mscclpp_benchmark/correctness.py	Remove an extra pre-run barrier in correctness iterations.
python/mscclpp_benchmark/bench_collective.py	Add a ROCm 7.2 workaround to free cases between iterations and synchronize ranks.
pyproject.toml	Add `rocm7` extra with ROCm 7-compatible `hip-python` dependency.
include/mscclpp/gpu_data_types.hpp	Make ROCm native FP8 alias selection depend on a build-controlled macro (FNUZ vs HIP default).
docs/quickstart.md	Document ROCm 7.2 docker tag and `rocm7` install extra.
docker/build.sh	Add ROCm 7.2 base image target and related build metadata.
docker/base-dev-x.dockerfile	Adjust ROCm package installs and normalize extra selection parsing from `TARGET`.
CMakeLists.txt	Add `MSCCLPP_ROCM_USE_FNUZ_FP8` option and update default ROCm arch list.
.github/workflows/codeql-analysis.yml	Run CodeQL for both `rocm6.2` and `rocm7.2` images.
.azure-pipelines/ut.yml	Add ROCm 7.2 container matrix entry for unit tests.
.azure-pipelines/templates/rccl-test.yml	Add ROCm scratch-reclaim workaround env var for RCCL tests/benchmarks.
.azure-pipelines/rccl-api-test.yml	Add ROCm 7.2 container matrix entry for RCCL API tests.
.azure-pipelines/codecov.yml	Add ROCm 7.2 container matrix entry for coverage runs.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Binyang2014 added 2 commits June 17, 2026 21:53

support rocm7.2

52619b0

WIP

86ed263

Binyang2014 added 5 commits June 18, 2026 00:08

WIP

1d86530

update

32287d8

update

0999af8

walkaround

d4b484f

Merge branch 'main' into binyli/rocm7

6fdb6d3

Binyang2014 marked this pull request as ready for review June 23, 2026 17:57

Binyang2014 requested review from a team and Copilot June 23, 2026 18:00

Copilot started reviewing on behalf of Binyang2014 June 23, 2026 18:00 View session

Copilot AI reviewed Jun 23, 2026

View reviewed changes

Comment thread src/ext/collectives/allreduce/allreduce_fullmesh.cu

Comment thread src/core/gpu_ipc_mem.cc Outdated

mahdiehghazim reviewed Jun 23, 2026

View reviewed changes

Comment thread include/mscclpp/gpu_data_types.hpp Outdated

Comment thread CMakeLists.txt

Comment thread python/mscclpp_benchmark/bench_collective.py Outdated

Comment thread src/core/gpu_ipc_mem.cc

Binyang2014 and others added 3 commits June 23, 2026 16:58

Potential fix for pull request finding

fcde5b3

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

address comments

87093c9

fix

fb47060

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support rocm7.2#819

support rocm7.2#819
Binyang2014 wants to merge 10 commits into
mainfrom
binyli/rocm7

Binyang2014 commented Jun 17, 2026 •

edited

Loading

Uh oh!

github-advanced-security AI commented Jun 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Binyang2014 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-advanced-security AI commented Jun 17, 2026

What Enabling Code Scanning Means:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Binyang2014 commented Jun 17, 2026 •

edited

Loading