Skip to content

Add support for allgather nvls algorithms#817

Open
Empyreus wants to merge 25 commits into
mainfrom
rjsouza/nvls-allgather-pr
Open

Add support for allgather nvls algorithms#817
Empyreus wants to merge 25 commits into
mainfrom
rjsouza/nvls-allgather-pr

Conversation

@Empyreus

@Empyreus Empyreus commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Add support for testing with NVSL by creating MSCCLPP_FORCE_DISABLE_IB to disable IB for NVLS runs.

Add needed functions for allgather implementation

Implement initial allgather nvls algorithm.

Results with allgather_nvls_zero_copy.py:

image

@Empyreus Empyreus marked this pull request as ready for review June 8, 2026 20:21
Comment thread python/mscclpp/language/internal/operations.py Outdated
Comment thread src/core/executor/executor.cc Outdated
Comment thread src/core/include/execution_kernel.hpp Outdated
Comment thread src/core/include/execution_kernel.hpp Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the MSCCL++ execution/runtime and Python language layer to enable an initial NVLS-based AllGather implementation, including a new NVLS “multicast store” executor op and a new environment knob to disable IB transport for NVLS-focused runs.

Changes:

  • Add a new executor op MULTI_STORE (plan opcode gstore) and device-side implementation to multicast/broadcast data via NVLS without reduction.
  • Add MSCCLPP_FORCE_DISABLE_IB to force-disable IB transport selection/registration (useful for MNNVL/NVLink-centric setups).
  • Update Python SwitchChannel broadcast (GroupStore) JSON emission to the src_buff/dst_buff schema and add an AllGather NVLS zero-copy test program.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/npkit/npkit_trace_generator.py Extends NPKIT op list to include the new MULTI_STORE executor op.
src/core/include/execution_common.hpp Adds OperationType::MULTI_STORE to the executor op enum.
src/core/include/execution_kernel.hpp Implements the MULTI_STORE device op handler and wires it into the executor dispatch.
src/core/executor/execution_plan.cc Adds plan opcode mapping for gstoreOperationType::MULTI_STORE.
src/core/executor/executor.cc Adds IB-disable env gating to transport selection and local memory registration.
src/core/env.cpp Reads/logs the new MSCCLPP_FORCE_DISABLE_IB env var.
include/mscclpp/env.hpp Documents the new MSCCLPP_FORCE_DISABLE_IB environment option.
python/mscclpp/language/internal/operations.py Updates GroupStore.to_dict() to emit src_buff/dst_buff for NVLS broadcast ops.
python/mscclpp/language/tests/single_node/allgather_nvls_zero_copy.py Adds a single-node NVLS-based AllGather program for testing/benchmarking.

Comment thread src/core/include/execution_common.hpp
Comment thread src/core/include/execution_kernel.hpp Outdated

@Binyang2014 Binyang2014 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pls address copilot PR first

Comment thread src/core/executor/executor.cc Outdated
@Empyreus

Copy link
Copy Markdown
Contributor Author
results Updated Results

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment thread src/core/include/execution_kernel.hpp
Comment thread src/core/include/execution_kernel.hpp Outdated
Comment on lines +582 to +586
// MULTI_STORE is a pure data-movement op: it broadcasts bytes from a local (unicast) source buffer
// to the NVLS multicast destination with no reduction. `multimem.st` writes raw register bits without
// any type conversion, so the data type is irrelevant here -- we move raw bytes using the widest
// available multimem store unit (16 -> 8 -> 4 bytes). This keeps the op fully type agnostic (works for
// any dtype, including uint8_t and FP8, on any arch that supports MULTI_STORE).
Comment thread python/mscclpp/language/channel.py Outdated
Comment thread python/mscclpp/language/internal/operations.py Outdated
Comment thread python/mscclpp/language/internal/operations.py Outdated
Comment thread python/mscclpp/language/channel.py Outdated
Comment thread python/mscclpp/language/channel.py Outdated
Comment thread python/mscclpp/language/channel.py
Comment thread python/mscclpp/language/channel.py
Comment thread python/mscclpp/language/channel.py Outdated
Comment thread src/core/include/execution_kernel.hpp Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants