feat(examples): L3 ring allreduce (chunked RS+AG, a2a3 verified) by georgebisbas · Pull Request #975 · hw-native-sys/simpler

georgebisbas · 2026-06-02T14:36:58Z

Summary

Reopens the work from #972 (that PR cannot be reopened after branch rebase).

Adds examples/workers/l3/allreduce_ring_distributed/ — chunked ring allreduce
(RS + AG on a logical ring), separate from mesh allreduce_distributed/.

Closes / supersedes: #972

Ring uses +10624B HCCL window (chunked + per-round signals); mesh uses 4096B.

Algorithm

Stage-in: P chunk slots in HCCL window
Reduce-scatter: (P-1) ring steps, per-round TNOTIFY/TWAIT
Allgather: (P-1) ring steps
Recv from left neighbour chunks[] via CommRemotePtr after each barrier

Test plan

python examples/workers/l3/allreduce_ring_distributed/main.py -p a2a3 -d 5-6
python examples/workers/l3/allreduce_ring_distributed/main.py -p a2a3 -d 0-3
python examples/workers/l3/allreduce_distributed/main.py -p a2a3 -d 0-3
CI green

coderabbitai · 2026-06-02T14:37:12Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 99a511ac-ba13-4618-9f3b-6ff150b46b87

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request adds a complete new L3 worker example implementing distributed ring AllReduce with chunked reduce-scatter and allgather phases. It includes AICORE kernel primitives, a full kernel implementation, orchestration wiring, Python runtime setup with golden-output validation, and integration tests.

Changes

Ring Allreduce Distributed Example

Layer / File(s)	Summary
Documentation and Package Setup `examples/workers/l3/README.md`, `examples/workers/l3/allreduce_ring_distributed/__init__.py`	Example entry added to the L3 README table, and package structure created with license header and relative import marker.
Ring Allreduce Primitives and Helpers `examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_common.hpp`	Compile-time constants (`kAllReduceCount`, `kMaxSupportedRanks`, `kChunkMax`), dynamic tensor/tile type aliases, and inline AICORE helpers for remote pointer computation, peer barrier synchronization, chunk copy with MTE2/MTE3 flag ordering, left-neighbor receipt, and scratch memory layout binding.
AIV Kernel Reduce-Scatter and Allgather Logic `examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_kernel.cpp`	Kernel entrypoint validates parameters, binds scratch layout, stages input into chunks, executes (nranks−1) reduce-scatter steps with remote exchange and tiled accumulation, then (nranks−1) allgather steps to disseminate reduced chunks, and stages the final result back to output with pipeline synchronization.
Orchestration and Task Submission `examples/workers/l3/allreduce_ring_distributed/kernels/orchestration/allreduce_ring_orch.cpp`	Orchestration config specifies 5 expected arguments; orchestration entry extracts tensors and scalars, packages them into an Arg payload, and submits the AIV task.
Python Compilation and Helper Configuration `examples/workers/l3/allreduce_ring_distributed/main.py` (lines 1–134)	Top-level constants define `ALLREDUCE_COUNT` and sizing calculations; `scratch_float_elems()` and `parse_device_range()` validate inputs with divisibility and rank-count bounds; `build_chip_callable()` compiles kernel and orchestration binaries and returns a `ChipCallable` hierarchy; `expected_output()` generates golden output.
Runtime Execution and Golden Validation `examples/workers/l3/allreduce_ring_distributed/main.py` (lines 136–246)	`run()` allocates per-rank shared-memory tensors, initializes the worker, defines an orchestration function that allocates ring domains and scratch buffers, wires tensor/scalar arguments, submits the execution DAG, validates each rank's result against golden output (1e-3 tolerance), and ensures worker cleanup. `main()` provides CLI parsing for platform, device range, and optional PTO ISA commit.
Integration Tests `examples/workers/l3/allreduce_ring_distributed/test_allreduce.py`	Two parameterized pytest tests: `test_ring_allreduce_distributed()` for 2-device and `test_ring_allreduce_distributed_multi_rank()` for 4-device configurations, both asserting `run()` returns exit code 0.

Sequence Diagram

sequenceDiagram
  participant Python as Python Runtime
  participant Worker
  participant Orch as Orchestration Layer
  participant AIVKernel as AIV Kernel
  participant Rank0
  participant Rank1
  Python->>Worker: Initialize worker
  Python->>Python: Allocate per-rank input/output tensors
  Python->>Python: Allocate ring domain window and scratch buffer
  Python->>Orch: Submit orchestration DAG with tensor/scalar args
  Orch->>AIVKernel: rt_submit_aiv_task (3 tensors + 2 scalars)
  AIVKernel->>Rank0: Validate nranks, bind scratch layout
  Rank0->>Rank0: Stage input into chunk slots
  par Reduce-Scatter Phase
    Rank0->>Rank1: Publish chunk, barrier signal
    Rank1-->>Rank0: Send left-neighbor chunk
    Rank0->>Rank0: Load/accumulate tile with MTE flags
  end
  par Allgather Phase
    Rank0->>Rank1: Publish reduced chunk for dissemination
    Rank1-->>Rank0: Send chunk from previous round
    Rank0->>Rank0: Store chunk to output slot
  end
  AIVKernel->>AIVKernel: Stage concatenated chunks to output tensor
  AIVKernel->>Worker: Return
  Python->>Python: Compute golden expected output
  Python->>Python: Validate each rank output vs golden (1e-3 tolerance)
  Python->>Worker: Close worker

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

In rings we gather, chunk by chunk,
Each rank reduces with a hunch,
Scatter down, then gather round,
AllReduce magic, distributed sound! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a new L3 ring allreduce example implementation with chunked reduce-scatter and allgather, including platform verification. It accurately reflects the primary purpose of the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description clearly relates to the changeset, describing the addition of a ring allreduce example to examples/workers/l3/.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a distributed ring AllReduce implementation, featuring chunked reduce-scatter and allgather algorithms. The code review feedback highlights several critical optimization and correctness improvements. First, the exchange buffer is completely unused by remote ranks and should be removed along with its redundant memory copies across the kernel, helper functions, and Python host code to improve performance and reduce scratch memory usage. Second, the kernel must explicitly zero-initialize the local signal slots to prevent undefined behavior, as device memory is not guaranteed to be zero-initialized. Finally, the unnecessary from __future__ import annotations import in main.py should be removed.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

New L3 example separate from mesh allreduce_distributed: stage-in, (P-1) reduce-scatter and (P-1) allgather ring rounds over HCCL window chunks with per-round TNOTIFY/TWAIT barriers. Same golden as mesh. P=2/P=4 pytest; default CLI devices 0-3.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/workers/l3/allreduce_ring_distributed/main.py`:
- Around line 136-145: Validate the device_ids input at the top of run(): check
that device_ids is non-empty and that nranks = len(device_ids) is within the
supported range (e.g., between 2 and 16 as the example expects); if not, raise a
ValueError with a clear message so downstream calls (like scratch_float_elems)
don't hit ZeroDivisionError or unsupported configurations. Add this check
directly in run() before calling scratch_float_elems() or computing window_size,
referencing run() and scratch_float_elems() in the message so the caller can see
which entrypoint enforces the constraint.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a173f9be-0a2b-4f97-bf62-6560ebde9a86

📥 Commits

Reviewing files that changed from the base of the PR and between d61dee4 and 75c3152.

📒 Files selected for processing (7)

examples/workers/l3/README.md
examples/workers/l3/allreduce_ring_distributed/__init__.py
examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_common.hpp
examples/workers/l3/allreduce_ring_distributed/kernels/aiv/allreduce_ring_kernel.cpp
examples/workers/l3/allreduce_ring_distributed/kernels/orchestration/allreduce_ring_orch.cpp
examples/workers/l3/allreduce_ring_distributed/main.py
examples/workers/l3/allreduce_ring_distributed/test_allreduce.py

Drop RingZeroSignals (per-round barrier rows used once; zeroing raced peer notify and caused AICPU 507018 timeout). Recv via left neighbour chunks[] after barrier, not local exchange mirror (max golden diff 99 on second chunk). Size scratch CommBufferSpec to (P+1)*chunk elements. Align ring example with mesh L3 style: single allreduce_ring_kernel.cpp (no common header), phase banners, and matching orch/main.py comments.

Mirror parse_device_range() so pytest/CLI callers cannot pass an empty list or unsupported rank count into scratch_float_elems().

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

georgebisbas force-pushed the feat/l3-ring-allreduce-skeleton branch from 75c3152 to 497ae58 Compare June 2, 2026 14:41

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread examples/workers/l3/allreduce_ring_distributed/main.py

georgebisbas force-pushed the feat/l3-ring-allreduce-skeleton branch from 497ae58 to 690efbc Compare June 2, 2026 15:12

georgebisbas added 2 commits June 2, 2026 17:17

Fix: ruff-format main.py for ring allreduce example

a43cd32

Fix: validate device_ids rank count in ring allreduce run()

9c2236a

Mirror parse_device_range() so pytest/CLI callers cannot pass an empty list or unsupported rank count into scratch_float_elems().

georgebisbas force-pushed the feat/l3-ring-allreduce-skeleton branch from 3a9b9b8 to 9c2236a Compare June 4, 2026 15:37

ChaoWao approved these changes Jun 8, 2026

View reviewed changes

ChaoWao merged commit 607f78a into hw-native-sys:main Jun 8, 2026
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): L3 ring allreduce (chunked RS+AG, a2a3 verified)#975

feat(examples): L3 ring allreduce (chunked RS+AG, a2a3 verified)#975
ChaoWao merged 4 commits into
hw-native-sys:mainfrom
georgebisbas:feat/l3-ring-allreduce-skeleton

georgebisbas commented Jun 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

georgebisbas commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Algorithm

Test plan

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

georgebisbas commented Jun 2, 2026 •

edited

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading