From d7bd0d5bbbcc838dd604867ea867396a1d2c6791 Mon Sep 17 00:00:00 2001 From: Changho Hwang Date: Mon, 22 Jun 2026 21:26:25 +0000 Subject: [PATCH 1/2] Build the Qwen3 third-party reproduce package and final comparison report. --- examples/qwen3/reproduce.md | 96 +++++++++++++++++++++++++++++++++++++ 1 file changed, 96 insertions(+) create mode 100644 examples/qwen3/reproduce.md diff --git a/examples/qwen3/reproduce.md b/examples/qwen3/reproduce.md new file mode 100644 index 00000000..a1149bba --- /dev/null +++ b/examples/qwen3/reproduce.md @@ -0,0 +1,96 @@ +# Qwen3 reproduce skeleton and evidence inventory + +This document is a reproduce skeleton, not a final comparison report. It lists the Qwen3 evidence that can be rerun from this branch and the evidence that is still missing. + +## Scope + +- ARK-only evidence belongs in this repo under `examples/qwen3/`. +- The SGLang baseline and profiler harness stay local-only outside this repo. +- Fixed SGLang numbers below are target context only. This branch does not include commands that launch, drive, or profile SGLang. +- A missing `PERF_GATE` line is missing evidence. It is not a pass. +- This branch makes no end-to-end speedup claim. + +## Hardware and build assumptions + +Use the same topology as ARK CUDA CI: + +- 8×A100-80GB. +- CUDA-capable ARK build container. +- Repository root checked out at the branch SHA under review. +- Qwen3 component gates run after the normal ARK build. + +From the repository root, build ARK with the project workflow: + +```bash +mkdir -p build +cd build +cmake .. +make -j"$(nproc)" ut ark_py +export ARK_ROOT="$PWD" +export PYTHONPATH="$PWD/python${PYTHONPATH:+:$PYTHONPATH}" +``` + +## SGLang baseline context + +Local-only SGLang measurements from 2026-06-15 define the targets. They are not rerunnable from this branch. + +| Config | Prefill TTFT | Decode/token | Context | +|--------|-------------:|-------------:|---------| +| TP=8 matched regime | 47.41 ms | 4.26 ms | Primary comparison target. | +| TP=1 SGLang best | 52.31 ms | 12.98 ms | Reference only. | + +The local TP=8 profile was decode-dominated and communication-bound: + +| Component | SGLang profile budget | Share | Status on this branch | +|-----------|----------------------:|------:|-----------------------| +| nccl / comm | 214.69 ms | 77.3% | Missing branch evidence. | +| gemm | 38.08 ms | 13.7% | Missing branch evidence. | +| attention | 20.93 ms | 7.5% | Missing branch evidence. | +| norms_rope | 3.40 ms | 1.2% | Missing branch evidence. | + +## ARK-only evidence inventory at this branch SHA + +Current branch artifacts under `examples/qwen3/`: + +| Artifact | Evidence type | Rerun command from `build/` | Status | +|----------|---------------|-----------------------------|--------| +| `examples/qwen3/reproduce.md` | Reproduce skeleton and evidence inventory | Not applicable. | Present. | +| `examples/qwen3/bench_*.py` | ARK component perf gates | None available on this branch. | Missing. | +| `examples/qwen3/test_*.py` | Torch-reference equivalence tests | None available on this branch. | Missing. | + +Inventory check from `build/`: + +```bash +find ../examples/qwen3 -maxdepth 1 \( -name 'bench_*.py' -o -name 'test_*.py' \) -print | sort +``` + +Expected result for this branch: no Qwen3 benchmark or equivalence-test files are listed. + +## Rerun slots for future ARK-only evidence + +When a component benchmark lands on this branch, rerun it from `build/` and require exactly one machine-readable line: + +```bash +python3 ../examples/qwen3/bench_.py +# PERF_GATE name= ark_ms= sglang_ms= ratio= +``` + +When a component equivalence test lands on this branch, rerun only that component from `build/`: + +```bash +python3 -m pytest -q ../examples/qwen3/test_.py +``` + +These slots are placeholders until the branch contains the named files. They do not create evidence by themselves. + +## Missing final-comparison prerequisites + +The final Qwen3 ARK vs SGLang comparison remains blocked. + +- Q13 end-to-end prefill/decode sweep evidence is unavailable. +- Q12A-PRE-VPR still needs a strict clean-environment A100 `PERF_GATE name=kv_cache_slot` rerun. +- PR #270 is unmerged, so its all-reduce evidence is not branch evidence here. +- Required Qwen3 component implementations, equivalence tests, and ARK-only perf gates are missing or blocked on this branch. +- No branch artifact proves a full ARK Qwen3 op graph, TP=8 decode path, or ≥3.5× end-to-end speedup. + +A future final report must cite rerunnable repo artifacts at its branch SHA for every ARK correctness and performance claim. From eaac6319f9dcbd0d3ffc2060517f2512d7a467da Mon Sep 17 00:00:00 2001 From: Changho Hwang Date: Mon, 22 Jun 2026 21:46:32 +0000 Subject: [PATCH 2/2] Build the Qwen3 third-party reproduce package and final comparison report. --- examples/qwen3/reproduce.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/examples/qwen3/reproduce.md b/examples/qwen3/reproduce.md index a1149bba..6bbf7843 100644 --- a/examples/qwen3/reproduce.md +++ b/examples/qwen3/reproduce.md @@ -12,7 +12,7 @@ This document is a reproduce skeleton, not a final comparison report. It lists t ## Hardware and build assumptions -Use the same topology as ARK CUDA CI: +Use the target local topology for the Qwen3 comparison: - 8×A100-80GB. - CUDA-capable ARK build container. @@ -66,9 +66,9 @@ find ../examples/qwen3 -maxdepth 1 \( -name 'bench_*.py' -o -name 'test_*.py' \) Expected result for this branch: no Qwen3 benchmark or equivalence-test files are listed. -## Rerun slots for future ARK-only evidence +## Future ARK-only evidence contract (not runnable on this branch) -When a component benchmark lands on this branch, rerun it from `build/` and require exactly one machine-readable line: +No component benchmark files are present on this branch; when one lands, rerun it from `build/` and require exactly one machine-readable line: ```bash python3 ../examples/qwen3/bench_.py @@ -87,9 +87,9 @@ These slots are placeholders until the branch contains the named files. They do The final Qwen3 ARK vs SGLang comparison remains blocked. -- Q13 end-to-end prefill/decode sweep evidence is unavailable. -- Q12A-PRE-VPR still needs a strict clean-environment A100 `PERF_GATE name=kv_cache_slot` rerun. -- PR #270 is unmerged, so its all-reduce evidence is not branch evidence here. +- End-to-end prefill/decode sweep evidence is not present in this branch. +- A strict clean-environment A100 `PERF_GATE name=kv_cache_slot` rerun is not present in this branch. +- All-reduce evidence is not present as a branch artifact here. - Required Qwen3 component implementations, equivalence tests, and ARK-only perf gates are missing or blocked on this branch. - No branch artifact proves a full ARK Qwen3 op graph, TP=8 decode path, or ≥3.5× end-to-end speedup.