Add AVX-512 kernel by SomeB1oody · Pull Request #100 · bluss/matrixmultiply

SomeB1oody · 2026-06-20T23:35:19Z

Add runtime-dispatched AVX-512 kernels (gated #[cfg(has_avx512)]). When a CPU reports avx512f and the compiler is 1.89+, the dispatcher selects them automatically; otherwise behaviour is unchanged. The MSRV is not affected.

Measured single-thread on an AMD Ryzen 9 9950X (Zen 5), AVX-512 is basically the same as the FMA kernel across all sizes (f32 and f64): Zen 5 sustains the same peak FMA throughput whether it issues 4x256-bit or 2x512-bit FMAs per cycle (~64 f32 / ~32 f64 flop/cycle), and it runs 512-bit code at full boost (no down-clock).

A ~2x speedup is expected on cores with two 512-bit FMA units but only 2x256-bit for AVX2, i.e. Intel server parts (Skylake-SP, Cascade Lake, Ice Lake-SP, Sapphire Rapids). I don't have such hardware, so benchmarks from anyone who does are very welcome.

SomeB1oody · 2026-06-21T00:42:52Z

Here are the things that are worth doing next:

Make the kernel-size bigger globally rather than widening only when AVX-512 is enabled. The trade-off is a slightly larger masked-output scratch buffer for every build.
AVX-512 cgemm/zgemm (complex) kernels.

SomeB1oody · 2026-06-23T17:29:51Z

AVX512 cgemm/zgemm support done.

Here's performance comparison:
sgemm (f32)

┌──────┬─────────┬───────────┬───────────┬─────────────┬─────────┐
│ size │ AVX2 ns │ AVX512 ns │ AVX2 GF/s │ AVX512 GF/s │ speedup │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│    4 │      37 │        46 │       3.5 │         2.8 │   0.80x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│    8 │      38 │        67 │      26.9 │        15.3 │   0.57x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   16 │     124 │       108 │      66.1 │        75.9 │   1.15x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   32 │     595 │       434 │     110.1 │       151.0 │   1.37x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   64 │   4,006 │     2,497 │     130.9 │       210.0 │   1.60x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│  127 │  29,382 │    17,032 │     139.4 │       240.5 │   1.73x │
└──────┴─────────┴───────────┴───────────┴─────────────┴─────────┘

dgemm (f64)

┌──────┬─────────┬───────────┬───────────┬─────────────┬─────────┐
│ size │ AVX2 ns │ AVX512 ns │ AVX2 GF/s │ AVX512 GF/s │ speedup │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│    8 │      53 │        35 │      19.3 │        29.3 │   1.51x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   16 │     191 │       119 │      42.9 │        68.8 │   1.61x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   32 │   1,049 │       612 │      62.5 │       107.1 │   1.71x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   64 │   7,650 │     4,270 │      68.5 │       122.8 │   1.79x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│  127 │  56,623 │    30,271 │      72.4 │       135.3 │   1.87x │
└──────┴─────────┴───────────┴───────────┴─────────────┴─────────┘

cgemm (complex f32)

┌──────┬─────────┬───────────┬───────────┬─────────────┬─────────┐
│ size │ AVX2 ns │ AVX512 ns │ AVX2 GF/s │ AVX512 GF/s │ speedup │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   16 │     974 │       360 │      33.6 │        91.0 │   2.71x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   32 │   6,979 │     2,238 │      37.6 │       117.1 │   3.12x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   64 │  52,935 │    15,433 │      39.6 │       135.9 │   3.43x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│  127 │ 407,009 │   117,148 │      40.3 │       139.9 │   3.47x │
└──────┴─────────┴───────────┴───────────┴─────────────┴─────────┘

zgemm (complex f64)

┌──────┬─────────┬───────────┬───────────┬─────────────┬─────────┐
│ size │ AVX2 ns │ AVX512 ns │ AVX2 GF/s │ AVX512 GF/s │ speedup │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   16 │   1,057 │       566 │      31.0 │        57.9 │   1.87x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   32 │   7,686 │     3,903 │      34.1 │        67.2 │   1.97x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   64 │  58,192 │    29,087 │      36.0 │        72.1 │   2.00x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│  127 │ 446,119 │   221,109 │      36.7 │        74.1 │   2.02x │
└──────┴─────────┴───────────┴───────────┴─────────────┴─────────┘

SomeB1oody · 2026-06-25T21:46:10Z

@bluss Could u please take a look at it and approve CI test? AVX512 really improves perf of large-matrix multiplication.

bluss

thanks for adding this. On my puny coding laptop, the benefit is not as great as in your results for sgemm and dgemm; maybe because more care has gone into the AVX2 kernel. But that's not a problem, just pointing to future improvements.

bluss · 2026-06-25T16:29:06Z

    unsafe fn pack_mr(kc: usize, mc: usize, pack: &mut [Self::Elem],
                      a: *const Self::Elem, rsa: isize, csa: isize)
    {
-        // safety: any CPU with avx512f also has avx2


This commit (commenting on the commit message) does not follow git's conventions for commits, it's missing the empty line after the subject.

https://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html

bluss · 2026-06-25T16:36:17Z

+macro_rules! cmat_mul {
+    ($modname:ident, $gemm:ident, $real:ty, $(($name:ident, $m:expr, $n:expr, $k:expr))+) => {
+        mod $modname {
+            use bencher::Bencher;


Please remove these new benchmarks, they are obsolete anyway.

Prefer the python benchmark script. Example use:

./benches/benchloop.py -t f32 f64 c32 c64 -s 256 --threads 0 2>/dev/null

while a bit janky in its own way, it produces a csv table of benchmark results for larger matrices. For example:

before pr

m k n layout type average_ns minimum_ns median_ns samples gflops nc kc mc threads 256 256 256 FCC f32 440629 436260 439157 7810 76.15121110957291 0 256 256 256 FCC f64 895523 882979 890395 1560 37.469090129455076 0 256 256 256 FCC c32 3417605 3388634 3411751 310 39.27245190711039 0 256 256 256 FCC c64 6852613 6733463 6849942 310 19.58635749603837 0

after pr

m k n layout type average_ns minimum_ns median_ns samples gflops nc kc mc threads 256 256 256 FCC f32 388369 385705 388099 7810 86.39832736392451 0 256 256 256 FCC f64 807061 792724 800311 1560 41.576079131565024 0 256 256 256 FCC c32 1925052 1892153 1918993 1560 69.72161167594435 0 256 256 256 FCC c64 3806188 3728752 3797703 310 35.26303167368506 0

(from the laptop where this is written.)

Sign in to view

+        with:
+          toolchain: ${{ matrix.rust }}
+          targets: ${{ matrix.target }}
+      - name: Set up Intel SDE


bluss · 2026-06-26T10:48:50Z

-    assert!(nr > 0 && nr <= 8);
-    assert!(mr * nr * size_of::<K::Elem>() <= 8 * 4 * 8);
-    assert!(K::align_to() <= 32);
+    // The bounds widen when AVX-512 is enabled (see KERNEL_MAX_* below).


unnecessary comment

Suggested change

// The bounds widen when AVX-512 is enabled (see KERNEL_MAX_* below).

bluss · 2026-06-26T10:51:24Z

 //!   - `fma`
 //!   - `avx`
 //!   - `sse2`
+//!   - `avx512` (Rust 1.89 or later)


feature is called avx512f, so we should use that name here. Don't need to mention rust version imo.

bluss · 2026-06-26T10:51:39Z

 //!
-//! Some features are enabled with later versions: from Rust 1.61 AArch64 NEON support.
+//! Some features are enabled with later versions: from Rust 1.61 AArch64 NEON
+//! support, and from Rust 1.89 x86/x86-64 AVX-512 support.


This is good.

bluss · 2026-06-26T10:52:10Z

    }}
 }

+// AVX-512 sgemm microkernel (16 kernel rows)


unnecessary comment. If it needs to be "said"; use the cfg.

bluss · 2026-06-26T10:55:01Z

+    let (rsc, csc) = if prefer_row_major_c { (rsc, csc) } else { (csc, rsc) };
+
+    // Compute A B. Load one 8-wide row of B, FMA against each broadcast A elem.
+    let mut bv = _mm512_loadu_pd(b);


why not request 64 byte alignment for this kernel and use _mm512_load_pd?

bluss · 2026-06-26T11:31:06Z

@bluss Could u please take a look at it and approve CI test? AVX512 really improves perf of large-matrix multiplication.

no need to nag about it, I've already approved a couple of runs and it's not my job to babysit random PRs. 😄

Use Intel Software Development Emulator (SDE) to ensure AVX512 is available. Use `MMTEST_FEATURE=avx512f` to force running on AVX512 kernel. Use `MMTEST_FAST_TEST=1` to avoid timeout.

The new `pack_avx512` function calls `pack_impl` like what `pack_avx2` does. It has no difference in performance compared with `pack_avx2`. It's for the consistency of the kernel architecture.

- Generate avx512 kernel tests with the existing `test_arch_kernels_x86!` macro. - Drop `cmat_mul!` benchmarks. - Use `avx512f` in crate docs and remove duplicate rust-version note. - Remove uncessary comments (gemm.rs and loopmacros.rs).

Request 64-byte alignment for the packed buffers (`KernelAvx512::align_to` and `KERNEL_MAX_ALIGN`). Use `_mm512_load_pd` / `_mm512_load_ps` for packed-B reads. Only sgemm and dgemm changes to 64-byte alignment. cgemm is capped at 32 and zgemm is left with it. Performance(single-thread, forced avx512f on 9950x), GFlop/s: 256 512 f32 281 -> 298 (+6%) 287 -> 303 (+6%) f64 124 -> 137 (+10%) 122 -> 135 (10%) The benefit of 64-byte alignment should be bigger on Intel server parts because unaligned 512-bit loads are more expensive there.

SomeB1oody · 2026-06-26T18:07:55Z

@bluss Thank u for ur code review:

Commit message format fixed
avx512f test now generated with the existing test_arch_kernels_x86! macro
Complex benchmarks removed.
Renamed the feature to avx512f and dropped the duplicate rust-version note (it's already documented further down).
Unnecessary comments removed.
64-byte alignment done for sgemm and dgemm. cgemm is structurally capped at 32 (min(MR, NR) * size_of == 32), so I left it (and zgemm) as is.

Single-thread, forced avx512f on the 9950X (Zen 5), GFlop/s:

type 256 512

f32 281 -> 298 (+6%) 287 -> 303 (+6%)

f64 124 -> 137 (+10%) 122 -> 135 (+10%)

Should help more on Intel server parts where unaligned 512-bit loads are costlier.

SomeB1oody changed the title ~~Add AVX-512 sgemm/dgemm kernel~~ Add AVX-512 kernel Jun 23, 2026

bluss reviewed Jun 26, 2026

View reviewed changes

SomeB1oody force-pushed the feat-avx512-support branch from da2bc53 to 9f90d62 Compare June 26, 2026 17:49

SomeB1oody added 9 commits June 26, 2026 11:00

Add AVX512 Kernel Support

826d81e

Use pack_avx2 and fold alpha into the final FMA

f854f72

Add CI to test AVX512 Kernel

95d39c3

Use Intel Software Development Emulator (SDE) to ensure AVX512 is available. Use `MMTEST_FEATURE=avx512f` to force running on AVX512 kernel. Use `MMTEST_FAST_TEST=1` to avoid timeout.

Upgrade SDE env to v5.0 to avoid download issue during CI

683a957

Rename the CI test so it's more detailed

b9eb0ba

Use a dedicated pack_avx512 for the AVX-512 kernels

bdde7fd

The new `pack_avx512` function calls `pack_impl` like what `pack_avx2` does. It has no difference in performance compared with `pack_avx2`. It's for the consistency of the kernel architecture.

Add AVX512 support for cgemm/zgemm

0837a14

Address review feedback on the avx512 kernels

e51f60f

- Generate avx512 kernel tests with the existing `test_arch_kernels_x86!` macro. - Drop `cmat_mul!` benchmarks. - Use `avx512f` in crate docs and remove duplicate rust-version note. - Remove uncessary comments (gemm.rs and loopmacros.rs).

SomeB1oody force-pushed the feat-avx512-support branch from 9f90d62 to 6d03be7 Compare June 26, 2026 18:02

Conversation

SomeB1oody commented Jun 20, 2026

Uh oh!

SomeB1oody commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SomeB1oody commented Jun 23, 2026

Uh oh!

SomeB1oody commented Jun 25, 2026

Uh oh!

bluss left a comment

Choose a reason for hiding this comment

Uh oh!

bluss Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

bluss Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

bluss Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

bluss Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

bluss Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

bluss Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

bluss Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

bluss commented Jun 26, 2026

Uh oh!

SomeB1oody commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SomeB1oody commented Jun 21, 2026 •

edited

Loading

SomeB1oody commented Jun 26, 2026 •

edited

Loading