Skip to content

Add AVX-512 kernel#100

Open
SomeB1oody wants to merge 9 commits into
bluss:masterfrom
SomeB1oody:feat-avx512-support
Open

Add AVX-512 kernel#100
SomeB1oody wants to merge 9 commits into
bluss:masterfrom
SomeB1oody:feat-avx512-support

Conversation

@SomeB1oody

Copy link
Copy Markdown

Add runtime-dispatched AVX-512 kernels (gated #[cfg(has_avx512)]). When a CPU reports avx512f and the compiler is 1.89+, the dispatcher selects them automatically; otherwise behaviour is unchanged. The MSRV is not affected.

Measured single-thread on an AMD Ryzen 9 9950X (Zen 5), AVX-512 is basically the same as the FMA kernel across all sizes (f32 and f64): Zen 5 sustains the same peak FMA throughput whether it issues 4x256-bit or 2x512-bit FMAs per cycle (~64 f32 / ~32 f64 flop/cycle), and it runs 512-bit code at full boost (no down-clock).

A ~2x speedup is expected on cores with two 512-bit FMA units but only 2x256-bit for AVX2, i.e. Intel server parts (Skylake-SP, Cascade Lake, Ice Lake-SP, Sapphire Rapids). I don't have such hardware, so benchmarks from anyone who does are very welcome.

@SomeB1oody

SomeB1oody commented Jun 21, 2026

Copy link
Copy Markdown
Author

Here are the things that are worth doing next:

  1. Make the kernel-size bigger globally rather than widening only when AVX-512 is enabled. The trade-off is a slightly larger masked-output scratch buffer for every build.

  2. AVX-512 cgemm/zgemm (complex) kernels.

@SomeB1oody SomeB1oody changed the title Add AVX-512 sgemm/dgemm kernel Add AVX-512 kernel Jun 23, 2026
@SomeB1oody

Copy link
Copy Markdown
Author

AVX512 cgemm/zgemm support done.

Here's performance comparison:
sgemm (f32)

┌──────┬─────────┬───────────┬───────────┬─────────────┬─────────┐
│ size │ AVX2 ns │ AVX512 ns │ AVX2 GF/s │ AVX512 GF/s │ speedup │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│    4 │      37 │        46 │       3.5 │         2.8 │   0.80x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│    8 │      38 │        67 │      26.9 │        15.3 │   0.57x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   16 │     124 │       108 │      66.1 │        75.9 │   1.15x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   32 │     595 │       434 │     110.1 │       151.0 │   1.37x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   64 │   4,006 │     2,497 │     130.9 │       210.0 │   1.60x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│  127 │  29,382 │    17,032 │     139.4 │       240.5 │   1.73x │
└──────┴─────────┴───────────┴───────────┴─────────────┴─────────┘

dgemm (f64)

┌──────┬─────────┬───────────┬───────────┬─────────────┬─────────┐
│ size │ AVX2 ns │ AVX512 ns │ AVX2 GF/s │ AVX512 GF/s │ speedup │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│    8 │      53 │        35 │      19.3 │        29.3 │   1.51x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   16 │     191 │       119 │      42.9 │        68.8 │   1.61x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   32 │   1,049 │       612 │      62.5 │       107.1 │   1.71x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   64 │   7,650 │     4,270 │      68.5 │       122.8 │   1.79x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│  127 │  56,623 │    30,271 │      72.4 │       135.3 │   1.87x │
└──────┴─────────┴───────────┴───────────┴─────────────┴─────────┘

cgemm (complex f32)

┌──────┬─────────┬───────────┬───────────┬─────────────┬─────────┐
│ size │ AVX2 ns │ AVX512 ns │ AVX2 GF/s │ AVX512 GF/s │ speedup │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   16 │     974 │       360 │      33.6 │        91.0 │   2.71x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   32 │   6,979 │     2,238 │      37.6 │       117.1 │   3.12x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   64 │  52,935 │    15,433 │      39.6 │       135.9 │   3.43x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│  127 │ 407,009 │   117,148 │      40.3 │       139.9 │   3.47x │
└──────┴─────────┴───────────┴───────────┴─────────────┴─────────┘

zgemm (complex f64)

┌──────┬─────────┬───────────┬───────────┬─────────────┬─────────┐
│ size │ AVX2 ns │ AVX512 ns │ AVX2 GF/s │ AVX512 GF/s │ speedup │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   16 │   1,057 │       566 │      31.0 │        57.9 │   1.87x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   32 │   7,686 │     3,903 │      34.1 │        67.2 │   1.97x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│   64 │  58,192 │    29,087 │      36.0 │        72.1 │   2.00x │
├──────┼─────────┼───────────┼───────────┼─────────────┼─────────┤
│  127 │ 446,119 │   221,109 │      36.7 │        74.1 │   2.02x │
└──────┴─────────┴───────────┴───────────┴─────────────┴─────────┘

@SomeB1oody

Copy link
Copy Markdown
Author

@bluss Could u please take a look at it and approve CI test? AVX512 really improves perf of large-matrix multiplication.

@bluss bluss left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this. On my puny coding laptop, the benefit is not as great as in your results for sgemm and dgemm; maybe because more care has gone into the AVX2 kernel. But that's not a problem, just pointing to future improvements.

Comment thread src/dgemm_kernel.rs Outdated
unsafe fn pack_mr(kc: usize, mc: usize, pack: &mut [Self::Elem],
a: *const Self::Elem, rsa: isize, csa: isize)
{
// safety: any CPU with avx512f also has avx2

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit (commenting on the commit message) does not follow git's conventions for commits, it's missing the empty line after the subject.

https://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html

Comment thread benches/benchmarks.rs Outdated
macro_rules! cmat_mul {
($modname:ident, $gemm:ident, $real:ty, $(($name:ident, $m:expr, $n:expr, $k:expr))+) => {
mod $modname {
use bencher::Bencher;

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these new benchmarks, they are obsolete anyway.

Prefer the python benchmark script. Example use:

./benches/benchloop.py -t f32 f64 c32 c64 -s 256 --threads 0 2>/dev/null

while a bit janky in its own way, it produces a csv table of benchmark results for larger matrices. For example:

before pr

m    k    n    layout  type  average_ns  minimum_ns  median_ns  samples  gflops              nc  kc  mc  threads
256  256  256  FCC     f32   440629      436260      439157     7810     76.15121110957291               0
256  256  256  FCC     f64   895523      882979      890395     1560     37.469090129455076              0
256  256  256  FCC     c32   3417605     3388634     3411751    310      39.27245190711039               0
256  256  256  FCC     c64   6852613     6733463     6849942    310      19.58635749603837               0

after pr

m    k    n    layout  type  average_ns  minimum_ns  median_ns  samples  gflops              nc  kc  mc  threads
256  256  256  FCC     f32   388369      385705      388099     7810     86.39832736392451               0
256  256  256  FCC     f64   807061      792724      800311     1560     41.576079131565024              0
256  256  256  FCC     c32   1925052     1892153     1918993    1560     69.72161167594435               0
256  256  256  FCC     c64   3806188     3728752     3797703    310      35.26303167368506               0

(from the laptop where this is written.)

Comment thread src/dgemm_kernel.rs
Comment thread .github/workflows/ci.yml
with:
toolchain: ${{ matrix.rust }}
targets: ${{ matrix.target }}
- name: Set up Intel SDE

This comment was marked as resolved.

Comment thread src/gemm.rs Outdated
assert!(nr > 0 && nr <= 8);
assert!(mr * nr * size_of::<K::Elem>() <= 8 * 4 * 8);
assert!(K::align_to() <= 32);
// The bounds widen when AVX-512 is enabled (see KERNEL_MAX_* below).

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary comment

Suggested change
// The bounds widen when AVX-512 is enabled (see KERNEL_MAX_* below).

Comment thread src/lib.rs Outdated
//! - `fma`
//! - `avx`
//! - `sse2`
//! - `avx512` (Rust 1.89 or later)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature is called avx512f, so we should use that name here. Don't need to mention rust version imo.

Comment thread src/lib.rs
//!
//! Some features are enabled with later versions: from Rust 1.61 AArch64 NEON support.
//! Some features are enabled with later versions: from Rust 1.61 AArch64 NEON
//! support, and from Rust 1.89 x86/x86-64 AVX-512 support.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good.

Comment thread src/loopmacros.rs Outdated
}}
}

// AVX-512 sgemm microkernel (16 kernel rows)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary comment. If it needs to be "said"; use the cfg.

Comment thread src/dgemm_kernel.rs Outdated
let (rsc, csc) = if prefer_row_major_c { (rsc, csc) } else { (csc, rsc) };

// Compute A B. Load one 8-wide row of B, FMA against each broadcast A elem.
let mut bv = _mm512_loadu_pd(b);

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not request 64 byte alignment for this kernel and use _mm512_load_pd?

@bluss

bluss commented Jun 26, 2026

Copy link
Copy Markdown
Owner

@bluss Could u please take a look at it and approve CI test? AVX512 really improves perf of large-matrix multiplication.

no need to nag about it, I've already approved a couple of runs and it's not my job to babysit random PRs. 😄

@SomeB1oody SomeB1oody force-pushed the feat-avx512-support branch from da2bc53 to 9f90d62 Compare June 26, 2026 17:49
Use Intel Software Development Emulator (SDE) to ensure AVX512 is available.
Use `MMTEST_FEATURE=avx512f` to force running on AVX512 kernel.
Use `MMTEST_FAST_TEST=1` to avoid timeout.
The new `pack_avx512` function calls `pack_impl` like what `pack_avx2` does.
It has no difference in performance compared with `pack_avx2`. It's for the consistency of the kernel architecture.
- Generate avx512 kernel tests with the existing `test_arch_kernels_x86!` macro.
- Drop `cmat_mul!` benchmarks.
- Use `avx512f` in crate docs and remove duplicate rust-version note.
- Remove uncessary comments (gemm.rs and loopmacros.rs).
Request 64-byte alignment for the packed buffers (`KernelAvx512::align_to` and `KERNEL_MAX_ALIGN`).
Use `_mm512_load_pd` / `_mm512_load_ps` for packed-B reads.
Only sgemm and dgemm changes to 64-byte alignment. cgemm is capped at 32 and zgemm is left with it.

Performance(single-thread, forced avx512f on 9950x), GFlop/s:
         256                 512
f32  281 -> 298 (+6%)     287 -> 303 (+6%)
f64  124 -> 137 (+10%)    122 -> 135 (10%)

The benefit of 64-byte alignment should be bigger on Intel server parts because unaligned 512-bit loads are more expensive there.
@SomeB1oody SomeB1oody force-pushed the feat-avx512-support branch from 9f90d62 to 6d03be7 Compare June 26, 2026 18:02
@SomeB1oody

SomeB1oody commented Jun 26, 2026

Copy link
Copy Markdown
Author

@bluss Thank u for ur code review:

  • Commit message format fixed

  • avx512f test now generated with the existing test_arch_kernels_x86! macro

  • Complex benchmarks removed.

  • Renamed the feature to avx512f and dropped the duplicate rust-version note (it's already documented further down).

  • Unnecessary comments removed.

  • 64-byte alignment done for sgemm and dgemm. cgemm is structurally capped at 32 (min(MR, NR) * size_of == 32), so I left it (and zgemm) as is.

    Single-thread, forced avx512f on the 9950X (Zen 5), GFlop/s:

    type 256 512
    f32 281 -> 298 (+6%) 287 -> 303 (+6%)
    f64 124 -> 137 (+10%) 122 -> 135 (+10%)

    Should help more on Intel server parts where unaligned 512-bit loads are costlier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants