Add AVX-512 kernel#100
Conversation
|
Here are the things that are worth doing next:
|
|
AVX512 cgemm/zgemm support done. Here's performance comparison: dgemm (f64) cgemm (complex f32) zgemm (complex f64) |
|
@bluss Could u please take a look at it and approve CI test? AVX512 really improves perf of large-matrix multiplication. |
bluss
left a comment
There was a problem hiding this comment.
thanks for adding this. On my puny coding laptop, the benefit is not as great as in your results for sgemm and dgemm; maybe because more care has gone into the AVX2 kernel. But that's not a problem, just pointing to future improvements.
| unsafe fn pack_mr(kc: usize, mc: usize, pack: &mut [Self::Elem], | ||
| a: *const Self::Elem, rsa: isize, csa: isize) | ||
| { | ||
| // safety: any CPU with avx512f also has avx2 |
There was a problem hiding this comment.
This commit (commenting on the commit message) does not follow git's conventions for commits, it's missing the empty line after the subject.
https://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html
| macro_rules! cmat_mul { | ||
| ($modname:ident, $gemm:ident, $real:ty, $(($name:ident, $m:expr, $n:expr, $k:expr))+) => { | ||
| mod $modname { | ||
| use bencher::Bencher; |
There was a problem hiding this comment.
Please remove these new benchmarks, they are obsolete anyway.
Prefer the python benchmark script. Example use:
./benches/benchloop.py -t f32 f64 c32 c64 -s 256 --threads 0 2>/dev/nullwhile a bit janky in its own way, it produces a csv table of benchmark results for larger matrices. For example:
before pr
m k n layout type average_ns minimum_ns median_ns samples gflops nc kc mc threads
256 256 256 FCC f32 440629 436260 439157 7810 76.15121110957291 0
256 256 256 FCC f64 895523 882979 890395 1560 37.469090129455076 0
256 256 256 FCC c32 3417605 3388634 3411751 310 39.27245190711039 0
256 256 256 FCC c64 6852613 6733463 6849942 310 19.58635749603837 0
after pr
m k n layout type average_ns minimum_ns median_ns samples gflops nc kc mc threads
256 256 256 FCC f32 388369 385705 388099 7810 86.39832736392451 0
256 256 256 FCC f64 807061 792724 800311 1560 41.576079131565024 0
256 256 256 FCC c32 1925052 1892153 1918993 1560 69.72161167594435 0
256 256 256 FCC c64 3806188 3728752 3797703 310 35.26303167368506 0
(from the laptop where this is written.)
| with: | ||
| toolchain: ${{ matrix.rust }} | ||
| targets: ${{ matrix.target }} | ||
| - name: Set up Intel SDE |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| assert!(nr > 0 && nr <= 8); | ||
| assert!(mr * nr * size_of::<K::Elem>() <= 8 * 4 * 8); | ||
| assert!(K::align_to() <= 32); | ||
| // The bounds widen when AVX-512 is enabled (see KERNEL_MAX_* below). |
There was a problem hiding this comment.
unnecessary comment
| // The bounds widen when AVX-512 is enabled (see KERNEL_MAX_* below). |
| //! - `fma` | ||
| //! - `avx` | ||
| //! - `sse2` | ||
| //! - `avx512` (Rust 1.89 or later) |
There was a problem hiding this comment.
feature is called avx512f, so we should use that name here. Don't need to mention rust version imo.
| //! | ||
| //! Some features are enabled with later versions: from Rust 1.61 AArch64 NEON support. | ||
| //! Some features are enabled with later versions: from Rust 1.61 AArch64 NEON | ||
| //! support, and from Rust 1.89 x86/x86-64 AVX-512 support. |
| }} | ||
| } | ||
|
|
||
| // AVX-512 sgemm microkernel (16 kernel rows) |
There was a problem hiding this comment.
unnecessary comment. If it needs to be "said"; use the cfg.
| let (rsc, csc) = if prefer_row_major_c { (rsc, csc) } else { (csc, rsc) }; | ||
|
|
||
| // Compute A B. Load one 8-wide row of B, FMA against each broadcast A elem. | ||
| let mut bv = _mm512_loadu_pd(b); |
There was a problem hiding this comment.
why not request 64 byte alignment for this kernel and use _mm512_load_pd?
no need to nag about it, I've already approved a couple of runs and it's not my job to babysit random PRs. 😄 |
da2bc53 to
9f90d62
Compare
Use Intel Software Development Emulator (SDE) to ensure AVX512 is available. Use `MMTEST_FEATURE=avx512f` to force running on AVX512 kernel. Use `MMTEST_FAST_TEST=1` to avoid timeout.
The new `pack_avx512` function calls `pack_impl` like what `pack_avx2` does. It has no difference in performance compared with `pack_avx2`. It's for the consistency of the kernel architecture.
- Generate avx512 kernel tests with the existing `test_arch_kernels_x86!` macro. - Drop `cmat_mul!` benchmarks. - Use `avx512f` in crate docs and remove duplicate rust-version note. - Remove uncessary comments (gemm.rs and loopmacros.rs).
Request 64-byte alignment for the packed buffers (`KernelAvx512::align_to` and `KERNEL_MAX_ALIGN`).
Use `_mm512_load_pd` / `_mm512_load_ps` for packed-B reads.
Only sgemm and dgemm changes to 64-byte alignment. cgemm is capped at 32 and zgemm is left with it.
Performance(single-thread, forced avx512f on 9950x), GFlop/s:
256 512
f32 281 -> 298 (+6%) 287 -> 303 (+6%)
f64 124 -> 137 (+10%) 122 -> 135 (10%)
The benefit of 64-byte alignment should be bigger on Intel server parts because unaligned 512-bit loads are more expensive there.
9f90d62 to
6d03be7
Compare
|
@bluss Thank u for ur code review:
|
Add runtime-dispatched AVX-512 kernels (gated
#[cfg(has_avx512)]). When a CPU reportsavx512fand the compiler is 1.89+, the dispatcher selects them automatically; otherwise behaviour is unchanged. The MSRV is not affected.Measured single-thread on an AMD Ryzen 9 9950X (Zen 5), AVX-512 is basically the same as the FMA kernel across all sizes (f32 and f64): Zen 5 sustains the same peak FMA throughput whether it issues 4x256-bit or 2x512-bit FMAs per cycle (~64 f32 / ~32 f64 flop/cycle), and it runs 512-bit code at full boost (no down-clock).
A ~2x speedup is expected on cores with two 512-bit FMA units but only 2x256-bit for AVX2, i.e. Intel server parts (Skylake-SP, Cascade Lake, Ice Lake-SP, Sapphire Rapids). I don't have such hardware, so benchmarks from anyone who does are very welcome.