OneDNN BRGeMM Micro-Kernel Integration for BF16 MatMul by bbhattar · Pull Request #903 · google/gemma.cpp

bbhattar · 2026-04-28T22:10:19Z

This PR integrates OneDNN BRGeMM (Batch-Reduced General Matrix Multiply) micro-kernels as an alternative compute path for BF16 MatMul on Intel Xeon platforms with AMX or AVX-512 BF16 support.

What

When enabled via the GEMMA_ONEDNN_BRGEMM compile-time flag, BF16×BF16 MatMul operations are dispatched to JIT-compiled BRGeMM kernels instead of the Highway SIMD path. This targets Gemma model workloads (FFW projections, attention) on Intel Xeon Scalable (SPR/EMR) processors. At this point support has been added to both CMake and Bazel build systems.

How to Enable

# CMake
cmake -DGEMMA_ONEDNN_BRGEMM=ON ..

# Bazel
bazel build --define gemma_onednn_brgemm=1 ...

Runtime Fallback

When GEMMA_ONEDNN_BRGEMM is enabled at compile time, the BRGeMM path activates for BF16×BF16 operations whose dimensions meet AMX tile constraints (M, N, K ≥ 32 and K % 32 == 0). All other cases — non-BF16 types, smaller or non-aligned dimensions, mixed precision — fall through to the standard Highway SIMD MatMul path automatically.

Changes

File	Description
`ops/brgemm.h`	Types, caches, thread-local buffers, `UseOneDnnBrgemm()`, autotuning candidates
`ops/brgemm-inl.h`	`DoMatMul_BRGeMM()`: kernel JIT/caching, B-packing with hugepages, tiled parallel execution
`ops/matmul-inl.h`	BRGeMM dispatch block in `MatMul()` guarded by `#if GEMMA_ONEDNN_BRGEMM`
`ops/matmul.h`	`#include "ops/brgemm.h"`, `brgemm_autotune` field in `MMPerKey`
`ops/bench_matmul.cc`	Check `brgemm_autotune.Best()` to avoid infinite loop when BRGeMM handles dispatch
`CMakeLists.txt`	`GEMMA_ONEDNN_BRGEMM` option, FetchContent for OneDNN v3.11, conditional target linking
`BUILD.bazel`	`config_setting` for `gemma_onednn_brgemm`, conditional OneDNN dep and defines for x86_64
`MODULE.bazel`	OneDNN v3.11 `http_archive` dependency
`bazel/onednn.BUILD`	Bazel build rules for OneDNN
`util/zones.h`	`kBRGeMM` caller enum for thread pool dispatch
`util/zones.cc`	`CallerName` mapping for `kBRGeMM`

Testing

matmul_test passes with and without GEMMA_ONEDNN_BRGEMM (all original test shapes, types, and correctness checks preserved)
bench_matmul runs successfully with BRGeMM enabled
No changes to existing tests; zero impact when OneDNN is not enabled or on non-x86 platforms

google-cla · 2026-04-28T22:10:29Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

jan-wassenberg

Very nice work :) Just some fairly minor suggestions:

jan-wassenberg · 2026-04-29T15:59:40Z

+  static constexpr int64_t kMBlkValues[] = {32, 64};
+  static constexpr int64_t kBatchValues[] = {16, 32, 64, 128, 256};
+
+  const int64_t k_chunks = static_cast<int64_t>(K) / kKBlk;


Should this round up? We have hwy::DivCeil.

Padding the k-dimension and using ceiling is an alternative. We are using integer division instead of ceiling and use a dedicated tail-shaped kernel to handle remainder.

…nd transform inits

…orage

jan-wassenberg

Very nice, thanks for making the changes!

jan-wassenberg · 2026-05-07T15:20:02Z

Looks like we require a rebase of the PR, then ready to land this :)

bbhattar · 2026-05-11T22:54:03Z

Looks like we require a rebase of the PR, then ready to land this :)

@jan-wassenberg rebased. Should be okay to merge now.

jan-wassenberg

Thanks, beginning the import :)

jan-wassenberg · 2026-05-15T16:33:09Z

Apologies, this got held up in our internal CI due to formatting of the BUILD file, sigh. I'll fix it and manually merge, but will have to wait until I'm returning from OOO on May26.

============= Description ========== This PR integrates OneDNN BRGeMM (Batch-Reduced General Matrix Multiply) micro-kernels as an alternative compute path for BF16 MatMul on Intel Xeon platforms with AMX or AVX-512 BF16 support. ## What When enabled via the `GEMMA_ONEDNN_BRGEMM` compile-time flag, BF16×BF16 MatMul operations are dispatched to JIT-compiled BRGeMM kernels instead of the Highway SIMD path. This targets Gemma model workloads (FFW projections, attention) on Intel Xeon Scalable (SPR/EMR) processors. At this point support has been added to both CMake and Bazel build systems. ### How to Enable ```bash # CMake cmake -DGEMMA_ONEDNN_BRGEMM=ON .. # Bazel bazel build --define gemma_onednn_brgemm=1 ... ``` ### Runtime Fallback When `GEMMA_ONEDNN_BRGEMM` is enabled at compile time, the BRGeMM path activates for BF16×BF16 operations whose dimensions meet AMX tile constraints (M, N, K ≥ 32 and K % 32 == 0). All other cases — non-BF16 types, smaller or non-aligned dimensions, mixed precision — fall through to the standard Highway SIMD MatMul path automatically. ## Changes | File | Description | |---|---| | `ops/brgemm.h` | Types, caches, thread-local buffers, `UseOneDnnBrgemm()`, autotuning candidates | | `ops/brgemm-inl.h` | `DoMatMul_BRGeMM()`: kernel JIT/caching, B-packing with hugepages, tiled parallel execution | | `ops/matmul-inl.h` | BRGeMM dispatch block in `MatMul()` guarded by `#if GEMMA_ONEDNN_BRGEMM` | | `ops/matmul.h` | `#include "ops/brgemm.h"`, `brgemm_autotune` field in `MMPerKey` | | `ops/bench_matmul.cc` | Check `brgemm_autotune.Best()` to avoid infinite loop when BRGeMM handles dispatch | | `CMakeLists.txt` | `GEMMA_ONEDNN_BRGEMM` option, FetchContent for OneDNN v3.11, conditional target linking | | `BUILD.bazel` | `config_setting` for `gemma_onednn_brgemm`, conditional OneDNN dep and defines for x86_64 | | `MODULE.bazel` | OneDNN v3.11 `http_archive` dependency | | `bazel/onednn.BUILD` | Bazel build rules for OneDNN | | `util/zones.h` | `kBRGeMM` caller enum for thread pool dispatch | | `util/zones.cc` | `CallerName` mapping for `kBRGeMM` | ## Testing - `matmul_test` passes with and without `GEMMA_ONEDNN_BRGEMM` (all original test shapes, types, and correctness checks preserved) - `bench_matmul` runs successfully with BRGeMM enabled - No changes to existing tests; zero impact when OneDNN is not enabled or on non-x86 platforms ============= Commits ============== -- 09ddbf4 by Bibek Bhattarai <bibek.bhattarai@intel.com>: Tested and benchmarked OneDNN BRGeMM integration against dev branch -- 1308355 by Bibek Bhattarai <bibek.bhattarai@intel.com>: fixing the copyright info -- 656444f by Bibek Bhattarai <bibek.bhattarai@intel.com>: Removing OneTBB dependency -- f8527a1 by Bibek Bhattarai <bibek.bhattarai@intel.com>: Fixed the compile time flag to designate BRGEMM path -- 0dde315 by Bibek Bhattarai <bibek.bhattarai@intel.com>: Adding the cmake based build support for oneDNN BGGeMM -- fd3b119 by Bibek Bhattarai <bibek.bhattarai@intel.com>: fixed dtypes and syntax divergence from codebase -- 6640021 by Bibek Bhattarai <bibek.bhattarai@intel.com>: changed lda and ldb to size_t. Added conversions inplace for brgemm and transform inits -- 9d6bbee by Bibek Bhattarai <bibek.bhattarai@intel.com>: Replaced / and % with Divide and Remainder utils from hwy::Divisor -- 45708ea by Bibek Bhattarai <bibek.bhattarai@intel.com>: Moved the BRGeMM Kernel inits to a separate HWY_NOINLINE helper function -- acf7592 by Bibek Bhattarai <bibek.bhattarai@intel.com>: Added HWY_WARN and fallback instead of exiting -- 7bdf4c6 by Bibek Bhattarai <bibek.bhattarai@intel.com>: using hwy::AlignedVector instead of std::vector for scratch and tc_storage ==================================== Resolves #903. PiperOrigin-RevId: 925971432

bbhattar added 5 commits April 13, 2026 18:31

Tested and benchmarked OneDNN BRGeMM integration against dev branch

09ddbf4

fixing the copyright info

1308355

Removing OneTBB dependency

656444f

Fixed the compile time flag to designate BRGEMM path

f8527a1

Adding the cmake based build support for oneDNN BGGeMM

0dde315

bbhattar force-pushed the feature/onednn-brgemm branch from 629b569 to e072d70 Compare April 28, 2026 22:19

jan-wassenberg requested changes Apr 29, 2026

View reviewed changes

bbhattar force-pushed the feature/onednn-brgemm branch from e072d70 to f3a75ca Compare May 5, 2026 18:53

bbhattar added 4 commits May 5, 2026 19:03

fixed dtypes and syntax divergence from codebase

fd3b119

changed lda and ldb to size_t. Added conversions inplace for brgemm a…

6640021

…nd transform inits

Replaced / and % with Divide and Remainder utils from hwy::Divisor

9d6bbee

Moved the BRGeMM Kernel inits to a separate HWY_NOINLINE helper function

45708ea

bbhattar force-pushed the feature/onednn-brgemm branch from 649e233 to acf7592 Compare May 6, 2026 00:49

bbhattar added 2 commits May 6, 2026 00:54

Added HWY_WARN and fallback instead of exiting

acf7592

using hwy::AlignedVector instead of std::vector for scratch and tc_st…

7bdf4c6

…orage

bbhattar requested a review from jan-wassenberg May 6, 2026 22:31

jan-wassenberg approved these changes May 7, 2026

View reviewed changes

bbhattar added 3 commits May 7, 2026 10:30

Merge branch 'dev' into feature/onednn-brgemm

0e8d31a

Merge branch 'dev' into feature/onednn-brgemm

dd78e91

Merge branch 'dev' into feature/onednn-brgemm

3a2212e

Merge branch 'dev' into feature/onednn-brgemm

93b38fb

jan-wassenberg approved these changes May 12, 2026

View reviewed changes

jan-wassenberg added the copybara-import Trigger Copybara for merging pull requests label May 12, 2026

bbhattar added 2 commits May 20, 2026 13:19

Merge branch 'dev' into feature/onednn-brgemm

b2df643

Merge branch 'dev' into feature/onednn-brgemm

22875ca

jan-wassenberg approved these changes May 29, 2026

View reviewed changes

jan-wassenberg added copybara-import Trigger Copybara for merging pull requests and removed copybara-import Trigger Copybara for merging pull requests labels May 29, 2026

Merge branch 'dev' into feature/onednn-brgemm

c3787f8

copybara-service Bot merged commit f7b2d1d into google:dev Jun 3, 2026
7 of 8 checks passed

copybara-service Bot mentioned this pull request Jun 3, 2026

Merge #903: OneDNN BRGeMM Micro-Kernel Integration for BF16 MatMul #925

Merged

Conversation

bbhattar commented Apr 28, 2026

What

How to Enable

Runtime Fallback

Changes

Testing

Uh oh!

google-cla Bot commented Apr 28, 2026

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jan-wassenberg Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

bbhattar May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg commented May 7, 2026

Uh oh!

bbhattar commented May 11, 2026

Uh oh!

jan-wassenberg left a comment

Choose a reason for hiding this comment

Uh oh!

jan-wassenberg commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants