PTX Backend by WillTrojak · Pull Request #18 · PyFR/GiMMiK

WillTrojak · 2026-05-15T12:23:17Z

This adds a PTX backend to GiMMiK. The key features are:

Mild optimisation of exist CUDA algorithms.
Optional async loads for some sparse kernels
Added dense generation for Hopper and above

Optimisations have focused on FP64, FP32 is future work.

FreddieWitherden · 2026-05-15T18:31:49Z

I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)?

FreddieWitherden · 2026-05-15T18:33:25Z

+                         .param .u64 _c)
+{
+% endif
+    .reg .u32 n, id, tid_x, tid_y;


Ensure we throw higher up if n is too big.

Checking here

We don't handle n being too large in any of the other backends.

https://github.com/PyFR/GiMMiK/blob/master/gimmik/kernels/cuda/cstream.mako#L20 in the embedded case we do (argument case doesn't but that is not currently used for CUDA).

FreddieWitherden · 2026-05-21T13:29:40Z

+        nnz = np.count_nonzero(arr)
+        nuq = len(np.unique(np.abs(arr)))
+        density = nnz / arr.size
+        return (nuq <= 28) or (density <= 0.15)


Check if these could do with tuning

I think that would be a seperate PR

FreddieWitherden · 2026-05-22T15:28:41Z

+%   for idx, kx in enumerate(bchunks[bb]):
+    ld.shared.${pftype} bv, [bsub_thread + ${bsub_off(buf_cur, idx)}];
+%    for j, row_j in enumerate(mcx):
+<%    jx = A[row_j, kx] %>


See if NumPy can be used in the for loop A[mcx, kx]

FreddieWitherden · 2026-06-03T17:58:32Z

JSON looks solid. See if we can factor out some of the common code so that other backends (CUDA) can also use it. Also just makes the code easier to evaluate standalone. I'll start trying to chunk through the kernels, but it would be great if you could give a once sentence sketch of their general approach.

FreddieWitherden · 2026-06-03T18:01:15Z

+    }
+
+    # Map Supported CC -> Minimum PTX version
+    PTX_SM = {(8, 0): (7, 0), (9, 0): (8, 6), (10, 0): (8, 7), (10, 3): (8, 7),


Is this okay when new GPUs are released?

FreddieWitherden · 2026-06-03T18:01:30Z

+    PTX_SM = {(8, 0): (7, 0), (9, 0): (8, 6), (10, 0): (8, 7), (10, 3): (8, 7),
+              (12, 0): (8, 7), (12, 1): (8, 7)}
+
+    PTX_TEMPLATE_FAMILY = {


Can this be in the config?

FreddieWitherden · 2026-06-03T18:04:25Z

+            'fzero': ('0f00000000' if dtype == 'float'
+                      else '0d0000000000000000'),
+            'beta_zero': self.beta == 0,
+            'mbar_maxwait': '0x989680',


What does this correspond to?

10'000'000, it overrides the system time limit in the membar wait to 10ms. This is generally a good idea.

Will Trojak and others added 6 commits December 2, 2025 22:13

[wip] added ptx generator for bstream

0cd7485

Addtional sparse and dense work

626c2f5

Dense and sparse optimisation

bbbb8ef

Added warp specialised dense kernel

393b409

Performance tuning and cleanup

67d1beb

Whitespace

e2a818b

WillTrojak mentioned this pull request May 15, 2026

Support for GiMMiK PTX Provider PyFR/PyFR#556

Open

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/base.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/cstream-ksplit.mako Outdated

FreddieWitherden reviewed May 15, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/bstream.mako

Cleanups, formating and addressign comments

7d7299a

FreddieWitherden reviewed May 19, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 19, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 19, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 19, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

General cleanups and moved smem to pyfr

1d405c3

FreddieWitherden reviewed May 21, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 21, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 21, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

FreddieWitherden reviewed May 21, 2026

View reviewed changes

Comment thread gimmik/ptx.py Outdated

WillTrojak added 3 commits May 21, 2026 09:26

Fixed missing import

0e86053

Fixed additional args

1f62b5f

Cleanup and added PTX Version to handle older drivers.

79f41cb

FreddieWitherden reviewed May 22, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated

FreddieWitherden reviewed May 22, 2026

View reviewed changes

Comment thread gimmik/kernels/ptx/dense-mma-gAd.mako Outdated

Further cleanup

7b59ca4

This was referenced May 27, 2026

added float4 and double2 #17

Closed

Added spill to shared and launch bounds #16

Closed

WillTrojak added 6 commits June 2, 2026 09:54

Moved the PTX backend to use tuned kernel profiles for different CCs

577208a

Added grid stridded kernel and cleanup

e7d4252

msplit dmma

d4e1216

Refactored arg setup

b9ac47c

Updated Blackwell profile

66e3796

updated sm100 config

010d13c

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Comment thread gimmik/ptx.py

FreddieWitherden reviewed Jun 3, 2026

View reviewed changes

Conversation

WillTrojak commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FreddieWitherden commented May 15, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FreddieWitherden commented Jun 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants