Skip to content

PTX Backend#18

Open
WillTrojak wants to merge 18 commits into
PyFR:masterfrom
WillTrojak:feature/ptx
Open

PTX Backend#18
WillTrojak wants to merge 18 commits into
PyFR:masterfrom
WillTrojak:feature/ptx

Conversation

@WillTrojak

Copy link
Copy Markdown
Member

This adds a PTX backend to GiMMiK. The key features are:

  • Mild optimisation of exist CUDA algorithms.
  • Optional async loads for some sparse kernels
  • Added dense generation for Hopper and above

Optimisations have focused on FP64, FP32 is future work.

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
@FreddieWitherden

Copy link
Copy Markdown
Contributor

I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)?

Comment thread gimmik/kernels/ptx/base.mako Outdated
.param .u64 _c)
{
% endif
.reg .u32 n, id, tid_x, tid_y;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure we throw higher up if n is too big.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't handle n being too large in any of the other backends.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PyFR/GiMMiK/blob/master/gimmik/kernels/cuda/cstream.mako#L20 in the embedded case we do (argument case doesn't but that is not currently used for CUDA).

Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/kernels/ptx/cstream-ksplit.mako Outdated
Comment thread gimmik/kernels/ptx/bstream.mako
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/cuda.py
nnz = np.count_nonzero(arr)
nuq = len(np.unique(np.abs(arr)))
density = nnz / arr.size
return (nuq <= 28) or (density <= 0.15)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if these could do with tuning

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a seperate PR

Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/ptx.py Outdated
Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
Comment thread gimmik/kernels/ptx/bstream-msplit.mako Outdated
% for idx, kx in enumerate(bchunks[bb]):
ld.shared.${pftype} bv, [bsub_thread + ${bsub_off(buf_cur, idx)}];
% for j, row_j in enumerate(mcx):
<% jx = A[row_j, kx] %>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if NumPy can be used in the for loop A[mcx, kx]

Comment thread gimmik/kernels/ptx/dense-mma-gAd.mako Outdated
@FreddieWitherden

Copy link
Copy Markdown
Contributor

JSON looks solid. See if we can factor out some of the common code so that other backends (CUDA) can also use it. Also just makes the code easier to evaluate standalone. I'll start trying to chunk through the kernels, but it would be great if you could give a once sentence sketch of their general approach.

Comment thread gimmik/ptx.py
}

# Map Supported CC -> Minimum PTX version
PTX_SM = {(8, 0): (7, 0), (9, 0): (8, 6), (10, 0): (8, 7), (10, 3): (8, 7),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this okay when new GPUs are released?

Comment thread gimmik/ptx.py
PTX_SM = {(8, 0): (7, 0), (9, 0): (8, 6), (10, 0): (8, 7), (10, 3): (8, 7),
(12, 0): (8, 7), (12, 1): (8, 7)}

PTX_TEMPLATE_FAMILY = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be in the config?

Comment thread gimmik/ptx.py
Comment thread gimmik/ptx.py
Comment thread gimmik/ptx.py
Comment thread gimmik/ptx.py
Comment thread gimmik/ptx.py
Comment thread gimmik/ptx.py
'fzero': ('0f00000000' if dtype == 'float'
else '0d0000000000000000'),
'beta_zero': self.beta == 0,
'mbar_maxwait': '0x989680',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this correspond to?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10'000'000, it overrides the system time limit in the membar wait to 10ms. This is generally a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants