PTX Backend#18
Conversation
|
I know this is an utter pain but for FP32/FP64 can you confirm correctness for all relevant PyFR matrices at a suite of N values for all instances where a kernel is expected to work on A100/H100/B100)? |
| .param .u64 _c) | ||
| { | ||
| % endif | ||
| .reg .u32 n, id, tid_x, tid_y; |
There was a problem hiding this comment.
Ensure we throw higher up if n is too big.
There was a problem hiding this comment.
We don't handle n being too large in any of the other backends.
There was a problem hiding this comment.
https://github.com/PyFR/GiMMiK/blob/master/gimmik/kernels/cuda/cstream.mako#L20 in the embedded case we do (argument case doesn't but that is not currently used for CUDA).
| nnz = np.count_nonzero(arr) | ||
| nuq = len(np.unique(np.abs(arr))) | ||
| density = nnz / arr.size | ||
| return (nuq <= 28) or (density <= 0.15) |
There was a problem hiding this comment.
Check if these could do with tuning
There was a problem hiding this comment.
I think that would be a seperate PR
| % for idx, kx in enumerate(bchunks[bb]): | ||
| ld.shared.${pftype} bv, [bsub_thread + ${bsub_off(buf_cur, idx)}]; | ||
| % for j, row_j in enumerate(mcx): | ||
| <% jx = A[row_j, kx] %> |
There was a problem hiding this comment.
See if NumPy can be used in the for loop A[mcx, kx]
|
JSON looks solid. See if we can factor out some of the common code so that other backends (CUDA) can also use it. Also just makes the code easier to evaluate standalone. I'll start trying to chunk through the kernels, but it would be great if you could give a once sentence sketch of their general approach. |
| } | ||
|
|
||
| # Map Supported CC -> Minimum PTX version | ||
| PTX_SM = {(8, 0): (7, 0), (9, 0): (8, 6), (10, 0): (8, 7), (10, 3): (8, 7), |
There was a problem hiding this comment.
Is this okay when new GPUs are released?
| PTX_SM = {(8, 0): (7, 0), (9, 0): (8, 6), (10, 0): (8, 7), (10, 3): (8, 7), | ||
| (12, 0): (8, 7), (12, 1): (8, 7)} | ||
|
|
||
| PTX_TEMPLATE_FAMILY = { |
There was a problem hiding this comment.
Can this be in the config?
| 'fzero': ('0f00000000' if dtype == 'float' | ||
| else '0d0000000000000000'), | ||
| 'beta_zero': self.beta == 0, | ||
| 'mbar_maxwait': '0x989680', |
There was a problem hiding this comment.
What does this correspond to?
There was a problem hiding this comment.
10'000'000, it overrides the system time limit in the membar wait to 10ms. This is generally a good idea.
This adds a PTX backend to GiMMiK. The key features are:
Optimisations have focused on FP64, FP32 is future work.