compiler: Avoid int32 overflow in linearized host-device transfer size#2939
Open
gaoflow wants to merge 3 commits into
Open
compiler: Avoid int32 overflow in linearized host-device transfer size#2939gaoflow wants to merge 3 commits into
gaoflow wants to merge 3 commits into
Conversation
When a host-device data transfer is linearized, its array section size is emitted as a product of the Function's per-dimension sizes, e.g. `copyin(u[0:u_vec->size[0]*u_vec->size[1]*u_vec->size[2]*u_vec->size[3]])`. The `size[i]` fields are 32-bit C ints, so for a Function with more than ~2**31 elements the product overflows `int` before it is used as the transfer bound, yielding a bogus size and a corrupt/failed device transfer. Cast each factor of the product to a 64-bit integer so the multiplication is carried out in 64-bit arithmetic. Casting the whole product would be too late (the overflow would already have occurred), so each factor is cast individually. Non-product bounds (a single size, an offset, a constant) cannot overflow and are left untouched, as are non-transfer expressions. Fixes devitocodes#2777
FabioLuporini
requested changes
May 29, 2026
Address review: replace the ad-hoc _avoid_overflow helper with the existing as_long. as_long only substituted plain Symbols (retrieve_symbols), so it was a no-op on the IndexedPointer size factors (vec->size[i]) of a linearized transfer bound; extend it to retrieve_terminals so Indexed/IndexedPointer leaves are cast too. Keep the cast scoped to Mul products in PragmaTransfer so non-linearized multi-dimensional sections are not needlessly upcast.
mloubout
reviewed
Jun 1, 2026
…nt directly Per review: as_long already walks the expression args, so the cast() helper and its is_Mul check are unnecessary. Apply as_long to the section extent directly. Output is unchanged: the start bound is always 0/an offset (left as-is) and the extent is the size product that as_long promotes to 64-bit.
Contributor
Author
|
Good point — dropped the sections = ''.join([f'[{ccode(i)}:{ccode(as_long(j))}]'
for i, j in self.sections])The generated code is unchanged: the start bound |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes #2777.
When a host↔device data transfer is linearized, its array section size is emitted as a product of the
Function's per-dimension sizes, for example:#pragma acc enter data copyin(u[0:u_vec->size[0]*u_vec->size[1]*u_vec->size[2]*u_vec->size[3]])The
u_vec->size[i]fields are 32-bit Cints, so the productsize[0]*size[1]*size[2]*size[3]is evaluated in 32-bit arithmetic. For aFunctionwith more than~2**31elements (e.g. the reporter's1295**3 ≈ 2.17e9points, ~24.5 GB) the product overflowsintbefore it is used as the transfer bound, producing a bogus size (the reporter saw18446744065653020036) and a corrupt / failed device transfer — independent ofindex-mode=int64/linearize=True, because those control the kernel index type, not the type of the transfer-clause arithmetic.As @mloubout noted on the issue, the fix is to perform the size multiplication in 64-bit. This casts each factor of a product section bound to a 64-bit integer:
#pragma acc enter data copyin(u[0:(long)(u_vec->size[0])*(long)(u_vec->size[1])*(long)(u_vec->size[2])*(long)(u_vec->size[3])])Casting the whole product (
(long)(a*b*c)) would be too late — the overflow would already have happened in 32-bit — so each factor is cast individually, which forces every multiplication to be 64-bit regardless of operand ordering.The change lives in
PragmaTransfer._generate, so it is scoped to host-device transfer clauses only. Non-product bounds (a single dimension size, an offset, a constant) cannot overflow and are left untouched, addressing the concern that there is "no reason to uselongfor all of those". Non-transfer expressions (e.g. free-space guards, TMA descriptors) are unaffected.Reproduction
Before:
After:
Verification
openacccopyin/copyout/deleteandopenmpmap(to:/release:)) and to 2D/3D Functions.[0:s0][0:s1]...) is unchanged — there is no product there, hence no overflow.TestPassesOptional::test_linearize_transfer_no_overflowasserting eachsize[i]factor of a linearized transfer is cast tolongand that no bare 32-bit product remains.test_gpu_openmp.pyexpectations (test_basic,test_multiple_eqns) whose OpenMP transfers use the flattened product form.linearize=True, build and run unchanged (no device transfers emitted).flake8clean on the changed files.