feat(c-tri): A64FX ARM64+SVE backend (NEON BVH4, SVE BVH8, fp64, SDOT bench)#4
Merged
Conversation
… bench)
Add ARM SIMD paths to the C11 kernel alongside scalar/SSE4/AVX2, compiled
with Fujitsu fcc (-Nclang -march=armv8.2-a+sve). Compile-time dispatch via
LRT_TRI_HAS_NEON / LRT_TRI_HAS_SVE; the scalar path stays the oracle and all
parity constants are shared.
- NEON BVH4 (128-bit fp32): 1:1 mirror of the SSE4 path (slab, 4-wide
Moller-Trumbore, perm[octant] push). movemask via vandq{1,2,4,8}+vaddvq,
any-hit via vmaxvq. kernel_name "bvh4/neon".
- SVE BVH8 (A64FX 512-bit fp32): mirror of AVX2, 8-of-16 lanes
(svwhilelt_b32); hit list materialized to float[8] + scalar insertion-sort
to reproduce AVX2 ordering. kernel_name "bvh8/sve". AUTO stays BVH4; pass
LRT_TRI_LAYOUT_BVH8 for SVE.
- fp64 HPC path: new lrt_tri_intersect1_hp (lrt_ray_hp/lrt_hit_hp) traverses
the fp32 BVH with an SVE 8-wide fp64 (svfloat64) leaf; the fp64
custom-geometry callback API (lightrt_c.c) gains a NEON node slab.
- int8/int16 SDOT leaf microbenchmark (benchmark_c/bench_a64fx_sdot.c):
genuinely uses svdot_s32/svdot_s64. Honest verdict (measured on A64FX):
neither beats 16-wide fp32 SVE (int8 ~1.03x, int16 ~0.78x), like the HIP
WMMA result; kept standalone, not wired into the BVH.
- LRT_TRI_FORCE_SCALAR knob to build a scalar baseline.
- Build: Makefile + top-level/benchmark CMake gain an aarch64 +sve/fcc branch;
scripts/build_a64fx.sh + run_a64fx.sh build & run SIMD/scalar/SDOT.
- Test portability for fcc clang mode: hoist GNU nested functions, fix a
label-needs-statement, an atomic-on-const load, and guard AVX2-only q4/fp8
layouts on non-x86.
Verified on A64FX (1 core, mandelbulb 88k tris): NEON BVH4 ~1.95x and SVE
BVH8 ~2.33x vs the same kernel built scalar; all hits agree 100% with the
fp64 oracle; full test_lightrt_c_tri suite PASS.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…benches Extend the A64FX benchmark suite to cover the non-triangle prim types, which currently run the scalar path on ARM (NEON/SVE leaves are triangle-only): - bench_a64fx_subd.c: parametric "subd" surfaces — a bicubic Bézier patch grid and a NURBS surface, intersected directly (no tessellation), traced over a pthread pool. Reuses rays.c. ~1.4-1.9 Mray/s at 48 threads (scalar). - bench_a64fx_volume.c: dense N^3 float volume raymarch (trilinear + front-to- back compositing) — the memory-bandwidth-bound workload. Reports Mrays/s, Msamples/s and effective GB/s across grid sizes; incoherent gather throughput drops ~2.6x (332->130 Msamp/s) as the grid spills from L2 to HBM2 while coherent stays flat (cache-line reuse). - build_a64fx.sh / run_a64fx.sh now build and run both, plus the existing curve/sphere/SDF backends (hair_bench covers round/flat/bezier curves). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Curves and the other BVH4 primitives previously fell back to scalar on A64FX (the 4-wide leaves are written in SSE intrinsics, x86-only). Add a minimal SSE→NEON compatibility shim (lightrt_sse2neon_min.h, only the ~39 _mm_* intrinsics this kernel uses) and, on ARM NEON, set LRT_TRI_HAS_SSE4=1 so the existing 4-wide SSE BVH4 code (triangle, curve, point, sphere, quad, qtri, and the parametric-patch traversal drivers) compiles and runs as NEON 4-wide. One shim vectorizes every BVH4 leaf instead of hand-duplicating them. - Replaces the hand-written NEON triangle BVH4 (removed) with the shimmed SSE path; SVE BVH8 + fp64 stay native (guards decoupled from SSE4). - Shim maps comparisons to float-domain masks, _mm_movemask_ps -> vshrq+vaddvq, _mm_blendv_ps -> vbslq (kernel only feeds compare masks); LRT_TRI_NEON_SHIM marks the mode and fixes kernel_name to report bvh4/neon, bvh8/sve. - Measured (48-thread A64FX, vs same kernel built scalar): flat curve ~2.3x, cubic-Bézier curve ~3.1x, sphere ~1.4x. Round-linear's blendv/movemask-heavy Embree CSG leaf is SLOWER as NEON, so TRI_PRIM_RLCURVE is routed to scalar. - Parametric surfaces (bilinear/bezpatch/NURBS) remain scalar: per-patch Newton + adaptive subdivision is divergent and not 4-wide-SIMD-friendly; intra-patch SVE eval is noted as future work. - Full tests/test_lightrt_c_tri.c oracle suite PASSES on A64FX (all prim_kinds); triangle bench unchanged and still verifies 100% vs the fp64 oracle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t-in) Add an SVE evaluation of the bicubic Bézier / NURBS patch leaf: instead of 18 scalar 1D cubic evals, weight all 16 control points at once on A64FX's 16-lane fp32 vector. The Bernstein basis for (u,v) gives a 16-lane weight w[k]=Bu[k%4]*Bv[k>>2] (k=v*4+u, matching the cp order) built with svld1rq (quad-replicate Bu) + svtbl (expand Bv) + svindex; the surface point and the du/dv partials are per-axis svaddv(w*cp_axis) horizontal sums. Control points are transposed to SoA once per patch and reused across all Newton iterations. NURBS adds the 4th homogeneous component + perspective divide. Both tri_bezpatch_eval_sve and tri_rbezpatch_eval_sve are wired through the Newton + seed via TRI_BEZ_EVAL / TRI_RBEZ_EVAL. Honest verdict (measured on A64FX, 48 threads): NEUTRAL-to-slightly-slower than scalar — bezpatch ~0.98x, NURBS ~0.97x. The 9-12 horizontal svaddv reductions per eval cost about as much as the 18 scalar cubic evals, and the eval is not the dominant cost (adaptive subdivision + AABB cull + Newton control flow are), mirroring the SDOT/WMMA results. So it ships OFF by default behind LRT_TRI_SVE_PATCH_EVAL; the scalar patch eval stays the default (no regression). Both configs verify 100% against the brute-force oracle (tests/test_lightrt_c_tri RESULT: PASS for default scalar and -DLRT_TRI_SVE_PATCH_EVAL). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rdict Add a 16-wide node (lrt_bvh16_node, 512 B) + 16-wide leaf (lrt_tri16) layout (LRT_TRI_LAYOUT_BVH16) so the A64FX SVE slab + Moller-Trumbore fill all 16 fp32 lanes with no predicate waste (tri_intersect_bvh16_sve / _occluded, svptrue_b32). Collapse generalized to width 16 (set[16], width==16 node branch, block_shift 4); build allocates nodes16 and rejects layout 16 without SVE (no scalar nodes16 traversal). Serialization/refit/mmap reject BVH16. New c11-bvh16 bench backend. Honest verdict (48-thread A64FX, mandelbulb 127752 tris, all verify 100%): BVH4 primary 30.6 incoherent 21.9 shadow 30.2 Mray/s (128 B nodes) BVH8 primary 32.3 incoherent 20.7 shadow 30.1 Mray/s (256 B nodes) BVH16 primary 17.8 incoherent 13.8 shadow 16.5 Mray/s (512 B nodes) BVH16 is ~0.55x of BVH8 despite filling all 16 lanes: the 512-byte nodes (8 cache lines) cost too much bandwidth and the wider nodes cull worse, outweighing the lane utilization (build is faster, memory slightly lower, but the trace tanks). BVH8's 8-of-16 stays the SVE sweet spot; BVH16 ships opt-in as a measured negative result, like the SDOT / patch-eval experiments. Correctness: BVH16 matches the BVH8/scalar oracle 100% (max_rel_t 0); full tests/test_lightrt_c_tri suite still PASSES. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend lrt_tri_intersect1_hp to bicubic Bézier and NURBS (incl. trimmed) surfaces for HPC visualization. The fp32 adaptive-subdivision intersector does the coarse find (which patch + an approximate (u,v) seed); an fp64 Newton on (u,v,t) — using the patch control points promoted to double and the fp64 ray — then refines to full fp64 t/u/v. NURBS (u,v) is mapped global<->local via the per-patch domain (shade_dom); control points are reconstructed from the existing per-prim shade_cps, so no extra storage. Reusing the fp32 find avoids duplicating the subdivision/cull machinery. The fp64 eval (tri_bezpatch_eval_hp / tri_rbezpatch_eval_hp) is scalar double: the fp32 SVE Bernstein eval was already neutral (svaddv-bound), and fp64 halves the SVE lane count (16 control points span two 8-lane vectors), so SVE does not help the fp64 eval — precision, not throughput, is the point of this path. The reasoning is documented next to the measured fp32 SVE-eval result. Verified: 100% hit agreement (prim_id + u/v) with the fp32 patch path on a bezpatch grid and a NURBS surface (200k rays each); fp64 t carries the extra precision (max_rel_t ~5-8e-5 vs the fp32 t). Full tests/test_lightrt_c_tri suite still PASSES. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The memory-efficient 8-bit-quantized-node layout (LRT_TRI_LAYOUT_BVH8Q, 128-byte nodes vs 256) ran the scalar path on A64FX — the decode+slab was AVX2-only. Add an SVE decode+slab (tri_bvh8q_slab_sve / tri_intersect_bvh8q_sve / _occluded): 8-bit child bounds are loaded + widened with svld1ub_u32 + svcvt_f32_u32 and the plane is q*scale*invd + (org_node-org_ray)*invd via svmla (8-of-16 lanes), mirroring tri_bvh8q_slab_avx. Leaves are full-precision lrt_tri8, so the existing SVE leaf MT (tri_block_isect_sve) is reused. Dispatch + kernel_name (bvh8q/sve). Measured (48-thread A64FX, mandelbulb 127k, all verify 100%): bvh8q scalar primary 12.8 incoherent 9.7 Mray/s bvh8q SVE primary 31.6 incoherent 22.4 Mray/s (~2.5x over scalar) bvh8q/sve lands ~= bvh8/sve (33.6/22.8) at half the node bytes. The fp8/q4 node formats (qnode != 0) remain AVX2-only. Correctness: bvh8q/sve matches the bvh8/scalar oracle 100% (conservative quantized bounds -> identical hits); full tests/test_lightrt_c_tri PASSES. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a 16-ray-wide SVE packet (tri_intersect16_sve): 16 coherent rays traverse the BVH together, tested ray-parallel against one box / one triangle per step, amortizing node + leaf fetches over the packet. Walks any plain-triangle layout (BVH4/8/8q/16) via tri_node_load (which now also handles BVH16). Closest-hit is order-independent, so a child is pushed if any active ray's interval enters it within that ray's current best_t. Routes lrt_tri_intersect1N(COHERENT) when built with -DLRT_TRI_SVE_PACKET. Measured (48-thread A64FX, mandelbulb 127k, 100% match vs the fp64 oracle): BVH4 primary 30.6 -> 120.7 Mray/s (~3.9x) BVH8 primary 32.3 -> 77.4 Mray/s (~2.4x) BVH4 wins under the packet (4 boxes/node vs 8 — coherent rays don't need the wider node). Off by default: the packet's FMA + traversal order is not bit-identical to the single-ray kernel (ULP-level t and tie-break prim_id differences — still oracle-correct within tolerance), and the default intersect1N keeps the bit-exact Ray4 path so the batch==single unit-test contract holds. The benchmark build (scripts/build_a64fx.sh) enables -DLRT_TRI_SVE_PACKET. Any-hit (occluded1N) stays per-ray (lockstep defeats per-ray early-out). Default unit suite (packet off) PASSES bit-exactly; the packet-on build verifies 100% vs the fp64 oracle on coherent primary rays. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A64FX (ARM64 + SVE 1.0) backend for the C11 ray-tracing kernel
Adds ARM SIMD paths to
lightrt_c_tri.c/lightrt_c.calongside the existingscalar/SSE4/AVX2 ones, compiled with the Fujitsu compiler in clang mode
(
fcc -Nclang -march=armv8.2-a+sve). Compile-time dispatch viaLRT_TRI_HAS_NEON/LRT_TRI_HAS_SVE; the scalar path stays the correctnessoracle and every parity constant is shared. A64FX SVE is fixed 512-bit
(16 fp32 / 8 fp64 / 64 int8 lanes).
What's new
Möller–Trumbore,
perm[octant]push); movemask emulated withvandq{1,2,4,8}+vaddvq, any-hit viavmaxvq.kernel_name=bvh4/neon.on the low 8-of-16 lanes (
svwhilelt_b32); hit list materialized tofloat[8]+ the scalar insertion-sort to reproduce AVX2 ordering exactly.kernel_name=bvh8/sve. AUTO layout stays BVH4; passLRT_TRI_LAYOUT_BVH8for the SVE path. (BVH16 /
lrt_tri16is a documented follow-up.)lrt_tri_intersect1_hp(
lrt_ray_hp/lrt_hit_hp) traverses the fp32 BVH but runs the leaf MT inSVE 8-wide fp64 (
svfloat64), scalar double elsewhere. The fp64custom-geometry callback API (
lightrt_c.c) gains a NEON node slab.benchmark_c/bench_a64fx_sdot.c) —genuinely uses
svdot_s32/svdot_s64. Honest verdict (measured onA64FX): neither beats 16-wide fp32 SVE (int8 ~1.03×, int16 ~0.78×, ~3% mean
accuracy loss), mirroring the repo's HIP-WMMA finding; kept standalone, not
wired into the BVH.
LRT_TRI_FORCE_SCALARbuilds a scalar baseline.benchmark_cCMake gain an aarch64-march=armv8.2-a+sve/fcc branch;benchmark_c/scripts/build_a64fx.shandrun_a64fx.shbuild & run the SIMD / scalar / SDOT binaries.scope, fix a label-needs-statement, an atomic-load-on-const, and guard the
AVX2-only q4/fp8 layouts on non-x86.
Verification (1 A64FX core, mandelbulb 88k tris)
All hits agree 100% with the fp64 oracle; full
tests/test_lightrt_c_tri.csuite PASS with NEON+SVE active.
Full-node scaling (48 threads, mandelbulb 127,752 tris, 8M rays)
SIMD hits all verify 100% vs the fp64 oracle; speedups hold under full-node
memory-bandwidth contention (SVE BVH8 keeps the larger multiplier on incoherent
and shadow rays).
Other primitive types on A64FX (48 threads)
The BVH4 prim leaves are now NEON-vectorized via a minimal SSE→NEON shim
(
lightrt_sse2neon_min.h): on ARM the kernel setsLRT_TRI_HAS_SSE4=1and runsthe existing 4-wide SSE leaves (triangle, curve, point, sphere, quad, qtri, patch
traversal) through the shim — one shim instead of hand-duplicating every leaf.
SVE BVH8 + fp64 stay native.
Curves — NEON vs the same kernel built scalar (furball, 48 threads, primary Mray/s):
Round-linear's leaf is a blendv/movemask-heavy Embree CSG port; it's measured
slower as NEON than scalar on A64FX, so
TRI_PRIM_RLCURVEis routed to scalar.Parametric surfaces (subd) — direct ray-patch, no tessellation:
These stay scalar by default. The patch eval is SVE-vectorizable — a
16-lane Bernstein-weighted sum over all 16 control points
(
tri_bezpatch_eval_sve/tri_rbezpatch_eval_sve,svld1rq+svtblweights,per-axis
svaddv) — and it's implemented behind-DLRT_TRI_SVE_PATCH_EVAL, butmeasured neutral-to-slightly-slower on A64FX (bezpatch ~0.98×, NURBS ~0.97×):
the 9–12 horizontal
svaddvreductions per eval cost ~as much as the 18 scalar1D cubic evals, and the eval isn't the dominant cost (adaptive subdivision +
AABB cull + Newton control flow are). So it ships off, alongside the SDOT
honest-verdict result. Both configs verify 100% vs the oracle.
Dense-grid volume raymarch (N³ floats, trilinear + front-to-back, Msamples/s)
— the memory-bound case: incoherent random-gather throughput drops ~2.6× as the
grid spills from L2-resident to HBM2-resident, while coherent marching stays
flat (trilinear cache-line reuse):
SVE 16-ray coherent packet — biggest coherent win (opt-in)
tri_intersect16_sverouteslrt_tri_intersect1N(COHERENT)through a 16 raysvs one box/triangle traversal (ray-parallel SoA, full 16-lane SVE), walking any
plain-triangle layout via
tri_node_load. 48-thread A64FX, mandelbulb 127k,100% fp64-oracle match:
BVH4 wins under the packet (4 boxes/node vs 8). Off by default (
-DLRT_TRI_SVE_PACKET):the packet's FMA/traversal order isn't bit-identical to the single-ray kernel
(ULP-t / tie prim_id), so the default
intersect1Nkeeps the bit-exact Ray4 path(the
batch==singleunit test stays green); the benchmark build enables it.SVE BVH8Q — quantized-node layout (positive port)
The memory-efficient 8-bit-quantized-node layout (
LRT_TRI_LAYOUT_BVH8Q,128-byte nodes vs 256) ran scalar on A64FX; the decode+slab is now SVE
(
svld1ub_u32+svcvt_f32_u32,svmla). ~2.5× over scalar and ≈ BVH8/SVEat half the node bytes (48-thread A64FX, all verify 100%):
(For reference BVH8/SVE is 33.6 / 22.8.) fp8/q4 node formats stay AVX2-only.
BVH16 — fill all 16 SVE lanes (opt-in, honest negative)
LRT_TRI_LAYOUT_BVH16adds a 16-wide node (512 B) + 16-wide leaf so the SVEslab + Möller-Trumbore use all 16 fp32 lanes with no predicate waste. 100%
correct vs the BVH8/scalar oracle, but ~0.55× of BVH8 (48-thread A64FX,
mandelbulb 127k):
The 512-byte nodes (8 cache lines) cost too much bandwidth and wider nodes cull
worse; the lane utilization doesn't pay for it. BVH8's 8-of-16 stays the SVE
sweet spot; BVH16 ships opt-in (SVE-only; no serialize/refit/mmap).
fp64 Bézier/NURBS surface intersection (HPC precision)
lrt_tri_intersect1_hpnow covers bicubic Bézier + NURBS: the fp32adaptive-subdiv intersector finds the patch + (u,v) seed, then an fp64 Newton
on (u,v,t) (control points → double, fp64 ray) refines to full fp64 t/u/v.
100% hit agreement with the fp32 path (bezpatch + NURBS, 200k rays each);
fp64
tcarries the extra precision. The fp64 eval is scalar double — an SVEfp64 eval doesn't help (the fp32 SVE Bernstein eval was already neutral, and
fp64 halves the lane count: 16 CPs over two 8-lane vectors). Precision, not
throughput, is the point.
🤖 Generated with Claude Code