feat(c-tri): A64FX ARM64+SVE backend (NEON BVH4, SVE BVH8, fp64, SDOT bench) by syoyo · Pull Request #4 · lighttransport/lightrt

syoyo · 2026-06-29T23:20:46Z

A64FX (ARM64 + SVE 1.0) backend for the C11 ray-tracing kernel

Adds ARM SIMD paths to lightrt_c_tri.c / lightrt_c.c alongside the existing
scalar/SSE4/AVX2 ones, compiled with the Fujitsu compiler in clang mode
(fcc -Nclang -march=armv8.2-a+sve). Compile-time dispatch via
LRT_TRI_HAS_NEON / LRT_TRI_HAS_SVE; the scalar path stays the correctness
oracle and every parity constant is shared. A64FX SVE is fixed 512-bit
(16 fp32 / 8 fp64 / 64 int8 lanes).

What's new

NEON BVH4 (128-bit fp32) — 1:1 mirror of the SSE4 path (slab, 4-wide
Möller–Trumbore, perm[octant] push); movemask emulated with
vandq{1,2,4,8}+vaddvq, any-hit via vmaxvq. kernel_name = bvh4/neon.
SVE BVH8 (A64FX 512-bit fp32) — mirror of AVX2, one node/leaf per vector
on the low 8-of-16 lanes (svwhilelt_b32); hit list materialized to
float[8] + the scalar insertion-sort to reproduce AVX2 ordering exactly.
kernel_name = bvh8/sve. AUTO layout stays BVH4; pass LRT_TRI_LAYOUT_BVH8
for the SVE path. (BVH16 / lrt_tri16 is a documented follow-up.)
fp64 high-precision (HPC visualization) — new lrt_tri_intersect1_hp
(lrt_ray_hp / lrt_hit_hp) traverses the fp32 BVH but runs the leaf MT in
SVE 8-wide fp64 (svfloat64), scalar double elsewhere. The fp64
custom-geometry callback API (lightrt_c.c) gains a NEON node slab.
int8/int16 SDOT leaf microbenchmark (benchmark_c/bench_a64fx_sdot.c) —
genuinely uses svdot_s32 / svdot_s64. Honest verdict (measured on
A64FX): neither beats 16-wide fp32 SVE (int8 ~1.03×, int16 ~0.78×, ~3% mean
accuracy loss), mirroring the repo's HIP-WMMA finding; kept standalone, not
wired into the BVH. LRT_TRI_FORCE_SCALAR builds a scalar baseline.
Build — Makefile + top-level/benchmark_c CMake gain an aarch64
-march=armv8.2-a+sve/fcc branch; benchmark_c/scripts/build_a64fx.sh and
run_a64fx.sh build & run the SIMD / scalar / SDOT binaries.
Test portability for fcc clang mode — hoist GNU nested functions to file
scope, fix a label-needs-statement, an atomic-load-on-const, and guard the
AVX2-only q4/fp8 layouts on non-x86.

Verification (1 A64FX core, mandelbulb 88k tris)

All hits agree 100% with the fp64 oracle; full tests/test_lightrt_c_tri.c
suite PASS with NEON+SVE active.

kernel	scalar	SIMD	speedup
BVH4 primary	1.88	3.66 Mray/s	1.95× (NEON)
BVH8 primary	1.43	3.33 Mray/s	2.33× (SVE)
BVH8 incoherent / shadow	0.29 / 0.37	0.65 / 0.84	2.24× / 2.30×

Full-node scaling (48 threads, mandelbulb 127,752 tris, 8M rays)

SIMD hits all verify 100% vs the fp64 oracle; speedups hold under full-node
memory-bandwidth contention (SVE BVH8 keeps the larger multiplier on incoherent
and shadow rays).

kernel	workload	scalar	SIMD	speedup
BVH4 (NEON)	primary	18.82	39.25 Mray/s	2.09×
BVH4 (NEON)	incoherent / shadow	13.45 / 20.28	21.31 / 26.55	1.58× / 1.31×
BVH8 (SVE)	primary	13.89	34.50 Mray/s	2.48×
BVH8 (SVE)	incoherent / shadow	11.45 / 11.66	23.41 / 27.66	2.05× / 2.37×

Other primitive types on A64FX (48 threads)

The BVH4 prim leaves are now NEON-vectorized via a minimal SSE→NEON shim
(lightrt_sse2neon_min.h): on ARM the kernel sets LRT_TRI_HAS_SSE4=1 and runs
the existing 4-wide SSE leaves (triangle, curve, point, sphere, quad, qtri, patch
traversal) through the shim — one shim instead of hand-duplicating every leaf.
SVE BVH8 + fp64 stay native.

Curves — NEON vs the same kernel built scalar (furball, 48 threads, primary Mray/s):

curve type	scalar	NEON	speedup
flat ribbon	0.90	2.05	2.3×
cubic Bézier	1.11	3.44	3.1×
sphere (point)	15.7	22.4	1.4×
round-linear	2.03	1.46	kept scalar (NEON CSG leaf slower)

Round-linear's leaf is a blendv/movemask-heavy Embree CSG port; it's measured
slower as NEON than scalar on A64FX, so TRI_PRIM_RLCURVE is routed to scalar.

Parametric surfaces (subd) — direct ray-patch, no tessellation:

surface	primary	incoherent
bicubic Bézier (1600 patches)	1.86	1.56
NURBS (25×25 bicubic net)	1.42	1.31

These stay scalar by default. The patch eval is SVE-vectorizable — a
16-lane Bernstein-weighted sum over all 16 control points
(tri_bezpatch_eval_sve/tri_rbezpatch_eval_sve, svld1rq+svtbl weights,
per-axis svaddv) — and it's implemented behind -DLRT_TRI_SVE_PATCH_EVAL, but
measured neutral-to-slightly-slower on A64FX (bezpatch ~0.98×, NURBS ~0.97×):
the 9–12 horizontal svaddv reductions per eval cost ~as much as the 18 scalar
1D cubic evals, and the eval isn't the dominant cost (adaptive subdivision +
AABB cull + Newton control flow are). So it ships off, alongside the SDOT
honest-verdict result. Both configs verify 100% vs the oracle.

Dense-grid volume raymarch (N³ floats, trilinear + front-to-back, Msamples/s)
— the memory-bound case: incoherent random-gather throughput drops ~2.6× as the
grid spills from L2-resident to HBM2-resident, while coherent marching stays
flat (trilinear cache-line reuse):

grid	working set	primary (coherent)	incoherent (gather)
64³	1 MB (L2)	108	332
128³	8 MB	105	307
256³	64 MB (HBM)	98	202
512³	512 MB (HBM)	85	130

SVE 16-ray coherent packet — biggest coherent win (opt-in)

tri_intersect16_sve routes lrt_tri_intersect1N(COHERENT) through a 16 rays
vs one box/triangle traversal (ray-parallel SoA, full 16-lane SVE), walking any
plain-triangle layout via tri_node_load. 48-thread A64FX, mandelbulb 127k,
100% fp64-oracle match:

coherent primary	per-ray	16-ray packet	speedup
BVH4	30.6	120.7 Mray/s	3.9×
BVH8	32.3	77.4 Mray/s	2.4×

BVH4 wins under the packet (4 boxes/node vs 8). Off by default (-DLRT_TRI_SVE_PACKET):
the packet's FMA/traversal order isn't bit-identical to the single-ray kernel
(ULP-t / tie prim_id), so the default intersect1N keeps the bit-exact Ray4 path
(the batch==single unit test stays green); the benchmark build enables it.

SVE BVH8Q — quantized-node layout (positive port)

The memory-efficient 8-bit-quantized-node layout (LRT_TRI_LAYOUT_BVH8Q,
128-byte nodes vs 256) ran scalar on A64FX; the decode+slab is now SVE
(svld1ub_u32+svcvt_f32_u32, svmla). ~2.5× over scalar and ≈ BVH8/SVE
at half the node bytes (48-thread A64FX, all verify 100%):

	scalar	SVE	speedup
BVH8Q primary	12.8	31.6	2.47×
BVH8Q incoherent	9.7	22.4	2.32×

(For reference BVH8/SVE is 33.6 / 22.8.) fp8/q4 node formats stay AVX2-only.

BVH16 — fill all 16 SVE lanes (opt-in, honest negative)

LRT_TRI_LAYOUT_BVH16 adds a 16-wide node (512 B) + 16-wide leaf so the SVE
slab + Möller-Trumbore use all 16 fp32 lanes with no predicate waste. 100%
correct vs the BVH8/scalar oracle, but ~0.55× of BVH8 (48-thread A64FX,
mandelbulb 127k):

layout	primary	incoherent	shadow	node
BVH4	30.6	21.9	30.2	128 B
BVH8	32.3	20.7	30.1	256 B
BVH16	17.8	13.8	16.5	512 B

The 512-byte nodes (8 cache lines) cost too much bandwidth and wider nodes cull
worse; the lane utilization doesn't pay for it. BVH8's 8-of-16 stays the SVE
sweet spot; BVH16 ships opt-in (SVE-only; no serialize/refit/mmap).

fp64 Bézier/NURBS surface intersection (HPC precision)

lrt_tri_intersect1_hp now covers bicubic Bézier + NURBS: the fp32
adaptive-subdiv intersector finds the patch + (u,v) seed, then an fp64 Newton
on (u,v,t) (control points → double, fp64 ray) refines to full fp64 t/u/v.
100% hit agreement with the fp32 path (bezpatch + NURBS, 200k rays each);
fp64 t carries the extra precision. The fp64 eval is scalar double — an SVE
fp64 eval doesn't help (the fp32 SVE Bernstein eval was already neutral, and
fp64 halves the lane count: 16 CPs over two 8-lane vectors). Precision, not
throughput, is the point.

🤖 Generated with Claude Code

… bench) Add ARM SIMD paths to the C11 kernel alongside scalar/SSE4/AVX2, compiled with Fujitsu fcc (-Nclang -march=armv8.2-a+sve). Compile-time dispatch via LRT_TRI_HAS_NEON / LRT_TRI_HAS_SVE; the scalar path stays the oracle and all parity constants are shared. - NEON BVH4 (128-bit fp32): 1:1 mirror of the SSE4 path (slab, 4-wide Moller-Trumbore, perm[octant] push). movemask via vandq{1,2,4,8}+vaddvq, any-hit via vmaxvq. kernel_name "bvh4/neon". - SVE BVH8 (A64FX 512-bit fp32): mirror of AVX2, 8-of-16 lanes (svwhilelt_b32); hit list materialized to float[8] + scalar insertion-sort to reproduce AVX2 ordering. kernel_name "bvh8/sve". AUTO stays BVH4; pass LRT_TRI_LAYOUT_BVH8 for SVE. - fp64 HPC path: new lrt_tri_intersect1_hp (lrt_ray_hp/lrt_hit_hp) traverses the fp32 BVH with an SVE 8-wide fp64 (svfloat64) leaf; the fp64 custom-geometry callback API (lightrt_c.c) gains a NEON node slab. - int8/int16 SDOT leaf microbenchmark (benchmark_c/bench_a64fx_sdot.c): genuinely uses svdot_s32/svdot_s64. Honest verdict (measured on A64FX): neither beats 16-wide fp32 SVE (int8 ~1.03x, int16 ~0.78x), like the HIP WMMA result; kept standalone, not wired into the BVH. - LRT_TRI_FORCE_SCALAR knob to build a scalar baseline. - Build: Makefile + top-level/benchmark CMake gain an aarch64 +sve/fcc branch; scripts/build_a64fx.sh + run_a64fx.sh build & run SIMD/scalar/SDOT. - Test portability for fcc clang mode: hoist GNU nested functions, fix a label-needs-statement, an atomic-on-const load, and guard AVX2-only q4/fp8 layouts on non-x86. Verified on A64FX (1 core, mandelbulb 88k tris): NEON BVH4 ~1.95x and SVE BVH8 ~2.33x vs the same kernel built scalar; all hits agree 100% with the fp64 oracle; full test_lightrt_c_tri suite PASS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…benches Extend the A64FX benchmark suite to cover the non-triangle prim types, which currently run the scalar path on ARM (NEON/SVE leaves are triangle-only): - bench_a64fx_subd.c: parametric "subd" surfaces — a bicubic Bézier patch grid and a NURBS surface, intersected directly (no tessellation), traced over a pthread pool. Reuses rays.c. ~1.4-1.9 Mray/s at 48 threads (scalar). - bench_a64fx_volume.c: dense N^3 float volume raymarch (trilinear + front-to- back compositing) — the memory-bandwidth-bound workload. Reports Mrays/s, Msamples/s and effective GB/s across grid sizes; incoherent gather throughput drops ~2.6x (332->130 Msamp/s) as the grid spills from L2 to HBM2 while coherent stays flat (cache-line reuse). - build_a64fx.sh / run_a64fx.sh now build and run both, plus the existing curve/sphere/SDF backends (hair_bench covers round/flat/bezier curves). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Curves and the other BVH4 primitives previously fell back to scalar on A64FX (the 4-wide leaves are written in SSE intrinsics, x86-only). Add a minimal SSE→NEON compatibility shim (lightrt_sse2neon_min.h, only the ~39 _mm_* intrinsics this kernel uses) and, on ARM NEON, set LRT_TRI_HAS_SSE4=1 so the existing 4-wide SSE BVH4 code (triangle, curve, point, sphere, quad, qtri, and the parametric-patch traversal drivers) compiles and runs as NEON 4-wide. One shim vectorizes every BVH4 leaf instead of hand-duplicating them. - Replaces the hand-written NEON triangle BVH4 (removed) with the shimmed SSE path; SVE BVH8 + fp64 stay native (guards decoupled from SSE4). - Shim maps comparisons to float-domain masks, _mm_movemask_ps -> vshrq+vaddvq, _mm_blendv_ps -> vbslq (kernel only feeds compare masks); LRT_TRI_NEON_SHIM marks the mode and fixes kernel_name to report bvh4/neon, bvh8/sve. - Measured (48-thread A64FX, vs same kernel built scalar): flat curve ~2.3x, cubic-Bézier curve ~3.1x, sphere ~1.4x. Round-linear's blendv/movemask-heavy Embree CSG leaf is SLOWER as NEON, so TRI_PRIM_RLCURVE is routed to scalar. - Parametric surfaces (bilinear/bezpatch/NURBS) remain scalar: per-patch Newton + adaptive subdivision is divergent and not 4-wide-SIMD-friendly; intra-patch SVE eval is noted as future work. - Full tests/test_lightrt_c_tri.c oracle suite PASSES on A64FX (all prim_kinds); triangle bench unchanged and still verifies 100% vs the fp64 oracle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…t-in) Add an SVE evaluation of the bicubic Bézier / NURBS patch leaf: instead of 18 scalar 1D cubic evals, weight all 16 control points at once on A64FX's 16-lane fp32 vector. The Bernstein basis for (u,v) gives a 16-lane weight w[k]=Bu[k%4]*Bv[k>>2] (k=v*4+u, matching the cp order) built with svld1rq (quad-replicate Bu) + svtbl (expand Bv) + svindex; the surface point and the du/dv partials are per-axis svaddv(w*cp_axis) horizontal sums. Control points are transposed to SoA once per patch and reused across all Newton iterations. NURBS adds the 4th homogeneous component + perspective divide. Both tri_bezpatch_eval_sve and tri_rbezpatch_eval_sve are wired through the Newton + seed via TRI_BEZ_EVAL / TRI_RBEZ_EVAL. Honest verdict (measured on A64FX, 48 threads): NEUTRAL-to-slightly-slower than scalar — bezpatch ~0.98x, NURBS ~0.97x. The 9-12 horizontal svaddv reductions per eval cost about as much as the 18 scalar cubic evals, and the eval is not the dominant cost (adaptive subdivision + AABB cull + Newton control flow are), mirroring the SDOT/WMMA results. So it ships OFF by default behind LRT_TRI_SVE_PATCH_EVAL; the scalar patch eval stays the default (no regression). Both configs verify 100% against the brute-force oracle (tests/test_lightrt_c_tri RESULT: PASS for default scalar and -DLRT_TRI_SVE_PATCH_EVAL). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rdict Add a 16-wide node (lrt_bvh16_node, 512 B) + 16-wide leaf (lrt_tri16) layout (LRT_TRI_LAYOUT_BVH16) so the A64FX SVE slab + Moller-Trumbore fill all 16 fp32 lanes with no predicate waste (tri_intersect_bvh16_sve / _occluded, svptrue_b32). Collapse generalized to width 16 (set[16], width==16 node branch, block_shift 4); build allocates nodes16 and rejects layout 16 without SVE (no scalar nodes16 traversal). Serialization/refit/mmap reject BVH16. New c11-bvh16 bench backend. Honest verdict (48-thread A64FX, mandelbulb 127752 tris, all verify 100%): BVH4 primary 30.6 incoherent 21.9 shadow 30.2 Mray/s (128 B nodes) BVH8 primary 32.3 incoherent 20.7 shadow 30.1 Mray/s (256 B nodes) BVH16 primary 17.8 incoherent 13.8 shadow 16.5 Mray/s (512 B nodes) BVH16 is ~0.55x of BVH8 despite filling all 16 lanes: the 512-byte nodes (8 cache lines) cost too much bandwidth and the wider nodes cull worse, outweighing the lane utilization (build is faster, memory slightly lower, but the trace tanks). BVH8's 8-of-16 stays the SVE sweet spot; BVH16 ships opt-in as a measured negative result, like the SDOT / patch-eval experiments. Correctness: BVH16 matches the BVH8/scalar oracle 100% (max_rel_t 0); full tests/test_lightrt_c_tri suite still PASSES. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Extend lrt_tri_intersect1_hp to bicubic Bézier and NURBS (incl. trimmed) surfaces for HPC visualization. The fp32 adaptive-subdivision intersector does the coarse find (which patch + an approximate (u,v) seed); an fp64 Newton on (u,v,t) — using the patch control points promoted to double and the fp64 ray — then refines to full fp64 t/u/v. NURBS (u,v) is mapped global<->local via the per-patch domain (shade_dom); control points are reconstructed from the existing per-prim shade_cps, so no extra storage. Reusing the fp32 find avoids duplicating the subdivision/cull machinery. The fp64 eval (tri_bezpatch_eval_hp / tri_rbezpatch_eval_hp) is scalar double: the fp32 SVE Bernstein eval was already neutral (svaddv-bound), and fp64 halves the SVE lane count (16 control points span two 8-lane vectors), so SVE does not help the fp64 eval — precision, not throughput, is the point of this path. The reasoning is documented next to the measured fp32 SVE-eval result. Verified: 100% hit agreement (prim_id + u/v) with the fp32 patch path on a bezpatch grid and a NURBS surface (200k rays each); fp64 t carries the extra precision (max_rel_t ~5-8e-5 vs the fp32 t). Full tests/test_lightrt_c_tri suite still PASSES. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The memory-efficient 8-bit-quantized-node layout (LRT_TRI_LAYOUT_BVH8Q, 128-byte nodes vs 256) ran the scalar path on A64FX — the decode+slab was AVX2-only. Add an SVE decode+slab (tri_bvh8q_slab_sve / tri_intersect_bvh8q_sve / _occluded): 8-bit child bounds are loaded + widened with svld1ub_u32 + svcvt_f32_u32 and the plane is q*scale*invd + (org_node-org_ray)*invd via svmla (8-of-16 lanes), mirroring tri_bvh8q_slab_avx. Leaves are full-precision lrt_tri8, so the existing SVE leaf MT (tri_block_isect_sve) is reused. Dispatch + kernel_name (bvh8q/sve). Measured (48-thread A64FX, mandelbulb 127k, all verify 100%): bvh8q scalar primary 12.8 incoherent 9.7 Mray/s bvh8q SVE primary 31.6 incoherent 22.4 Mray/s (~2.5x over scalar) bvh8q/sve lands ~= bvh8/sve (33.6/22.8) at half the node bytes. The fp8/q4 node formats (qnode != 0) remain AVX2-only. Correctness: bvh8q/sve matches the bvh8/scalar oracle 100% (conservative quantized bounds -> identical hits); full tests/test_lightrt_c_tri PASSES. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a 16-ray-wide SVE packet (tri_intersect16_sve): 16 coherent rays traverse the BVH together, tested ray-parallel against one box / one triangle per step, amortizing node + leaf fetches over the packet. Walks any plain-triangle layout (BVH4/8/8q/16) via tri_node_load (which now also handles BVH16). Closest-hit is order-independent, so a child is pushed if any active ray's interval enters it within that ray's current best_t. Routes lrt_tri_intersect1N(COHERENT) when built with -DLRT_TRI_SVE_PACKET. Measured (48-thread A64FX, mandelbulb 127k, 100% match vs the fp64 oracle): BVH4 primary 30.6 -> 120.7 Mray/s (~3.9x) BVH8 primary 32.3 -> 77.4 Mray/s (~2.4x) BVH4 wins under the packet (4 boxes/node vs 8 — coherent rays don't need the wider node). Off by default: the packet's FMA + traversal order is not bit-identical to the single-ray kernel (ULP-level t and tie-break prim_id differences — still oracle-correct within tolerance), and the default intersect1N keeps the bit-exact Ray4 path so the batch==single unit-test contract holds. The benchmark build (scripts/build_a64fx.sh) enables -DLRT_TRI_SVE_PACKET. Any-hit (occluded1N) stays per-ray (lockstep defeats per-ray early-out). Default unit suite (packet off) PASSES bit-exactly; the packet-on build verifies 100% vs the fp64 oracle on coherent primary rays. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

syoyo and others added 8 commits June 30, 2026 08:18

syoyo merged commit 2d470c6 into main Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(c-tri): A64FX ARM64+SVE backend (NEON BVH4, SVE BVH8, fp64, SDOT bench)#4

feat(c-tri): A64FX ARM64+SVE backend (NEON BVH4, SVE BVH8, fp64, SDOT bench)#4
syoyo merged 8 commits into
mainfrom
a64fx-sve-neon-kernels

syoyo commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

syoyo commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A64FX (ARM64 + SVE 1.0) backend for the C11 ray-tracing kernel

What's new

Verification (1 A64FX core, mandelbulb 88k tris)

Full-node scaling (48 threads, mandelbulb 127,752 tris, 8M rays)

Other primitive types on A64FX (48 threads)

SVE 16-ray coherent packet — biggest coherent win (opt-in)

SVE BVH8Q — quantized-node layout (positive port)

BVH16 — fill all 16 SVE lanes (opt-in, honest negative)

fp64 Bézier/NURBS surface intersection (HPC precision)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

syoyo commented Jun 29, 2026 •

edited

Loading