Skip to content

feat(c-tri): A64FX ARM64+SVE backend (NEON BVH4, SVE BVH8, fp64, SDOT bench)#4

Merged
syoyo merged 8 commits into
mainfrom
a64fx-sve-neon-kernels
Jul 1, 2026
Merged

feat(c-tri): A64FX ARM64+SVE backend (NEON BVH4, SVE BVH8, fp64, SDOT bench)#4
syoyo merged 8 commits into
mainfrom
a64fx-sve-neon-kernels

Conversation

@syoyo

@syoyo syoyo commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

A64FX (ARM64 + SVE 1.0) backend for the C11 ray-tracing kernel

Adds ARM SIMD paths to lightrt_c_tri.c / lightrt_c.c alongside the existing
scalar/SSE4/AVX2 ones, compiled with the Fujitsu compiler in clang mode
(fcc -Nclang -march=armv8.2-a+sve). Compile-time dispatch via
LRT_TRI_HAS_NEON / LRT_TRI_HAS_SVE; the scalar path stays the correctness
oracle and every parity constant is shared. A64FX SVE is fixed 512-bit
(16 fp32 / 8 fp64 / 64 int8 lanes).

What's new

  • NEON BVH4 (128-bit fp32) — 1:1 mirror of the SSE4 path (slab, 4-wide
    Möller–Trumbore, perm[octant] push); movemask emulated with
    vandq{1,2,4,8}+vaddvq, any-hit via vmaxvq. kernel_name = bvh4/neon.
  • SVE BVH8 (A64FX 512-bit fp32) — mirror of AVX2, one node/leaf per vector
    on the low 8-of-16 lanes (svwhilelt_b32); hit list materialized to
    float[8] + the scalar insertion-sort to reproduce AVX2 ordering exactly.
    kernel_name = bvh8/sve. AUTO layout stays BVH4; pass LRT_TRI_LAYOUT_BVH8
    for the SVE path. (BVH16 / lrt_tri16 is a documented follow-up.)
  • fp64 high-precision (HPC visualization) — new lrt_tri_intersect1_hp
    (lrt_ray_hp / lrt_hit_hp) traverses the fp32 BVH but runs the leaf MT in
    SVE 8-wide fp64 (svfloat64), scalar double elsewhere. The fp64
    custom-geometry callback API (lightrt_c.c) gains a NEON node slab.
  • int8/int16 SDOT leaf microbenchmark (benchmark_c/bench_a64fx_sdot.c) —
    genuinely uses svdot_s32 / svdot_s64. Honest verdict (measured on
    A64FX): neither beats 16-wide fp32 SVE
    (int8 ~1.03×, int16 ~0.78×, ~3% mean
    accuracy loss), mirroring the repo's HIP-WMMA finding; kept standalone, not
    wired into the BVH. LRT_TRI_FORCE_SCALAR builds a scalar baseline.
  • Build — Makefile + top-level/benchmark_c CMake gain an aarch64
    -march=armv8.2-a+sve/fcc branch; benchmark_c/scripts/build_a64fx.sh and
    run_a64fx.sh build & run the SIMD / scalar / SDOT binaries.
  • Test portability for fcc clang mode — hoist GNU nested functions to file
    scope, fix a label-needs-statement, an atomic-load-on-const, and guard the
    AVX2-only q4/fp8 layouts on non-x86.

Verification (1 A64FX core, mandelbulb 88k tris)

All hits agree 100% with the fp64 oracle; full tests/test_lightrt_c_tri.c
suite PASS with NEON+SVE active.

kernel scalar SIMD speedup
BVH4 primary 1.88 3.66 Mray/s 1.95× (NEON)
BVH8 primary 1.43 3.33 Mray/s 2.33× (SVE)
BVH8 incoherent / shadow 0.29 / 0.37 0.65 / 0.84 2.24× / 2.30×

Full-node scaling (48 threads, mandelbulb 127,752 tris, 8M rays)

SIMD hits all verify 100% vs the fp64 oracle; speedups hold under full-node
memory-bandwidth contention (SVE BVH8 keeps the larger multiplier on incoherent
and shadow rays).

kernel workload scalar SIMD speedup
BVH4 (NEON) primary 18.82 39.25 Mray/s 2.09×
BVH4 (NEON) incoherent / shadow 13.45 / 20.28 21.31 / 26.55 1.58× / 1.31×
BVH8 (SVE) primary 13.89 34.50 Mray/s 2.48×
BVH8 (SVE) incoherent / shadow 11.45 / 11.66 23.41 / 27.66 2.05× / 2.37×

Other primitive types on A64FX (48 threads)

The BVH4 prim leaves are now NEON-vectorized via a minimal SSE→NEON shim
(lightrt_sse2neon_min.h): on ARM the kernel sets LRT_TRI_HAS_SSE4=1 and runs
the existing 4-wide SSE leaves (triangle, curve, point, sphere, quad, qtri, patch
traversal) through the shim — one shim instead of hand-duplicating every leaf.
SVE BVH8 + fp64 stay native.

Curves — NEON vs the same kernel built scalar (furball, 48 threads, primary Mray/s):

curve type scalar NEON speedup
flat ribbon 0.90 2.05 2.3×
cubic Bézier 1.11 3.44 3.1×
sphere (point) 15.7 22.4 1.4×
round-linear 2.03 1.46 kept scalar (NEON CSG leaf slower)

Round-linear's leaf is a blendv/movemask-heavy Embree CSG port; it's measured
slower as NEON than scalar on A64FX, so TRI_PRIM_RLCURVE is routed to scalar.

Parametric surfaces (subd) — direct ray-patch, no tessellation:

surface primary incoherent
bicubic Bézier (1600 patches) 1.86 1.56
NURBS (25×25 bicubic net) 1.42 1.31

These stay scalar by default. The patch eval is SVE-vectorizable — a
16-lane Bernstein-weighted sum over all 16 control points
(tri_bezpatch_eval_sve/tri_rbezpatch_eval_sve, svld1rq+svtbl weights,
per-axis svaddv) — and it's implemented behind -DLRT_TRI_SVE_PATCH_EVAL, but
measured neutral-to-slightly-slower on A64FX (bezpatch ~0.98×, NURBS ~0.97×):
the 9–12 horizontal svaddv reductions per eval cost ~as much as the 18 scalar
1D cubic evals, and the eval isn't the dominant cost (adaptive subdivision +
AABB cull + Newton control flow are). So it ships off, alongside the SDOT
honest-verdict result. Both configs verify 100% vs the oracle.

Dense-grid volume raymarch (N³ floats, trilinear + front-to-back, Msamples/s)
— the memory-bound case: incoherent random-gather throughput drops ~2.6× as the
grid spills from L2-resident to HBM2-resident, while coherent marching stays
flat (trilinear cache-line reuse):

grid working set primary (coherent) incoherent (gather)
64³ 1 MB (L2) 108 332
128³ 8 MB 105 307
256³ 64 MB (HBM) 98 202
512³ 512 MB (HBM) 85 130

SVE 16-ray coherent packet — biggest coherent win (opt-in)

tri_intersect16_sve routes lrt_tri_intersect1N(COHERENT) through a 16 rays
vs one box/triangle
traversal (ray-parallel SoA, full 16-lane SVE), walking any
plain-triangle layout via tri_node_load. 48-thread A64FX, mandelbulb 127k,
100% fp64-oracle match:

coherent primary per-ray 16-ray packet speedup
BVH4 30.6 120.7 Mray/s 3.9×
BVH8 32.3 77.4 Mray/s 2.4×

BVH4 wins under the packet (4 boxes/node vs 8). Off by default (-DLRT_TRI_SVE_PACKET):
the packet's FMA/traversal order isn't bit-identical to the single-ray kernel
(ULP-t / tie prim_id), so the default intersect1N keeps the bit-exact Ray4 path
(the batch==single unit test stays green); the benchmark build enables it.

SVE BVH8Q — quantized-node layout (positive port)

The memory-efficient 8-bit-quantized-node layout (LRT_TRI_LAYOUT_BVH8Q,
128-byte nodes vs 256) ran scalar on A64FX; the decode+slab is now SVE
(svld1ub_u32+svcvt_f32_u32, svmla). ~2.5× over scalar and ≈ BVH8/SVE
at half the node bytes (48-thread A64FX, all verify 100%):

scalar SVE speedup
BVH8Q primary 12.8 31.6 2.47×
BVH8Q incoherent 9.7 22.4 2.32×

(For reference BVH8/SVE is 33.6 / 22.8.) fp8/q4 node formats stay AVX2-only.

BVH16 — fill all 16 SVE lanes (opt-in, honest negative)

LRT_TRI_LAYOUT_BVH16 adds a 16-wide node (512 B) + 16-wide leaf so the SVE
slab + Möller-Trumbore use all 16 fp32 lanes with no predicate waste. 100%
correct vs the BVH8/scalar oracle, but ~0.55× of BVH8 (48-thread A64FX,
mandelbulb 127k):

layout primary incoherent shadow node
BVH4 30.6 21.9 30.2 128 B
BVH8 32.3 20.7 30.1 256 B
BVH16 17.8 13.8 16.5 512 B

The 512-byte nodes (8 cache lines) cost too much bandwidth and wider nodes cull
worse; the lane utilization doesn't pay for it. BVH8's 8-of-16 stays the SVE
sweet spot
; BVH16 ships opt-in (SVE-only; no serialize/refit/mmap).

fp64 Bézier/NURBS surface intersection (HPC precision)

lrt_tri_intersect1_hp now covers bicubic Bézier + NURBS: the fp32
adaptive-subdiv intersector finds the patch + (u,v) seed, then an fp64 Newton
on (u,v,t) (control points → double, fp64 ray) refines to full fp64 t/u/v.
100% hit agreement with the fp32 path (bezpatch + NURBS, 200k rays each);
fp64 t carries the extra precision. The fp64 eval is scalar double — an SVE
fp64 eval doesn't help (the fp32 SVE Bernstein eval was already neutral, and
fp64 halves the lane count: 16 CPs over two 8-lane vectors). Precision, not
throughput, is the point.

🤖 Generated with Claude Code

syoyo and others added 8 commits June 30, 2026 08:18
… bench)

Add ARM SIMD paths to the C11 kernel alongside scalar/SSE4/AVX2, compiled
with Fujitsu fcc (-Nclang -march=armv8.2-a+sve). Compile-time dispatch via
LRT_TRI_HAS_NEON / LRT_TRI_HAS_SVE; the scalar path stays the oracle and all
parity constants are shared.

- NEON BVH4 (128-bit fp32): 1:1 mirror of the SSE4 path (slab, 4-wide
  Moller-Trumbore, perm[octant] push). movemask via vandq{1,2,4,8}+vaddvq,
  any-hit via vmaxvq. kernel_name "bvh4/neon".
- SVE BVH8 (A64FX 512-bit fp32): mirror of AVX2, 8-of-16 lanes
  (svwhilelt_b32); hit list materialized to float[8] + scalar insertion-sort
  to reproduce AVX2 ordering. kernel_name "bvh8/sve". AUTO stays BVH4; pass
  LRT_TRI_LAYOUT_BVH8 for SVE.
- fp64 HPC path: new lrt_tri_intersect1_hp (lrt_ray_hp/lrt_hit_hp) traverses
  the fp32 BVH with an SVE 8-wide fp64 (svfloat64) leaf; the fp64
  custom-geometry callback API (lightrt_c.c) gains a NEON node slab.
- int8/int16 SDOT leaf microbenchmark (benchmark_c/bench_a64fx_sdot.c):
  genuinely uses svdot_s32/svdot_s64. Honest verdict (measured on A64FX):
  neither beats 16-wide fp32 SVE (int8 ~1.03x, int16 ~0.78x), like the HIP
  WMMA result; kept standalone, not wired into the BVH.
- LRT_TRI_FORCE_SCALAR knob to build a scalar baseline.
- Build: Makefile + top-level/benchmark CMake gain an aarch64 +sve/fcc branch;
  scripts/build_a64fx.sh + run_a64fx.sh build & run SIMD/scalar/SDOT.
- Test portability for fcc clang mode: hoist GNU nested functions, fix a
  label-needs-statement, an atomic-on-const load, and guard AVX2-only q4/fp8
  layouts on non-x86.

Verified on A64FX (1 core, mandelbulb 88k tris): NEON BVH4 ~1.95x and SVE
BVH8 ~2.33x vs the same kernel built scalar; all hits agree 100% with the
fp64 oracle; full test_lightrt_c_tri suite PASS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…benches

Extend the A64FX benchmark suite to cover the non-triangle prim types, which
currently run the scalar path on ARM (NEON/SVE leaves are triangle-only):

- bench_a64fx_subd.c: parametric "subd" surfaces — a bicubic Bézier patch grid
  and a NURBS surface, intersected directly (no tessellation), traced over a
  pthread pool. Reuses rays.c. ~1.4-1.9 Mray/s at 48 threads (scalar).
- bench_a64fx_volume.c: dense N^3 float volume raymarch (trilinear + front-to-
  back compositing) — the memory-bandwidth-bound workload. Reports Mrays/s,
  Msamples/s and effective GB/s across grid sizes; incoherent gather throughput
  drops ~2.6x (332->130 Msamp/s) as the grid spills from L2 to HBM2 while
  coherent stays flat (cache-line reuse).
- build_a64fx.sh / run_a64fx.sh now build and run both, plus the existing
  curve/sphere/SDF backends (hair_bench covers round/flat/bezier curves).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Curves and the other BVH4 primitives previously fell back to scalar on A64FX
(the 4-wide leaves are written in SSE intrinsics, x86-only). Add a minimal
SSE→NEON compatibility shim (lightrt_sse2neon_min.h, only the ~39 _mm_*
intrinsics this kernel uses) and, on ARM NEON, set LRT_TRI_HAS_SSE4=1 so the
existing 4-wide SSE BVH4 code (triangle, curve, point, sphere, quad, qtri, and
the parametric-patch traversal drivers) compiles and runs as NEON 4-wide. One
shim vectorizes every BVH4 leaf instead of hand-duplicating them.

- Replaces the hand-written NEON triangle BVH4 (removed) with the shimmed SSE
  path; SVE BVH8 + fp64 stay native (guards decoupled from SSE4).
- Shim maps comparisons to float-domain masks, _mm_movemask_ps -> vshrq+vaddvq,
  _mm_blendv_ps -> vbslq (kernel only feeds compare masks); LRT_TRI_NEON_SHIM
  marks the mode and fixes kernel_name to report bvh4/neon, bvh8/sve.
- Measured (48-thread A64FX, vs same kernel built scalar): flat curve ~2.3x,
  cubic-Bézier curve ~3.1x, sphere ~1.4x. Round-linear's blendv/movemask-heavy
  Embree CSG leaf is SLOWER as NEON, so TRI_PRIM_RLCURVE is routed to scalar.
- Parametric surfaces (bilinear/bezpatch/NURBS) remain scalar: per-patch Newton
  + adaptive subdivision is divergent and not 4-wide-SIMD-friendly; intra-patch
  SVE eval is noted as future work.
- Full tests/test_lightrt_c_tri.c oracle suite PASSES on A64FX (all prim_kinds);
  triangle bench unchanged and still verifies 100% vs the fp64 oracle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t-in)

Add an SVE evaluation of the bicubic Bézier / NURBS patch leaf: instead of 18
scalar 1D cubic evals, weight all 16 control points at once on A64FX's 16-lane
fp32 vector. The Bernstein basis for (u,v) gives a 16-lane weight
w[k]=Bu[k%4]*Bv[k>>2] (k=v*4+u, matching the cp order) built with svld1rq
(quad-replicate Bu) + svtbl (expand Bv) + svindex; the surface point and the
du/dv partials are per-axis svaddv(w*cp_axis) horizontal sums. Control points
are transposed to SoA once per patch and reused across all Newton iterations.
NURBS adds the 4th homogeneous component + perspective divide. Both
tri_bezpatch_eval_sve and tri_rbezpatch_eval_sve are wired through the Newton +
seed via TRI_BEZ_EVAL / TRI_RBEZ_EVAL.

Honest verdict (measured on A64FX, 48 threads): NEUTRAL-to-slightly-slower than
scalar — bezpatch ~0.98x, NURBS ~0.97x. The 9-12 horizontal svaddv reductions
per eval cost about as much as the 18 scalar cubic evals, and the eval is not
the dominant cost (adaptive subdivision + AABB cull + Newton control flow are),
mirroring the SDOT/WMMA results. So it ships OFF by default behind
LRT_TRI_SVE_PATCH_EVAL; the scalar patch eval stays the default (no regression).

Both configs verify 100% against the brute-force oracle (tests/test_lightrt_c_tri
RESULT: PASS for default scalar and -DLRT_TRI_SVE_PATCH_EVAL).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rdict

Add a 16-wide node (lrt_bvh16_node, 512 B) + 16-wide leaf (lrt_tri16) layout
(LRT_TRI_LAYOUT_BVH16) so the A64FX SVE slab + Moller-Trumbore fill all 16 fp32
lanes with no predicate waste (tri_intersect_bvh16_sve / _occluded, svptrue_b32).
Collapse generalized to width 16 (set[16], width==16 node branch, block_shift 4);
build allocates nodes16 and rejects layout 16 without SVE (no scalar nodes16
traversal). Serialization/refit/mmap reject BVH16. New c11-bvh16 bench backend.

Honest verdict (48-thread A64FX, mandelbulb 127752 tris, all verify 100%):
  BVH4  primary 30.6  incoherent 21.9  shadow 30.2 Mray/s  (128 B nodes)
  BVH8  primary 32.3  incoherent 20.7  shadow 30.1 Mray/s  (256 B nodes)
  BVH16 primary 17.8  incoherent 13.8  shadow 16.5 Mray/s  (512 B nodes)
BVH16 is ~0.55x of BVH8 despite filling all 16 lanes: the 512-byte nodes (8
cache lines) cost too much bandwidth and the wider nodes cull worse, outweighing
the lane utilization (build is faster, memory slightly lower, but the trace
tanks). BVH8's 8-of-16 stays the SVE sweet spot; BVH16 ships opt-in as a
measured negative result, like the SDOT / patch-eval experiments.

Correctness: BVH16 matches the BVH8/scalar oracle 100% (max_rel_t 0); full
tests/test_lightrt_c_tri suite still PASSES.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend lrt_tri_intersect1_hp to bicubic Bézier and NURBS (incl. trimmed)
surfaces for HPC visualization. The fp32 adaptive-subdivision intersector does
the coarse find (which patch + an approximate (u,v) seed); an fp64 Newton on
(u,v,t) — using the patch control points promoted to double and the fp64 ray —
then refines to full fp64 t/u/v. NURBS (u,v) is mapped global<->local via the
per-patch domain (shade_dom); control points are reconstructed from the existing
per-prim shade_cps, so no extra storage. Reusing the fp32 find avoids
duplicating the subdivision/cull machinery.

The fp64 eval (tri_bezpatch_eval_hp / tri_rbezpatch_eval_hp) is scalar double:
the fp32 SVE Bernstein eval was already neutral (svaddv-bound), and fp64 halves
the SVE lane count (16 control points span two 8-lane vectors), so SVE does not
help the fp64 eval — precision, not throughput, is the point of this path. The
reasoning is documented next to the measured fp32 SVE-eval result.

Verified: 100% hit agreement (prim_id + u/v) with the fp32 patch path on a
bezpatch grid and a NURBS surface (200k rays each); fp64 t carries the extra
precision (max_rel_t ~5-8e-5 vs the fp32 t). Full tests/test_lightrt_c_tri
suite still PASSES.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The memory-efficient 8-bit-quantized-node layout (LRT_TRI_LAYOUT_BVH8Q, 128-byte
nodes vs 256) ran the scalar path on A64FX — the decode+slab was AVX2-only. Add
an SVE decode+slab (tri_bvh8q_slab_sve / tri_intersect_bvh8q_sve / _occluded):
8-bit child bounds are loaded + widened with svld1ub_u32 + svcvt_f32_u32 and the
plane is q*scale*invd + (org_node-org_ray)*invd via svmla (8-of-16 lanes),
mirroring tri_bvh8q_slab_avx. Leaves are full-precision lrt_tri8, so the existing
SVE leaf MT (tri_block_isect_sve) is reused. Dispatch + kernel_name (bvh8q/sve).

Measured (48-thread A64FX, mandelbulb 127k, all verify 100%):
  bvh8q  scalar  primary 12.8  incoherent  9.7 Mray/s
  bvh8q  SVE     primary 31.6  incoherent 22.4 Mray/s   (~2.5x over scalar)
bvh8q/sve lands ~= bvh8/sve (33.6/22.8) at half the node bytes. The fp8/q4 node
formats (qnode != 0) remain AVX2-only.

Correctness: bvh8q/sve matches the bvh8/scalar oracle 100% (conservative
quantized bounds -> identical hits); full tests/test_lightrt_c_tri PASSES.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a 16-ray-wide SVE packet (tri_intersect16_sve): 16 coherent rays traverse
the BVH together, tested ray-parallel against one box / one triangle per step,
amortizing node + leaf fetches over the packet. Walks any plain-triangle layout
(BVH4/8/8q/16) via tri_node_load (which now also handles BVH16). Closest-hit is
order-independent, so a child is pushed if any active ray's interval enters it
within that ray's current best_t. Routes lrt_tri_intersect1N(COHERENT) when
built with -DLRT_TRI_SVE_PACKET.

Measured (48-thread A64FX, mandelbulb 127k, 100% match vs the fp64 oracle):
  BVH4 primary 30.6 -> 120.7 Mray/s  (~3.9x)
  BVH8 primary 32.3 ->  77.4 Mray/s  (~2.4x)
BVH4 wins under the packet (4 boxes/node vs 8 — coherent rays don't need the
wider node).

Off by default: the packet's FMA + traversal order is not bit-identical to the
single-ray kernel (ULP-level t and tie-break prim_id differences — still
oracle-correct within tolerance), and the default intersect1N keeps the
bit-exact Ray4 path so the batch==single unit-test contract holds. The benchmark
build (scripts/build_a64fx.sh) enables -DLRT_TRI_SVE_PACKET. Any-hit (occluded1N)
stays per-ray (lockstep defeats per-ray early-out).

Default unit suite (packet off) PASSES bit-exactly; the packet-on build verifies
100% vs the fp64 oracle on coherent primary rays.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@syoyo syoyo merged commit 2d470c6 into main Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant