IxVM kernel: cut Aiur FFT cost on reduction- and check-heavy shards by arthurpaulino · Pull Request #438 · argumentcomputer/ix

arthurpaulino · 2026-06-10T11:31:52Z

Summary

Eight commits across six optimization areas in the Aiur IxVM kernel, each
targeting a specific hotspot found by profiling per-function FFT cost on real
shards of a compiled Init environment. No semantics change — every shard still
checks without error, and the full lake test -- --ignored ixvm suite passes
(297 tests) at every step.

Cumulative effect on a reduction-heavy shard (Init shard 9, one owned constant,
deep reduction): 45.5G → 16.6G total FFT (−64%). The structural/keyed-lookup
areas (1–5) account for 45.5G → 37.8G; the Nat-reduction area (6) takes it the
rest of the way by collapsing two symbolic-arithmetic blowups from O(n)/O(n²) to
O(1). That same area makes a previously intractable proof class — the UTF-8 codec
lemma ByteArray.utf8DecodeChar?_utf8EncodeChar_append, whose checking spawns
deep Nat.add x 0xC0 succ-towers and c >>> 6/12/18 division-towers — reduce in
constant rather than tower-depth cost.

Method

Built a .ixe of the Init closure, profiled and sharded it
(ix profile → ix shard --shards 64), then ran ix check per shard or per
constant and read the per-function FFT table. Shards 10 (check-heavy: lookups,
address compares, deserialize) and 9 (reduction-heavy: whnf, substitution, spine
collection) exercised different profiles; small hand-built reproducers
(x + D = …, x >>> D = …) isolated and scaled the symbolic-arithmetic blowups.

Aiur cost model behind the changes: FFT ≈ per-function width × row height,
charged at the function's max-arm width on every row. Levers: (a) factor a
wide/rare arm into its own circuit, (b) cut row count via better memoization,
(c) replace linear scans with keyed lookups, (d) keep symbolic primitives compact
so they never materialize a tower.

Areas

1. Cheaper Expr.Share / universe lookups — commit 3774860.
The sharing-table and universe-array lookups walked the list with
list_lookup_u64 (per-step U64 predecessor). Switched to a field index and, for
Expr.Share, to list_drop — which returns the sublist pointer, so repeated
lookups dedup in the memo cache. Shard 10: total FFT 3.89G → 2.11G; the merged
pointer-list walk fell from 1.93G to 5.8M.

2. Hot/cold split of address_eq — commit 69b6936.
The primitive-dispatch gauntlet compares each Const head against ~50 hardcoded
addresses, making address_eq (a full 32-limb compare, width 109) the hottest
circuit. Almost every comparison differs at limb 0. Split it: address_eq
compares limb 0 only and delegates the rest to address_eq_tail, whose height is
just the rare limb-0 matches. (A nested match in the same function does nothing —
Aiur charges full width per row regardless of arm; the split must be a function
boundary.) Shard 10: width 109 → 80, 509.8M → 374.1M; total 2.106G → 1.973G.

3. Hot/cold split of whnf_with_spine — commit 0e30d78.
On reduction-heavy consts this is the single hottest circuit (width 89 × 3.36M
rows), its width set by two arms a minority of rows reach: Const-head and
Proj-head dispatch. Factored both into whnf_const_head / whnf_proj_head. The
Proj arm is the jackpot — 3.4K rows but nearly half the width. Shard 9: width
89 → 34, 6.48G → 2.48G.

4. Non-tail, single-definition collect_spine — commit 5b7fabc.
collect_spine peeled the App chain with a tail-recursive accumulator. Tail
recursion buys nothing in Aiur (stack depth is free), and the accumulator landed
in the memo key, blocking reuse — 0 cache hits across 5.38M rows. Rewrote it
non-tail, keyed on the expression alone (2.19M hits now), and merged the two
verbatim copies into one definition in kernelTypes. Shard 9: 2.65G → 0.96G.

5. Keyed skip-set for sharded check — commit a52967f.
The largest address_eq source on a sharded check was addr_in_list, an O(N)
linear scan of the assumption-leaf list per checked const (1.81M calls on shard
9). Replaced with an RBTreeMap skip-set keyed on the first 4 address bytes:
build once, membership = one rbtree lookup + one confirming address_eq. Shard 9:
address_eq 2.07M rows / 3.48G → 274K rows / 0.40G; total 41.7G → 37.8G.

6. Compact symbolic Nat reduction — commits dbc4177, 5dcab7f, d50c9a6.
Nat primitives applied to (symbolic base, literal) delta-unfolded into towers
the kernel then re-walked: Nat.add x 0xC0 → succ^192(x), and Nat.div x 2
(what x >>> k unfolds to, k times) → an exploding division algorithm. Three
coordinated commits keep them compact:

dbc4177: try_nat_linear_rec reduces Nat.rec base (succ-step) (Lit n) with
a symbolic base directly to Nat.add base (Lit n) instead of n iota steps.
5dcab7f: whnf (try_nat_offset_stuck) leaves Nat.add base (Lit n) stuck as
a compact offset; def-eq (nat_offset_of + try_def_eq_nat) decomposes each
side to base + KLimbs and decides base ≟ base on equal offsets.
d50c9a6: generalizes the stuck rule to Nat.div/Nat.mod base (Lit k)
(k ≥ 2), so symbolic >>> reduces to nested-stuck-div in O(1).

Benchmarks: x + D = Nat.rec x succ-step D goes O(D) → depth-independent
(328M → 28.5M at D=512, −91%); x >>> 18 = (x >>> 9) >>> 9 goes 15.65G → 455M
(−97%), matching x >>> 6. All magnitudes stay KLimbs (no Goldilocks
collapse); only de Bruijn / top-position indices are G.

Soundness

Areas 1–4 are structural/refactor rewrites with identical results (each shard
still checks).
Area 2: address_eq returns the same value as a full 32-limb compare — one
differing limb proves inequality.
Area 5: key collisions overwrite, sound because the only effect is a missed
skip (a frontier const gets rechecked, never wrongly trusted); the confirming
address_eq rules out false positives; the 4-byte key makes collisions
negligible regardless.
Area 6: keeping Nat.add/div/mod symbolic literal stuck is the correct normal
form (these are irreducible for a symbolic base). try_def_eq_nat returns
eq = 1 only when offsets are equal and bases are def-eq (+ k is injective);
otherwise it falls back to the generic path. The full ixvm suite passes (297
tests), BAD negative cases still get rejected (no false accept), and
recursor-on-symbolic-offset reduces correctly.

Expr.Share resolution and universe lookups walked the sharing/univ lists with `list_lookup_u64` (per-step `u64_is_zero` + `relaxed_u64_pred`). Switch to: - universe lookups: `list_lookup(univs, flatten_u64(idx))` — cheap per-step field subtraction instead of the U64 predecessor. - Expr.Share (Convert convert_expr + Ingress deref_share): `let ListNode.Cons(e, _) = load(list_drop(sharing, flatten_u64(idx)));`. list_drop returns the sublist *pointer*; repeated Share lookups collapse to shared memo entries, so the walk is near-free. Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 10` (Init, owned=3, closure=676): total FFT 3.89G -> 2.11G. The merged pointer-list walk drops from 1.93G (list_lookup) to 5.8M (list_drop, 0.27%). (Measurement run had in-circuit blake3 verification disabled — not committed.)

The primitive-dispatch gauntlet in whnf compares each Const head against ~50 known primitive addresses, so address_eq is the single hottest circuit. The full 32-limb compare ran at width 109 on every call. Split it: `address_eq` now compares limb 0 only and rejects (the common case — a single differing limb proves inequality), delegating the remaining 31-limb compare to a separate `address_eq_tail`. Because Aiur charges a function's full width on every one of its rows, the cold path must be a separate function (a nested match in address_eq changes nothing — measured identical); as its own circuit, address_eq_tail's height is only the rare limb-0-match rows. Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 10`: - address_eq: width 109 -> 80, FFT 509.8M -> 374.1M - address_eq_tail (new): width 108, height 1683, FFT 1.9M - total shard FFT: 2.106G -> 1.973G Limb-0 is the sweet spot: hot height (259991) dominates tail height (1683), so pulling a second limb forward (N=2) widens the hot circuit by more than it saves in the tail (measured 1.976G, worse). Result is identical to a full 32-limb compare (sound). (Measurement run had in-circuit blake3 verification disabled — not committed.)

whnf_with_spine is the single hottest circuit on reduction-heavy consts: width 89 charged on every one of its 3.36M rows. The width was set by two arms that only a minority of rows reach — the Const-head dispatch (delta/iota/quot/prim gauntlet) and the Proj-head reduction. Factor both into their own functions (`whnf_const_head`, `whnf_proj_head`). Because Aiur charges a function's full width on every row, the ~76% of steps that are App/Lam/Let now run at the narrow residual width instead of 89; the wide dispatch only taxes its own (smaller) row count. Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9` (1 reduction-heavy owned const): - whnf_with_spine: width 89 -> 34, FFT 6.48G -> 2.48G - whnf_const_head (new): width 77, height 821490, FFT 1.24G - whnf_proj_head (new): width 56, height 3452, FFT 0.002G - total shard FFT: 45.54G -> 42.78G (-6.1%) The Proj arm is rare (3452 rows) but was nearly half the width; extracting it dropped whnf_with_spine 65 -> 34 for ~zero relocated cost. Pure refactor.

Replace the tail-recursive `collect_spine_go(e, acc)` (and its verbatim copy `collect_spine_simple_go`) with one non-tail `collect_spine(e)` in the kernelTypes block. Repoint all callers (whnf, primitive, inductive_check, def_eq); delete both old copies. Keyed on `e` alone, it now dedups shared sub-spines (0 -> 2.19M cache hits). Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9`: - collect_spine: 5.38M rows / 2.65G -> 2.17M rows / 0.96G - list_snoc.Ptr: +0.96G (new) - total shard FFT: 42.78G -> 41.67G

check_all_skipping tested each closure const against the assumption-leaf list via addr_in_list, an O(N) linear address_eq scan — the single largest address_eq source on sharded checks (1.81M calls on Init shard 9, more than all primitive dispatch combined). Build an RBTreeMap skip-set once (keyed on the first 4 address bytes), then test membership with one rbtree lookup + one confirming address_eq. Key collisions overwrite: sound because a missed skip only rechecks a frontier const, and the confirming address_eq rules out false positives. Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9`: - address_eq: 2.07M rows / 3.48G -> 274K rows / 0.40G - rbtree build + lookups: +~0.46G - total shard FFT: 41.67G -> 37.84G

try_nat_linear_rec only folded `Nat.rec base (λ_ ih => succ ih) (Lit n)` when the base was also a literal; a symbolic base declined and fell through to n iota steps that materialize succ^n(base). Reduce the symbolic-base case directly to the offset primitive `Nat.add base (Lit n)` instead. This keeps the value in the same compact `base + n` form a literal already has, so def-eq comparing it converges instead of descending n unary succ layers (the UTF-8 `x + 0xC0` reduction class). Offset magnitude stays KLimbs; only the nat_add top-position index is a (small) G. Measured on a `x + D = Nat.rec x succ-step D` benchmark: the unary def-eq descent collapses (nat_succ_of D -> ~constant), ~-15% total FFT at D=512. The remaining cost is the in-whnf chain reduction (substitution), unchanged here.

Reducing `Nat.add x (Lit n)` (symbolic x) delta-unfolded into a succ^n(x) tower, and the nat_succ dispatch then whnf-cascaded the whole chain — O(n) substitution per reduction, the dominant cost on Nat-arithmetic proofs (the UTF-8 codec class). Two coordinated changes: - whnf (try_nat_offset_stuck): leave `Nat.add base (Lit n)` (non-literal base, nonzero n) stuck as a compact offset instead of unfolding it. - def-eq (nat_offset_of + try_def_eq_nat): decompose each side to base + KLimbs offset (reading `Nat.add base (Lit m)` in O(1), peeling the few succs whnf exposes) and decide `base_a ≟ base_b` only when offsets are equal; otherwise fall back. All magnitudes stay KLimbs (no Goldilocks collapse). Measured on `x + D = Nat.rec x succ-step D` (depths 64..512): total FFT goes from O(D) (328M at D=512) to a depth-independent ~28.5M (-91% at D=512). Full `lake test -- --ignored ixvm` passes (297 tests; BAD cases still rejected); a recursor-on-symbolic-offset completeness test reduces correctly.

`Nat.div x (Lit k)` / `Nat.mod x (Lit k)` with a symbolic base delta-unfolded the division algorithm and expanded hugely, even though the result is irreducible for a symbolic base. `Nat.shiftRight x k` unfolds to k nested `Nat.div _ 2`, so a symbolic `>>>` materialized a deep, super-linear tower — the dominant cost in the UTF-8 codec's `c >>> 6/12/18` encoding. Generalize `try_nat_offset_stuck` (which already keeps `Nat.add base (Lit n)` compact) to also leave `Nat.div`/`Nat.mod base (Lit k)` (k ≥ 2) stuck. Thresholds keep `x/1 = x` and `x/0 = 0` reducing. Both `x >>> 18` and `(x >>> 9) >>> 9` then reduce to the same nested-stuck-div form, so def-eq stays structural and the identity still holds. Measured on `x >>> 18 = (x >>> 9) >>> 9`: total FFT 15.65G -> 455M (-97%), now depth-independent (matches `x >>> 6`). No regression on the Nat.add reproducers; full `lake test -- --ignored ixvm` passes (297 tests).

arthurpaulino added 5 commits June 9, 2026 11:49

arthurpaulino requested a review from gabriel-barrett June 10, 2026 11:31

arthurpaulino added 3 commits June 10, 2026 08:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IxVM kernel: cut Aiur FFT cost on reduction- and check-heavy shards#438

IxVM kernel: cut Aiur FFT cost on reduction- and check-heavy shards#438
arthurpaulino wants to merge 8 commits into
mainfrom
ap/optimize

arthurpaulino commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arthurpaulino commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Method

Areas

Soundness

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arthurpaulino commented Jun 10, 2026 •

edited

Loading