Skip to content

IxVM kernel: cut Aiur FFT cost on reduction- and check-heavy shards#438

Open
arthurpaulino wants to merge 8 commits into
mainfrom
ap/optimize
Open

IxVM kernel: cut Aiur FFT cost on reduction- and check-heavy shards#438
arthurpaulino wants to merge 8 commits into
mainfrom
ap/optimize

Conversation

@arthurpaulino

@arthurpaulino arthurpaulino commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary

Eight commits across six optimization areas in the Aiur IxVM kernel, each
targeting a specific hotspot found by profiling per-function FFT cost on real
shards of a compiled Init environment. No semantics change — every shard still
checks without error, and the full lake test -- --ignored ixvm suite passes
(297 tests) at every step.

Cumulative effect on a reduction-heavy shard (Init shard 9, one owned constant,
deep reduction): 45.5G → 16.6G total FFT (−64%). The structural/keyed-lookup
areas (1–5) account for 45.5G → 37.8G; the Nat-reduction area (6) takes it the
rest of the way by collapsing two symbolic-arithmetic blowups from O(n)/O(n²) to
O(1). That same area makes a previously intractable proof class — the UTF-8 codec
lemma ByteArray.utf8DecodeChar?_utf8EncodeChar_append, whose checking spawns
deep Nat.add x 0xC0 succ-towers and c >>> 6/12/18 division-towers — reduce in
constant rather than tower-depth cost.

Method

Built a .ixe of the Init closure, profiled and sharded it
(ix profileix shard --shards 64), then ran ix check per shard or per
constant and read the per-function FFT table. Shards 10 (check-heavy: lookups,
address compares, deserialize) and 9 (reduction-heavy: whnf, substitution, spine
collection) exercised different profiles; small hand-built reproducers
(x + D = …, x >>> D = …) isolated and scaled the symbolic-arithmetic blowups.

Aiur cost model behind the changes: FFT ≈ per-function width × row height,
charged at the function's max-arm width on every row. Levers: (a) factor a
wide/rare arm into its own circuit, (b) cut row count via better memoization,
(c) replace linear scans with keyed lookups, (d) keep symbolic primitives compact
so they never materialize a tower.

Areas

1. Cheaper Expr.Share / universe lookups — commit 3774860.
The sharing-table and universe-array lookups walked the list with
list_lookup_u64 (per-step U64 predecessor). Switched to a field index and, for
Expr.Share, to list_drop — which returns the sublist pointer, so repeated
lookups dedup in the memo cache. Shard 10: total FFT 3.89G → 2.11G; the merged
pointer-list walk fell from 1.93G to 5.8M.

2. Hot/cold split of address_eq — commit 69b6936.
The primitive-dispatch gauntlet compares each Const head against ~50 hardcoded
addresses, making address_eq (a full 32-limb compare, width 109) the hottest
circuit. Almost every comparison differs at limb 0. Split it: address_eq
compares limb 0 only and delegates the rest to address_eq_tail, whose height is
just the rare limb-0 matches. (A nested match in the same function does nothing —
Aiur charges full width per row regardless of arm; the split must be a function
boundary.) Shard 10: width 109 → 80, 509.8M → 374.1M; total 2.106G → 1.973G.

3. Hot/cold split of whnf_with_spine — commit 0e30d78.
On reduction-heavy consts this is the single hottest circuit (width 89 × 3.36M
rows), its width set by two arms a minority of rows reach: Const-head and
Proj-head dispatch. Factored both into whnf_const_head / whnf_proj_head. The
Proj arm is the jackpot — 3.4K rows but nearly half the width. Shard 9: width
89 → 34, 6.48G → 2.48G.

4. Non-tail, single-definition collect_spine — commit 5b7fabc.
collect_spine peeled the App chain with a tail-recursive accumulator. Tail
recursion buys nothing in Aiur (stack depth is free), and the accumulator landed
in the memo key, blocking reuse — 0 cache hits across 5.38M rows. Rewrote it
non-tail, keyed on the expression alone (2.19M hits now), and merged the two
verbatim copies into one definition in kernelTypes. Shard 9: 2.65G → 0.96G.

5. Keyed skip-set for sharded check — commit a52967f.
The largest address_eq source on a sharded check was addr_in_list, an O(N)
linear scan of the assumption-leaf list per checked const (1.81M calls on shard
9). Replaced with an RBTreeMap skip-set keyed on the first 4 address bytes:
build once, membership = one rbtree lookup + one confirming address_eq. Shard 9:
address_eq 2.07M rows / 3.48G → 274K rows / 0.40G; total 41.7G → 37.8G.

6. Compact symbolic Nat reduction — commits dbc4177, 5dcab7f, d50c9a6.
Nat primitives applied to (symbolic base, literal) delta-unfolded into towers
the kernel then re-walked: Nat.add x 0xC0succ^192(x), and Nat.div x 2
(what x >>> k unfolds to, k times) → an exploding division algorithm. Three
coordinated commits keep them compact:

  • dbc4177: try_nat_linear_rec reduces Nat.rec base (succ-step) (Lit n) with
    a symbolic base directly to Nat.add base (Lit n) instead of n iota steps.
  • 5dcab7f: whnf (try_nat_offset_stuck) leaves Nat.add base (Lit n) stuck as
    a compact offset; def-eq (nat_offset_of + try_def_eq_nat) decomposes each
    side to base + KLimbs and decides base ≟ base on equal offsets.
  • d50c9a6: generalizes the stuck rule to Nat.div/Nat.mod base (Lit k)
    (k ≥ 2), so symbolic >>> reduces to nested-stuck-div in O(1).

Benchmarks: x + D = Nat.rec x succ-step D goes O(D) → depth-independent
(328M → 28.5M at D=512, −91%); x >>> 18 = (x >>> 9) >>> 9 goes 15.65G → 455M
(−97%), matching x >>> 6. All magnitudes stay KLimbs (no Goldilocks
collapse); only de Bruijn / top-position indices are G.

Soundness

  • Areas 1–4 are structural/refactor rewrites with identical results (each shard
    still checks).
  • Area 2: address_eq returns the same value as a full 32-limb compare — one
    differing limb proves inequality.
  • Area 5: key collisions overwrite, sound because the only effect is a missed
    skip
    (a frontier const gets rechecked, never wrongly trusted); the confirming
    address_eq rules out false positives; the 4-byte key makes collisions
    negligible regardless.
  • Area 6: keeping Nat.add/div/mod symbolic literal stuck is the correct normal
    form (these are irreducible for a symbolic base). try_def_eq_nat returns
    eq = 1 only when offsets are equal and bases are def-eq (+ k is injective);
    otherwise it falls back to the generic path. The full ixvm suite passes (297
    tests), BAD negative cases still get rejected (no false accept), and
    recursor-on-symbolic-offset reduces correctly.

Expr.Share resolution and universe lookups walked the sharing/univ lists
with `list_lookup_u64` (per-step `u64_is_zero` + `relaxed_u64_pred`). Switch
to:

- universe lookups: `list_lookup(univs, flatten_u64(idx))` — cheap per-step
  field subtraction instead of the U64 predecessor.
- Expr.Share (Convert convert_expr + Ingress deref_share):
  `let ListNode.Cons(e, _) = load(list_drop(sharing, flatten_u64(idx)));`.
  list_drop returns the sublist *pointer*; repeated Share lookups collapse
  to shared memo entries, so the walk is near-free.

Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 10`
(Init, owned=3, closure=676): total FFT 3.89G -> 2.11G. The merged
pointer-list walk drops from 1.93G (list_lookup) to 5.8M (list_drop, 0.27%).

(Measurement run had in-circuit blake3 verification disabled — not committed.)
The primitive-dispatch gauntlet in whnf compares each Const head against ~50
known primitive addresses, so address_eq is the single hottest circuit. The
full 32-limb compare ran at width 109 on every call.

Split it: `address_eq` now compares limb 0 only and rejects (the common case —
a single differing limb proves inequality), delegating the remaining 31-limb
compare to a separate `address_eq_tail`. Because Aiur charges a function's full
width on every one of its rows, the cold path must be a separate function (a
nested match in address_eq changes nothing — measured identical); as its own
circuit, address_eq_tail's height is only the rare limb-0-match rows.

Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 10`:
- address_eq: width 109 -> 80, FFT 509.8M -> 374.1M
- address_eq_tail (new): width 108, height 1683, FFT 1.9M
- total shard FFT: 2.106G -> 1.973G

Limb-0 is the sweet spot: hot height (259991) dominates tail height (1683), so
pulling a second limb forward (N=2) widens the hot circuit by more than it
saves in the tail (measured 1.976G, worse). Result is identical to a full
32-limb compare (sound).

(Measurement run had in-circuit blake3 verification disabled — not committed.)
whnf_with_spine is the single hottest circuit on reduction-heavy consts: width
89 charged on every one of its 3.36M rows. The width was set by two arms that
only a minority of rows reach — the Const-head dispatch (delta/iota/quot/prim
gauntlet) and the Proj-head reduction.

Factor both into their own functions (`whnf_const_head`, `whnf_proj_head`).
Because Aiur charges a function's full width on every row, the ~76% of steps
that are App/Lam/Let now run at the narrow residual width instead of 89; the
wide dispatch only taxes its own (smaller) row count.

Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9`
(1 reduction-heavy owned const):
- whnf_with_spine: width 89 -> 34, FFT 6.48G -> 2.48G
- whnf_const_head (new): width 77, height 821490, FFT 1.24G
- whnf_proj_head (new): width 56, height 3452, FFT 0.002G
- total shard FFT: 45.54G -> 42.78G (-6.1%)

The Proj arm is rare (3452 rows) but was nearly half the width; extracting it
dropped whnf_with_spine 65 -> 34 for ~zero relocated cost. Pure refactor.
Replace the tail-recursive `collect_spine_go(e, acc)` (and its verbatim copy
`collect_spine_simple_go`) with one non-tail `collect_spine(e)` in the
kernelTypes block. Repoint all callers (whnf, primitive, inductive_check,
def_eq); delete both old copies. Keyed on `e` alone, it now dedups shared
sub-spines (0 -> 2.19M cache hits).

Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9`:
- collect_spine: 5.38M rows / 2.65G -> 2.17M rows / 0.96G
- list_snoc.Ptr: +0.96G (new)
- total shard FFT: 42.78G -> 41.67G
check_all_skipping tested each closure const against the assumption-leaf list
via addr_in_list, an O(N) linear address_eq scan — the single largest address_eq
source on sharded checks (1.81M calls on Init shard 9, more than all primitive
dispatch combined).

Build an RBTreeMap skip-set once (keyed on the first 4 address bytes), then test
membership with one rbtree lookup + one confirming address_eq. Key collisions
overwrite: sound because a missed skip only rechecks a frontier const, and the
confirming address_eq rules out false positives.

Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9`:
- address_eq: 2.07M rows / 3.48G -> 274K rows / 0.40G
- rbtree build + lookups: +~0.46G
- total shard FFT: 41.67G -> 37.84G
try_nat_linear_rec only folded `Nat.rec base (λ_ ih => succ ih) (Lit n)` when
the base was also a literal; a symbolic base declined and fell through to n
iota steps that materialize succ^n(base). Reduce the symbolic-base case
directly to the offset primitive `Nat.add base (Lit n)` instead.

This keeps the value in the same compact `base + n` form a literal already has,
so def-eq comparing it converges instead of descending n unary succ layers (the
UTF-8 `x + 0xC0` reduction class). Offset magnitude stays KLimbs; only the
nat_add top-position index is a (small) G.

Measured on a `x + D = Nat.rec x succ-step D` benchmark: the unary def-eq
descent collapses (nat_succ_of D -> ~constant), ~-15% total FFT at D=512. The
remaining cost is the in-whnf chain reduction (substitution), unchanged here.
Reducing `Nat.add x (Lit n)` (symbolic x) delta-unfolded into a succ^n(x)
tower, and the nat_succ dispatch then whnf-cascaded the whole chain — O(n)
substitution per reduction, the dominant cost on Nat-arithmetic proofs (the
UTF-8 codec class).

Two coordinated changes:
- whnf (try_nat_offset_stuck): leave `Nat.add base (Lit n)` (non-literal base,
  nonzero n) stuck as a compact offset instead of unfolding it.
- def-eq (nat_offset_of + try_def_eq_nat): decompose each side to base + KLimbs
  offset (reading `Nat.add base (Lit m)` in O(1), peeling the few succs whnf
  exposes) and decide `base_a ≟ base_b` only when offsets are equal; otherwise
  fall back. All magnitudes stay KLimbs (no Goldilocks collapse).

Measured on `x + D = Nat.rec x succ-step D` (depths 64..512): total FFT goes
from O(D) (328M at D=512) to a depth-independent ~28.5M (-91% at D=512). Full
`lake test -- --ignored ixvm` passes (297 tests; BAD cases still rejected);
a recursor-on-symbolic-offset completeness test reduces correctly.
`Nat.div x (Lit k)` / `Nat.mod x (Lit k)` with a symbolic base delta-unfolded
the division algorithm and expanded hugely, even though the result is
irreducible for a symbolic base. `Nat.shiftRight x k` unfolds to k nested
`Nat.div _ 2`, so a symbolic `>>>` materialized a deep, super-linear tower —
the dominant cost in the UTF-8 codec's `c >>> 6/12/18` encoding.

Generalize `try_nat_offset_stuck` (which already keeps `Nat.add base (Lit n)`
compact) to also leave `Nat.div`/`Nat.mod base (Lit k)` (k ≥ 2) stuck. Thresholds
keep `x/1 = x` and `x/0 = 0` reducing. Both `x >>> 18` and `(x >>> 9) >>> 9` then
reduce to the same nested-stuck-div form, so def-eq stays structural and the
identity still holds.

Measured on `x >>> 18 = (x >>> 9) >>> 9`: total FFT 15.65G -> 455M (-97%), now
depth-independent (matches `x >>> 6`). No regression on the Nat.add reproducers;
full `lake test -- --ignored ixvm` passes (297 tests).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant