IxVM kernel: cut Aiur FFT cost on reduction- and check-heavy shards#438
Open
arthurpaulino wants to merge 8 commits into
Open
IxVM kernel: cut Aiur FFT cost on reduction- and check-heavy shards#438arthurpaulino wants to merge 8 commits into
arthurpaulino wants to merge 8 commits into
Conversation
Expr.Share resolution and universe lookups walked the sharing/univ lists with `list_lookup_u64` (per-step `u64_is_zero` + `relaxed_u64_pred`). Switch to: - universe lookups: `list_lookup(univs, flatten_u64(idx))` — cheap per-step field subtraction instead of the U64 predecessor. - Expr.Share (Convert convert_expr + Ingress deref_share): `let ListNode.Cons(e, _) = load(list_drop(sharing, flatten_u64(idx)));`. list_drop returns the sublist *pointer*; repeated Share lookups collapse to shared memo entries, so the walk is near-free. Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 10` (Init, owned=3, closure=676): total FFT 3.89G -> 2.11G. The merged pointer-list walk drops from 1.93G (list_lookup) to 5.8M (list_drop, 0.27%). (Measurement run had in-circuit blake3 verification disabled — not committed.)
The primitive-dispatch gauntlet in whnf compares each Const head against ~50 known primitive addresses, so address_eq is the single hottest circuit. The full 32-limb compare ran at width 109 on every call. Split it: `address_eq` now compares limb 0 only and rejects (the common case — a single differing limb proves inequality), delegating the remaining 31-limb compare to a separate `address_eq_tail`. Because Aiur charges a function's full width on every one of its rows, the cold path must be a separate function (a nested match in address_eq changes nothing — measured identical); as its own circuit, address_eq_tail's height is only the rare limb-0-match rows. Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 10`: - address_eq: width 109 -> 80, FFT 509.8M -> 374.1M - address_eq_tail (new): width 108, height 1683, FFT 1.9M - total shard FFT: 2.106G -> 1.973G Limb-0 is the sweet spot: hot height (259991) dominates tail height (1683), so pulling a second limb forward (N=2) widens the hot circuit by more than it saves in the tail (measured 1.976G, worse). Result is identical to a full 32-limb compare (sound). (Measurement run had in-circuit blake3 verification disabled — not committed.)
whnf_with_spine is the single hottest circuit on reduction-heavy consts: width 89 charged on every one of its 3.36M rows. The width was set by two arms that only a minority of rows reach — the Const-head dispatch (delta/iota/quot/prim gauntlet) and the Proj-head reduction. Factor both into their own functions (`whnf_const_head`, `whnf_proj_head`). Because Aiur charges a function's full width on every row, the ~76% of steps that are App/Lam/Let now run at the narrow residual width instead of 89; the wide dispatch only taxes its own (smaller) row count. Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9` (1 reduction-heavy owned const): - whnf_with_spine: width 89 -> 34, FFT 6.48G -> 2.48G - whnf_const_head (new): width 77, height 821490, FFT 1.24G - whnf_proj_head (new): width 56, height 3452, FFT 0.002G - total shard FFT: 45.54G -> 42.78G (-6.1%) The Proj arm is rare (3452 rows) but was nearly half the width; extracting it dropped whnf_with_spine 65 -> 34 for ~zero relocated cost. Pure refactor.
Replace the tail-recursive `collect_spine_go(e, acc)` (and its verbatim copy `collect_spine_simple_go`) with one non-tail `collect_spine(e)` in the kernelTypes block. Repoint all callers (whnf, primitive, inductive_check, def_eq); delete both old copies. Keyed on `e` alone, it now dedups shared sub-spines (0 -> 2.19M cache hits). Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9`: - collect_spine: 5.38M rows / 2.65G -> 2.17M rows / 0.96G - list_snoc.Ptr: +0.96G (new) - total shard FFT: 42.78G -> 41.67G
check_all_skipping tested each closure const against the assumption-leaf list via addr_in_list, an O(N) linear address_eq scan — the single largest address_eq source on sharded checks (1.81M calls on Init shard 9, more than all primitive dispatch combined). Build an RBTreeMap skip-set once (keyed on the first 4 address bytes), then test membership with one rbtree lookup + one confirming address_eq. Key collisions overwrite: sound because a missed skip only rechecks a frontier const, and the confirming address_eq rules out false positives. Measured on `lake exe ix check --ixe init.ixe --ixes init.ixes --shard 9`: - address_eq: 2.07M rows / 3.48G -> 274K rows / 0.40G - rbtree build + lookups: +~0.46G - total shard FFT: 41.67G -> 37.84G
try_nat_linear_rec only folded `Nat.rec base (λ_ ih => succ ih) (Lit n)` when the base was also a literal; a symbolic base declined and fell through to n iota steps that materialize succ^n(base). Reduce the symbolic-base case directly to the offset primitive `Nat.add base (Lit n)` instead. This keeps the value in the same compact `base + n` form a literal already has, so def-eq comparing it converges instead of descending n unary succ layers (the UTF-8 `x + 0xC0` reduction class). Offset magnitude stays KLimbs; only the nat_add top-position index is a (small) G. Measured on a `x + D = Nat.rec x succ-step D` benchmark: the unary def-eq descent collapses (nat_succ_of D -> ~constant), ~-15% total FFT at D=512. The remaining cost is the in-whnf chain reduction (substitution), unchanged here.
Reducing `Nat.add x (Lit n)` (symbolic x) delta-unfolded into a succ^n(x) tower, and the nat_succ dispatch then whnf-cascaded the whole chain — O(n) substitution per reduction, the dominant cost on Nat-arithmetic proofs (the UTF-8 codec class). Two coordinated changes: - whnf (try_nat_offset_stuck): leave `Nat.add base (Lit n)` (non-literal base, nonzero n) stuck as a compact offset instead of unfolding it. - def-eq (nat_offset_of + try_def_eq_nat): decompose each side to base + KLimbs offset (reading `Nat.add base (Lit m)` in O(1), peeling the few succs whnf exposes) and decide `base_a ≟ base_b` only when offsets are equal; otherwise fall back. All magnitudes stay KLimbs (no Goldilocks collapse). Measured on `x + D = Nat.rec x succ-step D` (depths 64..512): total FFT goes from O(D) (328M at D=512) to a depth-independent ~28.5M (-91% at D=512). Full `lake test -- --ignored ixvm` passes (297 tests; BAD cases still rejected); a recursor-on-symbolic-offset completeness test reduces correctly.
`Nat.div x (Lit k)` / `Nat.mod x (Lit k)` with a symbolic base delta-unfolded the division algorithm and expanded hugely, even though the result is irreducible for a symbolic base. `Nat.shiftRight x k` unfolds to k nested `Nat.div _ 2`, so a symbolic `>>>` materialized a deep, super-linear tower — the dominant cost in the UTF-8 codec's `c >>> 6/12/18` encoding. Generalize `try_nat_offset_stuck` (which already keeps `Nat.add base (Lit n)` compact) to also leave `Nat.div`/`Nat.mod base (Lit k)` (k ≥ 2) stuck. Thresholds keep `x/1 = x` and `x/0 = 0` reducing. Both `x >>> 18` and `(x >>> 9) >>> 9` then reduce to the same nested-stuck-div form, so def-eq stays structural and the identity still holds. Measured on `x >>> 18 = (x >>> 9) >>> 9`: total FFT 15.65G -> 455M (-97%), now depth-independent (matches `x >>> 6`). No regression on the Nat.add reproducers; full `lake test -- --ignored ixvm` passes (297 tests).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eight commits across six optimization areas in the Aiur IxVM kernel, each
targeting a specific hotspot found by profiling per-function FFT cost on real
shards of a compiled
Initenvironment. No semantics change — every shard stillchecks without error, and the full
lake test -- --ignored ixvmsuite passes(297 tests) at every step.
Cumulative effect on a reduction-heavy shard (Init shard 9, one owned constant,
deep reduction): 45.5G → 16.6G total FFT (−64%). The structural/keyed-lookup
areas (1–5) account for 45.5G → 37.8G; the Nat-reduction area (6) takes it the
rest of the way by collapsing two symbolic-arithmetic blowups from O(n)/O(n²) to
O(1). That same area makes a previously intractable proof class — the UTF-8 codec
lemma
ByteArray.utf8DecodeChar?_utf8EncodeChar_append, whose checking spawnsdeep
Nat.add x 0xC0succ-towers andc >>> 6/12/18division-towers — reduce inconstant rather than tower-depth cost.
Method
Built a
.ixeof theInitclosure, profiled and sharded it(
ix profile→ix shard --shards 64), then ranix checkper shard or perconstant and read the per-function FFT table. Shards 10 (check-heavy: lookups,
address compares, deserialize) and 9 (reduction-heavy: whnf, substitution, spine
collection) exercised different profiles; small hand-built reproducers
(
x + D = …,x >>> D = …) isolated and scaled the symbolic-arithmetic blowups.Aiur cost model behind the changes: FFT ≈
per-function width × row height,charged at the function's max-arm width on every row. Levers: (a) factor a
wide/rare arm into its own circuit, (b) cut row count via better memoization,
(c) replace linear scans with keyed lookups, (d) keep symbolic primitives compact
so they never materialize a tower.
Areas
1. Cheaper
Expr.Share/ universe lookups — commit3774860.The sharing-table and universe-array lookups walked the list with
list_lookup_u64(per-step U64 predecessor). Switched to a field index and, forExpr.Share, tolist_drop— which returns the sublist pointer, so repeatedlookups dedup in the memo cache. Shard 10: total FFT 3.89G → 2.11G; the merged
pointer-list walk fell from 1.93G to 5.8M.
2. Hot/cold split of
address_eq— commit69b6936.The primitive-dispatch gauntlet compares each
Consthead against ~50 hardcodedaddresses, making
address_eq(a full 32-limb compare, width 109) the hottestcircuit. Almost every comparison differs at limb 0. Split it:
address_eqcompares limb 0 only and delegates the rest to
address_eq_tail, whose height isjust the rare limb-0 matches. (A nested match in the same function does nothing —
Aiur charges full width per row regardless of arm; the split must be a function
boundary.) Shard 10: width 109 → 80, 509.8M → 374.1M; total 2.106G → 1.973G.
3. Hot/cold split of
whnf_with_spine— commit0e30d78.On reduction-heavy consts this is the single hottest circuit (width 89 × 3.36M
rows), its width set by two arms a minority of rows reach: Const-head and
Proj-head dispatch. Factored both into
whnf_const_head/whnf_proj_head. TheProj arm is the jackpot — 3.4K rows but nearly half the width. Shard 9: width
89 → 34, 6.48G → 2.48G.
4. Non-tail, single-definition
collect_spine— commit5b7fabc.collect_spinepeeled the App chain with a tail-recursive accumulator. Tailrecursion buys nothing in Aiur (stack depth is free), and the accumulator landed
in the memo key, blocking reuse — 0 cache hits across 5.38M rows. Rewrote it
non-tail, keyed on the expression alone (2.19M hits now), and merged the two
verbatim copies into one definition in
kernelTypes. Shard 9: 2.65G → 0.96G.5. Keyed skip-set for sharded check — commit
a52967f.The largest
address_eqsource on a sharded check wasaddr_in_list, an O(N)linear scan of the assumption-leaf list per checked const (1.81M calls on shard
9). Replaced with an
RBTreeMapskip-set keyed on the first 4 address bytes:build once, membership = one rbtree lookup + one confirming
address_eq. Shard 9:address_eq2.07M rows / 3.48G → 274K rows / 0.40G; total 41.7G → 37.8G.6. Compact symbolic Nat reduction — commits
dbc4177,5dcab7f,d50c9a6.Natprimitives applied to (symbolic base, literal) delta-unfolded into towersthe kernel then re-walked:
Nat.add x 0xC0→succ^192(x), andNat.div x 2(what
x >>> kunfolds to, k times) → an exploding division algorithm. Threecoordinated commits keep them compact:
dbc4177:try_nat_linear_recreducesNat.rec base (succ-step) (Lit n)witha symbolic base directly to
Nat.add base (Lit n)instead of n iota steps.5dcab7f: whnf (try_nat_offset_stuck) leavesNat.add base (Lit n)stuck asa compact offset; def-eq (
nat_offset_of+try_def_eq_nat) decomposes eachside to
base + KLimbsand decidesbase ≟ baseon equal offsets.d50c9a6: generalizes the stuck rule toNat.div/Nat.mod base (Lit k)(k ≥ 2), so symbolic
>>>reduces to nested-stuck-div in O(1).Benchmarks:
x + D = Nat.rec x succ-step Dgoes O(D) → depth-independent(328M → 28.5M at D=512, −91%);
x >>> 18 = (x >>> 9) >>> 9goes 15.65G → 455M(−97%), matching
x >>> 6. All magnitudes stayKLimbs(no Goldilockscollapse); only de Bruijn /
top-position indices areG.Soundness
still checks).
address_eqreturns the same value as a full 32-limb compare — onediffering limb proves inequality.
skip (a frontier const gets rechecked, never wrongly trusted); the confirming
address_eqrules out false positives; the 4-byte key makes collisionsnegligible regardless.
Nat.add/div/mod symbolic literalstuck is the correct normalform (these are irreducible for a symbolic base).
try_def_eq_natreturnseq = 1only when offsets are equal and bases are def-eq (+ kis injective);otherwise it falls back to the generic path. The full
ixvmsuite passes (297tests),
BADnegative cases still get rejected (no false accept), andrecursor-on-symbolic-offset reduces correctly.