Reduce CoreML conversion peak and ANE steady-state memory by ChinChangYang · Pull Request #1202 · lightvector/KataGo

ChinChangYang · 2026-05-30T09:12:24Z

Summary

Draft. Cuts memory during the on-device KataGo→CoreML conversion and while running the ANE/CoreML path, with byte-identical converter output. Three changes (first commit):

Non-owning weight views — the converter's weight tensors become views into the parsed model instead of owning extra FP32 copies (derived/transposed tensors keep an owned buffer), dropping redundant resident weight copies during conversion. CoreML serialization is made deterministic (SetSerializationDeterministic) so output is byte-stable.
Streaming parser — the KataGo model parser streams the gzip through a bounded ~1 MB refill buffer instead of decompressing the whole file into memory, preserving the existing NaN/Inf weight validation.
Release weights on ANE — ModelDesc::releaseWeights() frees the in-memory weight arrays (keeping scalar shape metadata). The Metal backend calls it on the ANE (CoreML) path after converting from the model file on disk, gated by a new ComputeContext::aneOnly flag so it only fires when every configured device is ANE — the GPU/MPSGraph path keeps its weights. Serialized under computeHandleMutex; only scalar dims are read afterward.

A follow-up commit (Enforce non-owning weight-view contract at compile time) hardens the first bullet: WeightEntry now stores a non-owning view, so its backing std::vector must outlive serialization. The rvalue overloads of addConstOp/registerWeight are = deleted so a temporary can never bind to the view path (it would dangle) — temporaries are forced through the owning addOwnedConstOp/registerOwnedWeight. No codegen change: every existing call site passes an lvalue model member, verified by a full METAL rebuild plus a negative-compile test confirming the deleted overload is selected for a temporary.

Merged: Metal GPU + CoreML/ANE transformer support (#1205)

This branch now merges #1205 (transformer-net support — SiLU, GQA, learnable RoPE, RMSNorm, SwiGLU — for the Metal GPU and CoreML/ANE paths) so the two efforts reconcile in one place. All four merge conflicts were in the katagocoreml converter and came from a single root: this PR's non-owning FloatView weight registration vs #1205's per-weight FP32 precision tiers (is_fp32). Both are kept (registerWeight stamps entry.is_fp32; WeightSerializer stores fp16 iff use_fp16 && !is_fp32; addConstOp keeps the emitConstOp refactor and passes is_fp32 = (m_weight_dtype == FLOAT32)).

The merge required one integration fix (commit Conform CoreML transformer derived consts to the owned-weight + FP32 contract). The transformer attention builder emits four function-local tensors — RoPE cos/sin tables, the rotation matrix R, and per-head out-projection weight slices. After this PR made WeightEntry::data a non-owning view, those locals must:

be registered through the owning addOwnedConstOp (with std::move) so the view doesn't dangle once the build function returns (serialization runs afterwards); and
carry the same per-weight FP32 marking as addConstOp, because emitConstOp declares each const's dtype as m_weight_dtype. Without (2) a derived const that lands in the attention/value-head FP32 sub-region of an FP16 model is declared FP32 but stored FP16, which CoreML/ANE rejects at load ("Metadata data type does not match requested type", BNNS error −14), crashing every FP16-ANE transformer.

addOwnedConstOp/registerOwnedWeight now thread is_fp32 = (m_weight_dtype == FLOAT32), mirroring addConstOp so stored dtype always matches the declared dtype. This also fixes the same latent mismatch for addLinearOp's transposed value-head weights. The streaming parser needed no change (the new parseTransformer* use the unchanged readFloats), and ModelDesc::releaseWeights() already covers the transformer descriptors (including ropeFreqs).

Measured (19×19, ANE path)

Resident-memory A/B on the CoreML/ANE path (metalDeviceToUseThread0=100), cold conversion every run, before vs after this PR's three memory levers. before is this branch with the levers removed (the #1205 transformer tip 39f82f6d — no non-owning views, no streaming parser, no aneOnly release); after is HEAD. Peak RSS is the /usr/bin/time -l maximum over load+convert; idle steady-state is ps-sampled after the model is loaded and released. Numbers are GB, reproduced across two runs (<0.1% spread).

Model	Peak RSS (load+convert), before → after	Idle steady-state RSS, before → after
Classical 40b (`b40c256`)	1.50 → 0.80 GB (−47%)	0.38 → 0.25 GB (−33%)
18b nbt (`b18c384nbt`)	0.88 → 0.48 GB (−45%)	0.59 → 0.20 GB (−66%)
Transformer (`b10c384h6nbttflrs`)	1.58 → 1.37 GB (−13%)	0.66 → 0.46 GB (−31%)

The peak win is largest on the classical 40b, where the conversion transients the levers target (duplicate owning weight copies + whole-gzip decompression) dominate; the idle win is largest on the 18b nbt, which has the most resident FP32 weight for the ANE releaseWeights to free.

Why the transformer's peak reduces less (−13%), and why its absolute peak is highest despite the smallest file. Both are expected and not a regression — verified by instrumented byte-accounting at serialization time:

The peak levers (non-owning views + streaming parser) act on host RAM during parse/convert. The transformer gives them less to remove: (i) ~39% of its converter-side weight bytes are derived tensors — RoPE cos/sin tables (the bulk), the rotation matrix, and per-head out-projection slices — that are host-computed and therefore keep an owned buffer (addOwnedConstOp), so the non-owning-view lever can dedup ~0% of them, versus ~100% of a conv net's weights (the conv nets carry only ~0.07 MB of owned tensors total). (ii) Its 38 MB file makes the streaming parser's removed whole-decompress transient small (~42 MB, vs ~187 MB for the 40b). Combined, the levers free ~80 MB here vs ~365 MB on the 40b.
The ~1.4 GB absolute peak is reached after the C++ converter returns (which itself peaks at only ~0.24 GB): it is CoreML/ANE compilation of the MIL graph into an ANE program. The transformer emits ~3.4× more MIL ops than the 40b (op-level attention: per-block RoPE, GQA per-head slice/concat, 361×361 softmax), and the ANE compiler's memory scales with graph complexity, not weight bytes — so the smallest model on disk has the largest compile-time footprint. This floor sits downstream of, and is unaffected by, all three (host-side) levers.

Cross-backend parity vs an Eigen reference is unchanged on both the GPU and ANE paths.

Test plan

METAL build; ./katago runtests (All tests passed) + ./katago runnnlayertests (14 configurations)
./katago testgpuerror vs Eigen reference — GPU and ANE paths both within tolerance, no NaN
ANE-path RSS A/B (above); GPU path unaffected (aneOnly false)
Full rungpuerrortest.sh regression vs current Eigen FP32 references: Metal GPU 37/37 pass; CoreML/ANE 31/31 supported configs pass (the 6 "failures" are the expected pre-v8 SIGABRTs — run4/grun50/g103, nets predating CoreML-converter support). The two transformer nets the merge had broken on the FP16 ANE path (b7c96h3tfrs, b7c96h6kv3qk32v16 GQA) now load and match the reference to <0.0005% winrate; convnet ANE output is byte-identical and the GPU path is unchanged.

Note

The CI cache fix that was previously bundled here has been split into its own PR: #1204 (orthogonal fix for a stale-cache link failure on the build-macos-metal job). This PR is now the memory work plus the merged #1205 transformer support.

🤖 Generated with Claude Code

Cuts memory during the on-device KataGo -> CoreML conversion and while running the ANE/CoreML path, with byte-identical converter output: - The converter's weight tensors become non-owning views into the parsed model instead of owning extra FP32 copies; derived/transposed tensors keep an owned buffer. This drops redundant resident weight copies during conversion. CoreML model serialization is made deterministic (SetSerializationDeterministic) so the output is byte-stable. - The KataGo model parser streams the gzip through a bounded ~1 MB refill buffer instead of decompressing the whole file into memory, while preserving the existing NaN/Inf weight validation. - ModelDesc gains releaseWeights(), which frees the in-memory weight arrays (keeping scalar shape metadata). The Metal backend calls it on the ANE (CoreML) path after converting from the model file on disk, gated by a new ComputeContext::aneOnly flag so it only fires when every configured device is ANE -- the GPU/MPSGraph path keeps its weights. The call is serialized under computeHandleMutex and only scalar dims are read afterward. Measured on b18c384nbt (19x19) over the ANE path: idle steady-state RSS 0.59 GB -> 0.19 GB; peak (load+convert) 0.87 GB -> 0.48 GB. Cross-backend parity vs an Eigen reference is unchanged on both the GPU and ANE paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

WeightEntry stores a non-owning view (const float*, count) into the live KataGoModelDesc, so the backing std::vector must outlive serialization. addConstOp/registerWeight took the data by const& and silently stored a pointer to it; a caller passing a temporary would bind to that const& and leave the view dangling, read much later during serialization. Delete the rvalue overloads of both so any such call fails to compile, forcing temporaries through addOwnedConstOp/registerOwnedWeight (which take ownership). Named lvalues (the model-member call sites) still bind to the const& overload, so no existing caller changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Own the gzFile with a custom-deleter unique_ptr so it closes on every exit path (normal return, exception, bad_alloc); removes the manual try/catch+gzclose in parse() and the ordering caveat on buffer allocation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Introduce a KataGo-local non-owning FloatView for WeightEntry::data instead of a raw const float*/size_t pair; convert to MILBlob::Util::Span only inside WeightSerializer, keeping the MILBlob dependency out of Operations.hpp. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The ComputeHandle member-order comment claimed that declaring mpsGraphOnlyHandle before coremlOnlyHandle is what prevents a GPU handle from reading freed weights. That overstates the ordering's role: within a single ComputeHandle exactly one handle is built (mutually exclusive on gpuIdx, enforced by the ctor's exactly-one check), and releaseWeights() only fires on an aneOnly context where no MPSGraph handle is ever built. Reframe the declaration order as belt-and-suspenders and point at ComputeContext::aneOnly as the actual invariant. Comment-only change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the file-local releaseXXX free functions in desc.cpp (which reached into each desc struct's internals from outside) with releaseWeights() member methods on each weight-bearing struct, matching the existing OO convention used by applyScale8ToReduceActivations() and iterConvLayers(). Each container delegates to its members; type-erased block dispatch is inlined with the same cast pattern those methods use. Behavior-preserving: same set of freed vectors, same block recursion, same metaEncoderVersion guard. ModelDesc::releaseWeights() keeps its signature, so the metalbackend.cpp call site is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Move the 11 leaf/container releaseWeights() definitions in desc.cpp out of the bottom cluster (inherited from the old free-function layout) and place each immediately after its struct's last existing method, matching the file's per-struct grouping convention used by every other method. ModelDesc::releaseWeights() stays put, already adjacent to its siblings. Pure relocation: function bodies and desc.h are unchanged; only two stray double-blank lines were normalized to single. Verified clean Metal build, testgpuerror vs Eigen reference (g170-b6c96) at <0.0004% winrate error, and runtests all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ChinChangYang · 2026-06-01T00:29:20Z

I reviewed and approved this.

Resolve conflicts where upstream transformer support landed alongside the CoreML/ANE memory work: - desc.h / desc.cpp: keep both upstream getNumParameters() and the branch's releaseWeights() on SGFMetadataEncoderDesc. Extend releaseWeights() to the new transformer block kinds (TRANSFORMER_ATTENTION/FFN) in TrunkDesc and NestedBottleneckResidualBlockDesc, add releaseWeights() to RMSNormLayerDesc / TransformerRMSNormDesc / TransformerAttentionDesc / TransformerFFNDesc, and also release trunkTipRMSNorm -- mirroring upstream's getNumParameters() coverage so an ANE-path model containing transformer blocks frees its weights instead of hitting ASSERT_UNREACHABLE. - metalbackend.cpp: adopt upstream's 3-arg ComputeContext ctor (NHWC mode removed) while keeping the branch's context->aneOnly assignment. Verified: METAL build; runtests; runnnlayertests; testgpuerror ANE + GPU vs Eigen reference within tolerance, no NaN; ANE-path RSS still ~0.20 GB idle / ~0.49 GB peak (memory levers preserved). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(v15) Implement the LLaMA-style transformer-hybrid forward pass (RMSNorm, multi-head attention with learnable 2D RoPE, SwiGLU FFN) plus ACTIVATION_SILU across the Metal GPU (MPSGraph) and CoreML/ANE (MIL) backends, so the v15 b10c384h6nbttflrs model runs end-to-end. Metal GPU (MPSGraph) — verified via testgpuerror vs Eigen reference at sizes 9/13/19 (winrate error ~0.0001%, well under threshold): - metallayers.swift: TransformerRMSNormLayer, TrunkRMSNormLayer, TransformerAttentionBlock, TransformerFFNBlock, silu() activation, SWTransformer*/SWRMSNorm descriptors; Trunk branches on trunkNormKind - metalbackend.cpp: SILU bridge + transformer/RMSNorm desc bridges, wired into residualBlocksToSwift and trunkDescToSwift CoreML/ANE (katagocoreml MIL) — implemented end-to-end; fp32 model logically correct and consistent across CPU/ANE/GPU. fp16 ANE path is numerically precision-limited (~5%) due to fp16 matmul accumulation in the deep attention stack: - types/parser: ActivationType::Silu, trunk_norm_kind, transformer block kinds 4/5, RMSNorm/attention/FFN descriptors - MILBuilder: addSiluOps, RMSNorm ops, transformer attention/FFN blocks. Fixes 4 CoreML bugs: reshape-after-transpose, fp16 mask overflow, fp16 RMSNorm reduce_sum overflow (reduce_mean) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Conflicts: # cpp/external/katagocoreml/src/parser/KataGoParser.cpp

GQA models (numKVHeads != numHeads, e.g. b7c96h6kv3qk32v16tflrs) crashed on the Metal GPU path with: MPSNDArray.mm: NDArray dimension length > INT_MAX repeatKVHeads expanded the KV heads via reshape -> broadcast -> reshape, passing -1 for the batch dim in the broadcast target shape. Unlike reshape, MPSGraph.broadcast(_:shape:) does not infer -1 and treats it as a literal (near-INT_MAX) dimension, tripping the NDArray assertion. Replace the broadcast with a shape-safe slice + concat: slice each KV head (dim 1) and concatenate groupSize copies consecutively, so query head h uses kv = h / groupSize, matching the Eigen reference (kvh = h / kvGroupSize). No -1 broadcast. Verified: testgpuerror GPU vs Eigen reference at 9/13/19 now passes (~0.00003% winrate); non-GQA models (incl. b10c384h6) unaffected since the GQA branch is gated on numKVHeads != numHeads. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The MIL builder's inline activation dispatch (buildValueHead v2, policy-head pass activation, and both SGF metadata encoder layers) handled only ReLU and Mish; SiLU silently fell through to the else branch and applied NO activation at all. This corrupted the value-head pool -> v2 -> v3 scalar path for every SiLU model, producing large errors in winrate/score/lead while ownership (which branches off v1, before v2) stayed correct. Add an ActivationType::Silu branch (addSiluOps) at all four sites. The generic conv/BN activation path already handled SiLU, which is why the trunk and v1/ownership were fine. Root-caused via systematic debugging: CoreML-CPU(fp32) error was identical to ANE (-> logical bug, not fp16), and perfect ownership with wrong scalars localized it to the value-head post-pooling path. This corrects the earlier "ANE is fp16-precision-limited (~5%)" conclusion -- that 5.66% on b10c384h6 was this bug. After the fix, testgpuerror ANE vs Eigen drops to GPU-level accuracy for all models: b10c384h6 5.66% -> ~0.00005-0.0002% cnorm 11-13% -> ~0.00007% rsnh 22-29% -> ~0.00004-0.0001% Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The CoreML MIL builder threw "GQA (numKVHeads != numHeads) not supported" for grouped-query-attention transformer models, while the Metal GPU (MPSGraph) path already handled GQA. Port that support to the MIL builder. In buildTransformerAttentionBlock, remove the throw guard and, after the RoPE block and before the scores matmul, repeat each KV head groupSize (= numHeads/numKVHeads) times along the head axis via slice_by_size + concat (interleave=false), so query head h consumes kv head h/groupSize. This matches the Eigen reference (kvh = h/kvGroupSize) and the GPU repeatKVHeads ordering. RoPE stays before the repeat (its cos/sin tables are numKVHeads-shaped). The block is gated by numKVHeads != numHeads, so the standard MHA path is unchanged. Verified on b7c96h6kv3qk32v16tflrs-fson-bnh (6 query / 3 KV heads, qk32/v16) vs Eigen reference: ANE testgpuerror 9/13/19 = 0.00002-0.00003% winrate (previously a hard throw); GPU unchanged; non-GQA model ANE error identical to pre-change; runtests and runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Transformer models failed testgpuerror on the CoreML/ANE FP16 path: the ANE accumulates FP16 matmuls AND convs in FP16 (unlike OpenCL/CUDA/TRT, which accumulate in FP32), so wide/deep transformers lose too much precision and miss the thresholds at larger board sizes. BF16 is not an option (no compute path in CoreML: cast op, ArrayFeatureType and MLMultiArray all lack bf16; coremltools confirms FLOAT16/FLOAT32 only). Follow KataGo's FP16 convention (spatial convs FP16, non-spatial FP32), channel-gated for the ANE since every FP32 op runs off the FP16-only ANE: - RMSNorm reduction cores: FP32 in FP16 mode (always). - Non-spatial (FFN/Q-K-V proj/pooling/matmul): FP32 (always). MIL `linear` needs const weight/bias so it can't runtime-cast; only `matmul` is wrapped. - Convs: FP32 only for wide trunks (>= 320ch); narrower keep convs on-ANE. - Narrow trunks (< 256ch) sit on the testgpuerror thresholds and no partial FP32 config passes all board sizes (islands cast back to FP16 leave a noisy FP16 spatial stream); build them fully FP32 (off-ANE, cheap since small). Weights stay FP16-stored via runtime up-casts, except full-FP32 models. Add per-weight FP32 serialization (WeightEntry.is_fp32) so a const declared FP32 inside an otherwise-FP16 model is stored FP32 (fixes the load-time "storage and type have different number of elements" abort and enables the full-FP32 tier). Also fixes addFloatScalarConstOp keying storage off m_use_fp16 instead of the declared m_weight_dtype. Result: all 4 transformer test models (b10c384h6/b4c256h4/b7c96h3/ b7c96h6kv3-GQA) pass testgpuerror on ANE FP16 at sizes 9/13/19; runtests and runnnlayertests pass. All changes gated on m_use_fp16; FP32 mode unchanged. The 256/320 channel thresholds are width heuristics validated on these models. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Three near-identical blocks wrapped global pooling in FP32 (policy head, value head, gpool residual block): cast input/mask up to FP32, flip m_weight_dtype, pool, restore, cast pooled features back to FP16 - with inconsistent save-variable names and one site using castFixed vs addCastOp for the output cast. Extract a single addGlobalPoolingFp32(input, mask, channels, output, valueVariant) helper and a small RAII ScopedFp32 guard for the temporary m_weight_dtype flip. The three call sites become one-liners. Behavior-preserving: same emitted op sequence; testgpuerror output is byte-identical across all precision tiers (partial-FP32 b10c384h6, full-FP32 b7c96h3, non-spatial-FP32 b4c256h4), all 12 transformer gate runs pass, runtests and runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The width-keyed precision tiers (commit 3839e52) forced FP32 ops off the FP16-only ANE on plain production convnets, not just transformers. b18c384nbt ran ~2.6x slower on the ANE path (160 vs 416 visits/s) with no accuracy benefit. The dominant cost is the per-block global-pooling FP32 (non-spatial), which breaks the ANE pipeline once per gpool-residual block; conv-FP32 is secondary. Add a recursive blocksContainTransformer() helper and gate all three escalations (full-FP32, non-spatial-FP32, conv-FP32) on transformer-block presence. Convnets now run pure FP16 on the ANE (the long-standing pre-tier path); for transformer models the added "&& hasTransformer" is always true, so their emitted MIL is byte-identical and behavior is unchanged. Verified on the ANE FP16 path: b18c384nbt testgpuerror passes (winrate 99%=0.57%, max=0.87%) and recovers full throughput (424 visits/s); b28c512nbt passes (99%=0.41%); all 4 transformer test models x sizes 9/13/19 pass with numbers byte-identical to before; runtests and runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…former-silu-b10c384h6

makeRopeTables allocated cosBuf/sinBuf with UnsafeMutablePointer.allocate and handed them to Data(floatsNoCopy:), which uses deallocator: .none, so the buffers were never freed -- a leak on every graph build (per attention block, per board size). Unlike the other floatsNoCopy callers (weights/gamma/beta), which point at C++-descriptor memory that lives for the model's lifetime, these tables have no persistent owner. Switch to managed [Float32] arrays and copy into the Data via Data(buffer:) so MPSGraph owns the bytes -- avoids both the leak and a use-after-free that a naive deallocate() on the no-copy path would cause. Output-neutral: testgpuerror on the GQA + learnable-RoPE model (b7c96h6kv3qk32v16tflrs, board 19) vs Eigen FP32 reference matches to 0.00028% max winrate error over 2247 positions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The Metal forward pass (metallayers.swift TransformerFFNBlock) only implements the SwiGLU path (SiLU(linear1) * gate). A non-SwiGLU model carries no gate weights, so building the Swift descriptor from the empty linearGate would crash obscurely (or silently misbehave). Eigen (eigenbackend.cpp) and CoreML (katagocoreml MILBuilder) both throw a clear "non-SwiGLU transformer FFN not supported" error in this case; the Metal GPU path had no such guard. Add the matching StringError at the FFN descriptor conversion so all three backends fail loudly and consistently. No behavior change for any current model (all use useSwiGLU=true): the guard sits on an untaken path. Verified the SwiGLU model b10c384h6nbttflrs still passes testgpuerror on both GPU (0.00005% winrate) and ANE (unchanged from baseline); runtests and runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The trunk-tip dispatch in metallayers.swift compared trunkNormKind against the literal 1, while the rest of the codebase uses the named constants from desc.h (TRUNK_NORM_KIND_STANDARD/_RMSNORM). Add matching Swift constants and use TRUNK_NORM_KIND_RMSNORM at the comparison site. Pure literal-to-named-constant rename; no behavior change. Verified both branches still pass testgpuerror at GPU-level accuracy: RMSNorm tip (b10c384h6nbttflrs) 0.00005% winrate, BatchNorm tip (b7c96h6kv3 GQA) 0.00003%; runtests and runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…into feature/coreml-conversion-memory-levers (PR lightvector#1202) Brings Metal GPU + CoreML/ANE transformer-net support (SiLU, GQA, learnable RoPE, RMSNorm, SwiGLU) onto the CoreML-conversion memory branch (non-owning FloatView weight views, streaming parser, ANE weight release). All four conflicts were in the katagocoreml converter, from one root cause: lightvector#1202 reworked weight registration (non-owning FloatView views, registerOwnedWeight, deleted rvalue overloads) while lightvector#1205 added per-weight FP32 precision tiers (is_fp32). Resolved to keep both: - Operations.{hpp,cpp}: registerWeight stamps entry.is_fp32; keep the non-owning view plus registerOwnedWeight. - WeightSerializer.cpp: keep the FloatView-based count plus per-weight store_fp16 = use_fp16 && !is_fp32. - MILBuilder.cpp addConstOp: keep the emitConstOp refactor and pass is_fp32 = (m_weight_dtype == FLOAT32). A follow-up commit retrofits lightvector#1205's transformer derived consts to lightvector#1202's owned-weight + FP32-marking contract (required for the FP16 ANE path). Verified after the follow-up fix: katago runtests and runnnlayertests pass, and testgpuerror vs fresh Eigen FP32 references passes on convnet + all three transformer nets on both the Metal GPU and CoreML/ANE paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…contract The transformer attention builder emits four function-local std::vector<float> tensors: RoPE cos/sin tables, the rotation matrix R, and per-head out-projection weight slices. After merging the transformer support onto the FloatView branch, these needed two fixes: 1. Dangling view. lightvector#1202 made WeightEntry::data a non-owning FloatView, so addConstOp registers a view whose backing buffer must outlive serialization. These locals were passed to addConstOp and would dangle once the build function returns (serialization runs afterwards). Route them through addOwnedConstOp so KataGoOps owns the buffer until serialization. (Under lightvector#1205's owning WeightEntry they were copied, so this only surfaces post-merge.) 2. dtype mismatch. emitConstOp declares each const's dtype as m_weight_dtype, but addOwnedConstOp / registerOwnedWeight stored at the global mode (is_fp32 hardcoded false). In an FP16 model these derived consts land in the attention / value-head FP32 sub-region (m_weight_dtype == FLOAT32), so they were declared FP32 but stored FP16. CoreML/ANE then rejects the model at load ("Metadata data type does not match requested type", BNNS error -14), which SIGABRT'd every FP16 ANE transformer. Thread is_fp32 through registerOwnedWeight and have addOwnedConstOp pass is_fp32 = (m_weight_dtype == FLOAT32), mirroring addConstOp so the stored dtype always matches the declared dtype. This also fixes the same latent mismatch for addLinearOp's transposed value-head weights. Verified with testgpuerror against fresh Eigen FP32 references: b7c96h3tfrs and b7c96h6gqa, which previously SIGABRT'd on the FP16 ANE path, now load and match to <0.0005% winrate; convnet ANE output is byte-identical and the Metal GPU path is unchanged. katago runtests and runnnlayertests also pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ChinChangYang · 2026-06-04T12:30:24Z

Completed. Plan to set this PR to ready when #1205 is merged.

…contract The transformer attention builder emits four function-local std::vector<float> tensors: RoPE cos/sin tables, the rotation matrix R, and per-head out-projection weight slices. After merging the transformer support onto the FloatView branch, these needed two fixes: 1. Dangling view. lightvector#1202 made WeightEntry::data a non-owning FloatView, so addConstOp registers a view whose backing buffer must outlive serialization. These locals were passed to addConstOp and would dangle once the build function returns (serialization runs afterwards). Route them through addOwnedConstOp so KataGoOps owns the buffer until serialization. (Under lightvector#1205's owning WeightEntry they were copied, so this only surfaces post-merge.) 2. dtype mismatch. emitConstOp declares each const's dtype as m_weight_dtype, but addOwnedConstOp / registerOwnedWeight stored at the global mode (is_fp32 hardcoded false). In an FP16 model these derived consts land in the attention / value-head FP32 sub-region (m_weight_dtype == FLOAT32), so they were declared FP32 but stored FP16. CoreML/ANE then rejects the model at load ("Metadata data type does not match requested type", BNNS error -14), which SIGABRT'd every FP16 ANE transformer. Thread is_fp32 through registerOwnedWeight and have addOwnedConstOp pass is_fp32 = (m_weight_dtype == FLOAT32), mirroring addConstOp so the stored dtype always matches the declared dtype. This also fixes the same latent mismatch for addLinearOp's transposed value-head weights. Verified with testgpuerror against fresh Eigen FP32 references: b7c96h3tfrs and b7c96h6gqa, which previously SIGABRT'd on the FP16 ANE path, now load and match to <0.0005% winrate; convnet ANE output is byte-identical and the Metal GPU path is unchanged. katago runtests and runnnlayertests also pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 8481a94)

The cherry-picked per-struct releaseWeights() refactor (44342a3/98b17ebb) predates this branch's MLX transformer port, so it only added releaseWeights() to the non-transformer descriptors. Extend the coverage to the transformer descriptors present on this branch (RMSNormLayerDesc, TransformerRMSNormDesc, TransformerAttentionDesc incl. ropeFreqs, TransformerFFNDesc) and handle TRANSFORMER_ATTENTION_BLOCK_KIND / TRANSFORMER_FFN_BLOCK_KIND plus trunkTipRMSNorm in the trunk release walk. Without this, releasing weights on a transformer model would hit ASSERT_UNREACHABLE. This makes desc.cpp/desc.h byte-identical to the lightvector#1202 feature branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Port the lightvector#1202 ANE steady-state memory lever to the MLX backend. Add ComputeContext::aneOnly, set in createComputeContext when every configured device index is MLX_MUX_ANE, and call ModelDesc::releaseWeights() in convertAndCreateCoreMLOnlyHandleMLX after the model has been converted to CoreML on disk. Safe because: the ANE path re-reads the model from modelPath (not the in-memory weight arrays); the ComputeHandle ctor takes the MLX_MUX_ANE early-return before building any MLX/GPU model (the only weight-array consumer); only scalar dims are read afterward, which releaseWeights() preserves; and it runs under computeHandleMutex. Mirrors the Metal backend's aneOnly release. GPU path unaffected (aneOnly is false whenever any thread uses MLX_MUX_GPU). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ChinChangYang mentioned this pull request May 30, 2026

Reduce CoreML conversion peak + ANE steady-state memory (port PR #27 C++ levers to master) ChinChangYang/KataGo#28

Closed

5 tasks

ChinChangYang and others added 3 commits May 31, 2026 17:28

ChinChangYang mentioned this pull request May 31, 2026

CI: key macOS-metal CMake cache on Homebrew dependency versions #1204

Closed

ChinChangYang force-pushed the feature/coreml-conversion-memory-levers branch from a6f2b94 to 6bfa617 Compare May 31, 2026 09:29

ChinChangYang and others added 3 commits May 31, 2026 21:57

ChinChangYang marked this pull request as ready for review June 1, 2026 00:29

ChinChangYang and others added 13 commits June 1, 2026 09:02

Merge branch 'master' into feature/metal-transformer-silu-b10c384h6

c5fd1e2

# Conflicts: # cpp/external/katagocoreml/src/parser/KataGoParser.cpp

Merge remote-tracking branch 'origin/master' into feature/metal-trans…

f770888

…former-silu-b10c384h6

ChinChangYang marked this pull request as draft June 2, 2026 15:26

ChinChangYang and others added 2 commits June 4, 2026 12:04

ChinChangYang mentioned this pull request Jun 4, 2026

Add MLX backend for Apple Silicon (GPU + ANE/CoreML dispatch, incl. transformer nets) #1199

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce CoreML conversion peak and ANE steady-state memory#1202

Reduce CoreML conversion peak and ANE steady-state memory#1202
ChinChangYang wants to merge 22 commits into
lightvector:masterfrom
ChinChangYang:feature/coreml-conversion-memory-levers

ChinChangYang commented May 30, 2026 •

edited

Loading

Uh oh!

ChinChangYang commented Jun 1, 2026

Uh oh!

ChinChangYang commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChinChangYang commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Merged: Metal GPU + CoreML/ANE transformer support (#1205)

Measured (19×19, ANE path)

Test plan

Note

Uh oh!

ChinChangYang commented Jun 1, 2026

Uh oh!

ChinChangYang commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChinChangYang commented May 30, 2026 •

edited

Loading