Reduce CoreML conversion peak and ANE steady-state memory#1202
Draft
ChinChangYang wants to merge 22 commits into
Draft
Reduce CoreML conversion peak and ANE steady-state memory#1202ChinChangYang wants to merge 22 commits into
ChinChangYang wants to merge 22 commits into
Conversation
Cuts memory during the on-device KataGo -> CoreML conversion and while running the ANE/CoreML path, with byte-identical converter output: - The converter's weight tensors become non-owning views into the parsed model instead of owning extra FP32 copies; derived/transposed tensors keep an owned buffer. This drops redundant resident weight copies during conversion. CoreML model serialization is made deterministic (SetSerializationDeterministic) so the output is byte-stable. - The KataGo model parser streams the gzip through a bounded ~1 MB refill buffer instead of decompressing the whole file into memory, while preserving the existing NaN/Inf weight validation. - ModelDesc gains releaseWeights(), which frees the in-memory weight arrays (keeping scalar shape metadata). The Metal backend calls it on the ANE (CoreML) path after converting from the model file on disk, gated by a new ComputeContext::aneOnly flag so it only fires when every configured device is ANE -- the GPU/MPSGraph path keeps its weights. The call is serialized under computeHandleMutex and only scalar dims are read afterward. Measured on b18c384nbt (19x19) over the ANE path: idle steady-state RSS 0.59 GB -> 0.19 GB; peak (load+convert) 0.87 GB -> 0.48 GB. Cross-backend parity vs an Eigen reference is unchanged on both the GPU and ANE paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5 tasks
WeightEntry stores a non-owning view (const float*, count) into the live KataGoModelDesc, so the backing std::vector must outlive serialization. addConstOp/registerWeight took the data by const& and silently stored a pointer to it; a caller passing a temporary would bind to that const& and leave the view dangling, read much later during serialization. Delete the rvalue overloads of both so any such call fails to compile, forcing temporaries through addOwnedConstOp/registerOwnedWeight (which take ownership). Named lvalues (the model-member call sites) still bind to the const& overload, so no existing caller changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Own the gzFile with a custom-deleter unique_ptr so it closes on every exit path (normal return, exception, bad_alloc); removes the manual try/catch+gzclose in parse() and the ordering caveat on buffer allocation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Introduce a KataGo-local non-owning FloatView for WeightEntry::data instead of a raw const float*/size_t pair; convert to MILBlob::Util::Span only inside WeightSerializer, keeping the MILBlob dependency out of Operations.hpp. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a6f2b94 to
6bfa617
Compare
The ComputeHandle member-order comment claimed that declaring mpsGraphOnlyHandle before coremlOnlyHandle is what prevents a GPU handle from reading freed weights. That overstates the ordering's role: within a single ComputeHandle exactly one handle is built (mutually exclusive on gpuIdx, enforced by the ctor's exactly-one check), and releaseWeights() only fires on an aneOnly context where no MPSGraph handle is ever built. Reframe the declaration order as belt-and-suspenders and point at ComputeContext::aneOnly as the actual invariant. Comment-only change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the file-local releaseXXX free functions in desc.cpp (which reached into each desc struct's internals from outside) with releaseWeights() member methods on each weight-bearing struct, matching the existing OO convention used by applyScale8ToReduceActivations() and iterConvLayers(). Each container delegates to its members; type-erased block dispatch is inlined with the same cast pattern those methods use. Behavior-preserving: same set of freed vectors, same block recursion, same metaEncoderVersion guard. ModelDesc::releaseWeights() keeps its signature, so the metalbackend.cpp call site is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the 11 leaf/container releaseWeights() definitions in desc.cpp out of the bottom cluster (inherited from the old free-function layout) and place each immediately after its struct's last existing method, matching the file's per-struct grouping convention used by every other method. ModelDesc::releaseWeights() stays put, already adjacent to its siblings. Pure relocation: function bodies and desc.h are unchanged; only two stray double-blank lines were normalized to single. Verified clean Metal build, testgpuerror vs Eigen reference (g170-b6c96) at <0.0004% winrate error, and runtests all pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
I reviewed and approved this. |
Resolve conflicts where upstream transformer support landed alongside the CoreML/ANE memory work: - desc.h / desc.cpp: keep both upstream getNumParameters() and the branch's releaseWeights() on SGFMetadataEncoderDesc. Extend releaseWeights() to the new transformer block kinds (TRANSFORMER_ATTENTION/FFN) in TrunkDesc and NestedBottleneckResidualBlockDesc, add releaseWeights() to RMSNormLayerDesc / TransformerRMSNormDesc / TransformerAttentionDesc / TransformerFFNDesc, and also release trunkTipRMSNorm -- mirroring upstream's getNumParameters() coverage so an ANE-path model containing transformer blocks frees its weights instead of hitting ASSERT_UNREACHABLE. - metalbackend.cpp: adopt upstream's 3-arg ComputeContext ctor (NHWC mode removed) while keeping the branch's context->aneOnly assignment. Verified: METAL build; runtests; runnnlayertests; testgpuerror ANE + GPU vs Eigen reference within tolerance, no NaN; ANE-path RSS still ~0.20 GB idle / ~0.49 GB peak (memory levers preserved). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(v15) Implement the LLaMA-style transformer-hybrid forward pass (RMSNorm, multi-head attention with learnable 2D RoPE, SwiGLU FFN) plus ACTIVATION_SILU across the Metal GPU (MPSGraph) and CoreML/ANE (MIL) backends, so the v15 b10c384h6nbttflrs model runs end-to-end. Metal GPU (MPSGraph) — verified via testgpuerror vs Eigen reference at sizes 9/13/19 (winrate error ~0.0001%, well under threshold): - metallayers.swift: TransformerRMSNormLayer, TrunkRMSNormLayer, TransformerAttentionBlock, TransformerFFNBlock, silu() activation, SWTransformer*/SWRMSNorm descriptors; Trunk branches on trunkNormKind - metalbackend.cpp: SILU bridge + transformer/RMSNorm desc bridges, wired into residualBlocksToSwift and trunkDescToSwift CoreML/ANE (katagocoreml MIL) — implemented end-to-end; fp32 model logically correct and consistent across CPU/ANE/GPU. fp16 ANE path is numerically precision-limited (~5%) due to fp16 matmul accumulation in the deep attention stack: - types/parser: ActivationType::Silu, trunk_norm_kind, transformer block kinds 4/5, RMSNorm/attention/FFN descriptors - MILBuilder: addSiluOps, RMSNorm ops, transformer attention/FFN blocks. Fixes 4 CoreML bugs: reshape-after-transpose, fp16 mask overflow, fp16 RMSNorm reduce_sum overflow (reduce_mean) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # cpp/external/katagocoreml/src/parser/KataGoParser.cpp
GQA models (numKVHeads != numHeads, e.g. b7c96h6kv3qk32v16tflrs) crashed on the Metal GPU path with: MPSNDArray.mm: NDArray dimension length > INT_MAX repeatKVHeads expanded the KV heads via reshape -> broadcast -> reshape, passing -1 for the batch dim in the broadcast target shape. Unlike reshape, MPSGraph.broadcast(_:shape:) does not infer -1 and treats it as a literal (near-INT_MAX) dimension, tripping the NDArray assertion. Replace the broadcast with a shape-safe slice + concat: slice each KV head (dim 1) and concatenate groupSize copies consecutively, so query head h uses kv = h / groupSize, matching the Eigen reference (kvh = h / kvGroupSize). No -1 broadcast. Verified: testgpuerror GPU vs Eigen reference at 9/13/19 now passes (~0.00003% winrate); non-GQA models (incl. b10c384h6) unaffected since the GQA branch is gated on numKVHeads != numHeads. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The MIL builder's inline activation dispatch (buildValueHead v2, policy-head pass activation, and both SGF metadata encoder layers) handled only ReLU and Mish; SiLU silently fell through to the else branch and applied NO activation at all. This corrupted the value-head pool -> v2 -> v3 scalar path for every SiLU model, producing large errors in winrate/score/lead while ownership (which branches off v1, before v2) stayed correct. Add an ActivationType::Silu branch (addSiluOps) at all four sites. The generic conv/BN activation path already handled SiLU, which is why the trunk and v1/ownership were fine. Root-caused via systematic debugging: CoreML-CPU(fp32) error was identical to ANE (-> logical bug, not fp16), and perfect ownership with wrong scalars localized it to the value-head post-pooling path. This corrects the earlier "ANE is fp16-precision-limited (~5%)" conclusion -- that 5.66% on b10c384h6 was this bug. After the fix, testgpuerror ANE vs Eigen drops to GPU-level accuracy for all models: b10c384h6 5.66% -> ~0.00005-0.0002% cnorm 11-13% -> ~0.00007% rsnh 22-29% -> ~0.00004-0.0001% Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CoreML MIL builder threw "GQA (numKVHeads != numHeads) not supported" for grouped-query-attention transformer models, while the Metal GPU (MPSGraph) path already handled GQA. Port that support to the MIL builder. In buildTransformerAttentionBlock, remove the throw guard and, after the RoPE block and before the scores matmul, repeat each KV head groupSize (= numHeads/numKVHeads) times along the head axis via slice_by_size + concat (interleave=false), so query head h consumes kv head h/groupSize. This matches the Eigen reference (kvh = h/kvGroupSize) and the GPU repeatKVHeads ordering. RoPE stays before the repeat (its cos/sin tables are numKVHeads-shaped). The block is gated by numKVHeads != numHeads, so the standard MHA path is unchanged. Verified on b7c96h6kv3qk32v16tflrs-fson-bnh (6 query / 3 KV heads, qk32/v16) vs Eigen reference: ANE testgpuerror 9/13/19 = 0.00002-0.00003% winrate (previously a hard throw); GPU unchanged; non-GQA model ANE error identical to pre-change; runtests and runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Transformer models failed testgpuerror on the CoreML/ANE FP16 path: the ANE accumulates FP16 matmuls AND convs in FP16 (unlike OpenCL/CUDA/TRT, which accumulate in FP32), so wide/deep transformers lose too much precision and miss the thresholds at larger board sizes. BF16 is not an option (no compute path in CoreML: cast op, ArrayFeatureType and MLMultiArray all lack bf16; coremltools confirms FLOAT16/FLOAT32 only). Follow KataGo's FP16 convention (spatial convs FP16, non-spatial FP32), channel-gated for the ANE since every FP32 op runs off the FP16-only ANE: - RMSNorm reduction cores: FP32 in FP16 mode (always). - Non-spatial (FFN/Q-K-V proj/pooling/matmul): FP32 (always). MIL `linear` needs const weight/bias so it can't runtime-cast; only `matmul` is wrapped. - Convs: FP32 only for wide trunks (>= 320ch); narrower keep convs on-ANE. - Narrow trunks (< 256ch) sit on the testgpuerror thresholds and no partial FP32 config passes all board sizes (islands cast back to FP16 leave a noisy FP16 spatial stream); build them fully FP32 (off-ANE, cheap since small). Weights stay FP16-stored via runtime up-casts, except full-FP32 models. Add per-weight FP32 serialization (WeightEntry.is_fp32) so a const declared FP32 inside an otherwise-FP16 model is stored FP32 (fixes the load-time "storage and type have different number of elements" abort and enables the full-FP32 tier). Also fixes addFloatScalarConstOp keying storage off m_use_fp16 instead of the declared m_weight_dtype. Result: all 4 transformer test models (b10c384h6/b4c256h4/b7c96h3/ b7c96h6kv3-GQA) pass testgpuerror on ANE FP16 at sizes 9/13/19; runtests and runnnlayertests pass. All changes gated on m_use_fp16; FP32 mode unchanged. The 256/320 channel thresholds are width heuristics validated on these models. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Three near-identical blocks wrapped global pooling in FP32 (policy head, value head, gpool residual block): cast input/mask up to FP32, flip m_weight_dtype, pool, restore, cast pooled features back to FP16 - with inconsistent save-variable names and one site using castFixed vs addCastOp for the output cast. Extract a single addGlobalPoolingFp32(input, mask, channels, output, valueVariant) helper and a small RAII ScopedFp32 guard for the temporary m_weight_dtype flip. The three call sites become one-liners. Behavior-preserving: same emitted op sequence; testgpuerror output is byte-identical across all precision tiers (partial-FP32 b10c384h6, full-FP32 b7c96h3, non-spatial-FP32 b4c256h4), all 12 transformer gate runs pass, runtests and runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The width-keyed precision tiers (commit 3839e52) forced FP32 ops off the FP16-only ANE on plain production convnets, not just transformers. b18c384nbt ran ~2.6x slower on the ANE path (160 vs 416 visits/s) with no accuracy benefit. The dominant cost is the per-block global-pooling FP32 (non-spatial), which breaks the ANE pipeline once per gpool-residual block; conv-FP32 is secondary. Add a recursive blocksContainTransformer() helper and gate all three escalations (full-FP32, non-spatial-FP32, conv-FP32) on transformer-block presence. Convnets now run pure FP16 on the ANE (the long-standing pre-tier path); for transformer models the added "&& hasTransformer" is always true, so their emitted MIL is byte-identical and behavior is unchanged. Verified on the ANE FP16 path: b18c384nbt testgpuerror passes (winrate 99%=0.57%, max=0.87%) and recovers full throughput (424 visits/s); b28c512nbt passes (99%=0.41%); all 4 transformer test models x sizes 9/13/19 pass with numbers byte-identical to before; runtests and runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…former-silu-b10c384h6
makeRopeTables allocated cosBuf/sinBuf with UnsafeMutablePointer.allocate and handed them to Data(floatsNoCopy:), which uses deallocator: .none, so the buffers were never freed -- a leak on every graph build (per attention block, per board size). Unlike the other floatsNoCopy callers (weights/gamma/beta), which point at C++-descriptor memory that lives for the model's lifetime, these tables have no persistent owner. Switch to managed [Float32] arrays and copy into the Data via Data(buffer:) so MPSGraph owns the bytes -- avoids both the leak and a use-after-free that a naive deallocate() on the no-copy path would cause. Output-neutral: testgpuerror on the GQA + learnable-RoPE model (b7c96h6kv3qk32v16tflrs, board 19) vs Eigen FP32 reference matches to 0.00028% max winrate error over 2247 positions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Metal forward pass (metallayers.swift TransformerFFNBlock) only implements the SwiGLU path (SiLU(linear1) * gate). A non-SwiGLU model carries no gate weights, so building the Swift descriptor from the empty linearGate would crash obscurely (or silently misbehave). Eigen (eigenbackend.cpp) and CoreML (katagocoreml MILBuilder) both throw a clear "non-SwiGLU transformer FFN not supported" error in this case; the Metal GPU path had no such guard. Add the matching StringError at the FFN descriptor conversion so all three backends fail loudly and consistently. No behavior change for any current model (all use useSwiGLU=true): the guard sits on an untaken path. Verified the SwiGLU model b10c384h6nbttflrs still passes testgpuerror on both GPU (0.00005% winrate) and ANE (unchanged from baseline); runtests and runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The trunk-tip dispatch in metallayers.swift compared trunkNormKind against the literal 1, while the rest of the codebase uses the named constants from desc.h (TRUNK_NORM_KIND_STANDARD/_RMSNORM). Add matching Swift constants and use TRUNK_NORM_KIND_RMSNORM at the comparison site. Pure literal-to-named-constant rename; no behavior change. Verified both branches still pass testgpuerror at GPU-level accuracy: RMSNorm tip (b10c384h6nbttflrs) 0.00005% winrate, BatchNorm tip (b7c96h6kv3 GQA) 0.00003%; runtests and runnnlayertests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…into feature/coreml-conversion-memory-levers (PR lightvector#1202) Brings Metal GPU + CoreML/ANE transformer-net support (SiLU, GQA, learnable RoPE, RMSNorm, SwiGLU) onto the CoreML-conversion memory branch (non-owning FloatView weight views, streaming parser, ANE weight release). All four conflicts were in the katagocoreml converter, from one root cause: lightvector#1202 reworked weight registration (non-owning FloatView views, registerOwnedWeight, deleted rvalue overloads) while lightvector#1205 added per-weight FP32 precision tiers (is_fp32). Resolved to keep both: - Operations.{hpp,cpp}: registerWeight stamps entry.is_fp32; keep the non-owning view plus registerOwnedWeight. - WeightSerializer.cpp: keep the FloatView-based count plus per-weight store_fp16 = use_fp16 && !is_fp32. - MILBuilder.cpp addConstOp: keep the emitConstOp refactor and pass is_fp32 = (m_weight_dtype == FLOAT32). A follow-up commit retrofits lightvector#1205's transformer derived consts to lightvector#1202's owned-weight + FP32-marking contract (required for the FP16 ANE path). Verified after the follow-up fix: katago runtests and runnnlayertests pass, and testgpuerror vs fresh Eigen FP32 references passes on convnet + all three transformer nets on both the Metal GPU and CoreML/ANE paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…contract The transformer attention builder emits four function-local std::vector<float> tensors: RoPE cos/sin tables, the rotation matrix R, and per-head out-projection weight slices. After merging the transformer support onto the FloatView branch, these needed two fixes: 1. Dangling view. lightvector#1202 made WeightEntry::data a non-owning FloatView, so addConstOp registers a view whose backing buffer must outlive serialization. These locals were passed to addConstOp and would dangle once the build function returns (serialization runs afterwards). Route them through addOwnedConstOp so KataGoOps owns the buffer until serialization. (Under lightvector#1205's owning WeightEntry they were copied, so this only surfaces post-merge.) 2. dtype mismatch. emitConstOp declares each const's dtype as m_weight_dtype, but addOwnedConstOp / registerOwnedWeight stored at the global mode (is_fp32 hardcoded false). In an FP16 model these derived consts land in the attention / value-head FP32 sub-region (m_weight_dtype == FLOAT32), so they were declared FP32 but stored FP16. CoreML/ANE then rejects the model at load ("Metadata data type does not match requested type", BNNS error -14), which SIGABRT'd every FP16 ANE transformer. Thread is_fp32 through registerOwnedWeight and have addOwnedConstOp pass is_fp32 = (m_weight_dtype == FLOAT32), mirroring addConstOp so the stored dtype always matches the declared dtype. This also fixes the same latent mismatch for addLinearOp's transposed value-head weights. Verified with testgpuerror against fresh Eigen FP32 references: b7c96h3tfrs and b7c96h6gqa, which previously SIGABRT'd on the FP16 ANE path, now load and match to <0.0005% winrate; convnet ANE output is byte-identical and the Metal GPU path is unchanged. katago runtests and runnnlayertests also pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Completed. Plan to set this PR to ready when #1205 is merged. |
ChinChangYang
added a commit
to ChinChangYang/KataGo
that referenced
this pull request
Jun 4, 2026
…contract The transformer attention builder emits four function-local std::vector<float> tensors: RoPE cos/sin tables, the rotation matrix R, and per-head out-projection weight slices. After merging the transformer support onto the FloatView branch, these needed two fixes: 1. Dangling view. lightvector#1202 made WeightEntry::data a non-owning FloatView, so addConstOp registers a view whose backing buffer must outlive serialization. These locals were passed to addConstOp and would dangle once the build function returns (serialization runs afterwards). Route them through addOwnedConstOp so KataGoOps owns the buffer until serialization. (Under lightvector#1205's owning WeightEntry they were copied, so this only surfaces post-merge.) 2. dtype mismatch. emitConstOp declares each const's dtype as m_weight_dtype, but addOwnedConstOp / registerOwnedWeight stored at the global mode (is_fp32 hardcoded false). In an FP16 model these derived consts land in the attention / value-head FP32 sub-region (m_weight_dtype == FLOAT32), so they were declared FP32 but stored FP16. CoreML/ANE then rejects the model at load ("Metadata data type does not match requested type", BNNS error -14), which SIGABRT'd every FP16 ANE transformer. Thread is_fp32 through registerOwnedWeight and have addOwnedConstOp pass is_fp32 = (m_weight_dtype == FLOAT32), mirroring addConstOp so the stored dtype always matches the declared dtype. This also fixes the same latent mismatch for addLinearOp's transposed value-head weights. Verified with testgpuerror against fresh Eigen FP32 references: b7c96h3tfrs and b7c96h6gqa, which previously SIGABRT'd on the FP16 ANE path, now load and match to <0.0005% winrate; convnet ANE output is byte-identical and the Metal GPU path is unchanged. katago runtests and runnnlayertests also pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 8481a94)
ChinChangYang
added a commit
to ChinChangYang/KataGo
that referenced
this pull request
Jun 4, 2026
The cherry-picked per-struct releaseWeights() refactor (44342a3/98b17ebb) predates this branch's MLX transformer port, so it only added releaseWeights() to the non-transformer descriptors. Extend the coverage to the transformer descriptors present on this branch (RMSNormLayerDesc, TransformerRMSNormDesc, TransformerAttentionDesc incl. ropeFreqs, TransformerFFNDesc) and handle TRANSFORMER_ATTENTION_BLOCK_KIND / TRANSFORMER_FFN_BLOCK_KIND plus trunkTipRMSNorm in the trunk release walk. Without this, releasing weights on a transformer model would hit ASSERT_UNREACHABLE. This makes desc.cpp/desc.h byte-identical to the lightvector#1202 feature branch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ChinChangYang
added a commit
to ChinChangYang/KataGo
that referenced
this pull request
Jun 4, 2026
Port the lightvector#1202 ANE steady-state memory lever to the MLX backend. Add ComputeContext::aneOnly, set in createComputeContext when every configured device index is MLX_MUX_ANE, and call ModelDesc::releaseWeights() in convertAndCreateCoreMLOnlyHandleMLX after the model has been converted to CoreML on disk. Safe because: the ANE path re-reads the model from modelPath (not the in-memory weight arrays); the ComputeHandle ctor takes the MLX_MUX_ANE early-return before building any MLX/GPU model (the only weight-array consumer); only scalar dims are read afterward, which releaseWeights() preserves; and it runs under computeHandleMutex. Mirrors the Metal backend's aneOnly release. GPU path unaffected (aneOnly is false whenever any thread uses MLX_MUX_GPU). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Draft. Cuts memory during the on-device KataGo→CoreML conversion and while running the ANE/CoreML path, with byte-identical converter output. Three changes (first commit):
SetSerializationDeterministic) so output is byte-stable.ModelDesc::releaseWeights()frees the in-memory weight arrays (keeping scalar shape metadata). The Metal backend calls it on the ANE (CoreML) path after converting from the model file on disk, gated by a newComputeContext::aneOnlyflag so it only fires when every configured device is ANE — the GPU/MPSGraph path keeps its weights. Serialized undercomputeHandleMutex; only scalar dims are read afterward.A follow-up commit (Enforce non-owning weight-view contract at compile time) hardens the first bullet:
WeightEntrynow stores a non-owning view, so its backingstd::vectormust outlive serialization. The rvalue overloads ofaddConstOp/registerWeightare= deleted so a temporary can never bind to the view path (it would dangle) — temporaries are forced through the owningaddOwnedConstOp/registerOwnedWeight. No codegen change: every existing call site passes an lvalue model member, verified by a full METAL rebuild plus a negative-compile test confirming the deleted overload is selected for a temporary.Merged: Metal GPU + CoreML/ANE transformer support (#1205)
This branch now merges #1205 (transformer-net support — SiLU, GQA, learnable RoPE, RMSNorm, SwiGLU — for the Metal GPU and CoreML/ANE paths) so the two efforts reconcile in one place. All four merge conflicts were in the katagocoreml converter and came from a single root: this PR's non-owning
FloatViewweight registration vs #1205's per-weight FP32 precision tiers (is_fp32). Both are kept (registerWeightstampsentry.is_fp32;WeightSerializerstores fp16 iffuse_fp16 && !is_fp32;addConstOpkeeps theemitConstOprefactor and passesis_fp32 = (m_weight_dtype == FLOAT32)).The merge required one integration fix (commit Conform CoreML transformer derived consts to the owned-weight + FP32 contract). The transformer attention builder emits four function-local tensors — RoPE cos/sin tables, the rotation matrix
R, and per-head out-projection weight slices. After this PR madeWeightEntry::dataa non-owning view, those locals must:addOwnedConstOp(withstd::move) so the view doesn't dangle once the build function returns (serialization runs afterwards); andaddConstOp, becauseemitConstOpdeclares each const's dtype asm_weight_dtype. Without (2) a derived const that lands in the attention/value-head FP32 sub-region of an FP16 model is declared FP32 but stored FP16, which CoreML/ANE rejects at load ("Metadata data type does not match requested type", BNNS error −14), crashing every FP16-ANE transformer.addOwnedConstOp/registerOwnedWeightnow threadis_fp32 = (m_weight_dtype == FLOAT32), mirroringaddConstOpso stored dtype always matches the declared dtype. This also fixes the same latent mismatch foraddLinearOp's transposed value-head weights. The streaming parser needed no change (the newparseTransformer*use the unchangedreadFloats), andModelDesc::releaseWeights()already covers the transformer descriptors (includingropeFreqs).Measured (19×19, ANE path)
Resident-memory A/B on the CoreML/ANE path (
metalDeviceToUseThread0=100), cold conversion every run, before vs after this PR's three memory levers. before is this branch with the levers removed (the #1205 transformer tip39f82f6d— no non-owning views, no streaming parser, noaneOnlyrelease); after is HEAD. Peak RSS is the/usr/bin/time -lmaximum over load+convert; idle steady-state isps-sampled after the model is loaded and released. Numbers are GB, reproduced across two runs (<0.1% spread).b40c256)b18c384nbt)b10c384h6nbttflrs)The peak win is largest on the classical 40b, where the conversion transients the levers target (duplicate owning weight copies + whole-gzip decompression) dominate; the idle win is largest on the 18b nbt, which has the most resident FP32 weight for the ANE
releaseWeightsto free.Why the transformer's peak reduces less (−13%), and why its absolute peak is highest despite the smallest file. Both are expected and not a regression — verified by instrumented byte-accounting at serialization time:
addOwnedConstOp), so the non-owning-view lever can dedup ~0% of them, versus ~100% of a conv net's weights (the conv nets carry only ~0.07 MB of owned tensors total). (ii) Its 38 MB file makes the streaming parser's removed whole-decompress transient small (~42 MB, vs ~187 MB for the 40b). Combined, the levers free ~80 MB here vs ~365 MB on the 40b.Cross-backend parity vs an Eigen reference is unchanged on both the GPU and ANE paths.
Test plan
./katago runtests(All tests passed) +./katago runnnlayertests(14 configurations)./katago testgpuerrorvs Eigen reference — GPU and ANE paths both within tolerance, no NaNaneOnlyfalse)rungpuerrortest.shregression vs current Eigen FP32 references: Metal GPU 37/37 pass; CoreML/ANE 31/31 supported configs pass (the 6 "failures" are the expected pre-v8 SIGABRTs —run4/grun50/g103, nets predating CoreML-converter support). The two transformer nets the merge had broken on the FP16 ANE path (b7c96h3tfrs,b7c96h6kv3qk32v16GQA) now load and match the reference to <0.0005% winrate; convnet ANE output is byte-identical and the GPU path is unchanged.Note
The CI cache fix that was previously bundled here has been split into its own PR: #1204 (orthogonal fix for a stale-cache link failure on the
build-macos-metaljob). This PR is now the memory work plus the merged #1205 transformer support.🤖 Generated with Claude Code