Support file-based serialize/deserialize for DiskANN indices using Save/Load traits#1079
Support file-based serialize/deserialize for DiskANN indices using Save/Load traits#1079suhasjs wants to merge 28 commits into
Conversation
…borrow for Writer::finish() since it calls std::io::Write::flush() that needs a mutable borrow
…ave::Results<Handle>
…w directly impls for Value<'static>
… wherever needed + modify writer return type to Result<>
… required for their Load/Save impls)
… diskann-providers with a new SingleUseWriteProvider (also for reads with SingleUseReadProvider)
…ad_from_bin<> to only use one read since VectorDataIterator already has the metadata needed for
…ghborProviderAsync
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Suhas! My comments this round are mainly about enum handling. I think we can do this more cleanly by having enums "handle themselves" rather than lifting it into a somewhat fragile interaction in the Save/Load traits.
Other than that, the code inside diskann-record could use considerable hardening and testing.
Do we want to plan out how plugins like VFS would work in this PR, or save that for the future? I ask because VFS is used pretty heavily in tests, so having a replacement that's compliant with this new system will be required. And it will be a good stress test that we can expand the backend serialization to other formats in the future.
… that wraps anyhow::Error along with a recoverable bool to allow probing APIUs
…-> throws an error now
Resolve conflicts in Cargo.toml (versions + members), diskann-vector/Cargo.toml (dev-deps), diskann-providers/src/storage/bin.rs (single-read load_from_bin), diskann-providers/src/index/wrapped_async.rs (combine tests), diskann/src/graph/index.rs (combine impls + tests), and port FastMemoryVectorProviderAsync / MemoryVectorProviderAsync save/load impls to main's T: VectorRepr (formerly Data: GraphDataType in this branch).
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1079 +/- ##
==========================================
- Coverage 89.46% 89.25% -0.22%
==========================================
Files 482 493 +11
Lines 91082 93018 +1936
==========================================
+ Hits 81491 83021 +1530
- Misses 9591 9997 +406
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
hildebrandmw
left a comment
There was a problem hiding this comment.
A few more drive-by comments.
|
|
||
| Keys starting with `$` are reserved for infrastructure (`$version`, `$variant`, | ||
| `$handle`). Since Rust identifiers can't start with `$`, the `save_fields!` macro is | ||
| inherently safe — no runtime check needed in the macro path. |
There was a problem hiding this comment.
- This struct can be created without any issues:
pub struct Test {
version: String,
handle: String,
variant: String,
}
- The choice of programming language shouldn’t determine the field names in a public contract.
| } | ||
| ``` | ||
|
|
||
| The lifetime `'a` allows the save path to borrow from the data being saved (zero-copy |
There was a problem hiding this comment.
Is there a measurable benefit of introducing a lifetime here? I would say that config load/save is outside of hot path or at least its contribution is negligible to overall RAM/CPU usage, since config is tiny.
…/Deserialize<> for Version; updated sample_output
| } | ||
| } | ||
|
|
||
| impl diskann_record::load::Load<'_> for Config { |
There was a problem hiding this comment.
Config::load bypasses its own Builder.
Config::load directly constructs Self { pruned_degree, max_degree, ... } from deserialized fields — bypassing the Builder which enforces invariants like
max_degree >= pruned_degree and alpha >= 1.0.
A corrupted or hand-edited manifest with max_degree: 4, pruned_degree: 64
would load successfully and produce a Config that Builder would have
rejected, silently causing excessive pruning or algorithmic misbehavior.
The loader should feed values through Builder or at minimum call a validation method.
Open Question: Should deserialization bypass safety checks?
-
Option A — Yes, separate
::load(): Accept a dedicated load path that
skips the builder. Risk: two constructors accepting the same parameters in
different forms (struct fields vs. serialized DTO), with only one enforcing
invariants. -
Option B — No, chain through the constructor: The domain type exposes an
API to create itself and an API to export the values needed to recreate
itself. A separate adapter is responsible for persisting those values
(including versioning and format mapping) and calling the validated
constructor on load. This keeps the contract between DTO and public API
explicit.
There was a problem hiding this comment.
My guiding principles during this review:
-
Keep code simple. Extend functionality by adding new code, not modifying
existing code. -
Separate concerns clearly.
- Domain types must support save/restore of their state so it can be
recreated later. - How that state is persisted (format, versioning, I/O) is not the
domain type's responsibility.
- Domain types must support save/restore of their state so it can be
Let's take Disk-index checkpointing as an example:
// Domain type owns state, exposes it through public API
let builder = DiskIndexBuilder::new(config);
let offset = builder.save(); // returns resumption state
let builder = DiskIndexBuilder::new_from_offset(config, offset); // validates on entryResponsibilities are clean:
DiskIndexBuildercan stop and return the metadata needed to resume (offset). Restore process is still safe, because offset is validated.- The caller decides how to persist
configandoffset(format, versioning, storage).
The proposed design would require adding next methods to the builder:
// Domain type now also owns serialization mapping
builder.save(SaveContext) -> Value // new: format-specific logic on domain type
DiskIndexBuilder::load(Object) -> Self // new: bypasses new_from_offset() validationThis asks the domain type to handle serialization mapping and
versioning/backward-compatibility — responsibilities that belong in an external
adapter.
The adapter should deserialize, then call
DiskIndexBuilder::new_from_offset(config, offset) — reusing the validated
constructor.
| type TestNeighborProvider = SimpleNeighborProviderAsync<u32>; | ||
|
|
||
| fn round_trip_helper(provider: &TestNeighborProvider) -> TestNeighborProvider { | ||
| let dir = tempfile::tempdir().expect("tempdir"); |
There was a problem hiding this comment.
this test should work against VFS
| P: StorageWriteProvider, | ||
| { | ||
| storage::bin::save_to_bin(self, provider, path) | ||
| } |
There was a problem hiding this comment.
We need to understand why we add responsibility of saving data to a file to the fast memory provider, when it already implements GetData trait needed by storage::bin::save_to_bin().
| [dependencies] | ||
| anyhow.workspace = true | ||
| serde = { workspace = true, features = ["derive"] } | ||
| serde_json.workspace = true |
There was a problem hiding this comment.
serde_json propagates to diskann and diskann-vector crates.
Serialization should belong in an adapter layer, not in domain crates:
- Domain types stay format-agnostic; adapters own the format.
- Inner crates must not depend on serialization frameworks.
diskann-vector (Tier 1) and diskann (Tier 3) now depend on
diskann-record. Every domain type below has impl Save/impl Load directly
on it, embedding format-specific logic (constructing Records, inserting
Value::Null tags, calling single_key(), etc.).
| Crate / file | Type(s) | Lines |
|---|---|---|
diskann-vector/src/distance/metric.rs |
Metric |
112–151 |
diskann/src/graph/config/mod.rs |
Config, PruneKind, IntraBatchCandidates |
712–869 |
diskann/src/graph/config/experimental.rs |
InsertRetry |
82–125 |
diskann/src/graph/index.rs |
DiskANNIndex<DP> |
2948–2987 |
If diskann-record were swapped for another format, all of these files — plus
the 6 provider-side files — would need modification (10+ files across 3 crates).
Example: Metric (a Tier 1 type in diskann-vector) now has impl Save and impl Load directly on the domain type, pulling a serialization framework into a foundation crate. Same for Config, PruneKind, IntraBatchCandidates, InsertRetry, and DiskANNIndex in the core diskann crate.
Recommendation: Move all impl Save/Load out of diskann-vector and
diskann into a dedicated adapter crate (e.g. diskann-record-adapters) or
consolidate them in diskann-providers. The adapter depends on both the domain
types and diskann-record; the domain crates keep zero serialization
dependencies.
| } | ||
| } | ||
|
|
||
| impl diskann_record::load::Load<'_> for Config { |
There was a problem hiding this comment.
My guiding principles during this review:
-
Keep code simple. Extend functionality by adding new code, not modifying
existing code. -
Separate concerns clearly.
- Domain types must support save/restore of their state so it can be
recreated later. - How that state is persisted (format, versioning, I/O) is not the
domain type's responsibility.
- Domain types must support save/restore of their state so it can be
Let's take Disk-index checkpointing as an example:
// Domain type owns state, exposes it through public API
let builder = DiskIndexBuilder::new(config);
let offset = builder.save(); // returns resumption state
let builder = DiskIndexBuilder::new_from_offset(config, offset); // validates on entryResponsibilities are clean:
DiskIndexBuildercan stop and return the metadata needed to resume (offset). Restore process is still safe, because offset is validated.- The caller decides how to persist
configandoffset(format, versioning, storage).
The proposed design would require adding next methods to the builder:
// Domain type now also owns serialization mapping
builder.save(SaveContext) -> Value // new: format-specific logic on domain type
DiskIndexBuilder::load(Object) -> Self // new: bypasses new_from_offset() validationThis asks the domain type to handle serialization mapping and
versioning/backward-compatibility — responsibilities that belong in an external
adapter.
The adapter should deserialize, then call
DiskIndexBuilder::new_from_offset(config, offset) — reusing the validated
constructor.
There was a problem hiding this comment.
Pull request overview
This PR introduces a new workspace crate (diskann-record) that provides a versioned, file-backed save/load framework (JSON manifest + side-car binary artifacts) and wires it into DiskANN’s Rust components to prototype end-to-end index serialization/deserialization (including providers and key config/enum types).
Changes:
- Add
diskann-recordcrate implementingsave::Save/load::Load(+Saveable/Loadable), a manifest schema (Value,Record,$version,$handle), and disk entry points (save_to_disk,load_from_disk). - Implement Save/Load for
DiskANNIndex, core config types, vector distanceMetric, and multiple providers (vector + neighbor + default provider), plus round-trip tests. - Add “single-use” read/write storage shims to bridge existing byte-oriented storage helpers with the new record-based reader/writer model.
Reviewed changes
Copilot reviewed 31 out of 34 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann/src/graph/index.rs | Implements diskann_record Save/Load for DiskANNIndex (persisting config + data provider). |
| diskann/src/graph/config/mod.rs | Adds Save/Load for Config, PruneKind, IntraBatchCandidates + round-trip tests. |
| diskann/src/graph/config/experimental.rs | Adds Save/Load for experimental::InsertRetry. |
| diskann/Cargo.toml | Adds diskann-record dependency and tempfile for tests. |
| diskann-vector/src/distance/metric.rs | Adds Save/Load for Metric + manifest-tamper test; minor TryFrom<i32> refactor. |
| diskann-vector/Cargo.toml | Adds diskann-record and tempfile (dev) dependencies. |
| diskann-record/src/version.rs | Defines Version wire encoding ("major.minor.patch"). |
| diskann-record/src/save/value.rs | Defines manifest Value/Record/Handle wire types and serde behavior. |
| diskann-record/src/save/mod.rs | Defines save-side traits, entry point, primitive impls, and save_fields! macro. |
| diskann-record/src/save/error.rs | Implements save-side error wrapper. |
| diskann-record/src/save/context.rs | Implements save-side context, artifact writer allocation, and atomic manifest finalize. |
| diskann-record/src/number.rs | Adds lossless numeric container for JSON numbers + safe narrowing conversions. |
| diskann-record/src/load/mod.rs | Defines load-side traits, entry point, primitive impls, and load_fields! macro. |
| diskann-record/src/load/error.rs | Implements load-side error wrapper + recoverable/critical classification (Kind). |
| diskann-record/src/load/context.rs | Implements load-side context/object/array traversal + side-car artifact reading. |
| diskann-record/src/lib.rs | Crate-level docs, reserved-key helper, and extensive round-trip + security tests. |
| diskann-record/sample_output/manifest.json | Example manifest output for a saved index/provider tree. |
| diskann-record/HANDOFF.md | Handoff/design notes for the new crate and current prototype state. |
| diskann-record/Cargo.toml | New crate manifest (deps: anyhow/serde/serde_json; dev: tempfile). |
| diskann-providers/src/storage/record_shim.rs | Adds single-use read/write storage providers to adapt record I/O to existing helpers. |
| diskann-providers/src/storage/mod.rs | Exposes the new record shim types from the storage module. |
| diskann-providers/src/storage/bin.rs | Adjusts vector loading to avoid double-opening (supports single-use readers). |
| diskann-providers/src/model/graph/provider/async_/simple_neighbor_provider.rs | Adds Save/Load for neighbor provider via side-car graph.bin + round-trip tests. |
| diskann-providers/src/model/graph/provider/async_/memory_vector_provider.rs | Adds Save/Load for in-memory vector provider via side-car vectors.bin + tests. |
| diskann-providers/src/model/graph/provider/async_/inmem/provider.rs | Adds Save/Load for DefaultProvider composing sub-providers; adds round-trip tests. |
| diskann-providers/src/model/graph/provider/async_/fast_memory_vector_provider.rs | Adds Save/Load for fast in-memory vector provider (persists metric + prefetch tunables) + tests. |
| diskann-providers/src/model/graph/provider/async_/common.rs | Adds Save/Load for small leaf types (StartPoints, NoStore, NoDeletes, PrefetchCacheLineLevel) + tests. |
| diskann-providers/src/model/configuration/index_configuration.rs | Adds Save/Load for IndexConfiguration (drops prefetch knobs on load) + tests. |
| diskann-providers/src/index/wrapped_async.rs | Adds end-to-end index save/load/search parity test using diskann-record. |
| diskann-providers/Cargo.toml | Adds dependency on diskann-record. |
| Cargo.toml | Adds diskann-record to workspace members and workspace dependencies. |
| Cargo.lock | Locks new crate + dependency edges (diskann-record, tempfile usage). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…zation of DiskANN indexes (#1188) <!-- Thanks for contributing a pull request! Please ensure you have taken a look at the contribution guidelines: https://github.com/microsoft/DiskANN/blob/main/CONTRIBUTING.md --> - [x] Does this PR have a descriptive title that could go in our release notes? - [x] Does this PR add any new dependencies? - [ ] Does this PR modify any existing APIs? - [ ] Is the change to the API backwards compatible? - [x] Should this result in any changes to our documentation, either updating existing docs or adding new ones? #### Reference Issues/PRs Part 1/2 of #1079. Also see #737. <!-- Example: Fixes #1234. See also #3456. Please use keywords (e.g., Fixes) to create link to the issues or pull requests you resolved, so that they will automatically be closed when your pull request is merged. See https://github.com/blog/1506-closing-issues-via-pull-requests --> #### What does this implement/fix? Briefly explain your changes. Introduces a new create `diskann-record` to support serialization + deserialization for DiskANN indexes. It provides a small, backend-agnostic framework for persisting structured Rust values as a **versioned** manifest plus side-car binary artifacts, and reloading them later. `diskann_record` is intended to be a new __foundational__ crate, so it only depends on `anyhow` (`serde` is feature-gated and only used for on-disk formats). Summary of changes: - Introduces two new traits (`diskann_record::save::Save` and `diskann_record::load::Load`), along with two macros `load_fields` and `save_fields` for simple, plain `struct`s. - Records carry `Version` that allow for durable indexes: `load_legacy` explicitly supports loading serialized representations of older versions of the same struct. - Manifest is JSON-compatible and carries a list of artifact filenames (if serialized using disk-backend) - Implement two custom backends -- `Disk` backend that writes artifacts to individual files, and a `Memory` backend that simply serializes to a `Vec<u8>`. `Disk` backend depends on `serde`, whereas `Memory` does not. - `diskann-record` is generic enough to allow saving with one backend and loading with another provider. It boils down to how `SaveContext`/`LoadContext` are implemented for each provider, except for `Memory` provider -- it pipes the output of its `SaveContext` directly into its `LoadContext`, so it never leaves in-process memory. #### Any other comments?
Reference Issues/PRs
Initial attempt at #737
What does this implement/fix? Briefly explain your changes.
Introduces two new traits (
diskann_record::save::Saveanddiskann_record::load::Load)and a new creatediskann-recordto support file-based serialization + deserialization. This PR also implements the traits forDiskANNIndex(and all nested structs) for a successful prototype.Sample output can be found here.
Any other comments?
This PR is meant to be in a draft state until we're happy with the abstractions and the results.