Skip to content

Support file-based serialize/deserialize for DiskANN indices using Save/Load traits#1079

Open
suhasjs wants to merge 28 commits into
mainfrom
mhildebr/saveload
Open

Support file-based serialize/deserialize for DiskANN indices using Save/Load traits#1079
suhasjs wants to merge 28 commits into
mainfrom
mhildebr/saveload

Conversation

@suhasjs

@suhasjs suhasjs commented May 16, 2026

Copy link
Copy Markdown
Contributor
  • Does this PR have a descriptive title that could go in our release notes?
  • Does this PR add any new dependencies?
  • Does this PR modify any existing APIs?
  • Is the change to the API backwards compatible?
  • Should this result in any changes to our documentation, either updating existing docs or adding new ones?

Reference Issues/PRs

Initial attempt at #737

What does this implement/fix? Briefly explain your changes.

Introduces two new traits (diskann_record::save::Save and diskann_record::load::Load)and a new create diskann-record to support file-based serialization + deserialization. This PR also implements the traits for DiskANNIndex (and all nested structs) for a successful prototype.
Sample output can be found here.

Any other comments?

This PR is meant to be in a draft state until we're happy with the abstractions and the results.

Mark Hildebrand and others added 21 commits April 15, 2026 14:32
…borrow for Writer::finish() since it calls std::io::Write::flush() that needs a mutable borrow
… wherever needed + modify writer return type to Result<>
… diskann-providers with a new SingleUseWriteProvider (also for reads with SingleUseReadProvider)
…ad_from_bin<> to only use one read since VectorDataIterator already has the metadata needed for
@suhasjs suhasjs added this to the 2026-05 milestone May 16, 2026
@suhasjs suhasjs self-assigned this May 16, 2026
@suhasjs suhasjs added the enhancement New feature or request label May 16, 2026
@suhasjs suhasjs moved this to In Progress in DiskANN backlog May 16, 2026

@hildebrandmw hildebrandmw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Suhas! My comments this round are mainly about enum handling. I think we can do this more cleanly by having enums "handle themselves" rather than lifting it into a somewhat fragile interaction in the Save/Load traits.

Other than that, the code inside diskann-record could use considerable hardening and testing.

Do we want to plan out how plugins like VFS would work in this PR, or save that for the future? I ask because VFS is used pretty heavily in tests, so having a replacement that's compliant with this new system will be required. And it will be a good stress test that we can expand the backend serialization to other formats in the future.

Comment thread diskann-record/src/save/mod.rs Outdated
Comment thread diskann-record/src/load/error.rs
Comment thread diskann-record/src/load/context.rs
Comment thread diskann-providers/src/storage/record_shim.rs
Comment thread diskann-record/src/save/value.rs
@arrayka arrayka linked an issue May 19, 2026 that may be closed by this pull request
suhasjs added 3 commits May 21, 2026 15:28
Resolve conflicts in Cargo.toml (versions + members), diskann-vector/Cargo.toml (dev-deps), diskann-providers/src/storage/bin.rs (single-read load_from_bin), diskann-providers/src/index/wrapped_async.rs (combine tests), diskann/src/graph/index.rs (combine impls + tests), and port FastMemoryVectorProviderAsync / MemoryVectorProviderAsync save/load impls to main's T: VectorRepr (formerly Data: GraphDataType in this branch).
@codecov-commenter

codecov-commenter commented May 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.64476% with 416 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.25%. Comparing base (e2dc9a0) to head (9bfa925).

Files with missing lines Patch % Lines
diskann-providers/src/storage/record_shim.rs 70.17% 68 Missing ⚠️
diskann-record/src/save/value.rs 71.25% 48 Missing ⚠️
diskann-record/src/load/context.rs 75.95% 44 Missing ⚠️
diskann-record/src/load/error.rs 51.42% 34 Missing ⚠️
diskann-record/src/lib.rs 86.61% 32 Missing ⚠️
diskann-record/src/save/context.rs 67.02% 31 Missing ⚠️
...roviders/src/model/graph/provider/async_/common.rs 73.19% 26 Missing ⚠️
diskann-record/src/save/error.rs 0.00% 24 Missing ⚠️
diskann/src/graph/config/mod.rs 83.84% 21 Missing ⚠️
diskann-record/src/number.rs 57.14% 18 Missing ⚠️
... and 12 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1079      +/-   ##
==========================================
- Coverage   89.46%   89.25%   -0.22%     
==========================================
  Files         482      493      +11     
  Lines       91082    93018    +1936     
==========================================
+ Hits        81491    83021    +1530     
- Misses       9591     9997     +406     
Flag Coverage Δ
miri 89.25% <78.64%> (-0.22%) ⬇️
unittests 88.90% <78.64%> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-providers/src/index/wrapped_async.rs 64.22% <100.00%> (+4.34%) ⬆️
diskann-providers/src/storage/bin.rs 93.65% <62.50%> (+1.58%) ⬆️
diskann-record/src/load/mod.rs 94.11% <94.11%> (ø)
...ers/src/model/configuration/index_configuration.rs 94.00% <92.06%> (+3.19%) ⬆️
...aph/provider/async_/fast_memory_vector_provider.rs 94.67% <93.90%> (-0.27%) ⬇️
.../src/model/graph/provider/async_/inmem/provider.rs 91.70% <95.00%> (+1.10%) ⬆️
...el/graph/provider/async_/memory_vector_provider.rs 96.62% <92.06%> (-1.66%) ⬇️
diskann/src/graph/config/experimental.rs 89.79% <77.27%> (-10.21%) ⬇️
diskann/src/graph/index.rs 95.91% <66.66%> (-0.25%) ⬇️
.../graph/provider/async_/simple_neighbor_provider.rs 94.44% <93.25%> (-0.03%) ⬇️
... and 13 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hildebrandmw hildebrandmw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more drive-by comments.

Comment thread diskann-record/HANDOFF.md

Keys starting with `$` are reserved for infrastructure (`$version`, `$variant`,
`$handle`). Since Rust identifiers can't start with `$`, the `save_fields!` macro is
inherently safe — no runtime check needed in the macro path.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This struct can be created without any issues:
pub struct Test {
    version: String,
    handle: String,
    variant: String,
}
  1. The choice of programming language shouldn’t determine the field names in a public contract.

Comment thread diskann-record/sample_output/manifest.json Outdated
Comment thread diskann-record/HANDOFF.md
}
```

The lifetime `'a` allows the save path to borrow from the data being saved (zero-copy

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a measurable benefit of introducing a lifetime here? I would say that config load/save is outside of hot path or at least its contribution is negligible to overall RAM/CPU usage, since config is tiny.

Comment thread diskann-record/src/load/context.rs
Comment thread diskann-record/HANDOFF.md
Comment thread diskann-record/sample_output/manifest.json Outdated
…/Deserialize<> for Version; updated sample_output
}
}

impl diskann_record::load::Load<'_> for Config {

@arrayka arrayka May 28, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config::load bypasses its own Builder.

Config::load directly constructs Self { pruned_degree, max_degree, ... } from deserialized fields — bypassing the Builder which enforces invariants like
max_degree >= pruned_degree and alpha >= 1.0.

A corrupted or hand-edited manifest with max_degree: 4, pruned_degree: 64
would load successfully and produce a Config that Builder would have
rejected, silently causing excessive pruning or algorithmic misbehavior.

The loader should feed values through Builder or at minimum call a validation method.

Open Question: Should deserialization bypass safety checks?

  • Option A — Yes, separate ::load(): Accept a dedicated load path that
    skips the builder. Risk: two constructors accepting the same parameters in
    different forms (struct fields vs. serialized DTO), with only one enforcing
    invariants.

  • Option B — No, chain through the constructor: The domain type exposes an
    API to create itself and an API to export the values needed to recreate
    itself. A separate adapter is responsible for persisting those values
    (including versioning and format mapping) and calling the validated
    constructor on load. This keeps the contract between DTO and public API
    explicit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guiding principles during this review:

  1. Keep code simple. Extend functionality by adding new code, not modifying
    existing code.

  2. Separate concerns clearly.

    • Domain types must support save/restore of their state so it can be
      recreated later.
    • How that state is persisted (format, versioning, I/O) is not the
      domain type's responsibility.

Let's take Disk-index checkpointing as an example:

// Domain type owns state, exposes it through public API
let builder = DiskIndexBuilder::new(config);
let offset = builder.save();                    // returns resumption state
let builder = DiskIndexBuilder::new_from_offset(config, offset);  // validates on entry

Responsibilities are clean:

  • DiskIndexBuilder can stop and return the metadata needed to resume (offset). Restore process is still safe, because offset is validated.
  • The caller decides how to persist config and offset (format, versioning, storage).

The proposed design would require adding next methods to the builder:

// Domain type now also owns serialization mapping
builder.save(SaveContext) -> Value       // new: format-specific logic on domain type
DiskIndexBuilder::load(Object) -> Self   // new: bypasses new_from_offset() validation

This asks the domain type to handle serialization mapping and
versioning/backward-compatibility — responsibilities that belong in an external
adapter.

The adapter should deserialize, then call
DiskIndexBuilder::new_from_offset(config, offset) — reusing the validated
constructor.

type TestNeighborProvider = SimpleNeighborProviderAsync<u32>;

fn round_trip_helper(provider: &TestNeighborProvider) -> TestNeighborProvider {
let dir = tempfile::tempdir().expect("tempdir");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test should work against VFS

P: StorageWriteProvider,
{
storage::bin::save_to_bin(self, provider, path)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to understand why we add responsibility of saving data to a file to the fast memory provider, when it already implements GetData trait needed by storage::bin::save_to_bin().

Comment thread diskann-record/Cargo.toml
[dependencies]
anyhow.workspace = true
serde = { workspace = true, features = ["derive"] }
serde_json.workspace = true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serde_json propagates to diskann and diskann-vector crates.

Serialization should belong in an adapter layer, not in domain crates:

  • Domain types stay format-agnostic; adapters own the format.
  • Inner crates must not depend on serialization frameworks.

diskann-vector (Tier 1) and diskann (Tier 3) now depend on
diskann-record. Every domain type below has impl Save/impl Load directly
on it, embedding format-specific logic (constructing Records, inserting
Value::Null tags, calling single_key(), etc.).

Crate / file Type(s) Lines
diskann-vector/src/distance/metric.rs Metric 112–151
diskann/src/graph/config/mod.rs Config, PruneKind, IntraBatchCandidates 712–869
diskann/src/graph/config/experimental.rs InsertRetry 82–125
diskann/src/graph/index.rs DiskANNIndex<DP> 2948–2987

If diskann-record were swapped for another format, all of these files — plus
the 6 provider-side files — would need modification (10+ files across 3 crates).

Example: Metric (a Tier 1 type in diskann-vector) now has impl Save and impl Load directly on the domain type, pulling a serialization framework into a foundation crate. Same for Config, PruneKind, IntraBatchCandidates, InsertRetry, and DiskANNIndex in the core diskann crate.

Recommendation: Move all impl Save/Load out of diskann-vector and
diskann into a dedicated adapter crate (e.g. diskann-record-adapters) or
consolidate them in diskann-providers. The adapter depends on both the domain
types and diskann-record; the domain crates keep zero serialization
dependencies.

}
}

impl diskann_record::load::Load<'_> for Config {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guiding principles during this review:

  1. Keep code simple. Extend functionality by adding new code, not modifying
    existing code.

  2. Separate concerns clearly.

    • Domain types must support save/restore of their state so it can be
      recreated later.
    • How that state is persisted (format, versioning, I/O) is not the
      domain type's responsibility.

Let's take Disk-index checkpointing as an example:

// Domain type owns state, exposes it through public API
let builder = DiskIndexBuilder::new(config);
let offset = builder.save();                    // returns resumption state
let builder = DiskIndexBuilder::new_from_offset(config, offset);  // validates on entry

Responsibilities are clean:

  • DiskIndexBuilder can stop and return the metadata needed to resume (offset). Restore process is still safe, because offset is validated.
  • The caller decides how to persist config and offset (format, versioning, storage).

The proposed design would require adding next methods to the builder:

// Domain type now also owns serialization mapping
builder.save(SaveContext) -> Value       // new: format-specific logic on domain type
DiskIndexBuilder::load(Object) -> Self   // new: bypasses new_from_offset() validation

This asks the domain type to handle serialization mapping and
versioning/backward-compatibility — responsibilities that belong in an external
adapter.

The adapter should deserialize, then call
DiskIndexBuilder::new_from_offset(config, offset) — reusing the validated
constructor.

@suhasjs suhasjs marked this pull request as ready for review June 9, 2026 15:09
@suhasjs suhasjs requested review from a team and Copilot June 9, 2026 15:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new workspace crate (diskann-record) that provides a versioned, file-backed save/load framework (JSON manifest + side-car binary artifacts) and wires it into DiskANN’s Rust components to prototype end-to-end index serialization/deserialization (including providers and key config/enum types).

Changes:

  • Add diskann-record crate implementing save::Save/load::Load (+ Saveable/Loadable), a manifest schema (Value, Record, $version, $handle), and disk entry points (save_to_disk, load_from_disk).
  • Implement Save/Load for DiskANNIndex, core config types, vector distance Metric, and multiple providers (vector + neighbor + default provider), plus round-trip tests.
  • Add “single-use” read/write storage shims to bridge existing byte-oriented storage helpers with the new record-based reader/writer model.

Reviewed changes

Copilot reviewed 31 out of 34 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
diskann/src/graph/index.rs Implements diskann_record Save/Load for DiskANNIndex (persisting config + data provider).
diskann/src/graph/config/mod.rs Adds Save/Load for Config, PruneKind, IntraBatchCandidates + round-trip tests.
diskann/src/graph/config/experimental.rs Adds Save/Load for experimental::InsertRetry.
diskann/Cargo.toml Adds diskann-record dependency and tempfile for tests.
diskann-vector/src/distance/metric.rs Adds Save/Load for Metric + manifest-tamper test; minor TryFrom<i32> refactor.
diskann-vector/Cargo.toml Adds diskann-record and tempfile (dev) dependencies.
diskann-record/src/version.rs Defines Version wire encoding ("major.minor.patch").
diskann-record/src/save/value.rs Defines manifest Value/Record/Handle wire types and serde behavior.
diskann-record/src/save/mod.rs Defines save-side traits, entry point, primitive impls, and save_fields! macro.
diskann-record/src/save/error.rs Implements save-side error wrapper.
diskann-record/src/save/context.rs Implements save-side context, artifact writer allocation, and atomic manifest finalize.
diskann-record/src/number.rs Adds lossless numeric container for JSON numbers + safe narrowing conversions.
diskann-record/src/load/mod.rs Defines load-side traits, entry point, primitive impls, and load_fields! macro.
diskann-record/src/load/error.rs Implements load-side error wrapper + recoverable/critical classification (Kind).
diskann-record/src/load/context.rs Implements load-side context/object/array traversal + side-car artifact reading.
diskann-record/src/lib.rs Crate-level docs, reserved-key helper, and extensive round-trip + security tests.
diskann-record/sample_output/manifest.json Example manifest output for a saved index/provider tree.
diskann-record/HANDOFF.md Handoff/design notes for the new crate and current prototype state.
diskann-record/Cargo.toml New crate manifest (deps: anyhow/serde/serde_json; dev: tempfile).
diskann-providers/src/storage/record_shim.rs Adds single-use read/write storage providers to adapt record I/O to existing helpers.
diskann-providers/src/storage/mod.rs Exposes the new record shim types from the storage module.
diskann-providers/src/storage/bin.rs Adjusts vector loading to avoid double-opening (supports single-use readers).
diskann-providers/src/model/graph/provider/async_/simple_neighbor_provider.rs Adds Save/Load for neighbor provider via side-car graph.bin + round-trip tests.
diskann-providers/src/model/graph/provider/async_/memory_vector_provider.rs Adds Save/Load for in-memory vector provider via side-car vectors.bin + tests.
diskann-providers/src/model/graph/provider/async_/inmem/provider.rs Adds Save/Load for DefaultProvider composing sub-providers; adds round-trip tests.
diskann-providers/src/model/graph/provider/async_/fast_memory_vector_provider.rs Adds Save/Load for fast in-memory vector provider (persists metric + prefetch tunables) + tests.
diskann-providers/src/model/graph/provider/async_/common.rs Adds Save/Load for small leaf types (StartPoints, NoStore, NoDeletes, PrefetchCacheLineLevel) + tests.
diskann-providers/src/model/configuration/index_configuration.rs Adds Save/Load for IndexConfiguration (drops prefetch knobs on load) + tests.
diskann-providers/src/index/wrapped_async.rs Adds end-to-end index save/load/search parity test using diskann-record.
diskann-providers/Cargo.toml Adds dependency on diskann-record.
Cargo.toml Adds diskann-record to workspace members and workspace dependencies.
Cargo.lock Locks new crate + dependency edges (diskann-record, tempfile usage).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-record/src/save/context.rs
Comment thread diskann-record/src/save/value.rs
Comment thread diskann-record/HANDOFF.md
Comment thread diskann-record/HANDOFF.md
Comment thread diskann-record/src/load/error.rs
Comment thread diskann-record/src/load/error.rs
@harsha-simhadri harsha-simhadri modified the milestones: 2026-05, 2026-07 Jun 30, 2026
suhasjs added a commit that referenced this pull request Jun 30, 2026
…zation of DiskANN indexes (#1188)

<!--
Thanks for contributing a pull request! Please ensure you have taken a
look at
the contribution guidelines:
https://github.com/microsoft/DiskANN/blob/main/CONTRIBUTING.md
-->
- [x] Does this PR have a descriptive title that could go in our release
notes?
- [x] Does this PR add any new dependencies?
- [ ] Does this PR modify any existing APIs?
- [ ] Is the change to the API backwards compatible?
- [x] Should this result in any changes to our documentation, either
updating existing docs or adding new ones?

#### Reference Issues/PRs
Part 1/2 of #1079. Also see #737. 
<!--
Example: Fixes #1234. See also #3456.
Please use keywords (e.g., Fixes) to create link to the issues or pull
requests
you resolved, so that they will automatically be closed when your pull
request
is merged. See
https://github.com/blog/1506-closing-issues-via-pull-requests
-->

#### What does this implement/fix? Briefly explain your changes.
Introduces a new create `diskann-record` to support serialization +
deserialization for DiskANN indexes. It provides a small,
backend-agnostic framework for persisting structured Rust values as a
**versioned** manifest plus side-car binary artifacts, and reloading
them later. `diskann_record` is intended to be a new __foundational__
crate, so it only depends on `anyhow` (`serde` is feature-gated and only
used for on-disk formats).

Summary of changes:
- Introduces two new traits (`diskann_record::save::Save` and
`diskann_record::load::Load`), along with two macros `load_fields` and
`save_fields` for simple, plain `struct`s.
- Records carry `Version` that allow for durable indexes: `load_legacy`
explicitly supports loading serialized representations of older versions
of the same struct.
- Manifest is JSON-compatible and carries a list of artifact filenames
(if serialized using disk-backend)
- Implement two custom backends -- `Disk` backend that writes artifacts
to individual files, and a `Memory` backend that simply serializes to a
`Vec<u8>`. `Disk` backend depends on `serde`, whereas `Memory` does not.
- `diskann-record` is generic enough to allow saving with one backend
and loading with another provider. It boils down to how
`SaveContext`/`LoadContext` are implemented for each provider, except
for `Memory` provider -- it pipes the output of its `SaveContext`
directly into its `LoadContext`, so it never leaves in-process memory.

#### Any other comments?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Document and extend index serialization format

7 participants