Support u64 vertex ids in the bftree provider by JordanMaples · Pull Request #1216 · microsoft/DiskANN

JordanMaples · 2026-06-30T21:28:51Z

What

Make the bf-tree provider generic over its vertex-id type so it can index past
the u32::MAX (~4.3B) ceiling, while keeping u32 as the default (no memory or
storage cost for existing datasets).

New id.rs — BfTreeId trait with u32 / u64 impls, from_index /
try_from_index, and validate_id_capacity.
provider.rs — BfTreeProvider<T, Q = QuantVectorProvider, I = u32>.
The vertex id type threads through construction, save, and load. The persisted
SavedParams records id_width; load validates it against the requested type
so a u32 index can't be silently reopened as u64 (or vice versa).
neighbors.rs / vectors.rs — neighbor cells and starting-point seeding
are generic over I, sized via size_of::<I>().

Why

Larger-than-memory, billion-scale datasets need vertex ids wider than u32.
Defaulting I = u32 keeps the common case unchanged; opting into u64 is a
single type parameter at the call site.

Hardening

The second commit adds defensive coverage for the wide-id path:

Capacity guard rejecting id counts that overflow the chosen width.
Save/load rejects an id_width mismatch instead of misreading data.
Tests for high-bit (> u32::MAX) ids and their low-bit collisions, plus a
u64 save/load round-trip.

Notes

Two commits, each builds and lints clean in isolation (clean bisect).
Based independently off main (does not include the lazy-init change in
the sibling PR).
Open design question: bf-tree's I is an internal id. Mark's inmem2
direction uses InternalId = u32 (hardcoded) + generic external id. Worth
aligning on whether bftree needs wide internal ids or should follow the same
internal-u32 / external-generic split before merge.

The provider hardcoded `u32` vertex ids, capping any bf-tree-backed index at ~4.29B vectors. That ceiling is at odds with the crate's purpose of supporting larger-than-memory, billion-scale-and-beyond datasets. Introduce a `BfTreeId` trait (`VectorId + IntoUsize` plus index-construction) implemented for `u32` and `u64`, and thread a new id type parameter `I` through `BfTreeProvider<T, Q, I = u32>` and all of its accessors, strategies, and save/load impls. The default `I = u32` keeps every existing caller (benchmark harness, tests) compiling and behaving identically, while `u64` is now available opt-in for indexes that exceed the 32-bit range. `iter()`/`IntoIterator` now map `0..total` through `I::from_index` instead of relying on `Range<u32>` (a generic `Range<I>` would require the unstable `Step` trait). `VectorProvider::starting_points` is likewise generic over the id type. The on-disk neighbor format follows the id width and is self- consistent on load (the width is not persisted, mirroring the metric). A new `test_quantized_index_search_u64_ids` builds and searches a `BfTreeProvider<_, _, u64>` index end-to-end to prove the u64 path is functional, not merely compilable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reinforce the newly-reachable u64 id path surfaced by review: - neighbors: size record key/value buffers by size_of::<I>() instead of hardcoded size_of::<u32>(), so u64 providers reserve correct record sizes. - id: add validate_id_capacity to reject vertex counts that overflow the id type, and guard new_empty with it. - provider: persist id_width in SavedParams (back-compat default of 4) and validate it on load, so a u64 index can't be silently loaded as u32. - tests: add u64 high-bit id coverage (ids/neighbors > u32::MAX, low-bit collision check) plus a u64 save/load round-trip, id-width mismatch rejection, and capacity-guard rejection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR generalizes the diskann-bftree provider to support u64 vertex IDs (while keeping u32 as the default), enabling bf-tree indexes to exceed the u32::MAX vertex ceiling. It threads a new BfTreeId abstraction through the provider, neighbor storage, vector starting-point logic, and save/load metadata, and adds tests that exercise the wide-id path end-to-end.

Changes:

Introduces BfTreeId (u32/u64) and adds capacity validation to prevent silent truncation when minting IDs from indices.
Makes BfTreeProvider and related accessors/strategies generic over vertex-id type I, including neighbor record sizing via size_of::<I>().
Persists id_width in SavedParams and validates width/capacity on load; adds tests for u64 high-bit IDs and u64 save/load round-trips.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
diskann-bftree/src/id.rs	Adds `BfTreeId` trait + capacity validation helper for index-derived IDs.
diskann-bftree/src/lib.rs	Exposes the new `id` module and re-exports `BfTreeId`.
diskann-bftree/src/vectors.rs	Makes starting-point ID generation generic over `BfTreeId`.
diskann-bftree/src/neighbors.rs	Sizes bf-tree neighbor records based on `I` width; adds `u64` ID collision/round-trip test.
diskann-bftree/src/provider.rs	Makes `BfTreeProvider` generic over `I`; adds save/load `id_width` validation and `u64` path tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

            .quant_vector_provider_config
            .use_snapshot(params.use_snapshot);

+        crate::id::validate_id_capacity::<I>(params.max_points + params.num_start_points.get())?;


+            saved_params.id_width,
+        )));
+    }
+    crate::id::validate_id_capacity::<I>(saved_params.max_points + saved_params.frozen_points.get())


codecov-commenter · 2026-06-30T21:44:07Z

Codecov Report

❌ Patch coverage is 94.58484% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.83%. Comparing base (5b11031) to head (c861d22).

Files with missing lines	Patch %	Lines
diskann-bftree/src/provider.rs	94.59%	12 Missing ⚠️
diskann-bftree/src/vectors.rs	50.00%	2 Missing ⚠️
diskann-bftree/src/id.rs	95.83%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1216      +/-   ##
==========================================
+ Coverage   89.82%   89.83%   +0.01%     
==========================================
  Files         489      490       +1     
  Lines       93575    93794     +219     
==========================================
+ Hits        84051    84264     +213     
- Misses       9524     9530       +6

Flag	Coverage Δ
miri	`89.83% <94.58%> (+0.01%)`	⬆️
unittests	`89.49% <94.58%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-bftree/src/lib.rs	`59.09% <ø> (ø)`
diskann-bftree/src/neighbors.rs	`96.83% <100.00%> (+0.15%)`	⬆️
diskann-bftree/src/id.rs	`95.83% <95.83%> (ø)`
diskann-bftree/src/vectors.rs	`90.05% <50.00%> (-0.48%)`	⬇️
diskann-bftree/src/provider.rs	`91.25% <94.59%> (+0.79%)`	⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

harsha-simhadri · 2026-06-30T22:01:13Z

Does u64 bloat adjacency list, and if so, is there a chance we can bring in a delta compression ?

JordanMaples · 2026-07-01T16:57:42Z

Does u64 bloat adjacency list, and if so, is there a chance we can bring in a delta compression ?

@harsha-simhadri yes, in u64 mode it doubles the adjacency-list storage (neighbor ids are stored inline), but that's explicit opt-in and there are no u64 consumers yet. I can tackle delta compression as a follow-up. Neighbor order doesn't affect search correctness, so we can sort ids and delta+varint them, which recovers essentially all of the u64 overhead.

harsha-simhadri · 2026-07-01T17:44:20Z

Does u64 bloat adjacency list, and if so, is there a chance we can bring in a delta compression ?

@harsha-simhadri yes, in u64 mode it doubles the adjacency-list storage (neighbor ids are stored inline), but that's explicit opt-in and there are no u64 consumers yet. I can tackle delta compression as a follow-up. Neighbor order doesn't affect search correctness, so we can sort ids and delta+varint them, which recovers essentially all of the u64 overhead.

thanks, could you create an issue for future follow up. thanks

hildebrandmw · 2026-07-01T18:27:31Z

-    pub fn iter(&self) -> std::ops::Range<u32> {
-        0..(self.full_vectors.total() as u32)
+    pub fn iter(&self) -> std::iter::Map<std::ops::Range<usize>, fn(usize) -> I> {
+        (0..self.full_vectors.total()).map(I::from_index as fn(usize) -> I)


This is probably not the move, nor is it a great precedent to set. The problem is that std::iter::Map<.., fn(usize) -> I> folds the function pointer into the map. In particular, a function pointer is (usually) not inlineable, so the resulting iterator has a large number of undesirable properties, including an indirect call for each evaluation. Godbolt Link.

It's probably worth baking more logic into the BfTreeId trait to avoid needing these kinds of inefficient fallbacks.

hildebrandmw · 2026-07-01T18:30:53Z

+/// (for billion-scale-and-beyond, larger-than-memory datasets). On a 64-bit
+/// target `u64` covers every representable `usize`, so its conversions never
+/// fail.
+pub trait BfTreeId: VectorId + IntoUsize {


Suggestion: Try to avoid IntoUsize as a trait bound and just make this an as_index method of BfTreeId. It's a little more defensive and self contained. It may also be a good idea to drop the VectorId bound as well and instead include the required traits directly, but that's perhaps less important since VectorId still pulls in way too much.

JordanMaples and others added 2 commits June 30, 2026 14:11

JordanMaples requested review from a team and Copilot June 30, 2026 21:28

Copilot started reviewing on behalf of JordanMaples June 30, 2026 21:39 View session

Copilot AI reviewed Jun 30, 2026

View reviewed changes

hildebrandmw reviewed Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support u64 vertex ids in the bftree provider#1216

Support u64 vertex ids in the bftree provider#1216
JordanMaples wants to merge 2 commits into
mainfrom
jordanmaples/bftree_u64_ids

JordanMaples commented Jun 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov-commenter commented Jun 30, 2026

Uh oh!

harsha-simhadri commented Jun 30, 2026

Uh oh!

JordanMaples commented Jul 1, 2026

Uh oh!

harsha-simhadri commented Jul 1, 2026

Uh oh!

hildebrandmw Jul 1, 2026

Uh oh!

hildebrandmw Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

JordanMaples commented Jun 30, 2026

What

Why

Hardening

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

codecov-commenter commented Jun 30, 2026

Codecov Report

Uh oh!

harsha-simhadri commented Jun 30, 2026

Uh oh!

JordanMaples commented Jul 1, 2026

Uh oh!

harsha-simhadri commented Jul 1, 2026

Uh oh!

hildebrandmw Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

hildebrandmw Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants