Skip to content

Support u64 vertex ids in the bftree provider#1216

Open
JordanMaples wants to merge 2 commits into
mainfrom
jordanmaples/bftree_u64_ids
Open

Support u64 vertex ids in the bftree provider#1216
JordanMaples wants to merge 2 commits into
mainfrom
jordanmaples/bftree_u64_ids

Conversation

@JordanMaples

Copy link
Copy Markdown
Contributor

What

Make the bf-tree provider generic over its vertex-id type so it can index past
the u32::MAX (~4.3B) ceiling, while keeping u32 as the default (no memory or
storage cost for existing datasets).

  • New id.rsBfTreeId trait with u32 / u64 impls, from_index /
    try_from_index, and validate_id_capacity.
  • provider.rsBfTreeProvider<T, Q = QuantVectorProvider, I = u32>.
    The vertex id type threads through construction, save, and load. The persisted
    SavedParams records id_width; load validates it against the requested type
    so a u32 index can't be silently reopened as u64 (or vice versa).
  • neighbors.rs / vectors.rs — neighbor cells and starting-point seeding
    are generic over I, sized via size_of::<I>().

Why

Larger-than-memory, billion-scale datasets need vertex ids wider than u32.
Defaulting I = u32 keeps the common case unchanged; opting into u64 is a
single type parameter at the call site.

Hardening

The second commit adds defensive coverage for the wide-id path:

  • Capacity guard rejecting id counts that overflow the chosen width.
  • Save/load rejects an id_width mismatch instead of misreading data.
  • Tests for high-bit (> u32::MAX) ids and their low-bit collisions, plus a
    u64 save/load round-trip.

Notes

  • Two commits, each builds and lints clean in isolation (clean bisect).
  • Based independently off main (does not include the lazy-init change in
    the sibling PR).
  • Open design question: bf-tree's I is an internal id. Mark's inmem2
    direction uses InternalId = u32 (hardcoded) + generic external id. Worth
    aligning on whether bftree needs wide internal ids or should follow the same
    internal-u32 / external-generic split before merge.

JordanMaples and others added 2 commits June 30, 2026 14:11
The provider hardcoded `u32` vertex ids, capping any bf-tree-backed index
at ~4.29B vectors. That ceiling is at odds with the crate's purpose of
supporting larger-than-memory, billion-scale-and-beyond datasets.

Introduce a `BfTreeId` trait (`VectorId + IntoUsize` plus index-construction)
implemented for `u32` and `u64`, and thread a new id type parameter
`I` through `BfTreeProvider<T, Q, I = u32>` and all of its accessors,
strategies, and save/load impls. The default `I = u32` keeps every existing
caller (benchmark harness, tests) compiling and behaving identically, while
`u64` is now available opt-in for indexes that exceed the 32-bit range.

`iter()`/`IntoIterator` now map `0..total` through `I::from_index` instead
of relying on `Range<u32>` (a generic `Range<I>` would require the unstable
`Step` trait). `VectorProvider::starting_points` is likewise generic over the
id type. The on-disk neighbor format follows the id width and is self-
consistent on load (the width is not persisted, mirroring the metric).

A new `test_quantized_index_search_u64_ids` builds and searches a
`BfTreeProvider<_, _, u64>` index end-to-end to prove the u64 path is
functional, not merely compilable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reinforce the newly-reachable u64 id path surfaced by review:

- neighbors: size record key/value buffers by size_of::<I>() instead of
  hardcoded size_of::<u32>(), so u64 providers reserve correct record sizes.
- id: add validate_id_capacity to reject vertex counts that overflow the id
  type, and guard new_empty with it.
- provider: persist id_width in SavedParams (back-compat default of 4) and
  validate it on load, so a u64 index can't be silently loaded as u32.
- tests: add u64 high-bit id coverage (ids/neighbors > u32::MAX, low-bit
  collision check) plus a u64 save/load round-trip, id-width mismatch
  rejection, and capacity-guard rejection.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR generalizes the diskann-bftree provider to support u64 vertex IDs (while keeping u32 as the default), enabling bf-tree indexes to exceed the u32::MAX vertex ceiling. It threads a new BfTreeId abstraction through the provider, neighbor storage, vector starting-point logic, and save/load metadata, and adds tests that exercise the wide-id path end-to-end.

Changes:

  • Introduces BfTreeId (u32/u64) and adds capacity validation to prevent silent truncation when minting IDs from indices.
  • Makes BfTreeProvider and related accessors/strategies generic over vertex-id type I, including neighbor record sizing via size_of::<I>().
  • Persists id_width in SavedParams and validates width/capacity on load; adds tests for u64 high-bit IDs and u64 save/load round-trips.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
diskann-bftree/src/id.rs Adds BfTreeId trait + capacity validation helper for index-derived IDs.
diskann-bftree/src/lib.rs Exposes the new id module and re-exports BfTreeId.
diskann-bftree/src/vectors.rs Makes starting-point ID generation generic over BfTreeId.
diskann-bftree/src/neighbors.rs Sizes bf-tree neighbor records based on I width; adds u64 ID collision/round-trip test.
diskann-bftree/src/provider.rs Makes BfTreeProvider generic over I; adds save/load id_width validation and u64 path tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.quant_vector_provider_config
.use_snapshot(params.use_snapshot);

crate::id::validate_id_capacity::<I>(params.max_points + params.num_start_points.get())?;
saved_params.id_width,
)));
}
crate::id::validate_id_capacity::<I>(saved_params.max_points + saved_params.frozen_points.get())
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.58484% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.83%. Comparing base (5b11031) to head (c861d22).

Files with missing lines Patch % Lines
diskann-bftree/src/provider.rs 94.59% 12 Missing ⚠️
diskann-bftree/src/vectors.rs 50.00% 2 Missing ⚠️
diskann-bftree/src/id.rs 95.83% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1216      +/-   ##
==========================================
+ Coverage   89.82%   89.83%   +0.01%     
==========================================
  Files         489      490       +1     
  Lines       93575    93794     +219     
==========================================
+ Hits        84051    84264     +213     
- Misses       9524     9530       +6     
Flag Coverage Δ
miri 89.83% <94.58%> (+0.01%) ⬆️
unittests 89.49% <94.58%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
diskann-bftree/src/lib.rs 59.09% <ø> (ø)
diskann-bftree/src/neighbors.rs 96.83% <100.00%> (+0.15%) ⬆️
diskann-bftree/src/id.rs 95.83% <95.83%> (ø)
diskann-bftree/src/vectors.rs 90.05% <50.00%> (-0.48%) ⬇️
diskann-bftree/src/provider.rs 91.25% <94.59%> (+0.79%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@harsha-simhadri

Copy link
Copy Markdown
Contributor

Does u64 bloat adjacency list, and if so, is there a chance we can bring in a delta compression ?

@JordanMaples

Copy link
Copy Markdown
Contributor Author

Does u64 bloat adjacency list, and if so, is there a chance we can bring in a delta compression ?

@harsha-simhadri yes, in u64 mode it doubles the adjacency-list storage (neighbor ids are stored inline), but that's explicit opt-in and there are no u64 consumers yet. I can tackle delta compression as a follow-up. Neighbor order doesn't affect search correctness, so we can sort ids and delta+varint them, which recovers essentially all of the u64 overhead.

@harsha-simhadri

Copy link
Copy Markdown
Contributor

Does u64 bloat adjacency list, and if so, is there a chance we can bring in a delta compression ?

@harsha-simhadri yes, in u64 mode it doubles the adjacency-list storage (neighbor ids are stored inline), but that's explicit opt-in and there are no u64 consumers yet. I can tackle delta compression as a follow-up. Neighbor order doesn't affect search correctness, so we can sort ids and delta+varint them, which recovers essentially all of the u64 overhead.

thanks, could you create an issue for future follow up. thanks

pub fn iter(&self) -> std::ops::Range<u32> {
0..(self.full_vectors.total() as u32)
pub fn iter(&self) -> std::iter::Map<std::ops::Range<usize>, fn(usize) -> I> {
(0..self.full_vectors.total()).map(I::from_index as fn(usize) -> I)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not the move, nor is it a great precedent to set. The problem is that std::iter::Map<.., fn(usize) -> I> folds the function pointer into the map. In particular, a function pointer is (usually) not inlineable, so the resulting iterator has a large number of undesirable properties, including an indirect call for each evaluation. Godbolt Link.

It's probably worth baking more logic into the BfTreeId trait to avoid needing these kinds of inefficient fallbacks.

Comment thread diskann-bftree/src/id.rs
/// (for billion-scale-and-beyond, larger-than-memory datasets). On a 64-bit
/// target `u64` covers every representable `usize`, so its conversions never
/// fail.
pub trait BfTreeId: VectorId + IntoUsize {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Try to avoid IntoUsize as a trait bound and just make this an as_index method of BfTreeId. It's a little more defensive and self contained. It may also be a good idea to drop the VectorId bound as well and instead include the required traits directly, but that's perhaps less important since VectorId still pulls in way too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants