Support u64 vertex ids in the bftree provider#1216
Conversation
The provider hardcoded `u32` vertex ids, capping any bf-tree-backed index at ~4.29B vectors. That ceiling is at odds with the crate's purpose of supporting larger-than-memory, billion-scale-and-beyond datasets. Introduce a `BfTreeId` trait (`VectorId + IntoUsize` plus index-construction) implemented for `u32` and `u64`, and thread a new id type parameter `I` through `BfTreeProvider<T, Q, I = u32>` and all of its accessors, strategies, and save/load impls. The default `I = u32` keeps every existing caller (benchmark harness, tests) compiling and behaving identically, while `u64` is now available opt-in for indexes that exceed the 32-bit range. `iter()`/`IntoIterator` now map `0..total` through `I::from_index` instead of relying on `Range<u32>` (a generic `Range<I>` would require the unstable `Step` trait). `VectorProvider::starting_points` is likewise generic over the id type. The on-disk neighbor format follows the id width and is self- consistent on load (the width is not persisted, mirroring the metric). A new `test_quantized_index_search_u64_ids` builds and searches a `BfTreeProvider<_, _, u64>` index end-to-end to prove the u64 path is functional, not merely compilable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reinforce the newly-reachable u64 id path surfaced by review: - neighbors: size record key/value buffers by size_of::<I>() instead of hardcoded size_of::<u32>(), so u64 providers reserve correct record sizes. - id: add validate_id_capacity to reject vertex counts that overflow the id type, and guard new_empty with it. - provider: persist id_width in SavedParams (back-compat default of 4) and validate it on load, so a u64 index can't be silently loaded as u32. - tests: add u64 high-bit id coverage (ids/neighbors > u32::MAX, low-bit collision check) plus a u64 save/load round-trip, id-width mismatch rejection, and capacity-guard rejection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR generalizes the diskann-bftree provider to support u64 vertex IDs (while keeping u32 as the default), enabling bf-tree indexes to exceed the u32::MAX vertex ceiling. It threads a new BfTreeId abstraction through the provider, neighbor storage, vector starting-point logic, and save/load metadata, and adds tests that exercise the wide-id path end-to-end.
Changes:
- Introduces
BfTreeId(u32/u64) and adds capacity validation to prevent silent truncation when minting IDs from indices. - Makes
BfTreeProviderand related accessors/strategies generic over vertex-id typeI, including neighbor record sizing viasize_of::<I>(). - Persists
id_widthinSavedParamsand validates width/capacity on load; adds tests foru64high-bit IDs andu64save/load round-trips.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-bftree/src/id.rs | Adds BfTreeId trait + capacity validation helper for index-derived IDs. |
| diskann-bftree/src/lib.rs | Exposes the new id module and re-exports BfTreeId. |
| diskann-bftree/src/vectors.rs | Makes starting-point ID generation generic over BfTreeId. |
| diskann-bftree/src/neighbors.rs | Sizes bf-tree neighbor records based on I width; adds u64 ID collision/round-trip test. |
| diskann-bftree/src/provider.rs | Makes BfTreeProvider generic over I; adds save/load id_width validation and u64 path tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .quant_vector_provider_config | ||
| .use_snapshot(params.use_snapshot); | ||
|
|
||
| crate::id::validate_id_capacity::<I>(params.max_points + params.num_start_points.get())?; |
| saved_params.id_width, | ||
| ))); | ||
| } | ||
| crate::id::validate_id_capacity::<I>(saved_params.max_points + saved_params.frozen_points.get()) |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1216 +/- ##
==========================================
+ Coverage 89.82% 89.83% +0.01%
==========================================
Files 489 490 +1
Lines 93575 93794 +219
==========================================
+ Hits 84051 84264 +213
- Misses 9524 9530 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Does u64 bloat adjacency list, and if so, is there a chance we can bring in a delta compression ? |
@harsha-simhadri yes, in u64 mode it doubles the adjacency-list storage (neighbor ids are stored inline), but that's explicit opt-in and there are no u64 consumers yet. I can tackle delta compression as a follow-up. Neighbor order doesn't affect search correctness, so we can sort ids and delta+varint them, which recovers essentially all of the u64 overhead. |
thanks, could you create an issue for future follow up. thanks |
| pub fn iter(&self) -> std::ops::Range<u32> { | ||
| 0..(self.full_vectors.total() as u32) | ||
| pub fn iter(&self) -> std::iter::Map<std::ops::Range<usize>, fn(usize) -> I> { | ||
| (0..self.full_vectors.total()).map(I::from_index as fn(usize) -> I) |
There was a problem hiding this comment.
This is probably not the move, nor is it a great precedent to set. The problem is that std::iter::Map<.., fn(usize) -> I> folds the function pointer into the map. In particular, a function pointer is (usually) not inlineable, so the resulting iterator has a large number of undesirable properties, including an indirect call for each evaluation. Godbolt Link.
It's probably worth baking more logic into the BfTreeId trait to avoid needing these kinds of inefficient fallbacks.
| /// (for billion-scale-and-beyond, larger-than-memory datasets). On a 64-bit | ||
| /// target `u64` covers every representable `usize`, so its conversions never | ||
| /// fail. | ||
| pub trait BfTreeId: VectorId + IntoUsize { |
There was a problem hiding this comment.
Suggestion: Try to avoid IntoUsize as a trait bound and just make this an as_index method of BfTreeId. It's a little more defensive and self contained. It may also be a good idea to drop the VectorId bound as well and instead include the required traits directly, but that's perhaps less important since VectorId still pulls in way too much.
What
Make the bf-tree provider generic over its vertex-id type so it can index past
the
u32::MAX(~4.3B) ceiling, while keepingu32as the default (no memory orstorage cost for existing datasets).
id.rs—BfTreeIdtrait withu32/u64impls,from_index/try_from_index, andvalidate_id_capacity.provider.rs—BfTreeProvider<T, Q = QuantVectorProvider, I = u32>.The vertex id type threads through construction, save, and load. The persisted
SavedParamsrecordsid_width; load validates it against the requested typeso a
u32index can't be silently reopened asu64(or vice versa).neighbors.rs/vectors.rs— neighbor cells and starting-point seedingare generic over
I, sized viasize_of::<I>().Why
Larger-than-memory, billion-scale datasets need vertex ids wider than
u32.Defaulting
I = u32keeps the common case unchanged; opting intou64is asingle type parameter at the call site.
Hardening
The second commit adds defensive coverage for the wide-id path:
id_widthmismatch instead of misreading data.> u32::MAX) ids and their low-bit collisions, plus au64save/load round-trip.Notes
main(does not include the lazy-init change inthe sibling PR).
Iis an internal id. Mark'sinmem2direction uses
InternalId = u32(hardcoded) + generic external id. Worthaligning on whether bftree needs wide internal ids or should follow the same
internal-u32 / external-generic split before merge.