Skip to content

fix(fast-path): three bugs in FastPathFragmentWriter and _can_use_fast_path#43

Closed
FANNG1 wants to merge 4 commits into
daft-engine:mainfrom
FANNG1:fix/fast-path-type-erasure-and-collect-sideeffect
Closed

fix(fast-path): three bugs in FastPathFragmentWriter and _can_use_fast_path#43
FANNG1 wants to merge 4 commits into
daft-engine:mainfrom
FANNG1:fix/fast-path-type-erasure-and-collect-sideeffect

Conversation

@FANNG1

@FANNG1 FANNG1 commented Jun 25, 2026

Copy link
Copy Markdown

Fixes three independent bugs in the fast-path column-merge code path (lance_merge_column.py).

Changes

Bug 1 — Type erasure: fixed_size_list<float32>[N] becomes list<float64> (closes #40)

FastPathFragmentWriter built the new-column Arrow table via pa.array(s.to_pylist()).
Python floats are always float64 and Python lists carry no fixed-size constraint, so
fixed_size_list<item: float>[N] (e.g. an embedding UDF's output) was silently widened
to list<item: double>.

Fix: use s.to_arrow().combine_chunks() to preserve the Arrow type that daft declared
in the UDF's return_dtype.


Bug 2 — df.collect() in _can_use_fast_path exhausts one-shot BlobFile objects (closes #42)

_can_use_fast_path called len(df.collect()) to count rows. Daft caches the result in
df._result_cache. When the pipeline contains take_blobs(), the cache holds BlobFile
instances (one-shot streams). The subsequent groupby().map_groups() in _merge_fast_path
received the same exhausted objects; blob.read() returned b'', causing all downstream
UDFs to produce null or incorrect output.

Fix: replace df.collect() with df.count_rows(), which does not set _result_cache.


Bug 3 — next_fid collides with nested struct child field IDs (closes #41)

next_fid = max(f.id() for f in lance_schema.fields()) + 1 only scanned top-level fields.
A struct column with M children occupies M+1 Lance field IDs (parent + each child). The new
file's "fields" entry therefore collided with an existing child field, and the new column
read back as null for every row.

Fix: recurse into f.children() to find the true maximum field ID across all descendants.


Tests

tests/io/lance/test_fast_path_merge.py — new TestRegressions class with one test per bug:

  • test_fixed_size_list_float32_type_preserved
  • test_next_fid_skips_nested_child_field_ids
  • test_blob_pipeline_merge_produces_nonnull_results

fanng added 4 commits June 19, 2026 12:32
…t_path

Bug 1 — type erasure (FastPathFragmentWriter.__call__):
pa.array(s.to_pylist()) drops Arrow type info from daft Series.
For fixed_size_list<float32>[N], Python floats become float64 and
the fixed-size list becomes a variable-length list<double>, which
causes lance's stream merger to fail with a row-count mismatch
(1 != N). Fix: use s.to_arrow() / combine_chunks() to preserve
the declared daft return type.

Bug 2 — collect() side-effect (_can_use_fast_path):
len(df.collect()) materialises results into df._result_cache.
One-shot Python objects (e.g. lance BlobFile) cached there are
exhausted; a subsequent groupby().map_groups() that re-uses the
same df object receives stale objects and produces null values.
Fix: use df.count_rows(), which does not set _result_cache.

Bug 3 — next_fid under-counts child field IDs (FastPathFragmentWriter):
lance_schema.fields() returns only top-level fields. Nested types
(struct, blob-v2, etc.) have child fields with their own IDs.
max(f.id() for top-level f) skips these, so next_fid can collide
with an existing child field ID. The committed fragment metadata
then maps the new column file to the wrong schema field, causing
lance to read null for the new column. Fix: recurse into children
when computing the max field ID.
…r bugs

Covers:
- fixed_size_list<float32>[N] type erasure via pa.array(s.to_pylist())
- next_fid collision with nested struct child field IDs
- BlobFile exhaustion caused by collect() side-effect in _can_use_fast_path
@FANNG1

FANNG1 commented Jun 26, 2026

Copy link
Copy Markdown
Author

Superseded by three focused PRs for easier review: #44 (type erasure), #45 (collect side-effect), #46 (next_fid child IDs).

@FANNG1 FANNG1 closed this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment