Skip to content

fix(fast-path): use count_rows() instead of collect() in _can_use_fast_path#45

Open
FANNG1 wants to merge 1 commit into
daft-engine:mainfrom
FANNG1:fix/fast-path-collect-sideeffect
Open

fix(fast-path): use count_rows() instead of collect() in _can_use_fast_path#45
FANNG1 wants to merge 1 commit into
daft-engine:mainfrom
FANNG1:fix/fast-path-collect-sideeffect

Conversation

@FANNG1

@FANNG1 FANNG1 commented Jun 26, 2026

Copy link
Copy Markdown

Fixes #42

Problem

_can_use_fast_path used len(df.collect()) to count rows. collect() populates df._result_cache with the materialised results. When the pipeline includes take_blobs(), those cached results include BlobFile instances — one-shot streams whose .read() returns b"" after the first call.

The fast-path merge subsequently calls df.groupby("fragment_id").map_groups(...). Because _result_cache is already set, Daft returns the same exhausted BlobFile objects instead of re-materialising from Lance. Downstream UDFs that call blob.read() receive empty bytes and produce null or incorrect output.

Fix

Replace len(df.collect()) with df.count_rows(). count_rows() does not set _result_cache, so the pipeline re-executes fresh when the merge runs.

Test

TestRegressions::test_fast_path_check_does_not_set_result_cache — verifies that _can_use_fast_path leaves df._result_cache unset after it returns, confirming count_rows() (not collect()) is used internally.

…t_path

collect() populates df._result_cache with one-shot Python objects (e.g. BlobFile
from take_blobs()). A subsequent groupby().map_groups() re-uses the same
df object, receiving the same exhausted objects from the cache; downstream UDFs
that call blob.read() get b'' and produce null/wrong output.

count_rows() does not set _result_cache, so the pipeline re-executes fresh.

Fixes daft-engine#42
@FANNG1 FANNG1 force-pushed the fix/fast-path-collect-sideeffect branch from 0e7a835 to 43da0eb Compare June 26, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(fast-path): df.collect() in _can_use_fast_path exhausts one-shot BlobFile objects, corrupting downstream UDF results

1 participant