Merge/v1.5 variegata into main#507
Merged
evertlammerts merged 15 commits intoJun 26, 2026
Merged
Conversation
…with a live connection / transaction
This PR unifies arrow exports across query result types, and makes sure we always provide the schema from within a transaction. We are dealing with 3 arrow export types: - Arrow Table - Arrow RecordBatch - Arrow C Stream ... across 3 result types: - StreamingQueryResult - ArrowQueryResult - StreamingQueryResult The `StreamingQueryResult` paths are now unified. We re-feed the backing ColumnDataCollection to the engine for parallel conversion into a `ArrowQueryResult`, and then we delegate to the corresponding `ArrowQueryResult` path. The `ArrowQueryResult` paths deal with materialized data already, and we have no way to plug into the transaction that generated it. The actual fix for this is to cache the schema when creating the `ArrowQueryResult`, during `Finalize`. This is a core change that we will probably apply in v2.0. The workaround is to fetch the schema in a separate transaction. For all paths, since we are already dealing with materialized data, we create an arrow table. Then for the streaming paths we return the corresponding stream types directly from the table. The `StreamingQueryResult` paths always have access to a valid transaction context, and can get the arrow schema on demand even when that requires catalog access. As a side effect of this PR, consuming an arrow c stream (reading from `con.sql(q).__arrow_c_stream__()`) is now lazy, i.e. not materialized. This makes consumption of course slower, but allows streaming much larger datasets. The materialized paths are overall a little faster, and the non-c stream streaming paths as well. ``` ┌───────────────────────────────────────────────────┬────────────────────┬───────────────────┬───────────────────┐ │ benchmark expression │ wall base→now (ms) │ CPU base→now (ms) │ mem base→now (MB) │ ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤ │ r=con.sql(q); r.execute(); r.to_arrow_table() │ 159 → 161 │ 259 → 286 │ 847 → 875 │ ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤ │ r=con.sql(q); r.execute(); r.to_arrow_reader() │ 161 → 144 │ 255 → 263 │ 896 → 877 │ ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤ │ r=con.sql(q); r.execute(); r.__arrow_c_stream__() │ 157 → 136 │ 282 → 235 │ 854 → 881 │ ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤ │ con.sql(q).to_arrow_table() │ 52 → 35 │ 267 → 244 │ 855 → 854 │ ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤ │ con.execute(q).to_arrow_table() │ 202 → 174 │ 212 → 193 │ 548 → 554 │ ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤ │ con.sql(q).to_arrow_reader() │ 186 → 175 │ 199 → 187 │ 552 → 552 │ ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤ │ con.sql(q).__arrow_c_stream__() │ 48 → 173 │ 250 → 189 │ 857 → 554 │ └───────────────────────────────────────────────────┴────────────────────┴───────────────────┴───────────────────┘ ```
Bump duckdb submodule: - Target branch: v1.5-variegata - Date: 2026-06-17 07:32:20 - DuckDB SHA: ceb2aef3e30c5c04cf97eea4af3990a274bd49bb - Trigger: https://github.com/duckdb/duckdb-python/actions/runs/27671964362
Bump duckdb submodule: - Target branch: v1.5-variegata - Date: 2026-06-21 06:31:40 - DuckDB SHA: c4770ecba48065b691843da2e6eb9f91e3fea77b - Trigger: https://github.com/duckdb/duckdb-python/actions/runs/27895532903
Periodic forward-merge of release-branch bugfixes into main. Notably brings in duckdb#495 "Unify arrow exports across all query result types" (the materialized slow-path lifetime / connection-GC fix and the test_arrow_refeed suite), replacing main's older SchemaCachingStreamWrapper/ArrowQueryResultStreamWrapper approach. Submodule: external/duckdb is kept at main's pin 0361de441a (v1.5's submodule bumps discarded; git fast-forwarded the gitlink to main's newer pin). Conflict resolution: - .github/workflows/packaging_wheels.yml: applied both intents — v1.5's windows-2025 -> windows-2022 (consistent with targeted_test.yml) and main's ARM64-comment removal. Adaptation for main's newer core: - pyresult.cpp: core's ColumnDataRef now takes vector<Identifier> (not vector<string>); promote the deduplicated scan names to Identifiers explicitly in MakeColumnDataScanStatement. Verified: clean build; tests/fast/arrow + tests/fast/udf = 2436 passed, 0 failed (incl. test_capsule_slow_path_survives_connection_gc and the new test_arrow_refeed suite).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.