Skip to content

Merge/v1.5 variegata into main#507

Merged
evertlammerts merged 15 commits into
duckdb:mainfrom
evertlammerts:merge/v1.5-variegata-into-main
Jun 26, 2026
Merged

Merge/v1.5 variegata into main#507
evertlammerts merged 15 commits into
duckdb:mainfrom
evertlammerts:merge/v1.5-variegata-into-main

Conversation

@evertlammerts

Copy link
Copy Markdown
Member

No description provided.

evertlammerts and others added 15 commits June 12, 2026 23:21
This PR unifies arrow exports across query result types, and makes sure
we always provide the schema from within a transaction.

We are dealing with 3 arrow export types:
- Arrow Table
- Arrow RecordBatch
- Arrow C Stream

... across 3 result types:
- StreamingQueryResult
- ArrowQueryResult
- StreamingQueryResult

The `StreamingQueryResult` paths are now unified. We re-feed the backing
ColumnDataCollection to the engine for parallel conversion into a
`ArrowQueryResult`, and then we delegate to the corresponding
`ArrowQueryResult` path.

The `ArrowQueryResult` paths deal with materialized data already, and we
have no way to plug into the transaction that generated it. The actual
fix for this is to cache the schema when creating the
`ArrowQueryResult`, during `Finalize`. This is a core change that we
will probably apply in v2.0. The workaround is to fetch the schema in a
separate transaction. For all paths, since we are already dealing with
materialized data, we create an arrow table. Then for the streaming
paths we return the corresponding stream types directly from the table.

The `StreamingQueryResult` paths always have access to a valid
transaction context, and can get the arrow schema on demand even when
that requires catalog access.

As a side effect of this PR, consuming an arrow c stream (reading from
`con.sql(q).__arrow_c_stream__()`) is now lazy, i.e. not materialized.
This makes consumption of course slower, but allows streaming much
larger datasets.

The materialized paths are overall a little faster, and the non-c stream
streaming paths as well.

```
  ┌───────────────────────────────────────────────────┬────────────────────┬───────────────────┬───────────────────┐
  │               benchmark expression                │ wall base→now (ms) │ CPU base→now (ms) │ mem base→now (MB) │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ r=con.sql(q); r.execute(); r.to_arrow_table()     │ 159 → 161          │ 259 → 286         │ 847 → 875         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ r=con.sql(q); r.execute(); r.to_arrow_reader()    │ 161 → 144          │ 255 → 263         │ 896 → 877         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ r=con.sql(q); r.execute(); r.__arrow_c_stream__() │ 157 → 136          │ 282 → 235         │ 854 → 881         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ con.sql(q).to_arrow_table()                       │ 52 → 35            │ 267 → 244         │ 855 → 854         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ con.execute(q).to_arrow_table()                   │ 202 → 174          │ 212 → 193         │ 548 → 554         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ con.sql(q).to_arrow_reader()                      │ 186 → 175          │ 199 → 187         │ 552 → 552         │
  ├───────────────────────────────────────────────────┼────────────────────┼───────────────────┼───────────────────┤
  │ con.sql(q).__arrow_c_stream__()                   │ 48 → 173           │ 250 → 189         │ 857 → 554         │
  └───────────────────────────────────────────────────┴────────────────────┴───────────────────┴───────────────────┘

```
Bump duckdb submodule:
- Target branch: v1.5-variegata
- Date: 2026-06-17 07:32:20
- DuckDB SHA: ceb2aef3e30c5c04cf97eea4af3990a274bd49bb
- Trigger:
https://github.com/duckdb/duckdb-python/actions/runs/27671964362
Bump duckdb submodule:
- Target branch: v1.5-variegata
- Date: 2026-06-21 06:31:40
- DuckDB SHA: c4770ecba48065b691843da2e6eb9f91e3fea77b
- Trigger:
https://github.com/duckdb/duckdb-python/actions/runs/27895532903
Periodic forward-merge of release-branch bugfixes into main. Notably brings
in duckdb#495 "Unify arrow exports across all query result types" (the materialized
slow-path lifetime / connection-GC fix and the test_arrow_refeed suite),
replacing main's older SchemaCachingStreamWrapper/ArrowQueryResultStreamWrapper
approach.

Submodule: external/duckdb is kept at main's pin 0361de441a (v1.5's submodule
bumps discarded; git fast-forwarded the gitlink to main's newer pin).

Conflict resolution:
- .github/workflows/packaging_wheels.yml: applied both intents — v1.5's
  windows-2025 -> windows-2022 (consistent with targeted_test.yml) and main's
  ARM64-comment removal.

Adaptation for main's newer core:
- pyresult.cpp: core's ColumnDataRef now takes vector<Identifier> (not
  vector<string>); promote the deduplicated scan names to Identifiers
  explicitly in MakeColumnDataScanStatement.

Verified: clean build; tests/fast/arrow + tests/fast/udf = 2436 passed,
0 failed (incl. test_capsule_slow_path_survives_connection_gc and the new
test_arrow_refeed suite).
@evertlammerts evertlammerts merged commit 56c26cc into duckdb:main Jun 26, 2026
15 checks passed
@evertlammerts evertlammerts deleted the merge/v1.5-variegata-into-main branch June 26, 2026 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants