Skip to content

[lake/paimon] Tiering support to scan and write arrow record batch to paimon#3418

Open
luoyuxia wants to merge 6 commits into
apache:mainfrom
luoyuxia:tiering-scan-arrow-record-batch
Open

[lake/paimon] Tiering support to scan and write arrow record batch to paimon#3418
luoyuxia wants to merge 6 commits into
apache:mainfrom
luoyuxia:tiering-scan-arrow-record-batch

Conversation

@luoyuxia
Copy link
Copy Markdown
Contributor

@luoyuxia luoyuxia commented Jun 2, 2026

Summary

  • Adjust TieringSplitReader to support scanning Arrow record batches and writing them directly to Paimon via SupportsRecordBatchWrite
  • Add Arrow2PaimonVectorConverter for converting Arrow vectors to Paimon columnar vectors
  • Add AppendOnlyArrowBatchHelper for writing Arrow batches to Paimon append-only tables
  • Add runtime check for unshaded Arrow availability to gracefully fall back to row-based path
  • Fix CompletedFetch to support fetching Arrow batches with offset skipping and advanceNextFetchOffset guard

Stacked on #2995 — diff will shrink after that PR merges.

Closes #2966

luoyuxia added 4 commits May 15, 2026 11:43
…batches

Replace IOUtils.closeQuietly with an explicit close that surfaces the
exception when Arrow batches have not been released before the scanner
is closed.
@luoyuxia luoyuxia force-pushed the tiering-scan-arrow-record-batch branch 3 times, most recently from 65c25ab to ec60770 Compare June 2, 2026 07:31
@luoyuxia luoyuxia force-pushed the tiering-scan-arrow-record-batch branch from ec60770 to 97876b0 Compare June 2, 2026 07:38
- Use lastScannedOffset instead of logOffset for split completion check
  in TieringSplitReader to handle the case where scanned records are all
  beyond the stoppingOffset and no records are written
- Add lastScannedOffset field to LogOffsetAndTimestamp to track the last
  scanned offset independently from the last written offset
- Fix Arrow TIME type conversion to handle TimeMicroVector, TimeNanoVector,
  and TimeSecVector in addition to TimeMilliVector
- Fix PaimonTieringITCase to write all records in one batch to avoid
  non-deterministic read order from multiple Paimon data files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TieringSourceReader adjust to scan arrow record batch and write arrow record batch to lake

1 participant