Skip to content

Support quoted multiline fields in parallel CSV parsing#651

Merged
adsharma merged 3 commits into
LadybugDB:mainfrom
rahul-iyer:parallel-csv-multiline-ranges-clean
Jul 5, 2026
Merged

Support quoted multiline fields in parallel CSV parsing#651
adsharma merged 3 commits into
LadybugDB:mainfrom
rahul-iyer:parallel-csv-multiline-ranges-clean

Conversation

@rahul-iyer

Copy link
Copy Markdown
Contributor

This change adds a planned-range path for the parallel CSV reader so files with quoted newlines can
still use parallel parsing without splitting logical rows across workers.

The core change is a boundary-planning step that computes safe logical row ranges before worker parsing
begins. When MULTILINE_PARALLEL=true, the reader uses those planned ranges instead of naive fixed block
cuts. That lets workers parse independently while preserving correctness for multiline quoted fields.

What changed

  • Added MULTILINE_PARALLEL as a CSV reader option.

  • Added boundary planning for parallel CSV using CSVBoundaryScanner::planFixedChunkOverlap(...).

  • Added csv_boundary_scanner.cpp to the CSV reader build target.

  • Updated ParallelCSVReader to:

    • switch into planned-range mode when multiline parallel parsing is enabled
    • seek to planned range starts instead of fixed block boundaries
    • allow quoted newlines in planned-range mode
    • use the structural parser on planned ranges when supported
  • Kept the existing fixed-block path unchanged for non-multiline parallel reads.

Benchmark Results

Median of 3 runs. Dataset results came from /private/tmp/lbug_dataset_benchmark_counts_after_simd.csv.
The synthetic comparison came from /private/tmp/lbug_mode_benchmark_results.csv. In the dataset runs,
outputs were identical across modes for every case listed below.

Dataset File Serial (s) Parallel (s) Parallel Speedup Multiline Parallel (s) Multiline Speedup Output Match
amazon dataset/snap/amazon0601/csv/amazon-edges.csv 0.196 0.047 4.21x 0.060 3.28x Yes
twitter dataset/snap/twitter/csv/twitter-edges.csv 0.175 0.044 3.97x 0.055 3.18x Yes
comment dataset/ldbc-sf01/Comment.csv 0.067 0.033 2.05x 0.038 1.76x Yes
post dataset/ldbc-sf01/Post.csv 0.063 0.031 1.99x 0.036 1.75x Yes
facebook dataset/snap/facebook/edges.csv 0.072 0.031 2.35x 0.037 1.93x Yes
embeddings dataset/embeddings/embeddings-960-1k.csv 0.077 0.059 1.31x 0.066 1.17x Yes
Workload Serial (s) Parallel (s) Parallel Status Multiline Parallel (s) Multiline Status Notes
clean_256mb 1.046 0.152 Success 0.203 Success Plain parallel is fastest on non-multiline input; multiline path still gives 5.16x over serial
multiline_256mb 0.783 0.089 Fails on quoted newlines 0.148 Success Multiline parallel succeeds and is 5.30x faster than serial

Optimize CSV boundary seed scanning

Add SIMD structural CSV parser path

Update dataset submodule
@adsharma

adsharma commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

[edit - rebased and pushed ]
@rahul-iyer could you rebase your dataset module against origin/main? That should fix the test failure.

@adsharma

adsharma commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

@adsharma

adsharma commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Windows run with the proposed fix: https://github.com/LadybugDB/ladybug/actions/runs/28749519654

@adsharma adsharma merged commit 8ca0a4d into LadybugDB:main Jul 5, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants