Support quoted multiline fields in parallel CSV parsing by rahul-iyer · Pull Request #651 · LadybugDB/ladybug

rahul-iyer · 2026-07-04T01:16:52Z

This change adds a planned-range path for the parallel CSV reader so files with quoted newlines can
still use parallel parsing without splitting logical rows across workers.

The core change is a boundary-planning step that computes safe logical row ranges before worker parsing
begins. When MULTILINE_PARALLEL=true, the reader uses those planned ranges instead of naive fixed block
cuts. That lets workers parse independently while preserving correctness for multiline quoted fields.

What changed

Added MULTILINE_PARALLEL as a CSV reader option.
Added boundary planning for parallel CSV using CSVBoundaryScanner::planFixedChunkOverlap(...).
Added csv_boundary_scanner.cpp to the CSV reader build target.
Updated ParallelCSVReader to:
- switch into planned-range mode when multiline parallel parsing is enabled
- seek to planned range starts instead of fixed block boundaries
- allow quoted newlines in planned-range mode
- use the structural parser on planned ranges when supported
Kept the existing fixed-block path unchanged for non-multiline parallel reads.

Benchmark Results

Median of 3 runs. Dataset results came from /private/tmp/lbug_dataset_benchmark_counts_after_simd.csv.
The synthetic comparison came from /private/tmp/lbug_mode_benchmark_results.csv. In the dataset runs,
outputs were identical across modes for every case listed below.

Dataset	File	Serial (s)	Parallel (s)	Parallel Speedup	Multiline Parallel (s)	Multiline Speedup	Output Match
amazon	`dataset/snap/amazon0601/csv/amazon-edges.csv`	0.196	0.047	4.21x	0.060	3.28x	Yes
twitter	`dataset/snap/twitter/csv/twitter-edges.csv`	0.175	0.044	3.97x	0.055	3.18x	Yes
comment	`dataset/ldbc-sf01/Comment.csv`	0.067	0.033	2.05x	0.038	1.76x	Yes
post	`dataset/ldbc-sf01/Post.csv`	0.063	0.031	1.99x	0.036	1.75x	Yes
facebook	`dataset/snap/facebook/edges.csv`	0.072	0.031	2.35x	0.037	1.93x	Yes
embeddings	`dataset/embeddings/embeddings-960-1k.csv`	0.077	0.059	1.31x	0.066	1.17x	Yes

Workload	Serial (s)	Parallel (s)	Parallel Status	Multiline Parallel (s)	Multiline Status	Notes
`clean_256mb`	1.046	0.152	Success	0.203	Success	Plain parallel is fastest on non-multiline input; multiline path still gives 5.16x over serial
`multiline_256mb`	0.783	0.089	Fails on quoted newlines	0.148	Success	Multiline parallel succeeds and is 5.30x faster than serial

Optimize CSV boundary seed scanning Add SIMD structural CSV parser path Update dataset submodule

adsharma · 2026-07-04T18:13:56Z

[edit - rebased and pushed ]
@rahul-iyer could you rebase your dataset module against origin/main? That should fix the test failure.

adsharma · 2026-07-05T16:31:00Z

@rahul-iyer only one failure on windows to fix

https://github.com/LadybugDB/ladybug/actions/runs/28725297352/job/85236841991

adsharma · 2026-07-05T16:58:58Z

Windows run with the proposed fix: https://github.com/LadybugDB/ladybug/actions/runs/28749519654

Support quoted multiline fields in parallel CSV parsing

4567150

Optimize CSV boundary seed scanning Add SIMD structural CSV parser path Update dataset submodule

fix: the failing tests due to missing newline

9e5d9aa

fix: use _BitScanForward on windows

6d34f93

adsharma merged commit 8ca0a4d into LadybugDB:main Jul 5, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support quoted multiline fields in parallel CSV parsing#651

Support quoted multiline fields in parallel CSV parsing#651
adsharma merged 3 commits into
LadybugDB:mainfrom
rahul-iyer:parallel-csv-multiline-ranges-clean

rahul-iyer commented Jul 4, 2026

Uh oh!

adsharma commented Jul 4, 2026 •

edited

Loading

Uh oh!

adsharma commented Jul 5, 2026

Uh oh!

adsharma commented Jul 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rahul-iyer commented Jul 4, 2026

Uh oh!

adsharma commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adsharma commented Jul 5, 2026

Uh oh!

adsharma commented Jul 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adsharma commented Jul 4, 2026 •

edited

Loading

adsharma commented Jul 5, 2026 •

edited

Loading