Skip to content

perf(arrow-ipc): Add writer benchmarks for dictionaries#10122

Open
JakeDern wants to merge 1 commit into
apache:mainfrom
JakeDern:ipc-writer-dict-benches
Open

perf(arrow-ipc): Add writer benchmarks for dictionaries#10122
JakeDern wants to merge 1 commit into
apache:mainfrom
JakeDern:ipc-writer-dict-benches

Conversation

@JakeDern

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

This PR adds writer benchmarks for dictionaries so that we can measure the performance impact of code changes on those code paths.

What changes are included in this PR?

Three new benchmarks:

  • StreamWriter benchmark for dictionaries
  • StreamWriter benchmark for delta dictionaries
  • FileWriter benchmark for delta dictionaries

Are these changes tested?

Yes, just benchmarks included which I ran locally.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Jun 11, 2026
@JakeDern JakeDern marked this pull request as ready for review June 11, 2026 16:52
@JakeDern

Copy link
Copy Markdown
Contributor Author

CC: @alamb @Rich-T-kid

@Rich-T-kid

Copy link
Copy Markdown
Contributor

taking a look 👀

@Rich-T-kid Rich-T-kid left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JakeDern the PR looks good to me. I have some comments but they arent blocking. 🚀 nice work

for i in 0..n {
let mut builder = StringDictionaryBuilder::<UInt32Type>::new();
for r in 0..num_rows {
builder.append_value(format!("batch {i} value {}", r % (num_rows / 2)));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an 8k batch size there will be 4k unique values, and for 64k there will be 32k unique values. I don't think this alone is the right way to benchmark dictionaries. Dictionaries are focused on low cardinality, so it makes more sense to parameterize benchmarks by target cardinality. For example, (5%, 10%, 25%, 50%) unique values relative to batch size.
The implementation should also grow with the number of unique values, since that's the point of the IPC format in delta mode:

A dictionary batch with isDelta set indicates that its vector should be concatenated with those of any previous batches with the same id.

Varying cardinality helps detect whether the encoder is doing O(N) work (proportional to total rows) when it should be doing O(K) work (proportional to unique values).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thoughtful review!

Varying cardinality helps detect whether the encoder is doing O(N) work (proportional to total rows) when it should be doing O(K) work (proportional to unique values).

I would clarify that we are doing O(K + N) work as there is still per row overhead in encoding the dictionary keys as well as per unique value cost in encoding the dictionary values.

I agree that benchmarks varying the amount of unique values could yield useful information, but these simple benchmarks can still answer whether we've meaningfully changed the amount of work required to emit dictionary batches.

I think until those additional parameters are needed to demonstrate something the current set cannot, I favor the simplicity of omitting them.

Just my .02 of course, happy to take more feedback on this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me.

for i in 0..n {
// 3/4 of the rows reuse values shared by every batch, the other 1/4
// introduce values unique to this batch which extends the dictionary.
for r in 0..num_rows {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to before.

@Rich-T-kid

Copy link
Copy Markdown
Contributor

a possible minor/medium optimization for the uncompressed case came to mind & I remembered the current benchmarks dont cover this.
@JakeDern what do you think about adding an identical case to

group.bench_function("StreamWriter/write_10/zstd", |b| {
        let batch = create_batch(8192, true);
        let mut buffer = Vec::with_capacity(2 * 1024 * 1024);
        b.iter(move || {
            buffer.clear();
            let options = IpcWriteOptions::default()
                .try_with_compression(Some(CompressionType::ZSTD))
                .unwrap();
            let mut writer =
                StreamWriter::try_new_with_options(&mut buffer, batch.schema().as_ref(), options)
                    .unwrap();
            for _ in 0..10 {
                writer.write(&batch).unwrap();
            }
            writer.finish().unwrap();
        })
    });

but with try_with_compression set to None. It may make sense to actually refactor the StreamWriter benchmarks to iterate over the possible compression types.
I would add it but that would come with possible merge conflicts with this change. If not thats fine I can submit a PR.

@JakeDern

Copy link
Copy Markdown
Contributor Author

a possible minor/medium optimization for the uncompressed case came to mind & I remembered the current benchmarks dont cover this. @JakeDern what do you think about adding an identical case to

group.bench_function("StreamWriter/write_10/zstd", |b| {
        let batch = create_batch(8192, true);
        let mut buffer = Vec::with_capacity(2 * 1024 * 1024);
        b.iter(move || {
            buffer.clear();
            let options = IpcWriteOptions::default()
                .try_with_compression(Some(CompressionType::ZSTD))
                .unwrap();
            let mut writer =
                StreamWriter::try_new_with_options(&mut buffer, batch.schema().as_ref(), options)
                    .unwrap();
            for _ in 0..10 {
                writer.write(&batch).unwrap();
            }
            writer.finish().unwrap();
        })
    });

but with try_with_compression set to None. It may make sense to actually refactor the StreamWriter benchmarks to iterate over the possible compression types. I would add it but that would come with possible merge conflicts with this change. If not thats fine I can submit a PR.

I think this benchmark is covering that case:

group.bench_function("StreamWriter/write_10", |b| {
let batch = create_batch(8192, true);
let mut buffer = Vec::with_capacity(2 * 1024 * 1024);
b.iter(move || {
buffer.clear();
let mut writer = StreamWriter::try_new(&mut buffer, batch.schema().as_ref()).unwrap();
for _ in 0..10 {
writer.write(&batch).unwrap();
}
writer.finish().unwrap();
})
});

The default StreamWriter::try_new uses the default IpcWriteOptions which sets batch_compression_type: None. Let me know if I'm misunderstanding though.

@Rich-T-kid

Rich-T-kid commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

yea your right, source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

arrow-ipc: Extend writer benchmarks to include dictionaries

2 participants