Skip to content

[WIP] fix: read Dict(FixedSizeBinary) from parquet without dict encoding#10232

Draft
Jefffrey wants to merge 1 commit into
apache:mainfrom
Jefffrey:dict-fsb-parquet-rt
Draft

[WIP] fix: read Dict(FixedSizeBinary) from parquet without dict encoding#10232
Jefffrey wants to merge 1 commit into
apache:mainfrom
Jefffrey:dict-fsb-parquet-rt

Conversation

@Jefffrey

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

If a column Dictionary(FixedSizeBinary) was written to a parquet file, but with dictionary encoding turned off (so encoded only as PLAIN), then reading that file back into a Dictionary(FixedSizeBinary) lead to an unavoidable panic so roundtrip was not possible. Fixing this bug.

What changes are included in this PR?

Coerce the binary values to fixedsizebinary when reading from non-dictionary encoded values to ensure array construction is respected.

Are these changes tested?

Yes, added new tests.

Are there any user-facing changes?

No.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 29, 2026
let data = DictionaryArray::<K>::new(keys, Arc::new(values));
let batch = RecordBatch::try_new(Arc::new(schema), vec![Arc::new(data)]).unwrap();
let data = Arc::new(DictionaryArray::<K>::new(keys, Arc::new(values))) as ArrayRef;
one_column_roundtrip(Arc::clone(&data), true);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the reproduction, added one_column_roundtrip() usage here to test roundtrip where one of the cases is disabling dictionary encoding entirely

Rest of test changes is just restructuring

ArrowType::Dictionary(k, v) => (k, v.as_ref().clone()),
_ => unreachable!(),
};
let array = if let ArrowType::FixedSizeBinary(size) = value_type {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching what was done above in the Self::Dict path:

let values = if let ArrowType::FixedSizeBinary(size) = **value_type {
let binary = values.as_binary::<i32>();
Arc::new(FixedSizeBinaryArray::new(
size,
binary.values().clone(),
binary.nulls().cloned(),
)) as _
} else {
values
};

This is because values.into_array() assumed a generic byte offset array:

/// Converts this into an [`ArrayRef`] with the provided `data_type` and `null_buffer`
pub fn into_array(self, null_buffer: Option<Buffer>, data_type: ArrowType) -> ArrayRef {
let array_data_builder = ArrayDataBuilder::new(data_type)
.len(self.len())
.add_buffer(Buffer::from_vec(self.offsets))
.add_buffer(Buffer::from_vec(self.values))
.null_bit_buffer(null_buffer);
let data = match cfg!(debug_assertions) {
true => array_data_builder.build().unwrap(),
false => unsafe { array_data_builder.build_unchecked() },
};
make_array(data)
}

Which is the wrong structure for a fixedsizebinary array (we don't have offsets buffer, etc.)

@Jefffrey Jefffrey marked this pull request as draft June 29, 2026 03:10
@Jefffrey Jefffrey changed the title fix: read Dict(FixedSizeBinary) from parquet without dict encoding [WIP] fix: read Dict(FixedSizeBinary) from parquet without dict encoding Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

InvalidArgumentError("Expected 1 buffers in array of type FixedSizeBinary(4), got 2"

1 participant