feat: add bucket and level statistics to ManifestFileMeta#350
Merged
JingsongLi merged 1 commit intoJun 2, 2026
Merged
Conversation
`paimon-rust`'s `MANIFEST_FILE_META_SCHEMA` was missing the four bucket and level statistics fields added to `org.apache.paimon.manifest.ManifestFileMeta.SCHEMA` in apache/paimon#5345 (Mar 2025). Manifests written by `paimon-rust` therefore carried no `_MIN_BUCKET`, `_MAX_BUCKET`, `_MIN_LEVEL`, `_MAX_LEVEL` summaries. The Java reader handles their absence gracefully (`Integer != null` guards short-circuit each pruning branch in `ManifestsReader.filterManifestFileMeta`), so this is not a correctness bug — but it disables three optimizations on tables originating from `paimon-rust`: - `specifiedBucket` pruning (bucket-targeted Spark/Flink scans go full manifest scan instead of filtering by bucket range); - `levelMinMaxFilter` pruning (compaction reads extra manifests); - `onlyReadRealBuckets` (commit-time conflict checks no longer skip virtual-bucket manifests). The fix adds four `Option<i32>` fields to `ManifestFileMeta`, extends the Avro schema with `["null", "int"]` unions defaulting to `null` (in the same order as Java), aggregates min/max from manifest entries at write time via a new `with_bucket_level_stats` chain method, and surfaces the values through the manual Avro decoder. Serializer version stays at `2` — Java did not bump it either, since the fields are optional and back-compat is guaranteed by `default: null`. Tests cover three angles: - A round-trip through `ManifestList::write`/`read` preserves explicit bucket/level values. - A real `TableCommit::commit` with messages on multiple buckets / levels produces the correct aggregate in the manifest list. - A manifest list written in the pre-5345 schema (no bucket/level fields) still decodes, with the new getters returning `None` instead of failing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JingsongLi
approved these changes
Jun 2, 2026
Contributor
JingsongLi
left a comment
There was a problem hiding this comment.
LGTM. I checked the schema/order against current Java ManifestFileMeta, the fast Avro decode path, and the manifest write aggregation. The new fields are nullable and preserve legacy reads as expected.
Verification: cargo test -p paimon passes locally on this PR.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
paimon-rust'sMANIFEST_FILE_META_SCHEMAdeclares only 9 fields, while the upstream Javaorg.apache.paimon.manifest.ManifestFileMeta.SCHEMAdeclares 12. The four bucket / level pruning fields added in apache/paimon#5345 (_MIN_BUCKET,_MAX_BUCKET,_MIN_LEVEL,_MAX_LEVEL) are absent from both the Avro schema and the Rust struct. Manifests written bypaimon-rusttherefore carry no bucket / level summary.This is not a correctness bug — the Java reader's pruning code in
ManifestsReader.filterManifestFileMetaguards every branch withInteger != null, so missing fields just short-circuit the optimization. But it silently disables three production optimizations on tables originating frompaimon-rust:specifiedBucketpruning — bucket-targeted Spark/Flink scans (ReadBuilder.withBucket(...), bucketed joins, runtime filter pushdown) fall back to reading every manifest instead of filtering by bucket range.levelMinMaxFilterpruning — Java's compaction (CompactAction/ minor compaction) reads extra manifests that could be skipped.onlyReadRealBuckets— commit-timeStrictModeCheckerno longer skips virtual / index-only manifests (those with negative bucket numbers).Brief change log
Option<i32>fields toManifestFileMeta(crates/paimon/src/spec/manifest_file_meta.rs) with#[serde(rename, default, skip_serializing_if = "Option::is_none")]mirroring the existingmin_row_id/max_row_idpattern, plus getters.with_bucket_level_stats(min_bucket, max_bucket, min_level, max_level)chain method for writers that already have a constructedManifestFileMeta.MANIFEST_FILE_META_SCHEMAwith the four["null", "int"]Avro fields, inserted between_SCHEMA_IDand_MIN_ROW_IDto match the Java field order.crates/paimon/src/spec/avro/manifest_file_meta_decode.rs) to read the new fields when present (otherwise they fall through the_ => skip_nullable_fieldarm asNone).TableCommit::write_manifest_file(crates/paimon/src/table/table_commit.rs) and attach via the new chain method. When the entry list is empty all four stayNone, mirroring back-compat shape.new_with_versionto accept the new positionalOption<i32>arguments;newstill defaults them toNone, so non-writer call sites (tests,objects_file.rs) need no churn beyond passingNonethroughnew_with_version.Serializer version is intentionally not bumped — it stays at
2. Java did the same in apache/paimon#5345 because the fields are nullable withdefault: null, so old and new files coexist.Tests
Three new tests:
spec::manifest_list::tests::test_manifest_list_roundtrip_preserves_bucket_level_stats— writes a manifest list with explicit bucket / level values viaManifestList::writeand asserts they survive the Avro round-trip throughManifestList::read.table::table_commit::tests::test_commit_writes_bucket_and_level_stats_into_manifest_list— drives a realTableCommit::commitwith messages spanning buckets[0, 3]and levels[0, 2], then reads back the manifest list and asserts the aggregates. Exercises the full plumbing fromCommitMessagethroughmessages_to_entriestowrite_manifest_file.spec::manifest_list::tests::test_manifest_list_decodes_legacy_without_bucket_level_fields— fabricates a manifest list written under the pre-5345 Avro schema (no bucket / level fields) and asserts it decodes cleanly with the new getters returningNone. Pins the back-compat contract.Local verification:
cargo build -p paimon— okcargo test -p paimon --lib— 675 passed, 0 failedcargo clippy -p paimon --all-targets -- -D warnings— cleancargo fmt --check— cleanAPI and Format
On-disk format: extends
MANIFEST_FILE_META_SCHEMAwith four optional Avro fields. Oldpaimon-rustfiles decode unchanged (missing fields →None); new files are readable by Java (which already has these fields) and by any reader that follows Avro's nullable-union default semantics. Serializer version stays at2.Public API:
ManifestFileMetagains fourOption<i32>getters (min_bucket,max_bucket,min_level,max_level) and awith_bucket_level_stats(...)chain method.ManifestFileMeta::new(...)signature unchanged — new fields default toNone.ManifestFileMeta::new_with_version(...)signature gains fourOption<i32>arguments (positional, afterschema_id, beforemin_row_id). This ispub(crate); the only external impact is keeping the Avro decoder in sync.Documentation
Rustdoc comments on the new fields and
with_bucket_level_statsexplain the back-compat semantics (None = stats absent, treat as "no information") and reference apache/paimon#5345 for the upstream change. No prose docs need updating.