fix(transaction): de-duplicate file paths within a FastAppend batch#2509
fix(transaction): de-duplicate file paths within a FastAppend batch#2509SreeramGarlapati wants to merge 2 commits into
Conversation
xanderbailey
left a comment
There was a problem hiding this comment.
Nice PR, thanks for the contribution, left a couple of comments / questions
| } | ||
| if !duplicates_in_batch.is_empty() { | ||
| let mut paths: Vec<&str> = duplicates_in_batch.into_iter().collect(); | ||
| paths.sort_unstable(); |
There was a problem hiding this comment.
Curious to know if we need the sort?
There was a problem hiding this comment.
moot after blackmwk's fail-fast suggestion (below) - we now report just the first dup, no list to sort.
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @SreeramGarlapati for this pr!
c6f1688 to
eb8d123
Compare
|
thanks for the reviews @xanderbailey and @blackmwk - addressed all ur comments. |
| } | ||
| } | ||
|
|
||
| pub(super) fn check_no_duplicate_paths_in_batch(files: &[DataFile]) -> Result<()> { |
There was a problem hiding this comment.
Why putting it in SnapshotProducer? I think we should just put it in FastAppendAction.
| #[async_trait] | ||
| impl TransactionAction for FastAppendAction { | ||
| async fn commit(self: Arc<Self>, table: &Table) -> Result<ActionCommit> { | ||
| if self.check_duplicate { |
There was a problem hiding this comment.
I think we should always add this check. The check_duplicate flag is used to save checking against existing data files in snapshot since it may introduce extra io. This is not the case for the newly added data files, and if we don't do the check, it's a bug.
|
Nice catch on the The batch-internal check probably shouldn't be gated on Worth flagging that Scope is |
|
Changed direction after checking the Java reference: Iceberg Java doesn't error on a repeated path within a batch, it de-duplicates by file location ( @blackmwk the logic now lives in @viirya MergeAppend / overwrite: keeping that as a separate follow-up, out of scope here. |
A FastAppend batch carrying the same file path twice writes both entries into the manifest today, so scans double-count rows and the snapshot summary is inflated. Drop the duplicates, keeping the first occurrence, matching Iceberg Java (DataFileSet, keyed on file location). The dedupe is in-memory and always runs, independent of check_duplicate (which gates only the cross-snapshot I/O check). Co-authored-by: Aman Rawat <rawataaryan09@gmail.com>
viirya
left a comment
There was a problem hiding this comment.
The direction change since my last comment makes sense, and I verified it against the Java reference: DataFileSet (org.apache.iceberg.util.DataFileSet) is keyed purely on file.location(), FastAppend.appendFile does dataFiles.add(file) and silently skips on collision (no throw), and it keeps the first occurrence via LinkedHashSet semantics. MergeAppend behaves identically. So dedupe-first-wins is the faithful behavior, not fail-fast — good call checking the reference.
The implementation matches that precisely: dedupe_added_files filters on seen.insert(file_path), keeping the first and dropping later collisions, and it runs unconditionally (addressing @blackmwk's append.rs:88 point and my earlier note — intra-batch dedupe is pure in-memory work, unrelated to the I/O-cost rationale behind check_duplicate). test_fast_append_dedupes_intra_batch_duplicate_paths even pins the first-wins semantics by asserting the kept file's file_size_in_bytes == 100 out of 100/200/300.
I pulled the branch and ran it: all 8 transaction::append tests pass (including my earlier delete-only-manifest regression test, no regression), clippy and rustfmt clean.
One thing worth surfacing for the record, not a blocker: dedupe-by-path means two DataFiles that share a file_path but differ in metadata (size, record_count) collapse to the first, silently. That's exactly what Java does, and it's the right call, but it's a quieter resolution than the "reject" framing in #2507 — so anyone reading #2507 later should know the fix intentionally became "drop the duplicate" rather than "error out". The PR description already explains this well.
LGTM.
Closes #2507.
A FastAppend batch carrying the same file path twice writes both entries into the manifest today, so scans double-count rows and the snapshot summary is inflated.
This drops the duplicates and keeps the first occurrence, matching Iceberg Java, which de-duplicates by file location via
DataFileSet(used inFastAppend). The dedupe is in-memory and always runs, independent ofcheck_duplicate(which only gates the I/O-bound cross-snapshot check).Tested: a batch with repeated paths now commits a single manifest entry, with and without
check_duplicate.