Skip to content

fix(transaction): de-duplicate file paths within a FastAppend batch#2509

Open
SreeramGarlapati wants to merge 2 commits into
apache:mainfrom
SreeramGarlapati:main
Open

fix(transaction): de-duplicate file paths within a FastAppend batch#2509
SreeramGarlapati wants to merge 2 commits into
apache:mainfrom
SreeramGarlapati:main

Conversation

@SreeramGarlapati

@SreeramGarlapati SreeramGarlapati commented May 26, 2026

Copy link
Copy Markdown
Contributor

Closes #2507.

A FastAppend batch carrying the same file path twice writes both entries into the manifest today, so scans double-count rows and the snapshot summary is inflated.

This drops the duplicates and keeps the first occurrence, matching Iceberg Java, which de-duplicates by file location via DataFileSet (used in FastAppend). The dedupe is in-memory and always runs, independent of check_duplicate (which only gates the I/O-bound cross-snapshot check).

Tested: a batch with repeated paths now commits a single manifest entry, with and without check_duplicate.

@SreeramGarlapati

Copy link
Copy Markdown
Contributor Author

Hi @blackmwk — appreciate a quick review on this when you have a moment. Short, well-scoped fix (one function plus two tests); context in the linked #2507.

@xanderbailey xanderbailey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, thanks for the contribution, left a couple of comments / questions

}
if !duplicates_in_batch.is_empty() {
let mut paths: Vec<&str> = duplicates_in_batch.into_iter().collect();
paths.sort_unstable();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to know if we need the sort?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moot after blackmwk's fail-fast suggestion (below) - we now report just the first dup, no list to sort.

Comment thread crates/iceberg/src/transaction/snapshot.rs Outdated

@blackmwk blackmwk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SreeramGarlapati for this pr!

Comment thread crates/iceberg/src/transaction/snapshot.rs Outdated
@SreeramGarlapati SreeramGarlapati force-pushed the main branch 4 times, most recently from c6f1688 to eb8d123 Compare May 30, 2026 02:17
@SreeramGarlapati

Copy link
Copy Markdown
Contributor Author

thanks for the reviews @xanderbailey and @blackmwk - addressed all ur comments.
pl. re-review now.

}
}

pub(super) fn check_no_duplicate_paths_in_batch(files: &[DataFile]) -> Result<()> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why putting it in SnapshotProducer? I think we should just put it in FastAppendAction.

#[async_trait]
impl TransactionAction for FastAppendAction {
async fn commit(self: Arc<Self>, table: &Table) -> Result<ActionCommit> {
if self.check_duplicate {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always add this check. The check_duplicate flag is used to save checking against existing data files in snapshot since it may introduce extra io. This is not the case for the newly added data files, and if we don't do the check, it's a bug.

@viirya

viirya commented May 31, 2026

Copy link
Copy Markdown
Member

Nice catch on the HashSet-eats-the-duplicates bug — the fail-fast rewrite reads cleanly. Two things before this lands, both building on @blackmwk's open comments:

The batch-internal check probably shouldn't be gated on check_duplicate. I agree with @blackmwk's point on append.rs:88. The check_duplicate flag exists to let callers skip the cross-snapshot validation, which costs manifest I/O. The intra-batch check is pure in-memory work with no such cost, and skipping it means commit() can silently write a manifest that references the same file_path twice — i.e. exactly the corruption this PR is meant to prevent. I'd run check_no_duplicate_paths_in_batch unconditionally.

Worth flagging that test_fast_append_intra_batch_duplicate_check_can_be_disabled currently locks in that behavior — it asserts two identical paths commit successfully when the flag is off. If the check becomes unconditional, that test's assertion should flip to expecting an error.

Scope is FastAppendAction only. This fixes the fast-append path; it'd be worth a quick follow-up to confirm MergeAppend / overwrite don't share the same latent issue (they may route through different validation). Not blocking this PR.

@SreeramGarlapati SreeramGarlapati changed the title fix(transaction): detect duplicate file paths within a single FastAppend batch fix(transaction): de-duplicate file paths within a FastAppend batch Jun 18, 2026
@SreeramGarlapati

Copy link
Copy Markdown
Contributor Author

Changed direction after checking the Java reference: Iceberg Java doesn't error on a repeated path within a batch, it de-duplicates by file location (DataFileSet, keyed on file.location()). So rather than fail the commit, this now drops the duplicate and keeps the first occurrence. More forgiving for retry / file-discovery paths that can legitimately hand the same file twice, and it still removes the double-count corruption #2507 is about.

@blackmwk the logic now lives in FastAppendAction as you suggested, and isn't gated on check_duplicate (your append.rs:88 note + @viirya) since it's pure in-memory work.

@viirya MergeAppend / overwrite: keeping that as a separate follow-up, out of scope here.

A FastAppend batch carrying the same file path twice writes both entries into
the manifest today, so scans double-count rows and the snapshot summary is
inflated. Drop the duplicates, keeping the first occurrence, matching Iceberg
Java (DataFileSet, keyed on file location). The dedupe is in-memory and always
runs, independent of check_duplicate (which gates only the cross-snapshot I/O check).

Co-authored-by: Aman Rawat <rawataaryan09@gmail.com>

@viirya viirya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direction change since my last comment makes sense, and I verified it against the Java reference: DataFileSet (org.apache.iceberg.util.DataFileSet) is keyed purely on file.location(), FastAppend.appendFile does dataFiles.add(file) and silently skips on collision (no throw), and it keeps the first occurrence via LinkedHashSet semantics. MergeAppend behaves identically. So dedupe-first-wins is the faithful behavior, not fail-fast — good call checking the reference.

The implementation matches that precisely: dedupe_added_files filters on seen.insert(file_path), keeping the first and dropping later collisions, and it runs unconditionally (addressing @blackmwk's append.rs:88 point and my earlier note — intra-batch dedupe is pure in-memory work, unrelated to the I/O-cost rationale behind check_duplicate). test_fast_append_dedupes_intra_batch_duplicate_paths even pins the first-wins semantics by asserting the kept file's file_size_in_bytes == 100 out of 100/200/300.

I pulled the branch and ran it: all 8 transaction::append tests pass (including my earlier delete-only-manifest regression test, no regression), clippy and rustfmt clean.

One thing worth surfacing for the record, not a blocker: dedupe-by-path means two DataFiles that share a file_path but differ in metadata (size, record_count) collapse to the first, silently. That's exactly what Java does, and it's the right call, but it's a quieter resolution than the "reject" framing in #2507 — so anyone reading #2507 later should know the fix intentionally became "drop the duplicate" rather than "error out". The PR description already explains this well.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(transaction): FastAppendAction silently accepts duplicate file paths within a single batch

4 participants