Skip to content

Export partition - Introduce commit lock#1847

Open
arthurpassos wants to merge 3 commits into
antalya-26.3from
export-partition-commit-lock
Open

Export partition - Introduce commit lock#1847
arthurpassos wants to merge 3 commits into
antalya-26.3from
export-partition-commit-lock

Conversation

@arthurpassos
Copy link
Copy Markdown
Collaborator

@arthurpassos arthurpassos commented May 27, 2026

Note for the reviewer:

Normally, an export task is comitted by the thread that uploads the last part. At the same time, there is a cleanup thread that runs periodically that might try to commit tasks that are marked as PENDING and already uploaded all parts. Those tasks are considered to be in commit state, and if this happens, there may be a race condition.

Introducing a commit lock to prevent that from happening

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Prevent commit race conditions by adding a per-task commit_lock

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@arthurpassos arthurpassos added antalya port-antalya PRs to be ported to all new Antalya releases antalya-26.3 labels May 27, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

Workflow [PR], commit [ef0ea11]

@DimensionWieldr
Copy link
Copy Markdown
Collaborator

No new fails in PR.

AI audit returned "No blockers or majors."

Existing export partition regression test coverage is adequate.

LGTM

@DimensionWieldr DimensionWieldr added the verified Approved for release label May 29, 2026
@ilejn
Copy link
Copy Markdown
Collaborator

ilejn commented May 29, 2026

@arthurpassos I am definitely not the best person to review this, but some questions/suggestions

  1. Let's mention 'Export partition' in PR title.
  2. Is replica_name definitely exists and is meaningful? Cannot it be e.g. empty string?
  3. A bit strange that both successful and unsuccessful lock attempts logs has same severity, but i don't think that WARNING makes sense here.
  4. What is actually happening if commit lock is already acquired by another replica?
  5. Why not introduce a test?

@arthurpassos arthurpassos changed the title introduce commit lock Export partition - Introduce commit lock May 29, 2026
@arthurpassos
Copy link
Copy Markdown
Collaborator Author

@arthurpassos I am definitely not the best person to review this, but some questions/suggestions

  1. Let's mention 'Export partition' in PR title.
  2. Is replica_name definitely exists and is meaningful? Cannot it be e.g. empty string?
  3. A bit strange that both successful and unsuccessful lock attempts logs has same severity, but i don't think that WARNING makes sense here.
  4. What is actually happening if commit lock is already acquired by another replica?
  5. Why not introduce a test?
  1. Done
  2. AFAIK, you can't create a replicated* table without either providing the replica name or having {replica} defined.
  3. hm.. I am not very good at deciding log severities, but failing to acquire a lock is not an error by default. It just means another replica grabbed it first. So all I need is logging to keep track of it and understand what is going on internally if I have to debug it.
  4. It means the other replica is trying to commit it, and this replica can skip it.
  5. Testing race conditions is always hard, not sure how to cause this specific scenario. At the same time, I have 100+ export partition tests for happy and unhappy cases. They are passing

@il9ue
Copy link
Copy Markdown

il9ue commented May 30, 2026

Approach looks solid
— ephemeral commit_lock + RAII release is the correct primitive.

However, One thing before merge:
commit_lock is created as a child of entry_path, and the holder lives until function scope exits. If the successful-commit path does a non-recursive zk->remove(entry_path) while the lock is still held, ZK will reject it with ZNOTEMPTY. Can you confirm the entry isn't torn down inside the lock's lifetime? If it is, release before teardown or remove recursively.

(Minor: failed-acquire LOG_INFO will fire every poll for tasks another replica owns — DEBUG may fit better.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.3 port-antalya PRs to be ported to all new Antalya releases verified Approved for release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants