Skip to content

[release/6.x] Cherry pick: Recovery and Join snapshot ledger offset fix (#7901)#7917

Open
cjen1-msft wants to merge 5 commits into
microsoft:release/6.xfrom
cjen1-msft:backport-6-7901
Open

[release/6.x] Cherry pick: Recovery and Join snapshot ledger offset fix (#7901)#7917
cjen1-msft wants to merge 5 commits into
microsoft:release/6.xfrom
cjen1-msft:backport-6-7901

Conversation

@cjen1-msft
Copy link
Copy Markdown
Contributor

backport #7901

Co-authored-by: Amaury Chamayou <amaury@xargs.fr>
@cjen1-msft cjen1-msft requested a review from a team as a code owner June 3, 2026 10:37
@achamayou
Copy link
Copy Markdown
Member

@cjen1-msft I think this needs something like #7918 as well as a rev up of the version in pyproject.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Backport of #7901 to the 6.x release branch to fix an edge case where recovery/join from a snapshot whose seqno is beyond the end of the available ledger should still resume ledger writing correctly from the snapshot boundary. This is accompanied by new regression coverage in both Python E2E tests and C++ host ledger unit tests.

Changes:

  • Update host ledger truncation logic to support a recovery-mode “forward truncate” past current ledger end, and tighten file selection when reading by index.
  • Add new recovery/join regression tests which construct ledger variants with snapshot/ledger offsets and verify recovery/join succeeds.
  • Bump release version to 6.0.28 and document the fix in the changelog.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/recovery.py Adds an E2E recovery test that recovers from snapshot/ledger offset variants.
tests/reconfiguration.py Adds an E2E join-node test that joins from snapshot/ledger offset variants.
tests/infra/utils.py Adds helper to synthesize ledger chunk files for the new tests.
src/host/test/ledger.cpp Adds unit tests covering recovery-mode truncation behavior at/beyond ledger end.
src/host/ledger.h Implements recovery-mode forward truncation and improves file lookup bounds-checking.
python/pyproject.toml Bumps Python package version to 6.0.28.
CHANGELOG.md Adds 6.0.28 entry describing the fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/infra/utils.py
Comment thread CHANGELOG.md
Comment thread src/host/ledger.h
Comment on lines +1472 to +1491
// During recovery, a snapshot can be after the current end of the host
// ledger. Regular truncation requires that the truncation index is within
// the ledger and otherwise skips the truncation. This is a special case
// to handle the recovery forward truncation
if (recovery_mode && idx >= last_idx)
{
// Close any open files as the ledger should restart cleanly from a new
// chunk.
auto file = get_latest_file();
if (file != nullptr)
{
file->complete();
}
// Don't use any of the files on disk for writing
use_existing_files = false;
last_idx_on_init.reset();
// Set last_idx to the recovery idx, which may be past the current end
// of the ledger
last_idx = idx;
return;
@cjen1-msft cjen1-msft added the run-long-test Run Long Test job label Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-long-test Run Long Test job

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants