Recovery and Join snapshot ledger offset fix by cjen1-msft · Pull Request #7901 · microsoft/CCF

cjen1-msft · 2026-05-19T17:35:18Z

Per #7891 we had poor test coverage if a snapshot was after the end of the ledger.
This PR adds better coverage for that, reproducing the issue where recovering nodes fail when there is a gap between the end of the ledger and the starting snapshot.
It also adds a fix for that issue.

Supersedes: #7890

Copilot

Pull request overview

This PR fixes recovery/join robustness when the local ledger ends at (or before) the snapshot seqno, ensuring subsequent recovery writes resume correctly from the snapshot boundary. It also adds targeted test coverage to reproduce and guard against “gappy ledger” scenarios described in #7891.

Changes:

Fix host ledger file selection and truncation behaviour to support recovery truncation beyond the current ledger end.
Add new end-to-end tests for recovery and node join from snapshot across multiple “ledger end vs snapshot” variants (complete/incomplete chunks).
Add a shared test utility to synthesize ledger chunk files for these scenarios, and document the user-visible fix in the changelog.

Custom instructions used:

.github/copilot-instructions.md
.github/instructions/changelog.instructions.md
.github/instructions/reviewing.instructions.md

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/host/ledger.h`	Adjusts file lookup and extends `truncate()` with a recovery-aware path for snapshot boundaries beyond the local ledger end.
`src/host/test/ledger.cpp`	Adds unit tests covering recovery truncation behaviours at/after/beyond ledger end.
`tests/infra/utils.py`	Adds `write_ledger_chunk()` helper to generate complete/incomplete ledger chunks for test scenarios.
`tests/recovery.py`	Adds a recovery test that exercises recovery-from-snapshot with multiple crafted ledger variants.
`tests/reconfiguration.py`	Adds a join test that joins nodes from snapshot with crafted ledger offsets/variants.
`CHANGELOG.md`	Notes the recovery/join fix in the Unreleased “Fixed” section with PR reference.

cjen1-msft · 2026-05-29T13:22:41Z

There is a hard tradeoff here (thanks to @maxtropets for raising this!).

Consider recovering from a ledger that runs until 1000, and a snapshot at 2000.
Within that gap there is a rekey.

If we recover from the snapshot (this PR) we will be unable to serve any receipts for the ledger as we cannot reconstruct the commit evidence, nor backchain the endorsements.
Alternatively if we recover from the ledger we are truncating at least 1000 txs.
In this case the snapshot is probably the right thing to choose.

The other limit of this, is if the snapshot is at 1002, and the last ledger is at 1000, and 1001 was a rekey.
In this case we're not necessarily gaining anything from the snapshot, but we lose the ability to serve receipts for the historical ledger.

I don't know which of these two we want, we either optimise for the current state of the KV and ensure we implicitly contain all committed transactions. Or we optimise for being able to serve receipts, at the cost of effectively greater truncates.

A bad outcome from the 'restart from the snapshot' approach is that upon restart it is non-trivial to distinguish between a ledger starting from a snapshot and has an unrecoverable tail, and a 'better' node with a ledger right up to the snapshot that has a recoverable tail.
In practise I'd expect the 'better' node to have transactions after the snapshot so this might be sufficiently unlikely to be a non-issue.

To be specific the other option would probably be to ignore snapshots past the end of the ledger.
A bad outcome from this approach is that we will likely end up replaying the entirety of more ledgers.

cjen1-msft · 2026-06-01T09:29:14Z

To update after a discussion. The primary purpose of the node is to preserve the fuzzy 'history' not to be able to serve receipts. So we should recover from the snapshot, not from the ledger.

This reverts commit 255b191.

Co-authored-by: Amaury Chamayou <amaury@xargs.fr>

cjen1-msft added 3 commits May 19, 2026 18:32

Add tests for ledger-snapshot offset

42b374f

Add fix for recovery

54ba7f4

Merge branch 'main' into split/recovery-snapshot-ledger-offset

5930ba7

cjen1-msft added the run-long-test Run Long Test job label May 19, 2026

cjen1-msft added 5 commits May 19, 2026 18:40

changelogging

4e55215

fmt

f10fd58

Fix unit test

9fdefb9

remove vestigal call to ledger_files_invariant

eb00dd2

fmt

527943e

cjen1-msft mentioned this pull request May 20, 2026

Update ledger invariants and add tests for gappy ledgers during join and recovery #7890

Closed

cjen1-msft added 3 commits May 21, 2026 16:12

Don't write commit idx in recovery

fc0d2df

simplify changes to ledger.h

922dd94

Merge branch 'main' into split/recovery-snapshot-ledger-offset

c2b1ef0

cjen1-msft marked this pull request as ready for review May 26, 2026 13:32

cjen1-msft requested a review from a team as a code owner May 26, 2026 13:32

Copilot AI review requested due to automatic review settings May 26, 2026 13:32

Copilot started reviewing on behalf of cjen1-msft May 26, 2026 13:33 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Comment thread tests/reconfiguration.py

cjen1-msft added 2 commits May 26, 2026 15:45

fix bad merge

818b8fa

Merge branch 'main' into split/recovery-snapshot-ledger-offset

2c4c480

maxtropets reviewed May 28, 2026

View reviewed changes

Comment thread src/host/ledger.h Outdated

maxtropets reviewed May 28, 2026

View reviewed changes

Comment thread src/host/ledger.h

maxtropets reviewed May 28, 2026

View reviewed changes

Comment thread tests/reconfiguration.py

maxtropets reviewed May 28, 2026

View reviewed changes

Comment thread tests/recovery.py

maxtropets approved these changes May 28, 2026

View reviewed changes

Merge branch 'main' into split/recovery-snapshot-ledger-offset

7e17318

Merge branch 'main' into split/recovery-snapshot-ledger-offset

81269c3

cjen1-msft marked this pull request as draft June 1, 2026 12:27

MUST_REVERT: check for failing long-test

255b191

cjen1-msft force-pushed the split/recovery-snapshot-ledger-offset branch from a0ec7a8 to 255b191 Compare June 1, 2026 12:47

cjen1-msft added 2 commits June 1, 2026 17:42

Revert "MUST_REVERT: check for failing long-test"

85d8454

This reverts commit 255b191.

Fix partial copy race condition

112ba53

cjen1-msft marked this pull request as ready for review June 2, 2026 09:08

achamayou reviewed Jun 2, 2026

View reviewed changes

Comment thread src/host/ledger.h

achamayou reviewed Jun 2, 2026

View reviewed changes

Comment thread src/host/ledger.h

achamayou reviewed Jun 2, 2026

View reviewed changes

Comment thread CHANGELOG.md Outdated

achamayou approved these changes Jun 2, 2026

View reviewed changes

cjen1-msft and others added 4 commits June 2, 2026 17:19

Update CHANGELOG.md

0ddf489

Co-authored-by: Amaury Chamayou <amaury@xargs.fr>

Clean up truncation fix

45740be

fmt

0e22ec6

fixup

b64494f

achamayou mentioned this pull request Jun 2, 2026

Fix async ledger callback lifetime capture #7915

Merged

Merge branch 'main' into split/recovery-snapshot-ledger-offset

1505831

achamayou added this to the 7.0.4 milestone Jun 3, 2026

cjen1-msft merged commit 3a554d2 into microsoft:main Jun 3, 2026
23 of 29 checks passed

cjen1-msft mentioned this pull request Jun 3, 2026

[release/6.x] Cherry pick: Recovery and Join snapshot ledger offset fix (#7901) #7917

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery and Join snapshot ledger offset fix#7901

Recovery and Join snapshot ledger offset fix#7901
cjen1-msft merged 23 commits into
microsoft:mainfrom
cjen1-msft:split/recovery-snapshot-ledger-offset

cjen1-msft commented May 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjen1-msft commented May 29, 2026

Uh oh!

cjen1-msft commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cjen1-msft commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cjen1-msft commented May 29, 2026

Uh oh!

cjen1-msft commented Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cjen1-msft commented May 19, 2026 •

edited

Loading