Recovery and Join snapshot ledger offset fix#7901
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes recovery/join robustness when the local ledger ends at (or before) the snapshot seqno, ensuring subsequent recovery writes resume correctly from the snapshot boundary. It also adds targeted test coverage to reproduce and guard against “gappy ledger” scenarios described in #7891.
Changes:
- Fix host ledger file selection and truncation behaviour to support recovery truncation beyond the current ledger end.
- Add new end-to-end tests for recovery and node join from snapshot across multiple “ledger end vs snapshot” variants (complete/incomplete chunks).
- Add a shared test utility to synthesize ledger chunk files for these scenarios, and document the user-visible fix in the changelog.
Custom instructions used:
.github/copilot-instructions.md.github/instructions/changelog.instructions.md.github/instructions/reviewing.instructions.md
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/host/ledger.h |
Adjusts file lookup and extends truncate() with a recovery-aware path for snapshot boundaries beyond the local ledger end. |
src/host/test/ledger.cpp |
Adds unit tests covering recovery truncation behaviours at/after/beyond ledger end. |
tests/infra/utils.py |
Adds write_ledger_chunk() helper to generate complete/incomplete ledger chunks for test scenarios. |
tests/recovery.py |
Adds a recovery test that exercises recovery-from-snapshot with multiple crafted ledger variants. |
tests/reconfiguration.py |
Adds a join test that joins nodes from snapshot with crafted ledger offsets/variants. |
CHANGELOG.md |
Notes the recovery/join fix in the Unreleased “Fixed” section with PR reference. |
|
There is a hard tradeoff here (thanks to @maxtropets for raising this!). Consider recovering from a ledger that runs until 1000, and a snapshot at 2000. If we recover from the snapshot (this PR) we will be unable to serve any receipts for the ledger as we cannot reconstruct the commit evidence, nor backchain the endorsements. The other limit of this, is if the snapshot is at 1002, and the last ledger is at 1000, and 1001 was a rekey. I don't know which of these two we want, we either optimise for the current state of the KV and ensure we implicitly contain all committed transactions. Or we optimise for being able to serve receipts, at the cost of effectively greater truncates. A bad outcome from the 'restart from the snapshot' approach is that upon restart it is non-trivial to distinguish between a ledger starting from a snapshot and has an unrecoverable tail, and a 'better' node with a ledger right up to the snapshot that has a recoverable tail. To be specific the other option would probably be to ignore snapshots past the end of the ledger. |
|
To update after a discussion. The primary purpose of the node is to preserve the fuzzy 'history' not to be able to serve receipts. So we should recover from the snapshot, not from the ledger. |
a0ec7a8 to
255b191
Compare
Co-authored-by: Amaury Chamayou <amaury@xargs.fr>
Per #7891 we had poor test coverage if a snapshot was after the end of the ledger.
This PR adds better coverage for that, reproducing the issue where recovering nodes fail when there is a gap between the end of the ledger and the starting snapshot.
It also adds a fix for that issue.
Supersedes: #7890