feat: accept multiple URLs for checkpoint sync#374
Conversation
Greptile SummaryThis PR makes
Confidence Score: 4/5Safe to merge with minor log message improvements recommended. The failover logic is correct and backward-compatible. Two non-blocking issues exist: a warning message that says "trying next URL" even when there is no next URL to try, and a startup log that prints full URL strings which could expose embedded credentials in environments that use auth-bearing URLs. bin/ethlambda/src/main.rs — the warning log wording and the startup info log that prints all URLs verbatim.
|
| Filename | Overview |
|---|---|
| bin/ethlambda/src/main.rs | Extends --checkpoint-sync-url to accept a Vec of URLs with comma-delimiter support; adds try_checkpoint_url helper for per-URL retry; fetch_initial_state now iterates the slice and fails over sequentially. Two minor issues: the "trying next URL" warning fires even when no next URL exists, and the startup info log prints all URLs verbatim which could expose embedded credentials. |
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
bin/ethlambda/src/main.rs:639-658
**Misleading "trying next URL" on last/only failure**
The warning message `"Checkpoint sync failed for this peer; trying next URL"` is emitted unconditionally after every failed attempt, including the final URL in the list (and the only URL in the single-URL case). When startup is about to abort with no more peers to try, operators will see the log say "trying next URL" right before the process exits, which is confusing. Consider checking whether there are remaining URLs before deciding which message to emit — or drop the "trying next URL" suffix and log the total count of remaining peers instead.
### Issue 2 of 2
bin/ethlambda/src/main.rs:636
**Potential credential exposure in startup log**
`info!(urls = ?checkpoint_urls, ...)` prints all URLs via their `Debug` representation, which includes the full URL string. If any URL contains embedded authentication material (basic-auth credentials or token query parameters), they will appear in plaintext in the log stream. Consider logging only the count of URLs, or stripping auth components before logging.
Reviews (1): Last reviewed commit: "feat: accept multiple URLs for checkpoin..." | Re-trigger Greptile
|
@Aliemeka commits must be signed in order to be able to merge the PR. Can you sign them? thanks! |
27e215c to
1a581a7
Compare
1a581a7 to
9cf6482
Compare
| async fn try_checkpoint_url( | ||
| url: &str, | ||
| genesis_time: u64, | ||
| validators: &[ethlambda_types::state::Validator], |
There was a problem hiding this comment.
We should import this instead:
| validators: &[ethlambda_types::state::Validator], | |
| validators: &[Validator], |
| url: &str, | ||
| genesis_time: u64, | ||
| validators: &[ethlambda_types::state::Validator], | ||
| ) -> Result<(State, ethlambda_types::block::SignedBlock), checkpoint_sync::CheckpointSyncError> { |
There was a problem hiding this comment.
Same
| ) -> Result<(State, ethlambda_types::block::SignedBlock), checkpoint_sync::CheckpointSyncError> { | |
| ) -> Result<(State, SignedBlock), checkpoint_sync::CheckpointSyncError> { |
| validators: &[ethlambda_types::state::Validator], | ||
| ) -> Result<(State, ethlambda_types::block::SignedBlock), checkpoint_sync::CheckpointSyncError> { | ||
| const MAX_ANCHOR_FETCH_ATTEMPTS: u32 = 3; | ||
| const ANCHOR_FETCH_RETRY_DELAY: std::time::Duration = std::time::Duration::from_secs(1); |
There was a problem hiding this comment.
Same
| const ANCHOR_FETCH_RETRY_DELAY: std::time::Duration = std::time::Duration::from_secs(1); | |
| const ANCHOR_FETCH_RETRY_DELAY: Duration = Duration::from_secs(1); |
| /// Fetch the finalized anchor from a single checkpoint URL, retrying transient | ||
| /// races where the peer advances finalization between the state and block | ||
| /// fetches. | ||
| async fn try_checkpoint_url( |
There was a problem hiding this comment.
Can we move this to the checkpoint sync module?
|
|
||
| let mut attempt = 1; | ||
| let mut iter = checkpoint_urls.iter().peekable(); | ||
| let (state, signed_block) = loop { |
There was a problem hiding this comment.
Can we move this loop to the checkpoint sync module?
|
Commit 9cf6482 is still unsigned |
| const ANCHOR_FETCH_RETRY_DELAY: std::time::Duration = std::time::Duration::from_secs(1); | ||
| // Checkpoint sync path: try URLs in order, fail over to the next on error. | ||
| // Log only the count — URLs may carry basic-auth credentials or token query | ||
| // parameters; per-URL log lines below redact those before emission. |
There was a problem hiding this comment.
The comment promises redaction that isn't implemented — URLs are logged verbatim.
This comment says per-URL log lines "redact those before emission," but none of them do. Every site logs the raw URL via %url:
- L589 (
try_checkpoint_urlretry):warn!(%url, attempt, max, "…retrying") - L650:
info!(%url, "Checkpoint sync successful with this peer") - L655 / L657:
warn!(%url, %err, …)
So a URL carrying credentials (https://user:pass@peer:5052) or a token query param (?token=…) ends up in logs in plaintext — exactly the case this comment says is handled. Two options:
- Implement the redaction the comment claims: sanitize before logging (strip userinfo and sensitive query params), e.g. a small
redact(url)helper used at all four sites. Theinfo!(url_count = …)summary line below can stay as the extra-cautious default. - Drop the claim if redaction isn't in scope for this PR — remove the "per-URL log lines below redact those before emission" sentence so the comment doesn't describe behavior that doesn't exist.
Leaning toward (1) since you've already identified the risk, but either is fine as long as the comment and the code agree.
There was a problem hiding this comment.
We already dropped the redaction before, we should drop the comment too
| tokio::time::sleep(ANCHOR_FETCH_RETRY_DELAY).await; | ||
| attempt += 1; | ||
| let Some(url) = iter.next() else { | ||
| return Err(checkpoint_sync::CheckpointSyncError::AllPeersFailed); |
There was a problem hiding this comment.
AllPeersFailed discards the underlying error, which contradicts the PR description and changes single-URL behavior.
The description says "the surfaced error on total failure is the last URL's error," but this returns the generic AllPeersFailed. The real cause survives only in the warn! lines — the value propagated to the call site (fetch_initial_state(...).inspect_err(|err| error!(%err, "Failed to initialize state"))?) loses it.
This is also a behavior change for the single-URL backward-compat path: previously a single failing URL surfaced its specific error (AnchorPairingMismatch, an HTTP/verification error, …); now it always surfaces "all checkpoint peers failed". The PR notes single-URL error behavior is unchanged, but the propagated error type does change here.
Suggest preserving the cause — either return the last error directly, or have the variant carry it:
#[error("all checkpoint peers failed; last error: {0}")]
AllPeersFailed(Box<CheckpointSyncError>),and track the last error in the loop:
let mut last_err = None;
let (state, signed_block) = loop {
let Some(url) = iter.next() else {
return Err(CheckpointSyncError::AllPeersFailed(
Box::new(last_err.expect("loop runs at least once for a non-empty list")),
));
};
match try_checkpoint_url(url, genesis.genesis_time, &validators).await {
Ok(pair) => { info!(%url, "Checkpoint sync successful with this peer"); break pair; }
Err(err) => {
if iter.peek().is_some() {
warn!(%url, %err, "Checkpoint sync failed for this peer; trying next URL");
} else {
warn!(%url, %err, "Checkpoint sync failed for this peer; no more URLs to try");
}
last_err = Some(err);
}
}
};That keeps the new variant (nice for the multi-URL case) while restoring the specific cause in the surfaced error and the top-level error! log. The expect is safe because the is_empty() guard above guarantees ≥1 iteration. If you'd rather keep it simple, just return Err(err) on the last URL instead — but then AllPeersFailed is unused and the description should drop the "last URL's error" wording.
| /// in order; the first one that succeeds is used and any failures fall | ||
| /// over to the next URL. Startup only aborts if every URL fails. | ||
| #[arg(long, value_delimiter = ',')] | ||
| checkpoint_sync_url: Option<Vec<String>>, |
There was a problem hiding this comment.
Two small non-blocking polish notes on this flag (both optional).
1 — Option<Vec<String>> is redundant. With clap, a Vec<String> arg already defaults to empty when the flag is absent, and downstream you collapse None and Some(vec![]) to the same thing anyway (options.checkpoint_sync_url.as_deref().unwrap_or(&[]) at L216, then fetch_initial_state branches on is_empty()). Dropping the Option removes the redundant layer:
#[arg(long, value_delimiter = ',')]
checkpoint_sync_url: Vec<String>,and the call site becomes &options.checkpoint_sync_url.
2 — empty entries slip through the delimiter. --checkpoint-sync-url u1,, u1,,u2, or --checkpoint-sync-url "" all produce empty-string URLs in the Vec. Each one becomes a wasted failover attempt that fails deep inside fetch_finalized_anchor and emits a confusing warn! for an empty URL. Trimming + filtering before the list is used avoids that, e.g.:
options.checkpoint_sync_url
.iter()
.map(|u| u.trim())
.filter(|u| !u.is_empty())Neither blocks merge — just tidies the type and guards against trailing/empty commas.
🗒️ Description / Motivation
Adds redundancy to checkpoint sync by letting operators supply more than one peer URL via
--checkpoint-sync-url. The flag now accepts either a single URL (existing behavior), a comma-separated list, or multiple repeated occurrences. URLs are tried sequentially: the first peer that successfully serves a valid anchor wins, and we only fail startup if every URL fails. It is backward compatible. PR closes #111What Changed
bin/ethlambda/src/main.rs--checkpoint-sync-urlis nowOption<Vec<String>>withvalue_delimiter = ',', sou1,u2and repeated--checkpoint-sync-urlflags both populate the list. A single URL still works unchanged (backwards-compatible).try_checkpoint_url, which wraps the existing per-URLAnchorPairingMismatchretry (3 attempts, 1s backoff). No behavior change for the single-URL case.fetch_initial_statenow takes&[String], iterates URLs in order, logs awarn!with the URL and underlying error on each failure, and moves on to the next. Returns the last error only if every URL fails.Correctness / Behavior Guarantees
try_checkpoint_url, not changed).checkpoint_sync.rs— every peer's anchor is still validated against the local genesis config and internal state/block consistency before it is accepted.Tests Added / Run
main.rsorchestration code that drives the already-testedfetch_finalized_anchor/verify_checkpoint_statepaths. Existingcheckpoint_synctests (verification + URL normalization) still pass.cargo clippy -p ethlambda --all-targets -- -D warningscargo test --workspace --releaseRelated Issues / PRs
✅ Verification Checklist
make fmt— cleanmake lint(clippy with-D warnings) — cleancargo test --workspace --release— all passing