Skip to content

feat: accept multiple URLs for checkpoint sync#374

Open
Aliemeka wants to merge 8 commits into
lambdaclass:mainfrom
Aliemeka:feat/accept-multiple-urls-for-checkpoint-sync-i
Open

feat: accept multiple URLs for checkpoint sync#374
Aliemeka wants to merge 8 commits into
lambdaclass:mainfrom
Aliemeka:feat/accept-multiple-urls-for-checkpoint-sync-i

Conversation

@Aliemeka
Copy link
Copy Markdown
Contributor

🗒️ Description / Motivation

Adds redundancy to checkpoint sync by letting operators supply more than one peer URL via --checkpoint-sync-url. The flag now accepts either a single URL (existing behavior), a comma-separated list, or multiple repeated occurrences. URLs are tried sequentially: the first peer that successfully serves a valid anchor wins, and we only fail startup if every URL fails. It is backward compatible. PR closes #111

What Changed

  • bin/ethlambda/src/main.rs
    • --checkpoint-sync-url is now Option<Vec<String>> withvalue_delimiter = ',', so u1,u2 and repeated --checkpoint-sync-url flags both populate the list. A single URL still works unchanged (backwards-compatible).
    • Extracted try_checkpoint_url, which wraps the existing per-URLAnchorPairingMismatch retry (3 attempts, 1s backoff). No behavior change for the single-URL case.
    • fetch_initial_state now takes &[String], iterates URLs in order, logs a warn! with the URL and underlying error on each failure, and moves on to the next. Returns the last error only if every URL fails.
    • Updated CLI help and function doc comments to describe the failover semantics.

Correctness / Behavior Guarantees

  • Empty URL list ⇒ genesis init, identical to before.
  • Single URL ⇒ same retry/error behavior as before (the per-URL retry was moved into try_checkpoint_url, not changed).
  • Multi-URL ⇒ first-success-wins failover. Operators can see which peers failed via the warn logs; the surfaced error on total failure is the last URL's error.
  • No changes to the anchor verification rules in checkpoint_sync.rs — every peer's anchor is still validated against the local genesis config and internal state/block consistency before it is accepted.

Tests Added / Run

  • No new unit tests — all multi-URL logic lives in main.rs orchestration code that drives the already-tested fetch_finalized_anchor / verify_checkpoint_state paths. Existing checkpoint_sync tests (verification + URL normalization) still pass.
  • Commands run:
    • cargo clippy -p ethlambda --all-targets -- -D warnings
    • cargo test --workspace --release

Related Issues / PRs

✅ Verification Checklist

  • Ran make fmt — clean
  • Ran make lint (clippy with -D warnings) — clean
  • Ran cargo test --workspace --release — all passing

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 15, 2026

Greptile Summary

This PR makes --checkpoint-sync-url accept multiple URLs (comma-separated or repeated flags) and tries them sequentially, falling back to the next peer on failure and only aborting startup if every URL fails. The existing per-URL retry logic for AnchorPairingMismatch is cleanly extracted into try_checkpoint_url with no behavior change for the single-URL case.

  • The failover loop is logically correct and backward-compatible, but the warning message \"trying next URL\" fires unconditionally after every failure — including the last URL in the list — which misleads operators right before a startup abort.
  • The info! at startup logs all URLs via their full debug representation; any embedded authentication material in URLs would appear in plaintext in the log stream.

Confidence Score: 4/5

Safe to merge with minor log message improvements recommended.

The failover logic is correct and backward-compatible. Two non-blocking issues exist: a warning message that says "trying next URL" even when there is no next URL to try, and a startup log that prints full URL strings which could expose embedded credentials in environments that use auth-bearing URLs.

bin/ethlambda/src/main.rs — the warning log wording and the startup info log that prints all URLs verbatim.

Security Review

  • Credential exposure in logs (bin/ethlambda/src/main.rs, line 636): The startup info! log prints all checkpoint-sync URLs using Debug formatting. If any URL contains embedded authentication material (basic-auth credentials or token query parameters), they will be written to the log stream in plaintext.

Important Files Changed

Filename Overview
bin/ethlambda/src/main.rs Extends --checkpoint-sync-url to accept a Vec of URLs with comma-delimiter support; adds try_checkpoint_url helper for per-URL retry; fetch_initial_state now iterates the slice and fails over sequentially. Two minor issues: the "trying next URL" warning fires even when no next URL exists, and the startup info log prints all URLs verbatim which could expose embedded credentials.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
bin/ethlambda/src/main.rs:639-658
**Misleading "trying next URL" on last/only failure**

The warning message `"Checkpoint sync failed for this peer; trying next URL"` is emitted unconditionally after every failed attempt, including the final URL in the list (and the only URL in the single-URL case). When startup is about to abort with no more peers to try, operators will see the log say "trying next URL" right before the process exits, which is confusing. Consider checking whether there are remaining URLs before deciding which message to emit — or drop the "trying next URL" suffix and log the total count of remaining peers instead.

### Issue 2 of 2
bin/ethlambda/src/main.rs:636
**Potential credential exposure in startup log**

`info!(urls = ?checkpoint_urls, ...)` prints all URLs via their `Debug` representation, which includes the full URL string. If any URL contains embedded authentication material (basic-auth credentials or token query parameters), they will appear in plaintext in the log stream. Consider logging only the count of URLs, or stripping auth components before logging.

Reviews (1): Last reviewed commit: "feat: accept multiple URLs for checkpoin..." | Re-trigger Greptile

Comment thread bin/ethlambda/src/main.rs Outdated
Comment thread bin/ethlambda/src/main.rs Outdated
Comment thread bin/ethlambda/src/main.rs Outdated
Comment thread bin/ethlambda/src/main.rs Outdated
@Aliemeka Aliemeka requested a review from MegaRedHand May 22, 2026 15:22
@pablodeymo
Copy link
Copy Markdown
Collaborator

@Aliemeka commits must be signed in order to be able to merge the PR. Can you sign them? thanks!

@Aliemeka Aliemeka force-pushed the feat/accept-multiple-urls-for-checkpoint-sync-i branch from 27e215c to 1a581a7 Compare May 25, 2026 13:49
@Aliemeka Aliemeka force-pushed the feat/accept-multiple-urls-for-checkpoint-sync-i branch from 1a581a7 to 9cf6482 Compare May 25, 2026 13:55
Comment thread bin/ethlambda/src/main.rs
async fn try_checkpoint_url(
url: &str,
genesis_time: u64,
validators: &[ethlambda_types::state::Validator],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should import this instead:

Suggested change
validators: &[ethlambda_types::state::Validator],
validators: &[Validator],

Comment thread bin/ethlambda/src/main.rs
url: &str,
genesis_time: u64,
validators: &[ethlambda_types::state::Validator],
) -> Result<(State, ethlambda_types::block::SignedBlock), checkpoint_sync::CheckpointSyncError> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Suggested change
) -> Result<(State, ethlambda_types::block::SignedBlock), checkpoint_sync::CheckpointSyncError> {
) -> Result<(State, SignedBlock), checkpoint_sync::CheckpointSyncError> {

Comment thread bin/ethlambda/src/main.rs
validators: &[ethlambda_types::state::Validator],
) -> Result<(State, ethlambda_types::block::SignedBlock), checkpoint_sync::CheckpointSyncError> {
const MAX_ANCHOR_FETCH_ATTEMPTS: u32 = 3;
const ANCHOR_FETCH_RETRY_DELAY: std::time::Duration = std::time::Duration::from_secs(1);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Suggested change
const ANCHOR_FETCH_RETRY_DELAY: std::time::Duration = std::time::Duration::from_secs(1);
const ANCHOR_FETCH_RETRY_DELAY: Duration = Duration::from_secs(1);

Comment thread bin/ethlambda/src/main.rs
/// Fetch the finalized anchor from a single checkpoint URL, retrying transient
/// races where the peer advances finalization between the state and block
/// fetches.
async fn try_checkpoint_url(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to the checkpoint sync module?

Comment thread bin/ethlambda/src/main.rs

let mut attempt = 1;
let mut iter = checkpoint_urls.iter().peekable();
let (state, signed_block) = loop {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this loop to the checkpoint sync module?

@MegaRedHand
Copy link
Copy Markdown
Collaborator

Commit 9cf6482 is still unsigned

Comment thread bin/ethlambda/src/main.rs
const ANCHOR_FETCH_RETRY_DELAY: std::time::Duration = std::time::Duration::from_secs(1);
// Checkpoint sync path: try URLs in order, fail over to the next on error.
// Log only the count — URLs may carry basic-auth credentials or token query
// parameters; per-URL log lines below redact those before emission.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment promises redaction that isn't implemented — URLs are logged verbatim.

This comment says per-URL log lines "redact those before emission," but none of them do. Every site logs the raw URL via %url:

  • L589 (try_checkpoint_url retry): warn!(%url, attempt, max, "…retrying")
  • L650: info!(%url, "Checkpoint sync successful with this peer")
  • L655 / L657: warn!(%url, %err, …)

So a URL carrying credentials (https://user:pass@peer:5052) or a token query param (?token=…) ends up in logs in plaintext — exactly the case this comment says is handled. Two options:

  1. Implement the redaction the comment claims: sanitize before logging (strip userinfo and sensitive query params), e.g. a small redact(url) helper used at all four sites. The info!(url_count = …) summary line below can stay as the extra-cautious default.
  2. Drop the claim if redaction isn't in scope for this PR — remove the "per-URL log lines below redact those before emission" sentence so the comment doesn't describe behavior that doesn't exist.

Leaning toward (1) since you've already identified the risk, but either is fine as long as the comment and the code agree.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already dropped the redaction before, we should drop the comment too

Comment thread bin/ethlambda/src/main.rs
tokio::time::sleep(ANCHOR_FETCH_RETRY_DELAY).await;
attempt += 1;
let Some(url) = iter.next() else {
return Err(checkpoint_sync::CheckpointSyncError::AllPeersFailed);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllPeersFailed discards the underlying error, which contradicts the PR description and changes single-URL behavior.

The description says "the surfaced error on total failure is the last URL's error," but this returns the generic AllPeersFailed. The real cause survives only in the warn! lines — the value propagated to the call site (fetch_initial_state(...).inspect_err(|err| error!(%err, "Failed to initialize state"))?) loses it.

This is also a behavior change for the single-URL backward-compat path: previously a single failing URL surfaced its specific error (AnchorPairingMismatch, an HTTP/verification error, …); now it always surfaces "all checkpoint peers failed". The PR notes single-URL error behavior is unchanged, but the propagated error type does change here.

Suggest preserving the cause — either return the last error directly, or have the variant carry it:

#[error("all checkpoint peers failed; last error: {0}")]
AllPeersFailed(Box<CheckpointSyncError>),

and track the last error in the loop:

let mut last_err = None;
let (state, signed_block) = loop {
    let Some(url) = iter.next() else {
        return Err(CheckpointSyncError::AllPeersFailed(
            Box::new(last_err.expect("loop runs at least once for a non-empty list")),
        ));
    };
    match try_checkpoint_url(url, genesis.genesis_time, &validators).await {
        Ok(pair) => { info!(%url, "Checkpoint sync successful with this peer"); break pair; }
        Err(err) => {
            if iter.peek().is_some() {
                warn!(%url, %err, "Checkpoint sync failed for this peer; trying next URL");
            } else {
                warn!(%url, %err, "Checkpoint sync failed for this peer; no more URLs to try");
            }
            last_err = Some(err);
        }
    }
};

That keeps the new variant (nice for the multi-URL case) while restoring the specific cause in the surfaced error and the top-level error! log. The expect is safe because the is_empty() guard above guarantees ≥1 iteration. If you'd rather keep it simple, just return Err(err) on the last URL instead — but then AllPeersFailed is unused and the description should drop the "last URL's error" wording.

Comment thread bin/ethlambda/src/main.rs
/// in order; the first one that succeeds is used and any failures fall
/// over to the next URL. Startup only aborts if every URL fails.
#[arg(long, value_delimiter = ',')]
checkpoint_sync_url: Option<Vec<String>>,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small non-blocking polish notes on this flag (both optional).

1 — Option<Vec<String>> is redundant. With clap, a Vec<String> arg already defaults to empty when the flag is absent, and downstream you collapse None and Some(vec![]) to the same thing anyway (options.checkpoint_sync_url.as_deref().unwrap_or(&[]) at L216, then fetch_initial_state branches on is_empty()). Dropping the Option removes the redundant layer:

#[arg(long, value_delimiter = ',')]
checkpoint_sync_url: Vec<String>,

and the call site becomes &options.checkpoint_sync_url.

2 — empty entries slip through the delimiter. --checkpoint-sync-url u1,, u1,,u2, or --checkpoint-sync-url "" all produce empty-string URLs in the Vec. Each one becomes a wasted failover attempt that fails deep inside fetch_finalized_anchor and emits a confusing warn! for an empty URL. Trimming + filtering before the list is used avoids that, e.g.:

options.checkpoint_sync_url
    .iter()
    .map(|u| u.trim())
    .filter(|u| !u.is_empty())

Neither blocks merge — just tidies the type and guards against trailing/empty commas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Accept multiple URLs for checkpoint sync

3 participants