From fc0c12552d8ce0b74448a124c090274f22e020fa Mon Sep 17 00:00:00 2001 From: bdchatham Date: Fri, 12 Jun 2026 14:34:20 -0700 Subject: [PATCH 1/2] docs: scrub stale discover-peers references (PLT-452 follow-up) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR-C's cross-review (sei-network + kubernetes lenses) flagged docs/comments that still present the retired discover-peers task as part of the current bootstrap flow. Peering is controller-owned via the config-apply persistent_peers override; there is no discover-peers task. Fixed (misstate current behavior): - internal/task/bootstrap_resources.go: bootstrapWaitCommand doc comment no longer lists discover-peers in the bootstrap-task set it waits on. - README.md: Full node "Key tasks" chain drops discover-peers. - docs/design-seinode-import-volume.md: the "runs exactly as today" task list drops discover-peers. - docs/design/composable-genesis.md: both per-node init-plan flows drop the discover-peers step. - .agent/runbooks/migrating-validator-to-byo-secrets.md: the bootstrap-plan chain + its explanation now credit config-apply (via the resolved-peer override) with writing persistent-peers, not discover-peers. Deliberately left: - docs/known-issues-node-alarms.md: a dated incident record — it accurately describes a failure that occurred when discover-peers existed; rewriting it would falsify history. - Three Go test-comment analogies (common_overrides_test.go, controller_test.go) and the intentional TestCEL_DiscoverPeers_Removed_Rejected guard — internal references that don't misstate current behavior. Co-Authored-By: Claude Opus 4.8 --- .agent/runbooks/migrating-validator-to-byo-secrets.md | 9 +++++---- README.md | 2 +- docs/design-seinode-import-volume.md | 2 +- docs/design/composable-genesis.md | 2 -- internal/task/bootstrap_resources.go | 2 +- 5 files changed, 8 insertions(+), 9 deletions(-) diff --git a/.agent/runbooks/migrating-validator-to-byo-secrets.md b/.agent/runbooks/migrating-validator-to-byo-secrets.md index d2b621d4..a359cd6d 100644 --- a/.agent/runbooks/migrating-validator-to-byo-secrets.md +++ b/.agent/runbooks/migrating-validator-to-byo-secrets.md @@ -271,9 +271,10 @@ Ordered. Do not reorder steps 3 and 4. - Flux decrypts the Secrets and the SeiNode's bootstrap plan runs (no-snapshot validator): ensure-data-pvc -> validate-signing-key -> validate-node-key -> apply-rbac-proxy-config -> apply-statefulset -> apply-service -> - configure-genesis -> config-apply -> discover-peers -> config-validate -> - mark-ready (config-apply writes the base config; discover-peers then writes - persistent-peers; config-validate checks the assembled result last). seid then + configure-genesis -> config-apply -> config-validate -> + mark-ready (config-apply writes the base config and folds in + persistent-peers from the controller-resolved peer set; config-validate + checks the assembled result last). seid then block-syncs once the pod is up (there is no "block-sync" task — it's seid catching up, observed via catching_up/height, not the plan). - Expect the networking.tcp DNS race on cold start (§6 finding 1): ~6-8 min of @@ -302,7 +303,7 @@ Numbered as encountered. 1–4 are the ones that change operator behavior. **1. `networking.tcp` cold-start DNS race (self-heals).** A `tcp` validator gets a per-pod internet-facing NLB + an external-DNS Route53 record, and seid is configured with that hostname as its P2P `external-address`. On first boot seid resolves its own external address *before* external-DNS has published it → `lookup ...: no such host` → CrashLoopBackOff, made worse by CoreDNS negative-cache TTL. It clears on its own in ~6–8 min. **This is expected; do not drop `networking.tcp` to "fix" it** — a public arctic-1 validator needs the NLB. -**2. Deploy clean — never blow-away-and-recreate a running chain's SND.** This bit the dry-run's genesis chain, whose nodes use **controller-generated** `node_key.json`: recreating the SND regenerates those keys per pod, but peers keep the *old* NodeIDs in `persistent_peers` → every P2P dial is rejected (`peer NodeID = X, want Y`) and the chain wedges (never produces a block). A fresh deploy's `discover-peers` wires correct NodeIDs the first time. **For *this* BYO validator the NodeID is pinned by the `nodeKey` Secret and is stable across recreation** — so its NodeID won't churn, but recreation is still hazardous for a different reason: finding 3 (it destroys the data PVC). Bottom line: don't recreate a running validator's SND; if you must replace, do it clean. +**2. Deploy clean — never blow-away-and-recreate a running chain's SND.** This bit the dry-run's genesis chain, whose nodes use **controller-generated** `node_key.json`: recreating the SND regenerates those keys per pod, but peers keep the *old* NodeIDs in `persistent_peers` → every P2P dial is rejected (`peer NodeID = X, want Y`) and the chain wedges (never produces a block). A fresh deploy resolves and writes the correct NodeIDs into `persistent_peers` the first time. **For *this* BYO validator the NodeID is pinned by the `nodeKey` Secret and is stable across recreation** — so its NodeID won't churn, but recreation is still hazardous for a different reason: finding 3 (it destroys the data PVC). Bottom line: don't recreate a running validator's SND; if you must replace, do it clean. **3. Deleting the SeiNode destroys the data PVC.** A `Failed` SeiNode does **not** self-heal from a spec edit — the controller treats `Failed` as terminal and emits "Delete and recreate the resource to retry", so the only way to replan is to delete the SeiNode. But the data PVC carries a controller ownerRef **directly to the SeiNode** (the SeiNode owns the StatefulSet→Pod *and* the PVC in parallel — the PVC is **not** a StatefulSet `volumeClaimTemplate`, so it is not protected by the STS's `WhenDeleted=Retain`). Deleting the SeiNode therefore GC-deletes the PVC, destroying chain state and forcing a full re-sync. The **consensus identity survives** (it's in the Secrets), so the validator comes back as itself — but budget for the resync. Note the scope of `deletionPolicy: Retain`: it governs the **SND→child** cascade only — it protects the PVC when the *SND* is deleted, but a manual `kubectl delete seinode ` (the delete-to-replan action) still GC-deletes the PVC via the SeiNode ownerRef regardless. diff --git a/README.md b/README.md index a45583d2..a078101a 100644 --- a/README.md +++ b/README.md @@ -72,7 +72,7 @@ spec: | Mode | Condition | Key tasks | |------|-----------|-----------| -| Full node | `spec.fullNode` set | `configure-genesis` > `snapshot-restore` > `config-apply` > `discover-peers` > `mark-ready` | +| Full node | `spec.fullNode` set | `configure-genesis` > `snapshot-restore` > `config-apply` > `mark-ready` | | Validator | `spec.validator` set | Same as full node, or genesis ceremony flow for new networks | | Archive | `spec.archive` set | State sync with archival pruning configuration | | Replayer | `spec.replayer` set | Snapshot restore with result export for shadow validation | diff --git a/docs/design-seinode-import-volume.md b/docs/design-seinode-import-volume.md index f469f0b4..88eb0d70 100644 --- a/docs/design-seinode-import-volume.md +++ b/docs/design-seinode-import-volume.md @@ -62,7 +62,7 @@ spec: pvcName: data-archive-0-0 # name of a pre-existing PVC in the SeiNode's namespace ``` -Planner behavior: **the init plan is unchanged.** The only difference is inside `ensure-data-pvc`: if `spec.dataVolume.import.pvcName` is set, the task verifies the named PVC instead of creating a fresh one. Every successor task (`apply-statefulset`, `apply-service`, `configure-genesis`, `config-apply`, `discover-peers`, `configure-state-sync`, `config-validate`, `mark-ready`) runs exactly as it does today. +Planner behavior: **the init plan is unchanged.** The only difference is inside `ensure-data-pvc`: if `spec.dataVolume.import.pvcName` is set, the task verifies the named PVC instead of creating a fresh one. Every successor task (`apply-statefulset`, `apply-service`, `configure-genesis`, `config-apply`, `configure-state-sync`, `config-validate`, `mark-ready`) runs exactly as it does today. This is a deliberate "no extra fluff" choice: import is a PVC-source substitution, not a bootstrap off-ramp. The operator is trusted to provide a PVC whose contents are compatible with the rest of the init progression. If the imported data is from an incompatible seid version, the wrong chain, or in an unexpected on-disk format, seid will fail to start on the pod and the operator gets a clear signal from the Failed plan — same failure channel as any other init problem. diff --git a/docs/design/composable-genesis.md b/docs/design/composable-genesis.md index aee77f2f..b0771536 100644 --- a/docs/design/composable-genesis.md +++ b/docs/design/composable-genesis.md @@ -54,7 +54,6 @@ that no individual node can perform alone. ``` init-validator → create keys, gentx, publish identity to S3 configure-genesis → download assembled genesis.json from S3 (retries until available) -discover-peers → resolve network peers configure-state-sync → (only if StateSync is set) config-patch → apply TOML config patches mark-ready → signal bootstrap complete @@ -66,7 +65,6 @@ register-validator → (only on existing chains) submit create-validator tx ``` init-validator → create keys (gentx not needed; genesis already exists) configure-genesis → download existing chain genesis from S3 (immediately available) -discover-peers → resolve network peers configure-state-sync → sync to chain tip config-patch → apply TOML config patches mark-ready → signal bootstrap complete diff --git a/internal/task/bootstrap_resources.go b/internal/task/bootstrap_resources.go index 1939d0e8..d2f80e09 100644 --- a/internal/task/bootstrap_resources.go +++ b/internal/task/bootstrap_resources.go @@ -235,7 +235,7 @@ func buildBootstrapPodSpec(node *seiv1alpha1.SeiNode, snap *seiv1alpha1.Snapshot // to report healthy and then runs seid with --halt-height. Polls /v0/healthz // which returns 503 until the mark-ready task completes, ensuring all // bootstrap sidecar tasks (snapshot-restore, configure-genesis, config-apply, -// discover-peers, config-validate) have finished before seid starts. +// config-validate) have finished before seid starts. // // Uses bash's /dev/tcp to make raw HTTP requests instead of wget/curl, which // are not available on all sei images. From 32852750e8af3d477ace448e56b7d264104f9f9f Mon Sep 17 00:00:00 2001 From: bdchatham Date: Fri, 12 Jun 2026 14:37:29 -0700 Subject: [PATCH 2/2] docs: fold in cross-review polish (dated incident annotation, drop dangling analogy) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR-401 cross-review (sei-network) optional hardenings: - docs/known-issues-node-alarms.md: annotate the discover-peers mention as historical so a future reader grepping discover-peers isn't misled — keeps the incident record intact, adds the controller-owned-peering context. - controller_test.go: drop the trailing "Mirrors the DiscoverPeers poll shape" analogy — the first sentence already states the poll-to-completion shape, and the analogy referenced now-deleted code. Co-Authored-By: Claude Opus 4.8 --- docs/known-issues-node-alarms.md | 2 +- internal/controller/nodetask/controller_test.go | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/docs/known-issues-node-alarms.md b/docs/known-issues-node-alarms.md index 0fa6f3c0..2f622d62 100644 --- a/docs/known-issues-node-alarms.md +++ b/docs/known-issues-node-alarms.md @@ -8,7 +8,7 @@ Recurring alerts observed during SeiNode and SeiNodeDeployment deployments. Thes **Environment:** dev **Severity:** critical (alert), expected during iteration -**What happens:** The shadow replayer fails during bootstrap, typically at `discover-peers` or `configure-state-sync`. Each failed deployment requires deleting and recreating the SeiNode. +**What happens:** The shadow replayer fails during bootstrap, typically at `discover-peers` or `configure-state-sync`. Each failed deployment requires deleting and recreating the SeiNode. _(Historical: `discover-peers` was a sidecar bootstrap task when this incident occurred; peering is now controller-owned via the config-apply `persistent_peers` override and is no longer a distinct task.)_ **Root causes encountered:** 1. **Pruned peers (resolved):** State-syncer EC2 nodes pruned blocks below the snapshot height (200440000). `configure-state-sync` queries peers for a block hash at the trust height and gets empty responses. Fix: use a snapshot at a height within peers' retention window. diff --git a/internal/controller/nodetask/controller_test.go b/internal/controller/nodetask/controller_test.go index 2d78ebdc..614d764e 100644 --- a/internal/controller/nodetask/controller_test.go +++ b/internal/controller/nodetask/controller_test.go @@ -701,8 +701,7 @@ func TestTaskParamsForKind_RestartSeid(t *testing.T) { // RestartSeid is poll-to-completion (registered sidecarTask[...](false)): unlike // MarkReady's fire-and-forget ack, the controller polls GetTask until the -// restart-seid task reports terminal (seid's RPC back up). Mirrors the -// DiscoverPeers poll shape. +// restart-seid task reports terminal (seid's RPC back up). func TestReconcile_RestartSeid_EndToEnd(t *testing.T) { g := NewWithT(t) ctx := context.Background()