sei-protocol · bdchatham · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/.agent/runbooks/migrating-validator-to-byo-secrets.md b/.agent/runbooks/migrating-validator-to-byo-secrets.md
@@ -271,9 +271,10 @@ Ordered. Do not reorder steps 3 and 4.
    - Flux decrypts the Secrets and the SeiNode's bootstrap plan runs (no-snapshot
      validator): ensure-data-pvc -> validate-signing-key -> validate-node-key ->
      apply-rbac-proxy-config -> apply-statefulset -> apply-service ->
-     configure-genesis -> config-apply -> discover-peers -> config-validate ->
-     mark-ready (config-apply writes the base config; discover-peers then writes
-     persistent-peers; config-validate checks the assembled result last). seid then
+     configure-genesis -> config-apply -> config-validate ->
+     mark-ready (config-apply writes the base config and folds in
+     persistent-peers from the controller-resolved peer set; config-validate
+     checks the assembled result last). seid then
      block-syncs once the pod is up (there is no "block-sync" task — it's seid
      catching up, observed via catching_up/height, not the plan).
    - Expect the networking.tcp DNS race on cold start (§6 finding 1): ~6-8 min of
@@ -302,7 +303,7 @@ Numbered as encountered. 1–4 are the ones that change operator behavior.
 
 **1. `networking.tcp` cold-start DNS race (self-heals).** A `tcp` validator gets a per-pod internet-facing NLB + an external-DNS Route53 record, and seid is configured with that hostname as its P2P `external-address`. On first boot seid resolves its own external address *before* external-DNS has published it → `lookup ...: no such host` → CrashLoopBackOff, made worse by CoreDNS negative-cache TTL. It clears on its own in ~6–8 min. **This is expected; do not drop `networking.tcp` to "fix" it** — a public arctic-1 validator needs the NLB.
 
-**2. Deploy clean — never blow-away-and-recreate a running chain's SND.** This bit the dry-run's genesis chain, whose nodes use **controller-generated** `node_key.json`: recreating the SND regenerates those keys per pod, but peers keep the *old* NodeIDs in `persistent_peers` → every P2P dial is rejected (`peer NodeID = X, want Y`) and the chain wedges (never produces a block). A fresh deploy's `discover-peers` wires correct NodeIDs the first time. **For *this* BYO validator the NodeID is pinned by the `nodeKey` Secret and is stable across recreation** — so its NodeID won't churn, but recreation is still hazardous for a different reason: finding 3 (it destroys the data PVC). Bottom line: don't recreate a running validator's SND; if you must replace, do it clean.
+**2. Deploy clean — never blow-away-and-recreate a running chain's SND.** This bit the dry-run's genesis chain, whose nodes use **controller-generated** `node_key.json`: recreating the SND regenerates those keys per pod, but peers keep the *old* NodeIDs in `persistent_peers` → every P2P dial is rejected (`peer NodeID = X, want Y`) and the chain wedges (never produces a block). A fresh deploy resolves and writes the correct NodeIDs into `persistent_peers` the first time. **For *this* BYO validator the NodeID is pinned by the `nodeKey` Secret and is stable across recreation** — so its NodeID won't churn, but recreation is still hazardous for a different reason: finding 3 (it destroys the data PVC). Bottom line: don't recreate a running validator's SND; if you must replace, do it clean.
 
 **3. Deleting the SeiNode destroys the data PVC.** A `Failed` SeiNode does **not** self-heal from a spec edit — the controller treats `Failed` as terminal and emits "Delete and recreate the resource to retry", so the only way to replan is to delete the SeiNode. But the data PVC carries a controller ownerRef **directly to the SeiNode** (the SeiNode owns the StatefulSet→Pod *and* the PVC in parallel — the PVC is **not** a StatefulSet `volumeClaimTemplate`, so it is not protected by the STS's `WhenDeleted=Retain`). Deleting the SeiNode therefore GC-deletes the PVC, destroying chain state and forcing a full re-sync. The **consensus identity survives** (it's in the Secrets), so the validator comes back as itself — but budget for the resync. Note the scope of `deletionPolicy: Retain`: it governs the **SND→child** cascade only — it protects the PVC when the *SND* is deleted, but a manual `kubectl delete seinode <child>` (the delete-to-replan action) still GC-deletes the PVC via the SeiNode ownerRef regardless.
 

diff --git a/README.md b/README.md
@@ -72,7 +72,7 @@ spec:
 
 | Mode | Condition | Key tasks |
 |------|-----------|-----------|
-| Full node | `spec.fullNode` set | `configure-genesis` > `snapshot-restore` > `config-apply` > `discover-peers` > `mark-ready` |
+| Full node | `spec.fullNode` set | `configure-genesis` > `snapshot-restore` > `config-apply` > `mark-ready` |
 | Validator | `spec.validator` set | Same as full node, or genesis ceremony flow for new networks |
 | Archive | `spec.archive` set | State sync with archival pruning configuration |
 | Replayer | `spec.replayer` set | Snapshot restore with result export for shadow validation |

diff --git a/docs/design-seinode-import-volume.md b/docs/design-seinode-import-volume.md
@@ -62,7 +62,7 @@ spec:
       pvcName: data-archive-0-0      # name of a pre-existing PVC in the SeiNode's namespace
 ```
 
-Planner behavior: **the init plan is unchanged.** The only difference is inside `ensure-data-pvc`: if `spec.dataVolume.import.pvcName` is set, the task verifies the named PVC instead of creating a fresh one. Every successor task (`apply-statefulset`, `apply-service`, `configure-genesis`, `config-apply`, `discover-peers`, `configure-state-sync`, `config-validate`, `mark-ready`) runs exactly as it does today.
+Planner behavior: **the init plan is unchanged.** The only difference is inside `ensure-data-pvc`: if `spec.dataVolume.import.pvcName` is set, the task verifies the named PVC instead of creating a fresh one. Every successor task (`apply-statefulset`, `apply-service`, `configure-genesis`, `config-apply`, `configure-state-sync`, `config-validate`, `mark-ready`) runs exactly as it does today.
 
 This is a deliberate "no extra fluff" choice: import is a PVC-source substitution, not a bootstrap off-ramp. The operator is trusted to provide a PVC whose contents are compatible with the rest of the init progression. If the imported data is from an incompatible seid version, the wrong chain, or in an unexpected on-disk format, seid will fail to start on the pod and the operator gets a clear signal from the Failed plan — same failure channel as any other init problem.
 

diff --git a/docs/design/composable-genesis.md b/docs/design/composable-genesis.md
@@ -54,7 +54,6 @@ that no individual node can perform alone.
 ```
 init-validator          → create keys, gentx, publish identity to S3
 configure-genesis       → download assembled genesis.json from S3 (retries until available)
-discover-peers          → resolve network peers
 configure-state-sync    → (only if StateSync is set)
 config-patch            → apply TOML config patches
 mark-ready              → signal bootstrap complete
@@ -66,7 +65,6 @@ register-validator      → (only on existing chains) submit create-validator tx
 ```
 init-validator          → create keys (gentx not needed; genesis already exists)
 configure-genesis       → download existing chain genesis from S3 (immediately available)
-discover-peers          → resolve network peers
 configure-state-sync    → sync to chain tip
 config-patch            → apply TOML config patches
 mark-ready              → signal bootstrap complete

diff --git a/docs/known-issues-node-alarms.md b/docs/known-issues-node-alarms.md
@@ -8,7 +8,7 @@ Recurring alerts observed during SeiNode and SeiNodeDeployment deployments. Thes
 **Environment:** dev
 **Severity:** critical (alert), expected during iteration
 
-**What happens:** The shadow replayer fails during bootstrap, typically at `discover-peers` or `configure-state-sync`. Each failed deployment requires deleting and recreating the SeiNode.
+**What happens:** The shadow replayer fails during bootstrap, typically at `discover-peers` or `configure-state-sync`. Each failed deployment requires deleting and recreating the SeiNode. _(Historical: `discover-peers` was a sidecar bootstrap task when this incident occurred; peering is now controller-owned via the config-apply `persistent_peers` override and is no longer a distinct task.)_
 
 **Root causes encountered:**
 1. **Pruned peers (resolved):** State-syncer EC2 nodes pruned blocks below the snapshot height (200440000). `configure-state-sync` queries peers for a block hash at the trust height and gets empty responses. Fix: use a snapshot at a height within peers' retention window.

diff --git a/internal/controller/nodetask/controller_test.go b/internal/controller/nodetask/controller_test.go
@@ -701,8 +701,7 @@ func TestTaskParamsForKind_RestartSeid(t *testing.T) {
 
 // RestartSeid is poll-to-completion (registered sidecarTask[...](false)): unlike
 // MarkReady's fire-and-forget ack, the controller polls GetTask until the
-// restart-seid task reports terminal (seid's RPC back up). Mirrors the
-// DiscoverPeers poll shape.
+// restart-seid task reports terminal (seid's RPC back up).
 func TestReconcile_RestartSeid_EndToEnd(t *testing.T) {
 	g := NewWithT(t)
 	ctx := context.Background()

diff --git a/internal/task/bootstrap_resources.go b/internal/task/bootstrap_resources.go
@@ -235,7 +235,7 @@ func buildBootstrapPodSpec(node *seiv1alpha1.SeiNode, snap *seiv1alpha1.Snapshot
 // to report healthy and then runs seid with --halt-height. Polls /v0/healthz
 // which returns 503 until the mark-ready task completes, ensuring all
 // bootstrap sidecar tasks (snapshot-restore, configure-genesis, config-apply,
-// discover-peers, config-validate) have finished before seid starts.
+// config-validate) have finished before seid starts.
 //
 // Uses bash's /dev/tcp to make raw HTTP requests instead of wget/curl, which
 // are not available on all sei images.