From fc0c12552d8ce0b74448a124c090274f22e020fa Mon Sep 17 00:00:00 2001
From: bdchatham <bdchatham@gmail.com>
Date: Fri, 12 Jun 2026 14:34:20 -0700
Subject: [PATCH 1/2] docs: scrub stale discover-peers references (PLT-452
 follow-up)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PR-C's cross-review (sei-network + kubernetes lenses) flagged docs/comments
that still present the retired discover-peers task as part of the current
bootstrap flow. Peering is controller-owned via the config-apply
persistent_peers override; there is no discover-peers task.

Fixed (misstate current behavior):
- internal/task/bootstrap_resources.go: bootstrapWaitCommand doc comment no
  longer lists discover-peers in the bootstrap-task set it waits on.
- README.md: Full node "Key tasks" chain drops discover-peers.
- docs/design-seinode-import-volume.md: the "runs exactly as today" task
  list drops discover-peers.
- docs/design/composable-genesis.md: both per-node init-plan flows drop the
  discover-peers step.
- .agent/runbooks/migrating-validator-to-byo-secrets.md: the bootstrap-plan
  chain + its explanation now credit config-apply (via the resolved-peer
  override) with writing persistent-peers, not discover-peers.

Deliberately left:
- docs/known-issues-node-alarms.md: a dated incident record — it accurately
  describes a failure that occurred when discover-peers existed; rewriting it
  would falsify history.
- Three Go test-comment analogies (common_overrides_test.go,
  controller_test.go) and the intentional TestCEL_DiscoverPeers_Removed_Rejected
  guard — internal references that don't misstate current behavior.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 .agent/runbooks/migrating-validator-to-byo-secrets.md | 9 +++++----
 README.md                                             | 2 +-
 docs/design-seinode-import-volume.md                  | 2 +-
 docs/design/composable-genesis.md                     | 2 --
 internal/task/bootstrap_resources.go                  | 2 +-
 5 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/.agent/runbooks/migrating-validator-to-byo-secrets.md b/.agent/runbooks/migrating-validator-to-byo-secrets.md
index d2b621d4..a359cd6d 100644
--- a/.agent/runbooks/migrating-validator-to-byo-secrets.md
+++ b/.agent/runbooks/migrating-validator-to-byo-secrets.md
@@ -271,9 +271,10 @@ Ordered. Do not reorder steps 3 and 4.
    - Flux decrypts the Secrets and the SeiNode's bootstrap plan runs (no-snapshot
      validator): ensure-data-pvc -> validate-signing-key -> validate-node-key ->
      apply-rbac-proxy-config -> apply-statefulset -> apply-service ->
-     configure-genesis -> config-apply -> discover-peers -> config-validate ->
-     mark-ready (config-apply writes the base config; discover-peers then writes
-     persistent-peers; config-validate checks the assembled result last). seid then
+     configure-genesis -> config-apply -> config-validate ->
+     mark-ready (config-apply writes the base config and folds in
+     persistent-peers from the controller-resolved peer set; config-validate
+     checks the assembled result last). seid then
      block-syncs once the pod is up (there is no "block-sync" task — it's seid
      catching up, observed via catching_up/height, not the plan).
    - Expect the networking.tcp DNS race on cold start (§6 finding 1): ~6-8 min of
@@ -302,7 +303,7 @@ Numbered as encountered. 1–4 are the ones that change operator behavior.
 
 **1. `networking.tcp` cold-start DNS race (self-heals).** A `tcp` validator gets a per-pod internet-facing NLB + an external-DNS Route53 record, and seid is configured with that hostname as its P2P `external-address`. On first boot seid resolves its own external address *before* external-DNS has published it → `lookup ...: no such host` → CrashLoopBackOff, made worse by CoreDNS negative-cache TTL. It clears on its own in ~6–8 min. **This is expected; do not drop `networking.tcp` to "fix" it** — a public arctic-1 validator needs the NLB.
 
-**2. Deploy clean — never blow-away-and-recreate a running chain's SND.** This bit the dry-run's genesis chain, whose nodes use **controller-generated** `node_key.json`: recreating the SND regenerates those keys per pod, but peers keep the *old* NodeIDs in `persistent_peers` → every P2P dial is rejected (`peer NodeID = X, want Y`) and the chain wedges (never produces a block). A fresh deploy's `discover-peers` wires correct NodeIDs the first time. **For *this* BYO validator the NodeID is pinned by the `nodeKey` Secret and is stable across recreation** — so its NodeID won't churn, but recreation is still hazardous for a different reason: finding 3 (it destroys the data PVC). Bottom line: don't recreate a running validator's SND; if you must replace, do it clean.
+**2. Deploy clean — never blow-away-and-recreate a running chain's SND.** This bit the dry-run's genesis chain, whose nodes use **controller-generated** `node_key.json`: recreating the SND regenerates those keys per pod, but peers keep the *old* NodeIDs in `persistent_peers` → every P2P dial is rejected (`peer NodeID = X, want Y`) and the chain wedges (never produces a block). A fresh deploy resolves and writes the correct NodeIDs into `persistent_peers` the first time. **For *this* BYO validator the NodeID is pinned by the `nodeKey` Secret and is stable across recreation** — so its NodeID won't churn, but recreation is still hazardous for a different reason: finding 3 (it destroys the data PVC). Bottom line: don't recreate a running validator's SND; if you must replace, do it clean.
 
 **3. Deleting the SeiNode destroys the data PVC.** A `Failed` SeiNode does **not** self-heal from a spec edit — the controller treats `Failed` as terminal and emits "Delete and recreate the resource to retry", so the only way to replan is to delete the SeiNode. But the data PVC carries a controller ownerRef **directly to the SeiNode** (the SeiNode owns the StatefulSet→Pod *and* the PVC in parallel — the PVC is **not** a StatefulSet `volumeClaimTemplate`, so it is not protected by the STS's `WhenDeleted=Retain`). Deleting the SeiNode therefore GC-deletes the PVC, destroying chain state and forcing a full re-sync. The **consensus identity survives** (it's in the Secrets), so the validator comes back as itself — but budget for the resync. Note the scope of `deletionPolicy: Retain`: it governs the **SND→child** cascade only — it protects the PVC when the *SND* is deleted, but a manual `kubectl delete seinode <child>` (the delete-to-replan action) still GC-deletes the PVC via the SeiNode ownerRef regardless.
 
diff --git a/README.md b/README.md
index a45583d2..a078101a 100644
--- a/README.md
+++ b/README.md
@@ -72,7 +72,7 @@ spec:
 
 | Mode | Condition | Key tasks |
 |------|-----------|-----------|
-| Full node | `spec.fullNode` set | `configure-genesis` > `snapshot-restore` > `config-apply` > `discover-peers` > `mark-ready` |
+| Full node | `spec.fullNode` set | `configure-genesis` > `snapshot-restore` > `config-apply` > `mark-ready` |
 | Validator | `spec.validator` set | Same as full node, or genesis ceremony flow for new networks |
 | Archive | `spec.archive` set | State sync with archival pruning configuration |
 | Replayer | `spec.replayer` set | Snapshot restore with result export for shadow validation |
diff --git a/docs/design-seinode-import-volume.md b/docs/design-seinode-import-volume.md
index f469f0b4..88eb0d70 100644
--- a/docs/design-seinode-import-volume.md
+++ b/docs/design-seinode-import-volume.md
@@ -62,7 +62,7 @@ spec:
       pvcName: data-archive-0-0      # name of a pre-existing PVC in the SeiNode's namespace
 ```
 
-Planner behavior: **the init plan is unchanged.** The only difference is inside `ensure-data-pvc`: if `spec.dataVolume.import.pvcName` is set, the task verifies the named PVC instead of creating a fresh one. Every successor task (`apply-statefulset`, `apply-service`, `configure-genesis`, `config-apply`, `discover-peers`, `configure-state-sync`, `config-validate`, `mark-ready`) runs exactly as it does today.
+Planner behavior: **the init plan is unchanged.** The only difference is inside `ensure-data-pvc`: if `spec.dataVolume.import.pvcName` is set, the task verifies the named PVC instead of creating a fresh one. Every successor task (`apply-statefulset`, `apply-service`, `configure-genesis`, `config-apply`, `configure-state-sync`, `config-validate`, `mark-ready`) runs exactly as it does today.
 
 This is a deliberate "no extra fluff" choice: import is a PVC-source substitution, not a bootstrap off-ramp. The operator is trusted to provide a PVC whose contents are compatible with the rest of the init progression. If the imported data is from an incompatible seid version, the wrong chain, or in an unexpected on-disk format, seid will fail to start on the pod and the operator gets a clear signal from the Failed plan — same failure channel as any other init problem.
 
diff --git a/docs/design/composable-genesis.md b/docs/design/composable-genesis.md
index aee77f2f..b0771536 100644
--- a/docs/design/composable-genesis.md
+++ b/docs/design/composable-genesis.md
@@ -54,7 +54,6 @@ that no individual node can perform alone.
 ```
 init-validator          → create keys, gentx, publish identity to S3
 configure-genesis       → download assembled genesis.json from S3 (retries until available)
-discover-peers          → resolve network peers
 configure-state-sync    → (only if StateSync is set)
 config-patch            → apply TOML config patches
 mark-ready              → signal bootstrap complete
@@ -66,7 +65,6 @@ register-validator      → (only on existing chains) submit create-validator tx
 ```
 init-validator          → create keys (gentx not needed; genesis already exists)
 configure-genesis       → download existing chain genesis from S3 (immediately available)
-discover-peers          → resolve network peers
 configure-state-sync    → sync to chain tip
 config-patch            → apply TOML config patches
 mark-ready              → signal bootstrap complete
diff --git a/internal/task/bootstrap_resources.go b/internal/task/bootstrap_resources.go
index 1939d0e8..d2f80e09 100644
--- a/internal/task/bootstrap_resources.go
+++ b/internal/task/bootstrap_resources.go
@@ -235,7 +235,7 @@ func buildBootstrapPodSpec(node *seiv1alpha1.SeiNode, snap *seiv1alpha1.Snapshot
 // to report healthy and then runs seid with --halt-height. Polls /v0/healthz
 // which returns 503 until the mark-ready task completes, ensuring all
 // bootstrap sidecar tasks (snapshot-restore, configure-genesis, config-apply,
-// discover-peers, config-validate) have finished before seid starts.
+// config-validate) have finished before seid starts.
 //
 // Uses bash's /dev/tcp to make raw HTTP requests instead of wget/curl, which
 // are not available on all sei images.

From 32852750e8af3d477ace448e56b7d264104f9f9f Mon Sep 17 00:00:00 2001
From: bdchatham <bdchatham@gmail.com>
Date: Fri, 12 Jun 2026 14:37:29 -0700
Subject: [PATCH 2/2] docs: fold in cross-review polish (dated incident
 annotation, drop dangling analogy)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PR-401 cross-review (sei-network) optional hardenings:
- docs/known-issues-node-alarms.md: annotate the discover-peers mention as
  historical so a future reader grepping discover-peers isn't misled — keeps
  the incident record intact, adds the controller-owned-peering context.
- controller_test.go: drop the trailing "Mirrors the DiscoverPeers poll shape"
  analogy — the first sentence already states the poll-to-completion shape, and
  the analogy referenced now-deleted code.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/known-issues-node-alarms.md                | 2 +-
 internal/controller/nodetask/controller_test.go | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/docs/known-issues-node-alarms.md b/docs/known-issues-node-alarms.md
index 0fa6f3c0..2f622d62 100644
--- a/docs/known-issues-node-alarms.md
+++ b/docs/known-issues-node-alarms.md
@@ -8,7 +8,7 @@ Recurring alerts observed during SeiNode and SeiNodeDeployment deployments. Thes
 **Environment:** dev
 **Severity:** critical (alert), expected during iteration
 
-**What happens:** The shadow replayer fails during bootstrap, typically at `discover-peers` or `configure-state-sync`. Each failed deployment requires deleting and recreating the SeiNode.
+**What happens:** The shadow replayer fails during bootstrap, typically at `discover-peers` or `configure-state-sync`. Each failed deployment requires deleting and recreating the SeiNode. _(Historical: `discover-peers` was a sidecar bootstrap task when this incident occurred; peering is now controller-owned via the config-apply `persistent_peers` override and is no longer a distinct task.)_
 
 **Root causes encountered:**
 1. **Pruned peers (resolved):** State-syncer EC2 nodes pruned blocks below the snapshot height (200440000). `configure-state-sync` queries peers for a block hash at the trust height and gets empty responses. Fix: use a snapshot at a height within peers' retention window.
diff --git a/internal/controller/nodetask/controller_test.go b/internal/controller/nodetask/controller_test.go
index 2d78ebdc..614d764e 100644
--- a/internal/controller/nodetask/controller_test.go
+++ b/internal/controller/nodetask/controller_test.go
@@ -701,8 +701,7 @@ func TestTaskParamsForKind_RestartSeid(t *testing.T) {
 
 // RestartSeid is poll-to-completion (registered sidecarTask[...](false)): unlike
 // MarkReady's fire-and-forget ack, the controller polls GetTask until the
-// restart-seid task reports terminal (seid's RPC back up). Mirrors the
-// DiscoverPeers poll shape.
+// restart-seid task reports terminal (seid's RPC back up).
 func TestReconcile_RestartSeid_EndToEnd(t *testing.T) {
 	g := NewWithT(t)
 	ctx := context.Background()