Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions .agent/runbooks/migrating-validator-to-byo-secrets.md
Original file line number Diff line number Diff line change
Expand Up @@ -271,9 +271,10 @@ Ordered. Do not reorder steps 3 and 4.
- Flux decrypts the Secrets and the SeiNode's bootstrap plan runs (no-snapshot
validator): ensure-data-pvc -> validate-signing-key -> validate-node-key ->
apply-rbac-proxy-config -> apply-statefulset -> apply-service ->
configure-genesis -> config-apply -> discover-peers -> config-validate ->
mark-ready (config-apply writes the base config; discover-peers then writes
persistent-peers; config-validate checks the assembled result last). seid then
configure-genesis -> config-apply -> config-validate ->
mark-ready (config-apply writes the base config and folds in
persistent-peers from the controller-resolved peer set; config-validate
checks the assembled result last). seid then
block-syncs once the pod is up (there is no "block-sync" task — it's seid
catching up, observed via catching_up/height, not the plan).
- Expect the networking.tcp DNS race on cold start (§6 finding 1): ~6-8 min of
Expand Down Expand Up @@ -302,7 +303,7 @@ Numbered as encountered. 1–4 are the ones that change operator behavior.

**1. `networking.tcp` cold-start DNS race (self-heals).** A `tcp` validator gets a per-pod internet-facing NLB + an external-DNS Route53 record, and seid is configured with that hostname as its P2P `external-address`. On first boot seid resolves its own external address *before* external-DNS has published it → `lookup ...: no such host` → CrashLoopBackOff, made worse by CoreDNS negative-cache TTL. It clears on its own in ~6–8 min. **This is expected; do not drop `networking.tcp` to "fix" it** — a public arctic-1 validator needs the NLB.

**2. Deploy clean — never blow-away-and-recreate a running chain's SND.** This bit the dry-run's genesis chain, whose nodes use **controller-generated** `node_key.json`: recreating the SND regenerates those keys per pod, but peers keep the *old* NodeIDs in `persistent_peers` → every P2P dial is rejected (`peer NodeID = X, want Y`) and the chain wedges (never produces a block). A fresh deploy's `discover-peers` wires correct NodeIDs the first time. **For *this* BYO validator the NodeID is pinned by the `nodeKey` Secret and is stable across recreation** — so its NodeID won't churn, but recreation is still hazardous for a different reason: finding 3 (it destroys the data PVC). Bottom line: don't recreate a running validator's SND; if you must replace, do it clean.
**2. Deploy clean — never blow-away-and-recreate a running chain's SND.** This bit the dry-run's genesis chain, whose nodes use **controller-generated** `node_key.json`: recreating the SND regenerates those keys per pod, but peers keep the *old* NodeIDs in `persistent_peers` → every P2P dial is rejected (`peer NodeID = X, want Y`) and the chain wedges (never produces a block). A fresh deploy resolves and writes the correct NodeIDs into `persistent_peers` the first time. **For *this* BYO validator the NodeID is pinned by the `nodeKey` Secret and is stable across recreation** — so its NodeID won't churn, but recreation is still hazardous for a different reason: finding 3 (it destroys the data PVC). Bottom line: don't recreate a running validator's SND; if you must replace, do it clean.

**3. Deleting the SeiNode destroys the data PVC.** A `Failed` SeiNode does **not** self-heal from a spec edit — the controller treats `Failed` as terminal and emits "Delete and recreate the resource to retry", so the only way to replan is to delete the SeiNode. But the data PVC carries a controller ownerRef **directly to the SeiNode** (the SeiNode owns the StatefulSet→Pod *and* the PVC in parallel — the PVC is **not** a StatefulSet `volumeClaimTemplate`, so it is not protected by the STS's `WhenDeleted=Retain`). Deleting the SeiNode therefore GC-deletes the PVC, destroying chain state and forcing a full re-sync. The **consensus identity survives** (it's in the Secrets), so the validator comes back as itself — but budget for the resync. Note the scope of `deletionPolicy: Retain`: it governs the **SND→child** cascade only — it protects the PVC when the *SND* is deleted, but a manual `kubectl delete seinode <child>` (the delete-to-replan action) still GC-deletes the PVC via the SeiNode ownerRef regardless.

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ spec:

| Mode | Condition | Key tasks |
|------|-----------|-----------|
| Full node | `spec.fullNode` set | `configure-genesis` > `snapshot-restore` > `config-apply` > `discover-peers` > `mark-ready` |
| Full node | `spec.fullNode` set | `configure-genesis` > `snapshot-restore` > `config-apply` > `mark-ready` |
| Validator | `spec.validator` set | Same as full node, or genesis ceremony flow for new networks |
| Archive | `spec.archive` set | State sync with archival pruning configuration |
| Replayer | `spec.replayer` set | Snapshot restore with result export for shadow validation |
Expand Down
2 changes: 1 addition & 1 deletion docs/design-seinode-import-volume.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ spec:
pvcName: data-archive-0-0 # name of a pre-existing PVC in the SeiNode's namespace
```

Planner behavior: **the init plan is unchanged.** The only difference is inside `ensure-data-pvc`: if `spec.dataVolume.import.pvcName` is set, the task verifies the named PVC instead of creating a fresh one. Every successor task (`apply-statefulset`, `apply-service`, `configure-genesis`, `config-apply`, `discover-peers`, `configure-state-sync`, `config-validate`, `mark-ready`) runs exactly as it does today.
Planner behavior: **the init plan is unchanged.** The only difference is inside `ensure-data-pvc`: if `spec.dataVolume.import.pvcName` is set, the task verifies the named PVC instead of creating a fresh one. Every successor task (`apply-statefulset`, `apply-service`, `configure-genesis`, `config-apply`, `configure-state-sync`, `config-validate`, `mark-ready`) runs exactly as it does today.

This is a deliberate "no extra fluff" choice: import is a PVC-source substitution, not a bootstrap off-ramp. The operator is trusted to provide a PVC whose contents are compatible with the rest of the init progression. If the imported data is from an incompatible seid version, the wrong chain, or in an unexpected on-disk format, seid will fail to start on the pod and the operator gets a clear signal from the Failed plan — same failure channel as any other init problem.

Expand Down
2 changes: 0 additions & 2 deletions docs/design/composable-genesis.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,6 @@ that no individual node can perform alone.
```
init-validator → create keys, gentx, publish identity to S3
configure-genesis → download assembled genesis.json from S3 (retries until available)
discover-peers → resolve network peers
configure-state-sync → (only if StateSync is set)
config-patch → apply TOML config patches
mark-ready → signal bootstrap complete
Expand All @@ -66,7 +65,6 @@ register-validator → (only on existing chains) submit create-validator tx
```
init-validator → create keys (gentx not needed; genesis already exists)
configure-genesis → download existing chain genesis from S3 (immediately available)
discover-peers → resolve network peers
configure-state-sync → sync to chain tip
config-patch → apply TOML config patches
mark-ready → signal bootstrap complete
Expand Down
2 changes: 1 addition & 1 deletion docs/known-issues-node-alarms.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Recurring alerts observed during SeiNode and SeiNodeDeployment deployments. Thes
**Environment:** dev
**Severity:** critical (alert), expected during iteration

**What happens:** The shadow replayer fails during bootstrap, typically at `discover-peers` or `configure-state-sync`. Each failed deployment requires deleting and recreating the SeiNode.
**What happens:** The shadow replayer fails during bootstrap, typically at `discover-peers` or `configure-state-sync`. Each failed deployment requires deleting and recreating the SeiNode. _(Historical: `discover-peers` was a sidecar bootstrap task when this incident occurred; peering is now controller-owned via the config-apply `persistent_peers` override and is no longer a distinct task.)_

**Root causes encountered:**
1. **Pruned peers (resolved):** State-syncer EC2 nodes pruned blocks below the snapshot height (200440000). `configure-state-sync` queries peers for a block hash at the trust height and gets empty responses. Fix: use a snapshot at a height within peers' retention window.
Expand Down
3 changes: 1 addition & 2 deletions internal/controller/nodetask/controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -701,8 +701,7 @@ func TestTaskParamsForKind_RestartSeid(t *testing.T) {

// RestartSeid is poll-to-completion (registered sidecarTask[...](false)): unlike
// MarkReady's fire-and-forget ack, the controller polls GetTask until the
// restart-seid task reports terminal (seid's RPC back up). Mirrors the
// DiscoverPeers poll shape.
// restart-seid task reports terminal (seid's RPC back up).
func TestReconcile_RestartSeid_EndToEnd(t *testing.T) {
g := NewWithT(t)
ctx := context.Background()
Expand Down
2 changes: 1 addition & 1 deletion internal/task/bootstrap_resources.go
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ func buildBootstrapPodSpec(node *seiv1alpha1.SeiNode, snap *seiv1alpha1.Snapshot
// to report healthy and then runs seid with --halt-height. Polls /v0/healthz
// which returns 503 until the mark-ready task completes, ensuring all
// bootstrap sidecar tasks (snapshot-restore, configure-genesis, config-apply,
// discover-peers, config-validate) have finished before seid starts.
// config-validate) have finished before seid starts.
//
// Uses bash's /dev/tcp to make raw HTTP requests instead of wget/curl, which
// are not available on all sei images.
Expand Down
Loading