feat(statesync): canonical-syncer ConfigMap + fail-closed StateSyncReady gate (PLT-452) by bdchatham · Pull Request #397 · sei-protocol/sei-k8s-controller

bdchatham · 2026-06-11T23:24:36Z

The state-sync half of PLT-452 (controller side). Stacked on #396 (peering) — base is plt452/peering-consolidation; retarget to main once #396 merges. Trust model is signed off (owner-accepted posture; see the PLT-452 sign-off comment).

What changed

Canonical-syncer ConfigMap as the trust root. A controller-level ConfigMap (keyed by chainID, data[chainID] = bare host:port syncer RPC endpoints) replaces the label-derived Status.ResolvedRPCWitnesses witness feed. configureStateSyncTask now sources rpc_servers from Status.ResolvedStateSyncers. New optional SEI_STATESYNC_SYNCERS_CONFIGMAP / SEI_STATESYNC_SYNCERS_NAMESPACE platform config.
Fail-closed StateSyncReady gate (between reconcilePeers and ResolvePlan): enabled + <2 configured syncers (missing ConfigMap / unset ref / no chain entry) → False/NoSyncersConfigured + Warning Event + no state-sync plan; disabled → False/NotApplicable; enabled + ≥2 → True/Ready. Always-present condition, stable reason enum, ObservedGeneration, single optimistic-lock patch. An active plan is left to finish (no yank on a transient blip).
Dropped the label-derived witness path (Result.Witnesses/peerRPCAddress in internal/peering); Status.ResolvedRPCWitnesses left present-but-unwritten (deprecated — CRD-field removal is a one-way door, deferred to the version bump).

Scope boundary

The controller's ≥2 is a configured-count check. The ≥2 byte-for-byte witness agreement + kill duplicate-to-fill is the seictl sidecar's job — a separate coordinated PR. The ≥2 agreement guarantee is not in effect until that lands.

Cross-review (4 experts, signed-off trust model)

sei-network (correctness), idiom (condition discipline + single-patch) — clean, no blocks.
platform + security flagged that the controller's pre-existing cluster-wide configmaps write lets it rewrite its own trust root — violating signed-off precondition Flux/dev deployment key; why does it exist on this repo #1 (read-only on the trust root). Per owner decision: ship the working controller half now, fix the RBAC boundary as a fast-follow tracked in PLT-471 (a PLT-452 launch-gate blocker, landing with the namespace lockdown that decides the trust-root namespace). The in-repo read-only marker is documentary-only and now says so.

Follow-ups

PLT-471 — scope controller RBAC to read-only on the trust-root ConfigMap (launch-gate blocker).
seictl trust-hardening (≥2 agreement) — separate cross-repo PR.

🤖 Generated with Claude Code

cursor · 2026-06-11T23:24:42Z

PR Summary

High Risk
Changes state-sync trust inputs and blocks node bootstrap when syncers are misconfigured; incorrect or missing canonical syncer config can prevent state-sync nodes from initializing.

Overview
State-sync RPC witnesses no longer come from label-resolved peers. The controller reads curated canonical syncer endpoints from an optional GitOps-mounted YAML file (SEI_CONTROLLER_CONFIG → stateSync.syncers by chain) on every reconcile, writes them to status.resolvedStateSyncers, and feeds them into ConfigureStateSyncTask instead of resolvedRPCWitnesses (deprecated, left unwritten).

A new StateSyncReady condition and stateSyncBlocksPlan gate fail closed when state-sync is enabled but fewer than two syncers are configured (or the source is missing); init plans with state-sync are not built, with a one-shot StateSyncBlocked warning. Peering now only resolves persistent_peers; witness derivation was removed from internal/peering.

The manager deployment mounts an optional sei-controller-config ConfigMap (directory mount for live updates). Reconcile still runs StatefulSet/terminal-plan cleanup on transient parse/read errors and requeues on syncer-source failures.

^{Reviewed by Cursor Bugbot for commit ab26a2b. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ady gate (PLT-452) Replace the label-derived ResolvedRPCWitnesses witness feed with a controller- level canonical-syncer ConfigMap (keyed by chainID) as the state-sync trust root. configureStateSyncTask now sources rpc_servers from Status.ResolvedStateSyncers. Add a fail-closed StateSyncReady condition + gate (between reconcilePeers and ResolvePlan): when state-sync is enabled and <2 canonical syncers are configured (missing ConfigMap / unset ref / no chain entry all count), the controller sets StateSyncReady=False/NoSyncersConfigured, emits a Warning Event, and builds no state-sync plan — never a witness-less or untrusted-fallback sync. An active plan is left to finish (no yank on a transient ConfigMap blip). Disabled nodes report False/NotApplicable (always-present per condition discipline). The condition + ResolvedStateSyncers flush through the existing single optimistic- lock status patch. The controller's ≥2 is a CONFIGURED-COUNT check; the ≥2 byte-for-byte witness agreement (kill duplicate-to-fill) is the sidecar's job, landing with the seictl trust-hardening bump. Status.ResolvedRPCWitnesses is left present-but-unwritten (deprecated; CRD-field removal is a one-way door, deferred to the version bump). Read-only RBAC on the trust-root ConfigMap is not yet enforced (pre-existing cluster-wide configmaps grant) — tracked as a PLT-452 launch-gate blocker in PLT-471. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…turns StateSyncReady is documented always-present, but Reconcile returned early on spec.Paused and PhaseFailed before the gate seeded it, so a paused- or failed-from-creation node never got the condition (Cursor Bugbot). Split the gate: reconcileStateSyncGate (resolution — sets the always-present condition) now runs once before the early-returns, so the condition rides the existing flush on every path; enforceStateSyncGate (fail-closed stop/event) runs only on the active path. Single optimistic-lock patch preserved; gocyclo held under 30 by extracting emitPhaseTransition. Adds a paused-node always-present test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ration Per owner direction, drop the planned seictl ≥2-agreement/equivocation hardening as over-engineering. The controller already solves the real problem: state-sync witnesses come from a curated canonical-syncer ConfigMap (seeded per cluster) instead of label-matched peers, decoupling peers from state-sync and removing the sharp edge of using fine-as-peers-but-invalid-as-witnesses nodes. The sidecar keeps its existing trust-pinning behavior; reliability comes from curating the syncer set, not cross-witness checking. Comment-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The new statesync_test.go tripped goconst on the main-targeted lint (default, sei:latest, a:26657/b:26657, sei-test). Hoist them to test consts; the rest of the package keeps its existing literals (unchanged, grandfathered). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…iew) Move the fail-closed state-sync gate from a reconcile short-circuit to a plan-construction check, fixing three Cursor findings: - HIGH: terminal plans now always clear — handleTerminalPlan runs before the block check, so a fail-closed node no longer strands Complete/Failed plans. - MED: a transient canonical-syncer ConfigMap read error is non-fatal — it sets StateSyncReady=Unknown/ConfigMapReadError, clears ResolvedStateSyncers, and requeues, instead of aborting StatefulSet/Paused/Failed/flush handling. - MED: StateSyncBlocked fires once per transition, not every requeue. enforceStateSyncGate is deleted; stateSyncBlocksPlan declines a state-sync plan when StateSyncReady != True (init path only — running update plans carry no state-sync task). Fail-closed correctness is preserved: never builds with <2 or stale/unresolved syncers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bdchatham · 2026-06-12T18:26:58Z

+	StateSyncSyncersConfigMap string
+
+	// StateSyncSyncersNamespace is the namespace of StateSyncSyncersConfigMap.
+	// Required when StateSyncSyncersConfigMap is set (Validate enforces the pair).
+	StateSyncSyncersNamespace string


Instead of making these parameters config map, can you not just mount a configuration file like application config that supports state syncers? Honestly probably makes sense to create a linear ticket to migrate existing environment variables to this configuration file

…t the API Per owner review: replace the live-API-read canonical-syncer ConfigMap with a read-only mounted file (YAML chainID -> [host:port]). Removes the live K8s API dependency and the configmaps read-RBAC from the state-sync path, and closes the controller-binary in-process write path to the trust root (the read-only mount physically enforces what the RBAC marker only documented). Source swap only — the fail-closed StateSyncReady gate, plan-construction gating, terminal-cleanup ordering, and event-on-transition are unchanged. Mapping preserved: missing/empty file -> fail closed (NoSyncersConfigured); read/parse error -> transient (Unknown/SyncerSourceError + requeue, reconcile not aborted). Mounted as a directory volume (never subPath) from an optional ConfigMap so the controller starts and fails closed when syncers aren't provisioned. First slice of PLT-475 (env -> mounted app-config). The residual cluster-wide configmaps write grant remains PLT-471's fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f4cd247. Configure here.}

cursor · 2026-06-12T18:49:48Z

+	// to drive a requeue; poll so the gate re-resolves once the API recovers.
+	if stateSyncTransient && result.IsZero() {
+		return ctrl.Result{RequeueAfter: statusPollInterval}, nil
+	}


Fail-closed syncers never requeue

High Severity

When state sync is enabled but the gate is False/NoSyncersConfigured, reconcile often returns with no RequeueAfter. Only Running nodes get the 30s poll; pending or initializing fail-closed nodes do not. GitOps updates to the mounted syncer file then never re-run reconcileStateSyncGate, so nodes can stay blocked after syncers appear.

^{Reviewed by Cursor Bugbot for commit f4cd247. Configure here.}

Per owner review: make #397's mounted file the controller's generic application-config file (SEI_CONTROLLER_CONFIG -> /etc/sei-controller/config.yaml) with state-sync as one section, rather than a bespoke syncers file — so it's the foundation PLT-475 extends, not a throwaway shape. - platform.FileConfig{ StateSync: { Syncers map[chainID][]host:port } } typed, sibling to the env-sourced platform.Config (file-sourced vs env-sourced kept distinct; the config path is still an env var, so the env-driven convention holds). - platform.Config.StateSyncSyncersFile -> ControllerConfigFile; canonicalSyncers reads cfg.StateSync.Syncers[chainID]. Conventional os.ReadFile + sigs.k8s.io/yaml into the typed struct, fresh per reconcile for the dynamic syncer section. - Mount renamed: ConfigMap sei-controller-config, directory mount at /etc/sei-controller (not subPath), readOnly, optional. Fail-closed/transient semantics unchanged. First slice of PLT-475. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread internal/controller/node/controller.go Outdated

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread internal/controller/node/statesync.go

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread internal/controller/node/statesync.go Outdated

bdchatham and others added 3 commits June 12, 2026 10:36

bdchatham changed the base branch from plt452/peering-consolidation to main June 12, 2026 17:37

bdchatham force-pushed the plt452/statesync-controller branch from 31f19b9 to 66b7712 Compare June 12, 2026 17:38

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread internal/controller/node/controller.go Outdated

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread internal/controller/node/controller.go Outdated

bdchatham commented Jun 12, 2026

View reviewed changes

cursor Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(statesync): canonical-syncer ConfigMap + fail-closed StateSyncReady gate (PLT-452)#397

feat(statesync): canonical-syncer ConfigMap + fail-closed StateSyncReady gate (PLT-452)#397
bdchatham wants to merge 7 commits into
mainfrom
plt452/statesync-controller

bdchatham commented Jun 11, 2026

Uh oh!

cursor Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdchatham Jun 12, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented Jun 11, 2026

What changed

Scope boundary

Cross-review (4 experts, signed-off trust model)

Follow-ups

Uh oh!

cursor Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdchatham Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 12, 2026

Choose a reason for hiding this comment

Fail-closed syncers never requeue

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented Jun 11, 2026 •

edited

Loading