Skip to content

feat(statesync): canonical-syncer ConfigMap + fail-closed StateSyncReady gate (PLT-452)#397

Open
bdchatham wants to merge 7 commits into
mainfrom
plt452/statesync-controller
Open

feat(statesync): canonical-syncer ConfigMap + fail-closed StateSyncReady gate (PLT-452)#397
bdchatham wants to merge 7 commits into
mainfrom
plt452/statesync-controller

Conversation

@bdchatham

Copy link
Copy Markdown
Collaborator

The state-sync half of PLT-452 (controller side). Stacked on #396 (peering) — base is plt452/peering-consolidation; retarget to main once #396 merges. Trust model is signed off (owner-accepted posture; see the PLT-452 sign-off comment).

What changed

  • Canonical-syncer ConfigMap as the trust root. A controller-level ConfigMap (keyed by chainID, data[chainID] = bare host:port syncer RPC endpoints) replaces the label-derived Status.ResolvedRPCWitnesses witness feed. configureStateSyncTask now sources rpc_servers from Status.ResolvedStateSyncers. New optional SEI_STATESYNC_SYNCERS_CONFIGMAP / SEI_STATESYNC_SYNCERS_NAMESPACE platform config.
  • Fail-closed StateSyncReady gate (between reconcilePeers and ResolvePlan): enabled + <2 configured syncers (missing ConfigMap / unset ref / no chain entry) → False/NoSyncersConfigured + Warning Event + no state-sync plan; disabled → False/NotApplicable; enabled + ≥2 → True/Ready. Always-present condition, stable reason enum, ObservedGeneration, single optimistic-lock patch. An active plan is left to finish (no yank on a transient blip).
  • Dropped the label-derived witness path (Result.Witnesses/peerRPCAddress in internal/peering); Status.ResolvedRPCWitnesses left present-but-unwritten (deprecated — CRD-field removal is a one-way door, deferred to the version bump).

Scope boundary

  • The controller's ≥2 is a configured-count check. The ≥2 byte-for-byte witness agreement + kill duplicate-to-fill is the seictl sidecar's job — a separate coordinated PR. The ≥2 agreement guarantee is not in effect until that lands.

Cross-review (4 experts, signed-off trust model)

  • sei-network (correctness), idiom (condition discipline + single-patch) — clean, no blocks.
  • platform + security flagged that the controller's pre-existing cluster-wide configmaps write lets it rewrite its own trust root — violating signed-off precondition Flux/dev deployment key; why does it exist on this repo #1 (read-only on the trust root). Per owner decision: ship the working controller half now, fix the RBAC boundary as a fast-follow tracked in PLT-471 (a PLT-452 launch-gate blocker, landing with the namespace lockdown that decides the trust-root namespace). The in-repo read-only marker is documentary-only and now says so.

Follow-ups

  • PLT-471 — scope controller RBAC to read-only on the trust-root ConfigMap (launch-gate blocker).
  • seictl trust-hardening (≥2 agreement) — separate cross-repo PR.

🤖 Generated with Claude Code

@cursor

cursor Bot commented Jun 11, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Changes state-sync trust inputs and blocks node bootstrap when syncers are misconfigured; incorrect or missing canonical syncer config can prevent state-sync nodes from initializing.

Overview
State-sync RPC witnesses no longer come from label-resolved peers. The controller reads curated canonical syncer endpoints from an optional GitOps-mounted YAML file (SEI_CONTROLLER_CONFIGstateSync.syncers by chain) on every reconcile, writes them to status.resolvedStateSyncers, and feeds them into ConfigureStateSyncTask instead of resolvedRPCWitnesses (deprecated, left unwritten).

A new StateSyncReady condition and stateSyncBlocksPlan gate fail closed when state-sync is enabled but fewer than two syncers are configured (or the source is missing); init plans with state-sync are not built, with a one-shot StateSyncBlocked warning. Peering now only resolves persistent_peers; witness derivation was removed from internal/peering.

The manager deployment mounts an optional sei-controller-config ConfigMap (directory mount for live updates). Reconcile still runs StatefulSet/terminal-plan cleanup on transient parse/read errors and requeues on syncer-source failures.

Reviewed by Cursor Bugbot for commit ab26a2b. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread internal/controller/node/controller.go Outdated
Comment thread internal/controller/node/statesync.go
Comment thread internal/controller/node/statesync.go Outdated
bdchatham and others added 3 commits June 12, 2026 10:36
…ady gate (PLT-452)

Replace the label-derived ResolvedRPCWitnesses witness feed with a controller-
level canonical-syncer ConfigMap (keyed by chainID) as the state-sync trust
root. configureStateSyncTask now sources rpc_servers from Status.ResolvedStateSyncers.

Add a fail-closed StateSyncReady condition + gate (between reconcilePeers and
ResolvePlan): when state-sync is enabled and <2 canonical syncers are configured
(missing ConfigMap / unset ref / no chain entry all count), the controller sets
StateSyncReady=False/NoSyncersConfigured, emits a Warning Event, and builds no
state-sync plan — never a witness-less or untrusted-fallback sync. An active
plan is left to finish (no yank on a transient ConfigMap blip). Disabled nodes
report False/NotApplicable (always-present per condition discipline). The
condition + ResolvedStateSyncers flush through the existing single optimistic-
lock status patch.

The controller's ≥2 is a CONFIGURED-COUNT check; the ≥2 byte-for-byte witness
agreement (kill duplicate-to-fill) is the sidecar's job, landing with the seictl
trust-hardening bump. Status.ResolvedRPCWitnesses is left present-but-unwritten
(deprecated; CRD-field removal is a one-way door, deferred to the version bump).

Read-only RBAC on the trust-root ConfigMap is not yet enforced (pre-existing
cluster-wide configmaps grant) — tracked as a PLT-452 launch-gate blocker in
PLT-471.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…turns

StateSyncReady is documented always-present, but Reconcile returned early on
spec.Paused and PhaseFailed before the gate seeded it, so a paused- or
failed-from-creation node never got the condition (Cursor Bugbot). Split the
gate: reconcileStateSyncGate (resolution — sets the always-present condition)
now runs once before the early-returns, so the condition rides the existing
flush on every path; enforceStateSyncGate (fail-closed stop/event) runs only on
the active path. Single optimistic-lock patch preserved; gocyclo held under 30
by extracting emitPhaseTransition. Adds a paused-node always-present test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ration

Per owner direction, drop the planned seictl ≥2-agreement/equivocation
hardening as over-engineering. The controller already solves the real problem:
state-sync witnesses come from a curated canonical-syncer ConfigMap (seeded
per cluster) instead of label-matched peers, decoupling peers from state-sync
and removing the sharp edge of using fine-as-peers-but-invalid-as-witnesses
nodes. The sidecar keeps its existing trust-pinning behavior; reliability comes
from curating the syncer set, not cross-witness checking. Comment-only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bdchatham bdchatham changed the base branch from plt452/peering-consolidation to main June 12, 2026 17:37
@bdchatham bdchatham force-pushed the plt452/statesync-controller branch from 31f19b9 to 66b7712 Compare June 12, 2026 17:38
Comment thread internal/controller/node/controller.go Outdated
The new statesync_test.go tripped goconst on the main-targeted lint (default,
sei:latest, a:26657/b:26657, sei-test). Hoist them to test consts; the rest of
the package keeps its existing literals (unchanged, grandfathered).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread internal/controller/node/controller.go Outdated
…iew)

Move the fail-closed state-sync gate from a reconcile short-circuit to a
plan-construction check, fixing three Cursor findings:

- HIGH: terminal plans now always clear — handleTerminalPlan runs before the
  block check, so a fail-closed node no longer strands Complete/Failed plans.
- MED: a transient canonical-syncer ConfigMap read error is non-fatal — it sets
  StateSyncReady=Unknown/ConfigMapReadError, clears ResolvedStateSyncers, and
  requeues, instead of aborting StatefulSet/Paused/Failed/flush handling.
- MED: StateSyncBlocked fires once per transition, not every requeue.

enforceStateSyncGate is deleted; stateSyncBlocksPlan declines a state-sync plan
when StateSyncReady != True (init path only — running update plans carry no
state-sync task). Fail-closed correctness is preserved: never builds with <2 or
stale/unresolved syncers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread internal/platform/platform.go Outdated
Comment on lines +74 to +78
StateSyncSyncersConfigMap string

// StateSyncSyncersNamespace is the namespace of StateSyncSyncersConfigMap.
// Required when StateSyncSyncersConfigMap is set (Validate enforces the pair).
StateSyncSyncersNamespace string

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making these parameters config map, can you not just mount a configuration file like application config that supports state syncers? Honestly probably makes sense to create a linear ticket to migrate existing environment variables to this configuration file

…t the API

Per owner review: replace the live-API-read canonical-syncer ConfigMap with a
read-only mounted file (YAML chainID -> [host:port]). Removes the live K8s API
dependency and the configmaps read-RBAC from the state-sync path, and closes the
controller-binary in-process write path to the trust root (the read-only mount
physically enforces what the RBAC marker only documented).

Source swap only — the fail-closed StateSyncReady gate, plan-construction gating,
terminal-cleanup ordering, and event-on-transition are unchanged. Mapping
preserved: missing/empty file -> fail closed (NoSyncersConfigured); read/parse
error -> transient (Unknown/SyncerSourceError + requeue, reconcile not aborted).
Mounted as a directory volume (never subPath) from an optional ConfigMap so the
controller starts and fails closed when syncers aren't provisioned.

First slice of PLT-475 (env -> mounted app-config). The residual cluster-wide
configmaps write grant remains PLT-471's fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f4cd247. Configure here.

// to drive a requeue; poll so the gate re-resolves once the API recovers.
if stateSyncTransient && result.IsZero() {
return ctrl.Result{RequeueAfter: statusPollInterval}, nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail-closed syncers never requeue

High Severity

When state sync is enabled but the gate is False/NoSyncersConfigured, reconcile often returns with no RequeueAfter. Only Running nodes get the 30s poll; pending or initializing fail-closed nodes do not. GitOps updates to the mounted syncer file then never re-run reconcileStateSyncGate, so nodes can stay blocked after syncers appear.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f4cd247. Configure here.

Per owner review: make #397's mounted file the controller's generic
application-config file (SEI_CONTROLLER_CONFIG -> /etc/sei-controller/config.yaml)
with state-sync as one section, rather than a bespoke syncers file — so it's the
foundation PLT-475 extends, not a throwaway shape.

- platform.FileConfig{ StateSync: { Syncers map[chainID][]host:port } } typed,
  sibling to the env-sourced platform.Config (file-sourced vs env-sourced kept
  distinct; the config path is still an env var, so the env-driven convention holds).
- platform.Config.StateSyncSyncersFile -> ControllerConfigFile; canonicalSyncers
  reads cfg.StateSync.Syncers[chainID]. Conventional os.ReadFile + sigs.k8s.io/yaml
  into the typed struct, fresh per reconcile for the dynamic syncer section.
- Mount renamed: ConfigMap sei-controller-config, directory mount at
  /etc/sei-controller (not subPath), readOnly, optional.

Fail-closed/transient semantics unchanged. First slice of PLT-475.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant