feat(statesync): canonical-syncer ConfigMap + fail-closed StateSyncReady gate (PLT-452)#397
feat(statesync): canonical-syncer ConfigMap + fail-closed StateSyncReady gate (PLT-452)#397bdchatham wants to merge 7 commits into
Conversation
PR SummaryHigh Risk Overview A new The manager deployment mounts an optional Reviewed by Cursor Bugbot for commit ab26a2b. Bugbot is set up for automated code reviews on this repo. Configure here. |
…ady gate (PLT-452) Replace the label-derived ResolvedRPCWitnesses witness feed with a controller- level canonical-syncer ConfigMap (keyed by chainID) as the state-sync trust root. configureStateSyncTask now sources rpc_servers from Status.ResolvedStateSyncers. Add a fail-closed StateSyncReady condition + gate (between reconcilePeers and ResolvePlan): when state-sync is enabled and <2 canonical syncers are configured (missing ConfigMap / unset ref / no chain entry all count), the controller sets StateSyncReady=False/NoSyncersConfigured, emits a Warning Event, and builds no state-sync plan — never a witness-less or untrusted-fallback sync. An active plan is left to finish (no yank on a transient ConfigMap blip). Disabled nodes report False/NotApplicable (always-present per condition discipline). The condition + ResolvedStateSyncers flush through the existing single optimistic- lock status patch. The controller's ≥2 is a CONFIGURED-COUNT check; the ≥2 byte-for-byte witness agreement (kill duplicate-to-fill) is the sidecar's job, landing with the seictl trust-hardening bump. Status.ResolvedRPCWitnesses is left present-but-unwritten (deprecated; CRD-field removal is a one-way door, deferred to the version bump). Read-only RBAC on the trust-root ConfigMap is not yet enforced (pre-existing cluster-wide configmaps grant) — tracked as a PLT-452 launch-gate blocker in PLT-471. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…turns StateSyncReady is documented always-present, but Reconcile returned early on spec.Paused and PhaseFailed before the gate seeded it, so a paused- or failed-from-creation node never got the condition (Cursor Bugbot). Split the gate: reconcileStateSyncGate (resolution — sets the always-present condition) now runs once before the early-returns, so the condition rides the existing flush on every path; enforceStateSyncGate (fail-closed stop/event) runs only on the active path. Single optimistic-lock patch preserved; gocyclo held under 30 by extracting emitPhaseTransition. Adds a paused-node always-present test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ration Per owner direction, drop the planned seictl ≥2-agreement/equivocation hardening as over-engineering. The controller already solves the real problem: state-sync witnesses come from a curated canonical-syncer ConfigMap (seeded per cluster) instead of label-matched peers, decoupling peers from state-sync and removing the sharp edge of using fine-as-peers-but-invalid-as-witnesses nodes. The sidecar keeps its existing trust-pinning behavior; reliability comes from curating the syncer set, not cross-witness checking. Comment-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
31f19b9 to
66b7712
Compare
The new statesync_test.go tripped goconst on the main-targeted lint (default, sei:latest, a:26657/b:26657, sei-test). Hoist them to test consts; the rest of the package keeps its existing literals (unchanged, grandfathered). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…iew) Move the fail-closed state-sync gate from a reconcile short-circuit to a plan-construction check, fixing three Cursor findings: - HIGH: terminal plans now always clear — handleTerminalPlan runs before the block check, so a fail-closed node no longer strands Complete/Failed plans. - MED: a transient canonical-syncer ConfigMap read error is non-fatal — it sets StateSyncReady=Unknown/ConfigMapReadError, clears ResolvedStateSyncers, and requeues, instead of aborting StatefulSet/Paused/Failed/flush handling. - MED: StateSyncBlocked fires once per transition, not every requeue. enforceStateSyncGate is deleted; stateSyncBlocksPlan declines a state-sync plan when StateSyncReady != True (init path only — running update plans carry no state-sync task). Fail-closed correctness is preserved: never builds with <2 or stale/unresolved syncers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| StateSyncSyncersConfigMap string | ||
|
|
||
| // StateSyncSyncersNamespace is the namespace of StateSyncSyncersConfigMap. | ||
| // Required when StateSyncSyncersConfigMap is set (Validate enforces the pair). | ||
| StateSyncSyncersNamespace string |
There was a problem hiding this comment.
Instead of making these parameters config map, can you not just mount a configuration file like application config that supports state syncers? Honestly probably makes sense to create a linear ticket to migrate existing environment variables to this configuration file
…t the API Per owner review: replace the live-API-read canonical-syncer ConfigMap with a read-only mounted file (YAML chainID -> [host:port]). Removes the live K8s API dependency and the configmaps read-RBAC from the state-sync path, and closes the controller-binary in-process write path to the trust root (the read-only mount physically enforces what the RBAC marker only documented). Source swap only — the fail-closed StateSyncReady gate, plan-construction gating, terminal-cleanup ordering, and event-on-transition are unchanged. Mapping preserved: missing/empty file -> fail closed (NoSyncersConfigured); read/parse error -> transient (Unknown/SyncerSourceError + requeue, reconcile not aborted). Mounted as a directory volume (never subPath) from an optional ConfigMap so the controller starts and fails closed when syncers aren't provisioned. First slice of PLT-475 (env -> mounted app-config). The residual cluster-wide configmaps write grant remains PLT-471's fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f4cd247. Configure here.
| // to drive a requeue; poll so the gate re-resolves once the API recovers. | ||
| if stateSyncTransient && result.IsZero() { | ||
| return ctrl.Result{RequeueAfter: statusPollInterval}, nil | ||
| } |
There was a problem hiding this comment.
Fail-closed syncers never requeue
High Severity
When state sync is enabled but the gate is False/NoSyncersConfigured, reconcile often returns with no RequeueAfter. Only Running nodes get the 30s poll; pending or initializing fail-closed nodes do not. GitOps updates to the mounted syncer file then never re-run reconcileStateSyncGate, so nodes can stay blocked after syncers appear.
Reviewed by Cursor Bugbot for commit f4cd247. Configure here.
Per owner review: make #397's mounted file the controller's generic application-config file (SEI_CONTROLLER_CONFIG -> /etc/sei-controller/config.yaml) with state-sync as one section, rather than a bespoke syncers file — so it's the foundation PLT-475 extends, not a throwaway shape. - platform.FileConfig{ StateSync: { Syncers map[chainID][]host:port } } typed, sibling to the env-sourced platform.Config (file-sourced vs env-sourced kept distinct; the config path is still an env var, so the env-driven convention holds). - platform.Config.StateSyncSyncersFile -> ControllerConfigFile; canonicalSyncers reads cfg.StateSync.Syncers[chainID]. Conventional os.ReadFile + sigs.k8s.io/yaml into the typed struct, fresh per reconcile for the dynamic syncer section. - Mount renamed: ConfigMap sei-controller-config, directory mount at /etc/sei-controller (not subPath), readOnly, optional. Fail-closed/transient semantics unchanged. First slice of PLT-475. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>


The state-sync half of PLT-452 (controller side). Stacked on #396 (peering) — base is
plt452/peering-consolidation; retarget tomainonce #396 merges. Trust model is signed off (owner-accepted posture; see the PLT-452 sign-off comment).What changed
data[chainID]= barehost:portsyncer RPC endpoints) replaces the label-derivedStatus.ResolvedRPCWitnesseswitness feed.configureStateSyncTasknow sourcesrpc_serversfromStatus.ResolvedStateSyncers. New optionalSEI_STATESYNC_SYNCERS_CONFIGMAP/SEI_STATESYNC_SYNCERS_NAMESPACEplatform config.StateSyncReadygate (betweenreconcilePeersandResolvePlan): enabled + <2 configured syncers (missing ConfigMap / unset ref / no chain entry) →False/NoSyncersConfigured+ Warning Event + no state-sync plan; disabled →False/NotApplicable; enabled + ≥2 →True/Ready. Always-present condition, stable reason enum,ObservedGeneration, single optimistic-lock patch. An active plan is left to finish (no yank on a transient blip).Result.Witnesses/peerRPCAddressininternal/peering);Status.ResolvedRPCWitnessesleft present-but-unwritten (deprecated — CRD-field removal is a one-way door, deferred to the version bump).Scope boundary
Cross-review (4 experts, signed-off trust model)
configmapswrite lets it rewrite its own trust root — violating signed-off precondition Flux/dev deployment key; why does it exist on this repo #1 (read-only on the trust root). Per owner decision: ship the working controller half now, fix the RBAC boundary as a fast-follow tracked in PLT-471 (a PLT-452 launch-gate blocker, landing with the namespace lockdown that decides the trust-root namespace). The in-repo read-only marker is documentary-only and now says so.Follow-ups
🤖 Generated with Claude Code