Skip to content

deploy to prod#875

Merged
JoaquinBN merged 14 commits into
mainfrom
dev
Jul 2, 2026
Merged

deploy to prod#875
JoaquinBN merged 14 commits into
mainfrom
dev

Conversation

@JoaquinBN

Copy link
Copy Markdown
Collaborator

No description provided.

rasca and others added 14 commits July 1, 2026 22:15
The version-shame window on the Wall of Shame is no longer hardcoded to
three days. It now reads a NODE_VERSION_SHAME_GRACE_DAYS setting (default
three days, env-overridable), so the grace period can be tuned per
environment without a code change. The version verdict logic also moves out
of the wallet viewset into a shared version_status helper — with an explicit
node_version override for callers that already know the running version — so
the same rule can be reused by the Grafana-driven sync added later in this
branch.

## Claude Implementation Notes
- backend/validators/version_status.py: New compute_version_status(wallet, target, now, node_version=...) — extracted from the viewset; grace from settings.NODE_VERSION_SHAME_GRACE_DAYS via default_grace_days(); node_version param lets the Grafana sync pass the observed version (compares via NodeVersionMixin._compare_versions, no operator required)
- backend/validators/views.py: _version_context now delegates to the helper; removed hardcoded VERSION_SHAME_GRACE_DAYS constant and unused timedelta import
- backend/tally/settings.py: Add NODE_VERSION_SHAME_GRACE_DAYS (env-overridable, default 3, applied globally at evaluation time)
- backend/validators/tests/test_version_status.py: Unit tests — parity + grace configurable via the setting
- backend/CLAUDE.md, CHANGELOG.md: Document the setting and the shared helper
Every Grafana status sync now captures a per-run observation for each active
validator wallet (on-chain status, metrics, logs, and the node version read
from the Prometheus `version` label) and latches it into a per-day rollup on
the existing daily snapshot. Metrics and logs latch pessimistically — shamed
at any point means the day is shamed — while version latches optimistically:
once a node has upgraded during a day, an earlier stale reading cannot shame
it. Per-day sample counters record whether the node was seen reporting at all,
the building blocks for uptime-streak and days-in-shame reporting.

Version labels are normalised at ingest ('v' prefix stripped, capped to the
column length, and when a node briefly reports two series right after an
upgrade the higher parseable one wins), so bad node-reported data can never
corrupt or abort a whole network's history. The rollup is fully rebuildable
from the raw observation log, whose rows are retained forever by explicit
decision. History writing is best-effort and isolated from the live status
update, so a failure there never corrupts the Wall of Shame status. No points
or public API behaviour changes in this step.

## Claude Implementation Notes
- backend/validators/models.py: New ValidatorWalletObservation (append-only raw log); extend ValidatorWalletStatusSnapshot with metrics_status/logs_status/version_status, node_version, metrics_samples/logs_samples
- backend/validators/grafana_service.py: PromQL adds the `version` label; parse_response returns a 4-tuple with version_by_address (normalised via _normalize_version, capped to _VERSION_MAX_LENGTH, higher parseable version wins on duplicate series via _safe_parse); sync_network computes per-wallet version_status via compute_version_status; _record_history writes observations + latched rollup (worst-of-day _latch for metrics/logs, best-of-day _latch_version for version)
- backend/validators/management/commands/rebuild_daily_snapshots.py: New command to re-materialise rollups from observations; --days N cutoff snapped to the local-day boundary so the oldest day is never rebuilt from partial observations
- backend/validators/migrations/0015_*: New model + snapshot columns
- backend/validators/tests/test_grafana_service.py: version-label parse (normalised), duplicate-series keeps-higher, overlong-label truncation, observation/rollup writes, both latch directions, no-observations-on-failure, rebuild day-boundary regression
- backend/CLAUDE.md, CHANGELOG.md: Document the observation log, rollup columns, latch directions, rebuild command, and retain-forever decision
The Wall of Shame now surfaces how long each validator has gone without being
shamed. Every wallet reports a consecutive clean-day streak and the reasons the
streak was last broken, and each operator gets a per-network streak using
any-node-clean roll-up: a network-day counts as clean if at least one of the
operator's nodes was healthy that day. Streaks are computed on read from the
daily observability rollup, so they cost one extra snapshot query per request
(the endpoint stays cached 60s) and start accumulating from deploy.

## Claude Implementation Notes
- backend/validators/streaks.py: New module. clean_streak(wallet_ids, now, index) walks the daily rollup backward counting consecutive clean days (any-node-clean over the given wallet ids); clean day = active + >=1 metrics & logs sample + no shame dim. A partial today never breaks the streak; broken_by only attributes a reason for observed days (edge-of-history returns []). load_snapshot_index prefetches the window in one query.
- backend/validators/views.py: wall_of_shame builds the snapshot index once and per-wallet streaks, passes them to the serializer via context and into _build_validator_groups; groups gain network_streaks (per-network any-node-clean) and each node entry gains clean_streak_days / clean_streak_broken_by
- backend/validators/serializers.py: WallOfShameSerializer adds clean_streak_days + clean_streak_broken_by (from context, no N+1)
- backend/validators/tests/test_streaks.py: streak counting, shame/gap/version breaks, unsynced-today, any-node-clean operator roll-up
- backend/validators/tests/test_grafana_service.py: endpoint exposes streak fields + network_streaks
- backend/CLAUDE.md, CHANGELOG.md: Document the streak fields
Grafana becomes the single source of truth for validator node versions. The
status sync reads each node's reported version and: promotes the fleet's
highest stable release to the active upgrade target the first time it is seen
(ignoring pre-release and build-tagged versions), keeps each operator's
recorded version in step with what their nodes actually run, and awards the
node-upgrade contribution directly — with the existing sooner-is-better
bonus — the moment a visible operator reaches the target, with no manual
submission or steward review.

Detection covers every reporting node regardless of on-chain status, so a
quarantined validator that upgrades still records it and earns the award.
Versions that packaging cannot parse are excluded from comparisons instead of
aborting the run, and one operator's failure never blocks version updates or
awards for the rest. Because versions are now observed rather than
self-reported, the portal stops accepting manual edits: the profile shows the
detected version read-only, the two backend write paths are closed, and the
old save()-driven pending-submission flow is removed. Dedup on the shared
notes key guarantees nothing is ever awarded twice.

## Claude Implementation Notes
- backend/validators/grafana_service.py: new _sync_node_versions({address_lower: version}) — matches wallets of ANY on-chain status, filters to semver-valid AND PEP 440-parseable versions, auto-creates TargetNodeVersion from the highest stable observed (never blindly supersedes an unparseable active target), writes node_version_<network> via Validator.objects.update() (max across the operator's nodes), per-operator try/except fault isolation; _award_node_upgrade creates a direct approved Contribution (early-bonus 4/3/2/1, dedup on `version {v} [{network}]`, multiplier fallback via _allow_missing_multiplier)
- backend/validators/node_version.py: remove NodeVersionMixin.save() and _create_upgrade_submission (dead once the portal can't write versions); keep fields, validation, comparison helpers, calculate_early_upgrade_bonus
- backend/users/serializers.py: UserProfileUpdateSerializer drops the writable node_version fields and the custom update()
- backend/validators/views.py: ValidatorViewSet.my_profile is GET-only (PATCH → 405)
- frontend/src/routes/ProfileEdit.svelte: node version inputs replaced with read-only display ("Not detected yet" fallback, auto-detected hint); removed the related state/change-tracking/save logic
- backend/validators/tests/test_node_version_sync.py: auto-target stable-only guard, supersede/no-op cases, max-across-nodes, single award + dedup, invisible-operator no-award, quarantined-wallet award, PEP 440-invalid isolation, one-failing-operator isolation
- backend/validators/tests/test_node_version_tracking.py, test_api.py: drop save()-driven submission tests; /validators/me PATCH asserts read-only
- backend/CLAUDE.md, frontend/CLAUDE.md, CHANGELOG.md: document Grafana as source of truth and the read-only portal surface
Grafana's version label is self-reported by the node being judged and
rewarded, so the automatic flows no longer trust a single reporter. An
upgrade target is only auto-created when the new stable release is seen
on portal-known wallets of at least two distinct operators, and a
broadcast notification announces it so validators learn about the grace
period before they can be shamed. Versions from unknown Prometheus
series or banned wallets count for nothing, recorded operator versions
only move forward (a skipped scrape can't flash a downgrade onto the
Wall of Shame), and removing the node-upgrade multiplier now pauses the
auto-award entirely instead of awarding at 1.0.

The shame verdict never falls back to lexicographic version comparison:
unparseable versions read as version-unknown, and a parseable version
series always beats an unparseable duplicate regardless of frame order.
A sync run where a whole datasource comes back empty still updates the
self-healing live statuses but skips the permanent daily history latch,
so an infrastructure blackout can't shame every validator's recorded
day. Version detection also runs before the active-wallet early return,
so networks with zero active wallets still record versions and awards.

Uptime streaks now skip days with no monitoring data instead of
breaking, days spent quarantined or inactive break with an explicit
status reason, version-only rollups count as observed days, the maximum
streak honors the 180-day window, and both snapshot writers share one
day-bucketing function so the (wallet, date) key can never split.

## Claude Implementation Notes
- backend/validators/grafana_service.py: MIN_OPERATORS_FOR_AUTO_TARGET
  consensus guard + known/non-banned wallet restriction in
  _sync_node_versions; _broadcast_auto_target helper; monotonic
  node_version writes; award skipped on missing multiplier
  (_allow_missing_multiplier escape hatch removed); parseability gate
  for version_status in sync_network; datasource-blackout guard around
  _record_history; _sync_node_versions moved before the no-active-
  wallets early return; prefer-parseable rule in parse_response;
  defensive handlers log at exception level
- backend/validators/streaks.py: clean_streak skips unobserved days,
  breaks on non-active snapshot rows, range capped at max_days;
  _has_observation counts version_status; _shame_dims attributes
  'status' from the on-chain column even without Grafana data
- backend/validators/genlayer_validators_service.py: snapshot date uses
  timezone.localdate() to match the Grafana rollup bucketing
- backend/validators/management/commands/rebuild_daily_snapshots.py:
  --days 0 no longer means "all"; summary counts during iteration
  instead of a second full scan
- backend/validators/tests/: coverage for the consensus guard, unknown
  address rejection, banned exclusion, monotonic writes, multiplier
  kill switch, auto-target notification, blackout guard, unparseable
  version verdicts, zero-active-wallet version sync, streak skip/status
  semantics (80 tests, all passing)
- backend/CLAUDE.md, frontend/CLAUDE.md: docs updated to the new
  behavior; validators mutation contract corrected to staff-only;
  ProfileEdit.svelte filename fixed
Closes the still-open CodeRabbit findings: the version verdict shared by
the Wall of Shame and the Grafana sync now compares versions only via
PEP 440 parsing — an unparseable legacy or vendor-format version reads
as 'on' when it exactly equals the target string and 'unknown'
otherwise, never a lexicographic comparison that misorders versions.
The per-operator network-streak rollup pre-groups wallet ids once
instead of rescanning every operator-network pair for every group.

## Claude Implementation Notes
- backend/validators/version_status.py: safe_parse_version added (shared
  helper); compute_version_status verdicts via parsed comparison with
  exact-string-equality escape for vendor formats; 'unknown' when
  incomparable; NodeVersionMixin._compare_versions no longer used here
- backend/validators/grafana_service.py: imports safe_parse_version
  instead of a local duplicate; sync-loop parseability gate removed in
  favor of the hardened shared verdict; parse_response docstring
  documents the parseable-beats-unparseable rule
- backend/validators/views.py: wallet_ids_by_operator pre-grouping
  removes the O(groups x pairs) scan in _build_validator_groups
- backend/validators/tests/test_version_status.py: unparseable-version
  and vendor-format-equality verdict coverage
The node-upgrade award dedup now runs inside a transaction holding a
lock on the user row, so even the residual stale-lock-takeover window
can never double-award the same version. The minimum number of distinct
operators required before a new stable release auto-creates the
fleet-wide upgrade target is now a setting (default 2), tunable without
a code deploy like the shame grace period. The new snapshot and
observation model fields document their intent through help_text,
matching the model's existing convention.

## Claude Implementation Notes
- backend/validators/grafana_service.py: _award_node_upgrade wraps dedup
  check + create in transaction.atomic with select_for_update on the
  user row (no-op on SQLite, real lock on Postgres); module constant
  MIN_OPERATORS_FOR_AUTO_TARGET replaced by min_operators_for_auto_target()
  reading NODE_VERSION_MIN_OPERATORS_FOR_AUTO_TARGET at call time
- backend/tally/settings.py: NODE_VERSION_MIN_OPERATORS_FOR_AUTO_TARGET
  env-driven setting (default 2)
- backend/validators/models.py + migrations/0015: help_text on all new
  ValidatorWalletStatusSnapshot / ValidatorWalletObservation fields;
  migration regenerated in place (same name, help_text-only diff,
  makemigrations --check clean)
- backend/validators/tests/test_node_version_sync.py: threshold
  configurability test via override_settings
- backend/CLAUDE.md: env var documented; constant reference updated
Banned users are blocked from submitting contributions everywhere else,
so the Grafana version sync must not let them in through the back door:
wallets whose linked user is banned no longer count toward the
auto-target quorum, no longer get version write-backs, and can never
receive the direct node-upgrade award (the award gate also re-checks
the ban as a second layer).

## Claude Implementation Notes
- backend/validators/grafana_service.py: wallet query in
  _sync_node_versions filters operator__user__is_banned=False alongside
  the on-chain banned-wallet exclusion; award gate re-checks
  is_banned; docstrings updated
- backend/validators/tests/test_node_version_sync.py: banned-user test
  covering quorum, version write-back, and award paths
Per product decision, a single validator seen running a new stable
release is enough to auto-create the fleet-wide upgrade target; the
operator quorum setting now defaults to 1. The setting remains so the
bar can be raised without a deploy if version spoofing ever becomes a
concern. Known-wallet, banned-wallet, and banned-user restrictions are
unchanged.

## Claude Implementation Notes
- backend/tally/settings.py: NODE_VERSION_MIN_OPERATORS_FOR_AUTO_TARGET
  default 2 -> 1, comment explains the decision and when to raise it
- backend/validators/grafana_service.py: fallback default 1; docstrings
  and comments updated to match
- backend/validators/tests/test_node_version_sync.py: single-operator
  target creation is the default-path test again; quorum test now
  overrides the setting to 2 and covers both rejection and corroboration
- backend/CLAUDE.md: default documented as 1
…istory-streaks

Grafana-sourced shame history: uptime streaks + auto node-version
@JoaquinBN JoaquinBN merged commit d7461b3 into main Jul 2, 2026
5 checks passed
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: e24c36c2-762b-44f5-a615-e61b2484ac12

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@JoaquinBN JoaquinBN deployed to cron-job July 3, 2026 05:25 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants