Skip to content

Scale-metrics observability follow-up: Grafana dashboard JSON + alertmanager wiring (US-27.15 backlog) #196

Description

@mkreyman

Backlog follow-up (user-signed-off scope boundary for US-27.15)

US-27.15 shipped the durable signal layer: three Telemetry.Metrics (db_statement_timeout counter, heavy-read latency distribution, recall under-fill counter), a Prometheus reporter on the internal :9568/metrics port (behind the fault-isolating Loopctl.Telemetry.MetricsReporter wrapper) scraped by Fly's managed Prometheus, bounded cardinality, and the actual firing alert path.

Correction (R2): the firing path is the loopctl ScaleAlerts webhook, NOT PromQL on fly-metrics.net

fly-metrics.net's managed Grafana has alerting DISABLED (per Fly's docs: "Fly.io doesn't include built-in alerting on metrics, so you'll need to set up alerting yourself against the Prometheus endpoint."). So the PromQL expressions in the runbook visualize a trend but do not fire on fly-metrics.net — they are query bodies, not a firing path.

The firing path that ships (AC-27.15.2) is Loopctl.Telemetry.ScaleAlerts: a supervised, loopctl-owned threshold checker that windows the three scale signals via atomic ETS counters (no per-event GenServer call), evaluates them on a timer, debounces (edge-triggered, re-arming), and on breach POSTs an id-only JSON alert to an operator-configured webhook (SCALE_ALERT_WEBHOOK_URL) via the same :webhook_delivery DI the webhook worker uses. No external Grafana/Alertmanager is required to get an alert that fires.

Already shipped (do NOT redo)

  • The three metrics + the Prometheus reporter (wrapped for fault isolation) + the fly.toml [metrics] scrape + bounded cardinality gate.
  • The firing alert path: Loopctl.Telemetry.ScaleAlerts — windowed thresholds, edge-triggered debounce, id-only webhook POST, opt-in via SCALE_ALERT_WEBHOOK_URL. Config knobs + payload shape documented in the runbook.

Tracked for later (out of scope for US-27.15 — OPTIONAL self-hosted)

  • Curated Grafana dashboard JSON (panels for the 3 metrics + the US-27.11 pool SLOs) checked into the repo — for an operator running their own Grafana (fly-metrics.net cannot install these).
  • Alertmanager routes/receivers (paging) wired to the PromQL query bodies in the runbook, for a self-hosted Prometheus/Alertmanager stack. (This is an ALTERNATIVE to the shipped ScaleAlerts webhook, not a prerequisite for alerting.)
  • Optional: a log-shipper for >7-day per-tenant retention when the tenant_id counter label is gated off above the cap.

Reference: docs/runbooks/knowledge-scale.md -> "Metrics & alerting (US-27.15)" -> "The firing alert path: Loopctl.Telemetry.ScaleAlerts".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions