Scale-metrics observability follow-up: Grafana dashboard JSON + alertmanager wiring (US-27.15 backlog)

## Backlog follow-up (user-signed-off scope boundary for US-27.15)

US-27.15 shipped the durable signal layer: three `Telemetry.Metrics` (db_statement_timeout counter, heavy-read latency distribution, recall under-fill counter), a Prometheus reporter on the internal `:9568/metrics` port (behind the fault-isolating `Loopctl.Telemetry.MetricsReporter` wrapper) scraped by Fly's managed Prometheus, bounded cardinality, **and the actual firing alert path**.

### Correction (R2): the firing path is the loopctl ScaleAlerts webhook, NOT PromQL on fly-metrics.net

`fly-metrics.net`'s managed Grafana has **alerting DISABLED** (per Fly's docs: *"Fly.io doesn't include built-in alerting on metrics, so you'll need to set up alerting yourself against the Prometheus endpoint."*). So the PromQL expressions in the runbook **visualize** a trend but **do not fire** on fly-metrics.net — they are **query bodies**, not a firing path.

The **firing** path that ships (AC-27.15.2) is **`Loopctl.Telemetry.ScaleAlerts`**: a supervised, loopctl-owned threshold checker that windows the three scale signals via atomic ETS counters (no per-event GenServer call), evaluates them on a timer, debounces (edge-triggered, re-arming), and on breach POSTs an id-only JSON alert to an operator-configured webhook (`SCALE_ALERT_WEBHOOK_URL`) via the same `:webhook_delivery` DI the webhook worker uses. No external Grafana/Alertmanager is required to get an alert that fires.

### Already shipped (do NOT redo)
- The three metrics + the Prometheus reporter (wrapped for fault isolation) + the fly.toml `[metrics]` scrape + bounded cardinality gate.
- **The firing alert path: `Loopctl.Telemetry.ScaleAlerts`** — windowed thresholds, edge-triggered debounce, id-only webhook POST, opt-in via `SCALE_ALERT_WEBHOOK_URL`. Config knobs + payload shape documented in the runbook.

### Tracked for later (out of scope for US-27.15 — OPTIONAL self-hosted)
- Curated **Grafana dashboard JSON** (panels for the 3 metrics + the US-27.11 pool SLOs) checked into the repo — for an operator running **their own** Grafana (fly-metrics.net cannot install these).
- **Alertmanager** routes/receivers (paging) wired to the PromQL **query bodies** in the runbook, for a self-hosted Prometheus/Alertmanager stack. (This is an ALTERNATIVE to the shipped ScaleAlerts webhook, not a prerequisite for alerting.)
- Optional: a log-shipper for >7-day per-tenant retention when the `tenant_id` counter label is gated off above the cap.

Reference: `docs/runbooks/knowledge-scale.md` -> "Metrics & alerting (US-27.15)" -> "The firing alert path: `Loopctl.Telemetry.ScaleAlerts`".


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale-metrics observability follow-up: Grafana dashboard JSON + alertmanager wiring (US-27.15 backlog) #196

Backlog follow-up (user-signed-off scope boundary for US-27.15)

Correction (R2): the firing path is the loopctl ScaleAlerts webhook, NOT PromQL on fly-metrics.net

Already shipped (do NOT redo)

Tracked for later (out of scope for US-27.15 — OPTIONAL self-hosted)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scale-metrics observability follow-up: Grafana dashboard JSON + alertmanager wiring (US-27.15 backlog) #196

Description

Backlog follow-up (user-signed-off scope boundary for US-27.15)

Correction (R2): the firing path is the loopctl ScaleAlerts webhook, NOT PromQL on fly-metrics.net

Already shipped (do NOT redo)

Tracked for later (out of scope for US-27.15 — OPTIONAL self-hosted)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions