Backlog follow-up (user-signed-off scope boundary for US-27.15)
US-27.15 shipped the durable signal layer: three Telemetry.Metrics (db_statement_timeout counter, heavy-read latency distribution, recall under-fill counter), a Prometheus reporter on the internal :9568/metrics port (behind the fault-isolating Loopctl.Telemetry.MetricsReporter wrapper) scraped by Fly's managed Prometheus, bounded cardinality, and the actual firing alert path.
Correction (R2): the firing path is the loopctl ScaleAlerts webhook, NOT PromQL on fly-metrics.net
fly-metrics.net's managed Grafana has alerting DISABLED (per Fly's docs: "Fly.io doesn't include built-in alerting on metrics, so you'll need to set up alerting yourself against the Prometheus endpoint."). So the PromQL expressions in the runbook visualize a trend but do not fire on fly-metrics.net — they are query bodies, not a firing path.
The firing path that ships (AC-27.15.2) is Loopctl.Telemetry.ScaleAlerts: a supervised, loopctl-owned threshold checker that windows the three scale signals via atomic ETS counters (no per-event GenServer call), evaluates them on a timer, debounces (edge-triggered, re-arming), and on breach POSTs an id-only JSON alert to an operator-configured webhook (SCALE_ALERT_WEBHOOK_URL) via the same :webhook_delivery DI the webhook worker uses. No external Grafana/Alertmanager is required to get an alert that fires.
Already shipped (do NOT redo)
- The three metrics + the Prometheus reporter (wrapped for fault isolation) + the fly.toml
[metrics] scrape + bounded cardinality gate.
- The firing alert path:
Loopctl.Telemetry.ScaleAlerts — windowed thresholds, edge-triggered debounce, id-only webhook POST, opt-in via SCALE_ALERT_WEBHOOK_URL. Config knobs + payload shape documented in the runbook.
Tracked for later (out of scope for US-27.15 — OPTIONAL self-hosted)
- Curated Grafana dashboard JSON (panels for the 3 metrics + the US-27.11 pool SLOs) checked into the repo — for an operator running their own Grafana (fly-metrics.net cannot install these).
- Alertmanager routes/receivers (paging) wired to the PromQL query bodies in the runbook, for a self-hosted Prometheus/Alertmanager stack. (This is an ALTERNATIVE to the shipped ScaleAlerts webhook, not a prerequisite for alerting.)
- Optional: a log-shipper for >7-day per-tenant retention when the
tenant_id counter label is gated off above the cap.
Reference: docs/runbooks/knowledge-scale.md -> "Metrics & alerting (US-27.15)" -> "The firing alert path: Loopctl.Telemetry.ScaleAlerts".
Backlog follow-up (user-signed-off scope boundary for US-27.15)
US-27.15 shipped the durable signal layer: three
Telemetry.Metrics(db_statement_timeout counter, heavy-read latency distribution, recall under-fill counter), a Prometheus reporter on the internal:9568/metricsport (behind the fault-isolatingLoopctl.Telemetry.MetricsReporterwrapper) scraped by Fly's managed Prometheus, bounded cardinality, and the actual firing alert path.Correction (R2): the firing path is the loopctl ScaleAlerts webhook, NOT PromQL on fly-metrics.net
fly-metrics.net's managed Grafana has alerting DISABLED (per Fly's docs: "Fly.io doesn't include built-in alerting on metrics, so you'll need to set up alerting yourself against the Prometheus endpoint."). So the PromQL expressions in the runbook visualize a trend but do not fire on fly-metrics.net — they are query bodies, not a firing path.The firing path that ships (AC-27.15.2) is
Loopctl.Telemetry.ScaleAlerts: a supervised, loopctl-owned threshold checker that windows the three scale signals via atomic ETS counters (no per-event GenServer call), evaluates them on a timer, debounces (edge-triggered, re-arming), and on breach POSTs an id-only JSON alert to an operator-configured webhook (SCALE_ALERT_WEBHOOK_URL) via the same:webhook_deliveryDI the webhook worker uses. No external Grafana/Alertmanager is required to get an alert that fires.Already shipped (do NOT redo)
[metrics]scrape + bounded cardinality gate.Loopctl.Telemetry.ScaleAlerts— windowed thresholds, edge-triggered debounce, id-only webhook POST, opt-in viaSCALE_ALERT_WEBHOOK_URL. Config knobs + payload shape documented in the runbook.Tracked for later (out of scope for US-27.15 — OPTIONAL self-hosted)
tenant_idcounter label is gated off above the cap.Reference:
docs/runbooks/knowledge-scale.md-> "Metrics & alerting (US-27.15)" -> "The firing alert path:Loopctl.Telemetry.ScaleAlerts".