feat(health-checks): emit alert events on issue lifecycle transitions#59984
feat(health-checks): emit alert events on issue lifecycle transitions#59984rafaeelaudibert wants to merge 1 commit into
Conversation
Add \$health_check_issue_firing / \$health_check_issue_resolved internal events, emitted from the Temporal orchestrator for each row returned by bulk_upsert (newly active) and bulk_resolve (newly resolved). Event properties carry a uniform envelope — kind, severity, issue_id, title, summary, link, payload — so HogFunction destination templates can render any health check's alert without per-kind branching. Add an AlertContent dataclass and a HealthCheck.render_alert classmethod that concrete checks override to produce user-facing title/summary/link strings. The default falls back to a generic kind+severity message linking to /health. Override render_alert on SdkOutdatedCheck to name the affected SDK and link to /health/sdk-doctor. Per-issue emission failures (render or Kafka) are swallowed and reported via capture_exception so one bad row cannot break the batch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Additive feature with robust exception handling — all alert emission failures are swallowed and reported to Sentry, so the health-check orchestrator batch cannot be broken. Code is clean, well-tested (happy path plus four failure scenarios), and makes no data model or API contract changes.
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
posthog/temporal/health_checks/alerts.py:24-28
`_check_class_for_kind` reads `_DETECT_FNS` directly without first calling `ensure_registry_loaded()`. In the current primary code path (`_process_batch_detection` → `emit_health_check_alert`) the registry is always already loaded, so this is safe. But if `emit_health_check_alert` is ever called from a context that hasn't gone through the orchestrator, `_DETECT_FNS.get(kind)` will silently return `None` and fall back to generic alert content — the check-specific `render_alert` override is skipped with no warning.
```suggestion
def _check_class_for_kind(kind: str) -> type[HealthCheck] | None:
# The registry only stores instance-bound detect callables, but every
# HealthCheck subclass binds `cls()` so the underlying class is reachable
# via the bound method's `__self__`.
from posthog.temporal.health_checks.registry import ensure_registry_loaded
ensure_registry_loaded()
fn = _DETECT_FNS.get(kind)
```
### Issue 2 of 2
posthog/temporal/health_checks/tests/test_alert_emission.py:65-77
`test_produce_failure_is_swallowed_and_captured` and `test_render_failure_is_swallowed_and_captured` are structurally identical — both patch `_check_class_for_kind`, make either `produce_internal_event` or `render_alert` raise, then assert `fired is False` and `capture_exception` was called once. Per the team's convention, these should be a single parameterized (sub)test that varies the failing patch target.
Reviews (1): Last reviewed commit: "feat(health-checks): emit alert events o..." | Re-trigger Greptile |
| def _check_class_for_kind(kind: str) -> type[HealthCheck] | None: | ||
| # The registry only stores instance-bound detect callables, but every | ||
| # HealthCheck subclass binds `cls()` so the underlying class is reachable | ||
| # via the bound method's `__self__`. | ||
| fn = _DETECT_FNS.get(kind) |
There was a problem hiding this comment.
_check_class_for_kind reads _DETECT_FNS directly without first calling ensure_registry_loaded(). In the current primary code path (_process_batch_detection → emit_health_check_alert) the registry is always already loaded, so this is safe. But if emit_health_check_alert is ever called from a context that hasn't gone through the orchestrator, _DETECT_FNS.get(kind) will silently return None and fall back to generic alert content — the check-specific render_alert override is skipped with no warning.
| def _check_class_for_kind(kind: str) -> type[HealthCheck] | None: | |
| # The registry only stores instance-bound detect callables, but every | |
| # HealthCheck subclass binds `cls()` so the underlying class is reachable | |
| # via the bound method's `__self__`. | |
| fn = _DETECT_FNS.get(kind) | |
| def _check_class_for_kind(kind: str) -> type[HealthCheck] | None: | |
| # The registry only stores instance-bound detect callables, but every | |
| # HealthCheck subclass binds `cls()` so the underlying class is reachable | |
| # via the bound method's `__self__`. | |
| from posthog.temporal.health_checks.registry import ensure_registry_loaded | |
| ensure_registry_loaded() | |
| fn = _DETECT_FNS.get(kind) |
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/health_checks/alerts.py
Line: 24-28
Comment:
`_check_class_for_kind` reads `_DETECT_FNS` directly without first calling `ensure_registry_loaded()`. In the current primary code path (`_process_batch_detection` → `emit_health_check_alert`) the registry is always already loaded, so this is safe. But if `emit_health_check_alert` is ever called from a context that hasn't gone through the orchestrator, `_DETECT_FNS.get(kind)` will silently return `None` and fall back to generic alert content — the check-specific `render_alert` override is skipped with no warning.
```suggestion
def _check_class_for_kind(kind: str) -> type[HealthCheck] | None:
# The registry only stores instance-bound detect callables, but every
# HealthCheck subclass binds `cls()` so the underlying class is reachable
# via the bound method's `__self__`.
from posthog.temporal.health_checks.registry import ensure_registry_loaded
ensure_registry_loaded()
fn = _DETECT_FNS.get(kind)
```
How can I resolve this? If you propose a fix, please make it concise.| def test_produce_failure_is_swallowed_and_captured(self, _produce, _lookup, capture): | ||
| fired = emit_health_check_alert(_make_issue(), status="firing") | ||
| self.assertFalse(fired) | ||
| capture.assert_called_once() | ||
|
|
||
| @patch("posthog.temporal.health_checks.alerts.capture_exception") | ||
| @patch("posthog.temporal.health_checks.alerts._check_class_for_kind", return_value=_BadCheck) | ||
| @patch("posthog.temporal.health_checks.alerts.produce_internal_event") | ||
| def test_render_failure_is_swallowed_and_captured(self, produce, _lookup, capture): | ||
| fired = emit_health_check_alert(_make_issue(), status="firing") | ||
| self.assertFalse(fired) | ||
| produce.assert_not_called() | ||
| capture.assert_called_once() |
There was a problem hiding this comment.
test_produce_failure_is_swallowed_and_captured and test_render_failure_is_swallowed_and_captured are structurally identical — both patch _check_class_for_kind, make either produce_internal_event or render_alert raise, then assert fired is False and capture_exception was called once. Per the team's convention, these should be a single parameterized (sub)test that varies the failing patch target.
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/health_checks/tests/test_alert_emission.py
Line: 65-77
Comment:
`test_produce_failure_is_swallowed_and_captured` and `test_render_failure_is_swallowed_and_captured` are structurally identical — both patch `_check_class_for_kind`, make either `produce_internal_event` or `render_alert` raise, then assert `fired is False` and `capture_exception` was called once. Per the team's convention, these should be a single parameterized (sub)test that varies the failing patch target.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
ClickHouse migration SQL per cloud environmentNo ClickHouse migrations changed in this PR. |

Problem
PostHog has ~10 concrete health checks under
posthog/temporal/health_checks/(SDK outdated, ingestion warnings, external data failures, web-vitals gaps, no-pageleave, scroll depth, authorized URLs, reverse proxy, materialized view failures, …). The original SDK Doctor alerting PR (#58211) wired Slack/Discord/Teams/email/webhook destinations to one of those kinds via a bespoke event name ($sdk_doctor_alert_firing), a bespokeHogFunctionConfigurationContextId, a bespoke per-team Redis cooldown, and ~120 lines of hand-authored destination templates insub-templates.ts. Repeating that pattern per kind would bloatsub-templates.tsto ~2000 lines and require ~10 copies of the emission glue.This PR introduces a single emission path that fires for every health check, with a uniform event envelope. Concrete checks override one
render_alertclassmethod to provide their user-facing copy; nothing else per-kind is needed.Changes
Framework —
posthog/temporal/health_checks/framework.py:AlertContentfrozen dataclass:title,summary,link(relative path; templates concatenate onto{project.url})HealthCheck.render_alert(issue) -> AlertContentclassmethod with a generic default that usescls.name/cls.kind/issue.severityand links to/healthEmitter — new
posthog/temporal/health_checks/alerts.py:Two internal event names —
$health_check_issue_firingand$health_check_issue_resolved— produced viaproduce_internal_eventwith a uniform envelope:Per-issue
try/exceptmirroringproducts/logs/backend/temporal/activities.pyso one bad render or one Kafka error never breaks the batch. Failures are captured viaposthog.exceptions_capture.capture_exception.Orchestrator —
posthog/temporal/health_checks/processing.py:Iterates the newly-active and newly-resolved lists returned by the previous PR (#59983) and calls
emit_health_check_alert(issue, status=...)per row.SDK Doctor —
products/growth/backend/temporal/health_checks/sdk_outdated.py:SdkOutdatedCheck.render_alertreturns a human-readable summary keyed offissue.payload['sdk_name']andlatest_version, withlink="/health/sdk-doctor".How did you test this code?
I'm Claude (Opus 4.7) — automated checks only.
posthog/temporal/health_checks/tests/test_alert_emission.pycovering:render_alertexceptions are swallowed and capturedruff/ formatter /ty checkpass via lint-staged pre-commitPublish to changelog?
No — wait for the user-facing pieces (#59985, #59986).
Docs update
None needed yet — user-facing docs land with the rollout PR.
🤖 Agent context
Authored via Claude Code (Opus 4.7) in collaboration with @rafaeelaudibert. Originated from a review of #58211; plan refined remotely via Ultraplan.
Decisions worth flagging for reviewers:
ACTIVE, no re-emission. Onlyto_create(brand-new + reactivated) andbulk_resolveoutputs fire. This subsumes the 7-day per-team Redis cooldown the original PR added inproducts/growth/backend/sdk_doctor_alerts.py. That file gets removed in the SDK-Doctor migration further up the stack._DETECT_FNS[kind].__self__. The registry stores the bounddetectmethod per kind; the unbound class is reachable via the method's__self__. No new registry surface area was needed.This PR stacks on #59983 and is followed by #59985 and #59986.