feat(health-checks): emit alert events on issue lifecycle transitions by rafaeelaudibert · Pull Request #59984 · PostHog/posthog

rafaeelaudibert · 2026-05-26T01:25:15Z

Problem

PostHog has ~10 concrete health checks under posthog/temporal/health_checks/ (SDK outdated, ingestion warnings, external data failures, web-vitals gaps, no-pageleave, scroll depth, authorized URLs, reverse proxy, materialized view failures, …). The original SDK Doctor alerting PR (#58211) wired Slack/Discord/Teams/email/webhook destinations to one of those kinds via a bespoke event name ($sdk_doctor_alert_firing), a bespoke HogFunctionConfigurationContextId, a bespoke per-team Redis cooldown, and ~120 lines of hand-authored destination templates in sub-templates.ts. Repeating that pattern per kind would bloat sub-templates.ts to ~2000 lines and require ~10 copies of the emission glue.

This PR introduces a single emission path that fires for every health check, with a uniform event envelope. Concrete checks override one render_alert classmethod to provide their user-facing copy; nothing else per-kind is needed.

Changes

Framework — posthog/temporal/health_checks/framework.py:

New AlertContent frozen dataclass: title, summary, link (relative path; templates concatenate onto {project.url})
New HealthCheck.render_alert(issue) -> AlertContent classmethod with a generic default that uses cls.name / cls.kind / issue.severity and links to /health

Emitter — new posthog/temporal/health_checks/alerts.py:

Two internal event names — $health_check_issue_firing and $health_check_issue_resolved — produced via produce_internal_event with a uniform envelope:

kind, severity, issue_id, title, summary, link, payload

Per-issue try/except mirroring products/logs/backend/temporal/activities.py so one bad render or one Kafka error never breaks the batch. Failures are captured via posthog.exceptions_capture.capture_exception.

Orchestrator — posthog/temporal/health_checks/processing.py:

Iterates the newly-active and newly-resolved lists returned by the previous PR (#59983) and calls emit_health_check_alert(issue, status=...) per row.

SDK Doctor — products/growth/backend/temporal/health_checks/sdk_outdated.py:

SdkOutdatedCheck.render_alert returns a human-readable summary keyed off issue.payload['sdk_name'] and latest_version, with link="/health/sdk-doctor".

How did you test this code?

I'm Claude (Opus 4.7) — automated checks only.

5 new tests in posthog/temporal/health_checks/tests/test_alert_emission.py covering:
- Firing emits with the full envelope
- Resolved emits the resolved event name
- Unregistered kinds fall back to the generic content
- Kafka failures are swallowed and captured
- render_alert exceptions are swallowed and captured
ruff / formatter / ty check pass via lint-staged pre-commit
Did not run end-to-end against a real Kafka / temporal worker locally

Publish to changelog?

No — wait for the user-facing pieces (#59985, #59986).

Docs update

None needed yet — user-facing docs land with the rollout PR.

🤖 Agent context

Authored via Claude Code (Opus 4.7) in collaboration with @rafaeelaudibert. Originated from a review of #58211; plan refined remotely via Ultraplan.

Decisions worth flagging for reviewers:

Cooldown via lifecycle, not Redis. Built on top of #59983's delta returns: while an issue stays ACTIVE, no re-emission. Only to_create (brand-new + reactivated) and bulk_resolve outputs fire. This subsumes the 7-day per-team Redis cooldown the original PR added in products/growth/backend/sdk_doctor_alerts.py. That file gets removed in the SDK-Doctor migration further up the stack.
Symmetric firing/resolved events. Logs alerting also does this; symmetric semantics let users get a "recovered" notification without extra plumbing.
Registry lookup via _DETECT_FNS[kind].__self__. The registry stores the bound detect method per kind; the unbound class is reachable via the method's __self__. No new registry surface area was needed.
Existing precedents kept intact. Error tracking and logs alerting still use their own event names and sub-templates; they remain sibling alert families. No refactor of those is included here.

This PR stacks on #59983 and is followed by #59985 and #59986.

Add \$health_check_issue_firing / \$health_check_issue_resolved internal events, emitted from the Temporal orchestrator for each row returned by bulk_upsert (newly active) and bulk_resolve (newly resolved). Event properties carry a uniform envelope — kind, severity, issue_id, title, summary, link, payload — so HogFunction destination templates can render any health check's alert without per-kind branching. Add an AlertContent dataclass and a HealthCheck.render_alert classmethod that concrete checks override to produce user-facing title/summary/link strings. The default falls back to a generic kind+severity message linking to /health. Override render_alert on SdkOutdatedCheck to name the affected SDK and link to /health/sdk-doctor. Per-issue emission failures (render or Kafka) are swallowed and reported via capture_exception so one bad row cannot break the batch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rafaeelaudibert · 2026-05-26T01:25:36Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions

Additive feature with robust exception handling — all alert emission failures are swallowed and reported to Sentry, so the health-check orchestrator batch cannot be broken. Code is clean, well-tested (happy path plus four failure scenarios), and makes no data model or API contract changes.

greptile-apps · 2026-05-26T01:29:03Z

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
posthog/temporal/health_checks/alerts.py:24-28
`_check_class_for_kind` reads `_DETECT_FNS` directly without first calling `ensure_registry_loaded()`. In the current primary code path (`_process_batch_detection` → `emit_health_check_alert`) the registry is always already loaded, so this is safe. But if `emit_health_check_alert` is ever called from a context that hasn't gone through the orchestrator, `_DETECT_FNS.get(kind)` will silently return `None` and fall back to generic alert content — the check-specific `render_alert` override is skipped with no warning.

```suggestion
def _check_class_for_kind(kind: str) -> type[HealthCheck] | None:
    # The registry only stores instance-bound detect callables, but every
    # HealthCheck subclass binds `cls()` so the underlying class is reachable
    # via the bound method's `__self__`.
    from posthog.temporal.health_checks.registry import ensure_registry_loaded

    ensure_registry_loaded()
    fn = _DETECT_FNS.get(kind)
```

### Issue 2 of 2
posthog/temporal/health_checks/tests/test_alert_emission.py:65-77
`test_produce_failure_is_swallowed_and_captured` and `test_render_failure_is_swallowed_and_captured` are structurally identical — both patch `_check_class_for_kind`, make either `produce_internal_event` or `render_alert` raise, then assert `fired is False` and `capture_exception` was called once. Per the team's convention, these should be a single parameterized (sub)test that varies the failing patch target.

_{Reviews (1): Last reviewed commit: "feat(health-checks): emit alert events o..." | Re-trigger Greptile}

greptile-apps · 2026-05-26T01:29:07Z

+def _check_class_for_kind(kind: str) -> type[HealthCheck] | None:
+    # The registry only stores instance-bound detect callables, but every
+    # HealthCheck subclass binds `cls()` so the underlying class is reachable
+    # via the bound method's `__self__`.
+    fn = _DETECT_FNS.get(kind)


_check_class_for_kind reads _DETECT_FNS directly without first calling ensure_registry_loaded(). In the current primary code path (_process_batch_detection → emit_health_check_alert) the registry is always already loaded, so this is safe. But if emit_health_check_alert is ever called from a context that hasn't gone through the orchestrator, _DETECT_FNS.get(kind) will silently return None and fall back to generic alert content — the check-specific render_alert override is skipped with no warning.

Suggested change

def _check_class_for_kind(kind: str) -> type[HealthCheck] | None:

# The registry only stores instance-bound detect callables, but every

# HealthCheck subclass binds `cls()` so the underlying class is reachable

# via the bound method's `__self__`.

fn = _DETECT_FNS.get(kind)

def _check_class_for_kind(kind: str) -> type[HealthCheck] | None:

# The registry only stores instance-bound detect callables, but every

# HealthCheck subclass binds `cls()` so the underlying class is reachable

# via the bound method's `__self__`.

from posthog.temporal.health_checks.registry import ensure_registry_loaded

ensure_registry_loaded()

fn = _DETECT_FNS.get(kind)

Prompt To Fix With AI

This is a comment left during a code review. Path: posthog/temporal/health_checks/alerts.py Line: 24-28 Comment: `_check_class_for_kind` reads `_DETECT_FNS` directly without first calling `ensure_registry_loaded()`. In the current primary code path (`_process_batch_detection` → `emit_health_check_alert`) the registry is always already loaded, so this is safe. But if `emit_health_check_alert` is ever called from a context that hasn't gone through the orchestrator, `_DETECT_FNS.get(kind)` will silently return `None` and fall back to generic alert content — the check-specific `render_alert` override is skipped with no warning. ```suggestion def _check_class_for_kind(kind: str) -> type[HealthCheck] | None: # The registry only stores instance-bound detect callables, but every # HealthCheck subclass binds `cls()` so the underlying class is reachable # via the bound method's `__self__`. from posthog.temporal.health_checks.registry import ensure_registry_loaded ensure_registry_loaded() fn = _DETECT_FNS.get(kind) ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-26T01:29:08Z

+    def test_produce_failure_is_swallowed_and_captured(self, _produce, _lookup, capture):
+        fired = emit_health_check_alert(_make_issue(), status="firing")
+        self.assertFalse(fired)
+        capture.assert_called_once()
+
+    @patch("posthog.temporal.health_checks.alerts.capture_exception")
+    @patch("posthog.temporal.health_checks.alerts._check_class_for_kind", return_value=_BadCheck)
+    @patch("posthog.temporal.health_checks.alerts.produce_internal_event")
+    def test_render_failure_is_swallowed_and_captured(self, produce, _lookup, capture):
+        fired = emit_health_check_alert(_make_issue(), status="firing")
+        self.assertFalse(fired)
+        produce.assert_not_called()
+        capture.assert_called_once()


test_produce_failure_is_swallowed_and_captured and test_render_failure_is_swallowed_and_captured are structurally identical — both patch _check_class_for_kind, make either produce_internal_event or render_alert raise, then assert fired is False and capture_exception was called once. Per the team's convention, these should be a single parameterized (sub)test that varies the failing patch target.

Prompt To Fix With AI

This is a comment left during a code review. Path: posthog/temporal/health_checks/tests/test_alert_emission.py Line: 65-77 Comment: `test_produce_failure_is_swallowed_and_captured` and `test_render_failure_is_swallowed_and_captured` are structurally identical — both patch `_check_class_for_kind`, make either `produce_internal_event` or `render_alert` raise, then assert `fired is False` and `capture_exception` was called once. Per the team's convention, these should be a single parameterized (sub)test that varies the failing patch target. How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

github-actions · 2026-05-26T01:35:53Z

ClickHouse migration SQL per cloud environment

No ClickHouse migrations changed in this PR.

rafaeelaudibert requested review from a team, MattBro, fercgomes and slshults and removed request for a team May 26, 2026 01:25

graphite-app Bot added the stamphog Request AI review from stamphog label May 26, 2026

rafaeelaudibert mentioned this pull request May 26, 2026

feat(health-alerts): mount entry points and add per-check render_alert #59986

Open

This was referenced May 26, 2026

feat(health-alerts): add wizard and central scene for health-check alerts #59985

Open

feat(health-checks): return lifecycle deltas from bulk_upsert/bulk_resolve #59983

Open

assign-reviewers-posthog Bot requested a review from a team May 26, 2026 01:25

github-actions Bot approved these changes May 26, 2026

View reviewed changes

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(health-checks): emit alert events on issue lifecycle transitions#59984

feat(health-checks): emit alert events on issue lifecycle transitions#59984
rafaeelaudibert wants to merge 1 commit into
rafa/03-health-issue-deltasfrom
rafa/04-health-check-alert-events

rafaeelaudibert commented May 26, 2026 •

edited

Loading

Uh oh!

rafaeelaudibert commented May 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

greptile-apps Bot commented May 26, 2026

Uh oh!

greptile-apps Bot May 26, 2026

Uh oh!

greptile-apps Bot May 26, 2026

Uh oh!

github-actions Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rafaeelaudibert commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

How did you test this code?

Publish to changelog?

Docs update

🤖 Agent context

Uh oh!

rafaeelaudibert commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented May 26, 2026

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 26, 2026

ClickHouse migration SQL per cloud environment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rafaeelaudibert commented May 26, 2026 •

edited

Loading

rafaeelaudibert commented May 26, 2026 •

edited

Loading