Skip to content

feat(health-checks): emit alert events on issue lifecycle transitions#59984

Open
rafaeelaudibert wants to merge 1 commit into
rafa/03-health-issue-deltasfrom
rafa/04-health-check-alert-events
Open

feat(health-checks): emit alert events on issue lifecycle transitions#59984
rafaeelaudibert wants to merge 1 commit into
rafa/03-health-issue-deltasfrom
rafa/04-health-check-alert-events

Conversation

@rafaeelaudibert
Copy link
Copy Markdown
Member

@rafaeelaudibert rafaeelaudibert commented May 26, 2026

Problem

PostHog has ~10 concrete health checks under posthog/temporal/health_checks/ (SDK outdated, ingestion warnings, external data failures, web-vitals gaps, no-pageleave, scroll depth, authorized URLs, reverse proxy, materialized view failures, …). The original SDK Doctor alerting PR (#58211) wired Slack/Discord/Teams/email/webhook destinations to one of those kinds via a bespoke event name ($sdk_doctor_alert_firing), a bespoke HogFunctionConfigurationContextId, a bespoke per-team Redis cooldown, and ~120 lines of hand-authored destination templates in sub-templates.ts. Repeating that pattern per kind would bloat sub-templates.ts to ~2000 lines and require ~10 copies of the emission glue.

This PR introduces a single emission path that fires for every health check, with a uniform event envelope. Concrete checks override one render_alert classmethod to provide their user-facing copy; nothing else per-kind is needed.

Changes

Framework — posthog/temporal/health_checks/framework.py:

  • New AlertContent frozen dataclass: title, summary, link (relative path; templates concatenate onto {project.url})
  • New HealthCheck.render_alert(issue) -> AlertContent classmethod with a generic default that uses cls.name / cls.kind / issue.severity and links to /health

Emitter — new posthog/temporal/health_checks/alerts.py:

Two internal event names — $health_check_issue_firing and $health_check_issue_resolved — produced via produce_internal_event with a uniform envelope:

kind, severity, issue_id, title, summary, link, payload

Per-issue try/except mirroring products/logs/backend/temporal/activities.py so one bad render or one Kafka error never breaks the batch. Failures are captured via posthog.exceptions_capture.capture_exception.

Orchestrator — posthog/temporal/health_checks/processing.py:

Iterates the newly-active and newly-resolved lists returned by the previous PR (#59983) and calls emit_health_check_alert(issue, status=...) per row.

SDK Doctor — products/growth/backend/temporal/health_checks/sdk_outdated.py:

SdkOutdatedCheck.render_alert returns a human-readable summary keyed off issue.payload['sdk_name'] and latest_version, with link="/health/sdk-doctor".

How did you test this code?

I'm Claude (Opus 4.7) — automated checks only.

  • 5 new tests in posthog/temporal/health_checks/tests/test_alert_emission.py covering:
    • Firing emits with the full envelope
    • Resolved emits the resolved event name
    • Unregistered kinds fall back to the generic content
    • Kafka failures are swallowed and captured
    • render_alert exceptions are swallowed and captured
  • ruff / formatter / ty check pass via lint-staged pre-commit
  • Did not run end-to-end against a real Kafka / temporal worker locally

Publish to changelog?

No — wait for the user-facing pieces (#59985, #59986).

Docs update

None needed yet — user-facing docs land with the rollout PR.

🤖 Agent context

Authored via Claude Code (Opus 4.7) in collaboration with @rafaeelaudibert. Originated from a review of #58211; plan refined remotely via Ultraplan.

Decisions worth flagging for reviewers:

  • Cooldown via lifecycle, not Redis. Built on top of #59983's delta returns: while an issue stays ACTIVE, no re-emission. Only to_create (brand-new + reactivated) and bulk_resolve outputs fire. This subsumes the 7-day per-team Redis cooldown the original PR added in products/growth/backend/sdk_doctor_alerts.py. That file gets removed in the SDK-Doctor migration further up the stack.
  • Symmetric firing/resolved events. Logs alerting also does this; symmetric semantics let users get a "recovered" notification without extra plumbing.
  • Registry lookup via _DETECT_FNS[kind].__self__. The registry stores the bound detect method per kind; the unbound class is reachable via the method's __self__. No new registry surface area was needed.
  • Existing precedents kept intact. Error tracking and logs alerting still use their own event names and sub-templates; they remain sibling alert families. No refactor of those is included here.

This PR stacks on #59983 and is followed by #59985 and #59986.

Add \$health_check_issue_firing / \$health_check_issue_resolved internal
events, emitted from the Temporal orchestrator for each row returned by
bulk_upsert (newly active) and bulk_resolve (newly resolved). Event
properties carry a uniform envelope — kind, severity, issue_id, title,
summary, link, payload — so HogFunction destination templates can render
any health check's alert without per-kind branching.

Add an AlertContent dataclass and a HealthCheck.render_alert classmethod
that concrete checks override to produce user-facing title/summary/link
strings. The default falls back to a generic kind+severity message
linking to /health.

Override render_alert on SdkOutdatedCheck to name the affected SDK and
link to /health/sdk-doctor.

Per-issue emission failures (render or Kafka) are swallowed and reported
via capture_exception so one bad row cannot break the batch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rafaeelaudibert rafaeelaudibert requested review from a team, MattBro, fercgomes and slshults and removed request for a team May 26, 2026 01:25
@graphite-app graphite-app Bot added the stamphog Request AI review from stamphog label May 26, 2026
Copy link
Copy Markdown
Member Author

rafaeelaudibert commented May 26, 2026

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additive feature with robust exception handling — all alert emission failures are swallowed and reported to Sentry, so the health-check orchestrator batch cannot be broken. Code is clean, well-tested (happy path plus four failure scenarios), and makes no data model or API contract changes.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
posthog/temporal/health_checks/alerts.py:24-28
`_check_class_for_kind` reads `_DETECT_FNS` directly without first calling `ensure_registry_loaded()`. In the current primary code path (`_process_batch_detection``emit_health_check_alert`) the registry is always already loaded, so this is safe. But if `emit_health_check_alert` is ever called from a context that hasn't gone through the orchestrator, `_DETECT_FNS.get(kind)` will silently return `None` and fall back to generic alert content — the check-specific `render_alert` override is skipped with no warning.

```suggestion
def _check_class_for_kind(kind: str) -> type[HealthCheck] | None:
    # The registry only stores instance-bound detect callables, but every
    # HealthCheck subclass binds `cls()` so the underlying class is reachable
    # via the bound method's `__self__`.
    from posthog.temporal.health_checks.registry import ensure_registry_loaded

    ensure_registry_loaded()
    fn = _DETECT_FNS.get(kind)
```

### Issue 2 of 2
posthog/temporal/health_checks/tests/test_alert_emission.py:65-77
`test_produce_failure_is_swallowed_and_captured` and `test_render_failure_is_swallowed_and_captured` are structurally identical — both patch `_check_class_for_kind`, make either `produce_internal_event` or `render_alert` raise, then assert `fired is False` and `capture_exception` was called once. Per the team's convention, these should be a single parameterized (sub)test that varies the failing patch target.

Reviews (1): Last reviewed commit: "feat(health-checks): emit alert events o..." | Re-trigger Greptile

Comment on lines +24 to +28
def _check_class_for_kind(kind: str) -> type[HealthCheck] | None:
# The registry only stores instance-bound detect callables, but every
# HealthCheck subclass binds `cls()` so the underlying class is reachable
# via the bound method's `__self__`.
fn = _DETECT_FNS.get(kind)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _check_class_for_kind reads _DETECT_FNS directly without first calling ensure_registry_loaded(). In the current primary code path (_process_batch_detectionemit_health_check_alert) the registry is always already loaded, so this is safe. But if emit_health_check_alert is ever called from a context that hasn't gone through the orchestrator, _DETECT_FNS.get(kind) will silently return None and fall back to generic alert content — the check-specific render_alert override is skipped with no warning.

Suggested change
def _check_class_for_kind(kind: str) -> type[HealthCheck] | None:
# The registry only stores instance-bound detect callables, but every
# HealthCheck subclass binds `cls()` so the underlying class is reachable
# via the bound method's `__self__`.
fn = _DETECT_FNS.get(kind)
def _check_class_for_kind(kind: str) -> type[HealthCheck] | None:
# The registry only stores instance-bound detect callables, but every
# HealthCheck subclass binds `cls()` so the underlying class is reachable
# via the bound method's `__self__`.
from posthog.temporal.health_checks.registry import ensure_registry_loaded
ensure_registry_loaded()
fn = _DETECT_FNS.get(kind)
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/health_checks/alerts.py
Line: 24-28

Comment:
`_check_class_for_kind` reads `_DETECT_FNS` directly without first calling `ensure_registry_loaded()`. In the current primary code path (`_process_batch_detection``emit_health_check_alert`) the registry is always already loaded, so this is safe. But if `emit_health_check_alert` is ever called from a context that hasn't gone through the orchestrator, `_DETECT_FNS.get(kind)` will silently return `None` and fall back to generic alert content — the check-specific `render_alert` override is skipped with no warning.

```suggestion
def _check_class_for_kind(kind: str) -> type[HealthCheck] | None:
    # The registry only stores instance-bound detect callables, but every
    # HealthCheck subclass binds `cls()` so the underlying class is reachable
    # via the bound method's `__self__`.
    from posthog.temporal.health_checks.registry import ensure_registry_loaded

    ensure_registry_loaded()
    fn = _DETECT_FNS.get(kind)
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +65 to +77
def test_produce_failure_is_swallowed_and_captured(self, _produce, _lookup, capture):
fired = emit_health_check_alert(_make_issue(), status="firing")
self.assertFalse(fired)
capture.assert_called_once()

@patch("posthog.temporal.health_checks.alerts.capture_exception")
@patch("posthog.temporal.health_checks.alerts._check_class_for_kind", return_value=_BadCheck)
@patch("posthog.temporal.health_checks.alerts.produce_internal_event")
def test_render_failure_is_swallowed_and_captured(self, produce, _lookup, capture):
fired = emit_health_check_alert(_make_issue(), status="firing")
self.assertFalse(fired)
produce.assert_not_called()
capture.assert_called_once()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 test_produce_failure_is_swallowed_and_captured and test_render_failure_is_swallowed_and_captured are structurally identical — both patch _check_class_for_kind, make either produce_internal_event or render_alert raise, then assert fired is False and capture_exception was called once. Per the team's convention, these should be a single parameterized (sub)test that varies the failing patch target.

Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/temporal/health_checks/tests/test_alert_emission.py
Line: 65-77

Comment:
`test_produce_failure_is_swallowed_and_captured` and `test_render_failure_is_swallowed_and_captured` are structurally identical — both patch `_check_class_for_kind`, make either `produce_internal_event` or `render_alert` raise, then assert `fired is False` and `capture_exception` was called once. Per the team's convention, these should be a single parameterized (sub)test that varies the failing patch target.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@github-actions
Copy link
Copy Markdown
Contributor

ClickHouse migration SQL per cloud environment

No ClickHouse migrations changed in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stamphog Request AI review from stamphog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant