Skip to content

Gate external event producers through telemetry config#258

Merged
Sayan- merged 3 commits into
mainfrom
hypeship/gate-external-events
May 28, 2026
Merged

Gate external event producers through telemetry config#258
Sayan- merged 3 commits into
mainfrom
hypeship/gate-external-events

Conversation

@Sayan-
Copy link
Copy Markdown
Contributor

@Sayan- Sayan- commented May 28, 2026

Overview

External event producers (POST /telemetry/events, sysmon's kmsg reader, the CDP proxy, the api_call middleware, cdpmonitor) used to publish directly to the raw EventStream, which bypassed the active TelemetrySession and leaked events into the stream even when telemetry was unconfigured or the category was disabled. This PR routes every producer through TelemetrySession so they all honor the same gating.

Concretely:

  • POST /telemetry/events now returns 204 No Content when filtered, and 200 with the assigned envelope when admitted (previously: 200 with seq: 0 to signal a drop, which violated the schema)
  • TelemetrySession is documented as a non-nil invariant: NewTelemetrySession panics on a nil EventStream, ApiService construction rejects a nil session, and methods don't tolerate a nil receiver
  • All events must be published through TelemetrySession, never directly to EventStream. AGENTS.md documents the rule.

System events still flow whenever a telemetry session is active: the events.System force-include in Start / UpdateConfig is unchanged. Only the bypass paths are closed.

Pls don't review this PR by commit. I had opus rewrite whatever another agent initially came up with (which was just not good)

Testing

CI + smoke tests

# Test Build Result Notes
G1 POST /telemetry/events with no active session → expect 204 headless pass Returned 204 0B, body empty
G2 Synthetic kmsg OOM with no active session → expect 0 SSE events headless pass SSE silent for 6 s after injection; HTTP path also returned 204 for shim posts
T1 service_crashed phase=running (supervisorctl signal KILL chromium) headless pass One event with phase=running, pid=273, service_name=chromium
T2 service_crashed phase=gave_up (flaky service exhausting startretries) headless pass One event with phase=gave_up, service_name=flaky, no pid field
T3 Clean stop is suppressed (supervisorctl stop chromium) headless pass Zero new service_crashed events in the 4 s window
T4 system_oom_kill (synthetic kmsg dump) headless pass constraint=none, mem_total_kb=2097152, mem_free_kb=18240, top_tasks[0].name=chromium, trigger_process_name=chromium
T5 system_oom_kill (real cgroup OOM under --memory 512m) headless pass Two constraint=memcg events (python3 + chromium); mem_total_kb / mem_free_kb correctly omitted; top_tasks names clean (python3, Xvfb, chromium, kernel-images-a, chromedriver)
T6 Headful service_crashed phase=running headful pass Identical event shape as T1, fired against chromium-headful-test:latest

All events carry category: "system" and metadata.telemetry_session_id correctly. api_call events also flow once api is captured (visible in T1 as a side-effect of the shim's POST).


Note

Medium Risk
Changes which events reach the stream and the POST /telemetry/events contract; misconfiguration could drop expected telemetry until PUT /telemetry starts a session.

Overview
External telemetry producers no longer write to the raw EventStream. TelemetrySession.Publish is now the single gate: it applies the active session, category filter, and telemetry_session_id metadata before anything reaches SSE/S2.

POST /telemetry/events, sysmon (kmsg OOM), CDP monitor/proxy, and api_call HTTP middleware all use a shared func(Event) (Envelope, bool) callback wired to telemetrySession.Publish in main.go. Filtered publishes return 204 No Content; admitted events return 200 with the real envelope (replacing the old 200 + seq: 0 drop signal). OpenAPI and generated clients were updated accordingly.

TelemetrySession is documented as a required non-nil dependency; producer callback types and tests were updated. AGENTS.md and lib/sysmon docs now state that producers must not bypass the session.

Reviewed by Cursor Bugbot for commit 049b98c. Bugbot is set up for automated code reviews on this repo. Configure here.

Sayan- and others added 2 commits May 28, 2026 18:36
POST /telemetry/events and sysmon previously published directly to the
event stream, so events flowed even with no telemetry session active or
the category disabled. Route both through TelemetrySession so callers
that explicitly turn telemetry off (or omit a config) no longer see
external producers leak into the stream. Filtered publishes still ack
the request but return seq=0 to signal the drop.
@Sayan- Sayan- changed the title gate external event producers through telemetry config Gate external event producers through telemetry config May 28, 2026
@Sayan- Sayan- marked this pull request as ready for review May 28, 2026 20:03
@Sayan- Sayan- requested review from archandatta and rgarcia May 28, 2026 20:04
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 049b98c. Configure here.

Comment thread server/lib/telemetry/telemetry.go
Comment thread server/cmd/api/api/events.go
@firetiger-agent
Copy link
Copy Markdown

Monitoring Plan: Route In-VM Telemetry Through TelemetrySession

What this PR does: Enforces a rule that all in-VM telemetry producers publish through the TelemetrySession (which applies category gating and session metadata) rather than directly to the raw event stream. In practice, this means POST /telemetry/events inside a browser VM now returns 204 instead of 200 when no telemetry session is active or the event category is disabled — and OOM events from the kernel monitor are gated through the same path.

Intended effect:

  • POST /telemetry/events 204 responses: baseline 0 (endpoint previously always returned 200); confirmed if supervisord-shim crash reports continue arriving during active sessions with no new error logs
  • system_oom_kill event delivery: baseline — system category always enabled; confirmed if OOM events continue to reach consumers normally (no missing crash telemetry reports)

Risks:

  • supervisord-shim 204 handling — WARN/ERROR logs from in-VM services referencing telemetry/events; alert if any new error pattern appears in API logs containing telemetry + shim context
  • OOM events silently dropped — if a session starts without the system category (should never happen per code), system_oom_kill events would be lost; alert if users report missing VM crash telemetry
  • Unexpected caller breakage — any in-VM client treating HTTP 204 as failure; alert if API error rate rises >2× baseline (~111K errors/hr during active hours)

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

Copy link
Copy Markdown
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff!

@Sayan- Sayan- merged commit 4675f3b into main May 28, 2026
10 checks passed
@Sayan- Sayan- deleted the hypeship/gate-external-events branch May 28, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants