fix(telemetry): refactor to shared_ptr and add deterministic shutdown()#326
Conversation
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac2de8df27
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
BenchmarksBenchmark execution time: 2026-06-09 10:17:36 Comparing candidate commit 31b8ae3 in PR branch Found 0 performance improvements and 3 performance regressions! Performance is the same for 5 metrics, 0 unstable metrics.
|
db6b0ab to
a179e32
Compare
Refactors Telemetry to be heap-allocated via a shared_ptr (enable_shared_from_this + factory create()). Scheduled-task callbacks now capture a weak_from_this() so they become no-ops automatically once the object is no longer reachable. Move semantics are removed. Adds a shutdown() function that cancels scheduled tasks, sends the app-closing payload, and releases the HTTP client. After the call, all HTTP sending becomes a no-op: send_payload() snapshots http_client_ under http_client_mutex_ and returns early if null. http_client_ is reset under the same mutex in shutdown(), so any concurrent call that already holds a live snapshot completes cleanly while new calls short-circuit. If this is the last shared_ptr to the client, the background Curl thread is joined at that point. Rationale: shutdown() cannot guarantee that, after it returns, there are no in-flight HTTP callbacks still holding a reference to Telemetry state. Using a shared_ptr ensures any late callback finds a dead weak_ptr and short-circuits instead of accessing a destroyed object. The load-bearing weak_from_this() captures are the on_response and on_error lambdas inside send_payload(). Those run on CurlImpl's independent event_loop_ thread (via handle_message()), which has no blocking-cancel mechanism. app_closing() calls drain() with a 2-second deadline, so if the agent is slow, requests can still be in flight on the curl thread after shutdown() returns. The weak.lock() guard is what prevents a use-after-free in that window. The weak_from_this() captures in the scheduled-task callbacks (schedule_tasks()) are defense-in-depth for a different concern: ThreadedEventScheduler's cancel() blocks until any in-progress callback finishes, making those captures redundant against the threaded scheduler. However, the EventScheduler interface does not mandate a blocking cancel(), so the captures guard against non-blocking implementations where a callback could otherwise touch Telemetry state after it has been torn down. Also fixes a race in test_span.cpp: disabling telemetry there prevents app_started() from posting to the Curl background thread, which would otherwise fail to reach the (absent) agent asynchronously and call log_error() on the shared MockLogger, racing against the error_count == 0 assertion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
a179e32 to
31b8ae3
Compare
zacharycmontoya
left a comment
There was a problem hiding this comment.
LGTM. Admittedly this is my first time seeing usage of the std::enable_shared_from_this and weak_from_this API's but it makes sense to me!
Refactors Telemetry to be heap-allocated via a shared_ptr (enable_shared_from_this + factory create()). Scheduled-task callbacks now capture a weak_from_this() so they become no-ops automatically once the object is no longer reachable. Move semantics are removed.
Adds a shutdown() function that cancels scheduled tasks, sends the app-closing payload, and releases the HTTP client. After the call, all HTTP sending becomes a no-op: send_payload() snapshots http_client_ under http_client_mutex_ and returns early if null. http_client_ is reset under the same mutex in shutdown(), so any concurrent call that already holds a live snapshot completes cleanly while new calls short-circuit. If this is the last shared_ptr to the client, the background Curl thread is joined at that point.
Rationale: shutdown() cannot guarantee that, after it returns, there are no in-flight HTTP callbacks still holding a reference to Telemetry state. Using a shared_ptr ensures any late callback finds a dead weak_ptr and short-circuits instead of accessing a destroyed object.
The load-bearing weak_from_this() captures are the on_response and on_error lambdas inside send_payload(). Those run on CurlImpl's independent event_loop_ thread (via handle_message()), which has no blocking-cancel mechanism. app_closing() calls drain() with a 2-second deadline, so if the agent is slow, requests can still be in flight on the curl thread after shutdown() returns. The weak.lock() guard is what prevents a use-after-free in that window.
The weak_from_this() captures in the scheduled-task callbacks (schedule_tasks()) are defense-in-depth for a different concern: ThreadedEventScheduler's cancel() blocks until any in-progress callback finishes, making those captures redundant against the threaded scheduler. However, the EventScheduler interface does not mandate a blocking cancel(), so the captures guard against non-blocking implementations where a callback could otherwise touch Telemetry state after it has been torn down.
Also fixes a race in test_span.cpp: disabling telemetry there prevents app_started() from posting to the Curl background thread, which would otherwise fail to reach the (absent) agent asynchronously and call log_error() on the shared MockLogger, racing against the error_count == 0 assertion.
Description
Motivation
Additional Notes
Jira ticket: [PROJ-IDENT]