Skip to content

fix(telemetry): refactor to shared_ptr and add deterministic shutdown()#326

Merged
cataphract merged 3 commits into
mainfrom
glopes/telemetry-shutdown
Jun 9, 2026
Merged

fix(telemetry): refactor to shared_ptr and add deterministic shutdown()#326
cataphract merged 3 commits into
mainfrom
glopes/telemetry-shutdown

Conversation

@cataphract

Copy link
Copy Markdown
Contributor

Refactors Telemetry to be heap-allocated via a shared_ptr (enable_shared_from_this + factory create()). Scheduled-task callbacks now capture a weak_from_this() so they become no-ops automatically once the object is no longer reachable. Move semantics are removed.

Adds a shutdown() function that cancels scheduled tasks, sends the app-closing payload, and releases the HTTP client. After the call, all HTTP sending becomes a no-op: send_payload() snapshots http_client_ under http_client_mutex_ and returns early if null. http_client_ is reset under the same mutex in shutdown(), so any concurrent call that already holds a live snapshot completes cleanly while new calls short-circuit. If this is the last shared_ptr to the client, the background Curl thread is joined at that point.

Rationale: shutdown() cannot guarantee that, after it returns, there are no in-flight HTTP callbacks still holding a reference to Telemetry state. Using a shared_ptr ensures any late callback finds a dead weak_ptr and short-circuits instead of accessing a destroyed object.

The load-bearing weak_from_this() captures are the on_response and on_error lambdas inside send_payload(). Those run on CurlImpl's independent event_loop_ thread (via handle_message()), which has no blocking-cancel mechanism. app_closing() calls drain() with a 2-second deadline, so if the agent is slow, requests can still be in flight on the curl thread after shutdown() returns. The weak.lock() guard is what prevents a use-after-free in that window.

The weak_from_this() captures in the scheduled-task callbacks (schedule_tasks()) are defense-in-depth for a different concern: ThreadedEventScheduler's cancel() blocks until any in-progress callback finishes, making those captures redundant against the threaded scheduler. However, the EventScheduler interface does not mandate a blocking cancel(), so the captures guard against non-blocking implementations where a callback could otherwise touch Telemetry state after it has been torn down.

Also fixes a race in test_span.cpp: disabling telemetry there prevents app_started() from posting to the Curl background thread, which would otherwise fail to reach the (absent) agent asynchronously and call log_error() on the shared MockLogger, racing against the error_count == 0 assertion.

Description

Motivation

Additional Notes

Jira ticket: [PROJ-IDENT]

@cataphract cataphract requested review from a team as code owners June 8, 2026 15:35
@cataphract cataphract requested review from xlamorlette-datadog and removed request for a team June 8, 2026 15:35
@datadog-datadog-prod-us1

datadog-datadog-prod-us1 Bot commented Jun 8, 2026

Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

DataDog/apm-reliability/dd-trace-cpp | check-big-regressions   View in Datadog   GitLab

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 32.74%
Overall Coverage: 91.05% (+0.20%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 31b8ae3 | Docs | Datadog PR Page | Give us feedback!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac2de8df27

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/datadog/telemetry/telemetry_impl.cpp
Comment thread src/datadog/telemetry/telemetry.cpp
@pr-commenter

pr-commenter Bot commented Jun 8, 2026

Copy link
Copy Markdown

Benchmarks

Benchmark execution time: 2026-06-09 10:17:36

Comparing candidate commit 31b8ae3 in PR branch glopes/telemetry-shutdown with baseline commit 93ed055 in branch main.

Found 0 performance improvements and 3 performance regressions! Performance is the same for 5 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:BM_TraceID_ParseHex/128bit

  • 🟥 execution_time [+11.119ns; +11.183ns] or [+9.025%; +9.077%]

scenario:BM_TraceID_ParseHex/64bit

  • 🟥 execution_time [+5.813ns; +5.843ns] or [+8.392%; +8.436%]

scenario:BM_TraceTinyCCSource

  • 🟥 execution_time [+2.005ms; +2.243ms] or [+2.636%; +2.949%]

cataphract and others added 3 commits June 9, 2026 11:11
Refactors Telemetry to be heap-allocated via a shared_ptr
(enable_shared_from_this + factory create()). Scheduled-task callbacks
now capture a weak_from_this() so they become no-ops automatically once
the object is no longer reachable. Move semantics are removed.

Adds a shutdown() function that cancels scheduled tasks, sends the
app-closing payload, and releases the HTTP client. After the call, all
HTTP sending becomes a no-op: send_payload() snapshots http_client_
under http_client_mutex_ and returns early if null. http_client_ is
reset under the same mutex in shutdown(), so any concurrent call that
already holds a live snapshot completes cleanly while new calls
short-circuit. If this is the last shared_ptr to the client, the
background Curl thread is joined at that point.

Rationale: shutdown() cannot guarantee that, after it returns, there
are no in-flight HTTP callbacks still holding a reference to Telemetry
state. Using a shared_ptr ensures any late callback finds a dead
weak_ptr and short-circuits instead of accessing a destroyed object.

The load-bearing weak_from_this() captures are the on_response and
on_error lambdas inside send_payload(). Those run on CurlImpl's
independent event_loop_ thread (via handle_message()), which has no
blocking-cancel mechanism. app_closing() calls drain() with a 2-second
deadline, so if the agent is slow, requests can still be in flight on
the curl thread after shutdown() returns. The weak.lock() guard is what
prevents a use-after-free in that window.

The weak_from_this() captures in the scheduled-task callbacks
(schedule_tasks()) are defense-in-depth for a different concern:
ThreadedEventScheduler's cancel() blocks until any in-progress
callback finishes, making those captures redundant against the
threaded scheduler. However, the EventScheduler interface does not
mandate a blocking cancel(), so the captures guard against
non-blocking implementations where a callback could otherwise touch
Telemetry state after it has been torn down.

Also fixes a race in test_span.cpp: disabling telemetry there prevents
app_started() from posting to the Curl background thread, which would
otherwise fail to reach the (absent) agent asynchronously and call
log_error() on the shared MockLogger, racing against the
error_count == 0 assertion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@zacharycmontoya zacharycmontoya left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Admittedly this is my first time seeing usage of the std::enable_shared_from_this and weak_from_this API's but it makes sense to me!

@cataphract cataphract merged commit 2952e1f into main Jun 9, 2026
36 of 38 checks passed
@cataphract cataphract deleted the glopes/telemetry-shutdown branch June 9, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants