fix(telemetry): refactor to shared_ptr and add deterministic shutdown() by cataphract · Pull Request #326 · DataDog/dd-trace-cpp

cataphract · 2026-06-08T15:35:07Z

Refactors Telemetry to be heap-allocated via a shared_ptr (enable_shared_from_this + factory create()). Scheduled-task callbacks now capture a weak_from_this() so they become no-ops automatically once the object is no longer reachable. Move semantics are removed.

Adds a shutdown() function that cancels scheduled tasks, sends the app-closing payload, and releases the HTTP client. After the call, all HTTP sending becomes a no-op: send_payload() snapshots http_client_ under http_client_mutex_ and returns early if null. http_client_ is reset under the same mutex in shutdown(), so any concurrent call that already holds a live snapshot completes cleanly while new calls short-circuit. If this is the last shared_ptr to the client, the background Curl thread is joined at that point.

Rationale: shutdown() cannot guarantee that, after it returns, there are no in-flight HTTP callbacks still holding a reference to Telemetry state. Using a shared_ptr ensures any late callback finds a dead weak_ptr and short-circuits instead of accessing a destroyed object.

The load-bearing weak_from_this() captures are the on_response and on_error lambdas inside send_payload(). Those run on CurlImpl's independent event_loop_ thread (via handle_message()), which has no blocking-cancel mechanism. app_closing() calls drain() with a 2-second deadline, so if the agent is slow, requests can still be in flight on the curl thread after shutdown() returns. The weak.lock() guard is what prevents a use-after-free in that window.

The weak_from_this() captures in the scheduled-task callbacks (schedule_tasks()) are defense-in-depth for a different concern: ThreadedEventScheduler's cancel() blocks until any in-progress callback finishes, making those captures redundant against the threaded scheduler. However, the EventScheduler interface does not mandate a blocking cancel(), so the captures guard against non-blocking implementations where a callback could otherwise touch Telemetry state after it has been torn down.

Also fixes a race in test_span.cpp: disabling telemetry there prevents app_started() from posting to the Curl background thread, which would otherwise fail to reach the (absent) agent asynchronously and call log_error() on the shared MockLogger, racing against the error_count == 0 assertion.

Description

Motivation

Additional Notes

Jira ticket: [PROJ-IDENT]

datadog-datadog-prod-us1 · 2026-06-08T15:36:38Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

DataDog/apm-reliability/dd-trace-cpp | check-big-regressions

ℹ️ Info

🎯 Code Coverage (details)
• Patch Coverage: 32.74%
• Overall Coverage: 91.05% (+0.20%)

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 31b8ae3 | Docs | Datadog PR Page | Give us feedback!}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac2de8df27

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pr-commenter · 2026-06-08T15:38:42Z

Benchmarks

Benchmark execution time: 2026-06-09 10:17:36

Comparing candidate commit 31b8ae3 in PR branch glopes/telemetry-shutdown with baseline commit 93ed055 in branch main.

Found 0 performance improvements and 3 performance regressions! Performance is the same for 5 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:BM_TraceID_ParseHex/128bit

🟥 execution_time [+11.119ns; +11.183ns] or [+9.025%; +9.077%]

scenario:BM_TraceID_ParseHex/64bit

🟥 execution_time [+5.813ns; +5.843ns] or [+8.392%; +8.436%]

scenario:BM_TraceTinyCCSource

🟥 execution_time [+2.005ms; +2.243ms] or [+2.636%; +2.949%]

Refactors Telemetry to be heap-allocated via a shared_ptr (enable_shared_from_this + factory create()). Scheduled-task callbacks now capture a weak_from_this() so they become no-ops automatically once the object is no longer reachable. Move semantics are removed. Adds a shutdown() function that cancels scheduled tasks, sends the app-closing payload, and releases the HTTP client. After the call, all HTTP sending becomes a no-op: send_payload() snapshots http_client_ under http_client_mutex_ and returns early if null. http_client_ is reset under the same mutex in shutdown(), so any concurrent call that already holds a live snapshot completes cleanly while new calls short-circuit. If this is the last shared_ptr to the client, the background Curl thread is joined at that point. Rationale: shutdown() cannot guarantee that, after it returns, there are no in-flight HTTP callbacks still holding a reference to Telemetry state. Using a shared_ptr ensures any late callback finds a dead weak_ptr and short-circuits instead of accessing a destroyed object. The load-bearing weak_from_this() captures are the on_response and on_error lambdas inside send_payload(). Those run on CurlImpl's independent event_loop_ thread (via handle_message()), which has no blocking-cancel mechanism. app_closing() calls drain() with a 2-second deadline, so if the agent is slow, requests can still be in flight on the curl thread after shutdown() returns. The weak.lock() guard is what prevents a use-after-free in that window. The weak_from_this() captures in the scheduled-task callbacks (schedule_tasks()) are defense-in-depth for a different concern: ThreadedEventScheduler's cancel() blocks until any in-progress callback finishes, making those captures redundant against the threaded scheduler. However, the EventScheduler interface does not mandate a blocking cancel(), so the captures guard against non-blocking implementations where a callback could otherwise touch Telemetry state after it has been torn down. Also fixes a race in test_span.cpp: disabling telemetry there prevents app_started() from posting to the Curl background thread, which would otherwise fail to reach the (absent) agent asynchronously and call log_error() on the shared MockLogger, racing against the error_count == 0 assertion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

zacharycmontoya

LGTM. Admittedly this is my first time seeing usage of the std::enable_shared_from_this and weak_from_this API's but it makes sense to me!

cataphract requested review from a team as code owners June 8, 2026 15:35

cataphract requested review from xlamorlette-datadog and removed request for a team June 8, 2026 15:35

chatgpt-codex-connector Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread src/datadog/telemetry/telemetry_impl.cpp

Comment thread src/datadog/telemetry/telemetry.cpp

cataphract force-pushed the glopes/telemetry-shutdown branch from db6b0ab to a179e32 Compare June 9, 2026 10:09

cataphract temporarily deployed to dev June 9, 2026 10:10 — with GitHub Actions Inactive

cataphract and others added 3 commits June 9, 2026 11:11

fix comment wording

cf43352

run clang-format

31b8ae3

cataphract force-pushed the glopes/telemetry-shutdown branch from a179e32 to 31b8ae3 Compare June 9, 2026 10:11

cataphract temporarily deployed to dev June 9, 2026 10:12 — with GitHub Actions Inactive

cataphract temporarily deployed to dev June 9, 2026 10:16 — with GitHub Actions Inactive

zacharycmontoya approved these changes Jun 9, 2026

View reviewed changes

cataphract merged commit 2952e1f into main Jun 9, 2026
36 of 38 checks passed

cataphract deleted the glopes/telemetry-shutdown branch June 9, 2026 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(telemetry): refactor to shared_ptr and add deterministic shutdown()#326

fix(telemetry): refactor to shared_ptr and add deterministic shutdown()#326
cataphract merged 3 commits into
mainfrom
glopes/telemetry-shutdown

cataphract commented Jun 8, 2026

Uh oh!

datadog-datadog-prod-us1 Bot commented Jun 8, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

pr-commenter Bot commented Jun 8, 2026 •

edited

Loading

Explanation

More details about the CI and significant changes

Uh oh!

zacharycmontoya left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

cataphract commented Jun 8, 2026

Description

Motivation

Additional Notes

Uh oh!

datadog-datadog-prod-us1 Bot commented Jun 8, 2026 • edited by datadog-datadog-prod-us1-2 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

ℹ️ Info

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

pr-commenter Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Explanation

More details about the CI and significant changes

scenario:BM_TraceID_ParseHex/128bit

scenario:BM_TraceID_ParseHex/64bit

scenario:BM_TraceTinyCCSource

Uh oh!

zacharycmontoya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

datadog-datadog-prod-us1 Bot commented Jun 8, 2026 •

edited by datadog-datadog-prod-us1-2 Bot

Loading

pr-commenter Bot commented Jun 8, 2026 •

edited

Loading