Skip to content

Start collecting async IDs in profiles #5524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 8, 2025
Merged

Conversation

szegedi
Copy link
Contributor

@szegedi szegedi commented Apr 2, 2025

What does this PR do?

  • Upgrades @datadog/pprof to 5.7.0
  • Enables its feature for collecting Node's execution async IDs in profiles.
  • Adds a DD_PROFILING_ASYNC_ID_ENABLED env var to regulate the feature. It defaults to true when timelines are enabled. We have this env var mostly so we have a lever to give to customers to turn the feature off if it causes problems. We do not anticipate problems. (An earlier implementation of this feature did not guard against trying to retrieve the async ID while V8 was doing garbage collection. That caused crashes, and it has been fixed in pprof 5.7.0.)

Motivation

We have an insight in the profiling backend for detecting long-running event callbacks that block other callbacks from running on the event loop thread. An initial version of this insight uses span IDs associated with profiling samples to identify streaks of adjacent samples that belong to the same invocation. We have since upgraded the backend algorithm to also be able to use execution async IDs when available. Execution async ID is a better datum for this for two reasons:

  • it is less likely to cause false positives if two callbacks working on behalf of the same span run immediately one after the other, and
  • async IDs are available to the profiler even when tracing is turned off.

Additional information

Additional Notes

Jira: PROF-11107

@szegedi szegedi requested a review from a team as a code owner April 2, 2025 14:02
@szegedi szegedi requested a review from nsavoire April 2, 2025 14:02
@szegedi szegedi enabled auto-merge (squash) April 2, 2025 14:03
Copy link

github-actions bot commented Apr 2, 2025

Overall package size

Self size: 9.24 MB
Deduped: 101.64 MB
No deduping: 102.16 MB

Dependency sizes | name | version | self size | total size | |------|---------|-----------|------------| | @datadog/libdatadog | 0.5.0 | 29.83 MB | 29.83 MB | | @datadog/native-appsec | 8.5.1 | 19.26 MB | 19.27 MB | | @datadog/native-iast-taint-tracking | 3.3.0 | 13.77 MB | 13.78 MB | | @datadog/pprof | 5.7.0 | 9.89 MB | 10.26 MB | | @opentelemetry/core | 1.30.1 | 908.66 kB | 7.16 MB | | protobufjs | 7.4.0 | 2.77 MB | 5.42 MB | | @datadog/wasm-js-rewriter | 3.1.0 | 2.37 MB | 2.52 MB | | @datadog/native-metrics | 3.1.0 | 1.06 MB | 1.46 MB | | @opentelemetry/api | 1.8.0 | 1.21 MB | 1.21 MB | | import-in-the-middle | 1.13.1 | 117.64 kB | 839.26 kB | | source-map | 0.7.4 | 226 kB | 226 kB | | opentracing | 0.14.7 | 194.81 kB | 194.81 kB | | lru-cache | 7.18.3 | 133.92 kB | 133.92 kB | | pprof-format | 2.1.0 | 111.69 kB | 111.69 kB | | @datadog/sketches-js | 2.1.1 | 109.9 kB | 109.9 kB | | lodash.sortby | 4.7.0 | 75.76 kB | 75.76 kB | | ignore | 5.3.2 | 53.63 kB | 53.63 kB | | istanbul-lib-coverage | 3.2.0 | 29.34 kB | 29.34 kB | | rfdc | 1.4.1 | 27.15 kB | 27.15 kB | | @isaacs/ttlcache | 1.4.1 | 25.2 kB | 25.2 kB | | tlhunter-sorted-set | 0.1.0 | 24.94 kB | 24.94 kB | | dc-polyfill | 0.1.6 | 24.56 kB | 24.56 kB | | shell-quote | 1.8.2 | 23.54 kB | 23.54 kB | | limiter | 1.1.5 | 23.17 kB | 23.17 kB | | retry | 0.13.1 | 18.85 kB | 18.85 kB | | semifies | 1.0.0 | 15.84 kB | 15.84 kB | | jest-docblock | 29.7.0 | 8.99 kB | 12.76 kB | | crypto-randomuuid | 1.0.0 | 11.18 kB | 11.18 kB | | ttl-set | 1.0.0 | 4.61 kB | 9.69 kB | | path-to-regexp | 0.1.12 | 6.6 kB | 6.6 kB | | koalas | 1.0.2 | 6.47 kB | 6.47 kB | | module-details-from-path | 1.0.3 | 4.47 kB | 4.47 kB |

🤖 This report was automatically generated by heaviest-objects-in-the-universe

Copy link

codecov bot commented Apr 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.27%. Comparing base (b94052c) to head (bbb17e6).
Report is 23 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5524      +/-   ##
==========================================
+ Coverage   79.18%   79.27%   +0.08%     
==========================================
  Files         512      512              
  Lines       23158    23179      +21     
==========================================
+ Hits        18338    18374      +36     
+ Misses       4820     4805      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Apr 2, 2025

Datadog Report

Branch report: szegedi/async-id-profiler
Commit report: 59eabcf
Test service: dd-trace-js-integration-tests

✅ 0 Failed, 928 Passed, 0 Skipped, 14m 33.72s Total Time

Copy link
Collaborator

@BridgeAR BridgeAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@BridgeAR BridgeAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are actually failing. It seems like the wall.js options have to be changed to know about the option.

@pr-commenter
Copy link

pr-commenter bot commented Apr 2, 2025

Benchmarks

Benchmark execution time: 2025-04-07 12:38:15

Comparing candidate commit bbb17e6 in PR branch szegedi/async-id-profiler with baseline commit b94052c in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 946 metrics, 17 unstable metrics.

@szegedi szegedi marked this pull request as draft April 2, 2025 19:43
auto-merge was automatically disabled April 2, 2025 19:43

Pull request was converted to draft

assert.equal(spans.size, 9, encoded)
assert.equal(asyncIds.size, 9, encoded)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could even be zero due to the IDs sometimes missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We collect multiple samples for each span – we'll have a total of 9 spans, each span runs for ~100ms, and we sample at 99Hz, so statistically each span will have ~9.9 samples (I might increase the span execution time to 102ms which'd bring the expected sample value a tad over 10.) So we'd need to fail to capture the async ID in all 9-10 samples for a span in order for a particular async ID to be completely absent from this set.

Unfortunately, this is exactly what we see on Ubuntu profiler tests now: it observes 2-3 async IDs. Not sure what to make of it, but I'll be inspecting this by running some tests locally on an arm Ubuntu VM to try to reproduce.

Copy link
Contributor Author

@szegedi szegedi Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a closer look at the profiling GH workflow though it turns out that we only run profiling tests with active-lts version of Node on macOS, but additionally run them with oldest-maintenance-lts and newest-maintenance-lts on Linux.

Oldest maintenance is 18.20.8 and for it the test indeed fails even on my local macOS. It also fails for 20.19.0, but it passes for 22.14.0. I guess one way to fix the issue is to only enable the feature on Node 22 and above. It's an interesting question though why does node::AsyncHooksGetExecutionAsyncId() work reliably on Node 22 but not on earlier ones. Do you maybe have some information on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I bisected on minor node versions and 22.10.0 is the first one where the test reliably passes, and 22.9.0 the last one where it reliably fails. The functions that execute here: AsyncHooksGetExecutionAsyncId and its callees Environment::GetCurrent and Environment::execution_async_id don't seem like they had a code change in years. Except for nodejs/node@d6bcc44 which did indeed go into 22.10.0 but I'd be surprised if that was the one that made the difference.

@szegedi szegedi force-pushed the szegedi/async-id-profiler branch from c9eb137 to abf8de9 Compare April 7, 2025 11:56
@szegedi szegedi force-pushed the szegedi/async-id-profiler branch from abf8de9 to bbb17e6 Compare April 7, 2025 12:29
@szegedi szegedi marked this pull request as ready for review April 7, 2025 13:52
@szegedi szegedi requested a review from BridgeAR April 7, 2025 13:53
@BridgeAR BridgeAR merged commit 9ec7ef9 into master Apr 8, 2025
427 checks passed
@BridgeAR BridgeAR deleted the szegedi/async-id-profiler branch April 8, 2025 00:35
@wconti27 wconti27 mentioned this pull request Apr 8, 2025
wconti27 pushed a commit that referenced this pull request Apr 8, 2025
* Upgrade to profiler 5.7.0 for asyncId collection capability

* Add a config option to disable async ID collection

* Alphabetically sort some long option lists

* Add async ID assertions to the integration test

* Only enable the feature for Node.js version >=22.10.0
wconti27 pushed a commit that referenced this pull request Apr 9, 2025
* Upgrade to profiler 5.7.0 for asyncId collection capability

* Add a config option to disable async ID collection

* Alphabetically sort some long option lists

* Add async ID assertions to the integration test

* Only enable the feature for Node.js version >=22.10.0
szegedi added a commit that referenced this pull request Apr 30, 2025
BridgeAR pushed a commit that referenced this pull request Apr 30, 2025
* Revert "Constrain async ID collection to main thread only (#5559)"

This reverts commit 5fd6690.

* Revert "Start collecting async IDs in profiles (#5524)"

This reverts commit 9ec7ef9.

* Restore actually useful test code change
dd-trace-js bot pushed a commit that referenced this pull request May 1, 2025
* Revert "Constrain async ID collection to main thread only (#5559)"

This reverts commit 5fd6690.

* Revert "Start collecting async IDs in profiles (#5524)"

This reverts commit 9ec7ef9.

* Restore actually useful test code change
szegedi added a commit that referenced this pull request May 1, 2025
* Revert "Constrain async ID collection to main thread only (#5559)"

This reverts commit 5fd6690.

* Revert "Start collecting async IDs in profiles (#5524)"

This reverts commit 9ec7ef9.

* Restore actually useful test code change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants