-
Notifications
You must be signed in to change notification settings - Fork 341
Start collecting async IDs in profiles #5524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Overall package sizeSelf size: 9.24 MB Dependency sizes| name | version | self size | total size | |------|---------|-----------|------------| | @datadog/libdatadog | 0.5.0 | 29.83 MB | 29.83 MB | | @datadog/native-appsec | 8.5.1 | 19.26 MB | 19.27 MB | | @datadog/native-iast-taint-tracking | 3.3.0 | 13.77 MB | 13.78 MB | | @datadog/pprof | 5.7.0 | 9.89 MB | 10.26 MB | | @opentelemetry/core | 1.30.1 | 908.66 kB | 7.16 MB | | protobufjs | 7.4.0 | 2.77 MB | 5.42 MB | | @datadog/wasm-js-rewriter | 3.1.0 | 2.37 MB | 2.52 MB | | @datadog/native-metrics | 3.1.0 | 1.06 MB | 1.46 MB | | @opentelemetry/api | 1.8.0 | 1.21 MB | 1.21 MB | | import-in-the-middle | 1.13.1 | 117.64 kB | 839.26 kB | | source-map | 0.7.4 | 226 kB | 226 kB | | opentracing | 0.14.7 | 194.81 kB | 194.81 kB | | lru-cache | 7.18.3 | 133.92 kB | 133.92 kB | | pprof-format | 2.1.0 | 111.69 kB | 111.69 kB | | @datadog/sketches-js | 2.1.1 | 109.9 kB | 109.9 kB | | lodash.sortby | 4.7.0 | 75.76 kB | 75.76 kB | | ignore | 5.3.2 | 53.63 kB | 53.63 kB | | istanbul-lib-coverage | 3.2.0 | 29.34 kB | 29.34 kB | | rfdc | 1.4.1 | 27.15 kB | 27.15 kB | | @isaacs/ttlcache | 1.4.1 | 25.2 kB | 25.2 kB | | tlhunter-sorted-set | 0.1.0 | 24.94 kB | 24.94 kB | | dc-polyfill | 0.1.6 | 24.56 kB | 24.56 kB | | shell-quote | 1.8.2 | 23.54 kB | 23.54 kB | | limiter | 1.1.5 | 23.17 kB | 23.17 kB | | retry | 0.13.1 | 18.85 kB | 18.85 kB | | semifies | 1.0.0 | 15.84 kB | 15.84 kB | | jest-docblock | 29.7.0 | 8.99 kB | 12.76 kB | | crypto-randomuuid | 1.0.0 | 11.18 kB | 11.18 kB | | ttl-set | 1.0.0 | 4.61 kB | 9.69 kB | | path-to-regexp | 0.1.12 | 6.6 kB | 6.6 kB | | koalas | 1.0.2 | 6.47 kB | 6.47 kB | | module-details-from-path | 1.0.3 | 4.47 kB | 4.47 kB |🤖 This report was automatically generated by heaviest-objects-in-the-universe |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5524 +/- ##
==========================================
+ Coverage 79.18% 79.27% +0.08%
==========================================
Files 512 512
Lines 23158 23179 +21
==========================================
+ Hits 18338 18374 +36
+ Misses 4820 4805 -15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Datadog ReportBranch report: ✅ 0 Failed, 928 Passed, 0 Skipped, 14m 33.72s Total Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are actually failing. It seems like the wall.js options have to be changed to know about the option.
BenchmarksBenchmark execution time: 2025-04-07 12:38:15 Comparing candidate commit bbb17e6 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 946 metrics, 17 unstable metrics. |
Pull request was converted to draft
assert.equal(spans.size, 9, encoded) | ||
assert.equal(asyncIds.size, 9, encoded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this could even be zero due to the IDs sometimes missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We collect multiple samples for each span – we'll have a total of 9 spans, each span runs for ~100ms, and we sample at 99Hz, so statistically each span will have ~9.9 samples (I might increase the span execution time to 102ms which'd bring the expected sample value a tad over 10.) So we'd need to fail to capture the async ID in all 9-10 samples for a span in order for a particular async ID to be completely absent from this set.
Unfortunately, this is exactly what we see on Ubuntu profiler tests now: it observes 2-3 async IDs. Not sure what to make of it, but I'll be inspecting this by running some tests locally on an arm Ubuntu VM to try to reproduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a closer look at the profiling GH workflow though it turns out that we only run profiling tests with active-lts
version of Node on macOS, but additionally run them with oldest-maintenance-lts
and newest-maintenance-lts
on Linux.
Oldest maintenance is 18.20.8 and for it the test indeed fails even on my local macOS. It also fails for 20.19.0, but it passes for 22.14.0. I guess one way to fix the issue is to only enable the feature on Node 22 and above. It's an interesting question though why does node::AsyncHooksGetExecutionAsyncId()
work reliably on Node 22 but not on earlier ones. Do you maybe have some information on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I bisected on minor node versions and 22.10.0 is the first one where the test reliably passes, and 22.9.0 the last one where it reliably fails. The functions that execute here: AsyncHooksGetExecutionAsyncId
and its callees Environment::GetCurrent
and Environment::execution_async_id
don't seem like they had a code change in years. Except for nodejs/node@d6bcc44 which did indeed go into 22.10.0 but I'd be surprised if that was the one that made the difference.
c9eb137
to
abf8de9
Compare
abf8de9
to
bbb17e6
Compare
* Upgrade to profiler 5.7.0 for asyncId collection capability * Add a config option to disable async ID collection * Alphabetically sort some long option lists * Add async ID assertions to the integration test * Only enable the feature for Node.js version >=22.10.0
* Upgrade to profiler 5.7.0 for asyncId collection capability * Add a config option to disable async ID collection * Alphabetically sort some long option lists * Add async ID assertions to the integration test * Only enable the feature for Node.js version >=22.10.0
What does this PR do?
@datadog/pprof
to 5.7.0DD_PROFILING_ASYNC_ID_ENABLED
env var to regulate the feature. It defaults to true when timelines are enabled. We have this env var mostly so we have a lever to give to customers to turn the feature off if it causes problems. We do not anticipate problems. (An earlier implementation of this feature did not guard against trying to retrieve the async ID while V8 was doing garbage collection. That caused crashes, and it has been fixed in pprof 5.7.0.)Motivation
We have an insight in the profiling backend for detecting long-running event callbacks that block other callbacks from running on the event loop thread. An initial version of this insight uses span IDs associated with profiling samples to identify streaks of adjacent samples that belong to the same invocation. We have since upgraded the backend algorithm to also be able to use execution async IDs when available. Execution async ID is a better datum for this for two reasons:
Additional information
Additional Notes
Jira: PROF-11107