Skip to content

Commit 79e7db7

Browse files
[NFC][SYCL] Speed up device_impl::CallOnceCache on fast path for libstdc++ (#18597)
libstdc++ implementation of `std::call_once` isn't as performant as it could be due to ABI compatibility reasons (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146#c53). We can optimize fast path but paying some price for less important memory usage/slow path performance. Based on generated code (https://godbolt.org/z/1YaW5xozY) I wouldn't be surprised if the same would help on Windows, but I'd prefer to investigate/implement in a separate PR if necessary.
1 parent b91d3e2 commit 79e7db7

File tree

1 file changed

+30
-0
lines changed

1 file changed

+30
-0
lines changed

sycl/source/detail/device_impl.hpp

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -220,13 +220,32 @@ class device_impl : public std::enable_shared_from_this<device_impl> {
220220
}
221221
};
222222

223+
#if defined(_GLIBCXX_RELEASE)
224+
// libstdc++'s std::call_once is significantly slower than libc++
225+
// implementation (30-40ns for libc++ CallOnceCache/EagerCache vs 50-60ns for
226+
// CallOnceCache when using libstdc++ for queries of simple types like
227+
// `ur_device_usm_access_capability_flags_t`). libc++ implements it via
228+
// `__cxa_guard_*` (same as function static variables initialization) but
229+
// libstdc++ cannot do that without an ABI break:
230+
// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66146#c53.
231+
//
232+
// We do care about performance of the fast path and can pay extra costs in
233+
// memory/slow-down during single init call, so add an extra flag to optimize.
234+
#define GUARD_STD_CALL_ONCE_WITH_EXTRA_CHECK 1
235+
#else
236+
#define GUARD_STD_CALL_ONCE_WITH_EXTRA_CHECK 0
237+
#endif
238+
223239
// CallOnce - initialize on first query, but exactly once so that we could
224240
// return cached values by reference. Important for `std::vector` /
225241
// `std::string` values where returning cached values by value would cause
226242
// heap allocations.
227243
template <typename Desc> struct CallOnceCached {
228244
std::once_flag flag;
229245
typename Desc::return_type value;
246+
#if GUARD_STD_CALL_ONCE_WITH_EXTRA_CHECK
247+
std::atomic_bool initialized = false;
248+
#endif
230249
};
231250

232251
template <typename Initializer, typename... Descs>
@@ -241,13 +260,24 @@ class device_impl : public std::enable_shared_from_this<device_impl> {
241260

242261
template <typename Desc> decltype(auto) get() {
243262
auto &Entry = *static_cast<CallOnceCached<Desc> *>(this);
263+
#if GUARD_STD_CALL_ONCE_WITH_EXTRA_CHECK
264+
if (!Entry.initialized.load(std::memory_order_acquire)) {
265+
std::call_once(Entry.flag, [&]() {
266+
Initializer::template init<Desc>(device, Entry.value);
267+
Entry.initialized.store(true, std::memory_order_release);
268+
});
269+
}
270+
#else
244271
std::call_once(Entry.flag, Initializer::template init<Desc>, device,
245272
Entry.value);
273+
#endif
246274
// Extra parentheses to return as reference (see `decltype(auto)`).
247275
return (std::as_const(Entry.value));
248276
}
249277
};
250278

279+
#undef GUARD_STD_CALL_ONCE_WITH_EXTRA_CHECK
280+
251281
// get_info and get_info_impl need to know if a particular query is cacheable.
252282
// It's easier if all the cache instances (eager/call-once * UR/SYCL) are
253283
// merged into a single object.

0 commit comments

Comments
 (0)