Description
Hi,
We have this weird issue that users report crashes, but so far all my tries to reproduce it failed. The crash seems to be happening in the destructor of the websocket callback client and the stacktraces are always the same, here is one for example:
Thread 64 Crashed:: Dispatch queue: com.apple.root.default-qos
0 libsystem_kernel.dylib 0x00007fff960630ae __pthread_kill + 10
1 libsystem_pthread.dylib 0x00007fff96cc6500 pthread_kill + 90
2 libsystem_c.dylib 0x00007fff9a11c41b __abort + 145
3 libsystem_c.dylib 0x00007fff9a11c38a abort + 144
4 libc++abi.dylib 0x00007fff8bc52f81 abort_message + 257
5 libc++abi.dylib 0x00007fff8bc7896a default_terminate_handler() + 46
6 com.cisco.SparkMacDesktop 0x000000010261ad09 CLSTerminateHandler() + 270
7 libc++abi.dylib 0x00007fff8bc7619e std::__terminate(void (_)()) + 8
8 libc++abi.dylib 0x00007fff8bc7622d std::terminate() + 77
9 libcpprest.2.7.dylib 0x00000001033b1ba9 web::websockets::client::details::wspp_callback_client::~wspp_callback_client() + 889
10 libc++.1.dylib 0x00007fff90411cb8 std::__1::__shared_weak_count::__release_shared() + 44
11 com.cisco.SparkMacDesktop 0x000000010275ba4f MercuryManager::_websocketConnect(int) + 1705
12 com.cisco.SparkMacDesktop 0x0000000102675e75 std::__1::__function::__func<pplx::details::_MakeVoidToUnitFunc(std::__1::function<void ()> const&)::'lambda'(), std::__1::allocator<pplx::details::_MakeVoidToUnitFunc(std::__1::function<void ()> const&)::'lambda'()>, unsigned char ()>::operator()() + 13
13 com.cisco.SparkMacDesktop 0x00000001027619f9 pplx::details::_PPLTaskHandle<unsigned char, pplx::task::_InitialTaskHandle<void, MercuryManager::websocketConnect(int)::$_0, pplx::details::_TypeSelectorNoAsync>, pplx::details::_TaskProcHandle>::invoke() const + 265
14 com.cisco.SparkMacDesktop 0x00000001026508f3 pplx::details::_TaskProcHandle::RunChoreBridge(void) + 19
15 libdispatch.dylib 0x00007fff95f123c3 _dispatch_client_callout + 8
16 libdispatch.dylib 0x00007fff95f16253 _dispatch_root_queue_drain + 1890
17 libdispatch.dylib 0x00007fff95f15ab8 _dispatch_worker_thread3 + 91
18 libsystem_pthread.dylib 0x00007fff96cc34f2 _pthread_wqthread + 1129
19 libsystem_pthread.dylib 0x00007fff96cc1375 start_wqthread + 13
Now, the interesting bit is that i see from the Casablanca code that some engineer might have run into this issue before because there seems to be a specific attempt to "handle" this:
...
switch (m_state) {
case DESTROYED:
// This should be impossible
std::abort();
...
(I am not fully sure this actually works cause if the object is indeed dead then accessing m_state might be a bad idea already, but anyway, that's not the point I am trying to make here).
Based on the stacktrace and the code in ws_client_wspp.cpp it looks like a smoking gun: someone tries to double delete the object and the hard coded abort() is getting fired.
Now, as much as we can ever be sure, I am pretty-pretty sure that we, the consuming app, is not doing this double delete. We create a unique pointer to a Web socket callback client and when we get the close handler invoked (set up via set_close_handler), for example if I lose network, then we create a new PPL task, wait a bit and attempt to restore the websocket connection. The way we do that is that we create a brand new callback client (with make_unique) and replace the existing member field with this new pointer. At this point the old callback client - now in CLOSED state - is released and its destructor is called.
I simply see no C++ way how can the destructor can be invoked twice. Do you guys have any idea what's going on here? I presume there were some issues in this area and that's why the codepath is there in the destructor with abort?
Regards,
Gergely