Skip to content

Perl SEGV #23326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ancientwizard opened this issue May 25, 2025 · 15 comments
Open

Perl SEGV #23326

ancientwizard opened this issue May 25, 2025 · 15 comments

Comments

@ancientwizard
Copy link

ancientwizard commented May 25, 2025

Module:

Description
Perl SEGV's during SIGCHLD.

BTW I opened #23241 before but that issue has been fixed by a patch and pull request submitted the DBD::Oracle myself. This isn't an Oracle only issue. Apparently and dynamic library that creates threads could find their way into Perl land trying to service an interrupt.

Steps to Reproduce
The steps to reproduce are reported within perl5-dbi/DBD-Oracle#192
I have added a test that reproduces the SEGV t/92-segv-fork.t

Expected behavior
Perl should not SEGV

Perl configuration

CFLAGS="-fsanitize=address -g -fno-omit-frame-pointer" LDFLAGS="-fsanitize=address" \
./Configure -des -Dusethreads -Duse64bitint -Duse64bitall -Dprefix=/your-install-path \
      -Alddlflags="-shared -g -O0 -L/usr/local/lib -fstack-protector-strong" \
      -Accflags="-g -O0"
@tonycoz
Copy link
Contributor

tonycoz commented May 25, 2025

Possibly fixed by 85e9706

@ancientwizard
Copy link
Author

Possibly fixed by 85e9706

So you're suggesting that the Oracle instant client lib starts one or more additional non Perl threads that may leak into Perl land causing mayhem? That wouldn't seem like the typical M.O. of a dynamic library.

@tonycoz
Copy link
Contributor

tonycoz commented May 26, 2025

It's only a problem when you mix library worker threads with signals.

Your backtrace:

#6  0x00007fffeeff1ec9 in kpucpincrtime () from /usr/lib/oracle/23/client64/lib/libclntsh.so.23.1
#7  0x00007ffff7bb51ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#8  0x00007ffff6e5d8d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

shows this isn't a perl thread.

@ancientwizard
Copy link
Author

It's only a problem when you mix library worker threads with signals.

Your backtrace:

#6  0x00007fffeeff1ec9 in kpucpincrtime () from /usr/lib/oracle/23/client64/lib/libclntsh.so.23.1
#7  0x00007ffff7bb51ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#8  0x00007ffff6e5d8d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

shows this isn't a perl thread.

Thanks for the observation. I thought the GDB output seemed to suggest this wasn't a main thread; but I wasn't exactly ready to believe because I'm not creating any additional Perl threads. I have confirmed that once the Perl has connected and then disconnected from the DB the process does have an additional thread. I can easily see it by inspecting /proc/$$/status. Once created my app is extremely suspectable to SEGV's.

I'm in no position to coerce or convince the Perl maintainers to make change to Perl's SIG handling. ALSO: I have yet to find a way to tell the OCI to
A) not create threads
B) keep its threads on a leash!

I suppose there is a reason Oracle didn't have a built-in solution to funnel signals to the main thread; starting with they know a whole lot more than I do on the subject.

@xenu
Copy link
Member

xenu commented May 26, 2025

So does this still happen on blead or not? It's not clear which Perl version you're using.

@bulk88
Copy link
Contributor

bulk88 commented May 26, 2025

Possibly fixed by 85e9706

I read somewhere online (gossip), pthreads wants to or did actually deploy to prod a change, that their pthread_t type is now a 16 byte struct and not a 8 byte size_t opaque pointer. I forgot the rational but either it removes 1 level of ptr derefs, or a pthread_t object has no destructor and no memory behind it anymore, its not ref counted, its not malloced, its pass by copy, with a benefit of infinite CPU core count scaling because there is no more user mode global lock for getter/setter methods for pthread metadata/TLS.

Upgrading to 16 byte structs did cause widespread generic common POSIX app breakage, because suddenly == C operator syntax errors.

Google brings up https://forums.oracle.com/ords/apexds/post/issue-with-system-call-with-libclntsh-so-19-1-0673 and SIGCHILD drama.

@bulk88
Copy link
Contributor

bulk88 commented May 26, 2025

Thanks for the observation. I thought the GDB output seemed to suggest this wasn't a main thread; but I wasn't exactly ready to believe because I'm not creating any additional Perl threads. I have confirmed that once the Perl has connected and then disconnected from the DB the process does have an additional thread. I can easily see it by inspecting /proc/$$/status. Once created my app is extremely suspectable to SEGV's.

I'm in no position to coerce or convince the Perl maintainers to make change to Perl's SIG handling. ALSO: I have yet to find a way to tell the OCI to A) not create threads B) keep its threads on a leash!

I suppose there is a reason Oracle didn't have a built-in solution to funnel signals to the main thread; starting with they know a whole lot more than I do on the subject.

SP my API names before using them. dTHX;/Perl_get_context();/ENTER/SAVETEMPS/PUSHMARK/Perl_call_sv() executing on a random OS thread, from a OS/LibC thread pool, is always user error. And a very common recurrent mousetrap bug for all Perl CPAN XS code running on Windows. Without detaching a my_perl ptr from 1 OS thread, pause/sleeping/idling the old OS thread, and attach the my_perl ptr to a new OS thread with Perl_set_context(), any other behavior is illegal/invalid.

1 my_perl ptr can be bound to exactly 1 OS TID at any given time.

WinPerl has an my_perl ptr of last resort for signal handlers and a very small % of all possible permutations of thread pool accidents (most will just SEGV on WinPerl). See PL_curinterp specifically. IDK what Perl on other OSes do. Perl also has a "safe signals" API that packetizes/async (0.25-30 ms away) resends the unix signal event back to the root my_perl, to execute at a safe control flow location inside the run loop. its called PERL_ASYNC_CHECK().

interpreter * perl_alloc_using(IPerlMem **ipM, IPerlMem **ipMS, IPerlMem **ipMP, IPerlEnv **ipE, IPerlStdIO **ipStd, IPerlLIO **ipLIO, IPerlDir **ipD, IPerlSock **ipS, IPerlProc **ipP)
{
  interpreter *my_perl;
  _PerlIO **fd;
  interpreter *result;

  my_perl = (interpreter *)(*ipM)->pCalloc(ipM, 1ui64, 4816ui64);
  if ( PL_curinterp )
  {
    Perl_set_context(my_perl);
  }
  else
  {
    PL_curinterp = my_perl;
    if ( my_perl && !PL_veto_switch_non_tTHX_context )
      Perl_switch_locale_context(my_perl);
    PL_thr_key = TlsAlloc();
    if ( PL_thr_key == -1 )
    {
      fd = Perl_PerlIO_stderr(my_perl);
      PerlIO_printf(fd, "panic: TlsAlloc");
      exit(1);
    }
    Perl_set_context(my_perl);
    InitializeCriticalSection(&PL_op_mutex);
    InitializeCriticalSection(&PL_check_mutex);
    InitializeCriticalSection(&PL_keyword_plugin_mutex);
    InitializeCriticalSection(&PL_hints_mutex);
    InitializeCriticalSection(&PL_locale_mutex.lock);

@jkeenan
Copy link
Contributor

jkeenan commented May 26, 2025

@tonycoz, is this a perl problem or a DBD-Oracle problem? If the latter, then we should direct the OP to a more appropriate place to file this bug report. Thanks.

@ancientwizard
Copy link
Author

ancientwizard commented May 26, 2025

@tonycoz, is this a perl problem or a DBD-Oracle problem? If the latter, then we should direct the OP to a more appropriate place to file this bug report. Thanks.

This is a Perl vs libraries that create non Perl worker threads. PERL can't cope with non Perl threads entering its domain.

@tonycoz
Copy link
Contributor

tonycoz commented May 26, 2025

@tonycoz, is this a perl problem or a DBD-Oracle problem? If the latter, then we should direct the OP to a more appropriate place to file this bug report. Thanks.

It appears to be the same as #22487 which is fixed in blead.

We won't know if it fixes it for OP until they test it, though they might run into other DBD::Oracle issues along the way.

Edit: if you're using a vendor perl you might be able to convince them to backport 85e9706

@tonycoz
Copy link
Contributor

tonycoz commented May 26, 2025

I read somewhere online (gossip), pthreads wants to or did actually deploy to prod a change, that their pthread_t type is now a 16 byte struct and not a 8 byte size_t opaque pointer

pthreads isn't a specific implementation, but a specification. That specification does allow for non-scalar pthread_t which is why the test code in 85e9706 uses pthread_equal().

Though I kind of doubt any implementation does use non-scalar pthread_t - it would break too much code.

I thought the GDB output seemed to suggest this wasn't a main thread; but I wasn't exactly ready to believe because I'm not creating any additional Perl threads.

The thread start function for the thread is in libclntsh.so.23.1 rather than in the perl binary, libperl, or threads.so (the loadable object that does perl threads)

@bulk88
Copy link
Contributor

bulk88 commented May 28, 2025

Though I kind of doubt any implementation does use non-scalar pthread_t - it would break too much code.

https://www.ibm.com/docs/en/open-xl-c-cpp-zos/2.1.0?topic=guide-binary-compatibility

Internal type went from char thd [8] to void * I guess.

https://github.com/nginx/unit/blob/master/src/nxt_thread_id.h#L171 good reference.

It seems from my opinion, to say, certain pthread impls, pack a ptr and a 0 based kernel tid or a certain readable or unreadable memory address that the kernel returns as a "thread obj ref", together in 1 void * or into 8 bytes. The bitfield space is gained by extreme alignment of the master pthread struct, at 4096, or 10s of KBs to free up space in the void * for the kernel TID. The platform/CPU/OS might be lacking a dedicated TLS register, and having to ask the kernel each time for an abstract TID, and somehow convert the abstract number back into their large pthread internal C struct.

@ancientwizard
Copy link
Author

I have tested the patch placed in "bleed" (mentioned above) that forces Perl "WOOL' on to the non-Perl thread and then allows it to continue to handle the interrupt. I've also tested an alternative (though a real Perl internals developer should make it conform to Perl standards); Both work well to eliminate the SEGV when a non-Perl worker thread is called upon to enter Perl land to handle an interrupt. For the time being I'll apply said patch to any Perl I build within my organization as I can't use bleed there.

The alternative is shown here: perl5-dbi/DBD-Oracle#192

@tonycoz
Copy link
Contributor

tonycoz commented Jun 1, 2025

The patch you show as applied to blead in perl5-dbi/DBD-Oracle#192 isn't what was applied to blead. blead (5.42 to be) currently uses pthread_kill() to forward the signal to the main thread.

@ancientwizard
Copy link
Author

The example I used was something I made up (the alternative) simply to convey the intent not the final code. I'll have another look and if there is a bleed fix that passes the interrupt along that's a better solution and why I offered the example SU-DO code.

As for the the original approach that someone claimed to put in bleed is on them. I saw the code and the comment they made and I tested it. I don't really know anything about bleed. I just tried to connect the dots so others could follow If they wished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants