-
Notifications
You must be signed in to change notification settings - Fork 581
Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: blead
Are you sure you want to change the base?
Conversation
PostgreSQL found this exact performance problem on Windows OS (WinSock.dll problem) and on MSVC (the & << | macro and final mach code output) and fixed both in September 2017. Read entire thread, someone accuses GCC on Linux -O2 (year 2017) of emitting "very low quality" machine code for GCCs/Linuxs attempt to do an inlined https://www.postgresql.org/message-id/20170914063418.sckdzgjfrsbekae4%40alap3.anarazel.de update: with my buggy VC 2022 v19.36,
I am going pretend Storable.pm is not a P5P maintained product in the next sentence. The 0.001% defect is |
I have no strong opinions on whether this commit is the best solution to the problem being addressed (I tend to steer clear of Win32 stuff), but I am strongly opposed to the proposed commit message. It is verbose and unclear, and makes it very difficult to see at a glance what is being changed. Here's an example of what I think such a commit message should look like. I'm not proposing this as the actual text of the commit message, since I don't understand the real details of the commit. It's to give you an idea of what I regard as a suitable commit message. Win32: speed up htons etc byte-order swapping Various part of perl, in particular pack() and Storable, make extensive use of the htonl/htons/ntohl/ntohs library functions to swap the byte order of short and long ints. Unfortunately in Win32 these functions can often be relatively slow, for the reasons described below. This commit introduces a new define, PERL_MY_HOST_NET_BYTE_SWAP (set by default). If set, then htons etc, which were formerly mapped to win32_htonl etc, are now mapped to the pre-existing inline my_swap32 etc functions. On non-win32 platforms, nothing has changed. ... long explanation of the issue ... |
# define HAS_HTONLL | ||
# if (BYTEORDER & 0xffff) == 0x4321 | ||
# define ntohll(x) ((x)&0xFFFFFFFFFFFFFFFF) | ||
# define htonll(x) ntohll(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually thing we need ntohll
and htonll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/Perl/perl5/blob/blead/pp_pack.c#L146
https://github.com/Perl/perl5/blob/blead/pp_pack.c#L240
https://github.com/Perl/perl5/blob/blead/pp_pack.c#L3026
https://github.com/Perl/perl5/blob/blead/pp_pack.c#L2966
PP pack(), is doing the worst possible algorithm to swap byte order for 'Q'/U64.
https://github.com/Perl/perl5/blob/blead/ext/B/B.xs#L1608
B:: wants htonll at this line and is DIYing it.
https://github.com/Perl/perl5/blob/blead/cpan/Digest-SHA/src/sha.h#L91
Digest::SHA wants htonll at this line and is DIYing it.
https://github.com/Perl/perl5/blob/blead/dist/Storable/Storable.xs#L1059
Storable wants htonll at this line and is DIYing it.
https://github.com/Perl/perl5/blob/blead/ext/XS-APItest/APItest.xs#L7881
XS::APItest and P5P's default hash algo S_perl_siphash_seed_state()
want it on BE CPUs
https://grep.metacpan.org/search?q=ntohll&qft=*.xs%2C*.c%2C*.h&qd=&qifl=
atleast 5 CPAN modules are DIYing it or using OS version or it, there are more, im using a narrow regexp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually thing we need
ntohll
andhtonll
Whats your thoughts about evangelizing to CPAN authors identifiers my_swap64
or htonll
or PerlSock_htonll
?
_ # 1 and # 3 are P5P proprietary. # 2 is filling in missing POSIX APIs, where the perl VM impl, will get (good thing) accidental usage by CPAN XS authors, who dont know, dont care, or bare minimum care about P5P invented identifiers.
These XS authors are beginner or low effort or FT non-Perl C devs but low time people. So they will use max POSIX C/Linux C tokens that they are comfortable with where possible, vs the P5P invented "portable" C tokens that are an alien language from an extraterrestrial planet to them.
The idea of the change sounds like a very good idea, but I can't help wondering if it can't be wired more simply. |
What are your thoughts on dropping the "disable the hook/optimization and use slow OS htonl" build option nobody on WinOS or AnyOS will ever use or know its there? It will reduce the visual complexity and length of this patch if I do that. It exists because of some tests/benchmarks analysis I was doing in writing the patch, and I decided to not delete it, although I could've/should've. Remember the hook is turned on by P5P only for know problem platforms (currently WinOS, all CCs), never for all platforms. |
…order Changing BE/LE byte order is a very common operation used in many places inside libperl, and some core bundled XS modules, and in many CPAN XS modules. Storable.xs and PP pack()/unpack() are the largest most frequent users of byte swapping in Perl interp git repo. Some Perl interp repo .c or .h or .xs files DIY their own byte swap algorithms with macros or static inline functions. Others like Storable.xs and PP pack()/unpack() use the htonl(), htons(), ntohs(), and ntohs() macros/functions provided by the OS/CC, or depending on config.sh, equivelent polyfills from perl.h. On all modern CPU archs, i386/ARM/x64/PowerPC, there is a dedicate CPU instruction for doing it. Perl on Linux compiled with Clang or GCC, automatically is using the appropriate inlined intrinsic CPU opcode when using the CCs/OSes htonl(). Those 2 CCs also recogize perl.h's my_swap32() algorithm, and all other .h/.c files that use that statement as a euphemism to inline htonl() to 1 CPU opcode. On Windows, the situation is very bad compare to the above. The htonl(), htons(), ntohs(), and ntohs() functions/macros, provided to Perl from both MSVC and GCC, are exceptionally slow and inefficient, and unoptimized. MSVC has further problems, generating very inefficient machine code to swap byte order, using the DIY shift and mask algorithm. Since day 1 of Win32/64, C symbols/tokens htonl(), htons(), ntohs(), and ntohs() have been extern "C" PE symbol table exported functions from ws2_32.dll. ws2_32.dll is AKA WinSock, Win32/64's front end public facing lib for TCP/IP sockets. It is not a light weight DLL, but loads multiple other DLLs, filter/FW DLLs, middleware and backendware DLLs. WinPerl delay loads (RTLD_LAZY) the Winsock DLL until the first attempt is made by [lib] perl5XX.dll to go on the WWW. These 4 functions do 1 things and 1 thing only, swap bytes around. Storable.xs and PP keywords pack/unpack heavily use these 4 functions, but they have nothing to do with ethernet, token ring, or TCPIP. They should be using the correct inlined single CPU instruction to do this. All modern CPUs (i386/x64/ARM) have a dedicated CPU opcode for BE/LE swapping. x86 introduced the 32 bit/U32 bswap instruction with the release of the i486. U16 variables can use i386's "ror eax, 8" (bitwise roll 8 bits). Both Mingw GCC and MSVC CCs, will never optimize htonl/htons/ntohl/ntohs to 1 opcode on Win32/64, ask for the C linker symbol, and you will get the linker symbol. So Storable.xs and PP pack/unpack on both CCs, are calling Winsock's exported functions to do the swap. This is slow since its a function call. Even worse, its not 1 opcode inside ws2_32.dll but 11 to 15 CPU ops. MSVC 2022 CC binaries prior to build number 19.37 (released Aug 2023), don't even recognize "((x & 0xFF)<<24)|((x>>24) & 0xFF)|((x & 0x0000FF00) <<8)|((x & 0x00FF0000)>>8)" as a C level euphemism for "I want the byte swap CPU instruction". Mingw GCC does recognize that macro to mean 1 CPU instruction, but Mingw GCC won't modify or optimize MS's htonl()/etc identifiers/symbols for you, you need to DIY that macro yourself to trigger the 1 CPU op optimization. To fix all this, hook the 4 functions with the CPP to prevent the 4 tokens from calling the Winsock export table implimentations. 2nd, with all MSVC CCs versions use the official proper MS specific way to do a "byte swap". Which is instrinsic function _byteswap_ulong(). Theoretically RtlUlongByteSwap() also exists, but its much less mentioned on MSDN and the general WWW, and lives in "wdm.h" which is intended for Ring 0 kernel drivers, so header clashes can occur now, or in the future, so pick the easier and simple _byteswap_ulong() category of intrinsics, not the Ring 0 ones. Since _byteswap_ulong() is an intrinsic function and not a macro, S_my_swap32() isn't needed to prevent multi-eval problems. htonll()/ntohll() were added simply because it was easy to write them, and multiple interp core .c files, and interp core .xs files want a 64 bit byte swap function/macro, since most people use 64 bit pointer CPUs. 2 different algorithms for htonll() are included, I picked the one with less C src code ops, but I didn't check if neither, 1 or the other, or both are auto-detected by GCC and Clang as a 64 bit byte-swap instrinsic request without formally asking for the instrinsic with a named identifier. The vtable hooks that CPerlHost/iperlsys.h use to implement psuedo-fork and ithreads, also had to be disabled for the 4 functions. Or else XSUB.h will redirect all CPAN XS mods (not -DPERL_CORE !), to the CPerlHost and iperlsys.h vtables, which then redirect to winsock, negating this fix. Other less-than-perfect CC/CPUs/OSes combos might be discovered in the future, so PERL_MY_HOST_NET_BYTE_SWAP define is cross OS and not Win32 specific. Rumors online say GCC only added its __builtin_bswap32() and matching that macro, in 2008/v4.0. So S_my_swap32() isn't being turned on for any OSes/CCs other than Mingw and MSVC on Win32, because "if it ain't broke, don't fix it". If the CCs and .h'es for Linux/Android/OSX are already perfect, if Perl attempts to hook, intercept, or use a token or identifier from 6 years ago, not 18 months ago, more harm (deoptimization) can happen that good. MSVC produced 11-15 CPU ops for S_my_swap32()'s "ISO C" macro, before VC 2022 19.37 Aug 2023. Winsock has the same exact machine code. I'll assume MS/other major Windows corporate users, assumed that "ISO C" byte swap macro is fundamentally wrong, since acknowleging that endianness exists violates "ISO C", and that code base is now "some Vendor's C", so why fix htonl() or that long non-CPU arch specific macro, instead of the intrinsic? That program already is aware of OS and Vendor and platform names and isn't portable
314ac03
to
002f8ba
Compare
GCC added pattern matching the DIYed |
This patch started when I saw this call stack while single stepping Storeable.pm/.xs. And to my horror, why on earth is Storeable.pm trying to go on the WWW. It has nothing to do with TCP/IP, or IP4/IP6. What Storeable.xs was really wanting to do is execute C function
htonl()
which is 100% okay safe sane normal. It wasn't trying to go online.But for personal performance wishlist reasons, I don't want to see ws2_32.dll getting mapped into addr space just for that tiny
htonl
function.sample of how Storable.xs uses htonl
Next problem, after CPP hooking htonl() and its friends, to perl.h's backup implementation of htonl(), MSVC 2022 x64 -O1 was producing horrible machine code, instead of x64's
bswap
instruction.disassembling
ws2_32.dll
's htonl shows the same bad machine code as above. Here is a screenshot I found of someone else complaining about using the long Pure ISO C macro with a modern MSVC CC. The flaw in this screenshot was a production grade.a
/.lib
compiled and distributed by another entity containing a-Od
-O0
machine code version of C++ ecosystem's htonl().Strawberry/Mingw correctly recognizes the pure ISO C expression as a euphemism for the current CPU's arch specific endian/byte swap instruction, regardless of its ascii identifier name for that CPU. BUT WinPerl built with Strawberry/Mingw, still will execute PE symbol table's extern C
htonl()
function ptr, if you typehtonl
in src code instead of typing the long Pure ISO C macro/expression. Ubuntulibperl.so
doesnt import a htonl C linker symbol at all.In writing and researching this PR on IRC there was a long talk
The VC 2022 I have and the VC 2022 TonyC has, werent producing the same mach code output.
eventually the different mach code output between 2 people both with "VC 2022" mystery was solved using godbolt's collection of very many (monthly) point releases of MSVC 2022 as
More research shows MS publicly want the public to write
_byteswap_uint64
,_byteswap_ulong
,_byteswap_ushort
, and not misuse/abuse functional calls from a source code compatibility layer in a C language TCPIP driver library (aka BSD sockets).So to fix everything, WinPerl, regardless of CC vendor, will always CPP hook from perl.h token
htonl
and its friends on WinOS, so thehtonl()
the linker symbol fromws2_32.dll
will never ever again be reachable from Perl in C. which forever, on WinOS, will always be coming from the export table. All MS CRT, past, present, and future, will forever use ascii namebyteswap_ulong
for what Linux API callshtonl
.byteswap_ulong()
on MSVC is both a CPU inline intrinsic, if you turn that intrinsic "on" during CC compile time, but it is also a real extern "C" function exported fromucrtbase.dll
, for spec compliance/GetProcAddress/Linker symbol/-O0 -Od
reasons.The
ucrtbase.dll
version I found on my system is just as terrible as thews2_32.dll
version. MS devs in .c/.cpp lang probably wrote the CPU arch independent formal linker symbol extern "C" backup function using the samereturn ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8);
macro Perl is using.Now after detangling
ws2_32.dll
from WinPerl, next problem to fix is0.25 stars
codegen on VC 2022 build number 19.36 released May 16, 2023 and lower.So to fix that codegen problem, just replace that long macro with
byteswap_ulong()
andNow all WinPerl binaries are identical to what all LinPerl binaries have been doing since forever. Bug is fixed, defect removed. End of story.
Now my commit above, I didn't fully think out certain parts and they need input from the public.
Is there any reason for WinPerl to ever again use slow
htonl()
from WinSock.dll for any reason?Is this build option to revert back to slow
htonl()
in Windows only, a waste of screen space, tarball space, and repo space?I am keeping the Win32 MSVC specific intrinsic logic in perl.h right next to the P5P Pure ISO C polyfill version instead the byte swaping backends living in 2 different .h files.
I also believe, there are Linux/Commercial Unix C compilers out there, that also have this optimization bug, where
return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8);
is not regexp-ed by the closed source Unix CC at compile time, into the appropriate 1 opcode line CPU instruction for whatever CPU arch that Unix is running on. I think this optimization was invented by GCC, copy-catted by Clang because of public outrage, and doesn't exist in any other open or closed source C compilers.Its reasonable for the dev team of a C compiler, to assume, someone writing
return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8);
instead of reading the man page for their C compiler, is an incompetent fool and "We" (CC authors) will not correct such "user error/user incompetent" C code.memcpy()/strlen()/memcmp()
are more reasonable to assume they can turn into 1 CPU opcode inline intrinsic, since all 3 are just "1 node/1 C struct/1 AST node" inside the C compiler.return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8);
requires a regexp inside the C compiler matching atleast "23 nodes/23 C structs/23 AST nodes" before doing the optimization tobswap
instruction. Now the C compiler has anO(n*23)
CPU usage problem from what used to beO(1)
.BTW, for
htons()
on i386/x64, its done withroll16((U16) n, 8)
or(U16)n = ((U16)n) >ROLL> 8;
, I dont think abswap16
exists on Intel CPUs. But remember folks, ISO C, has never, and never will add math operators roll, pop_count(), count_leading_0s(), **, and count_leading_1s().exp()
is a linker symbol, not a C math operator, please contact the authors of man page oflibc.so
to learn more about linker symbolexp()
. Maybe it is an abbreviation of expires and opensgdb
instantly when you call it and pauses the process.I think its a very bad idea for perl.h to override htonl()/etc, 100% of the time, on all CCs and all CPUs and all OSes. Lets assume all CCs on earth do the RIGHT THING like GCC and Clang do, until some human uses a dis-assembler, and proves Foo C Compiler v0.0 -O2 did something wrong. The OS authors/OS .h authors/CC authors will always know more, and keep better routine maintenance and watch over what C token
htonl()
does under the hood on their own products/projects, than P5P volunteers can do.But !!!! P5P can't predict what current or future CCs, and what future CPU archs, will need perl.h to re-implement
htonl()
, hence I'm leaving provisions that Perl on Linux/Android/Solaris/AIX/IBM Z/OS, will need the same hooking WinPerl needs.My risk assessment is,
return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8);
will fail to optimize or fail to inline, in machine code of current or future RISCV CPUs and WASM CPUs. WASM is a real CPU ISA BTW. The spec was written so a hardware CPU can be lithographied that runs un-recompile WASM byte sequences right off the DDR ram sticks. ARM32 can execute the byte stream in unmodified.class
disk files remember, if your load the correct DRM license key into the ARM CPU at runtime to enable the feature.Also, the commit in this PR needs less src code comments.
But I don't know where and how to document this new "hooking htonl()" feature.
my failed attempt at humor that Im going to hide instead of deleting
What does "host" mean in the spec? Whose host, which address, which room and year in history?What is "endianness" in ISO C?
Your app has lost the portability game, it is now aware of the OS Vendor's ascii string name and aware of the ascii string name of a CPU arch, which is a big no no to "portable" SW. Remember Abstract OS v𝑖⌃² changes the byteorder using an RNG of your C program and its integer math operators each time your create a new process to run it, to 1 of 8 possible byte orders for doing math using C type
uint32_t
. Write a unit test involving,htonl()
,memcpy()
,fwrite()
, and a usb stick being walked between 2 random Linux or Unix systems anywhere on earth jkjkcommit msg below
Since day 1 of Win32/64, C symbols/tokens htonl(), htons(), ntohs(), and ntohs() have been extern "C" PE symbol table exported functions from ws2_32.dll. WinSock aka WinOS's front end public facing lib for TCP/IP along with alot of legacy support for a dozen other protocols that nobody remembers. These 4 functions do 1 things and 1 thing only, swap bytes around. Windows 16-64 has and forever will be a LE OS regardless of CPU archs/HW. Mixed endian systems don't exist. Undumping a process between BE/LE HW Unix boxes or any OS kernels doesn't exist.
Storable.xs and PP keywords pack/unpack heavily use these 4 functions, but they have nothing to do with ethernet, token ring, or TCPIP. All modern CPUs (i386/x64/ARM) have a dedicated CPU opcode for BE/LE swapping. x86 introduced the 32 bit/U32 bswap instruction with the release of the i486. U16 variables can use i386's "ror eax, 8" (bitwise roll 8 bits).
Both Mingw GCC and MSVC CCs, will never optimize htonl/htons/ntohl/ntohs to 1 opcode, ask for the C linker symbol, and you will get the linker symbol. So Storable.xs and PP pack/unpack on both CCs, are calling Winsock's exported functions to do the swap. This is slow. It has overhead.
MSVC 2022 CC binaries prior to build number 19.37 (released Aug 2023), don't even recognize "((x & 0xFF)<<24)|((x>>24) & 0xFF)|((x & 0x0000FF00) <<8)|((x & 0x00FF0000)>>8)" as a C level euphemism for "I want the byte swap CPU instruction". To fix all this, hook the 4 functions with the CPP to prevent the 4 tokens from calling the Winsock export table implimentations. 2nd, with all MSVC CCs versions use the official proper MS specific way to do a "byte swap". Which is instrinsic function _byteswap_ulong(). Theoretically RtlUlongByteSwap() also exists, but its much less mentioned on MSDN and the general WWW, and lives in "wdm.h" which is intended for Ring 0 kernel drivers, so header clashes can occur now, or in the future, trying to use a Ring 0 .h in a mostly user mode .h #include chain.
Since _byteswap_ulong() is an intrinsic function and not a macro, S_my_swap32() isn't needed to prevent multi-eval problems.
htonll()/ntohll() were added simply because it was easy to write them. 2 different algorithms for htonll() are included, I picked the one with less C src code ops, but I didn't check if neither, 1 or the other, or both are auto-detected by GCC and Clang as a 64 bit byte-swap instrinsic request without formally asking for the instrinsic with a named identifier.
The vtable hooks that CPerlHost/iperlsys.h use to implement psuedo-fork and ithreads, also had to be disabled for the 4 functions. Or else XSUB.h will redirect all CPAN XS mods (not -DPERL_CORE !), to the CPerlHost and iperlsys.h vtables, which then redirect to winsock, negating this fix.
Other less-than-perfect CC/CPUs/OSes combos might be discovered in the future, so PERL_MY_HOST_NET_BYTE_SWAP define is cross OS and not Win32 specific. S_my_swap32() isn't being turned on for any OSes/CCs other than Mingw and MSVC on Win32, because "if it ain't broke, don't fix it". If the CCs and .h'es for Linux/Android/OSX are already perfect, if Perl attempts to hook, intercept, or use a token or identifier from 6 years ago, not 18 months ago, more harm (deoptimization) can happen that good. MSVC produced 11-15 CPU ops for S_my_swap32()'s "ISO C" macro, before VC 2022 19.37 Aug 2023. Winsock has the same exact machine code.
I'll assume MS/other major Windows corporate users, assumed that "ISO C" byte swap macro is fundamentally wrong, since acknowleging that endianness exists in IT, violates "ISO C", and that code base is now "some Vendor's C", so why fix htonl() or that long non-CPU arch specific macro, instead of the intrinsic? That program already is aware of OS and Vendor and platform names and isn't portable,