Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330

bulk88 · 2025-05-26T01:19:40Z

This patch started when I saw this call stack while single stepping Storeable.pm/.xs. And to my horror, why on earth is Storeable.pm trying to go on the WWW. It has nothing to do with TCP/IP, or IP4/IP6. What Storeable.xs was really wanting to do is execute C function htonl() which is 100% okay safe sane normal. It wasn't trying to go online.

But for personal performance wishlist reasons, I don't want to see ws2_32.dll getting mapped into addr space just for that tiny htonl function.

>	ntdll.dll!LdrGetProcedureAddress ()	Unknown
  | KernelBase.dll!GetProcAddress()	Unknown
  | perl541.dll!__delayLoadHelper2(const ImgDelayDescr * pidd, __int64(*)() * ppfnIATEntry) Line 352	C++
  | perl541.dll!__tailMerge_ws2_32_dll ()	Unknown
  | Storable.dll!store_hash(interpreter * my_perl, stcxt * cxt, hv * hv) Line 2927	C
  | Storable.dll!store(interpreter * my_perl, stcxt * cxt, sv * sv) Line 4516	C
  | Storable.dll!do_store(interpreter * my_perl, _PerlIO * * f, sv * sv, int optype, int network_order, sv * * res) Line 4700	C
  | Storable.dll!XS_Storable_mstore(interpreter * my_perl, cv * cv) Line 7902	C
  | [Inline Frame] perl541.dll!Perl_rpp_invoke_xs(interpreter *) Line 1177	C++
  | perl541.dll!Perl_pp_entersub(interpreter * my_perl) Line 6529	C++
  | perl541.dll!Perl_runops_standard(interpreter * my_perl) Line 41	C++
  | perl541.dll!Perl_call_sv(interpreter * my_perl, sv * sv, long arg_flags) Line 3253	C++
  | threads.dll!S_jmpenv_run(interpreter * my_perl, int action, _ithread * thread, int * len_p, int * exit_app_p, int * exit_code_p) Line 521	C++
  | threads.dll!S_ithread_run(void * arg) Line 640	C++
  | kernel32.dll!BaseThreadInitThunk ()	Unknown
  | ntdll.dll!RtlUserThreadStart ()	Unknown

sample of how Storable.xs uses htonl

#ifdef HAS_HTONL
#define WLEN(x)                                                         \
    STMT_START {                                                        \
        ASSERT(sizeof(x) == sizeof(int), ("WLEN writing an int"));      \
        if (cxt->netorder) {                                            \
            int y = (int) htonl(x);                                     \
            if (!cxt->fio)                                              \
                MBUF_PUTINT(y);                                         \
            else if (PerlIO_write(cxt->fio,oI(&y),oS(sizeof(y))) != oS(sizeof(y))) \
                return -1;                                              \
        } else {                                                        \
            if (!cxt->fio)                                              \
                MBUF_PUTINT(x);                                         \
            else if (PerlIO_write(cxt->fio,oI(&x),                      \
                                  oS(sizeof(x))) != oS(sizeof(x)))      \
                return -1;                                              \
        }                                                               \
    } STMT_END

Next problem, after CPP hooking htonl() and its friends, to perl.h's backup implementation of htonl(), MSVC 2022 x64 -O1 was producing horrible machine code, instead of x64's bswap instruction.

  | ###########################################################
  | #  elif BYTEORDER == 0x1234 \|\| BYTEORDER == 0x12345678
  | #    define htonl(x)    st_my_swap32(x)
  | #    define ntohl(x)    st_my_swap32(x)
  |  
  | PERL_STATIC_INLINE U32
  | st_my_swap32(const U32 x) {
  | return ((x & 0xFF) << 24) \| ((x >> 24) & 0xFF)
  | \| ((x & 0x0000FF00) << 8) \| ((x & 0x00FF0000) >> 8);
  | }
  | ###########################################################
  | mov     ecx, [rbp+arg_18]
  | mov     edx, ecx
  | and     edx, 0FF0000h
  | mov     eax, ecx
  | shr     eax, 10h
  | or      edx, eax
  | mov     eax, ecx
  | shl     eax, 10h
  | and     ecx, 0FF00h
  | or      eax, ecx
  | shr     edx, 8
  | shl     eax, 8
  | or      edx, eax
  | mov     [rbp+Src], edx
  | test    r10, r10
  | jnz     short loc_180004B65

disassembling ws2_32.dll's htonl shows the same bad machine code as above. Here is a screenshot I found of someone else complaining about using the long Pure ISO C macro with a modern MSVC CC. The flaw in this screenshot was a production grade .a/.lib compiled and distributed by another entity containing a -Od -O0 machine code version of C++ ecosystem's htonl().

Strawberry/Mingw correctly recognizes the pure ISO C expression as a euphemism for the current CPU's arch specific endian/byte swap instruction, regardless of its ascii identifier name for that CPU. BUT WinPerl built with Strawberry/Mingw, still will execute PE symbol table's extern C htonl() function ptr, if you type htonl in src code instead of typing the long Pure ISO C macro/expression. Ubuntu libperl.so doesnt import a htonl C linker symbol at all.

In writing and researching this PR on IRC there was a long talk

[22:06] <TonyC> https://godbolt.org/z/1bGh9eWsM both gcc and MSVC optimize it pretty well

The VC 2022 I have and the VC 2022 TonyC has, werent producing the same mach code output.

�01[22:10] <bulk88> TonyC I wonder what build number of MSVC 2022 is being used by godbolt, since MS offical blog says they "fixed it" in November 2022 https://devblogs.microsoft.com/cppblog/a-tour-of-4-msvc-backend-improvements/
�01[22:11] <bulk88> but in Feb 2023 users are complaining that blog article was BS and lies https://developercommunity.visualstudio.com/t/MSVC-Not-using-bswap-when-possible/917784
[22:12] <TonyC> you can select different versions from the compiler dropdown
�01[22:18] <bulk88> https://godbolt.org/z/W79hxYG95 using MSVC 2022 latest, using the exact P5P blead code, I broke your example tony
�01[22:18] <bulk88> P5P is using U8 and U16 const literal ints operands, your version is 100% U64 const literal ints
[22:22] <TonyC> you picked a 32-bit compiler, but even that gets it right if the parameter to g() is a U64 rather than U32
�01[22:23] <bulk88> bswap op x86-32 was not a day 1 opcode, it was added in the release of the 486, so CC devs have a legit rational to never ever use the bswap opcode unless the end user tells the CC to do it
�01[22:24] <bulk88> IDK how C lang spec, and ones compliment sign extend promotion, are involved in this

eventually the different mach code output between 2 people both with "VC 2022" mystery was solved using godbolt's collection of very many (monthly) point releases of MSVC 2022 as

�01[22:28] <bulk88> MSVC x64  version 19.37  17.7 was the first build number that has the "perfection" 1 op code machine code
�01[22:29] <bulk88> 19.36 and lower have the 10 op code medicore version
�01[22:30] <bulk88>  build number Visual Studio 2022 version 17.7 was published on Aug 8th 2023
�01[22:38] <bulk88> I have build number  19.36.32535 installed, therefore I have the bug

More research shows MS publicly want the public to write _byteswap_uint64, _byteswap_ulong, _byteswap_ushort, and not misuse/abuse functional calls from a source code compatibility layer in a C language TCPIP driver library (aka BSD sockets).

So to fix everything, WinPerl, regardless of CC vendor, will always CPP hook from perl.h token htonl and its friends on WinOS, so the htonl() the linker symbol from ws2_32.dll will never ever again be reachable from Perl in C. which forever, on WinOS, will always be coming from the export table. All MS CRT, past, present, and future, will forever use ascii name byteswap_ulong for what Linux API calls htonl. byteswap_ulong() on MSVC is both a CPU inline intrinsic, if you turn that intrinsic "on" during CC compile time, but it is also a real extern "C" function exported from ucrtbase.dll, for spec compliance/GetProcAddress/Linker symbol/-O0 -Od reasons.

The ucrtbase.dll version I found on my system is just as terrible as the ws2_32.dll version. MS devs in .c/.cpp lang probably wrote the CPU arch independent formal linker symbol extern "C" backup function using the same return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8); macro Perl is using.

; unsigned int __cdecl byteswap_ulong(unsigned int Long)
public _byteswap_ulong
_byteswap_ulong proc near
mov     eax, ecx
mov     r8d, 0FF00h
and     eax, r8d
mov     edx, ecx
shl     edx, 10h
add     eax, edx
mov     edx, ecx
shl     eax, 8
shr     edx, 8
and     edx, r8d
shr     ecx, 18h
add     eax, edx
add     eax, ecx
retn
_byteswap_ulong endp

Now after detangling ws2_32.dll from WinPerl, next problem to fix is 0.25 stars codegen on VC 2022 build number 19.36 released May 16, 2023 and lower.

So to fix that codegen problem, just replace that long macro with byteswap_ulong() and

#    ifdef _MSC_VER
#      pragma intrinsic(_byteswap_ulong)
#      pragma intrinsic(_byteswap_ushort)
#    endif

Now all WinPerl binaries are identical to what all LinPerl binaries have been doing since forever. Bug is fixed, defect removed. End of story.

Now my commit above, I didn't fully think out certain parts and they need input from the public.

Is there any reason for WinPerl to ever again use slow htonl() from WinSock.dll for any reason?

Is this build option to revert back to slow htonl() in Windows only, a waste of screen space, tarball space, and repo space?

I am keeping the Win32 MSVC specific intrinsic logic in perl.h right next to the P5P Pure ISO C polyfill version instead the byte swaping backends living in 2 different .h files.

I also believe, there are Linux/Commercial Unix C compilers out there, that also have this optimization bug, where return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8); is not regexp-ed by the closed source Unix CC at compile time, into the appropriate 1 opcode line CPU instruction for whatever CPU arch that Unix is running on. I think this optimization was invented by GCC, copy-catted by Clang because of public outrage, and doesn't exist in any other open or closed source C compilers.

Its reasonable for the dev team of a C compiler, to assume, someone writing return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8); instead of reading the man page for their C compiler, is an incompetent fool and "We" (CC authors) will not correct such "user error/user incompetent" C code. memcpy()/strlen()/memcmp() are more reasonable to assume they can turn into 1 CPU opcode inline intrinsic, since all 3 are just "1 node/1 C struct/1 AST node" inside the C compiler. return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8); requires a regexp inside the C compiler matching atleast "23 nodes/23 C structs/23 AST nodes" before doing the optimization to bswap instruction. Now the C compiler has an O(n*23) CPU usage problem from what used to be O(1).

BTW, for htons() on i386/x64, its done with roll16((U16) n, 8) or (U16)n = ((U16)n) >ROLL> 8;, I dont think a bswap16 exists on Intel CPUs. But remember folks, ISO C, has never, and never will add math operators roll, pop_count(), count_leading_0s(), **, and count_leading_1s().

exp() is a linker symbol, not a C math operator, please contact the authors of man page of libc.so to learn more about linker symbol exp(). Maybe it is an abbreviation of expires and opens gdb instantly when you call it and pauses the process.

I think its a very bad idea for perl.h to override htonl()/etc, 100% of the time, on all CCs and all CPUs and all OSes. Lets assume all CCs on earth do the RIGHT THING like GCC and Clang do, until some human uses a dis-assembler, and proves Foo C Compiler v0.0 -O2 did something wrong. The OS authors/OS .h authors/CC authors will always know more, and keep better routine maintenance and watch over what C token htonl() does under the hood on their own products/projects, than P5P volunteers can do.

But !!!! P5P can't predict what current or future CCs, and what future CPU archs, will need perl.h to re-implement htonl(), hence I'm leaving provisions that Perl on Linux/Android/Solaris/AIX/IBM Z/OS, will need the same hooking WinPerl needs.

My risk assessment is, return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8); will fail to optimize or fail to inline, in machine code of current or future RISCV CPUs and WASM CPUs. WASM is a real CPU ISA BTW. The spec was written so a hardware CPU can be lithographied that runs un-recompile WASM byte sequences right off the DDR ram sticks. ARM32 can execute the byte stream in unmodified .class disk files remember, if your load the correct DRM license key into the ARM CPU at runtime to enable the feature.

Also, the commit in this PR needs less src code comments.

But I don't know where and how to document this new "hooking htonl()" feature.

1 large comment in 1 place, single line comments every where else?
1 large comment in 1 place, no comments anywhere else?
alot of 1 line comments saying "go read PR # 987654"?

my failed attempt at humor that Im going to hide instead of deleting

What does "host" mean in the spec? Whose host, which address, which room and year in history?

What is "endianness" in ISO C?

Your app has lost the portability game, it is now aware of the OS Vendor's ascii string name and aware of the ascii string name of a CPU arch, which is a big no no to "portable" SW. Remember Abstract OS v𝑖⌃² changes the byteorder using an RNG of your C program and its integer math operators each time your create a new process to run it, to 1 of 8 possible byte orders for doing math using C type uint32_t. Write a unit test involving, htonl(), memcpy(), fwrite(), and a usb stick being walked between 2 random Linux or Unix systems anywhere on earth jkjk

commit msg below

Since day 1 of Win32/64, C symbols/tokens htonl(), htons(), ntohs(), and ntohs() have been extern "C" PE symbol table exported functions from ws2_32.dll. WinSock aka WinOS's front end public facing lib for TCP/IP along with alot of legacy support for a dozen other protocols that nobody remembers. These 4 functions do 1 things and 1 thing only, swap bytes around. Windows 16-64 has and forever will be a LE OS regardless of CPU archs/HW. Mixed endian systems don't exist. Undumping a process between BE/LE HW Unix boxes or any OS kernels doesn't exist.

Storable.xs and PP keywords pack/unpack heavily use these 4 functions, but they have nothing to do with ethernet, token ring, or TCPIP. All modern CPUs (i386/x64/ARM) have a dedicated CPU opcode for BE/LE swapping. x86 introduced the 32 bit/U32 bswap instruction with the release of the i486. U16 variables can use i386's "ror eax, 8" (bitwise roll 8 bits).

Both Mingw GCC and MSVC CCs, will never optimize htonl/htons/ntohl/ntohs to 1 opcode, ask for the C linker symbol, and you will get the linker symbol. So Storable.xs and PP pack/unpack on both CCs, are calling Winsock's exported functions to do the swap. This is slow. It has overhead.

MSVC 2022 CC binaries prior to build number 19.37 (released Aug 2023), don't even recognize "((x & 0xFF)<<24)|((x>>24) & 0xFF)|((x & 0x0000FF00) <<8)|((x & 0x00FF0000)>>8)" as a C level euphemism for "I want the byte swap CPU instruction". To fix all this, hook the 4 functions with the CPP to prevent the 4 tokens from calling the Winsock export table implimentations. 2nd, with all MSVC CCs versions use the official proper MS specific way to do a "byte swap". Which is instrinsic function _byteswap_ulong(). Theoretically RtlUlongByteSwap() also exists, but its much less mentioned on MSDN and the general WWW, and lives in "wdm.h" which is intended for Ring 0 kernel drivers, so header clashes can occur now, or in the future, trying to use a Ring 0 .h in a mostly user mode .h #include chain.

Since _byteswap_ulong() is an intrinsic function and not a macro, S_my_swap32() isn't needed to prevent multi-eval problems.

htonll()/ntohll() were added simply because it was easy to write them. 2 different algorithms for htonll() are included, I picked the one with less C src code ops, but I didn't check if neither, 1 or the other, or both are auto-detected by GCC and Clang as a 64 bit byte-swap instrinsic request without formally asking for the instrinsic with a named identifier.

The vtable hooks that CPerlHost/iperlsys.h use to implement psuedo-fork and ithreads, also had to be disabled for the 4 functions. Or else XSUB.h will redirect all CPAN XS mods (not -DPERL_CORE !), to the CPerlHost and iperlsys.h vtables, which then redirect to winsock, negating this fix.

Other less-than-perfect CC/CPUs/OSes combos might be discovered in the future, so PERL_MY_HOST_NET_BYTE_SWAP define is cross OS and not Win32 specific. S_my_swap32() isn't being turned on for any OSes/CCs other than Mingw and MSVC on Win32, because "if it ain't broke, don't fix it". If the CCs and .h'es for Linux/Android/OSX are already perfect, if Perl attempts to hook, intercept, or use a token or identifier from 6 years ago, not 18 months ago, more harm (deoptimization) can happen that good. MSVC produced 11-15 CPU ops for S_my_swap32()'s "ISO C" macro, before VC 2022 19.37 Aug 2023. Winsock has the same exact machine code.

I'll assume MS/other major Windows corporate users, assumed that "ISO C" byte swap macro is fundamentally wrong, since acknowleging that endianness exists in IT, violates "ISO C", and that code base is now "some Vendor's C", so why fix htonl() or that long non-CPU arch specific macro, instead of the intrinsic? That program already is aware of OS and Vendor and platform names and isn't portable,

This set of changes requires a perldelta entry, and I need help writing it. <<<<< TODO: THIS ONE

bulk88 · 2025-05-26T03:52:32Z

PostgreSQL found this exact performance problem on Windows OS (WinSock.dll problem) and on MSVC (the & << | macro and final mach code output) and fixed both in September 2017.

postgres/postgres@510b8cb

Read entire thread, someone accuses GCC on Linux -O2 (year 2017) of emitting "very low quality" machine code for GCCs/Linuxs attempt to do an inlined htonl() optimization, written in GCC inline assembly grammar. I (bulk88) guess that ML post author is complaining about 100% pointless and redundant register to register mov opcodes. GCC IDK when in history solved this by inventing a non-back compat V2 GCC inline assembly grammar called https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

https://www.postgresql.org/message-id/20170914063418.sckdzgjfrsbekae4%40alap3.anarazel.de

update:

with my buggy VC 2022 v19.36, store_array() in Storable.xs is now a 99.999% perfect

mov     rcx, [rbp+var_10]
mov     eax, ecx
shr     rcx, 20h
bswap   ecx
mov     [rbp+Src], ecx
bswap   eax
mov     [rbp+arg_C], eax
test    rdx, rdx
jnz     loc_1800042E0

I am going pretend Storable.pm is not a P5P maintained product in the next sentence. The 0.001% defect is Storeable.xs writing htonl() twice in src code, to emulate htonll(), which is not P5P's problem or responsibility. More technically accurate, its not a problem or responsibility of P5P's collection of .h files and P5P's libperl.so file.

iabyn · 2025-05-26T09:19:47Z

I have no strong opinions on whether this commit is the best solution to the problem being addressed (I tend to steer clear of Win32 stuff), but I am strongly opposed to the proposed commit message. It is verbose and unclear, and makes it very difficult to see at a glance what is being changed.

Here's an example of what I think such a commit message should look like. I'm not proposing this as the actual text of the commit message, since I don't understand the real details of the commit. It's to give you an idea of what I regard as a suitable commit message.

Win32: speed up htons etc byte-order swapping

Various part of perl, in particular pack() and Storable, make extensive use of the htonl/htons/ntohl/ntohs library functions to swap the byte order of short and long ints.

Unfortunately in Win32 these functions can often be relatively slow, for the reasons described below. This commit introduces a new define, PERL_MY_HOST_NET_BYTE_SWAP (set by default). If set, then htons etc, which were formerly mapped to win32_htonl etc, are now mapped to the pre-existing inline my_swap32 etc functions.
In addition, the implementation of my_swap32 etc has been improved, using the compiler-specific _byteswap_uint64 pragma if available.

On non-win32 platforms, nothing has changed.

... long explanation of the issue ...

perl.h

Leont · 2025-05-26T17:42:29Z

The idea of the change sounds like a very good idea, but I can't help wondering if it can't be wired more simply.

bulk88 · 2025-05-27T00:30:18Z

The idea of the change sounds like a very good idea, but I can't help wondering if it can't be wired more simply.

What are your thoughts on dropping the "disable the hook/optimization and use slow OS htonl" build option nobody on WinOS or AnyOS will ever use or know its there? It will reduce the visual complexity and length of this patch if I do that. It exists because of some tests/benchmarks analysis I was doing in writing the patch, and I decided to not delete it, although I could've/should've.

Remember the hook is turned on by P5P only for know problem platforms (currently WinOS, all CCs), never for all platforms.

…order Changing BE/LE byte order is a very common operation used in many places inside libperl, and some core bundled XS modules, and in many CPAN XS modules. Storable.xs and PP pack()/unpack() are the largest most frequent users of byte swapping in Perl interp git repo. Some Perl interp repo .c or .h or .xs files DIY their own byte swap algorithms with macros or static inline functions. Others like Storable.xs and PP pack()/unpack() use the htonl(), htons(), ntohs(), and ntohs() macros/functions provided by the OS/CC, or depending on config.sh, equivelent polyfills from perl.h. On all modern CPU archs, i386/ARM/x64/PowerPC, there is a dedicate CPU instruction for doing it. Perl on Linux compiled with Clang or GCC, automatically is using the appropriate inlined intrinsic CPU opcode when using the CCs/OSes htonl(). Those 2 CCs also recogize perl.h's my_swap32() algorithm, and all other .h/.c files that use that statement as a euphemism to inline htonl() to 1 CPU opcode. On Windows, the situation is very bad compare to the above. The htonl(), htons(), ntohs(), and ntohs() functions/macros, provided to Perl from both MSVC and GCC, are exceptionally slow and inefficient, and unoptimized. MSVC has further problems, generating very inefficient machine code to swap byte order, using the DIY shift and mask algorithm. Since day 1 of Win32/64, C symbols/tokens htonl(), htons(), ntohs(), and ntohs() have been extern "C" PE symbol table exported functions from ws2_32.dll. ws2_32.dll is AKA WinSock, Win32/64's front end public facing lib for TCP/IP sockets. It is not a light weight DLL, but loads multiple other DLLs, filter/FW DLLs, middleware and backendware DLLs. WinPerl delay loads (RTLD_LAZY) the Winsock DLL until the first attempt is made by [lib] perl5XX.dll to go on the WWW. These 4 functions do 1 things and 1 thing only, swap bytes around. Storable.xs and PP keywords pack/unpack heavily use these 4 functions, but they have nothing to do with ethernet, token ring, or TCPIP. They should be using the correct inlined single CPU instruction to do this. All modern CPUs (i386/x64/ARM) have a dedicated CPU opcode for BE/LE swapping. x86 introduced the 32 bit/U32 bswap instruction with the release of the i486. U16 variables can use i386's "ror eax, 8" (bitwise roll 8 bits). Both Mingw GCC and MSVC CCs, will never optimize htonl/htons/ntohl/ntohs to 1 opcode on Win32/64, ask for the C linker symbol, and you will get the linker symbol. So Storable.xs and PP pack/unpack on both CCs, are calling Winsock's exported functions to do the swap. This is slow since its a function call. Even worse, its not 1 opcode inside ws2_32.dll but 11 to 15 CPU ops. MSVC 2022 CC binaries prior to build number 19.37 (released Aug 2023), don't even recognize "((x & 0xFF)<<24)|((x>>24) & 0xFF)|((x & 0x0000FF00) <<8)|((x & 0x00FF0000)>>8)" as a C level euphemism for "I want the byte swap CPU instruction". Mingw GCC does recognize that macro to mean 1 CPU instruction, but Mingw GCC won't modify or optimize MS's htonl()/etc identifiers/symbols for you, you need to DIY that macro yourself to trigger the 1 CPU op optimization. To fix all this, hook the 4 functions with the CPP to prevent the 4 tokens from calling the Winsock export table implimentations. 2nd, with all MSVC CCs versions use the official proper MS specific way to do a "byte swap". Which is instrinsic function _byteswap_ulong(). Theoretically RtlUlongByteSwap() also exists, but its much less mentioned on MSDN and the general WWW, and lives in "wdm.h" which is intended for Ring 0 kernel drivers, so header clashes can occur now, or in the future, so pick the easier and simple _byteswap_ulong() category of intrinsics, not the Ring 0 ones. Since _byteswap_ulong() is an intrinsic function and not a macro, S_my_swap32() isn't needed to prevent multi-eval problems. htonll()/ntohll() were added simply because it was easy to write them, and multiple interp core .c files, and interp core .xs files want a 64 bit byte swap function/macro, since most people use 64 bit pointer CPUs. 2 different algorithms for htonll() are included, I picked the one with less C src code ops, but I didn't check if neither, 1 or the other, or both are auto-detected by GCC and Clang as a 64 bit byte-swap instrinsic request without formally asking for the instrinsic with a named identifier. The vtable hooks that CPerlHost/iperlsys.h use to implement psuedo-fork and ithreads, also had to be disabled for the 4 functions. Or else XSUB.h will redirect all CPAN XS mods (not -DPERL_CORE !), to the CPerlHost and iperlsys.h vtables, which then redirect to winsock, negating this fix. Other less-than-perfect CC/CPUs/OSes combos might be discovered in the future, so PERL_MY_HOST_NET_BYTE_SWAP define is cross OS and not Win32 specific. Rumors online say GCC only added its __builtin_bswap32() and matching that macro, in 2008/v4.0. So S_my_swap32() isn't being turned on for any OSes/CCs other than Mingw and MSVC on Win32, because "if it ain't broke, don't fix it". If the CCs and .h'es for Linux/Android/OSX are already perfect, if Perl attempts to hook, intercept, or use a token or identifier from 6 years ago, not 18 months ago, more harm (deoptimization) can happen that good. MSVC produced 11-15 CPU ops for S_my_swap32()'s "ISO C" macro, before VC 2022 19.37 Aug 2023. Winsock has the same exact machine code. I'll assume MS/other major Windows corporate users, assumed that "ISO C" byte swap macro is fundamentally wrong, since acknowleging that endianness exists violates "ISO C", and that code base is now "some Vendor's C", so why fix htonl() or that long non-CPU arch specific macro, instead of the intrinsic? That program already is aware of OS and Vendor and platform names and isn't portable

bulk88 · 2025-05-28T08:54:27Z

gcc-mirror/gcc@73984f8

GCC added pattern matching the DIYed htonl() macro only recently in 2014 if above commit is true

khwilliamson · 2025-06-18T20:46:44Z

I agree with @iabyn about commit messages. And I want to amplify on that.

A commit message is a permanent record. It will be read many many times in the future. A poorly written commit message is technical debt, wasting the time of whoever does eventually read it.

One such person might be someone with hiring authority over the person who wrote the message. When writing a commit message, one should ask oneself would this commit message play well the next time I need to look for $work? If you think you won't ever be in that position again to need a new job, I claim you're overconfident. Just this past weekend, I ran into some people who stayed where I used to be employed. One was laid off a few years ago; the other was cut down to one day a week. I was shocked by lay-off reports of people who were once thought as crucial to the company.

For my well-being and to avoid technical debt, I refuse to review a PR without a decent commit message, and I will strenuously object to any such actually being merged. It is unfair to those who come after us.

A commit message doesn't have to be totally dry, but its wit needs brevity. It has to be in proper English grammar and free of gratuitous asides. It should be free of obscure terminology not familiar to someone who isn't familiar with general computerese and/or isn't a native English speaker.

Some commit messages can be very short. I think "foo.c: Fix typo in comment" is sufficient for the purpose. That is just the header; no further explanation nor message body is needed.

Like a newspaper article, a commit message needs to answer who what when where and why. The author and when are automatically filled in for you. I think the where and a brief what should go in the title. That leaves the body to contain an expanded what, and why.

Should the PSC come up with commit message guidelines?

khwilliamson · 2025-06-18T21:01:03Z

Maybe AI could be used to improve commit messages?

bulk88 · 2025-06-19T00:22:29Z

I agree with @iabyn about commit messages. And I want to amplify on that.

A commit message is a permanent record. It will be read many many times in the future. A poorly written commit message is technical debt, wasting the time of whoever does eventually read it.

My current commit message exceeds my own screen height. I don't like that. I think that a more than 1/4th of a screen long commit message is git repo bloat. Since P5P is FOSS software with a forever and permanent archived ML/bug tracker, more than 1/4 of a screen of commit message body is never necessary. The PHD doctorate long commit message or homicide case file long commit message details can be instead saved in the ML/bug tracker. That way they don't affect git clone time/git log/git blame/etc.

Should I just change the commit message to "See the details about this commit in it's PR at PR #23330."?

Problem solved? This is my favorite path forward. Also I think there should be some public API POD in https://perldoc.perl.org/perlclib but its not crucial to making this PR commitable.

Also I need input on my comment #23330 (comment)

Whats your thoughts about evangelizing to CPAN authors identifiers my_swap64 or htonll or PerlSock_htonll?

_ # 1 and # 3 are P5P proprietary. # 2 is filling in missing POSIX APIs, where the perl VM impl, will get (good thing) accidental usage by CPAN XS authors, who dont know, dont care, or bare minimum care about P5P invented identifiers.

These XS authors are beginner or low effort or FT non-Perl C devs but low time people. So they will use max POSIX C/Linux C tokens that they are comfortable with where possible, vs the P5P invented "portable" C tokens that are an alien language from an extraterrestrial planet to them.

No matter who writes this commit. me or someone else, the commit message, and all technical background info ^^^^ up above in this PR, is only readable by fluent x86 Assembly devs.

No C developer, P5P or not, will ever understand the defects and flaws of this PR, nor understand fixes, nor understand the rational behind each design choice made, when there were 2 or more possible solutions/design choices to pick from. It is beyond scope for me to explain what Virtual Address Space is to an HTML 5 Javascript or React JSX dev.

Some photos from my bookshelf of where my knowledge comes from.

The rational and tech talk I have made so far, will probably never be read again, after it is in blead perl again. The only reason I made it so verbose, is so that in the future, no P5P dev, who is a Linux Host OS + Win Guest/Cloud OS person, doesn't mess with the Win32-specific code not understanding, why it is exactly the way I wrote it. I also made it it highly verbose, since if some Asm-reading P5P person other than me, in the future, finds these exact same byte swap problems with a future Perl running on RISC-V or future Perl running on Loongson port. What to do, how to fix it, was already documented and done years ago.

Regarding "no C developer, P5P or not can read it", I can say the same exact thing about blead Perl right now.

No C or Perl developer, P5P or not, will ever understand a single commit from https://github.com/Perl/metaconfig or https://github.com/Perl/perl5/blob/blead/Configure

Should that framework and https://github.com/Perl/perl5/blob/blead/Configure be removed right now from Perl because they are a costly maintenance burden, written in unintelligible programming languages such as awk and m4 and an unknown undocumented variant of sh language?

I don't think so.

Update:

I just realized not even someone fluent x86 Assembly can 100% understand this commit. 25%-50% of this commit is dealing with bugs, in specific build numbers of MSVC's cl.exe/link.exe, along with MS's compliance [gossip? conspiracy theory?] with Intel's software patent on byte order swapping at https://patents.google.com/patent/US20040010676A1/en that expired in 2022.

Leont reviewed May 26, 2025

View reviewed changes

perl.h Show resolved Hide resolved

bulk88 force-pushed the win32_no_winsock_byte_swaper_funcs branch from 314ac03 to 002f8ba Compare May 27, 2025 11:47

bulk88 mentioned this pull request Jun 15, 2025

pp_reverse - chunk-at-a-time string reversal #23371

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330

Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330

bulk88 commented May 26, 2025

Uh oh!

bulk88 commented May 26, 2025 •

edited

Loading

Uh oh!

iabyn commented May 26, 2025

Uh oh!

Uh oh!

Leont commented May 26, 2025

Uh oh!

bulk88 commented May 27, 2025

Uh oh!

bulk88 commented May 28, 2025

Uh oh!

khwilliamson commented Jun 18, 2025

Uh oh!

khwilliamson commented Jun 18, 2025

Uh oh!

bulk88 commented Jun 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330

Are you sure you want to change the base?

Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330

Conversation

bulk88 commented May 26, 2025

Uh oh!

bulk88 commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iabyn commented May 26, 2025

Uh oh!

Uh oh!

Leont commented May 26, 2025

Uh oh!

bulk88 commented May 27, 2025

Uh oh!

bulk88 commented May 28, 2025

Uh oh!

khwilliamson commented Jun 18, 2025

Uh oh!

khwilliamson commented Jun 18, 2025

Uh oh!

bulk88 commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bulk88 commented May 26, 2025 •

edited

Loading

bulk88 commented Jun 19, 2025 •

edited

Loading