Skip to content

Win32: htonl/htons/ntohl/ntohs change slow winsock exports -> 1 CPU op/ins #23330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: blead
Choose a base branch
from

Conversation

bulk88
Copy link
Contributor

@bulk88 bulk88 commented May 26, 2025

This patch started when I saw this call stack while single stepping Storeable.pm/.xs. And to my horror, why on earth is Storeable.pm trying to go on the WWW. It has nothing to do with TCP/IP, or IP4/IP6. What Storeable.xs was really wanting to do is execute C function htonl() which is 100% okay safe sane normal. It wasn't trying to go online.

But for personal performance wishlist reasons, I don't want to see ws2_32.dll getting mapped into addr space just for that tiny htonl function.

>	ntdll.dll!LdrGetProcedureAddress ()	Unknown
  | KernelBase.dll!GetProcAddress()	Unknown
  | perl541.dll!__delayLoadHelper2(const ImgDelayDescr * pidd, __int64(*)() * ppfnIATEntry) Line 352	C++
  | perl541.dll!__tailMerge_ws2_32_dll ()	Unknown
  | Storable.dll!store_hash(interpreter * my_perl, stcxt * cxt, hv * hv) Line 2927	C
  | Storable.dll!store(interpreter * my_perl, stcxt * cxt, sv * sv) Line 4516	C
  | Storable.dll!do_store(interpreter * my_perl, _PerlIO * * f, sv * sv, int optype, int network_order, sv * * res) Line 4700	C
  | Storable.dll!XS_Storable_mstore(interpreter * my_perl, cv * cv) Line 7902	C
  | [Inline Frame] perl541.dll!Perl_rpp_invoke_xs(interpreter *) Line 1177	C++
  | perl541.dll!Perl_pp_entersub(interpreter * my_perl) Line 6529	C++
  | perl541.dll!Perl_runops_standard(interpreter * my_perl) Line 41	C++
  | perl541.dll!Perl_call_sv(interpreter * my_perl, sv * sv, long arg_flags) Line 3253	C++
  | threads.dll!S_jmpenv_run(interpreter * my_perl, int action, _ithread * thread, int * len_p, int * exit_app_p, int * exit_code_p) Line 521	C++
  | threads.dll!S_ithread_run(void * arg) Line 640	C++
  | kernel32.dll!BaseThreadInitThunk ()	Unknown
  | ntdll.dll!RtlUserThreadStart ()	Unknown

sample of how Storable.xs uses htonl

#ifdef HAS_HTONL
#define WLEN(x)                                                         \
    STMT_START {                                                        \
        ASSERT(sizeof(x) == sizeof(int), ("WLEN writing an int"));      \
        if (cxt->netorder) {                                            \
            int y = (int) htonl(x);                                     \
            if (!cxt->fio)                                              \
                MBUF_PUTINT(y);                                         \
            else if (PerlIO_write(cxt->fio,oI(&y),oS(sizeof(y))) != oS(sizeof(y))) \
                return -1;                                              \
        } else {                                                        \
            if (!cxt->fio)                                              \
                MBUF_PUTINT(x);                                         \
            else if (PerlIO_write(cxt->fio,oI(&x),                      \
                                  oS(sizeof(x))) != oS(sizeof(x)))      \
                return -1;                                              \
        }                                                               \
    } STMT_END

Next problem, after CPP hooking htonl() and its friends, to perl.h's backup implementation of htonl(), MSVC 2022 x64 -O1 was producing horrible machine code, instead of x64's bswap instruction.

  | ###########################################################
  | #  elif BYTEORDER == 0x1234 \|\| BYTEORDER == 0x12345678
  | #    define htonl(x)    st_my_swap32(x)
  | #    define ntohl(x)    st_my_swap32(x)
  |  
  | PERL_STATIC_INLINE U32
  | st_my_swap32(const U32 x) {
  | return ((x & 0xFF) << 24) \| ((x >> 24) & 0xFF)
  | \| ((x & 0x0000FF00) << 8) \| ((x & 0x00FF0000) >> 8);
  | }
  | ###########################################################
  | mov     ecx, [rbp+arg_18]
  | mov     edx, ecx
  | and     edx, 0FF0000h
  | mov     eax, ecx
  | shr     eax, 10h
  | or      edx, eax
  | mov     eax, ecx
  | shl     eax, 10h
  | and     ecx, 0FF00h
  | or      eax, ecx
  | shr     edx, 8
  | shl     eax, 8
  | or      edx, eax
  | mov     [rbp+Src], edx
  | test    r10, r10
  | jnz     short loc_180004B65

disassembling ws2_32.dll's htonl shows the same bad machine code as above. Here is a screenshot I found of someone else complaining about using the long Pure ISO C macro with a modern MSVC CC. The flaw in this screenshot was a production grade .a/.lib compiled and distributed by another entity containing a -Od -O0 machine code version of C++ ecosystem's htonl().

msvc-odhtonl-failure

Strawberry/Mingw correctly recognizes the pure ISO C expression as a euphemism for the current CPU's arch specific endian/byte swap instruction, regardless of its ascii identifier name for that CPU. BUT WinPerl built with Strawberry/Mingw, still will execute PE symbol table's extern C htonl() function ptr, if you type htonl in src code instead of typing the long Pure ISO C macro/expression. Ubuntu libperl.so doesnt import a htonl C linker symbol at all.

In writing and researching this PR on IRC there was a long talk

[22:06] <TonyC> https://godbolt.org/z/1bGh9eWsM both gcc and MSVC optimize it pretty well

The VC 2022 I have and the VC 2022 TonyC has, werent producing the same mach code output.

�01[22:10] <bulk88> TonyC I wonder what build number of MSVC 2022 is being used by godbolt, since MS offical blog says they "fixed it" in November 2022 https://devblogs.microsoft.com/cppblog/a-tour-of-4-msvc-backend-improvements/
�01[22:11] <bulk88> but in Feb 2023 users are complaining that blog article was BS and lies https://developercommunity.visualstudio.com/t/MSVC-Not-using-bswap-when-possible/917784
[22:12] <TonyC> you can select different versions from the compiler dropdown
�01[22:18] <bulk88> https://godbolt.org/z/W79hxYG95 using MSVC 2022 latest, using the exact P5P blead code, I broke your example tony
�01[22:18] <bulk88> P5P is using U8 and U16 const literal ints operands, your version is 100% U64 const literal ints
[22:22] <TonyC> you picked a 32-bit compiler, but even that gets it right if the parameter to g() is a U64 rather than U32
�01[22:23] <bulk88> bswap op x86-32 was not a day 1 opcode, it was added in the release of the 486, so CC devs have a legit rational to never ever use the bswap opcode unless the end user tells the CC to do it
�01[22:24] <bulk88> IDK how C lang spec, and ones compliment sign extend promotion, are involved in this

eventually the different mach code output between 2 people both with "VC 2022" mystery was solved using godbolt's collection of very many (monthly) point releases of MSVC 2022 as

�01[22:28] <bulk88> MSVC x64  version 19.37  17.7 was the first build number that has the "perfection" 1 op code machine code
�01[22:29] <bulk88> 19.36 and lower have the 10 op code medicore version
�01[22:30] <bulk88>  build number Visual Studio 2022 version 17.7 was published on Aug 8th 2023
�01[22:38] <bulk88> I have build number  19.36.32535 installed, therefore I have the bug

More research shows MS publicly want the public to write _byteswap_uint64, _byteswap_ulong, _byteswap_ushort, and not misuse/abuse functional calls from a source code compatibility layer in a C language TCPIP driver library (aka BSD sockets).

So to fix everything, WinPerl, regardless of CC vendor, will always CPP hook from perl.h token htonl and its friends on WinOS, so the htonl() the linker symbol from ws2_32.dll will never ever again be reachable from Perl in C. which forever, on WinOS, will always be coming from the export table. All MS CRT, past, present, and future, will forever use ascii name byteswap_ulong for what Linux API calls htonl. byteswap_ulong() on MSVC is both a CPU inline intrinsic, if you turn that intrinsic "on" during CC compile time, but it is also a real extern "C" function exported from ucrtbase.dll, for spec compliance/GetProcAddress/Linker symbol/-O0 -Od reasons.

The ucrtbase.dll version I found on my system is just as terrible as the ws2_32.dll version. MS devs in .c/.cpp lang probably wrote the CPU arch independent formal linker symbol extern "C" backup function using the same return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8); macro Perl is using.

; unsigned int __cdecl byteswap_ulong(unsigned int Long)
public _byteswap_ulong
_byteswap_ulong proc near
mov     eax, ecx
mov     r8d, 0FF00h
and     eax, r8d
mov     edx, ecx
shl     edx, 10h
add     eax, edx
mov     edx, ecx
shl     eax, 8
shr     edx, 8
and     edx, r8d
shr     ecx, 18h
add     eax, edx
add     eax, ecx
retn
_byteswap_ulong endp

Now after detangling ws2_32.dll from WinPerl, next problem to fix is 0.25 stars codegen on VC 2022 build number 19.36 released May 16, 2023 and lower.

So to fix that codegen problem, just replace that long macro with byteswap_ulong() and

#    ifdef _MSC_VER
#      pragma intrinsic(_byteswap_ulong)
#      pragma intrinsic(_byteswap_ushort)
#    endif

Now all WinPerl binaries are identical to what all LinPerl binaries have been doing since forever. Bug is fixed, defect removed. End of story.

Now my commit above, I didn't fully think out certain parts and they need input from the public.

Is there any reason for WinPerl to ever again use slow htonl() from WinSock.dll for any reason?

Is this build option to revert back to slow htonl() in Windows only, a waste of screen space, tarball space, and repo space?

I am keeping the Win32 MSVC specific intrinsic logic in perl.h right next to the P5P Pure ISO C polyfill version instead the byte swaping backends living in 2 different .h files.

I also believe, there are Linux/Commercial Unix C compilers out there, that also have this optimization bug, where return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8); is not regexp-ed by the closed source Unix CC at compile time, into the appropriate 1 opcode line CPU instruction for whatever CPU arch that Unix is running on. I think this optimization was invented by GCC, copy-catted by Clang because of public outrage, and doesn't exist in any other open or closed source C compilers.

Its reasonable for the dev team of a C compiler, to assume, someone writing return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8); instead of reading the man page for their C compiler, is an incompetent fool and "We" (CC authors) will not correct such "user error/user incompetent" C code. memcpy()/strlen()/memcmp() are more reasonable to assume they can turn into 1 CPU opcode inline intrinsic, since all 3 are just "1 node/1 C struct/1 AST node" inside the C compiler. return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8); requires a regexp inside the C compiler matching atleast "23 nodes/23 C structs/23 AST nodes" before doing the optimization to bswap instruction. Now the C compiler has an O(n*23) CPU usage problem from what used to be O(1).

BTW, for htons() on i386/x64, its done with roll16((U16) n, 8) or (U16)n = ((U16)n) >ROLL> 8;, I dont think a bswap16 exists on Intel CPUs. But remember folks, ISO C, has never, and never will add math operators roll, pop_count(), count_leading_0s(), **, and count_leading_1s().

exp() is a linker symbol, not a C math operator, please contact the authors of man page of libc.so to learn more about linker symbol exp(). Maybe it is an abbreviation of expires and opens gdb instantly when you call it and pauses the process.

I think its a very bad idea for perl.h to override htonl()/etc, 100% of the time, on all CCs and all CPUs and all OSes. Lets assume all CCs on earth do the RIGHT THING like GCC and Clang do, until some human uses a dis-assembler, and proves Foo C Compiler v0.0 -O2 did something wrong. The OS authors/OS .h authors/CC authors will always know more, and keep better routine maintenance and watch over what C token htonl() does under the hood on their own products/projects, than P5P volunteers can do.

But !!!! P5P can't predict what current or future CCs, and what future CPU archs, will need perl.h to re-implement htonl(), hence I'm leaving provisions that Perl on Linux/Android/Solaris/AIX/IBM Z/OS, will need the same hooking WinPerl needs.

My risk assessment is, return ((x & 0xFF) << 24) | ((x >> 24) & 0xFF) | ((x & 0x0000FF00) << 8) | ((x & 0x00FF0000) >> 8); will fail to optimize or fail to inline, in machine code of current or future RISCV CPUs and WASM CPUs. WASM is a real CPU ISA BTW. The spec was written so a hardware CPU can be lithographied that runs un-recompile WASM byte sequences right off the DDR ram sticks. ARM32 can execute the byte stream in unmodified .class disk files remember, if your load the correct DRM license key into the ARM CPU at runtime to enable the feature.

Also, the commit in this PR needs less src code comments.

But I don't know where and how to document this new "hooking htonl()" feature.

  • 1 large comment in 1 place, single line comments every where else?
  • 1 large comment in 1 place, no comments anywhere else?
  • alot of 1 line comments saying "go read PR # 987654"?
my failed attempt at humor that Im going to hide instead of deleting What does "host" mean in the spec? Whose host, which address, which room and year in history?

What is "endianness" in ISO C?

Your app has lost the portability game, it is now aware of the OS Vendor's ascii string name and aware of the ascii string name of a CPU arch, which is a big no no to "portable" SW. Remember Abstract OS v𝑖² changes the byteorder using an RNG of your C program and its integer math operators each time your create a new process to run it, to 1 of 8 possible byte orders for doing math using C type uint32_t. Write a unit test involving, htonl(), memcpy(), fwrite(), and a usb stick being walked between 2 random Linux or Unix systems anywhere on earth jkjk

commit msg below


Since day 1 of Win32/64, C symbols/tokens htonl(), htons(), ntohs(), and ntohs() have been extern "C" PE symbol table exported functions from ws2_32.dll. WinSock aka WinOS's front end public facing lib for TCP/IP along with alot of legacy support for a dozen other protocols that nobody remembers. These 4 functions do 1 things and 1 thing only, swap bytes around. Windows 16-64 has and forever will be a LE OS regardless of CPU archs/HW. Mixed endian systems don't exist. Undumping a process between BE/LE HW Unix boxes or any OS kernels doesn't exist.

Storable.xs and PP keywords pack/unpack heavily use these 4 functions, but they have nothing to do with ethernet, token ring, or TCPIP. All modern CPUs (i386/x64/ARM) have a dedicated CPU opcode for BE/LE swapping. x86 introduced the 32 bit/U32 bswap instruction with the release of the i486. U16 variables can use i386's "ror eax, 8" (bitwise roll 8 bits).

Both Mingw GCC and MSVC CCs, will never optimize htonl/htons/ntohl/ntohs to 1 opcode, ask for the C linker symbol, and you will get the linker symbol. So Storable.xs and PP pack/unpack on both CCs, are calling Winsock's exported functions to do the swap. This is slow. It has overhead.

MSVC 2022 CC binaries prior to build number 19.37 (released Aug 2023), don't even recognize "((x & 0xFF)<<24)|((x>>24) & 0xFF)|((x & 0x0000FF00) <<8)|((x & 0x00FF0000)>>8)" as a C level euphemism for "I want the byte swap CPU instruction". To fix all this, hook the 4 functions with the CPP to prevent the 4 tokens from calling the Winsock export table implimentations. 2nd, with all MSVC CCs versions use the official proper MS specific way to do a "byte swap". Which is instrinsic function _byteswap_ulong(). Theoretically RtlUlongByteSwap() also exists, but its much less mentioned on MSDN and the general WWW, and lives in "wdm.h" which is intended for Ring 0 kernel drivers, so header clashes can occur now, or in the future, trying to use a Ring 0 .h in a mostly user mode .h #include chain.

Since _byteswap_ulong() is an intrinsic function and not a macro, S_my_swap32() isn't needed to prevent multi-eval problems.

htonll()/ntohll() were added simply because it was easy to write them. 2 different algorithms for htonll() are included, I picked the one with less C src code ops, but I didn't check if neither, 1 or the other, or both are auto-detected by GCC and Clang as a 64 bit byte-swap instrinsic request without formally asking for the instrinsic with a named identifier.

The vtable hooks that CPerlHost/iperlsys.h use to implement psuedo-fork and ithreads, also had to be disabled for the 4 functions. Or else XSUB.h will redirect all CPAN XS mods (not -DPERL_CORE !), to the CPerlHost and iperlsys.h vtables, which then redirect to winsock, negating this fix.

Other less-than-perfect CC/CPUs/OSes combos might be discovered in the future, so PERL_MY_HOST_NET_BYTE_SWAP define is cross OS and not Win32 specific. S_my_swap32() isn't being turned on for any OSes/CCs other than Mingw and MSVC on Win32, because "if it ain't broke, don't fix it". If the CCs and .h'es for Linux/Android/OSX are already perfect, if Perl attempts to hook, intercept, or use a token or identifier from 6 years ago, not 18 months ago, more harm (deoptimization) can happen that good. MSVC produced 11-15 CPU ops for S_my_swap32()'s "ISO C" macro, before VC 2022 19.37 Aug 2023. Winsock has the same exact machine code.

I'll assume MS/other major Windows corporate users, assumed that "ISO C" byte swap macro is fundamentally wrong, since acknowleging that endianness exists in IT, violates "ISO C", and that code base is now "some Vendor's C", so why fix htonl() or that long non-CPU arch specific macro, instead of the intrinsic? That program already is aware of OS and Vendor and platform names and isn't portable,


  • This set of changes requires a perldelta entry, and I need help writing it. <<<<< TODO: THIS ONE

@bulk88
Copy link
Contributor Author

bulk88 commented May 26, 2025

PostgreSQL found this exact performance problem on Windows OS (WinSock.dll problem) and on MSVC (the & << | macro and final mach code output) and fixed both in September 2017.

postgres/postgres@510b8cb

Read entire thread, someone accuses GCC on Linux -O2 (year 2017) of emitting "very low quality" machine code for GCCs/Linuxs attempt to do an inlined htonl() optimization, written in GCC inline assembly grammar. I (bulk88) guess that ML post author is complaining about 100% pointless and redundant register to register mov opcodes. GCC IDK when in history solved this by inventing a non-back compat V2 GCC inline assembly grammar called https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

https://www.postgresql.org/message-id/20170914063418.sckdzgjfrsbekae4%40alap3.anarazel.de

update:

with my buggy VC 2022 v19.36, store_array() in Storable.xs is now a 99.999% perfect

mov     rcx, [rbp+var_10]
mov     eax, ecx
shr     rcx, 20h
bswap   ecx
mov     [rbp+Src], ecx
bswap   eax
mov     [rbp+arg_C], eax
test    rdx, rdx
jnz     loc_1800042E0

I am going pretend Storable.pm is not a P5P maintained product in the next sentence. The 0.001% defect is Storeable.xs writing htonl() twice in src code, to emulate htonll(), which is not P5P's problem or responsibility. More technically accurate, its not a problem or responsibility of P5P's collection of .h files and P5P's libperl.so file.

@iabyn
Copy link
Contributor

iabyn commented May 26, 2025

I have no strong opinions on whether this commit is the best solution to the problem being addressed (I tend to steer clear of Win32 stuff), but I am strongly opposed to the proposed commit message. It is verbose and unclear, and makes it very difficult to see at a glance what is being changed.

Here's an example of what I think such a commit message should look like. I'm not proposing this as the actual text of the commit message, since I don't understand the real details of the commit. It's to give you an idea of what I regard as a suitable commit message.


Win32: speed up htons etc byte-order swapping

Various part of perl, in particular pack() and Storable, make extensive use of the htonl/htons/ntohl/ntohs library functions to swap the byte order of short and long ints.

Unfortunately in Win32 these functions can often be relatively slow, for the reasons described below. This commit introduces a new define, PERL_MY_HOST_NET_BYTE_SWAP (set by default). If set, then htons etc, which were formerly mapped to win32_htonl etc, are now mapped to the pre-existing inline my_swap32 etc functions.
In addition, the implementation of my_swap32 etc has been improved, using the compiler-specific _byteswap_uint64 pragma if available.

On non-win32 platforms, nothing has changed.

... long explanation of the issue ...

# define HAS_HTONLL
# if (BYTEORDER & 0xffff) == 0x4321
# define ntohll(x) ((x)&0xFFFFFFFFFFFFFFFF)
# define htonll(x) ntohll(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually thing we need ntohll and htonll

Copy link
Contributor Author

@bulk88 bulk88 May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/Perl/perl5/blob/blead/pp_pack.c#L146
https://github.com/Perl/perl5/blob/blead/pp_pack.c#L240
https://github.com/Perl/perl5/blob/blead/pp_pack.c#L3026
https://github.com/Perl/perl5/blob/blead/pp_pack.c#L2966

PP pack(), is doing the worst possible algorithm to swap byte order for 'Q'/U64.

https://github.com/Perl/perl5/blob/blead/ext/B/B.xs#L1608

B:: wants htonll at this line and is DIYing it.

https://github.com/Perl/perl5/blob/blead/cpan/Digest-SHA/src/sha.h#L91

Digest::SHA wants htonll at this line and is DIYing it.

https://github.com/Perl/perl5/blob/blead/dist/Storable/Storable.xs#L1059

Storable wants htonll at this line and is DIYing it.

https://github.com/Perl/perl5/blob/blead/ext/XS-APItest/APItest.xs#L7881

XS::APItest and P5P's default hash algo S_perl_siphash_seed_state() want it on BE CPUs

https://grep.metacpan.org/search?q=ntohll&qft=*.xs%2C*.c%2C*.h&qd=&qifl=

atleast 5 CPAN modules are DIYing it or using OS version or it, there are more, im using a narrow regexp

Copy link
Contributor Author

@bulk88 bulk88 May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually thing we need ntohll and htonll

Whats your thoughts about evangelizing to CPAN authors identifiers my_swap64 or htonll or PerlSock_htonll?

_ # 1 and # 3 are P5P proprietary. # 2 is filling in missing POSIX APIs, where the perl VM impl, will get (good thing) accidental usage by CPAN XS authors, who dont know, dont care, or bare minimum care about P5P invented identifiers.

These XS authors are beginner or low effort or FT non-Perl C devs but low time people. So they will use max POSIX C/Linux C tokens that they are comfortable with where possible, vs the P5P invented "portable" C tokens that are an alien language from an extraterrestrial planet to them.

@Leont
Copy link
Contributor

Leont commented May 26, 2025

The idea of the change sounds like a very good idea, but I can't help wondering if it can't be wired more simply.

@bulk88
Copy link
Contributor Author

bulk88 commented May 27, 2025

The idea of the change sounds like a very good idea, but I can't help wondering if it can't be wired more simply.

What are your thoughts on dropping the "disable the hook/optimization and use slow OS htonl" build option nobody on WinOS or AnyOS will ever use or know its there? It will reduce the visual complexity and length of this patch if I do that. It exists because of some tests/benchmarks analysis I was doing in writing the patch, and I decided to not delete it, although I could've/should've.

Remember the hook is turned on by P5P only for know problem platforms (currently WinOS, all CCs), never for all platforms.

…order

Changing BE/LE byte order is a very common operation used in many
places inside libperl, and some core bundled XS modules, and in many
CPAN XS modules. Storable.xs and PP pack()/unpack() are the largest most
frequent users of byte swapping in Perl interp git repo. Some Perl interp
repo .c or .h or .xs files DIY their own byte swap algorithms with macros
or static inline functions. Others like Storable.xs and PP pack()/unpack()
use the htonl(), htons(), ntohs(), and ntohs() macros/functions provided
by the OS/CC, or depending on config.sh, equivelent polyfills from perl.h.

On all modern CPU archs, i386/ARM/x64/PowerPC, there is a dedicate CPU
instruction for doing it. Perl on Linux compiled with Clang or GCC,
automatically is using the appropriate inlined intrinsic CPU opcode
when using the CCs/OSes htonl(). Those 2 CCs also recogize perl.h's
my_swap32() algorithm, and all other .h/.c files that use that statement
as a euphemism to inline htonl() to 1 CPU opcode.

On Windows, the situation is very bad compare to the above.
The htonl(), htons(), ntohs(), and ntohs() functions/macros,
provided to Perl from both MSVC and GCC, are exceptionally slow and
inefficient, and unoptimized. MSVC has further problems, generating
very inefficient machine code to swap byte order, using the DIY shift
and mask algorithm.

Since day 1 of Win32/64, C symbols/tokens htonl(), htons(), ntohs(), and
ntohs() have been extern "C" PE symbol table exported functions from
ws2_32.dll. ws2_32.dll is AKA WinSock, Win32/64's front end public facing
lib for TCP/IP sockets. It is not a light weight DLL, but loads multiple
other DLLs, filter/FW DLLs, middleware and backendware DLLs. WinPerl delay
loads (RTLD_LAZY) the Winsock DLL until the first attempt is made by
[lib] perl5XX.dll to go on the WWW.

These 4 functions do 1 things and 1 thing only, swap bytes around.

Storable.xs and PP keywords pack/unpack heavily use these 4 functions, but
they have nothing to do with ethernet, token ring, or TCPIP. They should
be using the correct inlined single CPU instruction to do this. All modern
CPUs (i386/x64/ARM) have a dedicated CPU opcode for BE/LE swapping.
x86 introduced the 32 bit/U32 bswap instruction with the release of the
i486. U16 variables can use i386's "ror eax, 8" (bitwise roll 8 bits).

Both Mingw GCC and MSVC CCs, will never optimize htonl/htons/ntohl/ntohs
to 1 opcode on Win32/64, ask for the C linker symbol, and you will get
the linker symbol. So Storable.xs and PP pack/unpack on both CCs, are
calling Winsock's exported functions to do the swap. This is slow since
its a function call. Even worse, its not 1 opcode inside ws2_32.dll but
11 to 15 CPU ops.

MSVC 2022 CC binaries prior to build number 19.37 (released Aug 2023),
don't even recognize "((x & 0xFF)<<24)|((x>>24) & 0xFF)|((x & 0x0000FF00)
<<8)|((x & 0x00FF0000)>>8)" as a C level euphemism for
"I want the byte swap CPU instruction". Mingw GCC does recognize that
macro to mean 1 CPU instruction, but Mingw GCC won't modify or optimize
MS's htonl()/etc identifiers/symbols for you, you need to DIY that macro
yourself to trigger the 1 CPU op optimization.

To fix all this, hook the 4 functions with the CPP to prevent the 4
tokens from calling the Winsock export table implimentations. 2nd, with
all MSVC CCs versions use the official proper MS specific way to do a
"byte swap". Which is instrinsic function _byteswap_ulong().
Theoretically RtlUlongByteSwap() also exists, but its much less mentioned
on MSDN and the general WWW, and lives in "wdm.h" which is intended
for Ring 0 kernel drivers, so header clashes can occur now, or in the
future, so pick the easier and simple _byteswap_ulong() category of
intrinsics, not the Ring 0 ones.

Since _byteswap_ulong() is an intrinsic function and not a macro,
S_my_swap32() isn't needed to prevent multi-eval problems.

htonll()/ntohll() were added simply because it was easy to write them,
and multiple interp core .c files, and interp core .xs files want a 64 bit
byte swap function/macro, since most people use 64 bit pointer CPUs.

2 different algorithms for htonll() are included, I picked the one with
less C src code ops, but I didn't check if neither, 1 or the other, or
both are auto-detected by GCC and Clang as a 64 bit byte-swap instrinsic
request without formally asking for the instrinsic with a named identifier.

The vtable hooks that CPerlHost/iperlsys.h use to implement psuedo-fork
and ithreads, also had to be disabled for the 4 functions. Or else XSUB.h
will redirect all CPAN XS mods (not -DPERL_CORE !), to the CPerlHost
and iperlsys.h vtables, which then redirect to winsock, negating this fix.

Other less-than-perfect CC/CPUs/OSes combos might be discovered in the
future, so PERL_MY_HOST_NET_BYTE_SWAP define is cross OS and not Win32
specific. Rumors online say GCC only added its __builtin_bswap32() and
matching that macro, in 2008/v4.0.

So S_my_swap32() isn't being turned on for any OSes/CCs other than
Mingw and MSVC on Win32, because "if it ain't broke, don't fix it".
If the CCs and .h'es for Linux/Android/OSX are already perfect, if Perl
attempts to hook, intercept, or use a token or identifier from 6 years
ago, not 18 months ago, more harm (deoptimization) can happen that good.
MSVC produced 11-15 CPU ops for S_my_swap32()'s "ISO C" macro, before
VC 2022 19.37 Aug 2023. Winsock has the same exact machine code.

I'll assume MS/other major Windows corporate users, assumed that
"ISO C" byte swap macro is fundamentally wrong, since acknowleging
that endianness exists violates "ISO C", and that code base is
now "some Vendor's C", so why fix htonl() or that long non-CPU arch
specific macro, instead of the intrinsic? That program already is aware of
OS and Vendor and platform names and isn't portable
@bulk88 bulk88 force-pushed the win32_no_winsock_byte_swaper_funcs branch from 314ac03 to 002f8ba Compare May 27, 2025 11:47
@bulk88
Copy link
Contributor Author

bulk88 commented May 28, 2025

gcc-mirror/gcc@73984f8

GCC added pattern matching the DIYed htonl() macro only recently in 2014 if above commit is true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants