-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Vectorize search
for 32-bit and 64-bit elements, also improve 8-bit and 16-bit vectorization
#5484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'd be curious about benchmark results for this new Only one of algorithm type is of interest, so but for all element sizes, also I've modified the benchmarks before the actual changes, so commit 1fb4bfe is a convenient point for benchmarking the "before" state. My own results for E-core affinity
|
Click to expand 5950X results:
Ignoring the /4 and /5 "evil" cases, I'm somewhat concerned about perf regressions for 1-byte search. |
I then suggest restructuring so that on forward if constexpr (sizeof(_Ty >= 4) {
if (_Use_avx2()) {
new AVX2 path
}
} else {
if (_Use_sse42()) {
old SSE4.2 path
}
} |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
🔍 🕵️ 🔎 |
📜 Overview
AVX2 based search for all four element widths. Modeled on SSE4.2 8-bit and 16-bit approach.
Without
pcmpestr*
instruction, as it is not available for AVX vector width, traditionalfind
-like approach is used instead.Notable differences:
pcmpestr*
compares all fit characters,pcmpeq*
compares only the first, so need to confirm the rest of characters even if the needle fits vectorsearch
(forward search), the variable step is no longer better than the fixed step. See belowpcmpeq*
returns byte bitmask instead of element bitmask, so the mask is masked to make it contain only initial byte positions, also, some element size multiplications are not needed^= (1 << _Pos)
is used instead of_bittestandreset
. Apparently, this is faster, and should have been used for SSE either🚶 Fixed vs variable step
For SSE4.2, there are
pcmpestri
instruction that returns first or last match position, andpcmpestrm
instruction that returns bitmask of matches.pcmpestri
is faster, alsopcmpestri
is not affected by DevCom-10689455, unlikepcmpestrm
.Both of the instructions check all characters they can, not just the initial character of a potential match. With smaller needle they can end up making a full match, if the match start is closer to the beginning of the haystack part vector.
For
search
(the forward one), this makes it appealing to use onlypcmpestri
for matching, and if the match does not fit, make such step thatpcmpestri
would have matched, so it would be full match next time. If match is not confirmed, we have to do such step that we skip the unconfirmed match, but attempt to match immediately past it. This results in variable step, but generally faster progress if there are not too many apparent matches.For
find_end
(the backward one) this does not work well, as the matches we need to prioritize are closer to the end, which would result in more matches we need to confirm with additional comparison, and still smaller step, sopcmpestri
is clearly worse thanpcmpestrm
with fixed step.For AVX2, we only match the first character, so there are no cases, where variable step is better.
🐌 Perf bug fix in SSE4.2 path
There was a confusion of elements/bytes in the SSE4.2 path of
find_end
remaining part check. The intention was to mask out the positions that were already checked. It worked correctly for 8-bit elements, but masked out too few bits for 16-bit element. Whereas the impact is very small, I'm fixing this here in a separate commit, to avoid inconsistencies and make the comparison of these branches easier⏱️ Benchmark results
STL/benchmarks/src/search.cpp
Lines 42 to 50 in 396eaf8
classic_search<std::uint8_t>/0
classic_search<std::uint8_t>/1
classic_search<std::uint8_t>/2
classic_search<std::uint8_t>/3
classic_search<std::uint8_t>/4
classic_search<std::uint8_t>/5
classic_search<std::uint16_t>/0
classic_search<std::uint16_t>/1
classic_search<std::uint16_t>/2
classic_search<std::uint16_t>/3
classic_search<std::uint16_t>/4
classic_search<std::uint16_t>/5
classic_search<std::uint32_t>/0
classic_search<std::uint32_t>/1
classic_search<std::uint32_t>/2
classic_search<std::uint32_t>/3
classic_search<std::uint32_t>/4
classic_search<std::uint32_t>/5
classic_search<std::uint64_t>/0
classic_search<std::uint64_t>/1
classic_search<std::uint64_t>/2
classic_search<std::uint64_t>/3
classic_search<std::uint64_t>/4
classic_search<std::uint64_t>/5
classic_find_end<std::uint8_t>/0
classic_find_end<std::uint8_t>/1
classic_find_end<std::uint8_t>/2
classic_find_end<std::uint8_t>/3
classic_find_end<std::uint8_t>/4
classic_find_end<std::uint8_t>/5
classic_find_end<std::uint16_t>/0
classic_find_end<std::uint16_t>/1
classic_find_end<std::uint16_t>/2
classic_find_end<std::uint16_t>/3
classic_find_end<std::uint16_t>/4
classic_find_end<std::uint16_t>/5
classic_find_end<std::uint32_t>/0
classic_find_end<std::uint32_t>/1
classic_find_end<std::uint32_t>/2
classic_find_end<std::uint32_t>/3
classic_find_end<std::uint32_t>/4
classic_find_end<std::uint32_t>/5
classic_find_end<std::uint64_t>/0
classic_find_end<std::uint64_t>/1
classic_find_end<std::uint64_t>/2
classic_find_end<std::uint64_t>/3
classic_find_end<std::uint64_t>/4
classic_find_end<std::uint64_t>/5
Other results include unreachable perf of `strstr` and the same cases reached via other codepaths
c_strstr/0
c_strstr/1
c_strstr/2
c_strstr/3
c_strstr/4
c_strstr/5
ranges_search<std::uint8_t>/0
ranges_search<std::uint8_t>/1
ranges_search<std::uint8_t>/2
ranges_search<std::uint8_t>/3
ranges_search<std::uint8_t>/4
ranges_search<std::uint8_t>/5
ranges_search<std::uint16_t>/0
ranges_search<std::uint16_t>/1
ranges_search<std::uint16_t>/2
ranges_search<std::uint16_t>/3
ranges_search<std::uint16_t>/4
ranges_search<std::uint16_t>/5
ranges_search<std::uint32_t>/0
ranges_search<std::uint32_t>/1
ranges_search<std::uint32_t>/2
ranges_search<std::uint32_t>/3
ranges_search<std::uint32_t>/4
ranges_search<std::uint32_t>/5
ranges_search<std::uint64_t>/0
ranges_search<std::uint64_t>/1
ranges_search<std::uint64_t>/2
ranges_search<std::uint64_t>/3
ranges_search<std::uint64_t>/4
ranges_search<std::uint64_t>/5
search_default_searcher<std::uint8_t>/0
search_default_searcher<std::uint8_t>/1
search_default_searcher<std::uint8_t>/2
search_default_searcher<std::uint8_t>/3
search_default_searcher<std::uint8_t>/4
search_default_searcher<std::uint8_t>/5
search_default_searcher<std::uint16_t>/0
search_default_searcher<std::uint16_t>/1
search_default_searcher<std::uint16_t>/2
search_default_searcher<std::uint16_t>/3
search_default_searcher<std::uint16_t>/4
search_default_searcher<std::uint16_t>/5
search_default_searcher<std::uint32_t>/0
search_default_searcher<std::uint32_t>/1
search_default_searcher<std::uint32_t>/2
search_default_searcher<std::uint32_t>/3
search_default_searcher<std::uint32_t>/4
search_default_searcher<std::uint32_t>/5
search_default_searcher<std::uint64_t>/0
search_default_searcher<std::uint64_t>/1
search_default_searcher<std::uint64_t>/2
search_default_searcher<std::uint64_t>/3
search_default_searcher<std::uint64_t>/4
search_default_searcher<std::uint64_t>/5
member_find<not_highly_aligned_string>/0
member_find<not_highly_aligned_string>/1
member_find<not_highly_aligned_string>/2
member_find<not_highly_aligned_string>/3
member_find<not_highly_aligned_string>/4
member_find<not_highly_aligned_string>/5
member_find<not_highly_aligned_wstring>/0
member_find<not_highly_aligned_wstring>/1
member_find<not_highly_aligned_wstring>/2
member_find<not_highly_aligned_wstring>/3
member_find<not_highly_aligned_wstring>/4
member_find<not_highly_aligned_wstring>/5
ranges_find_end<std::uint8_t>/0
ranges_find_end<std::uint8_t>/1
ranges_find_end<std::uint8_t>/2
ranges_find_end<std::uint8_t>/3
ranges_find_end<std::uint8_t>/4
ranges_find_end<std::uint8_t>/5
ranges_find_end<std::uint16_t>/0
ranges_find_end<std::uint16_t>/1
ranges_find_end<std::uint16_t>/2
ranges_find_end<std::uint16_t>/3
ranges_find_end<std::uint16_t>/4
ranges_find_end<std::uint16_t>/5
ranges_find_end<std::uint32_t>/0
ranges_find_end<std::uint32_t>/1
ranges_find_end<std::uint32_t>/2
ranges_find_end<std::uint32_t>/3
ranges_find_end<std::uint32_t>/4
ranges_find_end<std::uint32_t>/5
ranges_find_end<std::uint64_t>/0
ranges_find_end<std::uint64_t>/1
ranges_find_end<std::uint64_t>/2
ranges_find_end<std::uint64_t>/3
ranges_find_end<std::uint64_t>/4
ranges_find_end<std::uint64_t>/5
member_rfind<not_highly_aligned_string>/0
member_rfind<not_highly_aligned_string>/1
member_rfind<not_highly_aligned_string>/2
member_rfind<not_highly_aligned_string>/3
member_rfind<not_highly_aligned_string>/4
member_rfind<not_highly_aligned_string>/5
member_rfind<not_highly_aligned_wstring>/0
member_rfind<not_highly_aligned_wstring>/1
member_rfind<not_highly_aligned_wstring>/2
member_rfind<not_highly_aligned_wstring>/3
member_rfind<not_highly_aligned_wstring>/4
member_rfind<not_highly_aligned_wstring>/5