Skip to content

Vectorize search for 32-bit and 64-bit elements, also improve 8-bit and 16-bit vectorization #5484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 17, 2025

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented May 10, 2025

📜 Overview

AVX2 based search for all four element widths. Modeled on SSE4.2 8-bit and 16-bit approach.
Without pcmpestr* instruction, as it is not available for AVX vector width, traditional find-like approach is used instead.

Notable differences:

  • pcmpestr* compares all fit characters, pcmpeq* compares only the first, so need to confirm the rest of characters even if the needle fits vector
  • as a result, for search (forward search), the variable step is no longer better than the fixed step. See below
  • pcmpeq* returns byte bitmask instead of element bitmask, so the mask is masked to make it contain only initial byte positions, also, some element size multiplications are not needed
  • for 32-bit and 64-bit elements, AVX2 mask can be used to load partial value, instead of temporary buffer
  • ^= (1 << _Pos) is used instead of _bittestandreset. Apparently, this is faster, and should have been used for SSE either

🚶 Fixed vs variable step

For SSE4.2, there are pcmpestri instruction that returns first or last match position, and pcmpestrm instruction that returns bitmask of matches. pcmpestri is faster, also pcmpestri is not affected by DevCom-10689455, unlike pcmpestrm.

Both of the instructions check all characters they can, not just the initial character of a potential match. With smaller needle they can end up making a full match, if the match start is closer to the beginning of the haystack part vector.

For search (the forward one), this makes it appealing to use only pcmpestri for matching, and if the match does not fit, make such step that pcmpestri would have matched, so it would be full match next time. If match is not confirmed, we have to do such step that we skip the unconfirmed match, but attempt to match immediately past it. This results in variable step, but generally faster progress if there are not too many apparent matches.

For find_end (the backward one) this does not work well, as the matches we need to prioritize are closer to the end, which would result in more matches we need to confirm with additional comparison, and still smaller step, so pcmpestri is clearly worse than pcmpestrm with fixed step.

For AVX2, we only match the first character, so there are no cases, where variable step is better.

🐌 Perf bug fix in SSE4.2 path

There was a confusion of elements/bytes in the SSE4.2 path of find_end remaining part check. The intention was to mask out the positions that were already checked. It worked correctly for 8-bit elements, but masked out too few bits for 16-bit element. Whereas the impact is very small, I'm fixing this here in a separate commit, to avoid inconsistencies and make the comparison of these branches easier

⏱️ Benchmark results

constexpr data_and_pattern patterns[] = {
/* 0. Small, closer to end */ {lorem_ipsum, "aliquet"sv},
/* 1. Large, closer to end */ {lorem_ipsum, "aliquet malesuada"sv},
/* 2. Small, closer to begin */ {lorem_ipsum, "pulvinar"sv},
/* 3. Large, closer to begin */ {lorem_ipsum, "dapibus elit interdum"sv},
/* 4. Small, evil */ {fill_pattern_view<3000, false>, fill_pattern_view<7, true>},
/* 5. Large, evil */ {fill_pattern_view<3000, false>, fill_pattern_view<20, true>},
};

Benchmark Before After Speedup
classic_search<std::uint8_t>/0 258 ns 231 ns 1.12
classic_search<std::uint8_t>/1 282 ns 256 ns 1.10
classic_search<std::uint8_t>/2 29.3 ns 18.2 ns 1.61
classic_search<std::uint8_t>/3 18.0 ns 13.4 ns 1.34
classic_search<std::uint8_t>/4 1533 ns 4756 ns 0.32
classic_search<std::uint8_t>/5 11641 ns 4590 ns 2.54
classic_search<std::uint16_t>/0 552 ns 262 ns 2.11
classic_search<std::uint16_t>/1 608 ns 275 ns 2.21
classic_search<std::uint16_t>/2 55.3 ns 22.3 ns 2.48
classic_search<std::uint16_t>/3 25.2 ns 14.6 ns 1.73
classic_search<std::uint16_t>/4 7801 ns 4016 ns 1.94
classic_search<std::uint16_t>/5 16790 ns 10673 ns 1.57
classic_search<std::uint32_t>/0 1522 ns 296 ns 5.14
classic_search<std::uint32_t>/1 1658 ns 383 ns 4.33
classic_search<std::uint32_t>/2 137 ns 24.5 ns 5.59
classic_search<std::uint32_t>/3 72.7 ns 18.9 ns 3.85
classic_search<std::uint32_t>/4 5355 ns 3693 ns 1.45
classic_search<std::uint32_t>/5 14815 ns 12080 ns 1.23
classic_search<std::uint64_t>/0 2303 ns 531 ns 4.34
classic_search<std::uint64_t>/1 2573 ns 586 ns 4.39
classic_search<std::uint64_t>/2 145 ns 39.6 ns 3.66
classic_search<std::uint64_t>/3 70.8 ns 22.5 ns 3.15
classic_search<std::uint64_t>/4 5706 ns 10248 ns 0.56
classic_search<std::uint64_t>/5 15606 ns 15303 ns 1.02
Benchmark Before After Speedup
classic_find_end<std::uint8_t>/0 21.3 ns 21.3 ns 1.00
classic_find_end<std::uint8_t>/1 20.3 ns 20.9 ns 0.97
classic_find_end<std::uint8_t>/2 300 ns 72.2 ns 4.16
classic_find_end<std::uint8_t>/3 361 ns 86.4 ns 4.18
classic_find_end<std::uint8_t>/4 4838 ns 3314 ns 1.46
classic_find_end<std::uint8_t>/5 15236 ns 3209 ns 4.75
classic_find_end<std::uint16_t>/0 37.4 ns 23.1 ns 1.62
classic_find_end<std::uint16_t>/1 36.6 ns 19.2 ns 1.91
classic_find_end<std::uint16_t>/2 599 ns 115 ns 5.21
classic_find_end<std::uint16_t>/3 701 ns 171 ns 4.10
classic_find_end<std::uint16_t>/4 9741 ns 2958 ns 3.29
classic_find_end<std::uint16_t>/5 15943 ns 10612 ns 1.50
classic_find_end<std::uint32_t>/0 99.7 ns 20.2 ns 4.94
classic_find_end<std::uint32_t>/1 105 ns 24.7 ns 4.25
classic_find_end<std::uint32_t>/2 1404 ns 200 ns 7.02
classic_find_end<std::uint32_t>/3 1725 ns 284 ns 6.07
classic_find_end<std::uint32_t>/4 5784 ns 2794 ns 2.07
classic_find_end<std::uint32_t>/5 17022 ns 10808 ns 1.57
classic_find_end<std::uint64_t>/0 138 ns 35.5 ns 3.89
classic_find_end<std::uint64_t>/1 118 ns 35.2 ns 3.35
classic_find_end<std::uint64_t>/2 1584 ns 426 ns 3.72
classic_find_end<std::uint64_t>/3 1821 ns 493 ns 3.69
classic_find_end<std::uint64_t>/4 5697 ns 9842 ns 0.58
classic_find_end<std::uint64_t>/5 15768 ns 13600 ns 1.16
Other results include unreachable perf of `strstr` and the same cases reached via other codepaths
Benchmark Before After Speedup
c_strstr/0 184 ns 183 ns 1.01
c_strstr/1 212 ns 215 ns 0.99
c_strstr/2 12.7 ns 13.4 ns 0.95
c_strstr/3 9.16 ns 10.1 ns 0.91
c_strstr/4 1402 ns 1478 ns 0.95
c_strstr/5 15762 ns 15807 ns 1.00
ranges_search<std::uint8_t>/0 257 ns 230 ns 1.12
ranges_search<std::uint8_t>/1 280 ns 255 ns 1.10
ranges_search<std::uint8_t>/2 29.0 ns 17.9 ns 1.62
ranges_search<std::uint8_t>/3 16.5 ns 13.1 ns 1.26
ranges_search<std::uint8_t>/4 1400 ns 4721 ns 0.30
ranges_search<std::uint8_t>/5 12674 ns 4664 ns 2.72
ranges_search<std::uint16_t>/0 510 ns 267 ns 1.91
ranges_search<std::uint16_t>/1 579 ns 287 ns 2.02
ranges_search<std::uint16_t>/2 50.6 ns 23.6 ns 2.14
ranges_search<std::uint16_t>/3 24.8 ns 15.2 ns 1.63
ranges_search<std::uint16_t>/4 7184 ns 4054 ns 1.77
ranges_search<std::uint16_t>/5 16005 ns 10771 ns 1.49
ranges_search<std::uint32_t>/0 1403 ns 296 ns 4.74
ranges_search<std::uint32_t>/1 1432 ns 382 ns 3.75
ranges_search<std::uint32_t>/2 132 ns 24.3 ns 5.43
ranges_search<std::uint32_t>/3 56.1 ns 18.8 ns 2.98
ranges_search<std::uint32_t>/4 5285 ns 3714 ns 1.42
ranges_search<std::uint32_t>/5 28818 ns 12333 ns 2.34
ranges_search<std::uint64_t>/0 1779 ns 542 ns 3.28
ranges_search<std::uint64_t>/1 1996 ns 580 ns 3.44
ranges_search<std::uint64_t>/2 135 ns 39.2 ns 3.44
ranges_search<std::uint64_t>/3 72.5 ns 22.9 ns 3.17
ranges_search<std::uint64_t>/4 5245 ns 10382 ns 0.51
ranges_search<std::uint64_t>/5 15289 ns 15183 ns 1.01
search_default_searcher<std::uint8_t>/0 257 ns 229 ns 1.12
search_default_searcher<std::uint8_t>/1 292 ns 252 ns 1.16
search_default_searcher<std::uint8_t>/2 28.6 ns 17.8 ns 1.61
search_default_searcher<std::uint8_t>/3 14.9 ns 13.1 ns 1.14
search_default_searcher<std::uint8_t>/4 1406 ns 4666 ns 0.30
search_default_searcher<std::uint8_t>/5 11555 ns 4590 ns 2.52
search_default_searcher<std::uint16_t>/0 513 ns 264 ns 1.94
search_default_searcher<std::uint16_t>/1 595 ns 276 ns 2.16
search_default_searcher<std::uint16_t>/2 53.7 ns 21.9 ns 2.45
search_default_searcher<std::uint16_t>/3 23.9 ns 14.0 ns 1.71
search_default_searcher<std::uint16_t>/4 7201 ns 3996 ns 1.80
search_default_searcher<std::uint16_t>/5 15742 ns 10681 ns 1.47
search_default_searcher<std::uint32_t>/0 1514 ns 298 ns 5.08
search_default_searcher<std::uint32_t>/1 1644 ns 378 ns 4.35
search_default_searcher<std::uint32_t>/2 137 ns 24.1 ns 5.68
search_default_searcher<std::uint32_t>/3 58.9 ns 18.5 ns 3.18
search_default_searcher<std::uint32_t>/4 6015 ns 3725 ns 1.61
search_default_searcher<std::uint32_t>/5 17738 ns 12318 ns 1.44
search_default_searcher<std::uint64_t>/0 2060 ns 531 ns 3.88
search_default_searcher<std::uint64_t>/1 2337 ns 573 ns 4.08
search_default_searcher<std::uint64_t>/2 152 ns 41.0 ns 3.71
search_default_searcher<std::uint64_t>/3 68.6 ns 25.0 ns 2.74
search_default_searcher<std::uint64_t>/4 6997 ns 11251 ns 0.62
search_default_searcher<std::uint64_t>/5 17749 ns 16792 ns 1.06
member_find<not_highly_aligned_string>/0 258 ns 232 ns 1.11
member_find<not_highly_aligned_string>/1 283 ns 254 ns 1.11
member_find<not_highly_aligned_string>/2 29.5 ns 18.7 ns 1.58
member_find<not_highly_aligned_string>/3 16.8 ns 13.1 ns 1.28
member_find<not_highly_aligned_string>/4 1410 ns 4635 ns 0.30
member_find<not_highly_aligned_string>/5 12208 ns 4541 ns 2.69
member_find<not_highly_aligned_wstring>/0 509 ns 262 ns 1.94
member_find<not_highly_aligned_wstring>/1 579 ns 283 ns 2.05
member_find<not_highly_aligned_wstring>/2 51.0 ns 23.1 ns 2.21
member_find<not_highly_aligned_wstring>/3 24.8 ns 15.8 ns 1.57
member_find<not_highly_aligned_wstring>/4 7192 ns 3964 ns 1.81
member_find<not_highly_aligned_wstring>/5 15564 ns 10700 ns 1.45
ranges_find_end<std::uint8_t>/0 22.4 ns 22.0 ns 1.02
ranges_find_end<std::uint8_t>/1 20.8 ns 21.3 ns 0.98
ranges_find_end<std::uint8_t>/2 298 ns 71.1 ns 4.19
ranges_find_end<std::uint8_t>/3 374 ns 87.4 ns 4.28
ranges_find_end<std::uint8_t>/4 4839 ns 3290 ns 1.47
ranges_find_end<std::uint8_t>/5 15299 ns 3243 ns 4.72
ranges_find_end<std::uint16_t>/0 37.9 ns 23.6 ns 1.61
ranges_find_end<std::uint16_t>/1 37.0 ns 18.0 ns 2.06
ranges_find_end<std::uint16_t>/2 601 ns 114 ns 5.27
ranges_find_end<std::uint16_t>/3 707 ns 169 ns 4.18
ranges_find_end<std::uint16_t>/4 9640 ns 2956 ns 3.26
ranges_find_end<std::uint16_t>/5 15681 ns 10605 ns 1.48
ranges_find_end<std::uint32_t>/0 98.7 ns 19.6 ns 5.04
ranges_find_end<std::uint32_t>/1 102 ns 23.8 ns 4.29
ranges_find_end<std::uint32_t>/2 1484 ns 218 ns 6.81
ranges_find_end<std::uint32_t>/3 1782 ns 284 ns 6.27
ranges_find_end<std::uint32_t>/4 5313 ns 2830 ns 1.88
ranges_find_end<std::uint32_t>/5 14732 ns 10760 ns 1.37
ranges_find_end<std::uint64_t>/0 102 ns 34.0 ns 3.00
ranges_find_end<std::uint64_t>/1 105 ns 34.2 ns 3.07
ranges_find_end<std::uint64_t>/2 1439 ns 413 ns 3.48
ranges_find_end<std::uint64_t>/3 1734 ns 490 ns 3.54
ranges_find_end<std::uint64_t>/4 5574 ns 9602 ns 0.58
ranges_find_end<std::uint64_t>/5 15474 ns 13516 ns 1.14
member_rfind<not_highly_aligned_string>/0 22.0 ns 21.6 ns 1.02
member_rfind<not_highly_aligned_string>/1 20.9 ns 21.4 ns 0.98
member_rfind<not_highly_aligned_string>/2 298 ns 71.9 ns 4.14
member_rfind<not_highly_aligned_string>/3 364 ns 87.6 ns 4.16
member_rfind<not_highly_aligned_string>/4 4892 ns 3345 ns 1.46
member_rfind<not_highly_aligned_string>/5 15381 ns 3275 ns 4.70
member_rfind<not_highly_aligned_wstring>/0 38.3 ns 23.6 ns 1.62
member_rfind<not_highly_aligned_wstring>/1 37.5 ns 18.6 ns 2.02
member_rfind<not_highly_aligned_wstring>/2 600 ns 116 ns 5.17
member_rfind<not_highly_aligned_wstring>/3 716 ns 173 ns 4.14
member_rfind<not_highly_aligned_wstring>/4 9725 ns 2956 ns 3.29
member_rfind<not_highly_aligned_wstring>/5 15831 ns 10801 ns 1.47

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner May 10, 2025 12:23
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 10, 2025
@StephanTLavavej StephanTLavavej added the performance Must go faster label May 10, 2025
@microsoft microsoft deleted a comment May 10, 2025
@microsoft microsoft deleted a comment May 10, 2025
@microsoft microsoft deleted a comment May 10, 2025
@StephanTLavavej StephanTLavavej self-assigned this May 10, 2025
@AlexGuteniev
Copy link
Contributor Author

I'd be curious about benchmark results for this new search/find_end vectorization compared to main on other CPUs. AMDs is of most interest, but Intels of significantly higher or significantly lower grade are also of certain interest. Mine is 12th Gen Intel(R) Core(TM) i5-1235U.

Only one of algorithm type is of interest, so but for all element sizes, also strstr is still an insteresting baseline, so running out\bench\benchmark-search.exe --benchmark_filter="strstr|classic_" is a good way to filter in only what is needed.

I've modified the benchmarks before the actual changes, so commit 1fb4bfe is a convenient point for benchmarking the "before" state.

My own results for E-core affinity
Benchmark Before After Speedup
c_strstr/0 775 ns 772 ns 1.00
c_strstr/1 854 ns 854 ns 1.00
c_strstr/2 69.6 ns 69.7 ns 1.00
c_strstr/3 37.9 ns 38.0 ns 1.00
c_strstr/4 3091 ns 3098 ns 1.00
c_strstr/5 48235 ns 48337 ns 1.00
classic_searchstd::uint8_t/0 771 ns 545 ns 1.41
classic_searchstd::uint8_t/1 832 ns 675 ns 1.23
classic_searchstd::uint8_t/2 78.9 ns 27.2 ns 2.90
classic_searchstd::uint8_t/3 34.2 ns 22.0 ns 1.55
classic_searchstd::uint8_t/4 2151 ns 7439 ns 0.29
classic_searchstd::uint8_t/5 20011 ns 7297 ns 2.74
classic_searchstd::uint16_t/0 1656 ns 494 ns 3.35
classic_searchstd::uint16_t/1 1679 ns 429 ns 3.91
classic_searchstd::uint16_t/2 148 ns 34.5 ns 4.29
classic_searchstd::uint16_t/3 61.1 ns 20.5 ns 2.98
classic_searchstd::uint16_t/4 10407 ns 6029 ns 1.73
classic_searchstd::uint16_t/5 26230 ns 18793 ns 1.40
classic_searchstd::uint32_t/0 3488 ns 529 ns 6.59
classic_searchstd::uint32_t/1 3843 ns 614 ns 6.26
classic_searchstd::uint32_t/2 297 ns 36.5 ns 8.14
classic_searchstd::uint32_t/3 101 ns 27.2 ns 3.71
classic_searchstd::uint32_t/4 13720 ns 5782 ns 2.37
classic_searchstd::uint32_t/5 28629 ns 22707 ns 1.26
classic_searchstd::uint64_t/0 3872 ns 1158 ns 3.34
classic_searchstd::uint64_t/1 4229 ns 1271 ns 3.33
classic_searchstd::uint64_t/2 301 ns 94.6 ns 3.18
classic_searchstd::uint64_t/3 109 ns 44.9 ns 2.43
classic_searchstd::uint64_t/4 19182 ns 14350 ns 1.34
classic_searchstd::uint64_t/5 42780 ns 23834 ns 1.79
classic_find_endstd::uint8_t/0 52.4 ns 33.6 ns 1.56
classic_find_endstd::uint8_t/1 47.6 ns 32.2 ns 1.48
classic_find_endstd::uint8_t/2 747 ns 126 ns 5.93
classic_find_endstd::uint8_t/3 871 ns 157 ns 5.55
classic_find_endstd::uint8_t/4 7467 ns 5302 ns 1.41
classic_find_endstd::uint8_t/5 29309 ns 5186 ns 5.65
classic_find_endstd::uint16_t/0 94.7 ns 36.1 ns 2.62
classic_find_endstd::uint16_t/1 89.2 ns 28.2 ns 3.16
classic_find_endstd::uint16_t/2 1543 ns 176 ns 8.77
classic_find_endstd::uint16_t/3 1816 ns 292 ns 6.22
classic_find_endstd::uint16_t/4 14271 ns 4653 ns 3.07
classic_find_endstd::uint16_t/5 31692 ns 17964 ns 1.76
classic_find_endstd::uint32_t/0 139 ns 30.5 ns 4.56
classic_find_endstd::uint32_t/1 136 ns 50.0 ns 2.72
classic_find_endstd::uint32_t/2 2433 ns 365 ns 6.67
classic_find_endstd::uint32_t/3 2890 ns 509 ns 5.68
classic_find_endstd::uint32_t/4 12794 ns 4451 ns 2.87
classic_find_endstd::uint32_t/5 28414 ns 19038 ns 1.49
classic_find_endstd::uint64_t/0 177 ns 55.2 ns 3.21
classic_find_endstd::uint64_t/1 174 ns 55.9 ns 3.11
classic_find_endstd::uint64_t/2 2905 ns 915 ns 3.17
classic_find_endstd::uint64_t/3 3441 ns 1099 ns 3.13
classic_find_endstd::uint64_t/4 19198 ns 13903 ns 1.38
classic_find_endstd::uint64_t/5 42716 ns 23509 ns 1.82

@StephanTLavavej StephanTLavavej removed their assignment May 16, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews May 16, 2025
@StephanTLavavej
Copy link
Member

Click to expand 5950X results:
Benchmark Before After Speedup
c_strstr/0 141 ns 138 ns 1.02
c_strstr/1 159 ns 155 ns 1.03
c_strstr/2 15.3 ns 14.7 ns 1.04
c_strstr/3 9.60 ns 9.27 ns 1.04
c_strstr/4 1062 ns 1058 ns 1.00
c_strstr/5 12351 ns 11788 ns 1.05
classic_search<std::uint8_t>/0 158 ns 202 ns 0.78
classic_search<std::uint8_t>/1 173 ns 229 ns 0.76
classic_search<std::uint8_t>/2 20.6 ns 16.6 ns 1.24
classic_search<std::uint8_t>/3 12.1 ns 13.3 ns 0.91
classic_search<std::uint8_t>/4 1016 ns 3058 ns 0.33
classic_search<std::uint8_t>/5 8219 ns 2966 ns 2.77
classic_search<std::uint16_t>/0 307 ns 223 ns 1.38
classic_search<std::uint16_t>/1 348 ns 210 ns 1.66
classic_search<std::uint16_t>/2 33.4 ns 19.6 ns 1.70
classic_search<std::uint16_t>/3 19.6 ns 12.7 ns 1.54
classic_search<std::uint16_t>/4 5003 ns 2530 ns 1.98
classic_search<std::uint16_t>/5 13542 ns 9725 ns 1.39
classic_search<std::uint32_t>/0 1967 ns 283 ns 6.95
classic_search<std::uint32_t>/1 2144 ns 344 ns 6.23
classic_search<std::uint32_t>/2 172 ns 22.6 ns 7.61
classic_search<std::uint32_t>/3 70.2 ns 16.6 ns 4.23
classic_search<std::uint32_t>/4 6325 ns 2320 ns 2.73
classic_search<std::uint32_t>/5 16600 ns 9653 ns 1.72
classic_search<std::uint64_t>/0 1372 ns 472 ns 2.91
classic_search<std::uint64_t>/1 1500 ns 521 ns 2.88
classic_search<std::uint64_t>/2 119 ns 39.3 ns 3.03
classic_search<std::uint64_t>/3 51.2 ns 21.6 ns 2.37
classic_search<std::uint64_t>/4 6645 ns 9358 ns 0.71
classic_search<std::uint64_t>/5 16154 ns 13946 ns 1.16
member_find<not_highly_aligned_string>/0 156 ns 203 ns 0.77
member_find<not_highly_aligned_string>/1 173 ns 228 ns 0.76
member_find<not_highly_aligned_string>/2 20.7 ns 17.0 ns 1.22
member_find<not_highly_aligned_string>/3 12.5 ns 13.0 ns 0.96
member_find<not_highly_aligned_string>/4 1013 ns 3056 ns 0.33
member_find<not_highly_aligned_string>/5 8181 ns 2965 ns 2.76
member_find<not_highly_aligned_wstring>/0 307 ns 223 ns 1.38
member_find<not_highly_aligned_wstring>/1 347 ns 211 ns 1.64
member_find<not_highly_aligned_wstring>/2 34.1 ns 20.4 ns 1.67
member_find<not_highly_aligned_wstring>/3 20.1 ns 13.6 ns 1.48
member_find<not_highly_aligned_wstring>/4 4981 ns 2514 ns 1.98
member_find<not_highly_aligned_wstring>/5 13623 ns 9676 ns 1.41
member_find<not_highly_aligned_u32string>/0 1143 ns 283 ns 4.04
member_find<not_highly_aligned_u32string>/1 1226 ns 346 ns 3.54
member_find<not_highly_aligned_u32string>/2 92.6 ns 23.0 ns 4.03
member_find<not_highly_aligned_u32string>/3 37.3 ns 17.2 ns 2.17
member_find<not_highly_aligned_u32string>/4 7569 ns 2315 ns 3.27
member_find<not_highly_aligned_u32string>/5 17833 ns 9722 ns 1.83
classic_find_end<std::uint8_t>/0 17.8 ns 19.7 ns 0.90
classic_find_end<std::uint8_t>/1 17.1 ns 19.6 ns 0.87
classic_find_end<std::uint8_t>/2 192 ns 71.9 ns 2.67
classic_find_end<std::uint8_t>/3 236 ns 86.3 ns 2.73
classic_find_end<std::uint8_t>/4 6726 ns 3131 ns 2.15
classic_find_end<std::uint8_t>/5 19026 ns 2869 ns 6.63
classic_find_end<std::uint16_t>/0 28.2 ns 22.0 ns 1.28
classic_find_end<std::uint16_t>/1 27.9 ns 15.7 ns 1.78
classic_find_end<std::uint16_t>/2 435 ns 107 ns 4.07
classic_find_end<std::uint16_t>/3 484 ns 154 ns 3.14
classic_find_end<std::uint16_t>/4 13612 ns 2248 ns 6.06
classic_find_end<std::uint16_t>/5 24622 ns 9311 ns 2.64
classic_find_end<std::uint32_t>/0 108 ns 14.9 ns 7.25
classic_find_end<std::uint32_t>/1 102 ns 20.9 ns 4.88
classic_find_end<std::uint32_t>/2 1834 ns 172 ns 10.66
classic_find_end<std::uint32_t>/3 2139 ns 294 ns 7.28
classic_find_end<std::uint32_t>/4 6390 ns 2048 ns 3.12
classic_find_end<std::uint32_t>/5 16613 ns 9647 ns 1.72
classic_find_end<std::uint64_t>/0 107 ns 34.5 ns 3.10
classic_find_end<std::uint64_t>/1 102 ns 37.0 ns 2.76
classic_find_end<std::uint64_t>/2 1845 ns 498 ns 3.70
classic_find_end<std::uint64_t>/3 2144 ns 552 ns 3.88
classic_find_end<std::uint64_t>/4 5712 ns 8877 ns 0.64
classic_find_end<std::uint64_t>/5 15608 ns 13323 ns 1.17
member_rfind<not_highly_aligned_string>/0 18.4 ns 19.9 ns 0.92
member_rfind<not_highly_aligned_string>/1 17.8 ns 20.0 ns 0.89
member_rfind<not_highly_aligned_string>/2 193 ns 72.4 ns 2.67
member_rfind<not_highly_aligned_string>/3 237 ns 86.7 ns 2.73
member_rfind<not_highly_aligned_string>/4 6729 ns 2742 ns 2.45
member_rfind<not_highly_aligned_string>/5 19110 ns 2640 ns 7.24
member_rfind<not_highly_aligned_wstring>/0 28.7 ns 21.8 ns 1.32
member_rfind<not_highly_aligned_wstring>/1 28.5 ns 16.3 ns 1.75
member_rfind<not_highly_aligned_wstring>/2 435 ns 107 ns 4.07
member_rfind<not_highly_aligned_wstring>/3 484 ns 157 ns 3.08
member_rfind<not_highly_aligned_wstring>/4 13626 ns 2247 ns 6.06
member_rfind<not_highly_aligned_wstring>/5 24005 ns 9375 ns 2.56
member_rfind<not_highly_aligned_u32string>/0 79.5 ns 15.9 ns 5.00
member_rfind<not_highly_aligned_u32string>/1 76.2 ns 21.6 ns 3.53
member_rfind<not_highly_aligned_u32string>/2 1283 ns 173 ns 7.42
member_rfind<not_highly_aligned_u32string>/3 1507 ns 296 ns 5.09
member_rfind<not_highly_aligned_u32string>/4 8882 ns 2052 ns 4.33
member_rfind<not_highly_aligned_u32string>/5 27068 ns 9658 ns 2.80

Ignoring the /4 and /5 "evil" cases, I'm somewhat concerned about perf regressions for 1-byte search.

@AlexGuteniev
Copy link
Contributor Author

I'm somewhat concerned about perf regressions for 1-byte search.

I then suggest restructuring so that on forward search we have;

if constexpr (sizeof(_Ty >= 4) {
    if (_Use_avx2()) {
       new AVX2 path
    }
} else {
   if (_Use_sse42()) {
       old SSE4.2 path
   }
}

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 16, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit b214da0 into microsoft:main May 17, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews May 17, 2025
@StephanTLavavej
Copy link
Member

🔍 🕵️ 🔎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants