Open
Description
I'm trying to limit generation of wide AVX instruction to reduce frequency impact/performance regression.
For the following example (consecutive FP division): https://godbolt.org/z/reP9c78cM I get vector division :vdivpd %ymm0, %ymm1, %ymm0
with 256-bit wide register. I've checked IR and SLP indeed generates %5 = fdiv <4 x double> %2, %4
.
When I try to limit register size to 128 I get the same results. Even when building with -mllvm -slp-max-reg-size=1 which should basically remove any slp vectorization completely. Wide AVX is know to cause significant performance regression from reduced frequency on some CPUs (especially older ones)