SIMD: extend NEON coverage, and include neon in oiio_simd_caps #3599

aras-p · 2022-10-12T19:28:23Z

Description

Related to points raised over at #3268, this improves NEON coverage a bit. It already works "out of the box" (at least when building on an Apple Silicon, using all default cmake options), but a bunch of functions were missing native NEON implementations. Also oiio_simd_caps() did not indicate in any way that NEON SIMD is actually compiled in.

Use hardware float<->half conversion (similar to F16C on Intel)
Use hardware sqrt
Use horizontal adds in more places ("reduce_add")
Equivalents of _mm_movemask_ps for NEON in vbool4
Other smaller bits & bobs

Tests

None of ARM platforms are tested on Github CI here. I manually ran various _test executables produced by a local build before & after the changes, and everything is the same correctness wise.

Testing on Apple M1 Max, Release build, notable simd_test Mvals/sec changes:

vfloat4
- load from half[]: 2457 -> 5890
- store to half[]: 1041 -> 6074
- simd sqrt: 2151 -> 6075
vfloat8
- load from half[]: 2433 -> 10367
- store to half[]: 1235 -> 11192
- reduce_add: 7377 -> 11971
vfloat16
- load from half[]: 2672 -> 14505
- store to half[]: 1322 -> 12558
- reduce_add: 7123 -> 11858
- operator<: 3345 -> 7500 (similar with other compare ops)
vint4
- blend: 4403 -> 5007
- blend0: 5187 -> 6076
- rotl: 1775 -> 2932
vint8
- load from int[]: 9415 -> 12152
vint16
- operator<<: 4656 -> 5865 (similar for >>, srl)
- rotl : 4042 -> 5137
vbool4
- reduce_and: 875 -> 1361
- reduce_or: 985 -> 1373

Checklist:

I have read the contribution guidelines.
If this is more extensive than a small change to existing code, I
have previously submitted a Contributor License Agreement
(individual, and if there is any way my
employers might think my programming belongs to them, then also
corporate).
I have updated the documentation, if applicable.
I have ensured that the change is tested somewhere in the testsuite
(adding new test cases if necessary).
My code follows the prevailing code style of this project.

…imd_caps * Use hardware float<->half conversion (similar to F16C on Intel) * Use hardware sqrt * Use horizontal adds in more places ("reduce_add") * Equivalents of _mm_movemask_ps for NEON in vbool4 * Other smaller bits & bobs Testing on Apple M1 Max, Release build, notable simd_test Mvals/sec changes: vfloat4 * load from half[]: 2457 -> 5890 * store to half[]: 1041 -> 6074 * simd sqrt: 2151 -> 6075 vfloat8 * load from half[]: 2433 -> 10367 * store to half[]: 1235 -> 11192 * reduce_add: 7377 -> 11971 vfloat16 * load from half[]: 2672 -> 14505 * store to half[]: 1322 -> 12558 * reduce_add: 7123 -> 11858 * operator<: 3345 -> 7500 (similar with other compare ops) vint4 * blend: 4403 -> 5007 * blend0: 5187 -> 6076 * rotl: 1775 -> 2932 vint8 * load from int[]: 9415 -> 12152 vint16 * operator<<: 4656 -> 5865 (similar for >>, srl) * rotl : 4042 -> 5137 vbool4 * reduce_and: 875 -> 1361 * reduce_or: 985 -> 1373

lgritz

Outstanding!

lgritz · 2022-10-12T22:46:27Z

We are looking forward to GHA eventually having Mac runners that are Apple Silicon / ARM. No ETA yet as far as I know.

Another possible approach is to revive using Travis-CI for additional CI coverage of some OS+Arch combos that GHA doesn't seem to have. The GHA would still be our primary CI but with some extra Travis (maybe not necessarily even being on every PR, but just a daily or weekly check?) to cover additional combinations. They don't have ARM Mac, but they do have ARM Linux, which would still let us test much of the same code paths.

aras-p · 2022-10-13T03:56:45Z

Yeah GHA has an open roadmap item for Apple Silicon support here: github/roadmap#528 -- I hope that happens sometime soon(ish).

…imd_caps (AcademySoftwareFoundation#3599) * Use hardware float<->half conversion (similar to F16C on Intel) * Use hardware sqrt * Use horizontal adds in more places ("reduce_add") * Equivalents of _mm_movemask_ps for NEON in vbool4 * Other smaller bits & bobs Testing on Apple M1 Max, Release build, notable simd_test Mvals/sec changes: vfloat4 * load from half[]: 2457 -> 5890 * store to half[]: 1041 -> 6074 * simd sqrt: 2151 -> 6075 vfloat8 * load from half[]: 2433 -> 10367 * store to half[]: 1235 -> 11192 * reduce_add: 7377 -> 11971 vfloat16 * load from half[]: 2672 -> 14505 * store to half[]: 1322 -> 12558 * reduce_add: 7123 -> 11858 * operator<: 3345 -> 7500 (similar with other compare ops) vint4 * blend: 4403 -> 5007 * blend0: 5187 -> 6076 * rotl: 1775 -> 2932 vint8 * load from int[]: 9415 -> 12152 vint16 * operator<<: 4656 -> 5865 (similar for >>, srl) * rotl : 4042 -> 5137 vbool4 * reduce_and: 875 -> 1361 * reduce_or: 985 -> 1373

lgritz approved these changes Oct 12, 2022

View reviewed changes

lgritz merged commit 5320f79 into AcademySoftwareFoundation:master Oct 13, 2022

lgritz mentioned this pull request Dec 22, 2022

[BUG] 2.4.6.1 build fails on ARM 64 (aarch64) #3721

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SIMD: extend NEON coverage, and include neon in oiio_simd_caps #3599

SIMD: extend NEON coverage, and include neon in oiio_simd_caps #3599

Uh oh!

aras-p commented Oct 12, 2022

Uh oh!

lgritz left a comment

Uh oh!

lgritz commented Oct 12, 2022

Uh oh!

aras-p commented Oct 13, 2022

Uh oh!

Uh oh!

SIMD: extend NEON coverage, and include neon in oiio_simd_caps #3599

SIMD: extend NEON coverage, and include neon in oiio_simd_caps #3599

Uh oh!

Conversation

aras-p commented Oct 12, 2022

Description

Tests

Checklist:

Uh oh!

lgritz left a comment

Choose a reason for hiding this comment

Uh oh!

lgritz commented Oct 12, 2022

Uh oh!

aras-p commented Oct 13, 2022

Uh oh!

Uh oh!