Skip to content

SIMD: extend NEON coverage, and include neon in oiio_simd_caps #3599

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 13, 2022

Conversation

aras-p
Copy link
Contributor

@aras-p aras-p commented Oct 12, 2022

Description

Related to points raised over at #3268, this improves NEON coverage a bit. It already works "out of the box" (at least when building on an Apple Silicon, using all default cmake options), but a bunch of functions were missing native NEON implementations. Also oiio_simd_caps() did not indicate in any way that NEON SIMD is actually compiled in.

  • Use hardware float<->half conversion (similar to F16C on Intel)
  • Use hardware sqrt
  • Use horizontal adds in more places ("reduce_add")
  • Equivalents of _mm_movemask_ps for NEON in vbool4
  • Other smaller bits & bobs

Tests

None of ARM platforms are tested on Github CI here. I manually ran various _test executables produced by a local build before & after the changes, and everything is the same correctness wise.

Testing on Apple M1 Max, Release build, notable simd_test Mvals/sec changes:

  • vfloat4
    • load from half[]: 2457 -> 5890
    • store to half[]: 1041 -> 6074
    • simd sqrt: 2151 -> 6075
  • vfloat8
    • load from half[]: 2433 -> 10367
    • store to half[]: 1235 -> 11192
    • reduce_add: 7377 -> 11971
  • vfloat16
    • load from half[]: 2672 -> 14505
    • store to half[]: 1322 -> 12558
    • reduce_add: 7123 -> 11858
    • operator<: 3345 -> 7500 (similar with other compare ops)
  • vint4
    • blend: 4403 -> 5007
    • blend0: 5187 -> 6076
    • rotl: 1775 -> 2932
  • vint8
    • load from int[]: 9415 -> 12152
  • vint16
    • operator<<: 4656 -> 5865 (similar for >>, srl)
    • rotl : 4042 -> 5137
  • vbool4
    • reduce_and: 875 -> 1361
    • reduce_or: 985 -> 1373

Checklist:

  • I have read the contribution guidelines.
  • If this is more extensive than a small change to existing code, I
    have previously submitted a Contributor License Agreement
    (individual, and if there is any way my
    employers might think my programming belongs to them, then also
    corporate).
  • I have updated the documentation, if applicable.
  • I have ensured that the change is tested somewhere in the testsuite
    (adding new test cases if necessary).
  • My code follows the prevailing code style of this project.

…imd_caps

* Use hardware float<->half conversion (similar to F16C on Intel)
* Use hardware sqrt
* Use horizontal adds in more places ("reduce_add")
* Equivalents of _mm_movemask_ps for NEON in vbool4
* Other smaller bits & bobs

Testing on Apple M1 Max, Release build, notable simd_test Mvals/sec
changes:

vfloat4
* load from half[]: 2457 -> 5890
* store to half[]: 1041 -> 6074
* simd sqrt: 2151 -> 6075

vfloat8
* load from half[]: 2433 -> 10367
* store to half[]: 1235 -> 11192
* reduce_add: 7377 -> 11971

vfloat16
* load from half[]: 2672 -> 14505
* store to half[]: 1322 -> 12558
* reduce_add: 7123 -> 11858
* operator<:  3345 -> 7500 (similar with other compare ops)

vint4
* blend: 4403 -> 5007
* blend0: 5187 -> 6076
* rotl: 1775 -> 2932

vint8
* load from int[]: 9415 -> 12152

vint16
* operator<<: 4656 -> 5865 (similar for >>, srl)
* rotl : 4042 -> 5137

vbool4
* reduce_and: 875 -> 1361
* reduce_or: 985 -> 1373
Copy link
Collaborator

@lgritz lgritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding!

@lgritz
Copy link
Collaborator

lgritz commented Oct 12, 2022

We are looking forward to GHA eventually having Mac runners that are Apple Silicon / ARM. No ETA yet as far as I know.

Another possible approach is to revive using Travis-CI for additional CI coverage of some OS+Arch combos that GHA doesn't seem to have. The GHA would still be our primary CI but with some extra Travis (maybe not necessarily even being on every PR, but just a daily or weekly check?) to cover additional combinations. They don't have ARM Mac, but they do have ARM Linux, which would still let us test much of the same code paths.

@lgritz lgritz merged commit 5320f79 into AcademySoftwareFoundation:master Oct 13, 2022
@aras-p
Copy link
Contributor Author

aras-p commented Oct 13, 2022

Yeah GHA has an open roadmap item for Apple Silicon support here: github/roadmap#528 -- I hope that happens sometime soon(ish).

lgritz pushed a commit to lgritz/OpenImageIO that referenced this pull request Oct 13, 2022
…imd_caps (AcademySoftwareFoundation#3599)

* Use hardware float<->half conversion (similar to F16C on Intel)
* Use hardware sqrt
* Use horizontal adds in more places ("reduce_add")
* Equivalents of _mm_movemask_ps for NEON in vbool4
* Other smaller bits & bobs

Testing on Apple M1 Max, Release build, notable simd_test Mvals/sec
changes:

vfloat4
* load from half[]: 2457 -> 5890
* store to half[]: 1041 -> 6074
* simd sqrt: 2151 -> 6075

vfloat8
* load from half[]: 2433 -> 10367
* store to half[]: 1235 -> 11192
* reduce_add: 7377 -> 11971

vfloat16
* load from half[]: 2672 -> 14505
* store to half[]: 1322 -> 12558
* reduce_add: 7123 -> 11858
* operator<:  3345 -> 7500 (similar with other compare ops)

vint4
* blend: 4403 -> 5007
* blend0: 5187 -> 6076
* rotl: 1775 -> 2932

vint8
* load from int[]: 9415 -> 12152

vint16
* operator<<: 4656 -> 5865 (similar for >>, srl)
* rotl : 4042 -> 5137

vbool4
* reduce_and: 875 -> 1361
* reduce_or: 985 -> 1373
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants