Skip to content

Add AVX512BW variant of fletcher #9517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 30, 2019

Conversation

rdolbeau
Copy link
Contributor

@rdolbeau rdolbeau commented Oct 27, 2019

It is much faster than AVX512F when byteswapping on Skylake-SP and newer, as we can do the byteswap in a single vshufb insteads of many instructions.

Signed-off-by: Romain Dolbeau [email protected]

Motivation and Context

Faster Fletcher on AXV512BW-capable machine. Performance only.

Description

This is the same code as AVX512, but changing the byteswap to the same principle as in the AVX (Intel) code: single pshufb instead of multiple instructions.

All AVX512-capable machine are also AVX512BW-capable, except for the Knight Landings processor (Intel Xeon Phi x1xx), so this should be generally useful.

How Has This Been Tested?

Pretty much only 'ztest' for validity.

Performance on a Xeon Platinum 8170:

̀cat /proc/spl/kstat/zfs/fletcher_4_bench 
0 0 0x01 -1 0 4687155613279 5306314643165
implementation   native         byteswap       
scalar           6598567882     2914387751     
superscalar      8619575466     3807285484     
superscalar4     7909377800     4574394878     
sse2             14823627397    8382296968     
ssse3            14871946829    13268853915    
avx2             22377596138    20723574960    
avx512f          33799684523    12442937636    
avx512bw         33785735010    29645697974    
fastest          avx512f        avx512bw    

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • [ x] Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • [x ] My code follows the ZFS on Linux code style requirements.
  • [x ] I have updated the documentation accordingly.
  • [ x] I have read the contributing document.
  • I have added tests to cover my changes.
  • [ x] All new and existing tests passed.
  • [ x] All commit messages are properly formatted and contain Signed-off-by.

@rdolbeau rdolbeau force-pushed the simd-avx512bw-fletcher branch from d41614d to 6b14f8a Compare October 27, 2019 09:36
@codecov
Copy link

codecov bot commented Oct 27, 2019

Codecov Report

Merging #9517 into master will decrease coverage by 0.05%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9517      +/-   ##
==========================================
- Coverage   79.16%    79.1%   -0.06%     
==========================================
  Files         416      416              
  Lines      123661   123667       +6     
==========================================
- Hits        97895    97827      -68     
- Misses      25766    25840      +74
Flag Coverage Δ
#kernel 79.73% <0%> (-0.01%) ⬇️
#user 66.83% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d46f0de...2c365a4. Read the comment docs.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 28, 2019
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is definitely a nice improvement for the byteswap case, and I agree I'm sure this will be helpful to some systems. The code and approach taken look correct to me. I haven't yet been able to verify the patch on the appropriate hardware, but I'll see if I can round something up.

@behlendorf behlendorf requested review from ironMann and tuxoko October 28, 2019 23:11
Copy link
Contributor

@ironMann ironMann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! (I cannot try the code atm)

Perhaps avoid duplicating the code by adding the new byteswap method to zfs_fletcher_avx512.c, like it was done for sse case.

@rdolbeau
Copy link
Contributor Author

@behlendorf @ironMann Thanks; I'll try to follow @ironMann suggestion and merge with avx512, as indeed that's a lot of duplicated code.

It is much faster than AVX512F when byteswapping on Skylake-SP
and newer, as we can do the byteswap in a single vshufb insteads
of many instructions.

Signed-off-by: Romain Dolbeau <[email protected]>
@rdolbeau rdolbeau force-pushed the simd-avx512bw-fletcher branch from 6b14f8a to 2c365a4 Compare October 30, 2019 07:52
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 30, 2019
@behlendorf
Copy link
Contributor

I've manually tested the patched with ztest and everything looks good. This is ready to go. Also, just another data point for future reference:

$ cat /proc/spl/kstat/zfs/fletcher_4_bench 
0 0 0x01 -1 0 18336380950 24249484731
implementation   native         byteswap       
scalar           6228223463     4987307307     
superscalar      8169752501     4502635203     
superscalar4     7460699360     6105662245     
sse2             14031638315    7912012361     
ssse3            14027234462    12551782074    
avx2             21087205076    19234057034    
avx512f          32509506092    11911328563    
avx512bw         32521718573    28283999556    
fastest          avx512bw       avx512bw    

@behlendorf behlendorf merged commit 0b2a642 into openzfs:master Oct 30, 2019
cvoltz pushed a commit to cvoltz/zfs that referenced this pull request Nov 1, 2019
It is much faster than AVX512F when byteswapping on Skylake-SP
and newer, as we can do the byteswap in a single vshufb instead
of many instructions.

Reviewed by: Gvozden Neskovic <[email protected]>
Reviewed-by: Chunwei Chen <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Romain Dolbeau <[email protected]>
Closes openzfs#9517
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants