Skip to content

RAIDZ: Use cache blocking during parity math #15448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 30, 2023

Conversation

amotin
Copy link
Member

@amotin amotin commented Oct 24, 2023

RAIDZ parity is calculated by adding data one column at a time. It works OK for small blocks, but for large blocks results of previous addition may already be evicted from CPU caches to main memory, and in addition to extra memory write require extra read to get it back.

This patch splits large parity operations into 64KB chunks, that should in most cases fit into CPU L2 caches from the last decade. I haven't touched more complicated cases of data reconstruction to not overcomplicate the code. Those should be relatively rare.

My tests on Xeon Gold 6242R CPU with 1MB of L2 cache per core show up to 10/20% memory traffic reduction when writing to 4-wide RAIDZ/RAIDZ2 blocks of ~4MB and up. Older CPUs with 256KB of L2 cache should see the effect even on smaller blocks. Wider vdevs may need bigger blocks to be affected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin
Copy link
Member Author

amotin commented Oct 24, 2023

Writing ~6.8GB/s to 4-wide RAIDZ with different block sizes:
image

Writing ~4.8GB/s to 4-wide RAIDZ2 with different block sizes:
image

@behlendorf behlendorf added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing labels Oct 24, 2023
@amotin amotin force-pushed the raidz_cache branch 2 times, most recently from 3c1266e to 585b9a3 Compare October 24, 2023 21:47
RAIDZ parity is calculated by adding data one column at a time.  It
works OK for small blocks, but for large blocks results of previous
addition may already be evicted from CPU caches to main memory, and
in addition to extra memory write require extra read to get it back.

This patch splits large parity operations into 64KB chunks, that
should in most cases fit into CPU L2 caches from the last decade.
I haven't touched more complicated cases of data reconstruction to
not overcomplicate the code.  Those should be relatively rare.

My tests on Xeon Gold 6242R CPU with 1MB of L2 cache per core show
up to 10/20% memory traffic reduction when writing to 4-wide RAIDZ/
RAIDZ2 blocks of ~4MB and up.  Older CPUs with 256KB of L2 cache
should see the effect even on smaller blocks.  Wider vdevs may need
bigger blocks to be affected.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 30, 2023
@behlendorf behlendorf merged commit 05a7348 into openzfs:master Oct 30, 2023
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Dec 12, 2023
RAIDZ parity is calculated by adding data one column at a time.  It
works OK for small blocks, but for large blocks results of previous
addition may already be evicted from CPU caches to main memory, and
in addition to extra memory write require extra read to get it back.

This patch splits large parity operations into 64KB chunks, that
should in most cases fit into CPU L2 caches from the last decade.
I haven't touched more complicated cases of data reconstruction to
not over complicate the code.  Those should be relatively rare.

My tests on Xeon Gold 6242R CPU with 1MB of L2 cache per core show
up to 10/20% memory traffic reduction when writing to 4-wide RAIDZ/
RAIDZ2 blocks of ~4MB and up.  Older CPUs with 256KB of L2 cache
should see the effect even on smaller blocks.  Wider vdevs may need
bigger blocks to be affected.

Reviewed-by: Brian Atkinson <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#15448
@amotin amotin deleted the raidz_cache branch March 22, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants