Improve zfs send performance by bypassing the ARC #10067

ahrens · 2020-02-27T05:40:01Z

Motivation and Context

When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads. This is especially
true with zfs send --compressed, which further reduces the amount of
data sent, for the same number of blocks. Several threads are involved,
but the limiting factor is the send_prefetch thread, which is 100% on
CPU.

The main job of the send_prefetch thread is to issue zio's for the
data that will be needed by the main thread. It does this by calling
arc_read(ARC_FLAG_PREFETCH). This has an immediate cost of creating
an arc_hdr, which takes around 14% of one CPU. It also induces later
costs by other threads:

Since the data was only prefetched, dmu_send()->dmu_dump_write() will
need to call arc_read() again to get the data. This will have to look
up the arc_hdr in the hash table and copy the data from the scatter ABD
in the arc_hdr to a linear ABD in arc_buf. This takes 27% of one CPU.
dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one CPU.
arc_adjust() will need to evict this arc_hdr, taking about 50% of one
CPU.

Description

All of these costs can be avoided by bypassing the ARC if the data is
not already cached. This commit changes zfs send to check for the
data in the ARC, and if it is not found then we directly call
zio_read(), reading the data into a linear ABD which is used by
dmu_dump_write() directly.

In addition to improving the performance of zfs send, this change
makes zfs send not pollute the ARC cache. In most cases the data will
not be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.

How Has This Been Tested?

The performance improvement is best expressed in terms of how many
blocks can be processed by zfs send in one second. This change
increases the metric by 50%, from ~100,000 to ~150,000. When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of zfs send >/dev/null (from 86 minutes
to 58 minutes in this test case).

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

module/zfs/arc.c

module/zfs/dmu_send.c

behlendorf · 2020-02-28T19:13:32Z

I've resubmitted the builds which failed due to the unrelated failure of the alloc_class_013_pos test. The Ubuntu 18.04 failure also appears to be unrelated. It may have been introduced by recent commit cccbed9, it's the first time I've seen this particular failure in the CI.

module/zfs/dmu_send.c

When doing a zfs send on a dataset with small recordsize (e.g. 8K), performance is dominated by the per-block overheads. This is especially true with `zfs send --compressed`, which further reduces the amount of data sent, for the same number of blocks. Several threads are involved, but the limiting factor is the `send_prefetch` thread, which is 100% on CPU. The main job of the `send_prefetch` thread is to issue zio's for the data that will be needed by the main thread. It does this by calling `arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating an arc_hdr, which takes around 14% of one CPU. It also induces later costs by other threads: * Since the data was only prefetched, dmu_send()->dmu_dump_write() will need to call arc_read() again to get the data. This will have to look up the arc_hdr in the hash table and copy the data from the scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes 27% of one CPU. * dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one CPU. * arc_adjust() will need to evict this arc_hdr, taking about 50% of one CPU. All of these costs can be avoided by bypassing the ARC if the data is not already cached. This commit changes `zfs send` to check for the data in the ARC, and if it is not found then we directly call `zio_read()`, reading the data into a linear ABD which is used by dmu_dump_write() directly. The performance improvement is best expressed in terms of how many blocks can be processed by `zfs send` in one second. This change increases the metric by 50%, from ~100,000 to ~150,000. When the amount of data per block is small (e.g. 2KB), there is a corresponding reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes to 58 minutes in this test case). In addition to improving the performance of `zfs send`, this change makes `zfs send` not pollute the ARC cache. In most cases the data will not be reused, so this allows us to keep caching useful data in the MRU (hit-once) part of the ARC. Signed-off-by: Matthew Ahrens <[email protected]>

ahrens · 2020-03-09T18:26:06Z

re-pushed the same changes to try and get the tests to run to completion.

behlendorf · 2020-03-10T17:45:24Z

Only expected failures in the updated push. This is ready to merge.

…rmance by bypassing the ARC When doing a zfs send on a dataset with small recordsize (e.g. 8K), performance is dominated by the per-block overheads. This is especially true with `zfs send --compressed`, which further reduces the amount of data sent, for the same number of blocks. Several threads are involved, but the limiting factor is the `send_prefetch` thread, which is 100% on CPU. The main job of the `send_prefetch` thread is to issue zio's for the data that will be needed by the main thread. It does this by calling `arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating an arc_hdr, which takes around 14% of one CPU. It also induces later costs by other threads: * Since the data was only prefetched, dmu_send()->dmu_dump_write() will need to call arc_read() again to get the data. This will have to look up the arc_hdr in the hash table and copy the data from the scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes 27% of one CPU. * dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one CPU. * arc_adjust() will need to evict this arc_hdr, taking about 50% of one CPU. All of these costs can be avoided by bypassing the ARC if the data is not already cached. This commit changes `zfs send` to check for the data in the ARC, and if it is not found then we directly call `zio_read()`, reading the data into a linear ABD which is used by dmu_dump_write() directly. The performance improvement is best expressed in terms of how many blocks can be processed by `zfs send` in one second. This change increases the metric by 50%, from ~100,000 to ~150,000. When the amount of data per block is small (e.g. 2KB), there is a corresponding reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes to 58 minutes in this test case). In addition to improving the performance of `zfs send`, this change makes `zfs send` not pollute the ARC cache. In most cases the data will not be reused, so this allows us to keep caching useful data in the MRU (hit-once) part of the ARC. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#10067

When doing a zfs send on a dataset with small recordsize (e.g. 8K), performance is dominated by the per-block overheads. This is especially true with `zfs send --compressed`, which further reduces the amount of data sent, for the same number of blocks. Several threads are involved, but the limiting factor is the `send_prefetch` thread, which is 100% on CPU. The main job of the `send_prefetch` thread is to issue zio's for the data that will be needed by the main thread. It does this by calling `arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating an arc_hdr, which takes around 14% of one CPU. It also induces later costs by other threads: * Since the data was only prefetched, dmu_send()->dmu_dump_write() will need to call arc_read() again to get the data. This will have to look up the arc_hdr in the hash table and copy the data from the scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes 27% of one CPU. * dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one CPU. * arc_adjust() will need to evict this arc_hdr, taking about 50% of one CPU. All of these costs can be avoided by bypassing the ARC if the data is not already cached. This commit changes `zfs send` to check for the data in the ARC, and if it is not found then we directly call `zio_read()`, reading the data into a linear ABD which is used by dmu_dump_write() directly. The performance improvement is best expressed in terms of how many blocks can be processed by `zfs send` in one second. This change increases the metric by 50%, from ~100,000 to ~150,000. When the amount of data per block is small (e.g. 2KB), there is a corresponding reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes to 58 minutes in this test case). In addition to improving the performance of `zfs send`, this change makes `zfs send` not pollute the ARC cache. In most cases the data will not be reused, so this allows us to keep caching useful data in the MRU (hit-once) part of the ARC. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#10067

ahrens requested a review from pcd1193182 February 27, 2020 05:40

ahrens added Type: Performance Performance improvement or performance problem Status: Code Review Needed Ready for review and testing labels Feb 27, 2020

ahrens requested review from grwilson and behlendorf February 27, 2020 05:40

dbussink reviewed Feb 27, 2020

View reviewed changes

module/zfs/arc.c Show resolved Hide resolved

pcd1193182 approved these changes Feb 27, 2020

View reviewed changes

module/zfs/dmu_send.c Show resolved Hide resolved

ahrens mentioned this pull request Feb 27, 2020

Improve performance of zio_taskq_member #10070

Merged

12 tasks

ahrens added the Component: Send/Recv "zfs send/recv" feature label Feb 27, 2020

behlendorf approved these changes Feb 28, 2020

View reviewed changes

module/zfs/dmu_send.c Show resolved Hide resolved

sdimitro reviewed Mar 2, 2020

View reviewed changes

module/zfs/dmu_send.c Show resolved Hide resolved

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 5, 2020

ahrens force-pushed the zol/send_directio branch from 3ea3e99 to e12a85b Compare March 6, 2020 16:41

ahrens force-pushed the zol/send_directio branch from e12a85b to bd0a1eb Compare March 9, 2020 18:25

behlendorf merged commit 1dc32a6 into openzfs:master Mar 10, 2020

matveevandrey mentioned this pull request Apr 16, 2020

Proposed zfs-0.8.4 patchset #10209

Merged

12 tasks

georgeyil mentioned this pull request Nov 11, 2020

PR involvement #11192

Closed

amotin mentioned this pull request Oct 30, 2024

ZFS send should use spill block prefetched from send_reader_thread #16701

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve zfs send performance by bypassing the ARC #10067

Improve zfs send performance by bypassing the ARC #10067

Uh oh!

ahrens commented Feb 27, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

behlendorf commented Feb 28, 2020

Uh oh!

Uh oh!

ahrens commented Mar 9, 2020

Uh oh!

behlendorf commented Mar 10, 2020

Uh oh!

Uh oh!

Improve zfs send performance by bypassing the ARC #10067

Improve zfs send performance by bypassing the ARC #10067

Uh oh!

Conversation

ahrens commented Feb 27, 2020

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

behlendorf commented Feb 28, 2020

Uh oh!

Uh oh!

ahrens commented Mar 9, 2020

Uh oh!

behlendorf commented Mar 10, 2020

Uh oh!

Uh oh!