-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Improve zfs send performance by bypassing the ARC #10067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've resubmitted the builds which failed due to the unrelated failure of the |
3ea3e99
to
e12a85b
Compare
When doing a zfs send on a dataset with small recordsize (e.g. 8K), performance is dominated by the per-block overheads. This is especially true with `zfs send --compressed`, which further reduces the amount of data sent, for the same number of blocks. Several threads are involved, but the limiting factor is the `send_prefetch` thread, which is 100% on CPU. The main job of the `send_prefetch` thread is to issue zio's for the data that will be needed by the main thread. It does this by calling `arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating an arc_hdr, which takes around 14% of one CPU. It also induces later costs by other threads: * Since the data was only prefetched, dmu_send()->dmu_dump_write() will need to call arc_read() again to get the data. This will have to look up the arc_hdr in the hash table and copy the data from the scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes 27% of one CPU. * dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one CPU. * arc_adjust() will need to evict this arc_hdr, taking about 50% of one CPU. All of these costs can be avoided by bypassing the ARC if the data is not already cached. This commit changes `zfs send` to check for the data in the ARC, and if it is not found then we directly call `zio_read()`, reading the data into a linear ABD which is used by dmu_dump_write() directly. The performance improvement is best expressed in terms of how many blocks can be processed by `zfs send` in one second. This change increases the metric by 50%, from ~100,000 to ~150,000. When the amount of data per block is small (e.g. 2KB), there is a corresponding reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes to 58 minutes in this test case). In addition to improving the performance of `zfs send`, this change makes `zfs send` not pollute the ARC cache. In most cases the data will not be reused, so this allows us to keep caching useful data in the MRU (hit-once) part of the ARC. Signed-off-by: Matthew Ahrens <[email protected]>
e12a85b
to
bd0a1eb
Compare
re-pushed the same changes to try and get the tests to run to completion. |
Only expected failures in the updated push. This is ready to merge. |
…rmance by bypassing the ARC When doing a zfs send on a dataset with small recordsize (e.g. 8K), performance is dominated by the per-block overheads. This is especially true with `zfs send --compressed`, which further reduces the amount of data sent, for the same number of blocks. Several threads are involved, but the limiting factor is the `send_prefetch` thread, which is 100% on CPU. The main job of the `send_prefetch` thread is to issue zio's for the data that will be needed by the main thread. It does this by calling `arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating an arc_hdr, which takes around 14% of one CPU. It also induces later costs by other threads: * Since the data was only prefetched, dmu_send()->dmu_dump_write() will need to call arc_read() again to get the data. This will have to look up the arc_hdr in the hash table and copy the data from the scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes 27% of one CPU. * dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one CPU. * arc_adjust() will need to evict this arc_hdr, taking about 50% of one CPU. All of these costs can be avoided by bypassing the ARC if the data is not already cached. This commit changes `zfs send` to check for the data in the ARC, and if it is not found then we directly call `zio_read()`, reading the data into a linear ABD which is used by dmu_dump_write() directly. The performance improvement is best expressed in terms of how many blocks can be processed by `zfs send` in one second. This change increases the metric by 50%, from ~100,000 to ~150,000. When the amount of data per block is small (e.g. 2KB), there is a corresponding reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes to 58 minutes in this test case). In addition to improving the performance of `zfs send`, this change makes `zfs send` not pollute the ARC cache. In most cases the data will not be reused, so this allows us to keep caching useful data in the MRU (hit-once) part of the ARC. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#10067
When doing a zfs send on a dataset with small recordsize (e.g. 8K), performance is dominated by the per-block overheads. This is especially true with `zfs send --compressed`, which further reduces the amount of data sent, for the same number of blocks. Several threads are involved, but the limiting factor is the `send_prefetch` thread, which is 100% on CPU. The main job of the `send_prefetch` thread is to issue zio's for the data that will be needed by the main thread. It does this by calling `arc_read(ARC_FLAG_PREFETCH)`. This has an immediate cost of creating an arc_hdr, which takes around 14% of one CPU. It also induces later costs by other threads: * Since the data was only prefetched, dmu_send()->dmu_dump_write() will need to call arc_read() again to get the data. This will have to look up the arc_hdr in the hash table and copy the data from the scatter ABD in the arc_hdr to a linear ABD in arc_buf. This takes 27% of one CPU. * dmu_dump_write() needs to arc_buf_destroy() This takes 11% of one CPU. * arc_adjust() will need to evict this arc_hdr, taking about 50% of one CPU. All of these costs can be avoided by bypassing the ARC if the data is not already cached. This commit changes `zfs send` to check for the data in the ARC, and if it is not found then we directly call `zio_read()`, reading the data into a linear ABD which is used by dmu_dump_write() directly. The performance improvement is best expressed in terms of how many blocks can be processed by `zfs send` in one second. This change increases the metric by 50%, from ~100,000 to ~150,000. When the amount of data per block is small (e.g. 2KB), there is a corresponding reduction in the elapsed time of `zfs send >/dev/null` (from 86 minutes to 58 minutes in this test case). In addition to improving the performance of `zfs send`, this change makes `zfs send` not pollute the ARC cache. In most cases the data will not be reused, so this allows us to keep caching useful data in the MRU (hit-once) part of the ARC. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#10067
Motivation and Context
When doing a zfs send on a dataset with small recordsize (e.g. 8K),
performance is dominated by the per-block overheads. This is especially
true with
zfs send --compressed
, which further reduces the amount ofdata sent, for the same number of blocks. Several threads are involved,
but the limiting factor is the
send_prefetch
thread, which is 100% onCPU.
The main job of the
send_prefetch
thread is to issue zio's for thedata that will be needed by the main thread. It does this by calling
arc_read(ARC_FLAG_PREFETCH)
. This has an immediate cost of creatingan arc_hdr, which takes around 14% of one CPU. It also induces later
costs by other threads:
need to call arc_read() again to get the data. This will have to look
up the arc_hdr in the hash table and copy the data from the scatter ABD
in the arc_hdr to a linear ABD in arc_buf. This takes 27% of one CPU.
CPU.
Description
All of these costs can be avoided by bypassing the ARC if the data is
not already cached. This commit changes
zfs send
to check for thedata in the ARC, and if it is not found then we directly call
zio_read()
, reading the data into a linear ABD which is used bydmu_dump_write() directly.
In addition to improving the performance of
zfs send
, this changemakes
zfs send
not pollute the ARC cache. In most cases the data willnot be reused, so this allows us to keep caching useful data in the MRU
(hit-once) part of the ARC.
How Has This Been Tested?
The performance improvement is best expressed in terms of how many
blocks can be processed by
zfs send
in one second. This changeincreases the metric by 50%, from ~100,000 to ~150,000. When the amount
of data per block is small (e.g. 2KB), there is a corresponding
reduction in the elapsed time of
zfs send >/dev/null
(from 86 minutesto 58 minutes in this test case).
Types of changes
Checklist:
Signed-off-by
.