Skip to content

Scale worker threads and taskqs with number of CPUs. #11966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2021

Conversation

amotin
Copy link
Member

@amotin amotin commented Apr 29, 2021

While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.

On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.

To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.

While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.

How Has This Been Tested?

The patch was tested on FreeBSD for cases of 8 and 40 CPU threads with striped pool of 5 Intel Optane 905p SSDs with SHA256 checksums.
With 8 CPU threads without the patch starting scrub while running QD1 random uncached read with 128KB records with fio drops IOPS from 1174 to 35 and increases 95% latency from 1ms to 77ms with maximum of 204ms. The patch increases the second IOPS number to 327 IOPS, while reducing 95% latency to 5ms with maximum of 7.5ms.
Large file delete tests with dedup enabled also show dramatic latency/interactivity improvements, while profiles show very small lock contention despite only 2 taskqs on 8-threads system.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Apr 29, 2021
@amotin amotin requested a review from behlendorf April 29, 2021 02:42
@amotin amotin self-assigned this Apr 29, 2021
@amotin amotin requested a review from ahrens April 29, 2021 02:49
@h1z1
Copy link

h1z1 commented May 1, 2021

If the target metric is latency, why use cpu/thread count at all? None of this takes any hardware architecture into consideration. What were your tests based on? (i/dcache sizes for example).

@amotin
Copy link
Member Author

amotin commented May 1, 2021

If the target metric is latency, why use cpu/thread count at all? None of this takes any hardware architecture into consideration.

Because it is about scheduling latency and task priorities. Architecture there is more question to CPU scheduler, etc. I do think that it would be good in perspective to make these multiple taskqs bound to NUMA domains, but haven't got to it yet, especially in cross-platform way.

What were your tests based on? (i/dcache sizes for example).

I don't see how it matter, but in my case it was 2x Xeon Silver 4114 with all cores and limited to 2 cores per socket (for 8 CPU threads test).

@amotin amotin force-pushed the scale branch 2 times, most recently from 06305d7 to e2a423c Compare May 1, 2021 15:58
@h1z1
Copy link

h1z1 commented May 1, 2021

If the target metric is latency, why use cpu/thread count at all? None of this takes any hardware architecture into consideration.

Because it is about scheduling latency and task priorities. Architecture there is more question to CPU scheduler, etc.

You're assuming the work scales linearlly, it doesn't.

I do think that it would be good in perspective to make these multiple taskqs bound to
NUMA domains, but haven't got to it yet, especially in cross-platform way.

What? You might want to test that. Having done so I CAN say it makes a major difference!

What were your tests based on? (i/dcache sizes for example).

I don't see how it matter, but in my case it was 2x Xeon Silver 4114 with all cores and limited to 2 cores per socket (for 8 CPU threads test).

Would you at least agree that hardware is no where near representative of most users? It doesn't even represent your initial goal of small systems.

@amotin
Copy link
Member Author

amotin commented May 1, 2021

Would you at least agree that hardware is no where near representative of most users? It doesn't even represent your initial goal of small systems.

It is close enough. It is just what I have on my lab desk right now, without spending time to set up 10 different machines. Since the goal is just to leave some CPU cores not permanently occupied by ZFS to allow user space to run, I don't care whether those cores are on the same silicon or different. If you like to help me with testing, please do, I'll appreciate that.

@h1z1
Copy link

h1z1 commented May 1, 2021

It is close enough.

I'm not aware of any embedded systems with Xeons.

It is just what I have on my lab desk right now, without spending time to set up 10 different machines.

If it scaled you wouldn't need to.

Since the goal is just to leave some CPU cores not permanently occupied by ZFS to allow user space to run, I don't care whether those cores are on the same silicon or different.

That is an absurd position to take. You're doing the equivalent of Google telling the world "make -j" is faster , based on their own dedicated hardware with zero thought into solving the actual problem.. It's pretty basic queue theory.

For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion.

Why is that happening for example.

If you like to help me with testing, please do, I'll appreciate that.

I don't have Xeons to test with, sorry.

@amotin
Copy link
Member Author

amotin commented May 1, 2021

I'm not aware of any embedded systems with Xeons. ... I don't have Xeons to test with, sorry.

Are you kidding me again?

For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion.

Why is that happening for example.

Because scrub runs completely within kernel, and it runs as fast as both disks and CPU can do. You may remember PR #11166 from last year (you should, because you annoyed me with weird accusations there too), reducing scrub effects on payload at disks level. Now I am looking on the other side -- CPU.

@h1z1
Copy link

h1z1 commented May 2, 2021

I'm not aware of any embedded systems with Xeons. ... I don't have Xeons to test with, sorry.

Are you kidding me again?

?

For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion.

Why is that happening for example.

Because scrub runs completely within kernel, and it runs as fast as both disks and CPU can do.

Yes, lots of things run in Kernel and do not or at least should not, lock a system as you describe. Those would be bugs. Funny enough I hit something very similar to this yesterday with simply clearing events.

You may remember PR #11166 from last year (you should, because you annoyed me with weird accusations there too), reducing scrub effects on payload at disks level. Now I am looking on the other side -- CPU.

That thread had nothing to do with this.

You posted numbers that didn't make sense, I was looking for clarification. You're taking offense to that is bizzare for such a change.

And I'm tired of dealing with you. Cheers.

@amotin
Copy link
Member Author

amotin commented May 2, 2021

I'm not aware of any embedded systems with Xeons. ... I don't have Xeons to test with, sorry.
Are you kidding me again?
?

First you are attracting attention to embedded systems, that are not Xeon. I understand that. BTW I actually have request from guy with low-end embedded CPU asking for change like this. But then you are saying you can not test the code because you have no Xeons. I understand lack of Xeon for everybody, but how would it stop you from testing on non-Xeons you have if you think my tests are not representative?

Yes, lots of things run in Kernel and do not or at least should not, lock a system as you describe.

Not so many things in kernel consume as much CPU time as ZFS with gzip9 and sha256 and gigabytes/s of traffic.

Funny enough I hit something very similar to this yesterday with simply clearing events.

Cool. Do you want to analyze it and propose patch for others to review?

You posted numbers that didn't make sense, I was looking for clarification. You're taking offense to that is bizzare for such a change.

Because you are unable to explain what wrong you see in my numbers. You are mentioning random facts, not explaining how they can be related to anything. This time it was "i/dcache sizes for example".

And I'm tired of dealing with you. Cheers.

I can probably live with this. Come back when you have something constructive to propose. I've never used "ban" button in my life, but you are making me wonder whether github has such one. I really tried to find your background by looking on github history, but I've failed.

@h1z1
Copy link

h1z1 commented May 2, 2021

Not so many things in kernel consume as much CPU time as ZFS with gzip9 and sha256 and gigabytes/s of traffic.

Funny enough I hit something very similar to this yesterday with simply clearing events.

Cool. Do you want to analyze it and propose patch for others to review?

Up to this little tyrant of yours, I was.

You posted numbers that didn't make sense, I was looking for clarification. You're taking offense to that is bizzare for such a change.

Because you are unable to explain what wrong you see in my numbers. You are mentioning random facts, not explaining how they can be related to anything. This time it was "i/dcache sizes for example".

And I'm tired of dealing with you. Cheers.

I can probably live with this. Come back when you have something constructive to propose. I've never used "ban" button in my life, but you are making me wonder whether github has such one. I really tried to find your background by looking on github history, but I've failed.

You do what petty things you need, I have no control over that. If asking questions bothers you perhaps you should refrain from making public statements. Continued harassment will not be tolerated either.

@amotin
Copy link
Member Author

amotin commented May 2, 2021

Continued harassment will not be tolerated either.

No. It won't. You've opened my github block list.

@bghira
Copy link

bghira commented May 2, 2021

On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.

there's several reasons that could occur - on my experience, the dedup deletes are very I/O intensive.

though i've definitely had a scrub bring my system to its knees to the point that network disk based VMs will disconnect..

@amotin
Copy link
Member Author

amotin commented May 2, 2021

there's several reasons that could occur - on my experience, the dedup deletes are very I/O intensive.

Testing this change I've easily reproduced in particular heavy CPU load for seconds during the dedupped delete. DDT size was not big enough to cause significant disk load in the test.

Also, speaking about small systems, lets say single-core, having 8 taskqs means that multiple small block I/Os aggregated into one disk I/O will try to wakeup multiple threads (at least one per taskq) same time to handle that. It means much more context switches than needed.

@h1z1 h1z1 mentioned this pull request May 4, 2021
13 tasks
@amotin
Copy link
Member Author

amotin commented May 10, 2021

Any more comments here?

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning behind this change makes good sense to me and I can confirm it works as intended. In my local testing I saw good performance as well and nothing I would consider a regression. My only concern is we accidentally dropped TASKQ_DYNAMIC flag which we should add back.

While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.

On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays.  For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.

To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much.  The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.

While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored-By:	iXsystems, Inc.
@amotin
Copy link
Member Author

amotin commented May 14, 2021

My only concern is we accidentally dropped TASKQ_DYNAMIC flag which we should add back.

Restored.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 14, 2021
@behlendorf behlendorf merged commit 7457b02 into openzfs:master May 14, 2021
rincebrain pushed a commit to rincebrain/zfs that referenced this pull request May 17, 2021
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.

On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays.  For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.

To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much.  The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.

While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11966
rincebrain pushed a commit to rincebrain/zfs that referenced this pull request May 17, 2021
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.

On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays.  For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.

To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much.  The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.

While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11966
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request May 28, 2021
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.

On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays.  For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.

To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much.  The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.

While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11966
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.

On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays.  For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.

To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much.  The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.

While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11966
tle211212 pushed a commit to tle211212/zfs that referenced this pull request Jul 9, 2021
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.

On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays.  For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.

To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much.  The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.

While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Alexander Motin <[email protected]>
Sponsored-By: iXsystems, Inc.
Closes openzfs#11966
@amotin amotin deleted the scale branch August 24, 2021 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants