Scale worker threads and taskqs with number of CPUs. #11966

amotin · 2021-04-29T02:42:20Z

While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.

On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.

To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.

While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.

How Has This Been Tested?

The patch was tested on FreeBSD for cases of 8 and 40 CPU threads with striped pool of 5 Intel Optane 905p SSDs with SHA256 checksums.
With 8 CPU threads without the patch starting scrub while running QD1 random uncached read with 128KB records with fio drops IOPS from 1174 to 35 and increases 95% latency from 1ms to 77ms with maximum of 204ms. The patch increases the second IOPS number to 327 IOPS, while reducing 95% latency to 5ms with maximum of 7.5ms.
Large file delete tests with dedup enabled also show dramatic latency/interactivity improvements, while profiles show very small lock contention despite only 2 taskqs on 8-threads system.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

module/zfs/spa.c

h1z1 · 2021-05-01T06:19:14Z

If the target metric is latency, why use cpu/thread count at all? None of this takes any hardware architecture into consideration. What were your tests based on? (i/dcache sizes for example).

amotin · 2021-05-01T13:13:40Z

If the target metric is latency, why use cpu/thread count at all? None of this takes any hardware architecture into consideration.

Because it is about scheduling latency and task priorities. Architecture there is more question to CPU scheduler, etc. I do think that it would be good in perspective to make these multiple taskqs bound to NUMA domains, but haven't got to it yet, especially in cross-platform way.

What were your tests based on? (i/dcache sizes for example).

I don't see how it matter, but in my case it was 2x Xeon Silver 4114 with all cores and limited to 2 cores per socket (for 8 CPU threads test).

h1z1 · 2021-05-01T20:07:08Z

If the target metric is latency, why use cpu/thread count at all? None of this takes any hardware architecture into consideration.

Because it is about scheduling latency and task priorities. Architecture there is more question to CPU scheduler, etc.

You're assuming the work scales linearlly, it doesn't.

I do think that it would be good in perspective to make these multiple taskqs bound to
NUMA domains, but haven't got to it yet, especially in cross-platform way.

What? You might want to test that. Having done so I CAN say it makes a major difference!

What were your tests based on? (i/dcache sizes for example).

I don't see how it matter, but in my case it was 2x Xeon Silver 4114 with all cores and limited to 2 cores per socket (for 8 CPU threads test).

Would you at least agree that hardware is no where near representative of most users? It doesn't even represent your initial goal of small systems.

amotin · 2021-05-01T20:33:18Z

Would you at least agree that hardware is no where near representative of most users? It doesn't even represent your initial goal of small systems.

It is close enough. It is just what I have on my lab desk right now, without spending time to set up 10 different machines. Since the goal is just to leave some CPU cores not permanently occupied by ZFS to allow user space to run, I don't care whether those cores are on the same silicon or different. If you like to help me with testing, please do, I'll appreciate that.

h1z1 · 2021-05-01T22:17:16Z

It is close enough.

I'm not aware of any embedded systems with Xeons.

It is just what I have on my lab desk right now, without spending time to set up 10 different machines.

If it scaled you wouldn't need to.

Since the goal is just to leave some CPU cores not permanently occupied by ZFS to allow user space to run, I don't care whether those cores are on the same silicon or different.

That is an absurd position to take. You're doing the equivalent of Google telling the world "make -j" is faster , based on their own dedicated hardware with zero thought into solving the actual problem.. It's pretty basic queue theory.

For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion.

Why is that happening for example.

If you like to help me with testing, please do, I'll appreciate that.

I don't have Xeons to test with, sorry.

amotin · 2021-05-01T22:38:30Z

I'm not aware of any embedded systems with Xeons. ... I don't have Xeons to test with, sorry.

Are you kidding me again?

For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion.

Why is that happening for example.

Because scrub runs completely within kernel, and it runs as fast as both disks and CPU can do. You may remember PR #11166 from last year (you should, because you annoyed me with weird accusations there too), reducing scrub effects on payload at disks level. Now I am looking on the other side -- CPU.

h1z1 · 2021-05-02T00:05:46Z

I'm not aware of any embedded systems with Xeons. ... I don't have Xeons to test with, sorry.

Are you kidding me again?

?

For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion.

Why is that happening for example.

Because scrub runs completely within kernel, and it runs as fast as both disks and CPU can do.

Yes, lots of things run in Kernel and do not or at least should not, lock a system as you describe. Those would be bugs. Funny enough I hit something very similar to this yesterday with simply clearing events.

You may remember PR #11166 from last year (you should, because you annoyed me with weird accusations there too), reducing scrub effects on payload at disks level. Now I am looking on the other side -- CPU.

That thread had nothing to do with this.

You posted numbers that didn't make sense, I was looking for clarification. You're taking offense to that is bizzare for such a change.

And I'm tired of dealing with you. Cheers.

amotin · 2021-05-02T16:01:09Z

I'm not aware of any embedded systems with Xeons. ... I don't have Xeons to test with, sorry.
Are you kidding me again?
?

First you are attracting attention to embedded systems, that are not Xeon. I understand that. BTW I actually have request from guy with low-end embedded CPU asking for change like this. But then you are saying you can not test the code because you have no Xeons. I understand lack of Xeon for everybody, but how would it stop you from testing on non-Xeons you have if you think my tests are not representative?

Yes, lots of things run in Kernel and do not or at least should not, lock a system as you describe.

Not so many things in kernel consume as much CPU time as ZFS with gzip9 and sha256 and gigabytes/s of traffic.

Funny enough I hit something very similar to this yesterday with simply clearing events.

Cool. Do you want to analyze it and propose patch for others to review?

You posted numbers that didn't make sense, I was looking for clarification. You're taking offense to that is bizzare for such a change.

Because you are unable to explain what wrong you see in my numbers. You are mentioning random facts, not explaining how they can be related to anything. This time it was "i/dcache sizes for example".

And I'm tired of dealing with you. Cheers.

I can probably live with this. Come back when you have something constructive to propose. I've never used "ban" button in my life, but you are making me wonder whether github has such one. I really tried to find your background by looking on github history, but I've failed.

h1z1 · 2021-05-02T18:07:36Z

Not so many things in kernel consume as much CPU time as ZFS with gzip9 and sha256 and gigabytes/s of traffic.

Funny enough I hit something very similar to this yesterday with simply clearing events.

Cool. Do you want to analyze it and propose patch for others to review?

Up to this little tyrant of yours, I was.

You posted numbers that didn't make sense, I was looking for clarification. You're taking offense to that is bizzare for such a change.

Because you are unable to explain what wrong you see in my numbers. You are mentioning random facts, not explaining how they can be related to anything. This time it was "i/dcache sizes for example".

And I'm tired of dealing with you. Cheers.

I can probably live with this. Come back when you have something constructive to propose. I've never used "ban" button in my life, but you are making me wonder whether github has such one. I really tried to find your background by looking on github history, but I've failed.

You do what petty things you need, I have no control over that. If asking questions bothers you perhaps you should refrain from making public statements. Continued harassment will not be tolerated either.

amotin · 2021-05-02T18:14:46Z

Continued harassment will not be tolerated either.

No. It won't. You've opened my github block list.

bghira · 2021-05-02T19:37:02Z

On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.

there's several reasons that could occur - on my experience, the dedup deletes are very I/O intensive.

though i've definitely had a scrub bring my system to its knees to the point that network disk based VMs will disconnect..

amotin · 2021-05-02T20:56:43Z

there's several reasons that could occur - on my experience, the dedup deletes are very I/O intensive.

Testing this change I've easily reproduced in particular heavy CPU load for seconds during the dedupped delete. DDT size was not big enough to cause significant disk load in the test.

Also, speaking about small systems, lets say single-core, having 8 taskqs means that multiple small block I/Os aggregated into one disk I/O will try to wakeup multiple threads (at least one per taskq) same time to handle that. It means much more context switches than needed.

amotin · 2021-05-10T14:06:02Z

Any more comments here?

behlendorf

The reasoning behind this change makes good sense to me and I can confirm it works as intended. In my local testing I saw good performance as well and nothing I would consider a regression. My only concern is we accidentally dropped TASKQ_DYNAMIC flag which we should add back.

module/zfs/spa.c

While use of dynamic taskqs allows to reduce number of idle threads, hardcoded 8 taskqs of each kind is a big overkill for small systems, complicating CPU scheduling, increasing I/O reorder, etc, while providing no real locking benefits, just not needed there. On another side, 12*8 worker threads per kind are able to overload almost any system nowadays. For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion. To address both problems this patch introduces ZTI_SCALE macro, alike to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs, to be used in places where lock scalability is needed, while request ordering is not so much. The code is made to create new taskq for ~6 worker threads (less for small systems, but more for very large) up to 80% of CPU cores (previous 75% was not good for rounding down). Both number of threads and threads per taskq are now tunable in case somebody really wants to use all of system power for ZFS. While obviously some benchmarks show small peak performance reduction (not so big really, especially on systems with SMT, where use of the second threads does not give as much performance as the first ones), they also show dramatic latency reduction and much more smooth user- space operation in case of high CPU usage by ZFS. Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc.

amotin · 2021-05-14T03:06:53Z

My only concern is we accidentally dropped TASKQ_DYNAMIC flag which we should add back.

Restored.

module/zfs/spa.c

While use of dynamic taskqs allows to reduce number of idle threads, hardcoded 8 taskqs of each kind is a big overkill for small systems, complicating CPU scheduling, increasing I/O reorder, etc, while providing no real locking benefits, just not needed there. On another side, 12*8 worker threads per kind are able to overload almost any system nowadays. For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion. To address both problems this patch introduces ZTI_SCALE macro, alike to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs, to be used in places where lock scalability is needed, while request ordering is not so much. The code is made to create new taskq for ~6 worker threads (less for small systems, but more for very large) up to 80% of CPU cores (previous 75% was not good for rounding down). Both number of threads and threads per taskq are now tunable in case somebody really wants to use all of system power for ZFS. While obviously some benchmarks show small peak performance reduction (not so big really, especially on systems with SMT, where use of the second threads does not give as much performance as the first ones), they also show dramatic latency reduction and much more smooth user- space operation in case of high CPU usage by ZFS. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes openzfs#11966

amotin added the Status: Code Review Needed Ready for review and testing label Apr 29, 2021

amotin requested a review from behlendorf April 29, 2021 02:42

amotin self-assigned this Apr 29, 2021

amotin requested a review from ahrens April 29, 2021 02:49

amotin force-pushed the scale branch from 4282e76 to f78aefc Compare April 29, 2021 14:13

behlendorf reviewed Apr 30, 2021

View reviewed changes

module/zfs/spa.c Outdated Show resolved Hide resolved

module/zfs/spa.c Show resolved Hide resolved

amotin force-pushed the scale branch 2 times, most recently from 06305d7 to e2a423c Compare May 1, 2021 15:58

h1z1 mentioned this pull request May 4, 2021

Misc zevent improvements #11822

Closed

13 tasks

amotin mentioned this pull request May 13, 2021

Allow txg_kick() even there is a txg still syncing #11929

Closed

13 tasks

behlendorf approved these changes May 14, 2021

View reviewed changes

module/zfs/spa.c Outdated Show resolved Hide resolved

amotin force-pushed the scale branch from e2a423c to 4c1434c Compare May 14, 2021 01:00

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 14, 2021

behlendorf merged commit 7457b02 into openzfs:master May 14, 2021

tonynguien suggested changes May 14, 2021

View reviewed changes

module/zfs/spa.c Show resolved Hide resolved

module/zfs/spa.c Show resolved Hide resolved

amotin deleted the scale branch August 24, 2021 20:18

Scale worker threads and taskqs with number of CPUs. #11966

Scale worker threads and taskqs with number of CPUs. #11966

Uh oh!

Conversation

amotin commented Apr 29, 2021

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

h1z1 commented May 1, 2021

Uh oh!

amotin commented May 1, 2021

Uh oh!

h1z1 commented May 1, 2021

Uh oh!

amotin commented May 1, 2021

Uh oh!

h1z1 commented May 1, 2021

Uh oh!

amotin commented May 1, 2021

Uh oh!

h1z1 commented May 2, 2021

Uh oh!

amotin commented May 2, 2021

Uh oh!

h1z1 commented May 2, 2021

Uh oh!

amotin commented May 2, 2021

Uh oh!

bghira commented May 2, 2021

Uh oh!

amotin commented May 2, 2021

Uh oh!

amotin commented May 10, 2021

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amotin commented May 14, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!