-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Scale worker threads and taskqs with number of CPUs. #11966
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If the target metric is latency, why use cpu/thread count at all? None of this takes any hardware architecture into consideration. What were your tests based on? (i/dcache sizes for example). |
Because it is about scheduling latency and task priorities. Architecture there is more question to CPU scheduler, etc. I do think that it would be good in perspective to make these multiple taskqs bound to NUMA domains, but haven't got to it yet, especially in cross-platform way.
I don't see how it matter, but in my case it was 2x Xeon Silver 4114 with all cores and limited to 2 cores per socket (for 8 CPU threads test). |
06305d7
to
e2a423c
Compare
You're assuming the work scales linearlly, it doesn't.
What? You might want to test that. Having done so I CAN say it makes a major difference!
Would you at least agree that hardware is no where near representative of most users? It doesn't even represent your initial goal of small systems. |
It is close enough. It is just what I have on my lab desk right now, without spending time to set up 10 different machines. Since the goal is just to leave some CPU cores not permanently occupied by ZFS to allow user space to run, I don't care whether those cores are on the same silicon or different. If you like to help me with testing, please do, I'll appreciate that. |
I'm not aware of any embedded systems with Xeons.
If it scaled you wouldn't need to.
That is an absurd position to take. You're doing the equivalent of Google telling the world "make -j" is faster , based on their own dedicated hardware with zero thought into solving the actual problem.. It's pretty basic queue theory.
Why is that happening for example.
I don't have Xeons to test with, sorry. |
Are you kidding me again?
Because scrub runs completely within kernel, and it runs as fast as both disks and CPU can do. You may remember PR #11166 from last year (you should, because you annoyed me with weird accusations there too), reducing scrub effects on payload at disks level. Now I am looking on the other side -- CPU. |
?
Yes, lots of things run in Kernel and do not or at least should not, lock a system as you describe. Those would be bugs. Funny enough I hit something very similar to this yesterday with simply clearing events.
That thread had nothing to do with this. You posted numbers that didn't make sense, I was looking for clarification. You're taking offense to that is bizzare for such a change. And I'm tired of dealing with you. Cheers. |
First you are attracting attention to embedded systems, that are not Xeon. I understand that. BTW I actually have request from guy with low-end embedded CPU asking for change like this. But then you are saying you can not test the code because you have no Xeons. I understand lack of Xeon for everybody, but how would it stop you from testing on non-Xeons you have if you think my tests are not representative?
Not so many things in kernel consume as much CPU time as ZFS with gzip9 and sha256 and gigabytes/s of traffic.
Cool. Do you want to analyze it and propose patch for others to review?
Because you are unable to explain what wrong you see in my numbers. You are mentioning random facts, not explaining how they can be related to anything. This time it was "i/dcache sizes for example".
I can probably live with this. Come back when you have something constructive to propose. I've never used "ban" button in my life, but you are making me wonder whether github has such one. I really tried to find your background by looking on github history, but I've failed. |
Up to this little tyrant of yours, I was.
You do what petty things you need, I have no control over that. If asking questions bothers you perhaps you should refrain from making public statements. Continued harassment will not be tolerated either. |
No. It won't. You've opened my github block list. |
there's several reasons that could occur - on my experience, the dedup deletes are very I/O intensive. though i've definitely had a scrub bring my system to its knees to the point that network disk based VMs will disconnect.. |
Testing this change I've easily reproduced in particular heavy CPU load for seconds during the dedupped delete. DDT size was not big enough to cause significant disk load in the test. Also, speaking about small systems, lets say single-core, having 8 taskqs means that multiple small block I/Os aggregated into one disk I/O will try to wakeup multiple threads (at least one per taskq) same time to handle that. It means much more context switches than needed. |
Any more comments here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning behind this change makes good sense to me and I can confirm it works as intended. In my local testing I saw good performance as well and nothing I would consider a regression. My only concern is we accidentally dropped TASKQ_DYNAMIC
flag which we should add back.
While use of dynamic taskqs allows to reduce number of idle threads, hardcoded 8 taskqs of each kind is a big overkill for small systems, complicating CPU scheduling, increasing I/O reorder, etc, while providing no real locking benefits, just not needed there. On another side, 12*8 worker threads per kind are able to overload almost any system nowadays. For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion. To address both problems this patch introduces ZTI_SCALE macro, alike to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs, to be used in places where lock scalability is needed, while request ordering is not so much. The code is made to create new taskq for ~6 worker threads (less for small systems, but more for very large) up to 80% of CPU cores (previous 75% was not good for rounding down). Both number of threads and threads per taskq are now tunable in case somebody really wants to use all of system power for ZFS. While obviously some benchmarks show small peak performance reduction (not so big really, especially on systems with SMT, where use of the second threads does not give as much performance as the first ones), they also show dramatic latency reduction and much more smooth user- space operation in case of high CPU usage by ZFS. Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc.
Restored. |
While use of dynamic taskqs allows to reduce number of idle threads, hardcoded 8 taskqs of each kind is a big overkill for small systems, complicating CPU scheduling, increasing I/O reorder, etc, while providing no real locking benefits, just not needed there. On another side, 12*8 worker threads per kind are able to overload almost any system nowadays. For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion. To address both problems this patch introduces ZTI_SCALE macro, alike to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs, to be used in places where lock scalability is needed, while request ordering is not so much. The code is made to create new taskq for ~6 worker threads (less for small systems, but more for very large) up to 80% of CPU cores (previous 75% was not good for rounding down). Both number of threads and threads per taskq are now tunable in case somebody really wants to use all of system power for ZFS. While obviously some benchmarks show small peak performance reduction (not so big really, especially on systems with SMT, where use of the second threads does not give as much performance as the first ones), they also show dramatic latency reduction and much more smooth user- space operation in case of high CPU usage by ZFS. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes openzfs#11966
While use of dynamic taskqs allows to reduce number of idle threads, hardcoded 8 taskqs of each kind is a big overkill for small systems, complicating CPU scheduling, increasing I/O reorder, etc, while providing no real locking benefits, just not needed there. On another side, 12*8 worker threads per kind are able to overload almost any system nowadays. For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion. To address both problems this patch introduces ZTI_SCALE macro, alike to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs, to be used in places where lock scalability is needed, while request ordering is not so much. The code is made to create new taskq for ~6 worker threads (less for small systems, but more for very large) up to 80% of CPU cores (previous 75% was not good for rounding down). Both number of threads and threads per taskq are now tunable in case somebody really wants to use all of system power for ZFS. While obviously some benchmarks show small peak performance reduction (not so big really, especially on systems with SMT, where use of the second threads does not give as much performance as the first ones), they also show dramatic latency reduction and much more smooth user- space operation in case of high CPU usage by ZFS. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes openzfs#11966
While use of dynamic taskqs allows to reduce number of idle threads, hardcoded 8 taskqs of each kind is a big overkill for small systems, complicating CPU scheduling, increasing I/O reorder, etc, while providing no real locking benefits, just not needed there. On another side, 12*8 worker threads per kind are able to overload almost any system nowadays. For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion. To address both problems this patch introduces ZTI_SCALE macro, alike to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs, to be used in places where lock scalability is needed, while request ordering is not so much. The code is made to create new taskq for ~6 worker threads (less for small systems, but more for very large) up to 80% of CPU cores (previous 75% was not good for rounding down). Both number of threads and threads per taskq are now tunable in case somebody really wants to use all of system power for ZFS. While obviously some benchmarks show small peak performance reduction (not so big really, especially on systems with SMT, where use of the second threads does not give as much performance as the first ones), they also show dramatic latency reduction and much more smooth user- space operation in case of high CPU usage by ZFS. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes openzfs#11966
While use of dynamic taskqs allows to reduce number of idle threads, hardcoded 8 taskqs of each kind is a big overkill for small systems, complicating CPU scheduling, increasing I/O reorder, etc, while providing no real locking benefits, just not needed there. On another side, 12*8 worker threads per kind are able to overload almost any system nowadays. For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion. To address both problems this patch introduces ZTI_SCALE macro, alike to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs, to be used in places where lock scalability is needed, while request ordering is not so much. The code is made to create new taskq for ~6 worker threads (less for small systems, but more for very large) up to 80% of CPU cores (previous 75% was not good for rounding down). Both number of threads and threads per taskq are now tunable in case somebody really wants to use all of system power for ZFS. While obviously some benchmarks show small peak performance reduction (not so big really, especially on systems with SMT, where use of the second threads does not give as much performance as the first ones), they also show dramatic latency reduction and much more smooth user- space operation in case of high CPU usage by ZFS. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes openzfs#11966
While use of dynamic taskqs allows to reduce number of idle threads, hardcoded 8 taskqs of each kind is a big overkill for small systems, complicating CPU scheduling, increasing I/O reorder, etc, while providing no real locking benefits, just not needed there. On another side, 12*8 worker threads per kind are able to overload almost any system nowadays. For example, pool of several fast SSDs with SHA256 checksum makes system barely responsive during scrub, or with dedup enabled barely responsive during large file deletion. To address both problems this patch introduces ZTI_SCALE macro, alike to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs, to be used in places where lock scalability is needed, while request ordering is not so much. The code is made to create new taskq for ~6 worker threads (less for small systems, but more for very large) up to 80% of CPU cores (previous 75% was not good for rounding down). Both number of threads and threads per taskq are now tunable in case somebody really wants to use all of system power for ZFS. While obviously some benchmarks show small peak performance reduction (not so big really, especially on systems with SMT, where use of the second threads does not give as much performance as the first ones), they also show dramatic latency reduction and much more smooth user- space operation in case of high CPU usage by ZFS. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes openzfs#11966
While use of dynamic taskqs allows to reduce number of idle threads,
hardcoded 8 taskqs of each kind is a big overkill for small systems,
complicating CPU scheduling, increasing I/O reorder, etc, while
providing no real locking benefits, just not needed there.
On another side, 12*8 worker threads per kind are able to overload
almost any system nowadays. For example, pool of several fast SSDs
with SHA256 checksum makes system barely responsive during scrub, or
with dedup enabled barely responsive during large file deletion.
To address both problems this patch introduces ZTI_SCALE macro, alike
to ZTI_BATCH, but with multiple taskqs, depending on number of CPUs,
to be used in places where lock scalability is needed, while request
ordering is not so much. The code is made to create new taskq for
~6 worker threads (less for small systems, but more for very large)
up to 80% of CPU cores (previous 75% was not good for rounding down).
Both number of threads and threads per taskq are now tunable in case
somebody really wants to use all of system power for ZFS.
While obviously some benchmarks show small peak performance reduction
(not so big really, especially on systems with SMT, where use of the
second threads does not give as much performance as the first ones),
they also show dramatic latency reduction and much more smooth user-
space operation in case of high CPU usage by ZFS.
How Has This Been Tested?
The patch was tested on FreeBSD for cases of 8 and 40 CPU threads with striped pool of 5 Intel Optane 905p SSDs with SHA256 checksums.
With 8 CPU threads without the patch starting scrub while running QD1 random uncached read with 128KB records with fio drops IOPS from 1174 to 35 and increases 95% latency from 1ms to 77ms with maximum of 204ms. The patch increases the second IOPS number to 327 IOPS, while reducing 95% latency to 5ms with maximum of 7.5ms.
Large file delete tests with dedup enabled also show dramatic latency/interactivity improvements, while profiles show very small lock contention despite only 2 taskqs on 8-threads system.
Types of changes
Checklist:
Signed-off-by
.