-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Implement parallel ARC eviction #16486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4cd510d
to
f45bf2e
Compare
f45bf2e
to
146fe45
Compare
I've been casually testing this out (combined with the parallel_dbuf_evict PR) over the last couple of weeks (most recently, 5b070d1 ). I've not been hammering it hard or specifically, just letting it do its thing with my messing-around desktop system. Hit a probable regression today, though: while mv'ing a meager 8GB of files from one pool to another, all my zfs IO got really high-latency, and an iotop showed that the copy part of the move (this being a mv across pools, so in reality it's a copy-and-remove) was running at a painful few 100KB/sec, and the zfs arc_evict thread was taking a whole core... but just one core. In time it all cleared up and of course I can't conclusively blame this PR's changes, but I left with two fuzzy observations:
|
146fe45
to
e128026
Compare
I have updated the patch with a different logic for picking the default maximum number of ARC eviction threads. The new logic aims to pick the number that is one-eighth of the available CPUs, with a minimum of 2 and a maximum of 16. |
Why would we need two evict threads on a single-core system? In that case I would probably prefer to disable taskqs completely. If that is a way to make it more logarithmic, then I would think about |
Right now, this is only enabled by a separate tunable, to enable multiple threads. So for the single CPU case, we don't expect it to be enabled. But for something like 4-12 core systems, we would want it to use at least 2 threads, and then grow from there, reaching 16 threads at 128 cores. |
Now that you mentioned it, I've noticed its been disabled by default. I don't like the idea to tune it manually in production depending on system size. I would prefer to to have reasonable automatic defaults. |
b6a65a2
to
e99733e
Compare
Hey! So, here's what changed in the patch: FormulaThere is now a different formula for automatically scaling the number of evict threads when the parameter is set to
It looks like this (the x axis is the CPU count and the y axis is the evict thread count): ![]() Here's also a table:
Less parameters
This approach has been suggested by @tonyhutter in another PR (#16487 (comment)). Stability improvementsIt is no longer possible to modify the actual evict threads count during runtime. Since the evict taskqs are only created during arc_init(), the module saves the actual number of evict threads it is going to use and does not care about changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for automating it. Few comments to that part, and please take a look on my earlier comments.
I am not sure it is right, but it seems GCC does no like it:
|
d899eaf
to
3218719
Compare
I gave this another spin (not in isolation though FYI - it was along with the parallel dbuf eviction PR) and got a repeat of the previously noted behavior. Seems to not be coincidence. In stress-testing the intended use-case (chugging through data when the arc is already full) this PR seems benign and probably even beneficial - multiple arc reaper threads are active and busy, and throughput is very healthy. However, later just puttering around in desktop usage under quite light reads I noticed that a reading app is blocked for several seconds at a time and the experience was quite unpleasant. Lo and behold, one or more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We are getting closer. ;)
Is anything left unresolved? |
@alex-stetsenko I'll take another look a bit later, but meanwhile would be good to fix style issue (some line is too long), squash it all into one commit and rebase on top of master. |
Yes, I have found too that I had older version, I still need to double check and get testing done:) Thanks for looking on it too!:) |
7555d5e
to
c6f7271
Compare
This one has fallen to me to get over the line. Last push(es):
The second commit is all new. I was trying to document the tunables in response to @tssge's comment above, and couldn't really figure out how to explain it clearly, so I instead tried to simplify their relationship to balance the operator's intent against the number of unused threads and still giving room to grow. Dunno how I did, let me know. (It really would be so much easier if we could just tell a taskq to change its number of threads; something in between fixed and dynamic. Probably wouldn't even be that hard). Anyway, take a look, and lets land this beast. |
729215e
to
3c1b3f7
Compare
Ok, I'm giving up on the thread/threads_max split for now. There just isn't an obvious good way to balance these for all imagined use cases, especially with the taskq control tools we have available, and I don't even know if we have an actual use case to target. So, last push removes The actual evict code itself hasn't changed since my last pass. The only difference is in the setup, we now preallocate the arg array at module load, since we know it won't change in size. This is following pattern used with the markers array - if we're on the proper evict thread, use the preallocated set, and if not, allocate our own. I think that's it! |
On systems with enormous amounts of memory, the single arc_evict thread can become a bottleneck if reads and writes are stuck behind it, waiting for old data to be evicted before new data can take its place. This commit adds support for evicting from multiple ARC lists in parallel, by farming the evict work out to some number of threads and then accumulating their results. A new tuneable, zfs_arc_evict_threads, sets the number of threads. By default, it will scale based on the number of CPUs. Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Mateusz Piotrowski <[email protected]> Co-authored-by: Alexander Stetsenko <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Signed-off-by: Alexander Stetsenko <[email protected]> Signed-off-by: Rob Norris <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Could we have some more eyes on this to finally push it through? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good to me. I applied this PR on our 2.2 branch and tested with in-house workload for a few days, no flag.
On systems with enormous amounts of memory, the single arc_evict thread can become a bottleneck if reads and writes are stuck behind it, waiting for old data to be evicted before new data can take its place. This commit adds support for evicting from multiple ARC lists in parallel, by farming the evict work out to some number of threads and then accumulating their results. A new tuneable, zfs_arc_evict_threads, sets the number of threads. By default, it will scale based on the number of CPUs. Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Signed-off-by: Alexander Stetsenko <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Rob Norris <[email protected]> Co-authored-by: Mateusz Piotrowski <[email protected]> Co-authored-by: Alexander Stetsenko <[email protected]> Closes openzfs#16486
On systems with enormous amounts of memory, the single arc_evict thread can become a bottleneck if reads and writes are stuck behind it, waiting for old data to be evicted before new data can take its place. This commit adds support for evicting from multiple ARC lists in parallel, by farming the evict work out to some number of threads and then accumulating their results. A new tuneable, zfs_arc_evict_threads, sets the number of threads. By default, it will scale based on the number of CPUs. Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Signed-off-by: Alexander Stetsenko <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Rob Norris <[email protected]> Co-authored-by: Mateusz Piotrowski <[email protected]> Co-authored-by: Alexander Stetsenko <[email protected]> Closes openzfs#16486
On systems with enormous amounts of memory, the single arc_evict thread can become a bottleneck if reads and writes are stuck behind it, waiting for old data to be evicted before new data can take its place. This commit adds support for evicting from multiple ARC lists in parallel, by farming the evict work out to some number of threads and then accumulating their results. A new tuneable, zfs_arc_evict_threads, sets the number of threads. By default, it will scale based on the number of CPUs. Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Signed-off-by: Alexander Stetsenko <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Rob Norris <[email protected]> Co-authored-by: Mateusz Piotrowski <[email protected]> Co-authored-by: Alexander Stetsenko <[email protected]> Closes #16486
On systems with enormous amounts of memory, the single arc_evict thread can become a bottleneck if reads and writes are stuck behind it, waiting for old data to be evicted before new data can take its place. This commit adds support for evicting from multiple ARC lists in parallel, by farming the evict work out to some number of threads and then accumulating their results. A new tuneable, zfs_arc_evict_threads, sets the number of threads. By default, it will scale based on the number of CPUs. Sponsored-by: Expensify, Inc. Sponsored-by: Klara, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Youzhong Yang <[email protected]> Signed-off-by: Allan Jude <[email protected]> Signed-off-by: Mateusz Piotrowski <[email protected]> Signed-off-by: Alexander Stetsenko <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Rob Norris <[email protected]> Co-authored-by: Mateusz Piotrowski <[email protected]> Co-authored-by: Alexander Stetsenko <[email protected]> Closes openzfs#16486
I installed this via zfs-2.3.3 GA on a few large memory NFS servers (384GB - 1 TB). While the parallel kernel threads are occasionally used, it appears that one thread still dominates even when it is nearly 100% busy. Is there anything I can do to trigger more parallelism under load? [root@zfs9 ~]# uname -a Linux zfs9 4.18.0-553.56.1.el8_10.x86_64 #1 SMP Tue Jun 10 17:00:45 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux [root@zfs9 ~]# cat /sys/module/zfs/version 2.3.3-1 [root@zfs9 ~]# top top - 07:57:44 up 1 day, 16:13, 2 users, load average: 2049.98, 1991.14, 1946.32 Tasks: 3235 total, 6 running, 3229 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.1 us, 7.4 sy, 0.0 ni, 89.4 id, 2.8 wa, 0.1 hi, 0.2 si, 0.0 st MiB Mem : 385086.8 total, 20213.4 free, 262595.1 used, 102278.4 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 116220.8 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3254 root 20 0 0 0 0 R 100.0 0.0 498:39.68 arc_evict 3244 root 20 0 0 0 0 R 57.9 0.0 287:27.41 arc_prune 1209011 root 20 0 0 0 0 R 57.9 0.0 1:27.65 arc_prune 1210543 root 20 0 0 0 0 S 57.9 0.0 0:29.53 arc_prune 1211522 root 20 0 0 0 0 R 57.9 0.0 0:08.02 arc_prune 1211645 root 20 0 69060 8012 3764 R 21.1 0.0 0:00.07 top 6693 root 0 -20 0 0 0 S 5.3 0.0 144:28.39 z_rd_int_0 6694 root 0 -20 0 0 0 S 5.3 0.0 144:26.18 z_rd_int_1 [root@zfs9 ~]# ps -ef | grep arc_evict root 3245 2 0 Jun19 ? 00:07:55 [arc_evict] root 3246 2 0 Jun19 ? 00:07:54 [arc_evict] root 3247 2 0 Jun19 ? 00:07:54 [arc_evict] root 3248 2 0 Jun19 ? 00:07:55 [arc_evict] root 3249 2 0 Jun19 ? 00:07:55 [arc_evict] root 3250 2 0 Jun19 ? 00:07:55 [arc_evict] root 3251 2 0 Jun19 ? 00:07:55 [arc_evict] root 3252 2 0 Jun19 ? 00:07:55 [arc_evict] root 3254 2 22 Jun19 ? 09:28:35 [arc_evict] |
@stuartthebruce Single thread is used when the requested eviction amount makes no sense to split between multiple threads. It may be your kernel version requests too little at a time. |
How about when there are a lot of small eviction requests? Is there anything useful I can profile from my systems to perhaps help improve that? Note, if it makes a difference I am seeing this on both the EL8.10 kernel-4.18.0-553.56.1.el8_10.x86_64 and EL9.6 kernel-5.14.0-570.22.1.el9_6.x86_64 |
@stuartthebruce Requests from kernel are incoming via As alternative pressure source, ZFS itself can evict old data when there is new active I/O, and in that case we should have sufficient hysteresis in form of |
Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Motivation and Context
Read and write performance can become limited by the arc_evict process being single threaded.
Additional data cannot be added to the ARC until sufficient existing data is evicted.
On many-core systems with TBs of RAM, a single thread becomes a significant bottleneck.
With the change we see a 25% increase in read and write throughput
Description
Use a new taskq to run multiple multiple
arc_evict()
threads at once, each given a fraction of the desired memory to reclaimHow Has This Been Tested?
Benchmarking with a full ARC to measure the performance difference.
Types of changes
Checklist:
Signed-off-by
.