Skip to content

Implement parallel ARC eviction #16486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 14, 2025
Merged

Conversation

allanjude
Copy link
Contributor

Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.

Motivation and Context

Read and write performance can become limited by the arc_evict process being single threaded.
Additional data cannot be added to the ARC until sufficient existing data is evicted.

On many-core systems with TBs of RAM, a single thread becomes a significant bottleneck.

With the change we see a 25% increase in read and write throughput

Description

Use a new taskq to run multiple multiple arc_evict() threads at once, each given a fraction of the desired memory to reclaim

How Has This Been Tested?

Benchmarking with a full ARC to measure the performance difference.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@adamdmoss
Copy link
Contributor

adamdmoss commented Sep 12, 2024

I've been casually testing this out (combined with the parallel_dbuf_evict PR) over the last couple of weeks (most recently, 5b070d1 ).

I've not been hammering it hard or specifically, just letting it do its thing with my messing-around desktop system.

Hit a probable regression today, though: while mv'ing a meager 8GB of files from one pool to another, all my zfs IO got really high-latency, and an iotop showed that the copy part of the move (this being a mv across pools, so in reality it's a copy-and-remove) was running at a painful few 100KB/sec, and the zfs arc_evict thread was taking a whole core... but just one core.

In time it all cleared up and of course I can't conclusively blame this PR's changes, but I left with two fuzzy observations:

  • In many years of mucking around with ZFS I've never(?) seemed to get the 'arc_evict is pegging CPU badly' issue until I started testing this PR's changes (though I'm aware the issue occurs in the wild for folks on master/release ZFSes)
  • arc_evict was only using one core as far as I can tell, so I guess the parallelism which is the point of this PR just wasn't kicking-in for some reason anyway and/or the spinning was happening outside of the parallelized part

@0mp 0mp force-pushed the parallel_arc_evict branch from 146fe45 to e128026 Compare September 12, 2024 10:03
@0mp
Copy link
Contributor

0mp commented Sep 12, 2024

I have updated the patch with a different logic for picking the default maximum number of ARC eviction threads. The new logic aims to pick the number that is one-eighth of the available CPUs, with a minimum of 2 and a maximum of 16.

@amotin
Copy link
Member

amotin commented Sep 12, 2024

one-eighth of the available CPUs, with a minimum of 2 and a maximum of 16.

Why would we need two evict threads on a single-core system? In that case I would probably prefer to disable taskqs completely. If that is a way to make it more logarithmic, then I would think about highbit(), though then it will grow pretty slow for very large systems, so that the limit of 16 will never be reached. But I am not exactly sure the faster growth would make sense, since it may cause more lock contentions in memory allocator, etc.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 13, 2024
@allanjude
Copy link
Contributor Author

one-eighth of the available CPUs, with a minimum of 2 and a maximum of 16.

Why would we need two evict threads on a single-core system? In that case I would probably prefer to disable taskqs completely. If that is a way to make it more logarithmic, then I would think about highbit(), though then it will grow pretty slow for very large systems, so that the limit of 16 will never be reached. But I am not exactly sure the faster growth would make sense, since it may cause more lock contentions in memory allocator, etc.

Right now, this is only enabled by a separate tunable, to enable multiple threads. So for the single CPU case, we don't expect it to be enabled. But for something like 4-12 core systems, we would want it to use at least 2 threads, and then grow from there, reaching 16 threads at 128 cores.

@amotin
Copy link
Member

amotin commented Sep 16, 2024

Right now, this is only enabled by a separate tunable, to enable multiple threads. So for the single CPU case, we don't expect it to be enabled.

Now that you mentioned it, I've noticed its been disabled by default. I don't like the idea to tune it manually in production depending on system size. I would prefer to to have reasonable automatic defaults.

@0mp 0mp force-pushed the parallel_arc_evict branch 2 times, most recently from b6a65a2 to e99733e Compare October 23, 2024 19:49
@0mp
Copy link
Contributor

0mp commented Oct 23, 2024

Hey! So, here's what changed in the patch:

Formula

There is now a different formula for automatically scaling the number of evict threads when the parameter is set to 0. The formula is:

MIN(MAX(max_ncpus > 6 ? 2 : 1, ilog2(max_ncpus) + (max_ncpus >> 6)), 16);

It looks like this (the x axis is the CPU count and the y axis is the evict thread count):

image

Here's also a table:

CPUs zfs_arc_evict_threads Evict threads count Using taskq?
1 0 1 (autoscaled) No
2 0 1 (autoscaled) No
5 0 1 (autoscaled) No
6 0 2 (autoscaled) Yes
1024 0 16 (autoscaled) Yes
(not using autoscaling, CPU count is irrelevant) 1 1 No
(not using autoscaling, CPU count is irrelevant) 32 32 Yes

Less parameters

zfs_arc_evict_threads is now the only parameter exposed to control the evict thread count. The zfs_arc_evict_threads_parallel has been removed in favor of enabling the use of taskqs when there are two or more evict threads.

This approach has been suggested by @tonyhutter in another PR (#16487 (comment)).

Stability improvements

It is no longer possible to modify the actual evict threads count during runtime. Since the evict taskqs are only created during arc_init(), the module saves the actual number of evict threads it is going to use and does not care about changes to zfs_arc_evict_threads from then on. This behavior has been documented in the manual page.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for automating it. Few comments to that part, and please take a look on my earlier comments.

@amotin amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 6, 2024
@github-actions github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Nov 12, 2024
@amotin
Copy link
Member

amotin commented Nov 19, 2024

I am not sure it is right, but it seems GCC does no like it:

  module/zfs/arc.c: In function 'arc_evict_state':
  module/zfs/arc.c:4095:15: error: 'evarg' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    evict_arg_t *evarg;
                 ^~~~~

@amotin amotin added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 19, 2024
@github-actions github-actions bot removed the Status: Revision Needed Changes are required for the PR to be accepted label Nov 21, 2024
@adamdmoss
Copy link
Contributor

adamdmoss commented Nov 29, 2024

I gave this another spin (not in isolation though FYI - it was along with the parallel dbuf eviction PR) and got a repeat of the previously noted behavior. Seems to not be coincidence.

In stress-testing the intended use-case (chugging through data when the arc is already full) this PR seems benign and probably even beneficial - multiple arc reaper threads are active and busy, and throughput is very healthy.

However, later just puttering around in desktop usage under quite light reads I noticed that a reading app is blocked for several seconds at a time and the experience was quite unpleasant. Lo and behold, one or more arc_evict threads were spinning hard, eventually settling down but then spinning hard again for several seconds under further light read load. This pattern repeated until I rebooted (into a zfs branch minus this PR :) ).

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We are getting closer. ;)

@alex-stetsenko
Copy link
Contributor

Thanks. We are getting closer. ;)

Is anything left unresolved?

@amotin
Copy link
Member

amotin commented Dec 10, 2024

@alex-stetsenko I'll take another look a bit later, but meanwhile would be good to fix style issue (some line is too long), squash it all into one commit and rebase on top of master.

@tsoome
Copy link
Contributor

tsoome commented May 1, 2025

@tsoome from just the stack traces, I think you've got args to zfs_arc_evict_state() swapped.

Signatures from this PR are:

static uint64_t
arc_evict_state(arc_state_t *state, arc_buf_contents_t type, uint64_t spa, uint64_t bytes);

static uint64_t
arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type, boolean_t retry);

So your call to arc_flush_state is:

     arc_flush_state(state = ffffffffc001d380, spa = 58ac7875e1998fb9, type = 2, retry = FALSE)

(I don't know what type == 2 is; does illumos have another kind?)

Than the gets an evict call:

    arc_evict_state(state, type, spa, ARC_EVICT_ALL);

which should come through as:

    arc_evict_state(state = ffffffffc001d380, type = 2, spa = 58ac7875e1998fb9, bytes = ffffffffffffffff);

but you seem to have:

    arc_evict_state(state = ffffffffc001d380, type = 58ac7875e1998fb9, spa = ffffffffffffffff, bytes = 2);

ie, type and spa swapped. I'm guessing that leads to evict_arg_t being filled out wrong; spa = 100000000 certainly looks like an alignment quirk.

If that's true, can totally see how that happened: I spent a little while looking at those two signatures feeling like something was wrong, but not seeing it until I put them side-by-side.

Ahh, looking back, seems the signature was changed in a8d83e2, so I bet illumos has the previous arg order. Probably just swap them, and it should come to life?

Yes, I have found too that I had older version, I still need to double check and get testing done:) Thanks for looking on it too!:)

@robn robn force-pushed the parallel_arc_evict branch 2 times, most recently from 7555d5e to c6f7271 Compare May 1, 2025 10:44
@robn
Copy link
Member

robn commented May 1, 2025

This one has fallen to me to get over the line. Last push(es):

  • rebase to master
  • various boilerplate updates (copyright notices, dates, etc)
  • cleanups, consistency, some comments as I read through it and figured out what's up

The second commit is all new. I was trying to document the tunables in response to @tssge's comment above, and couldn't really figure out how to explain it clearly, so I instead tried to simplify their relationship to balance the operator's intent against the number of unused threads and still giving room to grow. Dunno how I did, let me know.

(It really would be so much easier if we could just tell a taskq to change its number of threads; something in between fixed and dynamic. Probably wouldn't even be that hard).

Anyway, take a look, and lets land this beast.

@robn robn force-pushed the parallel_arc_evict branch 3 times, most recently from 729215e to 3c1b3f7 Compare May 2, 2025 02:09
@robn robn force-pushed the parallel_arc_evict branch from 3c1b3f7 to 9f650a0 Compare May 6, 2025 23:53
@robn
Copy link
Member

robn commented May 6, 2025

Ok, I'm giving up on the thread/threads_max split for now. There just isn't an obvious good way to balance these for all imagined use cases, especially with the taskq control tools we have available, and I don't even know if we have an actual use case to target.

So, last push removes zfs_arc_evict_threads_max entirely, leaving just zfs_arc_evict_threads. This is now settable at module load time only. The default of 0 computes its as before, and I expect that's what is going to work for most people. We can make changes later if it comes up.

The actual evict code itself hasn't changed since my last pass. The only difference is in the setup, we now preallocate the arg array at module load, since we know it won't change in size. This is following pattern used with the markers array - if we're on the proper evict thread, use the preallocated set, and if not, allocate our own.

I think that's it!

On systems with enormous amounts of memory, the single arc_evict thread
can become a bottleneck if reads and writes are stuck behind it, waiting
for old data to be evicted before new data can take its place.

This commit adds support for evicting from multiple ARC lists in
parallel, by farming the evict work out to some number of threads and
then accumulating their results.

A new tuneable, zfs_arc_evict_threads, sets the number of threads. By
default, it will scale based on the number of CPUs.

Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Co-authored-by: Allan Jude <[email protected]>
Co-authored-by: Mateusz Piotrowski <[email protected]>
Co-authored-by: Alexander Stetsenko <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Signed-off-by: Mateusz Piotrowski <[email protected]>
Signed-off-by: Alexander Stetsenko <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
@robn robn force-pushed the parallel_arc_evict branch from 9f650a0 to ffb797e Compare May 7, 2025 00:49
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@amotin amotin requested a review from tssge May 7, 2025 00:59
@amotin
Copy link
Member

amotin commented May 8, 2025

Could we have some more eyes on this to finally push it through?

@allanjude allanjude requested review from grwilson and removed request for tssge May 9, 2025 00:52
Copy link
Contributor

@youzhongyang youzhongyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. I applied this PR on our 2.2 branch and tested with in-house workload for a few days, no flag.

@tonyhutter tonyhutter added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels May 13, 2025
@amotin amotin merged commit b6916f9 into openzfs:master May 14, 2025
23 of 24 checks passed
ixhamza pushed a commit to truenas/zfs that referenced this pull request May 14, 2025
On systems with enormous amounts of memory, the single arc_evict thread
can become a bottleneck if reads and writes are stuck behind it, waiting
for old data to be evicted before new data can take its place.

This commit adds support for evicting from multiple ARC lists in
parallel, by farming the evict work out to some number of threads and
then accumulating their results.

A new tuneable, zfs_arc_evict_threads, sets the number of threads. By
default, it will scale based on the number of CPUs.

Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Youzhong Yang <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Signed-off-by: Mateusz Piotrowski <[email protected]>
Signed-off-by: Alexander Stetsenko <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Co-authored-by: Rob Norris <[email protected]>
Co-authored-by: Mateusz Piotrowski <[email protected]>
Co-authored-by: Alexander Stetsenko <[email protected]>
Closes openzfs#16486
@satmandu satmandu mentioned this pull request May 26, 2025
14 tasks
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 13, 2025
On systems with enormous amounts of memory, the single arc_evict thread
can become a bottleneck if reads and writes are stuck behind it, waiting
for old data to be evicted before new data can take its place.

This commit adds support for evicting from multiple ARC lists in
parallel, by farming the evict work out to some number of threads and
then accumulating their results.

A new tuneable, zfs_arc_evict_threads, sets the number of threads. By
default, it will scale based on the number of CPUs.

Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Youzhong Yang <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Signed-off-by: Mateusz Piotrowski <[email protected]>
Signed-off-by: Alexander Stetsenko <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Co-authored-by: Rob Norris <[email protected]>
Co-authored-by: Mateusz Piotrowski <[email protected]>
Co-authored-by: Alexander Stetsenko <[email protected]>
Closes openzfs#16486
behlendorf pushed a commit that referenced this pull request Jun 17, 2025
On systems with enormous amounts of memory, the single arc_evict thread
can become a bottleneck if reads and writes are stuck behind it, waiting
for old data to be evicted before new data can take its place.

This commit adds support for evicting from multiple ARC lists in
parallel, by farming the evict work out to some number of threads and
then accumulating their results.

A new tuneable, zfs_arc_evict_threads, sets the number of threads. By
default, it will scale based on the number of CPUs.

Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Youzhong Yang <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Signed-off-by: Mateusz Piotrowski <[email protected]>
Signed-off-by: Alexander Stetsenko <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Co-authored-by: Rob Norris <[email protected]>
Co-authored-by: Mateusz Piotrowski <[email protected]>
Co-authored-by: Alexander Stetsenko <[email protected]>
Closes #16486
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 17, 2025
On systems with enormous amounts of memory, the single arc_evict thread
can become a bottleneck if reads and writes are stuck behind it, waiting
for old data to be evicted before new data can take its place.

This commit adds support for evicting from multiple ARC lists in
parallel, by farming the evict work out to some number of threads and
then accumulating their results.

A new tuneable, zfs_arc_evict_threads, sets the number of threads. By
default, it will scale based on the number of CPUs.

Sponsored-by: Expensify, Inc.
Sponsored-by: Klara, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Youzhong Yang <[email protected]>
Signed-off-by: Allan Jude <[email protected]>
Signed-off-by: Mateusz Piotrowski <[email protected]>
Signed-off-by: Alexander Stetsenko <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Co-authored-by: Rob Norris <[email protected]>
Co-authored-by: Mateusz Piotrowski <[email protected]>
Co-authored-by: Alexander Stetsenko <[email protected]>
Closes openzfs#16486
@stuartthebruce
Copy link

I installed this via zfs-2.3.3 GA on a few large memory NFS servers (384GB - 1 TB). While the parallel kernel threads are occasionally used, it appears that one thread still dominates even when it is nearly 100% busy. Is there anything I can do to trigger more parallelism under load?

[root@zfs9 ~]# uname -a
Linux zfs9 4.18.0-553.56.1.el8_10.x86_64 #1 SMP Tue Jun 10 17:00:45 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

[root@zfs9 ~]# cat /sys/module/zfs/version 
2.3.3-1

[root@zfs9 ~]# top

top - 07:57:44 up 1 day, 16:13,  2 users,  load average: 2049.98, 1991.14, 1946.32
Tasks: 3235 total,   6 running, 3229 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  7.4 sy,  0.0 ni, 89.4 id,  2.8 wa,  0.1 hi,  0.2 si,  0.0 st
MiB Mem : 385086.8 total,  20213.4 free, 262595.1 used, 102278.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 116220.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                            
   3254 root      20   0       0      0      0 R 100.0   0.0 498:39.68 arc_evict                          
   3244 root      20   0       0      0      0 R  57.9   0.0 287:27.41 arc_prune                          
1209011 root      20   0       0      0      0 R  57.9   0.0   1:27.65 arc_prune                          
1210543 root      20   0       0      0      0 S  57.9   0.0   0:29.53 arc_prune                          
1211522 root      20   0       0      0      0 R  57.9   0.0   0:08.02 arc_prune                          
1211645 root      20   0   69060   8012   3764 R  21.1   0.0   0:00.07 top                                
   6693 root       0 -20       0      0      0 S   5.3   0.0 144:28.39 z_rd_int_0                         
   6694 root       0 -20       0      0      0 S   5.3   0.0 144:26.18 z_rd_int_1              
[root@zfs9 ~]# ps -ef | grep arc_evict
root        3245       2  0 Jun19 ?        00:07:55 [arc_evict]
root        3246       2  0 Jun19 ?        00:07:54 [arc_evict]
root        3247       2  0 Jun19 ?        00:07:54 [arc_evict]
root        3248       2  0 Jun19 ?        00:07:55 [arc_evict]
root        3249       2  0 Jun19 ?        00:07:55 [arc_evict]
root        3250       2  0 Jun19 ?        00:07:55 [arc_evict]
root        3251       2  0 Jun19 ?        00:07:55 [arc_evict]
root        3252       2  0 Jun19 ?        00:07:55 [arc_evict]
root        3254       2 22 Jun19 ?        09:28:35 [arc_evict]

@amotin
Copy link
Member

amotin commented Jun 23, 2025

@stuartthebruce Single thread is used when the requested eviction amount makes no sense to split between multiple threads. It may be your kernel version requests too little at a time.

@stuartthebruce
Copy link

@stuartthebruce Single thread is used when the requested eviction amount makes no sense to split between multiple threads. It may be your kernel version requests too little at a time.

How about when there are a lot of small eviction requests? Is there anything useful I can profile from my systems to perhaps help improve that? Note, if it makes a difference I am seeing this on both the EL8.10 kernel-4.18.0-553.56.1.el8_10.x86_64 and EL9.6 kernel-5.14.0-570.22.1.el9_6.x86_64

@amotin
Copy link
Member

amotin commented Jun 23, 2025

@stuartthebruce Requests from kernel are incoming via arc_shrinker_scan(). You may trace it to see what is going on there. I don't have any specific data for different versions, but over the last couple years we saw kernel both not requesting enough and requesting half of ARC at once, so even though I hope for the best, I don't expect too much sanity there.

As alternative pressure source, ZFS itself can evict old data when there is new active I/O, and in that case we should have sufficient hysteresis in form of zfs_arc_overflow_shift, zfs_arc_shrink_shift, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.