optimize locking checks in mempool allocator #11126

BrainSlayer · 2020-10-29T07:11:49Z

avoid checking the whole array of objects each time by removing the self
organized memory reaping. this can be managed by the global memory reap
callback which is called every 60 seconds. this will reduce the use of
locking operations significant. in addition the memory release function
has been optimized to use only lockings where really needed

Signed-off-by: Sebastian Gottschall [email protected]

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

BrainSlayer · 2020-10-29T07:12:37Z

just a draft right now for checking

PrivatePuffin · 2020-10-29T11:26:28Z

Nice work @BrainSlayer Looking forward to the benefids! :)

module/zstd/zfs_zstd.c

mjguzik · 2020-10-29T17:36:33Z

I think this is a nice step forward. Regardless, I think it should wait for #11129 and then get rebased.

module/zstd/zfs_zstd.c

PrivatePuffin · 2020-10-29T21:40:30Z

I think this is a nice step forward. Regardless, I think it should wait for #11129 and then get rebased.

Considering the state of this draft and #11129 i'm pretty sure that issue will solve itself ;)

mjguzik · 2020-10-31T00:05:50Z

I'm not going to argue anymore about this patch. I think the commits combined provide a marked improvement and in particular the one i was looking for (overhead avoided if zstd is not in use). However, I presume someone else will have to accept this PR.

BrainSlayer · 2020-10-31T00:14:18Z

@mjguzik why someone else?

mjguzik · 2020-10-31T00:19:55Z

@behlendorf I think the patch (once it finalizes) should land in 2.0. The zfs_zstd_cache_reap_now routine is called from arc_reap_cb_check regardless of zstd being in use or not. Without the patch the zstd shrinker walks the entire array trylocking each element. Since the array is sized to 4 * ncpus this amasses to significant single-threaded overhead on bigger machines (for example my test box has 104 threads giving 416 locks to take). With the patch the behavior is short-circuited if there are no buffers to begin with and some of the locking is elided even if the count is != 0.

mjguzik · 2020-10-31T00:27:48Z

@mjguzik why someone else?

I did not write any of the allocator modulo the cosmetic patch. I presume a buy in from other people who worked on this is necessary.

As for myself, I have more comments, but given the state reached they would be only getting in the way of closing this PR. I'll post them later.

PrivatePuffin · 2020-10-31T18:50:43Z

I did not write any of the allocator modulo the cosmetic patch. I presume a buy in from other people who worked on this is necessary.

Thats just @c0d3z3r0 (who has decided not to work on zfs anymore), @allanjude for the most part...
And me some testing, docs and performance work...

My two cents: This patch does solve an the issue as reported by @mjguzik and isn't too involved.
I agree with @mjguzik that we should all colaborate on improving on this PR and we don't have to PR all patches and tweaks at once.

I also agree that this should indeed by backported into 2.0.

PrivatePuffin

Just needs a little squash, but does what it is supposed to do.

BrainSlayer · 2020-11-01T15:34:58Z

@Ornias1993 so you request a squash of all 3 commits?

PrivatePuffin · 2020-11-01T15:38:19Z

@BrainSlayer The "one commit" "rule" still applies for small PR's like this one, so yes.
And a look at those tests ofc, but don't think those are any relevant.

But LGTM

avoid checking the whole array of objects each time by removing the self organized memory reaping. this can be managed by the global memory reap callback which is called every 60 seconds. this will reduce the use if locking operations significant. Signed-off-by: Sebastian Gottschall <[email protected]>

BrainSlayer · 2020-11-02T08:02:29Z

done

behlendorf

I'll make sure this gets backported to 2.0.

PrivatePuffin · 2020-11-02T21:06:50Z

@behlendorf Thanks!

@mjguzik and @BrainSlayer
Now that we have this out of the way, Concerning the proposed other improvements:
I've been thinking about what @mjguzik said about not wanting to work in things that aren't going to be accepted (and the catch 22 that creates), so I've come up with some citeria that I think the everyone with previous work on zstd-on-zfs all agrees on.

If the following criteria are met, merging a enhancement of the allocator would be most easily agreed on as "acceptable":

Not limiting the current capabilities (supported ZSTD levels, platforms, options, logging)
Not negatively impacting current performance more than a few percent (both compression ratio AND speeds)
Not too many platform dependent tweaks (we've seen on Allan's proposal how much discussion that creates)
Passing all tests and checklists (obviously)

As long as improvements don't cross any of those, I personally think any improvements would have a high chance of getting accepted/agreed-upon.

I hope that clears up some of the endless discussion about your proposed changes @mjguzik and gives you some idea of what you can expect in terms of responses on your proposed changes :)

BrainSlayer · 2020-11-03T02:04:34Z

@Ornias1993 i understand and will follow that guidance. in general i'm waiting now just for input/ideas for new enhancements. i'm not aware right now of any other issue left. except for maybe making the memory reaping polling time configurable or things like that.

mjguzik · 2020-11-03T10:32:10Z

Maybe my other reply came off as a little dramatic so let me re-state my key point here:

For any compression level + blocksize there seems to be max allocation associated with it, which is also trivial to provoke for every CPU. For the default it seems to be 1.2MB. I did a quick test with recordsize=1M and compression=zstd-19, that gave me 18MB buffers. On a 104 box that's 1.8GB RAM just to facilitate zstd. You can easily provoke all the allocation by running a few 'yes > somefile' processes. I don't think memory usage of this sort is sustainable in real deployments and the most viable option I see here is to reduce supported compression levels. It may be some tweaks can be made, but achievable memory usage has to go down significantly. The 1.2MB buffer per CPU may be tolerable. Short of this, I don't think any caching policy changes will be of significance.

That said, I'm not involved with zstd and I'm not willing to spend time debating the assessment made above. If you disagree with it, I'm just going to backoff from zstd altogether.

PrivatePuffin · 2020-11-03T12:04:02Z

compression=zstd-19

I think we should be more clear that everything byond ZSTD-9 is "for future use" and not actually expected to be used in any sane capacity 👍

I don't think memory usage of this sort is sustainable in real deployments

You don't have to, it's not supposed to be used by many.

he most viable option I see here is to reduce supported compression levels

If someone wants to run idiotic levels of something, we shouldn't prevent it je because we think it's weird. There is a lot worse you can do in the ZFS CLI.

may be some tweaks can be made, but achievable memory usage has to go down significantly

You keep focussing on zstd-19, which is NOT even close to the default setting for a reason. Please focus on anything < ZSTD-9 in the discussion... We know ZSTD-19 is insane and added it anyway in case we have more of a use (and tweaks) for it in the future.

If you disagree with it, I'm just going to backoff from zstd altogether.

You don't have to run ZSTD-19, we adviced against it quite frequently.
IMHO, it's onto the downstreams and users running ZSTD natively to determine up-to what level of ZSTD they want to support or use, not for us to decide how insane someone wants to go.

To give you two examples:

Myself, an enduser:
Even with a 104 thread system, I would NOT run ZSTD-19, because the gain/cost ratio is off in terms of ratio/CPU. But when I would own a 104 thread system, I doubt I would bother much about 1 or two GB less or more.
TrueNAS, a downstream implementation of ZFS including ZSTD:
I implemented just 3, 5 and 7 into TrueNAS, because anything higher didn't have much of a gain and to prevent users selecting insane levels.

Anyhow @mjguzik just to TLDR:
As I understand it you are saying insanely high levels should be removed, because you don't want to use them because they are insane?

@BrainSlayer

in general i'm waiting now just for input/ideas for new enhancements. i'm not aware right now of any other issue left

How about making the multiplier in pool_count into a tuneable, I was thinking about that one and for pure-storage servers or small home-servers *4 is fine...
However: Hyperconverged systems might have many cores but only a fraction of them would be pressured by ZFS/ZSTD. (because most would be in use with actuall services being run, not zstd compression)

mjguzik · 2020-11-03T12:17:38Z

My experience is that if a setting is easily accessible, the assumption is it basically works and there will be people using it, in this case requiring the high ram usage to be fine. That said, I don't think we have common ground here so I'm going to back off.

BrainSlayer · 2020-11-03T12:22:12Z

the level is a user decision. the memory consumption for certain levels might be important for documentation. the default level has been choosen since it provides comparable levels with gzip. level like 19 are more like xz an very slow. so its not just about memory consumption. its also about performance. but even if 19 has high memory consumption, zstd will handle it and ignores compression if allocation fails.

making the multiplier configurable is pointless. i have choose 4 since zfs itself creates 4 times io threads based on the available cpu core/threads. so on a typical i7 with hyperthreading enabled. 32 io threads will run. so these are really required. if you lower the available mempool you will run into locking barriers while allocation under heavy load. so decreasing it, will slow down zstd. increasing it, does not have a impact at all and just creates bloat

PrivatePuffin · 2020-11-03T13:39:46Z

My experience is that if a setting is easily accessible, the assumption is it basically works

Then your experience is wrong. There are A LOT of ZFS settings that definately don't "basically work" without significant downsides.

there will be people using it

I expect there will be and they should be aware of the consequences. Thats why I and @BrainSlayer agree updating the documentation to reflect this.

That said, I don't think we have common ground here so I'm going to back off.

Sorry to bother you, I was under the (wrong) impression that you had more suggestions than simply lower the amount of supported ZSTD levels, thats why I tried to find out what you wanted changed.

making the multiplier configurable is pointless. i have choose 4 since zfs itself creates 4 times io threads based on the available cpu core/threads. so on a typical i7 with hyperthreading enabled. 32 io threads will run. so these are really required. if you lower the available mempool you will run into locking barriers while allocation under heavy load. so decreasing it, will slow down zstd. increasing it, does not have a impact at all and just creates bloat

Ahh shoot, forgot about the IO threads, yes you're right...

Avoid checking the whole array of objects each time by removing the self organized memory reaping. this can be managed by the global memory reap callback which is called every 60 seconds. this will reduce the use if locking operations significant. Reviewed-by: Kjeld Schouten <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Sebastian Gottschall <[email protected]> Closes #11126

Avoid checking the whole array of objects each time by removing the self organized memory reaping. this can be managed by the global memory reap callback which is called every 60 seconds. this will reduce the use if locking operations significant. Reviewed-by: Kjeld Schouten <[email protected]> Reviewed-by: Mateusz Guzik <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Sebastian Gottschall <[email protected]> Closes openzfs#11126

BrainSlayer mentioned this pull request Oct 29, 2020

Split zstd into OS dependant files (use FreeBSD UMA) #10975

Closed

12 tasks

mjguzik reviewed Oct 29, 2020

View reviewed changes

module/zstd/zfs_zstd.c Show resolved Hide resolved

mjguzik reviewed Oct 29, 2020

View reviewed changes

module/zstd/zfs_zstd.c Show resolved Hide resolved

BrainSlayer marked this pull request as ready for review October 30, 2020 08:29

BrainSlayer force-pushed the mempool_rework branch from 950522f to 5e1e176 Compare October 30, 2020 23:41

BrainSlayer force-pushed the mempool_rework branch from bf7f1e2 to 2774004 Compare October 31, 2020 00:15

behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 31, 2020

PrivatePuffin approved these changes Oct 31, 2020

View reviewed changes

BrainSlayer force-pushed the mempool_rework branch from 2774004 to 4d1df4d Compare November 2, 2020 08:02

behlendorf approved these changes Nov 2, 2020

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 2, 2020

behlendorf merged commit 7eefaf0 into openzfs:master Nov 2, 2020

optimize locking checks in mempool allocator #11126

optimize locking checks in mempool allocator #11126

Uh oh!

Conversation

BrainSlayer commented Oct 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

BrainSlayer commented Oct 29, 2020

Uh oh!

PrivatePuffin commented Oct 29, 2020

Uh oh!

Uh oh!

mjguzik commented Oct 29, 2020

Uh oh!

Uh oh!

PrivatePuffin commented Oct 29, 2020

Uh oh!

mjguzik commented Oct 31, 2020

Uh oh!

BrainSlayer commented Oct 31, 2020

Uh oh!

mjguzik commented Oct 31, 2020

Uh oh!

mjguzik commented Oct 31, 2020

Uh oh!

PrivatePuffin commented Oct 31, 2020

Uh oh!

PrivatePuffin left a comment

Choose a reason for hiding this comment

Uh oh!

BrainSlayer commented Nov 1, 2020

Uh oh!

PrivatePuffin commented Nov 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BrainSlayer commented Nov 2, 2020

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

PrivatePuffin commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BrainSlayer commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjguzik commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PrivatePuffin commented Nov 3, 2020

Uh oh!

mjguzik commented Nov 3, 2020

Uh oh!

BrainSlayer commented Nov 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PrivatePuffin commented Nov 3, 2020

Uh oh!

Uh oh!

BrainSlayer commented Oct 29, 2020 •

edited

Loading

PrivatePuffin commented Nov 1, 2020 •

edited

Loading

PrivatePuffin commented Nov 2, 2020 •

edited

Loading

BrainSlayer commented Nov 3, 2020 •

edited

Loading

mjguzik commented Nov 3, 2020 •

edited

Loading

BrainSlayer commented Nov 3, 2020 •

edited

Loading