-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Split zstd into OS dependant files (use FreeBSD UMA) #10975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
6dfef3f
to
3db50a0
Compare
@c0d3z3r0 and @BrainSlayer might want to have a look. |
its just a oppinion. i dont think that it will create a positive performance benefit over the generic solution we made. so whats important here is a side by side comparisation between both solution. so mempool running on freebsd vs the new code on freebsd. using high and low compression levels. so even if the uma allocator results are identical it will create higher code maintainance efforts in future. the mempool solution allows us to have more control about the management and uma might take of taking care about how it works. but please provide a side by side comparisation to check if it makes sense at all |
@BrainSlayer That being said: I do like to have a some performance comparison tests of this before merging. I think it would be fine to just run the compression test script we used previously on both and just dump the results here (no need to make it look fancy) Also: |
Indeed, and as you can see, we actually re-use the kmem_caches that ZFS already maintains for allocations up to 16mb (max record size). I only added 32, 64, and the zstd max estimate, plus a dedicated bucket for the decompression context that is exactly the correct size.
I am working on these, but I wanted to share the code in the mean time
In this case, I actually feel it might make sense to do it the other way. The likely platforms to be added (illumos and OS X) will have SPL implementations of kmem_cache_alloc() etc, so will likely use the identical code that I added for FreeBSD here. One main advantage of using the OS allocator, is that it takes care of cleanup/idle detections/reaping etc, so you don't have the issue of #10969 |
Nice work, I've no issue with your allocator by-design (at this point at least) :)
No pressure, thanks for sharing anyway. I know how much work it is to do those tests!
Thats a very good point.
I agree, if we can use it we should. It's prefered for a multitude of reasons. #10969 is just one of them, although I don't want to say problems like #10969 would be fully prevented, they would be decreased in frequency. @BrainSlayer I think we shouldn't forget that the reason we build a custom allocator was the fact some distro's completely went haywire when using the normal system allocator and it was mostly a placeholder untill someone would find a way to make the system allocator work without downsides. (and with at least the same performance) because the performance was quite good, we didn't expect it to move to the system allocator any time soon. But we shouldn't forget how we made the decision to go custom. That all being said: Lets say we introduce OSX and it could use 95% of the allocator code of FreeBSD, are we going to duplicate it again? Add OSX changes into the FreeBSD allocator? Instead of splitting per OS, isn't it more sane to split it per-allocator? |
@Ornias1993 i'm fine with os allocator if the os allocator does provide the same performance. but what if not? thats the point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken the time to read through it.
Code wise the split if totally fine: there is no duplicate code as promised.
I would indeed like to see it not being based around OS, but just creating different files (to be included) for the different allocators under /module/zstd
So:
/module/zstd/zfs_zstd_alloc_uma.c
/module/zstd/zfs_zstd_alloc_pool.c
That also makes sure the ZSTD code keeps being somewhat contained, instead of spread all over the system and possible future OS's can either use one of those two allocators or create a new one, without creating duplicate code (With a bit of luck that is).
TLDR:
- Move the allocator files (back) to /module/zstd but keep them seperate from zfs_zstd.c
- Add a short benchmark run
@markjdb could you please comment on the issues raised here? |
MarkJ is a virtual memory expert and can provide more insight on the advantages of using UMA, and may be able to shed some light on how we can develop a test to show the advantages more clearly. |
Ahh thanks for that info Allan, ofcoarse all insight is welcome. I noticed your absolute shit tone of continued performace tests on the mailinglist (awesome work btw!), did you happen to have some time to do some of them with the UMA allocator? |
UMA is great, especially for large systems with many memory channels and NUMA zones. the improvement on those systems can be an order of magnitude, with less appreciable gains on more commodity hardware. though the Ryzen series of systems would possibly benefit on workloads where memory latency is a concern. |
I'd say the main advantage of using UMA here is that this allocator automatically gets tied in to the system's memory reclamation policy. We use centralized logic to decide when and how much to reclaim from UMA caches. The custom mempool allocator caches large objects (up to 64MB per slot, and there are 2*4*ncpu slots) and the core kernel has no way to request a flush of the caches, we have to wait for up to 60s for the reaper even if the system is under severe memory pressure. I'm curious about the original motivation for this custom pool allocator. I see "some distro's completely went haywire when using the normal system allocator" but no info in the commit log message. As for performance, it's hard to say since I don't know what "typical" zstd allocation patterns look like. For example, how many back-to-back allocations does zstd perform in a given context? Does a context use a mix of allocation sizes? Is there a mix of allocation sizes globally? Looking at the code I can see a few general advantages of UMA:
As for testing, I don't know much about how ZFS decides when to compress a file. For instance, if a file is compressed in the ARC and I read it, does the decompressed file end up cached in RAM instead? If there is some straightforward way to trigger decompression of a large file in a loop (e.g., |
I think the key missing piece of information here is real world memory usage before and after the patch. More importantly the huge zones mentioned make me wonder if their existence with per-CPU buckets is justified. While given enough RAM(tm) such caching provides fastest allocation/free, it is quite likely it's not feasible in real world workloads. Perhaps the concurrency levels required can be very easily sustained with a handful caches per numa node. To the best of my knowledge the FreeBSD kernel allocator does not provide such a feature (perhaps @markjdb can correct me here). At least three FreeBSD allocator provides several statistics for zones which may provide enough insight into how much is consumed and how much we could get away with. If not, it should not be fundamentally hard to hack it up for testing purposes (for example just add a trivial atomic counter which tracks max usage). It was mentioned that just the system allocator was causing trouble. I don't know how such usage looked like at the time, I suspect it was to just pass everything to malloc every time. For multipage sizes which get passed here both kernels will end up allocating pages directly -- this is slow and you are not expected to do it often. That said, in my opinion evaluation mentioned above should be considered a blocker for the patch. As for the current state of zstd_mempool_alloc, I think it needs replacement on Linux as well. I understand it was created to combat some problems with memory allocation and it tries to balance total memory use and allocation speed. Even if it is here to stay for the foreseeable future, there are significant improvements which can be done here. The code allocates a ZSTD_POOL_MAX array of ztsd_pool items.
Given the way this is allocated (one after another with no padding) this layout induces false sharing.
This is not capped in any manner and already poses a problem. My test system has 104 threads and pool_count got to 416, which as you will see below is quite sub optimal.
So this code walks the entire array from 0 every single time. On my box that's 416 lock acquires and releases, which is very expensive single-threaded. Multithreaded things will look much worse as with enough concurrency all the CPUs keep trying to acquire these locks in lockstep which floors performance. Note I don't know what kind of concurrency level is realistic to trigger here given all the other locks which have to be taken to get to this code. General point though is that the current scheme is counter-productive -- it makes things slower to allocate with growing number of present CPUs and in face of contention triggers a very expensive series of cacheline ping pongs. The above would be a problem (albeit to a lesser degree) even at much smaller scales (say aforementioned 16). That said, it is guaranteed the allocator would perform faster single-threaded if it just had one lock and an array of buffers to return. It is very likely it would be significantly faster multithreaded even at modest concurrency even with said global lock. Note that lock profiling tooling of your choosing may not be showing failing trylocking the above, it would be best to look at CPU profile instead (e.g., perf top on Linux). Moreover, given the 2 minute expiration policy, it is really wasteful to keep evaluating every buffer every time you alloc. There is a number of periodic tasks performed by ZFS as it is, such work can be added there instead. The code does not return a decent fit either, just the first buffer which is big enough to hold the request. All in all, the SLAB/ZONE memory use evaluation I mentioned above may be time consuming to do right and even then it may turn out that zstd is a bad fit for adding there. That said, my suggestion for the time being is to reimplement the zstd allocator to have one global lock and an array of bouned size with likely "big enough" buffers. The current notion of returning with a lock held would be removed, instead the spot in the array would get nullified. Note that while this does not scale and will become a problem at some point, it should perform better than the current code. Just to get example real-world data I added a dataset with zstd compression as the output directory for kernel compile, got the following as distribution (warning: rounded down to power-of-2):
key observation being that there are huge gaps in requested sizes. |
let me just explain again what the motivation is. kmalloc, vmalloc, kvmalloc. allthese wonderfull functions and yes also slab allocation just take too long. these allocation functions are just too slow if the allocation size is just big enough (like 16 mb for higher compression ratio etc). so zfs tests will fail due a timeout at the end. the mempool variant is just lightning fast and thats the reason. so using now a standard system allocator again will result in the same problem we had already. for your locking arguments. consider that its a mutex_trylock and will not aquire 416 locks each time. if zstd is under load alot of the slots are in use at the same time, so the amount of available slots is much smaller. 416 slots will only be available if zstd is not in heavy or no use. so there is no performance impact. and at the end we also did alot of benchmarks with that code to find the best solution at the end. of course i could just do a single lock covering the whole loop, but i dont think there will be any performance impact. returning the first matching buffer (even if bigger than requested) is intentional. the target is to avoid reallocation. so reusing a bigger already allocated but unused buffer is faster than reallocating a new block which just fits to the size for your the 2 minute policy. a additional callback for releasing memory has been already implemented. originally the intention of the code was to be self organizing without tainting other parts of zfs that much. i agree that this can be improved now since we have that callback. replacing the lock handling is not that simple. the lock beeing held is the marker for a memory object being in use. its all about thread safety. i have to ensure that access or checking of these objects are atomic |
Potential problems of this very sort are precisely why I'm skeptical about the patch proposed in this review. As noted earlier, really big allocations are not expected to happen often. Slight reordering:
I noted this might be the case. Moreover I found it worrisome zfs already preforms such big alllocs using zone/slab. Either way, my primary recommendation here is improving the existing allocator.
In the current code, assuming no contention, each allocation will perform 416 atomics to trylock and 415 atomics to unlock, the remaining one performed on "free". Atomics are very expensive even when the target line is cache hot, including on modern CPUs. Regardless of contention level the code will always perform 416 atomics to trylock. I presume you don't necessarily have hardware with this many threads, but you can easily simulate the problem: boot with the pool_count value hardcoded to 416, run The problem is significantly worse in face of contention.
See above. There is a fundamental difference between one lock (2 atomics) vs several.
I mentioned this may be problematic if sizes are wildly different. I also posted a histogram of sizes I got in my simple test. Key insight there was that about half allocations are ~1KB and then there is a huge gap for the rest. I extended the code with basic instrumentation to gather requested sizes, failed trylocks and size differences between requested size and buffer which was used to satisfy it. Then I ran The dtrace one-liner is: Note quantize rounds down to the nearest power of 2. For failing trylocks I got:
The most common allocation failed 64+ trylocks. Small minority did not fail any. More importantly though, size-wise (sizes
As you can see for some of them the difference is off by more than a megabyte. Or in other words the current allocator is handing out way oversized buffers, likely resulting in more memory use to satisfy other allocations. Complete allocation list can be found here. The first column is requested size, second one size of selected buffer and third one how many times given combination was encountered. The small 1176 byte buffer can be adequately satisfied with system allocator and special casing for it would remove majority of the above problem. Another note is that for all the big buffers, the actual allocated size is a multiple of 4096, despite very odd sizes requested. Not rounding it up passed up on some matching opportunities. In my opinion distributions from a handful of workloads should be gathered and based on that best multiple-of-page-size round ups decided. Then the allocator can start handing out possibly oversized buffers, but nothing extravagant and still have better memory use than now.
It is simple. Should you have an array of buffers or linked lists of buffers, on alloc you take the buffer out of the cache. You put it back on free. Nobody else has to see it. |
Be aware that the current zstd implementation is not very optimised for such single-stream writes (just a sidenote) |
Regardless of what's going on here can I get some reviews on #11129 |
I was wondering if it would be better to just use Using a slightly cleaned up version of this patch on
|
This is just to make the diff easier to read Signed-off-by: Allan Jude <[email protected]>
This allows FreeBSD to use UMA, while Linux uses the custom mempool Also split private bits of sys/zstd/zstd.h into sys/zstd/zstd_impl.h This new FreeBSD version uses the zio_data_buf_cache's for all sizes less than 16 MB, and uses its own dedicated kmem_caches for the decompression context, and for higher compression levels at larger block sizes that require the larger compression workspace. Sponsored-by: The FreeBSD Foundation Signed-off-by: Allan Jude <[email protected]>
88ac1e9
to
127ba0f
Compare
I tried to explain that 16MB kmem cache is an incredibly fishy ordeal in face of memory shortage. Apart from it the very fact that allocations of the sort are to be supported comes at a cost in terms of what solutions are viable. I created a zstd-19 dataset and re-tested kernel build on it. I validated I get the big allocations. Reached memory use is:
That's over 400MB of RAM just to handle in-flight compression and without any form of per-CPU caching. This can definitely be reduced with better behavior of the allocator, but it's still going to be way too big for this purpose. In other words zstd memory usage has to be bounded upfront to much smaller levels. The difficulty here is evaluating memory use vs compression ratios vs compression speed vs an array of workloads to check what configurations make the most sense. That's a rather involved project and I don't expect it to be completed anytime soon. In the meantime bare minimum which should be done is to test compression levels vs memory usage in at least 2 workloads. The default seems to be allocating ~1.2MB buffers which with better caching may be tolerable. So this would artificially limit what zstd compression levels are configurable to begin with. There is the problem of existing datasets which have a higher limit configured already. They will have to support at least reading the data and compressing with smaller limit despite the configuration. As for the allocator, there is a guaranteed significant improvement from the following:
None of the above points are hard to implement and I can take care of it later, but the buy-in to limit memory use has to be there. |
@mjguzik We could at least make the multiplier-per-core a tuneable to give users an option to lower it in case of insane amounts of threads?
A basic version of this had been done on low-ish core count systems (32threads max afaik) and the memory usage was "Nothing worrysome"... (as in: nothing that would case real trouble) Not much testing has been done on very-many-cores systems though...
We already do limit it to
I would not agree it's okey to redesign something as basic as the amount of supported levels for an algorithm already released (even if it's just in an RC)
If talking Linux or "all-os": you sound like these idea's wheren't tested. Please read back the insanely long history on zstd-on-zfs to get a glimps of the endless tries, discussions and weirdness with some distro's that where encountered.
That might be a fair tweak
How about: First fit if good fit, otherwise look for best fit?
I think everyone here is open to optimisation, but if with "buy in" you mean "at all costs" (like removing many zstd levels), thats a no for some/most of us I think. |
The ~100 thread box, while above average, is nothing special. Bigger systems have been available for years. This really should work at least OK on them without hand-tuning and it's not hard to get there. The current issues are somewhat self-induced by current design, see below.
What exact data are we talking about? In particular was memory use by the cache here measured? The ~3MB buffers which I can provoke on my own are very worrisome. 16MB claimed elsewhere is in my opinion a non-starter for anything but highly specialized deployments as it will tie up hundreds of megs of RAM.
See above.
Again, see above. The allocs which can be provoked are in my opinion may be quite hard to sustain in production. Also see below.
It is a fair point that I'm not aware of the history here. However, 1. pushing multimegabyte requests to vmalloc et all at high rate is basically expected to cause trouble, so I'm not surprised it did. 2. if per-cpu caching of these allocations was attempted at the slab layer I'm also not surprised it failed as it ties up even more memory (and is why I don't think the patch in this review is a viable way to go) Can you point me at specific attempts? I find it very unlikely that delegating the transient <2KB requests to kmalloc would be resulting in trouble. If anything it should lessen the total memory use. Note I'm talking about making sure this lands in the 2KB malloc slab (both FreeBSD and Linux have one and both use it as it is). I can tell you though that zfs is already making tons of allocs which fit the 2KB slab and they use it on FreeBSD. Allocation of the sort are definitely also made on Linux and they clearly perform at least adequately, albeit I did not check if that's the kmalloc slab or something else (it's definitely not vmalloc). While in principle it is possible that adding these buffers tipped something over, I find it incredibly unlikely.
I'm going to reiterate and expand a little. I was asked to comment on the FreeBSD side of the patch in this PR. I noted it looks like a non-starter if big sizes are at play. For zstd itself, as a zfs user, I need to it at least not get in the way if not employed. This is sorted out with #11126. Ultimately you are free to ignore my feedback, but at the same time I'm not going to spend time on a project where this happens. The 16MB-induced worry is not just some minor concern. Suitability for production use hinges on behavior in corner cases. For the default setup allocation request is at 1.2MB and it is trivial to provoke it on all CPUs. For the 104 thread box, that means over 100MB of RAM going directly to transient zstd buffers. If the 16MB allocations are possible this is over 1.5GB, which is a complete non-starter. The 16MB requests on a 32-bit system are 0.5GB. I can provide some patches if size reduction is fine. In my opinion sticking to current configurations will lead to significant problems and I'm not going to be on the hook for it. I plotted sizes in pages vs times requested as seen in https://people.freebsd.org/~mjg/zfs/sizes-distribution.txt mentioned in previous comments (excluded the 1KB+ buffer) As you can see, even without trying to provoke anything, the "max" is what ends up being allocated the most and what has to be expected to be there at least per CPU. The current allocation scheme easily suffers pessimal behavior. Say someone locked pool[0] to evaluate it for a fit and it's a bad fit so they move on. During that other thread else might trylock poo[0] and have it fail as the lock is already taken, causing the thread to move on to pool[1]. If this thread has a request which is a good fit for the size in pool[0], the buffer got skipped for no good reason. That said, more measurements will be needed, but so far I'm leaning towards a simplistic allocator:
Later addition: testing 'yes > file-on-zstd' Running the above racks up:
... as the I/O is spread across multiple zfs threads. That's literally over 1.5GB RAM used for running 'yes' with output redirected to a file. |
You are talking about limiting "levels" your "see above" doesn't talk about levels.
I wasn;t talking about vmalloc explicitly.
I'm certainly not going to dig through it AGAIN. (I already did multiple times by now).
Again, not what I said.
And at the same time you ignore my multiple comments about the removal of ZSTD levels. Also Brainslayer isn't a ZFS maintainer either. If you PR a fix, it's up to the maintainers to decide on it. Not me or even Brainslayer. Besides: Brainslayer is know to be overly " sturdy" on discussing improvements, we had to talk for days for him to stop ignorning a test failure. Thats okey for him to do, I can deal with it. But PLEASE do not take his sturdyness for something that is representitive for this project as a whole. |
You are talking about limiting "levels" your "see above" doesn't talk about levels.
I wasn;t talking about vmalloc explicitly.
I'm certainly not going to dig through it AGAIN. (I already did multiple times by now).
Again, not what I said.
And at the same time you ignore my multiple comments about the removal of ZSTD levels. Also Brainslayer isn't a ZFS maintainer either. If you PR a fix, it's up to the maintainers to decide on it. Not me or even Brainslayer. Besides: Brainslayer is know to be overly " sturdy" on discussing improvements, we had to talk for days for him to stop ignorning a test failure. Thats okey for him to do, I can deal with it. But PLEASE do not take his sturdyness for something that is representitive for this project as a whole. I think we are all open for reviewing a proposal. But if it's going to nuke the ZSTD compression ratio or is going to remove ZSTD levels, I don't think many are going to be very enthusiastic... Thats not a "we don't want it", thats a warning to keep looking at the bigger picture. But to be clear: |
Allan told to close this as overcome by later work and he does not plan to continue, even though not completely rejecting the idea. |
Motivation and Context
Split zstd into OS dependant files
This allows FreeBSD to use its UMA allocator, while Linux uses the custom mempool
Also split private bits of sys/zstd/zstd.h out into sys/zstd/zstd_impl.h
This new FreeBSD version uses the zio_data_buf_cache's for all sizes less
than 16 MB, and uses its own dedicated kmem_caches for the decompression
context, and for higher compression levels at larger block sizes that
require the larger compression workspace.
Description
This was originally going to be part of the main ZSTD pull request, but I was asked to put it off to ensure ZSTD got merged in time for OpenZFS 2.0.
I have posted this as 2 commits at first, to make it easier to review. The first commit just duplicates zfs_zstd.c to zfs_zstd_os.c for each of the 2 supported OSes, then the 2nd commit removes the duplicate code. This makes it easier to see that nothing has really changed for the linux version.
How Has This Been Tested?
I have run ztest against it extensively, and verified with the zstd kstats that allocations do not fail.
Types of changes
Checklist:
Signed-off-by
.