Skip to content

Split zstd into OS dependant files (use FreeBSD UMA) #10975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

allanjude
Copy link
Contributor

Motivation and Context

Split zstd into OS dependant files

This allows FreeBSD to use its UMA allocator, while Linux uses the custom mempool

Also split private bits of sys/zstd/zstd.h out into sys/zstd/zstd_impl.h

This new FreeBSD version uses the zio_data_buf_cache's for all sizes less
than 16 MB, and uses its own dedicated kmem_caches for the decompression
context, and for higher compression levels at larger block sizes that
require the larger compression workspace.

Description

This was originally going to be part of the main ZSTD pull request, but I was asked to put it off to ensure ZSTD got merged in time for OpenZFS 2.0.

I have posted this as 2 commits at first, to make it easier to review. The first commit just duplicates zfs_zstd.c to zfs_zstd_os.c for each of the 2 supported OSes, then the 2nd commit removes the duplicate code. This makes it easier to see that nothing has really changed for the linux version.

How Has This Been Tested?

I have run ztest against it extensively, and verified with the zstd kstats that allocations do not fail.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@allanjude allanjude force-pushed the openzfs_zstd_freebsd_uma branch from 6dfef3f to 3db50a0 Compare September 24, 2020 01:22
@PrivatePuffin
Copy link
Contributor

@c0d3z3r0 and @BrainSlayer might want to have a look.

@BrainSlayer
Copy link
Contributor

its just a oppinion. i dont think that it will create a positive performance benefit over the generic solution we made. so whats important here is a side by side comparisation between both solution. so mempool running on freebsd vs the new code on freebsd. using high and low compression levels. so even if the uma allocator results are identical it will create higher code maintainance efforts in future. the mempool solution allows us to have more control about the management and uma might take of taking care about how it works. but please provide a side by side comparisation to check if it makes sense at all

@PrivatePuffin
Copy link
Contributor

PrivatePuffin commented Sep 24, 2020

@BrainSlayer
It would be prefered to always use a OS provided allocator if we can imho. Even if that adds code, it does mean we don't actually have to maintain the allocator itself which might(!) be prefered for some.

That being said: I do like to have a some performance comparison tests of this before merging. I think it would be fine to just run the compression test script we used previously on both and just dump the results here (no need to make it look fancy)

Also:
@allanjude
I noticed the old allocator is put into "linux", I don't think thats what we want... Even if this is implemented, the old allocator is the most platform-independent of the two I think, so should be a default unless overriden by a specific OS allocator (FreeBSD in this case).

@allanjude
Copy link
Contributor Author

@BrainSlayer
It would be prefered to always use a OS provided allocator if we can imho. Even if that adds code, it does mean we don't actually have to maintain the allocator itself which might(!) be prefered for some.

Indeed, and as you can see, we actually re-use the kmem_caches that ZFS already maintains for allocations up to 16mb (max record size). I only added 32, 64, and the zstd max estimate, plus a dedicated bucket for the decompression context that is exactly the correct size.

That being said: I do like to have a some performance comparison tests of this before merging. I think it would be fine to just run the compression test script we used previously on both and just dump the results here (no need to make it look fancy)

I am working on these, but I wanted to share the code in the mean time

Also:
@allanjude
I noticed the old allocator is put into "linux", I don't think thats what we want... Even if this is implemented, the old allocator is the most platform-independent of the two I think, so should be a default unless overriden by a specific OS allocator (FreeBSD in this case).

In this case, I actually feel it might make sense to do it the other way. The likely platforms to be added (illumos and OS X) will have SPL implementations of kmem_cache_alloc() etc, so will likely use the identical code that I added for FreeBSD here.

One main advantage of using the OS allocator, is that it takes care of cleanup/idle detections/reaping etc, so you don't have the issue of #10969

@PrivatePuffin
Copy link
Contributor

PrivatePuffin commented Sep 24, 2020

Indeed, and as you can see, we actually re-use the kmem_caches that ZFS already maintains for allocations up to 16mb (max record size). I only added 32, 64, and the zstd max estimate, plus a dedicated bucket for the decompression context that is exactly the correct size.

Nice work, I've no issue with your allocator by-design (at this point at least) :)

I am working on these, but I wanted to share the code in the mean time

No pressure, thanks for sharing anyway. I know how much work it is to do those tests!

In this case, I actually feel it might make sense to do it the other way. The likely platforms to be added (illumos and OS X) will have SPL implementations of kmem_cache_alloc() etc, so will likely use the identical code that I added for FreeBSD here.

Thats a very good point.
Besides OSX (which @lundman is working on quite actively indeed!), any word on the illumos folks about them preparing to switch in such a way that our code would need to be optimised for them?

One main advantage of using the OS allocator, is that it takes care of cleanup/idle detections/reaping etc, so you don't have the issue of #10969

I agree, if we can use it we should. It's prefered for a multitude of reasons. #10969 is just one of them, although I don't want to say problems like #10969 would be fully prevented, they would be decreased in frequency.

@BrainSlayer I think we shouldn't forget that the reason we build a custom allocator was the fact some distro's completely went haywire when using the normal system allocator and it was mostly a placeholder untill someone would find a way to make the system allocator work without downsides. (and with at least the same performance)

because the performance was quite good, we didn't expect it to move to the system allocator any time soon. But we shouldn't forget how we made the decision to go custom.


That all being said:
Do we want to split this into OS specific parts?

Lets say we introduce OSX and it could use 95% of the allocator code of FreeBSD, are we going to duplicate it again? Add OSX changes into the FreeBSD allocator?

Instead of splitting per OS, isn't it more sane to split it per-allocator?
Something like:
ZSTD_ALLOC_POOL
ZSTD_ALLOC_UMA

@BrainSlayer
Copy link
Contributor

@Ornias1993 i'm fine with os allocator if the os allocator does provide the same performance. but what if not? thats the point.
and secondary we should make code maintaince easier. so if we have to change something, changing 3 different code implementations is surelly not the best idea. so keeping the splitted code path as simple as possible is the best way. anyway. still want to see a comparisation between both allocation way, because if the uma allocator decreases performance it would surelly not a good idea to implement this allocator for freebsd

@behlendorf behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Sep 25, 2020
@behlendorf behlendorf marked this pull request as draft September 25, 2020 21:43
Copy link
Contributor

@PrivatePuffin PrivatePuffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken the time to read through it.
Code wise the split if totally fine: there is no duplicate code as promised.

I would indeed like to see it not being based around OS, but just creating different files (to be included) for the different allocators under /module/zstd

So:
/module/zstd/zfs_zstd_alloc_uma.c
/module/zstd/zfs_zstd_alloc_pool.c

That also makes sure the ZSTD code keeps being somewhat contained, instead of spread all over the system and possible future OS's can either use one of those two allocators or create a new one, without creating duplicate code (With a bit of luck that is).

TLDR:

  • Move the allocator files (back) to /module/zstd but keep them seperate from zfs_zstd.c
  • Add a short benchmark run

@mattmacy
Copy link
Contributor

@markjdb could you please comment on the issues raised here?

@PrivatePuffin
Copy link
Contributor

@markjdb could you please comment on the issues raised here?

@mattmacy No offence intended at all, but for the record: where did your question for @markjdb come from?
I'm not on the mailinglists anymore so I might have missed something, but more people aren't, so just to be sure...

@allanjude
Copy link
Contributor Author

@markjdb could you please comment on the issues raised here?

@mattmacy No offence intended at all, but for the record: where did your question for @markjdb come from?
I'm not on the mailinglists anymore so I might have missed something, but more people aren't, so just to be sure...

MarkJ is a virtual memory expert and can provide more insight on the advantages of using UMA, and may be able to shed some light on how we can develop a test to show the advantages more clearly.

@PrivatePuffin
Copy link
Contributor

PrivatePuffin commented Oct 27, 2020

@markjdb could you please comment on the issues raised here?

@mattmacy No offence intended at all, but for the record: where did your question for @markjdb come from?
I'm not on the mailinglists anymore so I might have missed something, but more people aren't, so just to be sure...

MarkJ is a virtual memory expert and can provide more insight on the advantages of using UMA, and may be able to shed some light on how we can develop a test to show the advantages more clearly.

Ahh thanks for that info Allan, ofcoarse all insight is welcome.
Though I'm all for using system allocators where/when we can, even if it would turn out the be a tie.

I noticed your absolute shit tone of continued performace tests on the mailinglist (awesome work btw!), did you happen to have some time to do some of them with the UMA allocator?

@bghira
Copy link

bghira commented Oct 28, 2020

UMA is great, especially for large systems with many memory channels and NUMA zones. the improvement on those systems can be an order of magnitude, with less appreciable gains on more commodity hardware. though the Ryzen series of systems would possibly benefit on workloads where memory latency is a concern.

@markjdb
Copy link
Contributor

markjdb commented Oct 28, 2020

I'd say the main advantage of using UMA here is that this allocator automatically gets tied in to the system's memory reclamation policy. We use centralized logic to decide when and how much to reclaim from UMA caches. The custom mempool allocator caches large objects (up to 64MB per slot, and there are 2*4*ncpu slots) and the core kernel has no way to request a flush of the caches, we have to wait for up to 60s for the reaper even if the system is under severe memory pressure.

I'm curious about the original motivation for this custom pool allocator. I see "some distro's completely went haywire when using the normal system allocator" but no info in the commit log message.

As for performance, it's hard to say since I don't know what "typical" zstd allocation patterns look like. For example, how many back-to-back allocations does zstd perform in a given context? Does a context use a mix of allocation sizes? Is there a mix of allocation sizes globally? Looking at the code I can see a few general advantages of UMA:

  • On NUMA systems, UMA will return domain-local memory. The custom allocator will return the first chunk it finds. All allocations start their scan at the same index so locality isn't preserved. That is, if a CPU frees a chunk of memory and then goes to allocate another, it is unlikely to get back the chunk that it previously freed.
  • The custom allocator returns the first chunk of acceptable size, so if there's a mix of chunk sizes available it might end up wasting a large amount of memory. UMA uses separate zones for each power of 2 so there's a tighter bound on wasted memory. This might not be much a problem for zstd in practice.
  • The custom allocator's working set is limited to 4*ncpu items. If more are required it'll go to the slow path. UMA dynamically estimates each zone's working set size and tries to keep enough items cached to avoid ever having to go to the low-level page allocator. Again, this might not be a problem in practice today, but OTOH if 4*ncpu items is an overestimate, then we potentially have unused items sitting around.

As for testing, I don't know much about how ZFS decides when to compress a file. For instance, if a file is compressed in the ARC and I read it, does the decompressed file end up cached in RAM instead? If there is some straightforward way to trigger decompression of a large file in a loop (e.g., while true; do dd if=/large/file of=/dev/null bs=1M; done), I would try to demonstrate a benefit on NUMA systems by running a set of such loops pinned to CPUs in each domain, and see if there is some measurable throughput improvement that comes from always using domain-local memory for decompression contexts. In particular, measure the distribution of loop timings and compare the bounds of the distribution when using UMA vs. the custom allocator.

@mjguzik
Copy link
Contributor

mjguzik commented Oct 29, 2020

I think the key missing piece of information here is real world memory usage before and after the patch. More importantly the huge zones mentioned make me wonder if their existence with per-CPU buckets is justified. While given enough RAM(tm) such caching provides fastest allocation/free, it is quite likely it's not feasible in real world workloads. Perhaps the concurrency levels required can be very easily sustained with a handful caches per numa node. To the best of my knowledge the FreeBSD kernel allocator does not provide such a feature (perhaps @markjdb can correct me here). At least three FreeBSD allocator provides several statistics for zones which may provide enough insight into how much is consumed and how much we could get away with. If not, it should not be fundamentally hard to hack it up for testing purposes (for example just add a trivial atomic counter which tracks max usage).

It was mentioned that just the system allocator was causing trouble. I don't know how such usage looked like at the time, I suspect it was to just pass everything to malloc every time. For multipage sizes which get passed here both kernels will end up allocating pages directly -- this is slow and you are not expected to do it often.

That said, in my opinion evaluation mentioned above should be considered a blocker for the patch.

As for the current state of zstd_mempool_alloc, I think it needs replacement on Linux as well. I understand it was created to combat some problems with memory allocation and it tries to balance total memory use and allocation speed. Even if it is here to stay for the foreseeable future, there are significant improvements which can be done here.

The code allocates a ZSTD_POOL_MAX array of ztsd_pool items.

/* Structure for pooled memory objects */
struct zstd_pool {
        void *mem;
        size_t size;
        kmutex_t barrier;
        hrtime_t timeout;
};

Given the way this is allocated (one after another with no padding) this layout induces false sharing.

/*
 * This variable represents the maximum count of the pool based on the number
 * of CPUs plus some buffer. We default to cpu count * 4, see init_zstd.
 */
static int pool_count = 16;

#define ZSTD_POOL_MAX           pool_count

This is not capped in any manner and already poses a problem. My test system has 104 threads and pool_count got to 416, which as you will see below is quite sub optimal.

static void *
zstd_mempool_alloc(struct zstd_pool *zstd_mempool, size_t size)
{
[snip]
        /* Seek for preallocated memory slot and free obsolete slots */
        for (int i = 0; i < ZSTD_POOL_MAX; i++) {
                pool = &zstd_mempool[i];
[snip]
                        if (size && !mem && pool->mem && size <= pool->size) {
                                pool->timeout = gethrestime_sec() +
                                    ZSTD_POOL_TIMEOUT;
                                mem = pool->mem;
                                continue;
                        }

                        /* Free memory if unused object older than 2 minutes */
                        if (pool->mem && gethrestime_sec() > pool->timeout) {
                                vmem_free(pool->mem, pool->size);
                                pool->mem = NULL;
                                pool->size = 0;
                                pool->timeout = 0;
                        }

                        mutex_exit(&pool->barrier);
                }
         }

So this code walks the entire array from 0 every single time. On my box that's 416 lock acquires and releases, which is very expensive single-threaded. Multithreaded things will look much worse as with enough concurrency all the CPUs keep trying to acquire these locks in lockstep which floors performance. Note I don't know what kind of concurrency level is realistic to trigger here given all the other locks which have to be taken to get to this code. General point though is that the current scheme is counter-productive -- it makes things slower to allocate with growing number of present CPUs and in face of contention triggers a very expensive series of cacheline ping pongs.

The above would be a problem (albeit to a lesser degree) even at much smaller scales (say aforementioned 16).

That said, it is guaranteed the allocator would perform faster single-threaded if it just had one lock and an array of buffers to return. It is very likely it would be significantly faster multithreaded even at modest concurrency even with said global lock. Note that lock profiling tooling of your choosing may not be showing failing trylocking the above, it would be best to look at CPU profile instead (e.g., perf top on Linux).

Moreover, given the 2 minute expiration policy, it is really wasteful to keep evaluating every buffer every time you alloc. There is a number of periodic tasks performed by ZFS as it is, such work can be added there instead.

The code does not return a decent fit either, just the first buffer which is big enough to hold the request.

All in all, the SLAB/ZONE memory use evaluation I mentioned above may be time consuming to do right and even then it may turn out that zstd is a bad fit for adding there.

That said, my suggestion for the time being is to reimplement the zstd allocator to have one global lock and an array of bouned size with likely "big enough" buffers. The current notion of returning with a lock held would be removed, instead the spot in the array would get nullified. Note that while this does not scale and will become a problem at some point, it should perform better than the current code.

Just to get example real-world data I added a dataset with zstd compression as the output directory for kernel compile, got the following as distribution (warning: rounded down to power-of-2):

dtrace -n 'fbt::zstd_mempool_alloc:entry { @ = quantize(arg1); }'            
dtrace: description 'fbt::zstd_mempool_alloc:entry ' matched 1 probe    
^C


           value  ------------- Distribution ------------- count    
              -1 |                                         0        
               0 |                                         2        
               1 |                                         0        
               2 |                                         0        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |                                         0        
              64 |                                         0        
             128 |                                         0        
             256 |                                         0        
             512 |                                         0        
            1024 |@@@@@@@@@@@@@@@@@@                       31235    
            2048 |                                         0        
            4096 |                                         0        
            8192 |                                         0        
           16384 |                                         86       
           32768 |@@                                       3197     
           65536 |@@                                       3081     
          131072 |@@@@@                                    8564     
          262144 |@                                        1417     
          524288 |@@                                       3699     
         1048576 |@@@@@@@@@@                               17056    
         2097152 |                                         0        

key observation being that there are huge gaps in requested sizes.

@BrainSlayer
Copy link
Contributor

BrainSlayer commented Oct 29, 2020

let me just explain again what the motivation is. kmalloc, vmalloc, kvmalloc. allthese wonderfull functions and yes also slab allocation just take too long. these allocation functions are just too slow if the allocation size is just big enough (like 16 mb for higher compression ratio etc). so zfs tests will fail due a timeout at the end. the mempool variant is just lightning fast and thats the reason. so using now a standard system allocator again will result in the same problem we had already. for your locking arguments. consider that its a mutex_trylock and will not aquire 416 locks each time. if zstd is under load alot of the slots are in use at the same time, so the amount of available slots is much smaller. 416 slots will only be available if zstd is not in heavy or no use. so there is no performance impact. and at the end we also did alot of benchmarks with that code to find the best solution at the end. of course i could just do a single lock covering the whole loop, but i dont think there will be any performance impact.

returning the first matching buffer (even if bigger than requested) is intentional. the target is to avoid reallocation. so reusing a bigger already allocated but unused buffer is faster than reallocating a new block which just fits to the size

for your the 2 minute policy. a additional callback for releasing memory has been already implemented. originally the intention of the code was to be self organizing without tainting other parts of zfs that much. i agree that this can be improved now since we have that callback. replacing the lock handling is not that simple. the lock beeing held is the marker for a memory object being in use. its all about thread safety. i have to ensure that access or checking of these objects are atomic

@BrainSlayer
Copy link
Contributor

#11126

@mjguzik
Copy link
Contributor

mjguzik commented Oct 29, 2020

let me just explain again what the motivation is. kmalloc, vmalloc, kvmalloc. allthese wonderfull functions and yes also slab allocation just take too long. these allocation functions are just too slow if the allocation size is just big enough (like 16 mb for higher compression ratio etc). so zfs tests will fail due a timeout at the end

Potential problems of this very sort are precisely why I'm skeptical about the patch proposed in this review. As noted earlier, really big allocations are not expected to happen often.

Slight reordering:

so using now a standard system allocator again will result in the same problem we had already.

I noted this might be the case. Moreover I found it worrisome zfs already preforms such big alllocs using zone/slab. Either way, my primary recommendation here is improving the existing allocator.

the mempool variant is just lightning fast and thats the reason. for your locking arguments. consider that its a mutex_trylock and will not aquire 416 locks each time. if zstd is under load alot of the slots are in use at the same time, so the amount of available slots is much smaller. 416 slots will only be available if zstd is not in heavy or no use. so there is no performance impact.

In the current code, assuming no contention, each allocation will perform 416 atomics to trylock and 415 atomics to unlock, the remaining one performed on "free". Atomics are very expensive even when the target line is cache hot, including on modern CPUs. Regardless of contention level the code will always perform 416 atomics to trylock. I presume you don't necessarily have hardware with this many threads, but you can easily simulate the problem: boot with the pool_count value hardcoded to 416, run yes | dd if=/dev/stdin of=file-on-zstd-dataset and sudo perf top on another terminal. You will find locking primitives doing the spurious atomics at the top of the profile, with everything else far below.

The problem is significantly worse in face of contention.

and at the end we also did alot of benchmarks with that code to find the best solution at the end. of course i could just do a single lock covering the whole loop, but i dont think there will be any performance impact.

See above. There is a fundamental difference between one lock (2 atomics) vs several.

returning the first matching buffer (even if bigger than requested) is intentional. the target is to avoid reallocation. so reusing a bigger already allocated but unused buffer is faster than reallocating a new block which just fits to the size

I mentioned this may be problematic if sizes are wildly different. I also posted a histogram of sizes I got in my simple test. Key insight there was that about half allocations are ~1KB and then there is a huge gap for the rest.

I extended the code with basic instrumentation to gather requested sizes, failed trylocks and size differences between requested size and buffer which was used to satisfy it. Then I ran make -j 104 buildkernel with object files on a zstd-enabled dataset.

The dtrace one-liner is: dtrace -n ':::zstd-alloc { @sizes[arg0, arg1] = count(); } :::zstd-alloc { @["contention"] = quantize(arg2); @["mismatch"] = quantize(arg1 - arg0); } :::zstd-alloc_miss* { @["misses"] = quantize(arg0); }' . I'm not familiar with Linux instrumentation past systemtap, so I don't know how to provide one for that system.

Note quantize rounds down to the nearest power of 2.

For failing trylocks I got:

  contention
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@                                     7762
               1 |@                                        936
               2 |@                                        1094
               4 |@                                        1947
               8 |@@                                       4091
              16 |@@@@@                                    8137
              32 |@@@@@@@@                                 14462
              64 |@@@@@@@@@@@@@@@                          26322
             128 |@@@                                      5589
             256 |                                         0

The most common allocation failed 64+ trylocks. Small minority did not fail any.

More importantly though, size-wise (sizes allocated - requested, or how much oversized buffers tend to be for given alloc)

  mismatch
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@                   38704
               1 |                                         0
               2 |                                         0
               4 |                                         0
               8 |                                         0
              16 |                                         0
              32 |                                         0
              64 |                                         0
             128 |                                         0
             256 |                                         0
             512 |                                         0
            1024 |                                         0
            2048 |                                         340
            4096 |                                         152
            8192 |                                         293
           16384 |                                         747
           32768 |@@@                                      4940
           65536 |@@                                       3777
          131072 |@@                                       3133
          262144 |@@                                       4307
          524288 |@@@@@@                                   11177
         1048576 |@@                                       2770
         2097152 |                                         0

As you can see for some of them the difference is off by more than a megabyte. Or in other words the current allocator is handing out way oversized buffers, likely resulting in more memory use to satisfy other allocations. Complete allocation list can be found here. The first column is requested size, second one size of selected buffer and third one how many times given combination was encountered.

The small 1176 byte buffer can be adequately satisfied with system allocator and special casing for it would remove majority of the above problem.

Another note is that for all the big buffers, the actual allocated size is a multiple of 4096, despite very odd sizes requested. Not rounding it up passed up on some matching opportunities.

In my opinion distributions from a handful of workloads should be gathered and based on that best multiple-of-page-size round ups decided. Then the allocator can start handing out possibly oversized buffers, but nothing extravagant and still have better memory use than now.

for your the 2 minute policy. a additional callback for releasing memory has been already implemented. originally the intention of the code was to be self organizing without tainting other parts of zfs that much. i agree that this can be improved now since we have that callback.

replacing the lock handling is not that simple. the lock beeing held is the marker for a memory object being in use. its all about thread safety. i have to ensure that access or checking of these objects are atomic

It is simple. Should you have an array of buffers or linked lists of buffers, on alloc you take the buffer out of the cache. You put it back on free. Nobody else has to see it.

@PrivatePuffin
Copy link
Contributor

run yes | dd if=/dev/stdin of=file-on-zstd-dataset

Be aware that the current zstd implementation is not very optimised for such single-stream writes (just a sidenote)

@mjguzik mjguzik mentioned this pull request Oct 29, 2020
12 tasks
@mjguzik
Copy link
Contributor

mjguzik commented Oct 29, 2020

Regardless of what's going on here can I get some reviews on #11129

@allanjude
Copy link
Contributor Author

I was wondering if it would be better to just use kmem_alloc() for any allocation larger than the kmem_cache's that already exist (the largest is 16mb). However, when extracting a FreeBSD src.txz to a zstd-19 compressed dataset, with recordsize=1m, there were enough allocations larger than 16mb, where I think having the largest kmem_cache will be beneficial. It also has basically no cost if the user doesn't use very high compression levels and large block sizes.

Using a slightly cleaned up version of this patch on tar -xf src.tar -C /test/zstd-19; zpool sync:

dtrace -n 'fbt:openzfs:zstd_alloc:entry { @["size"] = quantize(arg1); }'
dtrace: description 'fbt:openzfs:zstd_alloc:entry ' matched 1 probe

  size
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@                     78549
            2048 |                                         0
            4096 |                                         0
            8192 |                                         0
           16384 |                                         0
           32768 |                                         0
           65536 |                                         0
          131072 |@@@@@@@                                  28127
          262144 |@@@@@@@                                  28972
          524288 |@@@@                                     14661
         1048576 |@                                        4012
         2097152 |                                         1488
         4194304 |                                         669
         8388608 |                                         264
        16777216 |                                         406
        33554432 |                                         0

This is just to make the diff easier to read

Signed-off-by: Allan Jude <[email protected]>
This allows FreeBSD to use UMA, while Linux uses the custom mempool

Also split private bits of sys/zstd/zstd.h into sys/zstd/zstd_impl.h

This new FreeBSD version uses the zio_data_buf_cache's for all sizes
less than 16 MB, and uses its own dedicated kmem_caches for the
decompression context, and for higher compression levels at larger
block sizes that require the larger compression workspace.

Sponsored-by: The FreeBSD Foundation
Signed-off-by: Allan Jude <[email protected]>
@allanjude allanjude force-pushed the openzfs_zstd_freebsd_uma branch from 88ac1e9 to 127ba0f Compare October 31, 2020 18:51
@mjguzik
Copy link
Contributor

mjguzik commented Oct 31, 2020

I tried to explain that 16MB kmem cache is an incredibly fishy ordeal in face of memory shortage. Apart from it the very fact that allocations of the sort are to be supported comes at a cost in terms of what solutions are viable.

I created a zstd-19 dataset and re-tested kernel build on it. I validated I get the big allocations. Reached memory use is:

kstat.zfs.misc.zstd.size: 420764287
kstat.zfs.misc.zstd.buffers: 269

That's over 400MB of RAM just to handle in-flight compression and without any form of per-CPU caching. This can definitely be reduced with better behavior of the allocator, but it's still going to be way too big for this purpose.

In other words zstd memory usage has to be bounded upfront to much smaller levels. The difficulty here is evaluating memory use vs compression ratios vs compression speed vs an array of workloads to check what configurations make the most sense. That's a rather involved project and I don't expect it to be completed anytime soon.

In the meantime bare minimum which should be done is to test compression levels vs memory usage in at least 2 workloads. The default seems to be allocating ~1.2MB buffers which with better caching may be tolerable.

So this would artificially limit what zstd compression levels are configurable to begin with. There is the problem of existing datasets which have a higher limit configured already. They will have to support at least reading the data and compressing with smaller limit despite the configuration.

As for the allocator, there is a guaranteed significant improvement from the following:

  1. delegate the 1KB+ allocs to kmalloc. They happen all the time at the beginning of compression (and comprise half of total allocations). They will fit in the 2KB slab. While a little wasteful, should be fine. Perhaps someone can look into shrinking them to <= 1KB so that they fit in the lower slab.
  2. round up all alocations at least to multiply of page size since that's what they actually take. Also see point 4.
  3. drop the first fit policy in favor of best fit. the current code notoriously hands out hugely mismatched buffers triggering more allocations than necessary.
  4. probably the most sensible thing here would split the de facto seen allocation range into n "slabs" and allocate buffers sized to that.
  5. concurrency-wise there can be a lock per "slab". should be fine enough for the foreseeable future.

None of the above points are hard to implement and I can take care of it later, but the buy-in to limit memory use has to be there.

@PrivatePuffin
Copy link
Contributor

PrivatePuffin commented Oct 31, 2020

The difficulty here is evaluating memory use vs compression ratios vs compression speed vs an array of workloads to check what configurations make the most sense. That's a rather involved project and I don't expect it to be completed anytime soon.

@mjguzik We could at least make the multiplier-per-core a tuneable to give users an option to lower it in case of insane amounts of threads?

In the meantime bare minimum which should be done is to test compression levels vs memory usage in at least 2 workloads.

A basic version of this had been done on low-ish core count systems (32threads max afaik) and the memory usage was "Nothing worrysome"... (as in: nothing that would case real trouble)

Not much testing has been done on very-many-cores systems though...

So this would artificially limit what zstd compression levels are configurable to begin with

We already do limit it to < ZSTD19 which was already a "future proof compromise"

There is the problem of existing datasets which have a higher limit configured already. They will have to support at least reading the data and compressing with smaller limit despite the configuration.

I would not agree it's okey to redesign something as basic as the amount of supported levels for an algorithm already released (even if it's just in an RC)

delegate the 1KB+ allocs to kmalloc. They happen all the time at the beginning of compression (and comprise half of total allocations). They will fit in the 2KB slab. While a little wasteful, should be fine. Perhaps someone can look into shrinking them to <= 1KB so that they fit in the lower slab.

If talking Linux or "all-os": you sound like these idea's wheren't tested. Please read back the insanely long history on zstd-on-zfs to get a glimps of the endless tries, discussions and weirdness with some distro's that where encountered.

round up all alocations at least to multiply of page size since that's what they actually take

That might be a fair tweak

drop the first fit policy in favor of best fit. the current code notoriously hands out hugely mismatched buffers triggering more allocations than necessary

How about: First fit if good fit, otherwise look for best fit?

the buy-in to limit memory use has to be there.

I think everyone here is open to optimisation, but if with "buy in" you mean "at all costs" (like removing many zstd levels), thats a no for some/most of us I think.

@mjguzik
Copy link
Contributor

mjguzik commented Nov 1, 2020

The difficulty here is evaluating memory use vs compression ratios vs compression speed vs an array of workloads to check what configurations make the most sense. That's a rather involved project and I don't expect it to be completed anytime soon.

@mjguzik We could at least make the multiplier-per-core a tuneable to give users an option to lower it in case of insane amounts of threads?

The ~100 thread box, while above average, is nothing special. Bigger systems have been available for years. This really should work at least OK on them without hand-tuning and it's not hard to get there. The current issues are somewhat self-induced by current design, see below.

In the meantime bare minimum which should be done is to test compression levels vs memory usage in at least 2 workloads.

A basic version of this had been done on low-ish core count systems (32threads max afaik) and the memory usage was "Nothing worrysome"... (as in: nothing that would case real trouble)

What exact data are we talking about? In particular was memory use by the cache here measured?

The ~3MB buffers which I can provoke on my own are very worrisome. 16MB claimed elsewhere is in my opinion a non-starter for anything but highly specialized deployments as it will tie up hundreds of megs of RAM.

Not much testing has been done on very-many-cores systems though...

So this would artificially limit what zstd compression levels are configurable to begin with

We already do limit it to < ZSTD19 which was already a "future proof compromise"

See above.

There is the problem of existing datasets which have a higher limit configured already. They will have to support at least reading the data and compressing with smaller limit despite the configuration.

I would not agree it's okey to redesign something as basic as the amount of supported levels for an algorithm already released (even if it's just in an RC)

Again, see above. The allocs which can be provoked are in my opinion may be quite hard to sustain in production. Also see below.

delegate the 1KB+ allocs to kmalloc. They happen all the time at the beginning of compression (and comprise half of total allocations). They will fit in the 2KB slab. While a little wasteful, should be fine. Perhaps someone can look into shrinking them to <= 1KB so that they fit in the lower slab.

If talking Linux or "all-os": you sound like these idea's wheren't tested. Please read back the insanely long history on zstd-on-zfs to get a glimps of the endless tries, discussions and weirdness with some distro's that where encountered.

It is a fair point that I'm not aware of the history here. However, 1. pushing multimegabyte requests to vmalloc et all at high rate is basically expected to cause trouble, so I'm not surprised it did. 2. if per-cpu caching of these allocations was attempted at the slab layer I'm also not surprised it failed as it ties up even more memory (and is why I don't think the patch in this review is a viable way to go)

Can you point me at specific attempts?

I find it very unlikely that delegating the transient <2KB requests to kmalloc would be resulting in trouble. If anything it should lessen the total memory use. Note I'm talking about making sure this lands in the 2KB malloc slab (both FreeBSD and Linux have one and both use it as it is). I can tell you though that zfs is already making tons of allocs which fit the 2KB slab and they use it on FreeBSD. Allocation of the sort are definitely also made on Linux and they clearly perform at least adequately, albeit I did not check if that's the kmalloc slab or something else (it's definitely not vmalloc). While in principle it is possible that adding these buffers tipped something over, I find it incredibly unlikely.

round up all alocations at least to multiply of page size since that's what they actually take

That might be a fair tweak

drop the first fit policy in favor of best fit. the current code notoriously hands out hugely mismatched buffers triggering more allocations than necessary

How about: First fit if good fit, otherwise look for best fit?

the buy-in to limit memory use has to be there.

I think everyone here is open to optimisation, but if with "buy in" you mean "at all costs" (like removing many zstd levels), thats a no for some/most of us I think.

I'm going to reiterate and expand a little.

I was asked to comment on the FreeBSD side of the patch in this PR. I noted it looks like a non-starter if big sizes are at play. For zstd itself, as a zfs user, I need to it at least not get in the way if not employed. This is sorted out with #11126.

Ultimately you are free to ignore my feedback, but at the same time I'm not going to spend time on a project where this happens. The 16MB-induced worry is not just some minor concern. Suitability for production use hinges on behavior in corner cases.

For the default setup allocation request is at 1.2MB and it is trivial to provoke it on all CPUs. For the 104 thread box, that means over 100MB of RAM going directly to transient zstd buffers. If the 16MB allocations are possible this is over 1.5GB, which is a complete non-starter. The 16MB requests on a 32-bit system are 0.5GB. I can provide some patches if size reduction is fine. In my opinion sticking to current configurations will lead to significant problems and I'm not going to be on the hook for it.

I plotted sizes in pages vs times requested as seen in https://people.freebsd.org/~mjg/zfs/sizes-distribution.txt mentioned in previous comments (excluded the 1KB+ buffer)

request distribution

As you can see, even without trying to provoke anything, the "max" is what ends up being allocated the most and what has to be expected to be there at least per CPU.

The current allocation scheme easily suffers pessimal behavior. Say someone locked pool[0] to evaluate it for a fit and it's a bad fit so they move on. During that other thread else might trylock poo[0] and have it fail as the lock is already taken, causing the thread to move on to pool[1]. If this thread has a request which is a good fit for the size in pool[0], the buffer got skipped for no good reason.

That said, more measurements will be needed, but so far I'm leaning towards a simplistic allocator:

  1. move the 1.1KB buffer to malloc
  2. only allocate "max" sized buffers (in this case they all would be 1.2MB). But again only viable if the max is not in range of 16MB or similar. The 3MB I got with zstd-19 is imo already problematic, giving lower bound to > 300MB on my system.

Later addition: testing 'yes > file-on-zstd'
compression=zstd recordsize=1M -> 2604825 byte requests
compression=zstd-19 recordsize=1M -> 19389039 byte requests

Running the above racks up:

kstat.zfs.misc.zstd.size: 1644948564                              
kstat.zfs.misc.zstd.buffers: 160

... as the I/O is spread across multiple zfs threads. That's literally over 1.5GB RAM used for running 'yes' with output redirected to a file.

@PrivatePuffin
Copy link
Contributor

PrivatePuffin commented Nov 1, 2020

See above.

You are talking about limiting "levels" your "see above" doesn't talk about levels.
Are you mixing jargon now? ZSTD LEVELS are jargon for ZSTD levels, not something alloc related.

  1. pushing multimegabyte requests to vmalloc et all at high rate is basically expected to cause trouble, so I'm not surprised it did.

I wasn;t talking about vmalloc explicitly.
Please stop using strawman arguments....

Can you point me at specific attempts?

I'm certainly not going to dig through it AGAIN. (I already did multiple times by now).
And there are not documented attempts because brainslayer isn't very good at keeping git history and we collabed a lot on it...

I find it very unlikely that delegating the transient <2KB requests to kmalloc would be resulting in trouble

Again, not what I said.
Stop the strawmans

Ultimately you are free to ignore my feedback, but at the same time I'm not going to spend time on a project where this happens. The 16MB-induced worry is not just some minor concern. Suitability for production use hinges on behavior in corner cases.

And at the same time you ignore my multiple comments about the removal of ZSTD levels.
To be clear: I'm NOT the main allocator guru here, I'm not even capable of discussing the intricacies of the allocators themselves. So yes, i'm ignoring that.

Also Brainslayer isn't a ZFS maintainer either. If you PR a fix, it's up to the maintainers to decide on it. Not me or even Brainslayer.

Besides: Brainslayer is know to be overly " sturdy" on discussing improvements, we had to talk for days for him to stop ignorning a test failure. Thats okey for him to do, I can deal with it. But PLEASE do not take his sturdyness for something that is representitive for this project as a whole.

@PrivatePuffin
Copy link
Contributor

PrivatePuffin commented Nov 1, 2020

See above.

You are talking about limiting "levels" your "see above" doesn't talk about levels.
Are you mixing jargon now? ZSTD LEVELS are jargon for ZSTD levels, not something alloc related.

  1. pushing multimegabyte requests to vmalloc et all at high rate is basically expected to cause trouble, so I'm not surprised it did.

I wasn;t talking about vmalloc explicitly.
Please stop using strawman arguments....

Can you point me at specific attempts?

I'm certainly not going to dig through it AGAIN. (I already did multiple times by now).
And there are not documented attempts because brainslayer isn't very good at keeping git history and we collabed a lot on it...

I find it very unlikely that delegating the transient <2KB requests to kmalloc would be resulting in trouble

Again, not what I said.

Ultimately you are free to ignore my feedback, but at the same time I'm not going to spend time on a project where this happens. The 16MB-induced worry is not just some minor concern. Suitability for production use hinges on behavior in corner cases.

And at the same time you ignore my multiple comments about the removal of ZSTD levels.
To be clear: I'm NOT the main allocator guru here, I'm not even capable of discussing the intricacies of the allocators themselves to this detail. So yes, i'm ignoring that.

Also Brainslayer isn't a ZFS maintainer either. If you PR a fix, it's up to the maintainers to decide on it. Not me or even Brainslayer.

Besides: Brainslayer is know to be overly " sturdy" on discussing improvements, we had to talk for days for him to stop ignorning a test failure. Thats okey for him to do, I can deal with it. But PLEASE do not take his sturdyness for something that is representitive for this project as a whole.

I think we are all open for reviewing a proposal. But if it's going to nuke the ZSTD compression ratio or is going to remove ZSTD levels, I don't think many are going to be very enthusiastic... Thats not a "we don't want it", thats a warning to keep looking at the bigger picture.

But to be clear:
Please just make a PR so we can discuss actual changes instead of endlessly semi-off-topic on @allanjude his PR. Code speaks a LOT more than just endlessly discussing principles.

@amotin
Copy link
Member

amotin commented Oct 29, 2024

Allan told to close this as overcome by later work and he does not plan to continue, even though not completely rejecting the idea.

@amotin amotin closed this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants