[Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs

### System information

Type | Version/Name
 Linux | Debian
Distribution Name	| debian
Distribution Version	| buster (10.7)
Linux Kernel	| 4.19.0-12-amd64
Architecture	| amd64
ZFS Version	| OpenZFS 2.0.0
SPL Version	| (SPL integrated in OpenZFS 2.0.0)

### Describe the problem you're observing

## Setup

- Supermicro X10DRU-i+
- LSI 9361-8i connected to a 6 / 12 Gbps SATA/SAS backplane
- 2 x Seagate ST4000NM000A (denoted `sda` and `sdb`), connected to the backplane
- 1 x Seagate ST4000NM0035 (denoted `sdc`), connected to the backplane
- 128 GB RAM (ECC, of course)

`sda` and `sdb` make up a mirrored ZFS VDEV. The O/S boots from this VDEV. There is only one pool, called `rpool`. `rpool` does not contain any other VDEVs besides that mirror. The root file system is mounted to `rpool/system`.

There is no swap file on that system (yet).

`rpool` has been created using the following command:
```
zpool create -o ashift=12 -o altroot=/mnt -O acltype=posixacl -O canmount=off -O checksum=on -O compression=off -O mountpoint=none -O sharesmb=off -O sharenfs=off -O xattr=sa rpool mirror /dev/disk/by-id/ata-ST4000...-part1 /dev/disk/by-id/ata-ST4000...-part1
```
That is, the pool and the VDEV have `ashift=12`.

`rpool/system` has been created using the following command:
```
zfs create -o aclinherit=passthrough -o acltype=posixacl -o atime=on -o canmount=on -o checksum=on -o compression=off -o mountpoint=/ -o overlay=off -o primarycache=all -o redundant_metadata=all -o relatime=off -o secondarycache=none -o setuid=on -o sharesmb=off -o sharenfs=off -o logbias=latency -o snapdev=hidden -o snapdir=hidden -o sync=standard -o xattr=sa -o casesensitivity=sensitive -o normalization=none -o utf8only=off rpool/system
```

We further have created a ZVOL using the following command:

```
zfs create -b 4096 -o checksum=on -o compression=off -o primarycache=metadata -o redundant_metadata=all -o secondarycache=none -o logbias=latency -o snapdev=hidden -o sync=standard -V 100G rpool/zvol-test
```

That zvol is mounted on `/blob`.

`sdc` contains a partition with a normal ext4 file system which is mounted on `/mnt`. That file system just contains several dozens of ISO files (average size about 6 GB).

On that machine, nothing else runs than the standard services the distribution installs. Notably, there is no VM running and nothing else which could produce substantial workload.

In this state, when starting `watch -n 1 zpool iostat -ylv 1 1` and watching it for a while, there is indeed nearly no load on the ZFS disks. Once in several seconds or so some kilobytes hit the VDEV, which is expected.

## Copying to the dataset (not the ZVOL): No problem

Now we open the iostats in one terminal window (`watch -n 1 zpool iostat -ylv 1 1`) and start to copy ISO files from `sdc` onto the ZFS dataset `rpool/system` in another terminal window (`rsync --progress /mnt/*.iso ~/test`, where `~/test` is part of the root file system and thus is on `rpool/system`).

While the copy runs, `rsync` shows a few drops in bandwidth every now and then, but there are no noticeable holdups, and the drops in bandwidth are short. Likewise, the `zpool iostats` shows that the two disks in the VDEV are hit with data rates which could be expected. The changes in disk load reported by `zpool iostat` are surprisingly high, though (the load constantly jumps between something like 30 MB/s and 300 MB/s), but there are no real holdups either. In summary, the copy in average runs with over 100 MB/s and does not stall for a longer time.

We have interrupted that test after 30 GB or so because we didn't expect anything new from letting it run longer. However, we repeated it several times, each time copying other ISO files, and each time rebooting before. The behavior was the same each time.

## Copying to the ZVOL: Problem

When we do exactly the same thing, but copy to the ZVOL instead of the dataset (`rsync --progress /mnt/*.iso /blob`), the situation changes. `rsync` initially shows the copy running with roughly about 190 MB/s for a few seconds, then it stalls. Thereafter it continues with copying for a few seconds at the rate denoted above, then stalls again after a few seconds, and so on.

The problem is that the holdups last for a long time where absolutely nothing happens, up to several minutes (!). However, `zpool iostat` shows that the two ZFS disks are under heavy load during this time, constantly (more or less) being hit with over 100 MB/s. Even when we interrupt copying by hitting `Ctrl-c` in the terminal window where `rsync` runs, this high load lasts for several minutes until everything returns to normal.

There must be extreme write amplification somewhere, the amplification factor being somewhere between 5 and 10. For example, if we copy 40 GB that way, this would normally take about 5 minutes. But actually it takes at least half an hour, although the ZFS disks are under heavy load all that time.

For that reason, ZVOLs are currently just not usable for us, which imposes a major problem. What could be going on there?

## Our own thoughts and what we have tried already:

At first, we'd like to stress again that the ZVOL test did not happen within a VM. The problem is definitely not due to QEMU or (para)virtualization of data transfer.

Secondly, I am aware that it might not be the best idea to have ZFS running on disks which are attached to a RAID controller like the LSI 9361-8i, or to have it running on hybrid disks like the ones we have. However, we have configured that controller to JBOD mode, and the O/S sees the disks as individual ones as expected. But the ultimate key point regarding possible hardware problems is that copying large amounts of data to the ZFS dataset (rpool/system) works as expected. If the problems with the ZVOL would be due to hardware, we would have the same problems with the dataset; this is not the case, though.

Thirdly, the problem is not due to ZFS versions. Debian buster comes with ZoL 0.7.12, and we originally have noticed the problem there. We desperately need ZVOLs working, so we have installed OpenZFS 2.0.0 on that machine, which did change exactly nothing with respect to that problem.

As a further test, we created the ZVOL with `volblocksize=512` and did the tests again. Again, nothing changed. We repeated the process with `volblocksize`s of `8192`, `16384` and `128k`. Again, no luck: Maybe it stalled a few seconds earlier or later, longer or shorter in each test compared to the others, but the general situation remained the same. Between the stalls, the copy ran a few seconds with expected speed, then it stalled for a lot of seconds, mostly even a few minutes while iostat was showing a constant data rate of roughly 100 MB/s for each disk, and so on. After interrupting the copy, both ZFS disks continued to be hit with a data rate of 100 MB/s or more for several minutes.

Then we tested the ZVOL with `sync=disabled`. That didn't change anything. The same goes for `primarycache=all` (instead of `metadata`) (but at least this was expected), and for `logbias=latency` (instead of `throughput`).

Next, we thought that it may have something to do with the physical sector size of the ZFS disks being 512 bytes, while the pool (and the VDEVs) had `ashift=12`. Therefore, we destroyed the pool, re-created it with `ashift=9`, re-created all file systems / datasets as described above, and did all tests again. Once again, this didn't change anything.

We then went back to the original pool with `ashift=12` and used it for the further tests. At this point, we were out of ideas what to do next, so we read about [the ZFS I/O scheduler](https://gist.github.com/szaydel/6244302) and tested a large number of combinations of `zfs_dirty_data_max`, `zfs_delay_scale`, `zfs_vdev_async_write_max_active`, `zfs_vdev_async_write_min_active`, `zfs_vdev_async_write_active_max_dirty_percent`, and `zfs_vdev_async_write_active_min_dirty_percent`.

To our surprise, the last five of these did barely influence the behavior. However, the first one (`zfs_dirty_data_max`), which originally was set to `4 GB`, changed the situation when we set it to a low value, e.g. `512 MB`. The improvement was that there were less long-lasting holdups: there were even more holdups, but all of them were so short that it became acceptable. However, the average data rate did not increase, because now the transfer rate `rsync` reported was limited to about 30 MB/s, mostly hanging around 10 MB/s or 20 MB/s. There were no phases with high data rates any more.

So the copying was more "responsive" with low values of `zfs_dirty_data_max`, but that didn't help because the data rate per se was drastically limited. In summary, changing the I/O scheduler parameters which are explained in the document linked above did not lead to anywhere.

The last thing we were looking into was `zfs_txg_timeout`. Setting it to a lower value didn't improve the situation with copying to the ZVOL (but increased the load which hit the ZFS disks when the system was completely idle). Setting it to a higher value didn't improve copying either (but reduced the load on the ZFS disks when the system was idle).

Now we are completely out of ideas. We probably could look into other parameters of the ZFS module (`/sys/module/zfs/parameters`) or the disk drivers (`/sys/block/sdx`). But this would be just wild guessing and a waste of time. Therefore, we are hoping that somebody is willing to give us some hints.

## What we did not try, and why not

`zfs_arc_max` is set to 4 GB on that system, and we did not test larger values for the following reasons:

1. That parameter is about reading, not writing, and the copy source is an ext4 partition of a physical disk, so no ZFS parameter would have any effect on the copy source.
2. We clearly have a problem with writing here, not with reading (remembering that copying to the normal dataset (not the ZVOL) works normally).
3. When we began working with ZFS some years ago, the first thing we had to solve was a system which started normally, but then became totally unresponsive and finally totally locked up within minutes. The cause of that problem was that ZFS was eating up all available RAM for its ARC cache until the machine crashed or hung. Since then, we always limit the ARC size (and never ever had any stability issues or crashes with ZFS again).
4. Our goal is to run a bunch of VMs with ZVOL storage (the tests described above are just, eehm, tests before we put even more effort into switching completely to ZFS). The number of VMs and the memory they will be given is precisely known. It would not make any sense to test larger ARC sizes, because the ARC size at the end of the day couldn't be much larger than 4 GB.

We did not try to use a secondary cache (L2ARC). Again, the copy source is not on ZFS, and therefore this wouldn't make any sense, and furthermore, we have a writing problem here, not a reading problem.

We did not try to use an SLOG. This would not make any sense, because one of our tests was to set `sync=disabled` on the copy destination ZVOL, and this did not change the slightest bit in the behavior observed. Therefore, we know that our problem is not due to sync writes, and thus, an SLOG wouldn't improve the behavior.

### Describe how to reproduce the problem

Install a system similar to the one described above, issue the commands described above, and watch the long-lasting holdups in the terminal window where `rsync` runs and the heavy disk load `zpool iostat` shows in the other terminal window, leading to high disk wear and low bandwidth.

Since it is not easy to setup a system like ours, we are willing to give remote access to one of that systems if somebody would be interested in investigating the problem. In this case, please leave a comment, stating how we can get into contact.

### Include any warning/errors/backtraces from the system logs

If somebody tells us what exactly is needed here, we'll immediately do it :-). We guess `zpool iostat` or other tools produce output which is more valuable than the log files, but neither being Linux nor ZFS experts, we are a bit lost here. Notably, we don't know how to operate `dtrace` or `strace` properly. If somebody tells us what to do, we'll try our best.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs #11407

System information

Describe the problem you're observing

Setup

Copying to the dataset (not the ZVOL): No problem

Copying to the ZVOL: Problem

Our own thoughts and what we have tried already:

What we did not try, and why not

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] Extreme performance penalty, holdups and write amplification when writing to ZVOLs #11407

Description

System information

Describe the problem you're observing

Setup

Copying to the dataset (not the ZVOL): No problem

Copying to the ZVOL: Problem

Our own thoughts and what we have tried already:

What we did not try, and why not

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions