Description
System information
Type | Version/Name
Linux | Debian
Distribution Name | debian
Distribution Version | buster (10.7)
Linux Kernel | 4.19.0-12-amd64
Architecture | amd64
ZFS Version | OpenZFS 2.0.0
SPL Version | (SPL integrated in OpenZFS 2.0.0)
Describe the problem you're observing
Setup
- Supermicro X10DRU-i+
- LSI 9361-8i connected to a 6 / 12 Gbps SATA/SAS backplane
- 2 x Seagate ST4000NM000A (denoted
sda
andsdb
), connected to the backplane - 1 x Seagate ST4000NM0035 (denoted
sdc
), connected to the backplane - 128 GB RAM (ECC, of course)
sda
and sdb
make up a mirrored ZFS VDEV. The O/S boots from this VDEV. There is only one pool, called rpool
. rpool
does not contain any other VDEVs besides that mirror. The root file system is mounted to rpool/system
.
There is no swap file on that system (yet).
rpool
has been created using the following command:
zpool create -o ashift=12 -o altroot=/mnt -O acltype=posixacl -O canmount=off -O checksum=on -O compression=off -O mountpoint=none -O sharesmb=off -O sharenfs=off -O xattr=sa rpool mirror /dev/disk/by-id/ata-ST4000...-part1 /dev/disk/by-id/ata-ST4000...-part1
That is, the pool and the VDEV have ashift=12
.
rpool/system
has been created using the following command:
zfs create -o aclinherit=passthrough -o acltype=posixacl -o atime=on -o canmount=on -o checksum=on -o compression=off -o mountpoint=/ -o overlay=off -o primarycache=all -o redundant_metadata=all -o relatime=off -o secondarycache=none -o setuid=on -o sharesmb=off -o sharenfs=off -o logbias=latency -o snapdev=hidden -o snapdir=hidden -o sync=standard -o xattr=sa -o casesensitivity=sensitive -o normalization=none -o utf8only=off rpool/system
We further have created a ZVOL using the following command:
zfs create -b 4096 -o checksum=on -o compression=off -o primarycache=metadata -o redundant_metadata=all -o secondarycache=none -o logbias=latency -o snapdev=hidden -o sync=standard -V 100G rpool/zvol-test
That zvol is mounted on /blob
.
sdc
contains a partition with a normal ext4 file system which is mounted on /mnt
. That file system just contains several dozens of ISO files (average size about 6 GB).
On that machine, nothing else runs than the standard services the distribution installs. Notably, there is no VM running and nothing else which could produce substantial workload.
In this state, when starting watch -n 1 zpool iostat -ylv 1 1
and watching it for a while, there is indeed nearly no load on the ZFS disks. Once in several seconds or so some kilobytes hit the VDEV, which is expected.
Copying to the dataset (not the ZVOL): No problem
Now we open the iostats in one terminal window (watch -n 1 zpool iostat -ylv 1 1
) and start to copy ISO files from sdc
onto the ZFS dataset rpool/system
in another terminal window (rsync --progress /mnt/*.iso ~/test
, where ~/test
is part of the root file system and thus is on rpool/system
).
While the copy runs, rsync
shows a few drops in bandwidth every now and then, but there are no noticeable holdups, and the drops in bandwidth are short. Likewise, the zpool iostats
shows that the two disks in the VDEV are hit with data rates which could be expected. The changes in disk load reported by zpool iostat
are surprisingly high, though (the load constantly jumps between something like 30 MB/s and 300 MB/s), but there are no real holdups either. In summary, the copy in average runs with over 100 MB/s and does not stall for a longer time.
We have interrupted that test after 30 GB or so because we didn't expect anything new from letting it run longer. However, we repeated it several times, each time copying other ISO files, and each time rebooting before. The behavior was the same each time.
Copying to the ZVOL: Problem
When we do exactly the same thing, but copy to the ZVOL instead of the dataset (rsync --progress /mnt/*.iso /blob
), the situation changes. rsync
initially shows the copy running with roughly about 190 MB/s for a few seconds, then it stalls. Thereafter it continues with copying for a few seconds at the rate denoted above, then stalls again after a few seconds, and so on.
The problem is that the holdups last for a long time where absolutely nothing happens, up to several minutes (!). However, zpool iostat
shows that the two ZFS disks are under heavy load during this time, constantly (more or less) being hit with over 100 MB/s. Even when we interrupt copying by hitting Ctrl-c
in the terminal window where rsync
runs, this high load lasts for several minutes until everything returns to normal.
There must be extreme write amplification somewhere, the amplification factor being somewhere between 5 and 10. For example, if we copy 40 GB that way, this would normally take about 5 minutes. But actually it takes at least half an hour, although the ZFS disks are under heavy load all that time.
For that reason, ZVOLs are currently just not usable for us, which imposes a major problem. What could be going on there?
Our own thoughts and what we have tried already:
At first, we'd like to stress again that the ZVOL test did not happen within a VM. The problem is definitely not due to QEMU or (para)virtualization of data transfer.
Secondly, I am aware that it might not be the best idea to have ZFS running on disks which are attached to a RAID controller like the LSI 9361-8i, or to have it running on hybrid disks like the ones we have. However, we have configured that controller to JBOD mode, and the O/S sees the disks as individual ones as expected. But the ultimate key point regarding possible hardware problems is that copying large amounts of data to the ZFS dataset (rpool/system) works as expected. If the problems with the ZVOL would be due to hardware, we would have the same problems with the dataset; this is not the case, though.
Thirdly, the problem is not due to ZFS versions. Debian buster comes with ZoL 0.7.12, and we originally have noticed the problem there. We desperately need ZVOLs working, so we have installed OpenZFS 2.0.0 on that machine, which did change exactly nothing with respect to that problem.
As a further test, we created the ZVOL with volblocksize=512
and did the tests again. Again, nothing changed. We repeated the process with volblocksize
s of 8192
, 16384
and 128k
. Again, no luck: Maybe it stalled a few seconds earlier or later, longer or shorter in each test compared to the others, but the general situation remained the same. Between the stalls, the copy ran a few seconds with expected speed, then it stalled for a lot of seconds, mostly even a few minutes while iostat was showing a constant data rate of roughly 100 MB/s for each disk, and so on. After interrupting the copy, both ZFS disks continued to be hit with a data rate of 100 MB/s or more for several minutes.
Then we tested the ZVOL with sync=disabled
. That didn't change anything. The same goes for primarycache=all
(instead of metadata
) (but at least this was expected), and for logbias=latency
(instead of throughput
).
Next, we thought that it may have something to do with the physical sector size of the ZFS disks being 512 bytes, while the pool (and the VDEVs) had ashift=12
. Therefore, we destroyed the pool, re-created it with ashift=9
, re-created all file systems / datasets as described above, and did all tests again. Once again, this didn't change anything.
We then went back to the original pool with ashift=12
and used it for the further tests. At this point, we were out of ideas what to do next, so we read about the ZFS I/O scheduler and tested a large number of combinations of zfs_dirty_data_max
, zfs_delay_scale
, zfs_vdev_async_write_max_active
, zfs_vdev_async_write_min_active
, zfs_vdev_async_write_active_max_dirty_percent
, and zfs_vdev_async_write_active_min_dirty_percent
.
To our surprise, the last five of these did barely influence the behavior. However, the first one (zfs_dirty_data_max
), which originally was set to 4 GB
, changed the situation when we set it to a low value, e.g. 512 MB
. The improvement was that there were less long-lasting holdups: there were even more holdups, but all of them were so short that it became acceptable. However, the average data rate did not increase, because now the transfer rate rsync
reported was limited to about 30 MB/s, mostly hanging around 10 MB/s or 20 MB/s. There were no phases with high data rates any more.
So the copying was more "responsive" with low values of zfs_dirty_data_max
, but that didn't help because the data rate per se was drastically limited. In summary, changing the I/O scheduler parameters which are explained in the document linked above did not lead to anywhere.
The last thing we were looking into was zfs_txg_timeout
. Setting it to a lower value didn't improve the situation with copying to the ZVOL (but increased the load which hit the ZFS disks when the system was completely idle). Setting it to a higher value didn't improve copying either (but reduced the load on the ZFS disks when the system was idle).
Now we are completely out of ideas. We probably could look into other parameters of the ZFS module (/sys/module/zfs/parameters
) or the disk drivers (/sys/block/sdx
). But this would be just wild guessing and a waste of time. Therefore, we are hoping that somebody is willing to give us some hints.
What we did not try, and why not
zfs_arc_max
is set to 4 GB on that system, and we did not test larger values for the following reasons:
- That parameter is about reading, not writing, and the copy source is an ext4 partition of a physical disk, so no ZFS parameter would have any effect on the copy source.
- We clearly have a problem with writing here, not with reading (remembering that copying to the normal dataset (not the ZVOL) works normally).
- When we began working with ZFS some years ago, the first thing we had to solve was a system which started normally, but then became totally unresponsive and finally totally locked up within minutes. The cause of that problem was that ZFS was eating up all available RAM for its ARC cache until the machine crashed or hung. Since then, we always limit the ARC size (and never ever had any stability issues or crashes with ZFS again).
- Our goal is to run a bunch of VMs with ZVOL storage (the tests described above are just, eehm, tests before we put even more effort into switching completely to ZFS). The number of VMs and the memory they will be given is precisely known. It would not make any sense to test larger ARC sizes, because the ARC size at the end of the day couldn't be much larger than 4 GB.
We did not try to use a secondary cache (L2ARC). Again, the copy source is not on ZFS, and therefore this wouldn't make any sense, and furthermore, we have a writing problem here, not a reading problem.
We did not try to use an SLOG. This would not make any sense, because one of our tests was to set sync=disabled
on the copy destination ZVOL, and this did not change the slightest bit in the behavior observed. Therefore, we know that our problem is not due to sync writes, and thus, an SLOG wouldn't improve the behavior.
Describe how to reproduce the problem
Install a system similar to the one described above, issue the commands described above, and watch the long-lasting holdups in the terminal window where rsync
runs and the heavy disk load zpool iostat
shows in the other terminal window, leading to high disk wear and low bandwidth.
Since it is not easy to setup a system like ours, we are willing to give remote access to one of that systems if somebody would be interested in investigating the problem. In this case, please leave a comment, stating how we can get into contact.
Include any warning/errors/backtraces from the system logs
If somebody tells us what exactly is needed here, we'll immediately do it :-). We guess zpool iostat
or other tools produce output which is more valuable than the log files, but neither being Linux nor ZFS experts, we are a bit lost here. Notably, we don't know how to operate dtrace
or strace
properly. If somebody tells us what to do, we'll try our best.