OpenZFS for Linux interaction problem with NCQ - potential data loss

### System information

Linux x64 Box
--- | ---
Proxmox 8.04	|  kernel 6.2.16-12-pve	| x64	|  OpenZFS zfs-2.1.12-pve1 | 8x WDC connected through SATA

### Describe the problem you're observing

There is an old issue which partly relates to this, but I think it is not classified as a bug - and what is worse, one that leads to data destruction.

Just to reiterate on what I wrote about this here: (https://github.com/openzfs/zfs/issues/10094#issuecomment-1707993156), I have a Linux box with 8 WDC 18 TByte SATA drives, 4 of which are connected through the mainboard controllers (AMD FCH variants) and 4 through an ASMEDIA ASM1166. They build a raidz2 running under Proxmox with a 6.2 kernel. During my nightly backups, the drives would regularly fail (sometimes "degraded" and somtimes "failed" and errors showed up in the system log, more often than not "unaligned write errors".

First thing to note is that one poster in the thread mentioned that the "Unaligned write" is a bug in libata, in that "other" errors are mapped to this one in the scsi translation code (https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/). Thus, the actual error message is meaningless.

In the old issue, several possible remedies were offered, such as:

1. Faulty SATA cables (I replaced them all, no change, but I admit this could be the problem in some cases)
2. Faulty disks (Mine were known to be good, and also, errors were randomly distributed among them)
3. Power saving in the SATA link or the PCI bus (disabling this did not help)
4. Problematic controllers (Both the FCH and the ASM1166 chips as well as a JMB585 showed the same behaviour)
5. Limiting SATA speed to SATA 3.0 Gbps or even to 1.5 Gbps (3.0 Gbps did not help, and was not even possible with the ASM1166 as the speed was always reset to 6.0 Gbps, but I could check with FCH and JMB585 controllers)
6. Disabling NCQ (guess what, this helped!)
7. Replacing the SATA controllers with an LSI 9211-8i (I guess this would have helped, as others have reported, because it probably does not use NCQ)

I am 99% sure that it boils down to a bad interaction between OpenZFS and libata with NCQ enabled and I have a theory why this is so:
When you look at how NCQ works, it is a queue of up to 32 (or to be exact 31 for implementation reasons) tasks that can be given to the disk drive. Those tasks can be handled in any order by the drive hardware, e.g. in order to minimize seek times. This, when you give the drive 3 tasks, like "read sectors 1, 42 and 2, the drive might decide to reorder them and read sector 42 last, thus saving one seek operation in the process.

Now imagine a time of high I/O pressure, like when I do my nightly backups. OpenZFS has some queues of its own which are then given to the drives and for each task started, OpenZFS expects a result (but in no particular order). However, when a task returns, it opens up a slot in the NCQ queue, which is immediately filled with another task because of the high I/O pressure. That means that the sector 42 could potentially never be read at all, provided that other tasks are prioritized higher by the drive hardware.

I believe, this is exactly what is happening and if one task result is not received within the expected time frame, a timeout or an unspecific error occurs which is then reflected as "unaligned write".

IMHO, this is the result of putting one (or more) queues within OpenZFS in front of a smaller hardware queue (i.e. NCQ).

It explains why both solutions 6 and probably 7 from my list above cure the problem: Without NCQ, every task must first be finished before the next one can be started. It also explains why this problem is not as evident with other filesystems - were this a general problem with libata, it would have been fixed long ago.

I would even guess reducing SATA speed to 1.5 Gbps would help (one guy reported this) - I bet this is simply because the resulting speed of ~150 MByte/s is somewhat lower than modern hard disks, such that the disk can always finish tasks before the next one is started, whereas 3 Gpbs is still faster than modern spinning rust.

If I am right, two things should be considered:

a. The problem should be analysed and fixed in a better way than just disabling NCQ, like throttling the libata NCQ queue if pressure gets too high, just before errors are thrown. This would give the drive time to finish existing tasks.
b. There should be a warning or some kind of automatism to disable NCQ for OpenZFS for the time being.

I also think that the performance impact of disabling NCQ with OpenZFS is probably neglible, because OpenZFS has prioritized queues for different operations anyway.

### Describe how to reproduce the problem

Create a raidz2, copy a large number of files to it, preferably from a fast source like an NVMe disk.

### Include any warning/errors/backtraces from the system logs

Irrelevant because of another bug in the libata/scsi abstraction layer, see: https://lore.kernel.org/all/20230623181908.2032764-1-lorenz@brun.one/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenZFS for Linux interaction problem with NCQ - potential data loss #15270

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenZFS for Linux interaction problem with NCQ - potential data loss #15270

Description

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions