Description
System information
Type | Version/Name |
---|---|
Distribution Name | Debian |
Distribution Version | Buster |
Linux Kernel | 5.10.0-0.bpo.5-amd64 |
Architecture | amd64 |
ZFS Version | 2.0.3-1~bpo10+1 |
SPL Version | 2.0.3-1~bpo10+1 |
Describe the problem you're observing
Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:
zpool status -v
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May 3 16:58:33 2021
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme0n1p7 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<0xeb51>:<0x0>
Of note, the <0xeb51>
is sometimes a snapshot name; if I zfs destroy
the snapshot, it is replaced by this tag.
Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub without rebooting after seeing this kind of zpool status
output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:
[393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[393801.328129] PANIC at arc.c:3790:arc_buf_destroy()
[393801.328130] Showing stack for process 363
[393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P U OE 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
[393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020
[393801.328134] Call Trace:
[393801.328140] dump_stack+0x6d/0x88
[393801.328149] spl_panic+0xd3/0xfb [spl]
[393801.328153] ? __wake_up_common_lock+0x87/0xc0
[393801.328221] ? zei_add_range+0x130/0x130 [zfs]
[393801.328225] ? __cv_broadcast+0x26/0x30 [spl]
[393801.328275] ? zfs_zevent_post+0x238/0x2a0 [zfs]
[393801.328302] arc_buf_destroy+0xf3/0x100 [zfs]
[393801.328331] arc_read_done+0x24d/0x490 [zfs]
[393801.328388] zio_done+0x43d/0x1020 [zfs]
[393801.328445] ? zio_vdev_io_assess+0x4d/0x240 [zfs]
[393801.328502] zio_execute+0x90/0xf0 [zfs]
[393801.328508] taskq_thread+0x2e7/0x530 [spl]
[393801.328512] ? wake_up_q+0xa0/0xa0
[393801.328569] ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs]
[393801.328574] ? taskq_thread_spawn+0x50/0x50 [spl]
[393801.328576] kthread+0x116/0x130
[393801.328578] ? kthread_park+0x80/0x80
[393801.328581] ret_from_fork+0x22/0x30
However I want to stress that this backtrace is not the original cause of the problem, and it only appears if I do a scrub without first rebooting.
After that panic, the scrub stalled -- and a second error appeared:
zpool status -v
pool: rpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub in progress since Sat May 8 08:11:07 2021
152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total
0B repaired, 0.00% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
nvme0n1p7 ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<0xeb51>:<0x0>
rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0>
I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.
I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?
- It is a laptop
- It uses ZFS crypto (the others use LUKS)
I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.
Describe how to reproduce the problem
I can't at will. I have to wait for a spell.
Include any warning/errors/backtraces from the system logs
See above
Potentially related bugs
- I already mentioned permanent errors (ereport.fs.zfs.authentication) reported after syncoid snapshot/send workload #11688 which seems similar, but a scrub doesn't immediately resolve the issue here
- A quite similar backtrace also involving
arc_buf_destroy
is in silent corruption for thousands files gives input/output error but cannot be detected with scrub - at least for openzfs 2.0.0 #11443. The behavior described there has some parallels to what I observe. I am uncertain from the discussion what that means for this. - In silent corruption gives input/output error but cannot be detected with scrub, experienced on 0.7.5 and 0.8.3 versions #10697 there are some similar symptoms, but it looks like a different issue to me