ZFS corruption related to snapshots post-2.0.x upgrade

### System information

Type | Version/Name
 --- | ---
Distribution Name	| Debian
Distribution Version	| Buster
Linux Kernel	| 5.10.0-0.bpo.5-amd64
Architecture	| amd64
ZFS Version	| 2.0.3-1~bpo10+1
SPL Version	| 2.0.3-1~bpo10+1

### Describe the problem you're observing

Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups.  Upon investigating, I will see output like this:

```
zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May  3 16:58:33 2021
config:

	NAME         STATE     READ WRITE CKSUM
	rpool        ONLINE       0     0     0
	  nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>
```

Of note, the `<0xeb51>` is sometimes a snapshot name; if I `zfs destroy` the snapshot, it is replaced by this tag.

Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it.  For me, it did not.  If I run a scrub **without rebooting** after seeing this kind of `zpool status` output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:

```
[393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[393801.328129] PANIC at arc.c:3790:arc_buf_destroy()
[393801.328130] Showing stack for process 363
[393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P     U     OE     5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
[393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020
[393801.328134] Call Trace:
[393801.328140]  dump_stack+0x6d/0x88
[393801.328149]  spl_panic+0xd3/0xfb [spl]
[393801.328153]  ? __wake_up_common_lock+0x87/0xc0
[393801.328221]  ? zei_add_range+0x130/0x130 [zfs]
[393801.328225]  ? __cv_broadcast+0x26/0x30 [spl]
[393801.328275]  ? zfs_zevent_post+0x238/0x2a0 [zfs]
[393801.328302]  arc_buf_destroy+0xf3/0x100 [zfs]
[393801.328331]  arc_read_done+0x24d/0x490 [zfs]
[393801.328388]  zio_done+0x43d/0x1020 [zfs]
[393801.328445]  ? zio_vdev_io_assess+0x4d/0x240 [zfs]
[393801.328502]  zio_execute+0x90/0xf0 [zfs]
[393801.328508]  taskq_thread+0x2e7/0x530 [spl]
[393801.328512]  ? wake_up_q+0xa0/0xa0
[393801.328569]  ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs]
[393801.328574]  ? taskq_thread_spawn+0x50/0x50 [spl]
[393801.328576]  kthread+0x116/0x130
[393801.328578]  ? kthread_park+0x80/0x80
[393801.328581]  ret_from_fork+0x22/0x30
```

However I want to stress that this backtrace is not the original **cause** of the problem, and it only appears if I do a scrub without first rebooting.

After that panic, the scrub stalled -- and a second error appeared:

```
zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Sat May  8 08:11:07 2021
	152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total
	0B repaired, 0.00% done, no estimated completion time
config:

	NAME         STATE     READ WRITE CKSUM
	rpool        ONLINE       0     0     0
	  nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>
        rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0>
```

I have found the solution to this issue is to reboot into single-user mode and run a scrub.  Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue.  If I reboot before scrubbing, I do not get the panic or the hung scrub.

I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?

- It is a laptop
- It uses ZFS crypto (the others use LUKS)

I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics.  I believe I have rules that out.

### Describe how to reproduce the problem

I can't at will.  I have to wait for a spell.

### Include any warning/errors/backtraces from the system logs

See above

### Potentially related bugs

- I already mentioned #11688 which seems similar, but a scrub doesn't immediately resolve the issue here
- A quite similar backtrace also involving `arc_buf_destroy` is in #11443.  The behavior described there has some parallels to what I observe.  I am uncertain from the discussion what that means for this.
- In #10697 there are some similar symptoms, but it looks like a different issue to me

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ZFS corruption related to snapshots post-2.0.x upgrade #12014

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Potentially related bugs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Type	Version/Name
Distribution Name	Debian
Distribution Version	Buster
Linux Kernel	5.10.0-0.bpo.5-amd64
Architecture	amd64
ZFS Version	2.0.3-1~bpo10+1
SPL Version	2.0.3-1~bpo10+1

ZFS corruption related to snapshots post-2.0.x upgrade #12014

Description

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Potentially related bugs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions