Description
System information
Type | Version/Name |
---|---|
Distribution Name | Ubuntu |
Distribution Version | 20.4 |
Kernel Version | 5.15.0-67 |
Architecture | x86_64 |
OpenZFS Version | zfs-2.2.99 |
Describe the problem you're observing
After upgrading a system which did not have the head_errlog
feature (introduced in 0409d33) and with an existing scrub IO error recorded, to a version which did have this feature, the system entered a panic loop during a scrub
.
When a pool transitions to a state where the head_errlog
feature is enabled, any persistent errlog state is converted to a new format. This involves deleting the old persistent errlog objects (errlog_last
and errlog_scrub
) and replacing them with new ones. While the internal state is updated (spa_errlog_last
and spa_errlog_scrub
), the persistent pointers to these objects in the MOS directory object are not updated. These are currently only updated in spa_errlog_sync()
when it is called during sync phase (and there is "something to do").
Describe how to reproduce the problem
The following sequence will reproduce the issue:
Start with a pool which does not have the head_errlog feature (this was introduced in release 6.0.15.0).
$ zpool create -o feature@head_errlog=disabled test_pool sdb
A data error occurs in the pool at some point and is discovered via a scrub. This will result in the errlog_last entry in the pools MOS directory to have a non-zero value (an object Id for a zap object that records the error block). Note that on pool import the entries from the MOS directory object are used to populate fields in the spa (e.g. errlog_last is used to initialize spa_errlog_last).
$ dd if=/dev/urandom of=/test_pool/file bs=128K count=1000
$ zinject -b 36:2:0:0:1 test_pool
$ zpool scrub test_pool
$ sudo zdb -dddd test 1
Dataset mos [META], ID 0, cr_txg 4, 141K, 61 objects, rootbp DVA[0]=<0:c0016200:200> DVA[1]=<0:e0016200:200> DVA[2]=<0:100022800:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique triple size=1000L/200P birth=89L/89P fill=61 cksum=0000000961588255:000003cffd0c0ccd:0000c9dbf729c1f0:001c559859613442
Object lvl iblk dblk dsize dnsize lsize %full type
1 1 128K 16K 13.5K 512 32K 100.00 object directory
dnode flags: USED_BYTES
dnode maxblkid: 1
Fat ZAP stats:
Pointer table:
1024 elements
zt_blk: 0
zt_numblks: 0
zt_shift: 10
zt_blks_copied: 0
zt_nextblk: 0
ZAP entries: 19
Leaf blocks: 1
Total blocks: 2
zap_block_type: 0x8000000000000001
zap_magic: 0x2f52ab2ab
zap_salt: 0x1df1dddb
Leafs with 2^n pointers:
9: 1 *
Blocks with n*5 entries:
3: 1 *
Blocks n/10 full:
2: 1 *
Entries with n chunks:
3: 13 *************
4: 3 ***
5: 1 *
6: 1 *
7: 0
8: 0
9: 1 *
Buckets with n entries:
0: 494 ****************************************
1: 17 **
2: 1 *
creation_version = 5000
feature_descriptions = 53
sync_bplist = 62
org.illumos:checksum_salt = 8b469083295342722ca32f7963150eacd1153da3fa0471e12d59130d73b853d2
scan = 1 2 0 0 49 1 49 1694731576 1694731576 131278336 131278336 44544 0 1 1 3 0 0 0 -1 0 0 0 0
free_bpobj = 41
feature_enabled_txg = 63
deflate = 1
errlog_scrub = 0
root_dataset = 32
com.delphix:log_spacemap_zap = 69
config = 61
error_scrub = 0 0 0 0 0 0 0 0 0
history = 60
features_for_write = 52
features_for_read = 51
com.delphix:bookmark_v2_recalculated = 1
com.delphix:vdev_zap_map = 64
errlog_last = 89
Upgrade the pool to a later version (e.g., 7.0.0.0.0). This will cause spa_errlog_upgrade() to be called, which will transfer the entries from the old object and delete the old object (replacing the value of spa_errlog_last). Note that this function does not update the MOS directory object, so the directory will continue to reference the old (now deleted) object.
$ zpool set feature@head_errlog=enabled test_pool
$ sdb
sdb> spa
ADDR NAME
------------------------------------------------------------
0xffff8968924e4000 rpool
0xffff89690d298000 test_pool
sdb> addr 0xffff89690d298000 | cast spa_t * | member spa_errlog_last
(uint64_t)129
Before another scrub, reboot the system. This will cause the spa to be initialized with, now stale, data from the MOS directory object. At this point the spa will now reference a deleted object from spa_errlog_last and the MOS directory object continues to reference the deleted object as well.
$ reboot
...
$ sdb
sdb> spa
ADDR NAME
------------------------------------------------------------
0xffff9cc08c598000 rpool
0xffff9cc096e6c000 test_pool
sdb> addr 0xffff9cc096e6c000 | cast spa_t * | member spa_errlog_last
(uint64_t)89
Start a new scrub. When we attempt to rotate the new scrub data (from spa_errlog_scrub) to the spa_errlog_last we will detect a non-zero value in spa_errlog_last and attempt to free that (already freed) object. This will trigger the panic loop.
$ zpool scrub test_pool
Include any warning/errors/backtraces from the system logs
Panic stack:
[ 34.473834] VERIFY0(0 == dmu_object_free(spa->spa_meta_objset, spa_err_obj, tx)) failed (0 == 2)
[ 34.475480] PANIC at spa_errlog.c:1139:delete_errlog()
[ 34.476432] Kernel panic - not syncing: VERIFY0(0 == dmu_object_free(spa->spa_meta_objset, spa_err_obj, tx)) failed (0 == 2)
[ 34.478437] CPU: 1 PID: 1059 Comm: txg_sync Tainted: P OE 5.4.0-135-dx2023011005-db2b50090-generic #152
[ 34.480332] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
[ 34.482249] Call Trace:
[ 34.482712] dump_stack+0x6d/0x8b
[ 34.483327] panic+0x101/0x2e3
[ 34.483894] spl_panic+0xcf/0x102 [spl]
[ 34.487189] delete_errlog+0x124/0x130 [zfs]
[ 34.496503] spa_errlog_sync+0x256/0x260 [zfs]
[ 34.499310] spa_sync_iterate_to_convergence+0xe5/0x250 [zfs]
[ 34.500415] spa_sync+0x2f7/0x670 [zfs]
[ 34.501181] txg_sync_thread+0x22d/0x2d0 [zfs]
[ 34.502988] thread_generic_wrapper+0x83/0xa0 [spl]
[ 34.503889] kthread+0x104/0x140
[ 34.506175] ret_from_fork+0x1f/0x40