Closed
Description
System information
Type | Version/Name |
---|---|
Distribution Name | Red Hat Enterprise Linux and Rocky Linux |
Distribution Version | RHEL 8.6 / Rocky 8.4 |
Kernel Version | 4.18+ kernels |
Architecture | x86_64 |
OpenZFS Version | zfs-2.1.11 |
Describe the problem you're observing
At rare instances, during a single drive failure or offline we see both sequential and healing resilver gets initiated at the same time. Once the healing resilver gets completed(w/ errors), the rebuild gets canceled.
zpool-history:
Snip of <zpool history -il> from the problematic zpool which we have seen. We can see here both resilvers started at the same time and txg and is running for around 1Hr 20mins or so.
[txg:2610151] rebuild vdev_id=0 vdev_guid=17410720669444886704 started
[txg:2610151] scan setup func=2 mintxg=3 maxtxg=2610154
[txg:2610156] vdev attach spare in vdev=draid2-0-0 for vdev=/dev/disk/by-vdev/enc2-0
[txg:2610841] scan done errors=2
[txg:2610842] rebuild vdev_id=0 vdev_guid=17410720669444886704 canceled
Describe how to reproduce the problem
+ zpool create -f -o cachefile=none -o failmode=panic -O canmount=off pool-oss0 draid1:10d:1s /root/test/files/file1 /root/test/files/file2 /root/test/files/file3 /root/test/files/file4 /root/test/files/file5 /root/test/files/file6 /root/test/files/file7 /root/test/files/file8 /root/test/files/file9 /root/test/files/file10 /root/test/files/file11 /root/test/files/file12 /root/test/files/file13 /root/test/files/file14 /root/test/files/file15 /root/test/files/file16 /root/test/files/file17 /root/test/files/file18 /root/test/files/file19 /root/test/files/file20
+ zfs create -o mountpoint=/mnt/ost0 -o recordsize=4M -o compression=off pool-oss0/ost0
+ zpool status -v
pool: pool-oss0
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
pool-oss0 ONLINE 0 0 0
draid1:10d:20c:1s-0 ONLINE 0 0 0
/root/test/files/file1 ONLINE 0 0 0
/root/test/files/file2 ONLINE 0 0 0
/root/test/files/file3 ONLINE 0 0 0
/root/test/files/file4 ONLINE 0 0 0
/root/test/files/file5 ONLINE 0 0 0
/root/test/files/file6 ONLINE 0 0 0
/root/test/files/file7 ONLINE 0 0 0
/root/test/files/file8 ONLINE 0 0 0
/root/test/files/file9 ONLINE 0 0 0
/root/test/files/file10 ONLINE 0 0 0
/root/test/files/file11 ONLINE 0 0 0
/root/test/files/file12 ONLINE 0 0 0
/root/test/files/file13 ONLINE 0 0 0
/root/test/files/file14 ONLINE 0 0 0
/root/test/files/file15 ONLINE 0 0 0
/root/test/files/file16 ONLINE 0 0 0
/root/test/files/file17 ONLINE 0 0 0
/root/test/files/file18 ONLINE 0 0 0
/root/test/files/file19 ONLINE 0 0 0
/root/test/files/file20 ONLINE 0 0 0
spares
draid1-0-0 AVAIL
errors: No known data errors
+ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pool-oss0 18.0G 1.11M 18.0G - - 0% 0% 1.00x ONLINE -
+ zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool-oss0 694K 15.6G 118K /pool-oss0
pool-oss0/ost0 118K 15.6G 118K /mnt/ost0
+ zpool offline -f pool-oss0 /root/test/files/file10
+ rm -f /root/test/files/file10
<Wait for rebuild to complete ...>
pool: pool-oss0
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0B in 00:00:01 with 0 errors on Fri May 19 12:07:48 2023
scan: resilvered (draid1:10d:20c:1s-0) 81.5K in 00:00:05 with 0 errors on Fri May 19 12:07:47 2023
config:
NAME STATE READ WRITE CKSUM
pool-oss0 DEGRADED 0 0 0
draid1:10d:20c:1s-0 DEGRADED 0 0 0
/root/test/files/file1 ONLINE 0 0 0
/root/test/files/file2 ONLINE 0 0 0
/root/test/files/file3 ONLINE 0 0 0
/root/test/files/file4 ONLINE 0 0 0
/root/test/files/file5 ONLINE 0 0 0
/root/test/files/file6 ONLINE 0 0 0
/root/test/files/file7 ONLINE 0 0 0
/root/test/files/file8 ONLINE 0 0 0
/root/test/files/file9 ONLINE 0 0 0
spare-9 DEGRADED 0 0 0
/root/test/files/file10 FAULTED 0 0 0 external device fault
draid1-0-0 ONLINE 0 0 0
/root/test/files/file11 ONLINE 0 0 0
/root/test/files/file12 ONLINE 0 0 0
/root/test/files/file13 ONLINE 0 0 0
/root/test/files/file14 ONLINE 0 0 0
/root/test/files/file15 ONLINE 0 0 0
/root/test/files/file16 ONLINE 0 0 0
/root/test/files/file17 ONLINE 0 0 0
/root/test/files/file18 ONLINE 0 0 0
/root/test/files/file19 ONLINE 0 0 0
/root/test/files/file20 ONLINE 0 0 0
spares
draid1-0-0 INUSE currently in use
errors: No known data errors
==== rebuild complete. ====
+ zpool offline -f pool-oss0 /root/test/files/file11
+ rm -f /root/test/files/file11
<Wait for rebuild to complete ...>
+ set +x
pool: pool-oss0
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0B in 00:00:01 with 0 errors on Fri May 19 12:07:48 2023
scan: resilvered (draid1:10d:20c:1s-0) 81.5K in 00:00:05 with 0 errors on Fri May 19 12:07:47 2023
config:
NAME STATE READ WRITE CKSUM
pool-oss0 DEGRADED 0 0 0
draid1:10d:20c:1s-0 DEGRADED 0 0 0
/root/test/files/file1 ONLINE 0 0 0
/root/test/files/file2 ONLINE 0 0 0
/root/test/files/file3 ONLINE 0 0 0
/root/test/files/file4 ONLINE 0 0 0
/root/test/files/file5 ONLINE 0 0 0
/root/test/files/file6 ONLINE 0 0 0
/root/test/files/file7 ONLINE 0 0 0
/root/test/files/file8 ONLINE 0 0 0
/root/test/files/file9 ONLINE 0 0 0
spare-9 DEGRADED 0 0 0
/root/test/files/file10 FAULTED 0 0 0 external device fault
draid1-0-0 ONLINE 0 0 0
/root/test/files/file11 FAULTED 0 0 0 external device fault
/root/test/files/file12 ONLINE 0 0 0
/root/test/files/file13 ONLINE 0 0 0
/root/test/files/file14 ONLINE 0 0 0
/root/test/files/file15 ONLINE 0 0 0
/root/test/files/file16 ONLINE 0 0 0
/root/test/files/file17 ONLINE 0 0 0
/root/test/files/file18 ONLINE 0 0 0
/root/test/files/file19 ONLINE 0 0 0
/root/test/files/file20 ONLINE 0 0 0
spares
draid1-0-0 INUSE currently in use
errors: No known data errors
==== rebuild complete. ====
+ truncate -s 2G /root/test/files/file10 /root/test/files/file11
+ zpool replace -w pool-oss0 /root/test/files/file10
+ sleep 5
+ echo 'Running zpool clear ...'
Running zpool clear ...
+ zpool clear pool-oss0
<At this point both rebuild(sequential resilver) and healing resilver are initiated.>
pool: pool-oss0
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: resilver (draid1:10d:20c:1s-0) canceled on Fri May 19 12:08:02 2023
config:
NAME STATE READ WRITE CKSUM
pool-oss0 DEGRADED 0 0 0
draid1:10d:20c:1s-0 DEGRADED 0 0 0
/root/test/files/file1 ONLINE 0 0 42
/root/test/files/file2 ONLINE 0 0 72
/root/test/files/file3 ONLINE 0 0 36
/root/test/files/file4 ONLINE 0 0 48
/root/test/files/file5 ONLINE 0 0 54
/root/test/files/file6 ONLINE 0 0 90
/root/test/files/file7 ONLINE 0 0 18
/root/test/files/file8 ONLINE 0 0 48
/root/test/files/file9 ONLINE 0 0 78
/root/test/files/file10 ONLINE 0 0 72
spare-10 DEGRADED 0 0 54
/root/test/files/file11 UNAVAIL 0 0 0 invalid label
draid1-0-0 ONLINE 0 0 0
/root/test/files/file12 ONLINE 0 0 30
/root/test/files/file13 ONLINE 0 0 60
/root/test/files/file14 ONLINE 0 0 54
/root/test/files/file15 ONLINE 0 0 78
/root/test/files/file16 ONLINE 0 0 66
/root/test/files/file17 ONLINE 0 0 72
/root/test/files/file18 ONLINE 0 0 72
/root/test/files/file19 ONLINE 0 0 72
/root/test/files/file20 ONLINE 0 0 72
spares
draid1-0-0 INUSE currently in use
errors: Permanent errors have been detected in the following files:
<metadata>:<0x0>
<metadata>:<0x3d>
zpool-history:
2023-05-19.12:05:36 zpool create -f -o cachefile=none -o failmode=panic -O canmount=off pool-oss0 draid1:10d:1s /root/test/files/file1 /root/test/files/file2 /root/test/files/file3 /root/test/files/file4 /root/test/files/file5 /root/test/files/file6 /root/test/files/file7 /root/test/files/file8 /root/test/files/file9 /root/test/files/file10 /root/test/files/file11 /root/test/files/file12 /root/test/files/file13 /root/test/files/file14 /root/test/files/file15 /root/test/files/file16 /root/test/files/file17 /root/test/files/file18 /root/test/files/file19 /root/test/files/file20 [user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:05:36 [txg:6] create pool-oss0/ost0 (259) [on rocky6x-kvm2]
2023-05-19.12:05:36 [txg:7] set pool-oss0/ost0 (259) mountpoint=/mnt/ost0 [on rocky6x-kvm2]
2023-05-19.12:05:36 [txg:7] set pool-oss0/ost0 (259) recordsize=4194304 [on rocky6x-kvm2]
2023-05-19.12:05:36 [txg:7] set pool-oss0/ost0 (259) compression=2 [on rocky6x-kvm2]
2023-05-19.12:05:36 (278ms) ioctl create
input:
type: 2
props:
mountpoint: '/mnt/ost0'
recordsize: 4194304
compression: 2
[user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:05:36 zfs create -o mountpoint=/mnt/ost0 -o recordsize=4M -o compression=off pool-oss0/ost0 [user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:07:42 [txg:33] rebuild vdev_id=0 vdev_guid=16844522658709568318 started [on rocky6x-kvm2]
2023-05-19.12:07:42 [txg:37] vdev attach spare in vdev=draid1-0-0 for vdev=/root/test/files/file10 [on rocky6x-kvm2]
2023-05-19.12:07:42 zpool offline -f pool-oss0 /root/test/files/file10 [user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:07:47 [txg:37] rebuild vdev_id=0 vdev_guid=16844522658709568318 complete [on rocky6x-kvm2]
2023-05-19.12:07:47 [txg:37] scan setup func=1 mintxg=0 maxtxg=37 [on rocky6x-kvm2]
2023-05-19.12:07:48 [txg:39] scan done errors=0 [on rocky6x-kvm2]
2023-05-19.12:07:49 zpool offline -f pool-oss0 /root/test/files/file11 [user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:07:49 [txg:46] scan setup func=2 mintxg=3 maxtxg=46 [on rocky6x-kvm2]
2023-05-19.12:07:49 [txg:48] vdev attach replace vdev=/root/test/files/file10 for vdev=/root/test/files/file10/old [on rocky6x-kvm2]
2023-05-19.12:07:49 [txg:48] scan done errors=0 [on rocky6x-kvm2]
2023-05-19.12:07:50 [txg:50] detach vdev=/root/test/files/file10/old [on rocky6x-kvm2]
2023-05-19.12:07:50 [txg:51] detach vdev=draid1-0-0 [on rocky6x-kvm2]
2023-05-19.12:07:50 zpool replace -w pool-oss0 /root/test/files/file10 [user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:07:55 [txg:53] rebuild vdev_id=0 vdev_guid=16844522658709568318 started [on rocky6x-kvm2] -> sequential resilver
2023-05-19.12:07:55 [txg:53] scan setup func=2 mintxg=3 maxtxg=56 [on rocky6x-kvm2] -> healing resilver
2023-05-19.12:07:56 [txg:55] scan done errors=6 [on rocky6x-kvm2]
2023-05-19.12:08:02 [txg:57] rebuild vdev_id=0 vdev_guid=16844522658709568318 canceled [on rocky6x-kvm2]
2023-05-19.12:07:56 [txg:57] vdev attach spare in vdev=draid1-0-0 for vdev=/root/test/files/file11 [on rocky6x-kvm2]
2023-05-19.12:07:56 zpool clear pool-oss0 [user 0 (root) on rocky6x-kvm2:linux]
zpool-clear:
snip of relevant events:
May 19 2023 12:07:55.701923737 resource.fs.zfs.statechange
version = 0x0
class = "resource.fs.zfs.statechange"
pool = "pool-oss0"
pool_guid = 0x676926306413a9b7
pool_state = 0x0
pool_context = 0x0
vdev_guid = 0x88353c55104c6630
vdev_state = "UNAVAIL" (0x4)
vdev_path = "/root/test/files/file11"
vdev_laststate = "ONLINE" (0x7)
time = 0x6467669b 0x29d68199
eid = 0x4b
May 19 2023 12:07:55.702923744 sysevent.fs.zfs.vdev_clear
version = 0x0
class = "sysevent.fs.zfs.vdev_clear"
pool = "pool-oss0"
pool_guid = 0x676926306413a9b7
pool_state = 0x0
pool_context = 0x0
vdev_guid = 0x88353c55104c6630
vdev_state = "UNAVAIL" (0x4)
vdev_path = "/root/test/files/file11"
time = 0x6467669b 0x29e5c3e0
eid = 0x4c
May 19 2023 12:07:55.977925890 sysevent.fs.zfs.vdev_spare
version = 0x0
class = "sysevent.fs.zfs.vdev_spare"
pool = "pool-oss0"
pool_guid = 0x676926306413a9b7
pool_state = 0x0
pool_context = 0x0
vdev_guid = 0xc4b175f5e14653c
vdev_state = "ONLINE" (0x7)
vdev_path = "draid1-0-0"
time = 0x6467669b 0x3a49f702
eid = 0x4d
May 19 2023 12:07:55.977925890 sysevent.fs.zfs.resilver_start
version = 0x0
class = "sysevent.fs.zfs.resilver_start"
pool = "pool-oss0"
pool_guid = 0x676926306413a9b7
pool_state = 0x0
pool_context = 0x0
vdev_guid = 0xe9c3c41cab66fb3e
vdev_state = "DEGRADED" (0x6)
resilver_type = "sequential"
time = 0x6467669b 0x3a49f702
eid = 0x4e
May 19 2023 12:07:55.977925890 sysevent.fs.zfs.vdev_attach
version = 0x0
class = "sysevent.fs.zfs.vdev_attach"
pool = "pool-oss0"
pool_guid = 0x676926306413a9b7
pool_state = 0x0
pool_context = 0x0
vdev_guid = 0xc4b175f5e14653c
vdev_state = "ONLINE" (0x7)
vdev_path = "draid1-0-0"
time = 0x6467669b 0x3a49f702
eid = 0x4f
May 19 2023 12:07:55.984925944 sysevent.fs.zfs.history_event
version = 0x0
class = "sysevent.fs.zfs.history_event"
pool = "pool-oss0"
pool_guid = 0x676926306413a9b7
pool_state = 0x0
pool_context = 0x0
history_hostname = "rocky6x-kvm2"
history_internal_str = "vdev_id=0 vdev_guid=16844522658709568318 started"
history_internal_name = "rebuild"
history_txg = 0x35
history_time = 0x6467669b
time = 0x6467669b 0x3ab4c6f8
eid = 0x50
May 19 2023 12:07:55.985925952 sysevent.fs.zfs.resilver_start
version = 0x0
class = "sysevent.fs.zfs.resilver_start"
pool = "pool-oss0"
pool_guid = 0x676926306413a9b7
pool_state = 0x0
pool_context = 0x0
resilver_type = "healing"
time = 0x6467669b 0x3ac40940
eid = 0x51
Include any warning/errors/backtraces from the system logs
None