Skip to content

Sequential(rebuild) and healing resilver initiated at same time/txg #14881

Closed
@akashb-22

Description

@akashb-22

System information

Type Version/Name
Distribution Name Red Hat Enterprise Linux and Rocky Linux
Distribution Version RHEL 8.6 / Rocky 8.4
Kernel Version 4.18+ kernels
Architecture x86_64
OpenZFS Version zfs-2.1.11

Describe the problem you're observing

At rare instances, during a single drive failure or offline we see both sequential and healing resilver gets initiated at the same time. Once the healing resilver gets completed(w/ errors), the rebuild gets canceled.
zpool-history:
Snip of <zpool history -il> from the problematic zpool which we have seen. We can see here both resilvers started at the same time and txg and is running for around 1Hr 20mins or so.

[txg:2610151] rebuild vdev_id=0 vdev_guid=17410720669444886704 started 
[txg:2610151] scan setup func=2 mintxg=3 maxtxg=2610154
[txg:2610156] vdev attach spare in vdev=draid2-0-0 for vdev=/dev/disk/by-vdev/enc2-0 
[txg:2610841] scan done errors=2 
[txg:2610842] rebuild vdev_id=0 vdev_guid=17410720669444886704 canceled

Describe how to reproduce the problem

+ zpool create -f -o cachefile=none -o failmode=panic -O canmount=off pool-oss0 draid1:10d:1s /root/test/files/file1 /root/test/files/file2 /root/test/files/file3 /root/test/files/file4 /root/test/files/file5 /root/test/files/file6 /root/test/files/file7 /root/test/files/file8 /root/test/files/file9 /root/test/files/file10 /root/test/files/file11 /root/test/files/file12 /root/test/files/file13 /root/test/files/file14 /root/test/files/file15 /root/test/files/file16 /root/test/files/file17 /root/test/files/file18 /root/test/files/file19 /root/test/files/file20
+ zfs create -o mountpoint=/mnt/ost0 -o recordsize=4M -o compression=off pool-oss0/ost0
+ zpool status -v
  pool: pool-oss0
 state: ONLINE
config:

        NAME                         STATE     READ WRITE CKSUM
        pool-oss0                    ONLINE       0     0     0
          draid1:10d:20c:1s-0        ONLINE       0     0     0
            /root/test/files/file1   ONLINE       0     0     0
            /root/test/files/file2   ONLINE       0     0     0
            /root/test/files/file3   ONLINE       0     0     0
            /root/test/files/file4   ONLINE       0     0     0
            /root/test/files/file5   ONLINE       0     0     0
            /root/test/files/file6   ONLINE       0     0     0
            /root/test/files/file7   ONLINE       0     0     0
            /root/test/files/file8   ONLINE       0     0     0
            /root/test/files/file9   ONLINE       0     0     0
            /root/test/files/file10  ONLINE       0     0     0
            /root/test/files/file11  ONLINE       0     0     0
            /root/test/files/file12  ONLINE       0     0     0
            /root/test/files/file13  ONLINE       0     0     0
            /root/test/files/file14  ONLINE       0     0     0
            /root/test/files/file15  ONLINE       0     0     0
            /root/test/files/file16  ONLINE       0     0     0
            /root/test/files/file17  ONLINE       0     0     0
            /root/test/files/file18  ONLINE       0     0     0
            /root/test/files/file19  ONLINE       0     0     0
            /root/test/files/file20  ONLINE       0     0     0
        spares
          draid1-0-0                 AVAIL

errors: No known data errors
+ zpool list
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
pool-oss0  18.0G  1.11M  18.0G        -         -     0%     0%  1.00x    ONLINE  -
+ zfs list
NAME             USED  AVAIL     REFER  MOUNTPOINT
pool-oss0        694K  15.6G      118K  /pool-oss0
pool-oss0/ost0   118K  15.6G      118K  /mnt/ost0


+ zpool offline -f pool-oss0 /root/test/files/file10
+ rm -f /root/test/files/file10
<Wait for rebuild to complete ...>

  pool: pool-oss0
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Fri May 19 12:07:48 2023
  scan: resilvered (draid1:10d:20c:1s-0) 81.5K in 00:00:05 with 0 errors on Fri May 19 12:07:47 2023
config:

        NAME                           STATE     READ WRITE CKSUM
        pool-oss0                      DEGRADED     0     0     0
          draid1:10d:20c:1s-0          DEGRADED     0     0     0
            /root/test/files/file1     ONLINE       0     0     0
            /root/test/files/file2     ONLINE       0     0     0
            /root/test/files/file3     ONLINE       0     0     0
            /root/test/files/file4     ONLINE       0     0     0
            /root/test/files/file5     ONLINE       0     0     0
            /root/test/files/file6     ONLINE       0     0     0
            /root/test/files/file7     ONLINE       0     0     0
            /root/test/files/file8     ONLINE       0     0     0
            /root/test/files/file9     ONLINE       0     0     0
            spare-9                    DEGRADED     0     0     0
              /root/test/files/file10  FAULTED      0     0     0  external device fault
              draid1-0-0               ONLINE       0     0     0
            /root/test/files/file11    ONLINE       0     0     0
            /root/test/files/file12    ONLINE       0     0     0
            /root/test/files/file13    ONLINE       0     0     0
            /root/test/files/file14    ONLINE       0     0     0
            /root/test/files/file15    ONLINE       0     0     0
            /root/test/files/file16    ONLINE       0     0     0
            /root/test/files/file17    ONLINE       0     0     0
            /root/test/files/file18    ONLINE       0     0     0
            /root/test/files/file19    ONLINE       0     0     0
            /root/test/files/file20    ONLINE       0     0     0
        spares
          draid1-0-0                   INUSE     currently in use

errors: No known data errors
==== rebuild complete. ====
+ zpool offline -f pool-oss0 /root/test/files/file11
+ rm -f /root/test/files/file11
<Wait for rebuild to complete ...>

+ set +x
  pool: pool-oss0
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Fri May 19 12:07:48 2023
  scan: resilvered (draid1:10d:20c:1s-0) 81.5K in 00:00:05 with 0 errors on Fri May 19 12:07:47 2023
config:

        NAME                           STATE     READ WRITE CKSUM
        pool-oss0                      DEGRADED     0     0     0
          draid1:10d:20c:1s-0          DEGRADED     0     0     0
            /root/test/files/file1     ONLINE       0     0     0
            /root/test/files/file2     ONLINE       0     0     0
            /root/test/files/file3     ONLINE       0     0     0
            /root/test/files/file4     ONLINE       0     0     0
            /root/test/files/file5     ONLINE       0     0     0
            /root/test/files/file6     ONLINE       0     0     0
            /root/test/files/file7     ONLINE       0     0     0
            /root/test/files/file8     ONLINE       0     0     0
            /root/test/files/file9     ONLINE       0     0     0
            spare-9                    DEGRADED     0     0     0
              /root/test/files/file10  FAULTED      0     0     0  external device fault
              draid1-0-0               ONLINE       0     0     0
            /root/test/files/file11    FAULTED      0     0     0  external device fault
            /root/test/files/file12    ONLINE       0     0     0
            /root/test/files/file13    ONLINE       0     0     0
            /root/test/files/file14    ONLINE       0     0     0
            /root/test/files/file15    ONLINE       0     0     0
            /root/test/files/file16    ONLINE       0     0     0
            /root/test/files/file17    ONLINE       0     0     0
            /root/test/files/file18    ONLINE       0     0     0
            /root/test/files/file19    ONLINE       0     0     0
            /root/test/files/file20    ONLINE       0     0     0
        spares
          draid1-0-0                   INUSE     currently in use

errors: No known data errors
==== rebuild complete. ====

+ truncate -s 2G /root/test/files/file10 /root/test/files/file11
+ zpool replace -w pool-oss0 /root/test/files/file10
+ sleep 5
+ echo 'Running zpool clear ...'
Running zpool clear ...
+ zpool clear pool-oss0
<At this point both rebuild(sequential resilver) and healing resilver are initiated.>
  pool: pool-oss0
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilver (draid1:10d:20c:1s-0) canceled on Fri May 19 12:08:02 2023
config:

        NAME                           STATE     READ WRITE CKSUM
        pool-oss0                      DEGRADED     0     0     0
          draid1:10d:20c:1s-0          DEGRADED     0     0     0
            /root/test/files/file1     ONLINE       0     0    42
            /root/test/files/file2     ONLINE       0     0    72
            /root/test/files/file3     ONLINE       0     0    36
            /root/test/files/file4     ONLINE       0     0    48
            /root/test/files/file5     ONLINE       0     0    54
            /root/test/files/file6     ONLINE       0     0    90
            /root/test/files/file7     ONLINE       0     0    18
            /root/test/files/file8     ONLINE       0     0    48
            /root/test/files/file9     ONLINE       0     0    78
            /root/test/files/file10    ONLINE       0     0    72
            spare-10                   DEGRADED     0     0    54
              /root/test/files/file11  UNAVAIL      0     0     0  invalid label
              draid1-0-0               ONLINE       0     0     0
            /root/test/files/file12    ONLINE       0     0    30
            /root/test/files/file13    ONLINE       0     0    60
            /root/test/files/file14    ONLINE       0     0    54
            /root/test/files/file15    ONLINE       0     0    78
            /root/test/files/file16    ONLINE       0     0    66
            /root/test/files/file17    ONLINE       0     0    72
            /root/test/files/file18    ONLINE       0     0    72
            /root/test/files/file19    ONLINE       0     0    72
            /root/test/files/file20    ONLINE       0     0    72
        spares
          draid1-0-0                   INUSE     currently in use

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>
        <metadata>:<0x3d>

zpool-history:

2023-05-19.12:05:36 zpool create -f -o cachefile=none -o failmode=panic -O canmount=off pool-oss0 draid1:10d:1s /root/test/files/file1 /root/test/files/file2 /root/test/files/file3 /root/test/files/file4 /root/test/files/file5 /root/test/files/file6 /root/test/files/file7 /root/test/files/file8 /root/test/files/file9 /root/test/files/file10 /root/test/files/file11 /root/test/files/file12 /root/test/files/file13 /root/test/files/file14 /root/test/files/file15 /root/test/files/file16 /root/test/files/file17 /root/test/files/file18 /root/test/files/file19 /root/test/files/file20 [user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:05:36 [txg:6] create pool-oss0/ost0 (259)   [on rocky6x-kvm2]
2023-05-19.12:05:36 [txg:7] set pool-oss0/ost0 (259) mountpoint=/mnt/ost0 [on rocky6x-kvm2]
2023-05-19.12:05:36 [txg:7] set pool-oss0/ost0 (259) recordsize=4194304 [on rocky6x-kvm2]
2023-05-19.12:05:36 [txg:7] set pool-oss0/ost0 (259) compression=2 [on rocky6x-kvm2]
2023-05-19.12:05:36 (278ms) ioctl create
    input:
        type: 2
        props:
            mountpoint: '/mnt/ost0'
            recordsize: 4194304
            compression: 2
 [user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:05:36 zfs create -o mountpoint=/mnt/ost0 -o recordsize=4M -o compression=off pool-oss0/ost0 [user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:07:42 [txg:33] rebuild vdev_id=0 vdev_guid=16844522658709568318 started [on rocky6x-kvm2]
2023-05-19.12:07:42 [txg:37] vdev attach spare in vdev=draid1-0-0 for vdev=/root/test/files/file10 [on rocky6x-kvm2]
2023-05-19.12:07:42 zpool offline -f pool-oss0 /root/test/files/file10 [user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:07:47 [txg:37] rebuild vdev_id=0 vdev_guid=16844522658709568318 complete [on rocky6x-kvm2]
2023-05-19.12:07:47 [txg:37] scan setup func=1 mintxg=0 maxtxg=37 [on rocky6x-kvm2]
2023-05-19.12:07:48 [txg:39] scan done errors=0 [on rocky6x-kvm2]
2023-05-19.12:07:49 zpool offline -f pool-oss0 /root/test/files/file11 [user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:07:49 [txg:46] scan setup func=2 mintxg=3 maxtxg=46 [on rocky6x-kvm2]
2023-05-19.12:07:49 [txg:48] vdev attach replace vdev=/root/test/files/file10 for vdev=/root/test/files/file10/old [on rocky6x-kvm2]
2023-05-19.12:07:49 [txg:48] scan done errors=0 [on rocky6x-kvm2]
2023-05-19.12:07:50 [txg:50] detach vdev=/root/test/files/file10/old [on rocky6x-kvm2]
2023-05-19.12:07:50 [txg:51] detach vdev=draid1-0-0 [on rocky6x-kvm2]
2023-05-19.12:07:50 zpool replace -w pool-oss0 /root/test/files/file10 [user 0 (root) on rocky6x-kvm2:linux]
2023-05-19.12:07:55 [txg:53] rebuild vdev_id=0 vdev_guid=16844522658709568318 started [on rocky6x-kvm2] -> sequential resilver
2023-05-19.12:07:55 [txg:53] scan setup func=2 mintxg=3 maxtxg=56 [on rocky6x-kvm2]  -> healing resilver
2023-05-19.12:07:56 [txg:55] scan done errors=6 [on rocky6x-kvm2]
2023-05-19.12:08:02 [txg:57] rebuild vdev_id=0 vdev_guid=16844522658709568318 canceled [on rocky6x-kvm2]
2023-05-19.12:07:56 [txg:57] vdev attach spare in vdev=draid1-0-0 for vdev=/root/test/files/file11 [on rocky6x-kvm2]
2023-05-19.12:07:56 zpool clear pool-oss0 [user 0 (root) on rocky6x-kvm2:linux]

zpool-clear:
snip of relevant events:

May 19 2023 12:07:55.701923737 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool = "pool-oss0"
        pool_guid = 0x676926306413a9b7
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x88353c55104c6630
        vdev_state = "UNAVAIL" (0x4)
        vdev_path = "/root/test/files/file11"
        vdev_laststate = "ONLINE" (0x7)
        time = 0x6467669b 0x29d68199
        eid = 0x4b

May 19 2023 12:07:55.702923744 sysevent.fs.zfs.vdev_clear
        version = 0x0
        class = "sysevent.fs.zfs.vdev_clear"
        pool = "pool-oss0"
        pool_guid = 0x676926306413a9b7
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x88353c55104c6630
        vdev_state = "UNAVAIL" (0x4)
        vdev_path = "/root/test/files/file11"
        time = 0x6467669b 0x29e5c3e0
        eid = 0x4c

May 19 2023 12:07:55.977925890 sysevent.fs.zfs.vdev_spare
        version = 0x0
        class = "sysevent.fs.zfs.vdev_spare"
        pool = "pool-oss0"
        pool_guid = 0x676926306413a9b7
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0xc4b175f5e14653c
        vdev_state = "ONLINE" (0x7)
        vdev_path = "draid1-0-0"
        time = 0x6467669b 0x3a49f702
        eid = 0x4d

May 19 2023 12:07:55.977925890 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "pool-oss0"
        pool_guid = 0x676926306413a9b7
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0xe9c3c41cab66fb3e
        vdev_state = "DEGRADED" (0x6)
        resilver_type = "sequential"
        time = 0x6467669b 0x3a49f702
        eid = 0x4e

May 19 2023 12:07:55.977925890 sysevent.fs.zfs.vdev_attach
        version = 0x0
        class = "sysevent.fs.zfs.vdev_attach"
        pool = "pool-oss0"
        pool_guid = 0x676926306413a9b7
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0xc4b175f5e14653c
        vdev_state = "ONLINE" (0x7)
        vdev_path = "draid1-0-0"
        time = 0x6467669b 0x3a49f702
        eid = 0x4f

May 19 2023 12:07:55.984925944 sysevent.fs.zfs.history_event
        version = 0x0
        class = "sysevent.fs.zfs.history_event"
        pool = "pool-oss0"
        pool_guid = 0x676926306413a9b7
        pool_state = 0x0
        pool_context = 0x0
        history_hostname = "rocky6x-kvm2"
        history_internal_str = "vdev_id=0 vdev_guid=16844522658709568318 started"
        history_internal_name = "rebuild"
        history_txg = 0x35
        history_time = 0x6467669b
        time = 0x6467669b 0x3ab4c6f8
        eid = 0x50

May 19 2023 12:07:55.985925952 sysevent.fs.zfs.resilver_start
        version = 0x0
        class = "sysevent.fs.zfs.resilver_start"
        pool = "pool-oss0"
        pool_guid = 0x676926306413a9b7
        pool_state = 0x0
        pool_context = 0x0
        resilver_type = "healing"
        time = 0x6467669b 0x3ac40940
        eid = 0x51

Include any warning/errors/backtraces from the system logs

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: DefectIncorrect behavior (e.g. crash, hang)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions