Skip to content

hang in call to zil_lwb_flush_wait_all, blocking txg_sync #14982

Closed
@prakashsurya

Description

@prakashsurya

We've seen zil_lwb_flush_wait_all hang on a few systems, the stack looks like this:

Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257040] INFO: task txg_sync:2620216 blocked for more than 120 seconds.
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257183]       Tainted: P           OE     5.4.0-137-dx2023022419-17ff49f96-generic #154
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257369] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257509] txg_sync        D    0 2620216      2 0x80004080
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257511] Call Trace:
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257517]  __schedule+0x2de/0x710
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257521]  ? __wake_up_common_lock+0x8a/0xc0
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257523]  schedule+0x42/0xb0
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257531]  cv_wait_common+0x11e/0x160 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257532]  ? __wake_up_pollfree+0x40/0x40
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257535]  __cv_wait+0x15/0x20 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257719]  zil_lwb_flush_wait_all+0x51/0x90 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257831]  zil_sync+0x89/0x4c0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257837]  ? spl_kmem_alloc_impl+0xdd/0x100 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257901]  ? dmu_objset_projectquota_enabled+0x50/0x50 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257905]  ? taskq_wait_check+0x31/0x40 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257909]  ? taskq_wait+0x2f/0xa0 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257976]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258035]  dmu_objset_sync+0x339/0x460 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258099]  dsl_dataset_sync+0x5e/0x1e0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258168]  dsl_pool_sync+0xa0/0x410 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258254]  ? spa_errlog_sync+0x26c/0x290 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258330]  spa_sync_iterate_to_convergence+0xf1/0x250 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258410]  spa_sync+0x31c/0x6c0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258483]  txg_sync_thread+0x22d/0x2d0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258550]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258555]  thread_generic_wrapper+0x83/0xa0 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258558]  kthread+0x104/0x140
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258563]  ? kasan_check_write.constprop.0+0x10/0x10 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258564]  ? kthread_park+0x90/0x90
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258567]  ret_from_fork+0x1f/0x40

As a result of txg_sync being blocked, all TXGs halt, and thus all IO to this pool effectively stops.. we've resolved this via a reboot.. and we haven't yet been able to capture a kernel crash dump when this situation occurs.

So far, this situation seems relatively rare.. only been reported 2 times.. but since it causes all IO to the pool to hang, it's relatively impactful when it does happen, as it requires an admin to detect the situation and manually reboot the system to recover.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: StaleNo recent activity for issueType: DefectIncorrect behavior (e.g. crash, hang)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions