hang in call to zil_lwb_flush_wait_all, blocking txg_sync

We've seen `zil_lwb_flush_wait_all` hang on a few systems, the stack looks like this:
```
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257040] INFO: task txg_sync:2620216 blocked for more than 120 seconds.
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257183]       Tainted: P           OE     5.4.0-137-dx2023022419-17ff49f96-generic #154
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257369] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257509] txg_sync        D    0 2620216      2 0x80004080
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257511] Call Trace:
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257517]  __schedule+0x2de/0x710
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257521]  ? __wake_up_common_lock+0x8a/0xc0
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257523]  schedule+0x42/0xb0
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257531]  cv_wait_common+0x11e/0x160 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257532]  ? __wake_up_pollfree+0x40/0x40
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257535]  __cv_wait+0x15/0x20 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257719]  zil_lwb_flush_wait_all+0x51/0x90 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257831]  zil_sync+0x89/0x4c0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257837]  ? spl_kmem_alloc_impl+0xdd/0x100 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257901]  ? dmu_objset_projectquota_enabled+0x50/0x50 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257905]  ? taskq_wait_check+0x31/0x40 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257909]  ? taskq_wait+0x2f/0xa0 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.257976]  ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258035]  dmu_objset_sync+0x339/0x460 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258099]  dsl_dataset_sync+0x5e/0x1e0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258168]  dsl_pool_sync+0xa0/0x410 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258254]  ? spa_errlog_sync+0x26c/0x290 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258330]  spa_sync_iterate_to_convergence+0xf1/0x250 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258410]  spa_sync+0x31c/0x6c0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258483]  txg_sync_thread+0x22d/0x2d0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258550]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258555]  thread_generic_wrapper+0x83/0xa0 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258558]  kthread+0x104/0x140
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258563]  ? kasan_check_write.constprop.0+0x10/0x10 [spl]
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258564]  ? kthread_park+0x90/0x90
Jun  7 13:00:05 phlpwcgap06 kernel: [5428563.258567]  ret_from_fork+0x1f/0x40
```

As a result of `txg_sync` being blocked, all TXGs halt, and thus all IO to this pool effectively stops.. we've resolved this via a reboot.. and we haven't yet been able to capture a kernel crash dump when this situation occurs.

So far, this situation seems relatively rare.. only been reported 2 times.. but since it causes all IO to the pool to hang, it's relatively impactful when it does happen, as it requires an admin to detect the situation and manually reboot the system to recover.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hang in call to zil_lwb_flush_wait_all, blocking txg_sync #14982

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

hang in call to zil_lwb_flush_wait_all, blocking txg_sync #14982

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions