Closed
Description
We've seen zil_lwb_flush_wait_all
hang on a few systems, the stack looks like this:
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257040] INFO: task txg_sync:2620216 blocked for more than 120 seconds.
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257183] Tainted: P OE 5.4.0-137-dx2023022419-17ff49f96-generic #154
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257369] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257509] txg_sync D 0 2620216 2 0x80004080
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257511] Call Trace:
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257517] __schedule+0x2de/0x710
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257521] ? __wake_up_common_lock+0x8a/0xc0
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257523] schedule+0x42/0xb0
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257531] cv_wait_common+0x11e/0x160 [spl]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257532] ? __wake_up_pollfree+0x40/0x40
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257535] __cv_wait+0x15/0x20 [spl]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257719] zil_lwb_flush_wait_all+0x51/0x90 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257831] zil_sync+0x89/0x4c0 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257837] ? spl_kmem_alloc_impl+0xdd/0x100 [spl]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257901] ? dmu_objset_projectquota_enabled+0x50/0x50 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257905] ? taskq_wait_check+0x31/0x40 [spl]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257909] ? taskq_wait+0x2f/0xa0 [spl]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.257976] ? do_raw_spin_unlock+0x9/0x10 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258035] dmu_objset_sync+0x339/0x460 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258099] dsl_dataset_sync+0x5e/0x1e0 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258168] dsl_pool_sync+0xa0/0x410 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258254] ? spa_errlog_sync+0x26c/0x290 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258330] spa_sync_iterate_to_convergence+0xf1/0x250 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258410] spa_sync+0x31c/0x6c0 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258483] txg_sync_thread+0x22d/0x2d0 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258550] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258555] thread_generic_wrapper+0x83/0xa0 [spl]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258558] kthread+0x104/0x140
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258563] ? kasan_check_write.constprop.0+0x10/0x10 [spl]
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258564] ? kthread_park+0x90/0x90
Jun 7 13:00:05 phlpwcgap06 kernel: [5428563.258567] ret_from_fork+0x1f/0x40
As a result of txg_sync
being blocked, all TXGs halt, and thus all IO to this pool effectively stops.. we've resolved this via a reboot.. and we haven't yet been able to capture a kernel crash dump when this situation occurs.
So far, this situation seems relatively rare.. only been reported 2 times.. but since it causes all IO to the pool to hang, it's relatively impactful when it does happen, as it requires an admin to detect the situation and manually reboot the system to recover.