Skip to content

Commit 9909121

Browse files
committed
Fixing race condition with rangelocks
There existed a race condition between when a Direct I/O write could complete and if a sync operation was issued. This was due to the fact that a Direct I/O would sleep waiting on previous TXG's to sync out their dirty records assosciated with a dbuf if there was an ARC buffer associated with the dbuf. This was necessay to safely destroy the ARC buffer in case previous dirty records dr_data as pointed at that the db_buf. The main issue with this approach is a Direct I/o write holds the rangelock across the entire block, so when a sync on that same block was issued and tried to grab the rangelock as reader, it would be blocked indefinitely because the Direct I/O that was now sleeping was holding that same rangelock as writer. This led to a complete deadlock. This commit fixes this issue and removes the wait in dmu_write_direct_done(). The way this is now handled is the ARC buffer is destroyed, if there an associated one with dbuf, before ever issuing the Direct I/O write. This implemenation heavily borrows from the block cloning implementation. A new function dmu_buf_wil_clone_or_dio() is called in both dmu_write_direct() and dmu_brt_clone() that does the following: 1. Undirties a dirty record for that db if there one currently associated with the current TXG. 2. Destroys the ARC buffer if the previous dirty record dr_data does not point at the dbufs ARC buffer (db_buf). 3. Sets the dbufs data pointers to NULL. 4. Redirties the dbuf using db_state = DB_NOFILL. As part of this commit, the dmu_write_direct_done() function was also cleaned up. Now dmu_sync_done() is called before undirtying the dbuf dirty record associated with a failed Direct I/O write. This is correct logic and how it always should have been. The additional benefits of these modifications is there is no longer a stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT together. Also it unifies the block cloning and Direct I/O write path as they both need to call dbuf_fix_old_data() before destroying the ARC buffer. As part of this commit, there is also just general code cleanup. Various dbuf stats were removed because they are not necesary any longer. Additionally, useless functions were removed to make the code paths cleaner for Direct I/O. Below is the race condtion stack trace that was being consistently observed in the CI runs for the dio_random test case that prompted these changes: trace: [ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [ 9954.770512] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.773848] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [ 9954.775512] Call Trace: [ 9954.776406] __schedule+0x2d1/0x870 [ 9954.777386] ? free_one_page+0x204/0x530 [ 9954.778466] schedule+0x55/0xf0 [ 9954.779355] cv_wait_common+0x16d/0x280 [spl] [ 9954.780491] ? finish_wait+0x80/0x80 [ 9954.781450] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [ 9954.782889] dmu_write_direct_done+0x90/0x3b0 [zfs] [ 9954.784255] zio_done+0x373/0x1d50 [zfs] [ 9954.785410] zio_execute+0xee/0x210 [zfs] [ 9954.786588] taskq_thread+0x205/0x3f0 [spl] [ 9954.787673] ? wake_up_q+0x60/0x60 [ 9954.788571] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 9954.790079] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 9954.791199] kthread+0x134/0x150 [ 9954.792082] ? set_kthread_struct+0x50/0x50 [ 9954.793189] ret_from_fork+0x35/0x40 [ 9954.794108] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [ 9954.795535] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.798669] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [ 9954.800267] Call Trace: [ 9954.801096] __schedule+0x2d1/0x870 [ 9954.801972] ? __wake_up_common+0x7a/0x190 [ 9954.802963] schedule+0x55/0xf0 [ 9954.803884] schedule_timeout+0x19f/0x320 [ 9954.804837] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.805932] ? taskq_dispatch+0xab/0x280 [spl] [ 9954.806959] io_schedule_timeout+0x19/0x40 [ 9954.807989] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.809110] ? finish_wait+0x80/0x80 [ 9954.810068] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.811103] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.812255] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 9954.813442] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 9954.814648] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [ 9954.816023] spa_sync+0x362/0x8f0 [zfs] [ 9954.817110] txg_sync_thread+0x27a/0x3b0 [zfs] [ 9954.818267] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 9954.819510] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 9954.820643] thread_generic_wrapper+0x63/0x90 [spl] [ 9954.821709] kthread+0x134/0x150 [ 9954.822590] ? set_kthread_struct+0x50/0x50 [ 9954.823584] ret_from_fork+0x35/0x40 [ 9954.824444] INFO: task fio:1055501 blocked for more than 120 seconds. [ 9954.825781] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.828871] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [ 9954.830463] Call Trace: [ 9954.831280] __schedule+0x2d1/0x870 [ 9954.832159] ? dbuf_hold_copy+0xec/0x230 [zfs] [ 9954.833396] schedule+0x55/0xf0 [ 9954.834286] cv_wait_common+0x16d/0x280 [spl] [ 9954.835291] ? finish_wait+0x80/0x80 [ 9954.836235] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 9954.837543] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 9954.838838] zfs_get_data+0x566/0x810 [zfs] [ 9954.840034] zil_lwb_commit+0x194/0x3f0 [zfs] [ 9954.841154] zil_lwb_write_issue+0x68/0xb90 [zfs] [ 9954.842367] ? __list_add+0x12/0x30 [zfs] [ 9954.843496] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.844665] ? zil_alloc_lwb+0x217/0x360 [zfs] [ 9954.845852] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [ 9954.847203] zil_commit_waiter+0x1d2/0x3b0 [zfs] [ 9954.848380] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.849550] zfs_fsync+0x66/0x90 [zfs] [ 9954.850640] zpl_fsync+0xe5/0x140 [zfs] [ 9954.851729] do_fsync+0x38/0x70 [ 9954.852585] __x64_sys_fsync+0x10/0x20 [ 9954.853486] do_syscall_64+0x5b/0x1b0 [ 9954.854416] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.855466] RIP: 0033:0x7eff236bb057 [ 9954.856388] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [ 9954.866149] INFO: task fio:1055502 blocked for more than 120 seconds. [ 9954.867490] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.870571] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [ 9954.872162] Call Trace: [ 9954.872947] __schedule+0x2d1/0x870 [ 9954.873844] schedule+0x55/0xf0 [ 9954.874716] schedule_timeout+0x19f/0x320 [ 9954.875645] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.876722] io_schedule_timeout+0x19/0x40 [ 9954.877677] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.878822] ? finish_wait+0x80/0x80 [ 9954.879694] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.880763] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.881865] dmu_write_abd+0x174/0x1c0 [zfs] [ 9954.883074] dmu_write_uio_direct+0x79/0x100 [zfs] [ 9954.884285] dmu_write_uio_dnode+0xb2/0x320 [zfs] [ 9954.885507] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 9954.886687] zfs_write+0x581/0xe20 [zfs] [ 9954.887822] ? iov_iter_get_pages+0xe9/0x390 [ 9954.888862] ? trylock_page+0xd/0x20 [zfs] [ 9954.890005] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.891217] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 9954.892391] zpl_iter_write_direct+0xd4/0x170 [zfs] [ 9954.893663] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.894764] zpl_iter_write+0xd5/0x110 [zfs] [ 9954.895911] new_sync_write+0x112/0x160 [ 9954.896881] vfs_write+0xa5/0x1b0 [ 9954.897701] ksys_write+0x4f/0xb0 [ 9954.898569] do_syscall_64+0x5b/0x1b0 [ 9954.899417] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.900515] RIP: 0033:0x7eff236baa47 [ 9954.901363] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 [ 9954.911129] INFO: task fio:1055504 blocked for more than 120 seconds. [ 9954.912381] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.915434] task:fio state:D stack:0 pid:1055504 ppid:1055493 flags:0x00000080 [ 9954.917082] Call Trace: [ 9954.917773] __schedule+0x2d1/0x870 [ 9954.918648] ? zilog_dirty+0x4f/0xc0 [zfs] [ 9954.919831] schedule+0x55/0xf0 [ 9954.920717] cv_wait_common+0x16d/0x280 [spl] [ 9954.921704] ? finish_wait+0x80/0x80 [ 9954.922639] zfs_rangelock_enter_writer+0x46/0x1c0 [zfs] [ 9954.923940] zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs] [ 9954.925306] zfs_write+0x703/0xe20 [zfs] [ 9954.926406] zpl_iter_write_buffered+0xb2/0x120 [zfs] [ 9954.927687] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.928821] zpl_iter_write+0xbe/0x110 [zfs] [ 9954.930028] new_sync_write+0x112/0x160 [ 9954.930913] vfs_write+0xa5/0x1b0 [ 9954.931758] ksys_write+0x4f/0xb0 [ 9954.932666] do_syscall_64+0x5b/0x1b0 [ 9954.933544] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.934689] RIP: 0033:0x7fcaee8f0a47 [ 9954.935551] Code: Unable to access opcode bytes at RIP 0x7fcaee8f0a1d. [ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fcaee8f0a47 [ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI: 0000000000000006 [ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09: 0000000000000000 [ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000001d000 [ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15: 0000557a2006bae8 [ 9954.945525] INFO: task fio:1055505 blocked for more than 120 seconds. [ 9954.946819] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.949959] task:fio state:D stack:0 pid:1055505 ppid:1055493 flags:0x00004080 [ 9954.951653] Call Trace: [ 9954.952417] __schedule+0x2d1/0x870 [ 9954.953393] ? finish_wait+0x3e/0x80 [ 9954.954315] schedule+0x55/0xf0 [ 9954.955212] cv_wait_common+0x16d/0x280 [spl] [ 9954.956211] ? finish_wait+0x80/0x80 [ 9954.957159] zil_commit_waiter+0xfa/0x3b0 [zfs] [ 9954.958343] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.959524] zfs_fsync+0x66/0x90 [zfs] [ 9954.960626] zpl_fsync+0xe5/0x140 [zfs] [ 9954.961763] do_fsync+0x38/0x70 [ 9954.962638] __x64_sys_fsync+0x10/0x20 [ 9954.963520] do_syscall_64+0x5b/0x1b0 [ 9954.964470] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.965567] RIP: 0033:0x7fcaee8f1057 [ 9954.966490] Code: Unable to access opcode bytes at RIP 0x7fcaee8f102d. [ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fcaee8f1057 [ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI: 0000000000000005 [ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09: 0000000000000000 [ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15: 0000557a2006bae8 [10077.648150] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [10077.649541] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.652782] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [10077.654420] Call Trace: [10077.655267] __schedule+0x2d1/0x870 [10077.656179] ? free_one_page+0x204/0x530 [10077.657192] schedule+0x55/0xf0 [10077.658004] cv_wait_common+0x16d/0x280 [spl] [10077.659018] ? finish_wait+0x80/0x80 [10077.660013] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [10077.661396] dmu_write_direct_done+0x90/0x3b0 [zfs] [10077.662617] zio_done+0x373/0x1d50 [zfs] [10077.663783] zio_execute+0xee/0x210 [zfs] [10077.664921] taskq_thread+0x205/0x3f0 [spl] [10077.665982] ? wake_up_q+0x60/0x60 [10077.666842] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [10077.668295] ? taskq_lowest_id+0xc0/0xc0 [spl] [10077.669360] kthread+0x134/0x150 [10077.670191] ? set_kthread_struct+0x50/0x50 [10077.671209] ret_from_fork+0x35/0x40 [10077.672076] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [10077.673467] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.676612] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [10077.678288] Call Trace: [10077.679024] __schedule+0x2d1/0x870 [10077.679948] ? __wake_up_common+0x7a/0x190 [10077.681042] schedule+0x55/0xf0 [10077.681899] schedule_timeout+0x19f/0x320 [10077.682951] ? __next_timer_interrupt+0xf0/0xf0 [10077.684005] ? taskq_dispatch+0xab/0x280 [spl] [10077.685085] io_schedule_timeout+0x19/0x40 [10077.686080] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.687227] ? finish_wait+0x80/0x80 [10077.688123] __cv_timedwait_io+0x15/0x20 [spl] [10077.689206] zio_wait+0x1ad/0x4f0 [zfs] [10077.690300] dsl_pool_sync+0xcb/0x6c0 [zfs] [10077.691435] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [10077.692636] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [10077.693997] spa_sync+0x362/0x8f0 [zfs] [10077.695112] txg_sync_thread+0x27a/0x3b0 [zfs] [10077.696239] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [10077.697512] ? spl_assert.constprop.0+0x20/0x20 [spl] [10077.698639] thread_generic_wrapper+0x63/0x90 [spl] [10077.699687] kthread+0x134/0x150 [10077.700567] ? set_kthread_struct+0x50/0x50 [10077.701502] ret_from_fork+0x35/0x40 [10077.702430] INFO: task fio:1055501 blocked for more than 120 seconds. [10077.703697] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.706780] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [10077.708479] Call Trace: [10077.709231] __schedule+0x2d1/0x870 [10077.710190] ? dbuf_hold_copy+0xec/0x230 [zfs] [10077.711368] schedule+0x55/0xf0 [10077.712286] cv_wait_common+0x16d/0x280 [spl] [10077.713316] ? finish_wait+0x80/0x80 [10077.714262] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [10077.715566] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [10077.716878] zfs_get_data+0x566/0x810 [zfs] [10077.718032] zil_lwb_commit+0x194/0x3f0 [zfs] [10077.719234] zil_lwb_write_issue+0x68/0xb90 [zfs] [10077.720413] ? __list_add+0x12/0x30 [zfs] [10077.721525] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.722708] ? zil_alloc_lwb+0x217/0x360 [zfs] [10077.723931] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [10077.725273] zil_commit_waiter+0x1d2/0x3b0 [zfs] [10077.726438] zil_commit_impl+0x6d/0xd0 [zfs] [10077.727586] zfs_fsync+0x66/0x90 [zfs] [10077.728675] zpl_fsync+0xe5/0x140 [zfs] [10077.729755] do_fsync+0x38/0x70 [10077.730607] __x64_sys_fsync+0x10/0x20 [10077.731482] do_syscall_64+0x5b/0x1b0 [10077.732415] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.733487] RIP: 0033:0x7eff236bb057 [10077.734399] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [10077.744168] INFO: task fio:1055502 blocked for more than 120 seconds. [10077.745505] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.748642] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [10077.750233] Call Trace: [10077.751011] __schedule+0x2d1/0x870 [10077.751915] schedule+0x55/0xf0 [10077.752811] schedule_timeout+0x19f/0x320 [10077.753762] ? __next_timer_interrupt+0xf0/0xf0 [10077.754824] io_schedule_timeout+0x19/0x40 [10077.755782] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.756922] ? finish_wait+0x80/0x80 [10077.757788] __cv_timedwait_io+0x15/0x20 [spl] [10077.758845] zio_wait+0x1ad/0x4f0 [zfs] [10077.759941] dmu_write_abd+0x174/0x1c0 [zfs] [10077.761144] dmu_write_uio_direct+0x79/0x100 [zfs] [10077.762327] dmu_write_uio_dnode+0xb2/0x320 [zfs] [10077.763523] dmu_write_uio_dbuf+0x47/0x60 [zfs] [10077.764749] zfs_write+0x581/0xe20 [zfs] [10077.765825] ? iov_iter_get_pages+0xe9/0x390 [10077.766842] ? trylock_page+0xd/0x20 [zfs] [10077.767956] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.769189] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [10077.770343] zpl_iter_write_direct+0xd4/0x170 [zfs] [10077.771570] ? rrw_exit+0xc6/0x200 [zfs] [10077.772674] zpl_iter_write+0xd5/0x110 [zfs] [10077.773834] new_sync_write+0x112/0x160 [10077.774805] vfs_write+0xa5/0x1b0 [10077.775634] ksys_write+0x4f/0xb0 [10077.776526] do_syscall_64+0x5b/0x1b0 [10077.777386] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.778488] RIP: 0033:0x7eff236baa47 [10077.779339] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 Signed-off-by: Brian Atkinson <[email protected]>
1 parent e6e5b4c commit 9909121

File tree

6 files changed

+170
-334
lines changed

6 files changed

+170
-334
lines changed

include/sys/dbuf.h

Lines changed: 1 addition & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -330,14 +330,6 @@ typedef struct dmu_buf_impl {
330330

331331
/* The buffer was partially read. More reads may follow. */
332332
uint8_t db_partial_read;
333-
334-
/*
335-
* This block is being held under a writer rangelock of a Direct I/O
336-
* write that is waiting for previous buffered writes to synced out
337-
* due to mixed buffered and O_DIRECT operations. This is needed to
338-
* check whether to grab the rangelock in zfs_get_data().
339-
*/
340-
uint8_t db_mixed_io_dio_wait;
341333
} dmu_buf_impl_t;
342334

343335
#define DBUF_HASH_MUTEX(h, idx) \
@@ -386,7 +378,7 @@ dmu_buf_impl_t *dbuf_find(struct objset *os, uint64_t object, uint8_t level,
386378
uint64_t blkid, uint64_t *hash_out);
387379

388380
int dbuf_read(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags);
389-
void dmu_buf_will_clone(dmu_buf_t *db, dmu_tx_t *tx);
381+
void dmu_buf_will_clone_or_dio(dmu_buf_t *db, dmu_tx_t *tx);
390382
void dmu_buf_will_not_fill(dmu_buf_t *db, dmu_tx_t *tx);
391383
void dmu_buf_will_fill(dmu_buf_t *db, dmu_tx_t *tx, boolean_t canfail);
392384
boolean_t dmu_buf_fill_done(dmu_buf_t *db, dmu_tx_t *tx, boolean_t failed);
@@ -395,9 +387,6 @@ dbuf_dirty_record_t *dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
395387
dbuf_dirty_record_t *dbuf_dirty_lightweight(dnode_t *dn, uint64_t blkid,
396388
dmu_tx_t *tx);
397389
boolean_t dbuf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
398-
void dmu_buf_direct_mixed_io_wait(dmu_buf_impl_t *db, uint64_t txg,
399-
boolean_t read);
400-
void dmu_buf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
401390
blkptr_t *dmu_buf_get_bp_from_dbuf(dmu_buf_impl_t *db);
402391
int dmu_buf_untransform_direct(dmu_buf_impl_t *db, spa_t *spa);
403392
arc_buf_t *dbuf_loan_arcbuf(dmu_buf_impl_t *db);

include/sys/dmu.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1078,7 +1078,6 @@ typedef struct zgd {
10781078
struct blkptr *zgd_bp;
10791079
dmu_buf_t *zgd_db;
10801080
struct zfs_locked_range *zgd_lr;
1081-
boolean_t zgd_grabbed_rangelock;
10821081
void *zgd_private;
10831082
} zgd_t;
10841083

module/zfs/dbuf.c

Lines changed: 72 additions & 148 deletions
Original file line numberDiff line numberDiff line change
@@ -85,10 +85,7 @@ typedef struct dbuf_stats {
8585
/*
8686
* Statistics for Direct I/O.
8787
*/
88-
kstat_named_t direct_mixed_io_read_wait;
89-
kstat_named_t direct_mixed_io_write_wait;
9088
kstat_named_t direct_sync_wait;
91-
kstat_named_t direct_undirty;
9289
/*
9390
* Statistics about the dbuf hash table.
9491
*/
@@ -137,10 +134,7 @@ dbuf_stats_t dbuf_stats = {
137134
{ "cache_total_evicts", KSTAT_DATA_UINT64 },
138135
{ { "cache_levels_N", KSTAT_DATA_UINT64 } },
139136
{ { "cache_levels_bytes_N", KSTAT_DATA_UINT64 } },
140-
{ "direct_mixed_io_read_wait", KSTAT_DATA_UINT64 },
141-
{ "direct_mixed_io_write_wait", KSTAT_DATA_UINT64 },
142137
{ "direct_sync_wait", KSTAT_DATA_UINT64 },
143-
{ "direct_undirty", KSTAT_DATA_UINT64 },
144138
{ "hash_hits", KSTAT_DATA_UINT64 },
145139
{ "hash_misses", KSTAT_DATA_UINT64 },
146140
{ "hash_collisions", KSTAT_DATA_UINT64 },
@@ -162,10 +156,7 @@ struct {
162156
wmsum_t cache_total_evicts;
163157
wmsum_t cache_levels[DN_MAX_LEVELS];
164158
wmsum_t cache_levels_bytes[DN_MAX_LEVELS];
165-
wmsum_t direct_mixed_io_read_wait;
166-
wmsum_t direct_mixed_io_write_wait;
167159
wmsum_t direct_sync_wait;
168-
wmsum_t direct_undirty;
169160
wmsum_t hash_hits;
170161
wmsum_t hash_misses;
171162
wmsum_t hash_collisions;
@@ -911,14 +902,8 @@ dbuf_kstat_update(kstat_t *ksp, int rw)
911902
ds->cache_levels_bytes[i].value.ui64 =
912903
wmsum_value(&dbuf_sums.cache_levels_bytes[i]);
913904
}
914-
ds->direct_mixed_io_read_wait.value.ui64 =
915-
wmsum_value(&dbuf_sums.direct_mixed_io_read_wait);
916-
ds->direct_mixed_io_write_wait.value.ui64 =
917-
wmsum_value(&dbuf_sums.direct_mixed_io_write_wait);
918905
ds->direct_sync_wait.value.ui64 =
919906
wmsum_value(&dbuf_sums.direct_sync_wait);
920-
ds->direct_undirty.value.ui64 =
921-
wmsum_value(&dbuf_sums.direct_undirty);
922907
ds->hash_hits.value.ui64 =
923908
wmsum_value(&dbuf_sums.hash_hits);
924909
ds->hash_misses.value.ui64 =
@@ -1021,10 +1006,7 @@ dbuf_init(void)
10211006
wmsum_init(&dbuf_sums.cache_levels[i], 0);
10221007
wmsum_init(&dbuf_sums.cache_levels_bytes[i], 0);
10231008
}
1024-
wmsum_init(&dbuf_sums.direct_mixed_io_read_wait, 0);
1025-
wmsum_init(&dbuf_sums.direct_mixed_io_write_wait, 0);
10261009
wmsum_init(&dbuf_sums.direct_sync_wait, 0);
1027-
wmsum_init(&dbuf_sums.direct_undirty, 0);
10281010
wmsum_init(&dbuf_sums.hash_hits, 0);
10291011
wmsum_init(&dbuf_sums.hash_misses, 0);
10301012
wmsum_init(&dbuf_sums.hash_collisions, 0);
@@ -1097,10 +1079,7 @@ dbuf_fini(void)
10971079
wmsum_fini(&dbuf_sums.cache_levels[i]);
10981080
wmsum_fini(&dbuf_sums.cache_levels_bytes[i]);
10991081
}
1100-
wmsum_fini(&dbuf_sums.direct_mixed_io_read_wait);
1101-
wmsum_fini(&dbuf_sums.direct_mixed_io_write_wait);
11021082
wmsum_fini(&dbuf_sums.direct_sync_wait);
1103-
wmsum_fini(&dbuf_sums.direct_undirty);
11041083
wmsum_fini(&dbuf_sums.hash_hits);
11051084
wmsum_fini(&dbuf_sums.hash_misses);
11061085
wmsum_fini(&dbuf_sums.hash_collisions);
@@ -1271,9 +1250,8 @@ dbuf_clear_data(dmu_buf_impl_t *db)
12711250
{
12721251
ASSERT(MUTEX_HELD(&db->db_mtx));
12731252
dbuf_evict_user(db);
1274-
/* Direct I/O writes may have data */
1275-
if (db->db_buf == NULL)
1276-
db->db.db_data = NULL;
1253+
ASSERT3P(db->db_buf, ==, NULL);
1254+
db->db.db_data = NULL;
12771255
if (db->db_state != DB_NOFILL) {
12781256
db->db_state = DB_UNCACHED;
12791257
DTRACE_SET_STATE(db, "clear data");
@@ -1733,10 +1711,11 @@ dbuf_read_impl(dmu_buf_impl_t *db, dnode_t *dn, zio_t *zio, uint32_t flags,
17331711
* have been modified in a previous transaction group before we access them in
17341712
* the current active group.
17351713
*
1736-
* This function is used in three places: when we are dirtying a buffer for the
1714+
* This function is used in four places: when we are dirtying a buffer for the
17371715
* first time in a txg, when we are freeing a range in a dnode that includes
1738-
* this buffer, and when we are accessing a buffer which was received compressed
1739-
* and later referenced in a WRITE_BYREF record.
1716+
* this buffer, when doing block cloning or issuing a Direct I/O write with
1717+
* abuffer, and when we are accessing a buffer which was received compressed and
1718+
* later referenced in a WRITE_BYREF record.
17401719
*
17411720
* Note that when we are called from dbuf_free_range() we do not put a hold on
17421721
* the buffer, we just traverse the active dbuf list for the dnode.
@@ -2789,93 +2768,6 @@ dmu_buf_is_dirty(dmu_buf_t *db_fake, dmu_tx_t *tx)
27892768
return (dr != NULL);
27902769
}
27912770

2792-
void
2793-
dmu_buf_direct_mixed_io_wait(dmu_buf_impl_t *db, uint64_t txg, boolean_t read)
2794-
{
2795-
ASSERT(MUTEX_HELD(&db->db_mtx));
2796-
2797-
if (read == B_TRUE) {
2798-
/*
2799-
* If a buffered read is in process, a Direct I/O read will
2800-
* wait for the buffered I/O to complete.
2801-
*/
2802-
ASSERT3U(txg, ==, 0);
2803-
while (db->db_state == DB_READ) {
2804-
DBUF_STAT_BUMP(direct_mixed_io_read_wait);
2805-
cv_wait(&db->db_changed, &db->db_mtx);
2806-
}
2807-
} else {
2808-
/*
2809-
* There must be an ARC buf associated with this Direct I/O
2810-
* write otherwise there is no reason to wait for previous
2811-
* dirty records to sync out.
2812-
*
2813-
* The db_state will temporarily be set to DB_CACHED so that
2814-
* that any synchronous writes issued through the ZIL will
2815-
* still be handled properly. In particular, the call to
2816-
* dbuf_read() in dmu_sync_late_arrival() must account for the
2817-
* data still being in the ARC. After waiting here for previous
2818-
* TXGs to sync out, dmu_write_direct_done() will update the
2819-
* db_state.
2820-
*/
2821-
ASSERT3P(db->db_buf, !=, NULL);
2822-
ASSERT3U(txg, >, 0);
2823-
db->db_mixed_io_dio_wait = TRUE;
2824-
db->db_state = DB_CACHED;
2825-
while (dbuf_find_dirty_lte(db, txg) != NULL) {
2826-
DBUF_STAT_BUMP(direct_mixed_io_write_wait);
2827-
cv_wait(&db->db_changed, &db->db_mtx);
2828-
}
2829-
db->db_mixed_io_dio_wait = FALSE;
2830-
}
2831-
}
2832-
2833-
/*
2834-
* Direct I/O writes may need to undirty the open-context dirty record
2835-
* associated with it in the event of an I/O error.
2836-
*/
2837-
void
2838-
dmu_buf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
2839-
{
2840-
/*
2841-
* Direct I/O writes always happen in open-context.
2842-
*/
2843-
ASSERT(!dmu_tx_is_syncing(tx));
2844-
ASSERT(MUTEX_HELD(&db->db_mtx));
2845-
ASSERT(db->db_state == DB_NOFILL || db->db_state == DB_UNCACHED);
2846-
2847-
2848-
/*
2849-
* In the event of an I/O error we will handle the metaslab clean up in
2850-
* zio_done(). Also, the dirty record's dr_overridden_by BP is not
2851-
* currently set as that is done in dmu_sync_done(). Since the db_state
2852-
* is still set to DB_NOFILL, dbuf_unoverride() will not be called in
2853-
* dbuf_undirty() and the dirty record's BP will not be added the SPA's
2854-
* spa_free_bplist via zio_free().
2855-
*
2856-
* This function can also be called in the event that a Direct I/O
2857-
* write is overwriting a previous Direct I/O to the same block for
2858-
* this TXG. It is important to go ahead and free up the space
2859-
* accounting in this case through dbuf_undirty() -> dbuf_unoverride()
2860-
* -> zio_free(). This is necessary because the space accounting for
2861-
* determining if a write can occur in zfs_write() happens through
2862-
* dmu_tx_assign(). This can cause an issue with Direct I/O writes in
2863-
* the case of overwrites, because all DVA allocations are being done
2864-
* in open-context. Constanstly allowing Direct I/O overwrites to the
2865-
* same blocks can exhaust the pools available space leading to ENOSPC
2866-
* errors at the DVA allcoation part of the ZIO pipeline, which will
2867-
* eventually suspend the pool. By cleaning up space accounting now
2868-
* the ENOSPC pool suspend can be avoided.
2869-
*
2870-
* Since we are undirtying the record for the Direct I/O in
2871-
* open-context we must have a hold on the db, so it should never be
2872-
* evicted after calling dbuf_undirty().
2873-
*/
2874-
VERIFY3B(dbuf_undirty(db, tx), ==, B_FALSE);
2875-
2876-
DBUF_STAT_BUMP(direct_undirty);
2877-
}
2878-
28792771
/*
28802772
* Normally the db_blkptr points to the most recent on-disk content for the
28812773
* dbuf (and anything newer will be cached in the dbuf). However, a recent
@@ -2951,27 +2843,71 @@ dmu_buf_untransform_direct(dmu_buf_impl_t *db, spa_t *spa)
29512843
}
29522844

29532845
void
2954-
dmu_buf_will_clone(dmu_buf_t *db_fake, dmu_tx_t *tx)
2846+
dmu_buf_will_clone_or_dio(dmu_buf_t *db_fake, dmu_tx_t *tx)
29552847
{
29562848
dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
29572849

29582850
/*
2959-
* Block cloning: We are going to clone into this block, so undirty
2960-
* modifications done to this block so far in this txg. This includes
2961-
* writes and clones into this block.
2851+
* Block clones and Direct I/O writes always happen in open-context.
29622852
*/
2853+
ASSERT(!dmu_tx_is_syncing(tx));
2854+
ASSERT(db->db_level == 0);
2855+
ASSERT(db->db_blkid != DMU_BONUS_BLKID);
2856+
ASSERT(db->db.db_object != DMU_META_DNODE_OBJECT);
2857+
29632858
mutex_enter(&db->db_mtx);
29642859
DBUF_VERIFY(db);
2965-
VERIFY(!dbuf_undirty(db, tx));
2860+
2861+
/*
2862+
* We are going to clone or issue a Direct I/O write on this block, so
2863+
* undirty modifications done to this block so far in this txg. This
2864+
* includes writes and clones into this block.
2865+
*
2866+
* If there dirty record associated with this txg from a previous Direct
2867+
* I/O write then space accounting cleanup takes place. It is important
2868+
* to go ahead free up the space accounting through dbuf_undirty() ->
2869+
* dbuf_unoverride() -> zio_free(). Space accountiung for determining
2870+
* if a write can occur in zfs_write() happens through dmu_tx_assign().
2871+
* This can cuase an issue with Direct I/O writes in the case of
2872+
* overwriting the same block, because all DVA allocations are being
2873+
* done in open-context. Constantly allowing Direct I/O overwrites to
2874+
* the same block can exhaust the pools available space leading to
2875+
* ENOSPC errors at the DVA allocation part of the ZIO pipeline, which
2876+
* will eventually suspend the pool. By cleaning up sapce acccounting
2877+
* now, the ENOSPC error can be avoided.
2878+
*
2879+
* Since we are undirtying the record in open-context, we must have a
2880+
* hold on the db, so it should never be evicted after calling
2881+
* dbuf_undirty().
2882+
*/
2883+
VERIFY3B(dbuf_undirty(db, tx), ==, B_FALSE);
29662884
ASSERT0P(dbuf_find_dirty_eq(db, tx->tx_txg));
2885+
29672886
if (db->db_buf != NULL) {
2968-
arc_buf_destroy(db->db_buf, db);
2887+
/*
2888+
* If there is an associated ARC buffer with this dbuf we can
2889+
* only destroy it if the previous dirty record does not
2890+
* reference it.
2891+
*/
2892+
dbuf_dirty_record_t *dr = list_head(&db->db_dirty_records);
2893+
if (dr == NULL || dr->dt.dl.dr_data != db->db_buf)
2894+
arc_buf_destroy(db->db_buf, db);
2895+
2896+
/*
2897+
* Setting the dbuf's data pointers to NULL will force all
2898+
* future reads down to the devices to get the most up to date
2899+
* version of the data after a Direct I/O write has completed.
2900+
*/
29692901
db->db_buf = NULL;
29702902
dbuf_clear_data(db);
29712903
}
29722904

2905+
ASSERT3P(db->db_buf, ==, NULL);
2906+
ASSERT3P(db->db.db_data, ==, NULL);
2907+
29732908
db->db_state = DB_NOFILL;
2974-
DTRACE_SET_STATE(db, "allocating NOFILL buffer for clone");
2909+
DTRACE_SET_STATE(db,
2910+
"allocating NOFILL buffer for clone or direct I/O write");
29752911

29762912
DBUF_VERIFY(db);
29772913
mutex_exit(&db->db_mtx);
@@ -3510,7 +3446,6 @@ dbuf_create(dnode_t *dn, uint8_t level, uint64_t blkid,
35103446
db->db_user_immediate_evict = FALSE;
35113447
db->db_freed_in_flight = FALSE;
35123448
db->db_pending_evict = FALSE;
3513-
db->db_mixed_io_dio_wait = FALSE;
35143449

35153450
if (blkid == DMU_BONUS_BLKID) {
35163451
ASSERT3P(parent, ==, dn->dn_dbuf);
@@ -4766,25 +4701,6 @@ dbuf_sync_leaf(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
47664701
dprintf_dbuf_bp(db, db->db_blkptr, "blkptr=%p", db->db_blkptr);
47674702

47684703
mutex_enter(&db->db_mtx);
4769-
4770-
/*
4771-
* It is possible a buffered read has come in after a Direct I/O
4772-
* write and is currently transistioning the db_state from DB_READ
4773-
* in dbuf_read_impl() to another state in dbuf_read_done(). We
4774-
* have to wait in order for the dbuf state to change from DB_READ
4775-
* before syncing the dirty record of the Direct I/O write.
4776-
*/
4777-
if (db->db_state == DB_READ && !dr->dt.dl.dr_brtwrite) {
4778-
ASSERT3P(*datap, ==, NULL);
4779-
ASSERT3P(db->db_buf, ==, NULL);
4780-
ASSERT3P(db->db.db_data, ==, NULL);
4781-
ASSERT3U(dr->dt.dl.dr_override_state, ==, DR_OVERRIDDEN);
4782-
while (db->db_state == DB_READ) {
4783-
DBUF_STAT_BUMP(direct_sync_wait);
4784-
cv_wait(&db->db_changed, &db->db_mtx);
4785-
}
4786-
}
4787-
47884704
/*
47894705
* To be synced, we must be dirtied. But we might have been freed
47904706
* after the dirty.
@@ -4797,13 +4713,21 @@ dbuf_sync_leaf(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
47974713
ASSERT(db->db.db_data != dr->dt.dl.dr_data);
47984714
} else if (db->db_state == DB_READ) {
47994715
/*
4800-
* This buffer has a clone we need to write, and an in-flight
4801-
* read on the BP we're about to clone. Its safe to issue the
4802-
* write here because the read has already been issued and the
4803-
* contents won't change.
4716+
* This buffer was either cloned or had a Direct I/O write
4717+
* occur and has an in-flgiht read on the BP. It is safe to
4718+
* issue the write here, because the read has already been
4719+
* issued and the contents won't change.
4720+
*
4721+
* We can verify the case of both the clone and Direct I/O
4722+
* write by making sure the first dirty record for the dbuf
4723+
* has no ARC buffer associated with it.
48044724
*/
4805-
ASSERT(dr->dt.dl.dr_brtwrite &&
4806-
dr->dt.dl.dr_override_state == DR_OVERRIDDEN);
4725+
dbuf_dirty_record_t *dr_head =
4726+
list_head(&db->db_dirty_records);
4727+
ASSERT3P(db->db_buf, ==, NULL);
4728+
ASSERT3P(db->db.db_data, ==, NULL);
4729+
ASSERT3P(dr_head->dt.dl.dr_data, ==, NULL);
4730+
ASSERT3U(dr_head->dt.dl.dr_override_state, ==, DR_OVERRIDDEN);
48074731
} else {
48084732
ASSERT(db->db_state == DB_CACHED || db->db_state == DB_NOFILL);
48094733
}
@@ -5503,7 +5427,7 @@ EXPORT_SYMBOL(dbuf_dirty);
55035427
EXPORT_SYMBOL(dmu_buf_set_crypt_params);
55045428
EXPORT_SYMBOL(dmu_buf_will_dirty);
55055429
EXPORT_SYMBOL(dmu_buf_is_dirty);
5506-
EXPORT_SYMBOL(dmu_buf_will_clone);
5430+
EXPORT_SYMBOL(dmu_buf_will_clone_or_dio);
55075431
EXPORT_SYMBOL(dmu_buf_will_not_fill);
55085432
EXPORT_SYMBOL(dmu_buf_will_fill);
55095433
EXPORT_SYMBOL(dmu_buf_fill_done);

module/zfs/dmu.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2439,7 +2439,7 @@ dmu_brt_clone(objset_t *os, uint64_t object, uint64_t offset, uint64_t length,
24392439
ASSERT(db->db_blkid != DMU_SPILL_BLKID);
24402440
ASSERT(BP_IS_HOLE(bp) || dbuf->db_size == BP_GET_LSIZE(bp));
24412441

2442-
dmu_buf_will_clone(dbuf, tx);
2442+
dmu_buf_will_clone_or_dio(dbuf, tx);
24432443

24442444
mutex_enter(&db->db_mtx);
24452445

0 commit comments

Comments
 (0)