Fixing race condition with rangelocks

bwatkinson · bwatkinson · commit 9909121f0859 · 2024-06-28T12:58:22.000-06:00
There existed a race condition between when a Direct I/O write could complete and if a sync operation was issued. This was due to the fact that a Direct I/O would sleep waiting on previous TXG's to sync out their dirty records assosciated with a dbuf if there was an ARC buffer associated with the dbuf. This was necessay to safely destroy the ARC buffer in case previous dirty records dr_data as pointed at that the db_buf. The main issue with this approach is a Direct I/o write holds the rangelock across the entire block, so when a sync on that same block was issued and tried to grab the rangelock as reader, it would be blocked indefinitely because the Direct I/O that was now sleeping was holding that same rangelock as writer. This led to a complete deadlock. This commit fixes this issue and removes the wait in dmu_write_direct_done(). The way this is now handled is the ARC buffer is destroyed, if there an associated one with dbuf, before ever issuing the Direct I/O write. This implemenation heavily borrows from the block cloning implementation. A new function dmu_buf_wil_clone_or_dio() is called in both dmu_write_direct() and dmu_brt_clone() that does the following: 1. Undirties a dirty record for that db if there one currently associated with the current TXG. 2. Destroys the ARC buffer if the previous dirty record dr_data does not point at the dbufs ARC buffer (db_buf). 3. Sets the dbufs data pointers to NULL. 4. Redirties the dbuf using db_state = DB_NOFILL. As part of this commit, the dmu_write_direct_done() function was also cleaned up. Now dmu_sync_done() is called before undirtying the dbuf dirty record associated with a failed Direct I/O write. This is correct logic and how it always should have been. The additional benefits of these modifications is there is no longer a stall in a Direct I/O write if the user is mixing bufferd and O_DIRECT together. Also it unifies the block cloning and Direct I/O write path as they both need to call dbuf_fix_old_data() before destroying the ARC buffer. As part of this commit, there is also just general code cleanup. Various dbuf stats were removed because they are not necesary any longer. Additionally, useless functions were removed to make the code paths cleaner for Direct I/O. Below is the race condtion stack trace that was being consistently observed in the CI runs for the dio_random test case that prompted these changes: trace: [ 7795.294473] sd 0:0:0:0: [sda] Synchronizing SCSI cache [ 9954.769075] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [ 9954.770512] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.772159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.773848] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [ 9954.775512] Call Trace: [ 9954.776406] __schedule+0x2d1/0x870 [ 9954.777386] ? free_one_page+0x204/0x530 [ 9954.778466] schedule+0x55/0xf0 [ 9954.779355] cv_wait_common+0x16d/0x280 [spl] [ 9954.780491] ? finish_wait+0x80/0x80 [ 9954.781450] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [ 9954.782889] dmu_write_direct_done+0x90/0x3b0 [zfs] [ 9954.784255] zio_done+0x373/0x1d50 [zfs] [ 9954.785410] zio_execute+0xee/0x210 [zfs] [ 9954.786588] taskq_thread+0x205/0x3f0 [spl] [ 9954.787673] ? wake_up_q+0x60/0x60 [ 9954.788571] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 9954.790079] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 9954.791199] kthread+0x134/0x150 [ 9954.792082] ? set_kthread_struct+0x50/0x50 [ 9954.793189] ret_from_fork+0x35/0x40 [ 9954.794108] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [ 9954.795535] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.797103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.798669] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [ 9954.800267] Call Trace: [ 9954.801096] __schedule+0x2d1/0x870 [ 9954.801972] ? __wake_up_common+0x7a/0x190 [ 9954.802963] schedule+0x55/0xf0 [ 9954.803884] schedule_timeout+0x19f/0x320 [ 9954.804837] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.805932] ? taskq_dispatch+0xab/0x280 [spl] [ 9954.806959] io_schedule_timeout+0x19/0x40 [ 9954.807989] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.809110] ? finish_wait+0x80/0x80 [ 9954.810068] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.811103] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.812255] dsl_pool_sync+0xcb/0x6c0 [zfs] [ 9954.813442] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [ 9954.814648] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [ 9954.816023] spa_sync+0x362/0x8f0 [zfs] [ 9954.817110] txg_sync_thread+0x27a/0x3b0 [zfs] [ 9954.818267] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 9954.819510] ? spl_assert.constprop.0+0x20/0x20 [spl] [ 9954.820643] thread_generic_wrapper+0x63/0x90 [spl] [ 9954.821709] kthread+0x134/0x150 [ 9954.822590] ? set_kthread_struct+0x50/0x50 [ 9954.823584] ret_from_fork+0x35/0x40 [ 9954.824444] INFO: task fio:1055501 blocked for more than 120 seconds. [ 9954.825781] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.827315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.828871] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [ 9954.830463] Call Trace: [ 9954.831280] __schedule+0x2d1/0x870 [ 9954.832159] ? dbuf_hold_copy+0xec/0x230 [zfs] [ 9954.833396] schedule+0x55/0xf0 [ 9954.834286] cv_wait_common+0x16d/0x280 [spl] [ 9954.835291] ? finish_wait+0x80/0x80 [ 9954.836235] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [ 9954.837543] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [ 9954.838838] zfs_get_data+0x566/0x810 [zfs] [ 9954.840034] zil_lwb_commit+0x194/0x3f0 [zfs] [ 9954.841154] zil_lwb_write_issue+0x68/0xb90 [zfs] [ 9954.842367] ? __list_add+0x12/0x30 [zfs] [ 9954.843496] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.844665] ? zil_alloc_lwb+0x217/0x360 [zfs] [ 9954.845852] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [ 9954.847203] zil_commit_waiter+0x1d2/0x3b0 [zfs] [ 9954.848380] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.849550] zfs_fsync+0x66/0x90 [zfs] [ 9954.850640] zpl_fsync+0xe5/0x140 [zfs] [ 9954.851729] do_fsync+0x38/0x70 [ 9954.852585] __x64_sys_fsync+0x10/0x20 [ 9954.853486] do_syscall_64+0x5b/0x1b0 [ 9954.854416] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.855466] RIP: 0033:0x7eff236bb057 [ 9954.856388] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [ 9954.857651] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.859141] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [ 9954.860496] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [ 9954.861945] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.863327] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.864765] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [ 9954.866149] INFO: task fio:1055502 blocked for more than 120 seconds. [ 9954.867490] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.869029] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.870571] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [ 9954.872162] Call Trace: [ 9954.872947] __schedule+0x2d1/0x870 [ 9954.873844] schedule+0x55/0xf0 [ 9954.874716] schedule_timeout+0x19f/0x320 [ 9954.875645] ? __next_timer_interrupt+0xf0/0xf0 [ 9954.876722] io_schedule_timeout+0x19/0x40 [ 9954.877677] __cv_timedwait_common+0x19e/0x2c0 [spl] [ 9954.878822] ? finish_wait+0x80/0x80 [ 9954.879694] __cv_timedwait_io+0x15/0x20 [spl] [ 9954.880763] zio_wait+0x1ad/0x4f0 [zfs] [ 9954.881865] dmu_write_abd+0x174/0x1c0 [zfs] [ 9954.883074] dmu_write_uio_direct+0x79/0x100 [zfs] [ 9954.884285] dmu_write_uio_dnode+0xb2/0x320 [zfs] [ 9954.885507] dmu_write_uio_dbuf+0x47/0x60 [zfs] [ 9954.886687] zfs_write+0x581/0xe20 [zfs] [ 9954.887822] ? iov_iter_get_pages+0xe9/0x390 [ 9954.888862] ? trylock_page+0xd/0x20 [zfs] [ 9954.890005] ? __raw_spin_unlock+0x5/0x10 [zfs] [ 9954.891217] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [ 9954.892391] zpl_iter_write_direct+0xd4/0x170 [zfs] [ 9954.893663] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.894764] zpl_iter_write+0xd5/0x110 [zfs] [ 9954.895911] new_sync_write+0x112/0x160 [ 9954.896881] vfs_write+0xa5/0x1b0 [ 9954.897701] ksys_write+0x4f/0xb0 [ 9954.898569] do_syscall_64+0x5b/0x1b0 [ 9954.899417] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.900515] RIP: 0033:0x7eff236baa47 [ 9954.901363] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [ 9954.902673] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.904099] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [ 9954.905535] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [ 9954.906902] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [ 9954.908339] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [ 9954.909705] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 [ 9954.911129] INFO: task fio:1055504 blocked for more than 120 seconds. [ 9954.912381] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.913978] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.915434] task:fio state:D stack:0 pid:1055504 ppid:1055493 flags:0x00000080 [ 9954.917082] Call Trace: [ 9954.917773] __schedule+0x2d1/0x870 [ 9954.918648] ? zilog_dirty+0x4f/0xc0 [zfs] [ 9954.919831] schedule+0x55/0xf0 [ 9954.920717] cv_wait_common+0x16d/0x280 [spl] [ 9954.921704] ? finish_wait+0x80/0x80 [ 9954.922639] zfs_rangelock_enter_writer+0x46/0x1c0 [zfs] [ 9954.923940] zfs_rangelock_enter_impl+0x12a/0x1b0 [zfs] [ 9954.925306] zfs_write+0x703/0xe20 [zfs] [ 9954.926406] zpl_iter_write_buffered+0xb2/0x120 [zfs] [ 9954.927687] ? rrw_exit+0xc6/0x200 [zfs] [ 9954.928821] zpl_iter_write+0xbe/0x110 [zfs] [ 9954.930028] new_sync_write+0x112/0x160 [ 9954.930913] vfs_write+0xa5/0x1b0 [ 9954.931758] ksys_write+0x4f/0xb0 [ 9954.932666] do_syscall_64+0x5b/0x1b0 [ 9954.933544] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.934689] RIP: 0033:0x7fcaee8f0a47 [ 9954.935551] Code: Unable to access opcode bytes at RIP 0x7fcaee8f0a1d. [ 9954.936893] RSP: 002b:00007fff56b2c240 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [ 9954.938327] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fcaee8f0a47 [ 9954.939777] RDX: 000000000001d000 RSI: 00007fca8300b010 RDI: 0000000000000006 [ 9954.941187] RBP: 00007fca8300b010 R08: 0000000000000000 R09: 0000000000000000 [ 9954.942655] R10: 0000000000000000 R11: 0000000000000293 R12: 000000000001d000 [ 9954.944062] R13: 0000557a2006bac0 R14: 000000000001d000 R15: 0000557a2006bae8 [ 9954.945525] INFO: task fio:1055505 blocked for more than 120 seconds. [ 9954.946819] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [ 9954.948466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 9954.949959] task:fio state:D stack:0 pid:1055505 ppid:1055493 flags:0x00004080 [ 9954.951653] Call Trace: [ 9954.952417] __schedule+0x2d1/0x870 [ 9954.953393] ? finish_wait+0x3e/0x80 [ 9954.954315] schedule+0x55/0xf0 [ 9954.955212] cv_wait_common+0x16d/0x280 [spl] [ 9954.956211] ? finish_wait+0x80/0x80 [ 9954.957159] zil_commit_waiter+0xfa/0x3b0 [zfs] [ 9954.958343] zil_commit_impl+0x6d/0xd0 [zfs] [ 9954.959524] zfs_fsync+0x66/0x90 [zfs] [ 9954.960626] zpl_fsync+0xe5/0x140 [zfs] [ 9954.961763] do_fsync+0x38/0x70 [ 9954.962638] __x64_sys_fsync+0x10/0x20 [ 9954.963520] do_syscall_64+0x5b/0x1b0 [ 9954.964470] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 9954.965567] RIP: 0033:0x7fcaee8f1057 [ 9954.966490] Code: Unable to access opcode bytes at RIP 0x7fcaee8f102d. [ 9954.967752] RSP: 002b:00007fff56b2c230 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [ 9954.969260] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fcaee8f1057 [ 9954.970628] RDX: 0000000000000000 RSI: 0000557a2006bac0 RDI: 0000000000000005 [ 9954.972092] RBP: 00007fca84152a18 R08: 0000000000000000 R09: 0000000000000000 [ 9954.973484] R10: 0000000000035000 R11: 0000000000000293 R12: 0000000000000003 [ 9954.974958] R13: 0000557a2006bac0 R14: 0000000000000000 R15: 0000557a2006bae8 [10077.648150] INFO: task z_wr_int:1051869 blocked for more than 120 seconds. [10077.649541] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.651116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.652782] task:z_wr_int state:D stack:0 pid:1051869 ppid:2 flags:0x80004080 [10077.654420] Call Trace: [10077.655267] __schedule+0x2d1/0x870 [10077.656179] ? free_one_page+0x204/0x530 [10077.657192] schedule+0x55/0xf0 [10077.658004] cv_wait_common+0x16d/0x280 [spl] [10077.659018] ? finish_wait+0x80/0x80 [10077.660013] dmu_buf_direct_mixed_io_wait+0x84/0x1a0 [zfs] [10077.661396] dmu_write_direct_done+0x90/0x3b0 [zfs] [10077.662617] zio_done+0x373/0x1d50 [zfs] [10077.663783] zio_execute+0xee/0x210 [zfs] [10077.664921] taskq_thread+0x205/0x3f0 [spl] [10077.665982] ? wake_up_q+0x60/0x60 [10077.666842] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [10077.668295] ? taskq_lowest_id+0xc0/0xc0 [spl] [10077.669360] kthread+0x134/0x150 [10077.670191] ? set_kthread_struct+0x50/0x50 [10077.671209] ret_from_fork+0x35/0x40 [10077.672076] INFO: task txg_sync:1051894 blocked for more than 120 seconds. [10077.673467] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.675112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.676612] task:txg_sync state:D stack:0 pid:1051894 ppid:2 flags:0x80004080 [10077.678288] Call Trace: [10077.679024] __schedule+0x2d1/0x870 [10077.679948] ? __wake_up_common+0x7a/0x190 [10077.681042] schedule+0x55/0xf0 [10077.681899] schedule_timeout+0x19f/0x320 [10077.682951] ? __next_timer_interrupt+0xf0/0xf0 [10077.684005] ? taskq_dispatch+0xab/0x280 [spl] [10077.685085] io_schedule_timeout+0x19/0x40 [10077.686080] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.687227] ? finish_wait+0x80/0x80 [10077.688123] __cv_timedwait_io+0x15/0x20 [spl] [10077.689206] zio_wait+0x1ad/0x4f0 [zfs] [10077.690300] dsl_pool_sync+0xcb/0x6c0 [zfs] [10077.691435] ? spa_errlog_sync+0x2f0/0x3d0 [zfs] [10077.692636] spa_sync_iterate_to_convergence+0xcb/0x310 [zfs] [10077.693997] spa_sync+0x362/0x8f0 [zfs] [10077.695112] txg_sync_thread+0x27a/0x3b0 [zfs] [10077.696239] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [10077.697512] ? spl_assert.constprop.0+0x20/0x20 [spl] [10077.698639] thread_generic_wrapper+0x63/0x90 [spl] [10077.699687] kthread+0x134/0x150 [10077.700567] ? set_kthread_struct+0x50/0x50 [10077.701502] ret_from_fork+0x35/0x40 [10077.702430] INFO: task fio:1055501 blocked for more than 120 seconds. [10077.703697] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.705309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.706780] task:fio state:D stack:0 pid:1055501 ppid:1055490 flags:0x00004080 [10077.708479] Call Trace: [10077.709231] __schedule+0x2d1/0x870 [10077.710190] ? dbuf_hold_copy+0xec/0x230 [zfs] [10077.711368] schedule+0x55/0xf0 [10077.712286] cv_wait_common+0x16d/0x280 [spl] [10077.713316] ? finish_wait+0x80/0x80 [10077.714262] zfs_rangelock_enter_reader+0xa1/0x1f0 [zfs] [10077.715566] zfs_rangelock_enter_impl+0xbf/0x1b0 [zfs] [10077.716878] zfs_get_data+0x566/0x810 [zfs] [10077.718032] zil_lwb_commit+0x194/0x3f0 [zfs] [10077.719234] zil_lwb_write_issue+0x68/0xb90 [zfs] [10077.720413] ? __list_add+0x12/0x30 [zfs] [10077.721525] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.722708] ? zil_alloc_lwb+0x217/0x360 [zfs] [10077.723931] zil_commit_waiter_timeout+0x1f3/0x570 [zfs] [10077.725273] zil_commit_waiter+0x1d2/0x3b0 [zfs] [10077.726438] zil_commit_impl+0x6d/0xd0 [zfs] [10077.727586] zfs_fsync+0x66/0x90 [zfs] [10077.728675] zpl_fsync+0xe5/0x140 [zfs] [10077.729755] do_fsync+0x38/0x70 [10077.730607] __x64_sys_fsync+0x10/0x20 [10077.731482] do_syscall_64+0x5b/0x1b0 [10077.732415] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.733487] RIP: 0033:0x7eff236bb057 [10077.734399] Code: Unable to access opcode bytes at RIP 0x7eff236bb02d. [10077.735657] RSP: 002b:00007ffffb8e5320 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [10077.737163] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007eff236bb057 [10077.738526] RDX: 0000000000000000 RSI: 000055e4d1f13ac0 RDI: 0000000000000006 [10077.739966] RBP: 00007efeb8ed8000 R08: 0000000000000000 R09: 0000000000000000 [10077.741336] R10: 0000000000056000 R11: 0000000000000293 R12: 0000000000000003 [10077.742773] R13: 000055e4d1f13ac0 R14: 0000000000000000 R15: 000055e4d1f13ae8 [10077.744168] INFO: task fio:1055502 blocked for more than 120 seconds. [10077.745505] Tainted: P OE -------- - - 4.18.0-553.5.1.el8_10.x86_64 #1 [10077.747073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [10077.748642] task:fio state:D stack:0 pid:1055502 ppid:1055490 flags:0x00004080 [10077.750233] Call Trace: [10077.751011] __schedule+0x2d1/0x870 [10077.751915] schedule+0x55/0xf0 [10077.752811] schedule_timeout+0x19f/0x320 [10077.753762] ? __next_timer_interrupt+0xf0/0xf0 [10077.754824] io_schedule_timeout+0x19/0x40 [10077.755782] __cv_timedwait_common+0x19e/0x2c0 [spl] [10077.756922] ? finish_wait+0x80/0x80 [10077.757788] __cv_timedwait_io+0x15/0x20 [spl] [10077.758845] zio_wait+0x1ad/0x4f0 [zfs] [10077.759941] dmu_write_abd+0x174/0x1c0 [zfs] [10077.761144] dmu_write_uio_direct+0x79/0x100 [zfs] [10077.762327] dmu_write_uio_dnode+0xb2/0x320 [zfs] [10077.763523] dmu_write_uio_dbuf+0x47/0x60 [zfs] [10077.764749] zfs_write+0x581/0xe20 [zfs] [10077.765825] ? iov_iter_get_pages+0xe9/0x390 [10077.766842] ? trylock_page+0xd/0x20 [zfs] [10077.767956] ? __raw_spin_unlock+0x5/0x10 [zfs] [10077.769189] ? zfs_setup_direct+0x7e/0x1b0 [zfs] [10077.770343] zpl_iter_write_direct+0xd4/0x170 [zfs] [10077.771570] ? rrw_exit+0xc6/0x200 [zfs] [10077.772674] zpl_iter_write+0xd5/0x110 [zfs] [10077.773834] new_sync_write+0x112/0x160 [10077.774805] vfs_write+0xa5/0x1b0 [10077.775634] ksys_write+0x4f/0xb0 [10077.776526] do_syscall_64+0x5b/0x1b0 [10077.777386] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [10077.778488] RIP: 0033:0x7eff236baa47 [10077.779339] Code: Unable to access opcode bytes at RIP 0x7eff236baa1d. [10077.780655] RSP: 002b:00007ffffb8e5330 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 [10077.782056] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007eff236baa47 [10077.783507] RDX: 00000000000e4000 RSI: 00007efeb7dd4000 RDI: 0000000000000005 [10077.784890] RBP: 00007efeb7dd4000 R08: 0000000000000000 R09: 0000000000000000 [10077.786303] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000e4000 [10077.787637] R13: 000055e4d1f13ac0 R14: 00000000000e4000 R15: 000055e4d1f13ae8 Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
diff --git a/include/sys/dbuf.h b/include/sys/dbuf.h
@@ -330,14 +330,6 @@ typedef struct dmu_buf_impl {
 
 	/* The buffer was partially read.  More reads may follow. */
 	uint8_t db_partial_read;
-
-	/*
-	 * This block is being held under a writer rangelock of a Direct I/O
-	 * write that is waiting for previous buffered writes to synced out
-	 * due to mixed buffered and O_DIRECT operations. This is needed to
-	 * check whether to grab the rangelock in zfs_get_data().
-	 */
-	uint8_t db_mixed_io_dio_wait;
 } dmu_buf_impl_t;
 
 #define	DBUF_HASH_MUTEX(h, idx) \
@@ -386,7 +378,7 @@ dmu_buf_impl_t *dbuf_find(struct objset *os, uint64_t object, uint8_t level,
     uint64_t blkid, uint64_t *hash_out);
 
 int dbuf_read(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags);
-void dmu_buf_will_clone(dmu_buf_t *db, dmu_tx_t *tx);
+void dmu_buf_will_clone_or_dio(dmu_buf_t *db, dmu_tx_t *tx);
 void dmu_buf_will_not_fill(dmu_buf_t *db, dmu_tx_t *tx);
 void dmu_buf_will_fill(dmu_buf_t *db, dmu_tx_t *tx, boolean_t canfail);
 boolean_t dmu_buf_fill_done(dmu_buf_t *db, dmu_tx_t *tx, boolean_t failed);
@@ -395,9 +387,6 @@ dbuf_dirty_record_t *dbuf_dirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
 dbuf_dirty_record_t *dbuf_dirty_lightweight(dnode_t *dn, uint64_t blkid,
     dmu_tx_t *tx);
 boolean_t dbuf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
-void dmu_buf_direct_mixed_io_wait(dmu_buf_impl_t *db, uint64_t txg,
-    boolean_t read);
-void dmu_buf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx);
 blkptr_t *dmu_buf_get_bp_from_dbuf(dmu_buf_impl_t *db);
 int dmu_buf_untransform_direct(dmu_buf_impl_t *db, spa_t *spa);
 arc_buf_t *dbuf_loan_arcbuf(dmu_buf_impl_t *db);
diff --git a/include/sys/dmu.h b/include/sys/dmu.h
@@ -1078,7 +1078,6 @@ typedef struct zgd {
 	struct blkptr	*zgd_bp;
 	dmu_buf_t	*zgd_db;
 	struct zfs_locked_range *zgd_lr;
-	boolean_t	zgd_grabbed_rangelock;
 	void		*zgd_private;
 } zgd_t;
 
diff --git a/module/zfs/dbuf.c b/module/zfs/dbuf.c
@@ -85,10 +85,7 @@ typedef struct dbuf_stats {
 	/*
 	 * Statistics for Direct I/O.
 	 */
-	kstat_named_t direct_mixed_io_read_wait;
-	kstat_named_t direct_mixed_io_write_wait;
 	kstat_named_t direct_sync_wait;
-	kstat_named_t direct_undirty;
 	/*
 	 * Statistics about the dbuf hash table.
 	 */
@@ -137,10 +134,7 @@ dbuf_stats_t dbuf_stats = {
 	{ "cache_total_evicts",			KSTAT_DATA_UINT64 },
 	{ { "cache_levels_N",			KSTAT_DATA_UINT64 } },
 	{ { "cache_levels_bytes_N",		KSTAT_DATA_UINT64 } },
-	{ "direct_mixed_io_read_wait",		KSTAT_DATA_UINT64 },
-	{ "direct_mixed_io_write_wait",		KSTAT_DATA_UINT64 },
 	{ "direct_sync_wait",			KSTAT_DATA_UINT64 },
-	{ "direct_undirty",			KSTAT_DATA_UINT64 },
 	{ "hash_hits",				KSTAT_DATA_UINT64 },
 	{ "hash_misses",			KSTAT_DATA_UINT64 },
 	{ "hash_collisions",			KSTAT_DATA_UINT64 },
@@ -162,10 +156,7 @@ struct {
 	wmsum_t cache_total_evicts;
 	wmsum_t cache_levels[DN_MAX_LEVELS];
 	wmsum_t cache_levels_bytes[DN_MAX_LEVELS];
-	wmsum_t direct_mixed_io_read_wait;
-	wmsum_t direct_mixed_io_write_wait;
 	wmsum_t direct_sync_wait;
-	wmsum_t direct_undirty;
 	wmsum_t hash_hits;
 	wmsum_t hash_misses;
 	wmsum_t hash_collisions;
@@ -911,14 +902,8 @@ dbuf_kstat_update(kstat_t *ksp, int rw)
 		ds->cache_levels_bytes[i].value.ui64 =
 		    wmsum_value(&dbuf_sums.cache_levels_bytes[i]);
 	}
-	ds->direct_mixed_io_read_wait.value.ui64 =
-	    wmsum_value(&dbuf_sums.direct_mixed_io_read_wait);
-	ds->direct_mixed_io_write_wait.value.ui64 =
-	    wmsum_value(&dbuf_sums.direct_mixed_io_write_wait);
 	ds->direct_sync_wait.value.ui64 =
 	    wmsum_value(&dbuf_sums.direct_sync_wait);
-	ds->direct_undirty.value.ui64 =
-	    wmsum_value(&dbuf_sums.direct_undirty);
 	ds->hash_hits.value.ui64 =
 	    wmsum_value(&dbuf_sums.hash_hits);
 	ds->hash_misses.value.ui64 =
@@ -1021,10 +1006,7 @@ dbuf_init(void)
 		wmsum_init(&dbuf_sums.cache_levels[i], 0);
 		wmsum_init(&dbuf_sums.cache_levels_bytes[i], 0);
 	}
-	wmsum_init(&dbuf_sums.direct_mixed_io_read_wait, 0);
-	wmsum_init(&dbuf_sums.direct_mixed_io_write_wait, 0);
 	wmsum_init(&dbuf_sums.direct_sync_wait, 0);
-	wmsum_init(&dbuf_sums.direct_undirty, 0);
 	wmsum_init(&dbuf_sums.hash_hits, 0);
 	wmsum_init(&dbuf_sums.hash_misses, 0);
 	wmsum_init(&dbuf_sums.hash_collisions, 0);
@@ -1097,10 +1079,7 @@ dbuf_fini(void)
 		wmsum_fini(&dbuf_sums.cache_levels[i]);
 		wmsum_fini(&dbuf_sums.cache_levels_bytes[i]);
 	}
-	wmsum_fini(&dbuf_sums.direct_mixed_io_read_wait);
-	wmsum_fini(&dbuf_sums.direct_mixed_io_write_wait);
 	wmsum_fini(&dbuf_sums.direct_sync_wait);
-	wmsum_fini(&dbuf_sums.direct_undirty);
 	wmsum_fini(&dbuf_sums.hash_hits);
 	wmsum_fini(&dbuf_sums.hash_misses);
 	wmsum_fini(&dbuf_sums.hash_collisions);
@@ -1271,9 +1250,8 @@ dbuf_clear_data(dmu_buf_impl_t *db)
 {
 	ASSERT(MUTEX_HELD(&db->db_mtx));
 	dbuf_evict_user(db);
-	/* Direct I/O writes may have data */
-	if (db->db_buf == NULL)
-		db->db.db_data = NULL;
+	ASSERT3P(db->db_buf, ==, NULL);
+	db->db.db_data = NULL;
 	if (db->db_state != DB_NOFILL) {
 		db->db_state = DB_UNCACHED;
 		DTRACE_SET_STATE(db, "clear data");
@@ -1733,10 +1711,11 @@ dbuf_read_impl(dmu_buf_impl_t *db, dnode_t *dn, zio_t *zio, uint32_t flags,
  * have been modified in a previous transaction group before we access them in
  * the current active group.
  *
- * This function is used in three places: when we are dirtying a buffer for the
+ * This function is used in four places: when we are dirtying a buffer for the
  * first time in a txg, when we are freeing a range in a dnode that includes
- * this buffer, and when we are accessing a buffer which was received compressed
- * and later referenced in a WRITE_BYREF record.
+ * this buffer, when doing block cloning or issuing a Direct I/O write with
+ * abuffer, and when we are accessing a buffer which was received compressed and
+ * later referenced in a WRITE_BYREF record.
  *
  * Note that when we are called from dbuf_free_range() we do not put a hold on
  * the buffer, we just traverse the active dbuf list for the dnode.
@@ -2789,93 +2768,6 @@ dmu_buf_is_dirty(dmu_buf_t *db_fake, dmu_tx_t *tx)
 	return (dr != NULL);
 }
 
-void
-dmu_buf_direct_mixed_io_wait(dmu_buf_impl_t *db, uint64_t txg, boolean_t read)
-{
-	ASSERT(MUTEX_HELD(&db->db_mtx));
-
-	if (read == B_TRUE) {
-		/*
-		 * If a buffered read is in process, a Direct I/O read will
-		 * wait for the buffered I/O to complete.
-		 */
-		ASSERT3U(txg, ==, 0);
-		while (db->db_state == DB_READ) {
-			DBUF_STAT_BUMP(direct_mixed_io_read_wait);
-			cv_wait(&db->db_changed, &db->db_mtx);
-		}
-	} else {
-		/*
-		 * There must be an ARC buf associated with this Direct I/O
-		 * write otherwise there is no reason to wait for previous
-		 * dirty records to sync out.
-		 *
-		 * The db_state will temporarily be set to DB_CACHED so that
-		 * that any synchronous writes issued through the ZIL will
-		 * still be handled properly. In particular, the call to
-		 * dbuf_read() in dmu_sync_late_arrival() must account for the
-		 * data still being in the ARC. After waiting here for previous
-		 * TXGs to sync out, dmu_write_direct_done() will update the
-		 * db_state.
-		 */
-		ASSERT3P(db->db_buf, !=, NULL);
-		ASSERT3U(txg, >, 0);
-		db->db_mixed_io_dio_wait = TRUE;
-		db->db_state = DB_CACHED;
-		while (dbuf_find_dirty_lte(db, txg) != NULL) {
-			DBUF_STAT_BUMP(direct_mixed_io_write_wait);
-			cv_wait(&db->db_changed, &db->db_mtx);
-		}
-		db->db_mixed_io_dio_wait = FALSE;
-	}
-}
-
-/*
- * Direct I/O writes may need to undirty the open-context dirty record
- * associated with it in the event of an I/O error.
- */
-void
-dmu_buf_undirty(dmu_buf_impl_t *db, dmu_tx_t *tx)
-{
-	/*
-	 * Direct I/O writes always happen in open-context.
-	 */
-	ASSERT(!dmu_tx_is_syncing(tx));
-	ASSERT(MUTEX_HELD(&db->db_mtx));
-	ASSERT(db->db_state == DB_NOFILL || db->db_state == DB_UNCACHED);
-
-
-	/*
-	 * In the event of an I/O error we will handle the metaslab clean up in
-	 * zio_done(). Also, the dirty record's dr_overridden_by BP is not
-	 * currently set as that is done in dmu_sync_done(). Since the db_state
-	 * is still set to DB_NOFILL, dbuf_unoverride() will not be called in
-	 * dbuf_undirty() and the dirty record's BP will not be added the SPA's
-	 * spa_free_bplist via zio_free().
-	 *
-	 * This function can also be called in the event that a Direct I/O
-	 * write is overwriting a previous Direct I/O to the same block for
-	 * this TXG. It is important to go ahead and free up the space
-	 * accounting in this case through dbuf_undirty() -> dbuf_unoverride()
-	 * -> zio_free(). This is necessary because the space accounting for
-	 * determining if a write can occur in zfs_write() happens through
-	 * dmu_tx_assign(). This can cause an issue with Direct I/O writes in
-	 * the case of overwrites, because all DVA allocations are being done
-	 * in open-context. Constanstly allowing Direct I/O overwrites to the
-	 * same blocks can exhaust the pools available space leading to ENOSPC
-	 * errors at the DVA allcoation part of the ZIO pipeline, which will
-	 * eventually suspend the pool. By cleaning up space accounting now
-	 * the ENOSPC pool suspend can be avoided.
-	 *
-	 * Since we are undirtying the record for the Direct I/O in
-	 * open-context we must have a hold on the db, so it should never be
-	 * evicted after calling dbuf_undirty().
-	 */
-	VERIFY3B(dbuf_undirty(db, tx), ==, B_FALSE);
-
-	DBUF_STAT_BUMP(direct_undirty);
-}
-
 /*
  * Normally the db_blkptr points to the most recent on-disk content for the
  * dbuf (and anything newer will be cached in the dbuf). However, a recent
@@ -2951,27 +2843,71 @@ dmu_buf_untransform_direct(dmu_buf_impl_t *db, spa_t *spa)
 }
 
 void
-dmu_buf_will_clone(dmu_buf_t *db_fake, dmu_tx_t *tx)
+dmu_buf_will_clone_or_dio(dmu_buf_t *db_fake, dmu_tx_t *tx)
 {
 	dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;
 
 	/*
-	 * Block cloning: We are going to clone into this block, so undirty
-	 * modifications done to this block so far in this txg. This includes
-	 * writes and clones into this block.
+	 * Block clones and Direct I/O writes always happen in open-context.
 	 */
+	ASSERT(!dmu_tx_is_syncing(tx));
+	ASSERT(db->db_level == 0);
+	ASSERT(db->db_blkid != DMU_BONUS_BLKID);
+	ASSERT(db->db.db_object != DMU_META_DNODE_OBJECT);
+
 	mutex_enter(&db->db_mtx);
 	DBUF_VERIFY(db);
-	VERIFY(!dbuf_undirty(db, tx));
+
+	/*
+	 * We are going to clone or issue a Direct I/O write on this block, so
+	 * undirty modifications done to this block so far in this txg. This
+	 * includes writes and clones into this block.
+	 *
+	 * If there dirty record associated with this txg from a previous Direct
+	 * I/O write then space accounting cleanup takes place. It is important
+	 * to go ahead free up the space accounting through dbuf_undirty() ->
+	 * dbuf_unoverride() -> zio_free(). Space accountiung for determining
+	 * if a write can occur in zfs_write() happens through dmu_tx_assign().
+	 * This can cuase an issue with Direct I/O writes in the case of
+	 * overwriting the same block, because all DVA allocations are being
+	 * done in open-context. Constantly allowing Direct I/O overwrites to
+	 * the same block can exhaust the pools available space leading to
+	 * ENOSPC errors at the DVA allocation part of the ZIO pipeline, which
+	 * will eventually suspend the pool. By cleaning up sapce acccounting
+	 * now, the ENOSPC error can be avoided.
+	 *
+	 * Since we are undirtying the record in open-context, we must have a
+	 * hold on the db, so it should never be evicted after calling
+	 * dbuf_undirty().
+	 */
+	VERIFY3B(dbuf_undirty(db, tx), ==, B_FALSE);
 	ASSERT0P(dbuf_find_dirty_eq(db, tx->tx_txg));
+
 	if (db->db_buf != NULL) {
-		arc_buf_destroy(db->db_buf, db);
+		/*
+		 * If there is an associated ARC buffer with this dbuf we can
+		 * only destroy it if  the previous dirty record does not
+		 * reference it.
+		 */
+		dbuf_dirty_record_t *dr = list_head(&db->db_dirty_records);
+		if (dr == NULL || dr->dt.dl.dr_data != db->db_buf)
+			arc_buf_destroy(db->db_buf, db);
+
+		/*
+		 * Setting the dbuf's data pointers to NULL will force all
+		 * future reads down to the devices to get the most up to date
+		 * version of the data after a Direct I/O write has completed.
+		 */
 		db->db_buf = NULL;
 		dbuf_clear_data(db);
 	}
 
+	ASSERT3P(db->db_buf, ==, NULL);
+	ASSERT3P(db->db.db_data, ==, NULL);
+
 	db->db_state = DB_NOFILL;
-	DTRACE_SET_STATE(db, "allocating NOFILL buffer for clone");
+	DTRACE_SET_STATE(db,
+	    "allocating NOFILL buffer for clone or direct I/O write");
 
 	DBUF_VERIFY(db);
 	mutex_exit(&db->db_mtx);
@@ -3510,7 +3446,6 @@ dbuf_create(dnode_t *dn, uint8_t level, uint64_t blkid,
 	db->db_user_immediate_evict = FALSE;
 	db->db_freed_in_flight = FALSE;
 	db->db_pending_evict = FALSE;
-	db->db_mixed_io_dio_wait = FALSE;
 
 	if (blkid == DMU_BONUS_BLKID) {
 		ASSERT3P(parent, ==, dn->dn_dbuf);
@@ -4766,25 +4701,6 @@ dbuf_sync_leaf(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
 	dprintf_dbuf_bp(db, db->db_blkptr, "blkptr=%p", db->db_blkptr);
 
 	mutex_enter(&db->db_mtx);
-
-	/*
-	 * It is possible a buffered read has come in after a Direct I/O
-	 * write and is currently transistioning the db_state from DB_READ
-	 * in dbuf_read_impl() to another state in dbuf_read_done().  We
-	 * have to wait in order for the dbuf state to change from DB_READ
-	 * before syncing the dirty record of the Direct I/O write.
-	 */
-	if (db->db_state == DB_READ && !dr->dt.dl.dr_brtwrite) {
-		ASSERT3P(*datap, ==, NULL);
-		ASSERT3P(db->db_buf, ==, NULL);
-		ASSERT3P(db->db.db_data, ==, NULL);
-		ASSERT3U(dr->dt.dl.dr_override_state, ==, DR_OVERRIDDEN);
-		while (db->db_state == DB_READ) {
-			DBUF_STAT_BUMP(direct_sync_wait);
-			cv_wait(&db->db_changed, &db->db_mtx);
-		}
-	}
-
 	/*
 	 * To be synced, we must be dirtied.  But we might have been freed
 	 * after the dirty.
@@ -4797,13 +4713,21 @@ dbuf_sync_leaf(dbuf_dirty_record_t *dr, dmu_tx_t *tx)
 		ASSERT(db->db.db_data != dr->dt.dl.dr_data);
 	} else if (db->db_state == DB_READ) {
 		/*
-		 * This buffer has a clone we need to write, and an in-flight
-		 * read on the BP we're about to clone. Its safe to issue the
-		 * write here because the read has already been issued and the
-		 * contents won't change.
+		 * This buffer was either cloned or had a Direct I/O write
+		 * occur and has an in-flgiht read on the BP. It is safe to
+		 * issue the write here, because the read has already been
+		 * issued and the contents won't change.
+		 *
+		 * We can verify the case of both the clone and Direct I/O
+		 * write by making sure the first dirty record for the dbuf
+		 * has no ARC buffer associated with it.
 		 */
-		ASSERT(dr->dt.dl.dr_brtwrite &&
-		    dr->dt.dl.dr_override_state == DR_OVERRIDDEN);
+		dbuf_dirty_record_t *dr_head =
+		    list_head(&db->db_dirty_records);
+		ASSERT3P(db->db_buf, ==, NULL);
+		ASSERT3P(db->db.db_data, ==, NULL);
+		ASSERT3P(dr_head->dt.dl.dr_data, ==, NULL);
+		ASSERT3U(dr_head->dt.dl.dr_override_state, ==, DR_OVERRIDDEN);
 	} else {
 		ASSERT(db->db_state == DB_CACHED || db->db_state == DB_NOFILL);
 	}
@@ -5503,7 +5427,7 @@ EXPORT_SYMBOL(dbuf_dirty);
 EXPORT_SYMBOL(dmu_buf_set_crypt_params);
 EXPORT_SYMBOL(dmu_buf_will_dirty);
 EXPORT_SYMBOL(dmu_buf_is_dirty);
-EXPORT_SYMBOL(dmu_buf_will_clone);
+EXPORT_SYMBOL(dmu_buf_will_clone_or_dio);
 EXPORT_SYMBOL(dmu_buf_will_not_fill);
 EXPORT_SYMBOL(dmu_buf_will_fill);
 EXPORT_SYMBOL(dmu_buf_fill_done);
diff --git a/module/zfs/dmu.c b/module/zfs/dmu.c
@@ -2439,7 +2439,7 @@ dmu_brt_clone(objset_t *os, uint64_t object, uint64_t offset, uint64_t length,
 		ASSERT(db->db_blkid != DMU_SPILL_BLKID);
 		ASSERT(BP_IS_HOLE(bp) || dbuf->db_size == BP_GET_LSIZE(bp));
 
-		dmu_buf_will_clone(dbuf, tx);
+		dmu_buf_will_clone_or_dio(dbuf, tx);
 
 		mutex_enter(&db->db_mtx);
 
diff --git a/module/zfs/dmu_direct.c b/module/zfs/dmu_direct.c
diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c