Description
I believe we've uncovered a bug resulting from the changes made in f63811f.
We have a system that's stuck with the following stacks:
0xffff8911c9862000 UNINTERRUPTIBLE 1
__schedule+0x2c8
__schedule+0x2c8
schedule+0x69
cv_wait_common+0x16e
__cv_wait+0x15
spa_config_enter_impl.constprop.0+0x11a
spa_config_enter+0x12
spa_vdev_state_enter+0x88
zfs_ioc_pool_reopen+0x63
zfsdev_ioctl_common+0x295
zfsdev_ioctl+0x57
__x64_sys_ioctl+0x92
__x64_sys_ioctl+0x92
__x64_sys_ioctl+0x92
__x64_sys_ioctl+0x92
do_syscall_64+0x59
do_syscall_64+0x59
entry_SYSCALL_64+0x99
i.e. this thread attempting to get the lock as writer..
0xffff8912c5266000 UNINTERRUPTIBLE 1
__schedule+0x2c8
__schedule+0x2c8
schedule+0x69
cv_wait_common+0x16e
__cv_wait+0x15
spa_config_enter_impl.constprop.0+0x1de
spa_config_enter+0x12
spa_sync_rewrite_vdev_config+0x5b
spa_sync+0x37b
txg_sync_thread+0x27a
thread_generic_wrapper+0x6c
kthread+0x127
ret_from_fork+0x1f
0x0+0x0
spa_sync
also trying to get this lock..
and then we have ZIL threads also trying to get the lock.. e.g..
0xffff89120363a000 UNINTERRUPTIBLE 20
__schedule+0x2c8
__schedule+0x2c8
schedule+0x69
cv_wait_common+0x16e
__cv_wait+0x15
spa_config_enter_impl.constprop.0+0x1de
spa_config_enter+0x12
zil_lwb_write_issue+0x19f
zil_commit_writer+0xe9
zil_commit_impl+0x62
zil_commit+0x79
zfs_write+0xbad
zpl_iter_write+0xe7
do_iter_readv_writev+0x14f
do_iter_readv_writev+0x14f
do_iter_write+0x8c
do_iter_write+0x8c
vfs_iter_write+0x19
nfsd_vfs_write+0x149
nfsd4_write+0x130
nfsd4_proc_compound+0x415
nfsd_dispatch+0x16c
svc_process_common+0x3d5
svc_process+0xbc
nfsd+0xed
kthread+0x127
ret_from_fork+0x1f
0x0+0x0
0xffff8912c0276000 UNINTERRUPTIBLE 1
__schedule+0x2c8
__schedule+0x2c8
schedule+0x69
cv_wait_common+0x16e
__cv_wait+0x15
spa_config_enter_impl.constprop.0+0x1de
spa_config_enter+0x12
zil_lwb_write_issue+0x19f
zil_commit_waiter_timeout+0x1e8
zil_commit_waiter+0x215
zil_commit_impl+0x6d
zil_commit+0x79
zfs_fsync+0x89
zpl_fsync+0xe6
vfs_fsync_range+0x46
do_fsync+0x3d
do_fsync+0x3d
__x64_sys_fdatasync+0x17
__x64_sys_fdatasync+0x17
__x64_sys_fdatasync+0x17
do_syscall_64+0x59
do_syscall_64+0x59
entry_SYSCALL_64+0x99
Once in this state, the system remains in this state indefinitely.. deadlocked.
What we found, is we believe the config lock is held (as reader) by ZIL IOs that have not yet completed.. there's a lot of ZIL IOs pending..
sdb> zio -r | member io_done ! grep zil_lwb | sort | uniq -c
32 (zio_done_func_t *)zil_lwb_flush_vdevs_done+0x0 = 0xffffffffc03a8620
183 (zio_done_func_t *)zil_lwb_write_done+0x0 = 0xffffffffc03a5d60
With trees of ZIOs like this:
0xffff8912d76b3a20 WRITE OPEN - -
0xffff8911c55e09b0 WRITE OPEN - -
0xffff89111508ab98 WRITE ISSUE_ASYNC - -
0xffff891115089838 WRITE ISSUE_ASYNC - -
0xffff89121cc884d8 WRITE ISSUE_ASYNC - -
0xffff89121cc88e88 WRITE ISSUE_ASYNC - -
0xffff89121cc8a1e8 WRITE ISSUE_ASYNC - -
0xffff8912cb470e88 WRITE ISSUE_ASYNC - -
0xffff8912cb471d10 WRITE OPEN - -
0xffff8912d76b21e8 WRITE OPEN - -
0xffff8912d76b3a20 WRITE OPEN - -
0xffff8911c55e09b0 WRITE OPEN - -
0xffff89111508ab98 WRITE ISSUE_ASYNC - -
0xffff891115089838 WRITE ISSUE_ASYNC - -
0xffff89121cc884d8 WRITE ISSUE_ASYNC - -
0xffff89121cc88e88 WRITE ISSUE_ASYNC - -
0xffff89121cc8a1e8 WRITE ISSUE_ASYNC - -
0xffff8912cb470e88 WRITE ISSUE_ASYNC - -
0xffff8912cb471d10 WRITE OPEN - -
0xffff8912dbbf5838 WRITE OPEN - -
0xffff8912d76b21e8 WRITE OPEN - -
0xffff8912d76b3a20 WRITE OPEN - -
0xffff8911c55e09b0 WRITE OPEN - -
0xffff89111508ab98 WRITE ISSUE_ASYNC - -
0xffff891115089838 WRITE ISSUE_ASYNC - -
0xffff89121cc884d8 WRITE ISSUE_ASYNC - -
0xffff89121cc88e88 WRITE ISSUE_ASYNC - -
0xffff89121cc8a1e8 WRITE ISSUE_ASYNC - -
0xffff8912cb470e88 WRITE ISSUE_ASYNC - -
0xffff8912cb471d10 WRITE OPEN - -
It's a hard to see from this output, but we verified these ZIOs to be for ZIL blocks.
So, at this point, the question we tried to solve, is why aren't these ZIOs completing? From what we can tell, that's the cause of the deadlock.. i.e. we have ZIL blocks that took the config lock as reader (i.e. acquired in zil_lwb_write_issue
), got issued, but never dropped that lock because they never completed (i.e. would have dropped in zil_lwb_flush_vdevs_done
).. All other stuck threads, are stuck because these ZIL IOs haven't dropped the lock..
I believe what is happening, is, prior to f63811f, we would have issued all LWBs in the order that they get allocated.. due to us calling zil_lwb_write_issue
while holding the zl_issuer_lock
.. meaning, that from a ZIO parent/child perspective, the child ZIOs would always get issued prior to their parents getting issued..
Meaning, by the time the parent ZIO tries to get the config lock in zil_lwb_write_issue
, the child ZIO for it would have necessarily taken its reader hold on the config lock for itself.. i.e. the child ZIO would always get a reader lock before the parent gets its own reader lock..
I think, due to the changes in f63811f, this isn't necessarily true anymore. Specifically, now we drop the zl_issuer_lock
before we call zil_lwb_write_issue
.. such that, it's possible for the thread to call schedule
in between dropping the zl_issuer_lock
, and actually issuing the LWBs..
While that thread is sleeping.. it's also possible for another thread to acquire the zl_issuer_lock
and call zil_lwb_write_issue
on it's LWBs.. all while the first thread is still sleeping, and hasn't yet issued those original LWBs..
Meaning, the second thread's LWBs will be parents of the first thread's LWBs (from the perspective of the LWB root ZIOs).. and these parent LWBs would get issued (and thus, grab the config lock as reader) prior to the child LWBs being issued..
Further, if those child LWBs are then unable to get the config lock as reader (as is our case), then the child ZIOs will never complete, and they'll prevent the parent ZIOs from completing as well.
Since that's a little convoluted, I want to be clear.. Here's what is happening:
- Thread A calls
zil_commit_writer
, gets thezl_issuer_lock
, builds up itsilwbs
list, drops thezl_issuer_lock
, then goes to sleep before processing and issuing LWBs from theilwbs
list.. - Thread B calls
zil_commit_writer, gets the
zl_issuer_lock, builds up it's own
ilwbslist, drops
zl_issuer_lock, calls
zil_lwb_write_issueon all of its
ilwbs`.. - Thread C (unrelated to the ZIL) tries to get the config lock as writer (e.g.
zfs_ioc_pool_reopen
is called), but is blocked due to ZIOs from (2) holding the lock as reader, so it waits.. - Thread A wakes up, calls
zil_lwb_write_issue
and tries to get the config lock as reader, but is blocked due to thread from (3) trying to get the lock as writer, so it waits..
At this point, the system is deadlocked.. the ZIOs from (4) are children of the ZIOs from (2).. the ZIOs from (4) are blocked on the thread from (3).. the thread from (3) is blocked by the ZIOs from (2).. and the ZIOs from (2) are blocked waiting on the ZIOs from (4).. so we're deadlocked..
I've verified this to be true.. for example, we can take this tree of ZIOs from earlier:
0xffff8912dbbf5838 WRITE OPEN - -
0xffff8912d76b21e8 WRITE OPEN - -
0xffff8912d76b3a20 WRITE OPEN - -
0xffff8911c55e09b0 WRITE OPEN - -
0xffff89111508ab98 WRITE ISSUE_ASYNC - -
0xffff891115089838 WRITE ISSUE_ASYNC - -
0xffff89121cc884d8 WRITE ISSUE_ASYNC - -
0xffff89121cc88e88 WRITE ISSUE_ASYNC - -
0xffff89121cc8a1e8 WRITE ISSUE_ASYNC - -
0xffff8912cb470e88 WRITE ISSUE_ASYNC - -
0xffff8912cb471d10 WRITE OPEN - -
In this case, 0xffff8912cb471d10
is a child of 0xffff8912cb470e88
.. since 0xffff8912cb470e88
is in the ZIO pipeline, we know it must hold the config lock as reader (e.g. it's a parent ZIO from (2) in prior example).. 0xffff8912cb471d10
is a child ZIO that would have been created in (1) of the prior example, but hasn't been issued yet.. we can verify it's stuck as I described in (4), by getting the lwb_t
from that zio_t
, and then mapping that pointer to a thread on the system:
sdb> echo 0xffff8912cb471d10 | zio | member io_private
(void *)0xffff891202b2a620
Thread 675 (pid 1586):
#0 0xffffffffa3a3a568 in context_switch (rf=0xffff9b7cc559b7c0, next=0xffff891204ca4000, prev=0xffff8912057c2000, rq=0xffff8912f9730ac0) at linux/kernel/sched/core.c:4994
#1 0xffffffffa3a3a568 in __schedule (sched_mode=sched_mode@entry=0) at linux/kernel/sched/core.c:6341
#2 0xffffffffa3a3ab69 in schedule () at linux/kernel/sched/core.c:6424
#3 0xffffffffc012d78e in cv_wait_common (cvp=cvp@entry=0xffff891245c261c0, mp=mp@entry=0xffff891245c26180, state=state@entry=2, io=io@entry=0) at zfs/module/os/linux/spl/spl-condvar.c:146
#4 0xffffffffc012d8d5 in __cv_wait (cvp=cvp@entry=0xffff891245c261c0, mp=mp@entry=0xffff891245c26180) at zfs/module/os/linux/spl/spl-condvar.c:171
#5 0xffffffffc031d3de in spa_config_enter_impl (spa=0xffff891245c24000, locks=locks@entry=2, rw=rw@entry=RW_READER, mmp_flag=mmp_flag@entry=0, tag=0xffff891202b2a620) at zfs/module/zfs/spa_misc.c:613
#6 0xffffffffc031d4d2 in spa_config_enter (spa=<optimized out>, locks=locks@entry=2, tag=tag@entry=0xffff891202b2a620, rw=rw@entry=RW_READER) at zfs/module/zfs/spa_misc.c:643
#7 0xffffffffc03a6a7f in zil_lwb_write_issue (zilog=zilog@entry=0xffff8911107f7800, lwb=0xffff891202b2a620) at zfs/module/zfs/zil.c:1947
#8 0xffffffffc03ac5e9 in zil_commit_writer (zilog=zilog@entry=0xffff8911107f7800, zcw=zcw@entry=0xffff8912c5e9d720) at zfs/module/zfs/zil.c:3009
#9 0xffffffffc03ad112 in zil_commit_impl (zilog=zilog@entry=0xffff8911107f7800, foid=foid@entry=139) at zfs/module/zfs/zil.c:3540
#10 0xffffffffc03ad1e9 in zil_commit (zilog=zilog@entry=0xffff8911107f7800, foid=139) at zfs/module/zfs/zil.c:3503
#11 0xffffffffc03a1aed in zfs_write (zp=zp@entry=0xffff8912db672c60, uio=uio@entry=0xffff9b7cc559bb98, ioflag=1085442, cr=cr@entry=0xffff8912756e0d80) at zfs/module/zfs/zfs_vnops.c:786
#12 0xffffffffc0406447 in zpl_iter_write (kiocb=0xffff9b7cc559bc48, from=<optimized out>) at zfs/module/os/linux/zfs/zpl_file.c:392
#13 0xffffffffa317ca9f in call_write_iter (file=<unavailable>, iter=<unavailable>, kio=<unavailable>) at linux/include/linux/fs.h:2097
#14 0xffffffffa317ca9f in do_iter_readv_writev (filp=<unavailable>, filp@entry=0xffff89125443f300, iter=<unavailable>, iter@entry=0xffff9b7cc559bd28, ppos=ppos@entry=0xffff9b7cc559bd20, type=<unavailable>, type@entry=1, flags=flags@entry=4) at linux/fs/read_write.c:731
#15 0xffffffffa317de3c in do_iter_write (flags=4, pos=0xffff9b7cc559bd20, iter=0xffff9b7cc559bd28, file=0xffff89125443f300) at linux/fs/read_write.c:857
#16 0xffffffffa317de3c in do_iter_write (file=0xffff89125443f300, iter=0xffff9b7cc559bd28, pos=0xffff9b7cc559bd20, flags=4) at linux/fs/read_write.c:838
#17 0xffffffffa317df99 in vfs_iter_write (file=file@entry=0xffff89125443f300, iter=iter@entry=0xffff9b7cc559bd28, ppos=ppos@entry=0xffff9b7cc559bd20, flags=flags@entry=4) at linux/fs/read_write.c:898
#18 0xffffffffc08d0cd9 in nfsd_vfs_write (rqstp=rqstp@entry=0xffff891204fd0000, fhp=fhp@entry=0xffff8912057d0030, nf=<optimized out>, offset=71827456, vec=vec@entry=0xffff891204fd0b88, vlen=<optimized out>, cnt=0xffff9b7cc559bdb0, stable=2, verf=0xffff8912057c9210) at linux/fs/nfsd/vfs.c:1009
#19 0xffffffffc08e1620 in nfsd4_write (rqstp=0xffff891204fd0000, cstate=0xffff8912057d0030, u=0xffff8912057c91a0) at linux/fs/nfsd/nfs4proc.c:1044
#20 0xffffffffc08e3f05 in nfsd4_proc_compound (rqstp=0xffff891204fd0000) at linux/fs/nfsd/nfs4proc.c:2558
#21 0xffffffffc08ca9cc in nfsd_dispatch (rqstp=0xffff891204fd0000, statp=0xffff891218d9d014) at linux/fs/nfsd/nfssvc.c:1037
#22 0xffffffffc07df255 in svc_process_common (rqstp=rqstp@entry=0xffff891204fd0000, argv=argv@entry=0xffff891204fd01a8, resv=resv@entry=0xffff891204fd0288) at linux/net/sunrpc/svc.c:1402
#23 0xffffffffc07df65c in svc_process (rqstp=rqstp@entry=0xffff891204fd0000) at linux/net/sunrpc/svc.c:1542
#24 0xffffffffc08ca28d in nfsd (vrqstp=0xffff891204fd0000) at linux/fs/nfsd/nfssvc.c:969
#25 0xffffffffa2ee6f57 in kthread (_create=0xffff891203ca2600) at linux/kernel/kthread.c:319
#26 0xffffffffa2e04c3f in ret_from_fork () at linux/arch/x86/entry/entry_64.S:298
Here we can see the lwb
pointer matches the frame:
#7 0xffffffffc03a6a7f in zil_lwb_write_issue (zilog=zilog@entry=0xffff8911107f7800, lwb=0xffff891202b2a620) at zfs/module/zfs/zil.c:1947
and we can see this thread is stuck in zil_lwb_write_issue
trying to get the config lock as reader.. stuck because this other thread is trying to get the config lock as writer:
Thread 719 (pid 7956):
#0 0xffffffffa3a3a568 in context_switch (rf=0xffff9b7cc3a63be8, next=0xffff891219658000, prev=0xffff8911c9862000, rq=0xffff8912f9630ac0) at linux/kernel/sched/core.c:4994
#1 0xffffffffa3a3a568 in __schedule (sched_mode=sched_mode@entry=0) at linux/kernel/sched/core.c:6341
#2 0xffffffffa3a3ab69 in schedule () at linux/kernel/sched/core.c:6424
#3 0xffffffffc012d78e in cv_wait_common (cvp=cvp@entry=0xffff891245c261c0, mp=mp@entry=0xffff891245c26180, state=state@entry=2, io=io@entry=0) at zfs/module/os/linux/spl/spl-condvar.c:146
#4 0xffffffffc012d8d5 in __cv_wait (cvp=cvp@entry=0xffff891245c261c0, mp=mp@entry=0xffff891245c26180) at zfs/module/os/linux/spl/spl-condvar.c:171
#5 0xffffffffc031d31a in spa_config_enter_impl (spa=spa@entry=0xffff891245c24000, locks=locks@entry=22, rw=rw@entry=RW_WRITER, mmp_flag=mmp_flag@entry=0, tag=0xffff891245c24000) at zfs/module/zfs/spa_misc.c:629
#6 0xffffffffc031d4d2 in spa_config_enter (spa=spa@entry=0xffff891245c24000, locks=locks@entry=22, tag=tag@entry=0xffff891245c24000, rw=rw@entry=RW_WRITER) at zfs/module/zfs/spa_misc.c:643
#7 0xffffffffc031d568 in spa_vdev_state_enter (spa=0xffff891245c24000, oplocks=oplocks@entry=0) at zfs/module/zfs/spa_misc.c:1472
#8 0xffffffffc038e453 in zfs_ioc_pool_reopen (pool=0xffff8912d391c000 "domain0", innvl=<optimized out>, outnvl=<optimized out>) at zfs/module/zfs/zfs_ioctl.c:5871
#9 0xffffffffc0396705 in zfsdev_ioctl_common (vecnum=vecnum@entry=61, zc=zc@entry=0xffff8912d391c000, flag=flag@entry=0) at zfs/module/zfs/zfs_ioctl.c:7800
#10 0xffffffffc03e4c27 in zfsdev_ioctl (filp=<optimized out>, cmd=23101, arg=140289147176688) at zfs/module/os/linux/zfs/zfs_ioctl_os.c:144
#11 0xffffffffa3198682 in vfs_ioctl (arg=140289147176688, cmd=<optimized out>, filp=0xffff891208b9e800) at linux/fs/ioctl.c:51
#12 0xffffffffa3198682 in __do_sys_ioctl (arg=140289147176688, cmd=23101, fd=6) at linux/fs/ioctl.c:874
#13 0xffffffffa3198682 in __se_sys_ioctl (arg=140289147176688, cmd=23101, fd=6) at linux/fs/ioctl.c:860
#14 0xffffffffa3198682 in __x64_sys_ioctl (regs=<optimized out>) at linux/fs/ioctl.c:860
#15 0xffffffffa3a2c709 in do_syscall_x64 (nr=<optimized out>, regs=0xffff9b7cc3a63f58) at linux/arch/x86/entry/common.c:50
#16 0xffffffffa3a2c709 in do_syscall_64 (regs=0xffff9b7cc3a63f58, nr=<optimized out>) at linux/arch/x86/entry/common.c:80
#17 0xffffffffa3c00099 in entry_SYSCALL_64 () at linux/arch/x86/entry/entry_64.S:118
FWIW, the unrelated (to the ZIL) thread attempting to get the config lock as a writer is necessary to trigger the deadlock.. which might not be a common case for most users, but actually is common for Delphix.. which might help explain why we've seen this, but perhaps not others.
Further, I think what this boils down to is.. due to how we obtain the config lock as reader for each LWB that we issue, and the parent/child relationship of LWB ZIOs.. I think we need to issue the LWBs in parent/child order.. i.e. always issue the child ZIOs before the parents, as we did prior to f63811f.. so that we cannot get into a situation where a parent ZIO holds the config lock, and a child ZIO can't get it.. or, perhaps, change how we acquire the config lock when issuing the ZIOs..
@amotin does this make sense to you?