Skip to content

Commit 2921ad6

Browse files
authored
Fix zrele race in zrele_async that can cause hang
There is a race condition in zfs_zrele_async when we are checking if we would be the one to evict an inode. This can lead to a txg sync deadlock. Instead of calling into iput directly, we attempt to perform the atomic decrement ourselves, unless that would set the i_count value to zero. In that case, we dispatch a call to iput to run later, to prevent a deadlock from occurring. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes #11527 Closes #11530
1 parent b8e6401 commit 2921ad6

File tree

1 file changed

+22
-12
lines changed

1 file changed

+22
-12
lines changed

module/os/linux/zfs/zfs_vnops_os.c

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -87,15 +87,18 @@
8787
* must be checked with ZFS_VERIFY_ZP(zp). Both of these macros
8888
* can return EIO from the calling function.
8989
*
90-
* (2) zrele() should always be the last thing except for zil_commit()
91-
* (if necessary) and ZFS_EXIT(). This is for 3 reasons:
92-
* First, if it's the last reference, the vnode/znode
93-
* can be freed, so the zp may point to freed memory. Second, the last
94-
* reference will call zfs_zinactive(), which may induce a lot of work --
95-
* pushing cached pages (which acquires range locks) and syncing out
96-
* cached atime changes. Third, zfs_zinactive() may require a new tx,
97-
* which could deadlock the system if you were already holding one.
98-
* If you must call zrele() within a tx then use zfs_zrele_async().
90+
* (2) zrele() should always be the last thing except for zil_commit() (if
91+
* necessary) and ZFS_EXIT(). This is for 3 reasons: First, if it's the
92+
* last reference, the vnode/znode can be freed, so the zp may point to
93+
* freed memory. Second, the last reference will call zfs_zinactive(),
94+
* which may induce a lot of work -- pushing cached pages (which acquires
95+
* range locks) and syncing out cached atime changes. Third,
96+
* zfs_zinactive() may require a new tx, which could deadlock the system
97+
* if you were already holding one. This deadlock occurs because the tx
98+
* currently being operated on prevents a txg from syncing, which
99+
* prevents the new tx from progressing, resulting in a deadlock. If you
100+
* must call zrele() within a tx, use zfs_zrele_async(). Note that iput()
101+
* is a synonym for zrele().
99102
*
100103
* (3) All range locks must be grabbed before calling dmu_tx_assign(),
101104
* as they can span dmu_tx_assign() calls.
@@ -398,11 +401,18 @@ zfs_zrele_async(znode_t *zp)
398401
ASSERT(atomic_read(&ip->i_count) > 0);
399402
ASSERT(os != NULL);
400403

401-
if (atomic_read(&ip->i_count) == 1)
404+
/*
405+
* If decrementing the count would put us at 0, we can't do it inline
406+
* here, because that would be synchronous. Instead, dispatch an iput
407+
* to run later.
408+
*
409+
* For more information on the dangers of a synchronous iput, see the
410+
* header comment of this file.
411+
*/
412+
if (!atomic_add_unless(&ip->i_count, -1, 1)) {
402413
VERIFY(taskq_dispatch(dsl_pool_zrele_taskq(dmu_objset_pool(os)),
403414
(task_func_t *)iput, ip, TQ_SLEEP) != TASKQID_INVALID);
404-
else
405-
zrele(zp);
415+
}
406416
}
407417

408418

0 commit comments

Comments
 (0)