Skip to content

Commit 12433ce

Browse files
committed
Fix read errors race after block cloning
Investigating read errors triggering panic fixed in openzfs#16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_CACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_CACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc.
1 parent 39be46f commit 12433ce

File tree

1 file changed

+11
-4
lines changed

1 file changed

+11
-4
lines changed

module/zfs/dbuf.c

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1581,6 +1581,7 @@ dbuf_read_impl(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags,
15811581
}
15821582

15831583
if (db->db_state == DB_UNCACHED) {
1584+
uncached:
15841585
if (db->db_blkptr == NULL) {
15851586
bpp = NULL;
15861587
} else {
@@ -1593,12 +1594,18 @@ dbuf_read_impl(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags,
15931594
ASSERT3S(db->db_state, ==, DB_NOFILL);
15941595

15951596
/*
1596-
* Block cloning: If we have a pending block clone,
1597-
* we don't want to read the underlying block, but the content
1598-
* of the block being cloned, so we have the most recent data.
1597+
* If we have a pending block clone, we don't want to read
1598+
* the underlying block, but the content of the block being
1599+
* cloned, pointed by the dirty record, so we have the most
1600+
* recent data. If there is no dirty record, then we hit a
1601+
* race in a sync process when the dirty record is already
1602+
* removed, while the dbuf is not yet destroyed. Such case
1603+
* is equivalent to uncached.
15991604
*/
16001605
dr = list_head(&db->db_dirty_records);
1601-
if (dr == NULL || !dr->dt.dl.dr_brtwrite) {
1606+
if (dr == NULL)
1607+
goto uncached;
1608+
if (!dr->dt.dl.dr_brtwrite) {
16021609
err = EIO;
16031610
goto early_unlock;
16041611
}

0 commit comments

Comments
 (0)