Skip to content

Commit 602b5dc

Browse files
amotinbehlendorf
authored andcommitted
Fix read errors race after block cloning
Investigating read errors triggering panic fixed in #16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_UNCACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_UNCACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Robert Evans <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes #16052
1 parent d5fb6ab commit 602b5dc

File tree

1 file changed

+20
-21
lines changed

1 file changed

+20
-21
lines changed

module/zfs/dbuf.c

Lines changed: 20 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1548,7 +1548,7 @@ dbuf_read_impl(dmu_buf_impl_t *db, dnode_t *dn, zio_t *zio, uint32_t flags,
15481548
zbookmark_phys_t zb;
15491549
uint32_t aflags = ARC_FLAG_NOWAIT;
15501550
int err, zio_flags;
1551-
blkptr_t bp, *bpp;
1551+
blkptr_t bp, *bpp = NULL;
15521552

15531553
ASSERT(!zfs_refcount_is_zero(&db->db_holds));
15541554
ASSERT(MUTEX_HELD(&db->db_mtx));
@@ -1562,29 +1562,28 @@ dbuf_read_impl(dmu_buf_impl_t *db, dnode_t *dn, zio_t *zio, uint32_t flags,
15621562
goto early_unlock;
15631563
}
15641564

1565-
if (db->db_state == DB_UNCACHED) {
1566-
if (db->db_blkptr == NULL) {
1567-
bpp = NULL;
1568-
} else {
1569-
bp = *db->db_blkptr;
1565+
/*
1566+
* If we have a pending block clone, we don't want to read the
1567+
* underlying block, but the content of the block being cloned,
1568+
* pointed by the dirty record, so we have the most recent data.
1569+
* If there is no dirty record, then we hit a race in a sync
1570+
* process when the dirty record is already removed, while the
1571+
* dbuf is not yet destroyed. Such case is equivalent to uncached.
1572+
*/
1573+
if (db->db_state == DB_NOFILL) {
1574+
dbuf_dirty_record_t *dr = list_head(&db->db_dirty_records);
1575+
if (dr != NULL) {
1576+
if (!dr->dt.dl.dr_brtwrite) {
1577+
err = EIO;
1578+
goto early_unlock;
1579+
}
1580+
bp = dr->dt.dl.dr_overridden_by;
15701581
bpp = &bp;
15711582
}
1572-
} else {
1573-
dbuf_dirty_record_t *dr;
1574-
1575-
ASSERT3S(db->db_state, ==, DB_NOFILL);
1583+
}
15761584

1577-
/*
1578-
* Block cloning: If we have a pending block clone,
1579-
* we don't want to read the underlying block, but the content
1580-
* of the block being cloned, so we have the most recent data.
1581-
*/
1582-
dr = list_head(&db->db_dirty_records);
1583-
if (dr == NULL || !dr->dt.dl.dr_brtwrite) {
1584-
err = EIO;
1585-
goto early_unlock;
1586-
}
1587-
bp = dr->dt.dl.dr_overridden_by;
1585+
if (bpp == NULL && db->db_blkptr != NULL) {
1586+
bp = *db->db_blkptr;
15881587
bpp = &bp;
15891588
}
15901589

0 commit comments

Comments
 (0)