Fix read errors race after block cloning

amotin · amotin · commit 12433ce1c60b · 2024-04-02T18:02:01.000-04:00
Investigating read errors triggering panic fixed in openzfs#16042 I've found that we have a race in a sync process between the moment dirty record for cloned block is removed and the moment dbuf is destroyed. If dmu_buf_hold_array_by_dnode() take a hold on a cloned dbuf before it is synced/destroyed, then dbuf_read_impl() may see it still in DB_NOFILL state, but without the dirty record. Such case is not an error, but equivalent to DB_CACHED, since the dbuf block pointer is already updated by dbuf_write_ready(). Unfortunately it is impossible to safely change the dbuf state to DB_CACHED there, since there may already be another cloning in progress, that dropped dbuf lock before creating a new dirty record, protected only by the range lock. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.
diff --git a/module/zfs/dbuf.c b/module/zfs/dbuf.c
@@ -1581,6 +1581,7 @@ dbuf_read_impl(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags,
 	}
 
 	if (db->db_state == DB_UNCACHED) {
+uncached:
 		if (db->db_blkptr == NULL) {
 			bpp = NULL;
 		} else {
@@ -1593,12 +1594,18 @@ dbuf_read_impl(dmu_buf_impl_t *db, zio_t *zio, uint32_t flags,
 		ASSERT3S(db->db_state, ==, DB_NOFILL);
 
 		/*
-		 * Block cloning: If we have a pending block clone,
-		 * we don't want to read the underlying block, but the content
-		 * of the block being cloned, so we have the most recent data.
+		 * If we have a pending block clone, we don't want to read
+		 * the underlying block, but the content of the block being
+		 * cloned, pointed by the dirty record, so we have the most
+		 * recent data.  If there is no dirty record, then we hit a
+		 * race in a sync process when the dirty record is already
+		 * removed, while the dbuf is not yet destroyed.  Such case
+		 * is equivalent to uncached.
 		 */
 		dr = list_head(&db->db_dirty_records);
-		if (dr == NULL || !dr->dt.dl.dr_brtwrite) {
+		if (dr == NULL)
+			goto uncached;
+		if (!dr->dt.dl.dr_brtwrite) {
 			err = EIO;
 			goto early_unlock;
 		}