Skip to content

Commit 2bba9fd

Browse files
jasonbkingbehlendorf
authored andcommitted
Zpool can start allocating from metaslab before TRIMs have completed
When doing a manual TRIM on a zpool, the metaslab being TRIMmed is potentially re-enabled before all queued TRIM zios for that metaslab have completed. Since TRIM zios have the lowest priority, it is possible to get into a situation where allocations occur from the just re-enabled metaslab and cut ahead of queued TRIMs to the same metaslab. If the ranges overlap, this will cause corruption. We were able to trigger this pretty consistently with a small single top-level vdev zpool (i.e. small number of metaslabs) with heavy parallel write activity while performing a manual TRIM against a somewhat 'slow' device (so TRIMs took a bit of time to complete). With the patch, we've not been able to recreate it since. It was on illumos, but inspection of the OpenZFS trim code looks like the relevant pieces are largely unchanged and so it appears it would be vulnerable to the same issue. Reviewed-by: Igor Kozhukhov <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Jason King <[email protected]> Illumos-issue: https://www.illumos.org/issues/15939 Closes #15395
1 parent 30ee2ee commit 2bba9fd

File tree

1 file changed

+19
-9
lines changed

1 file changed

+19
-9
lines changed

module/zfs/vdev_trim.c

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
* Copyright (c) 2016 by Delphix. All rights reserved.
2424
* Copyright (c) 2019 by Lawrence Livermore National Security, LLC.
2525
* Copyright (c) 2021 Hewlett Packard Enterprise Development LP
26+
* Copyright 2023 RackTop Systems, Inc.
2627
*/
2728

2829
#include <sys/spa.h>
@@ -591,6 +592,7 @@ vdev_trim_ranges(trim_args_t *ta)
591592
uint64_t extent_bytes_max = ta->trim_extent_bytes_max;
592593
uint64_t extent_bytes_min = ta->trim_extent_bytes_min;
593594
spa_t *spa = vd->vdev_spa;
595+
int error = 0;
594596

595597
ta->trim_start_time = gethrtime();
596598
ta->trim_bytes_done = 0;
@@ -610,19 +612,32 @@ vdev_trim_ranges(trim_args_t *ta)
610612
uint64_t writes_required = ((size - 1) / extent_bytes_max) + 1;
611613

612614
for (uint64_t w = 0; w < writes_required; w++) {
613-
int error;
614-
615615
error = vdev_trim_range(ta, VDEV_LABEL_START_SIZE +
616616
rs_get_start(rs, ta->trim_tree) +
617617
(w *extent_bytes_max), MIN(size -
618618
(w * extent_bytes_max), extent_bytes_max));
619619
if (error != 0) {
620-
return (error);
620+
goto done;
621621
}
622622
}
623623
}
624624

625-
return (0);
625+
done:
626+
/*
627+
* Make sure all TRIMs for this metaslab have completed before
628+
* returning. TRIM zios have lower priority over regular or syncing
629+
* zios, so all TRIM zios for this metaslab must complete before the
630+
* metaslab is re-enabled. Otherwise it's possible write zios to
631+
* this metaslab could cut ahead of still queued TRIM zios for this
632+
* metaslab causing corruption if the ranges overlap.
633+
*/
634+
mutex_enter(&vd->vdev_trim_io_lock);
635+
while (vd->vdev_trim_inflight[0] > 0) {
636+
cv_wait(&vd->vdev_trim_io_cv, &vd->vdev_trim_io_lock);
637+
}
638+
mutex_exit(&vd->vdev_trim_io_lock);
639+
640+
return (error);
626641
}
627642

628643
static void
@@ -941,11 +956,6 @@ vdev_trim_thread(void *arg)
941956
}
942957

943958
spa_config_exit(spa, SCL_CONFIG, FTAG);
944-
mutex_enter(&vd->vdev_trim_io_lock);
945-
while (vd->vdev_trim_inflight[0] > 0) {
946-
cv_wait(&vd->vdev_trim_io_cv, &vd->vdev_trim_io_lock);
947-
}
948-
mutex_exit(&vd->vdev_trim_io_lock);
949959

950960
range_tree_destroy(ta.trim_tree);
951961

0 commit comments

Comments
 (0)