Skip to content

Commit 6f5aac3

Browse files
authored
Reduce latency effects of non-interactive I/O
Investigating influence of scrub (especially sequential) on random read latency I've noticed that on some HDDs single 4KB read may take up to 4 seconds! Deeper investigation shown that many HDDs heavily prioritize sequential reads even when those are submitted with queue depth of 1. This patch addresses the latency from two sides: - by using _min_active queue depths for non-interactive requests while the interactive request(s) are active and few requests after; - by throttling it further if no interactive requests has completed while configured amount of non-interactive did. While there, I've also modified vdev_queue_class_to_issue() to give more chances to schedule at least _min_active requests to the lowest priorities. It should reduce starvation if several non-interactive processes are running same time with some interactive and I think should make possible setting of zfs_vdev_max_active to as low as 1. I've benchmarked this change with 4KB random reads from ZVOL with 16KB block size on newly written non-fragmented pool. On fragmented pool I also saw improvements, but not so dramatic. Below are log2 histograms of the random read latency in milliseconds for different devices: 4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before: 0, 0, 2, 1, 12, 21, 19, 18, 10, 15, 17, 21 after: 0, 0, 0, 24, 101, 195, 419, 250, 47, 4, 0, 0 , that means maximum latency reduction from 2s to 500ms. 4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before: 0, 0, 2, 31, 38, 28, 18, 12, 17, 20, 24, 10, 3 after: 0, 0, 55, 247, 455, 470, 412, 181, 36, 0, 0, 0, 0 , i.e. from 4s to 250ms. 1 SAS HDD SEAGATE ST14000NM0048 before: 0, 0, 29, 70, 107, 45, 27, 1, 0, 0, 1, 4, 19 after: 1, 29, 681, 1261, 676, 1633, 67, 1, 0, 0, 0, 0, 0 , i.e. from 4s to 125ms. 1 SAS SSD SEAGATE XS3840TE70014 before (microseconds): 0, 0, 0, 0, 0, 0, 0, 0, 70, 18343, 82548, 618 after: 0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844, 90 I've also measured scrub time during the test and on idle pools. On idle fragmented pool I've measured scrub getting few percent faster due to use of QD3 instead of QD2 before. On idle non-fragmented pool I've measured no difference. On busy non-fragmented pool I've measured scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Reviewed-by: Ryan Moeller <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored-By: iXsystems, Inc. Closes #11166
1 parent f67bebb commit 6f5aac3

File tree

3 files changed

+145
-18
lines changed

3 files changed

+145
-18
lines changed

include/sys/vdev_impl.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,9 @@ struct vdev_queue {
165165
avl_tree_t vq_write_offset_tree;
166166
avl_tree_t vq_trim_offset_tree;
167167
uint64_t vq_last_offset;
168+
zio_priority_t vq_last_prio; /* Last sent I/O priority. */
169+
uint32_t vq_ia_active; /* Active interactive I/Os. */
170+
uint32_t vq_nia_credit; /* Non-interactive I/Os credit. */
168171
hrtime_t vq_io_complete_ts; /* time last i/o completed */
169172
hrtime_t vq_io_delta_ts;
170173
zio_t vq_io_search; /* used as local for stack reduction */

man/man5/zfs-module-parameters.5

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2029,8 +2029,7 @@ Default value: \fB1\fR.
20292029
.ad
20302030
.RS 12n
20312031
The maximum number of I/Os active to each device. Ideally, this will be >=
2032-
the sum of each queue's max_active. It must be at least the sum of each
2033-
queue's min_active. See the section "ZFS I/O SCHEDULER".
2032+
the sum of each queue's max_active. See the section "ZFS I/O SCHEDULER".
20342033
.sp
20352034
Default value: \fB1,000\fR.
20362035
.RE
@@ -2179,6 +2178,42 @@ See the section "ZFS I/O SCHEDULER".
21792178
Default value: \fB1\fR.
21802179
.RE
21812180

2181+
.sp
2182+
.ne 2
2183+
.na
2184+
\fBzfs_vdev_nia_delay\fR (int)
2185+
.ad
2186+
.RS 12n
2187+
For non-interactive I/O (scrub, resilver, removal, initialize and rebuild),
2188+
the number of concurrently-active I/O's is limited to *_min_active, unless
2189+
the vdev is "idle". When there are no interactive I/Os active (sync or
2190+
async), and zfs_vdev_nia_delay I/Os have completed since the last
2191+
interactive I/O, then the vdev is considered to be "idle", and the number
2192+
of concurrently-active non-interactive I/O's is increased to *_max_active.
2193+
See the section "ZFS I/O SCHEDULER".
2194+
.sp
2195+
Default value: \fB5\fR.
2196+
.RE
2197+
2198+
.sp
2199+
.ne 2
2200+
.na
2201+
\fBzfs_vdev_nia_credit\fR (int)
2202+
.ad
2203+
.RS 12n
2204+
Some HDDs tend to prioritize sequential I/O so high, that concurrent
2205+
random I/O latency reaches several seconds. On some HDDs it happens
2206+
even if sequential I/Os are submitted one at a time, and so setting
2207+
*_max_active to 1 does not help. To prevent non-interactive I/Os, like
2208+
scrub, from monopolizing the device no more than zfs_vdev_nia_credit
2209+
I/Os can be sent while there are outstanding incomplete interactive
2210+
I/Os. This enforced wait ensures the HDD services the interactive I/O
2211+
within a reasonable amount of time.
2212+
See the section "ZFS I/O SCHEDULER".
2213+
.sp
2214+
Default value: \fB5\fR.
2215+
.RE
2216+
21822217
.sp
21832218
.ne 2
21842219
.na

module/zfs/vdev_queue.c

Lines changed: 105 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -121,16 +121,17 @@
121121

122122
/*
123123
* The maximum number of i/os active to each device. Ideally, this will be >=
124-
* the sum of each queue's max_active. It must be at least the sum of each
125-
* queue's min_active.
124+
* the sum of each queue's max_active.
126125
*/
127126
uint32_t zfs_vdev_max_active = 1000;
128127

129128
/*
130129
* Per-queue limits on the number of i/os active to each device. If the
131130
* number of active i/os is < zfs_vdev_max_active, then the min_active comes
132-
* into play. We will send min_active from each queue, and then select from
133-
* queues in the order defined by zio_priority_t.
131+
* into play. We will send min_active from each queue round-robin, and then
132+
* send from queues in the order defined by zio_priority_t up to max_active.
133+
* Some queues have additional mechanisms to limit number of active I/Os in
134+
* addition to min_active and max_active, see below.
134135
*
135136
* In general, smaller max_active's will lead to lower latency of synchronous
136137
* operations. Larger max_active's may lead to higher overall throughput,
@@ -151,7 +152,7 @@ uint32_t zfs_vdev_async_read_max_active = 3;
151152
uint32_t zfs_vdev_async_write_min_active = 2;
152153
uint32_t zfs_vdev_async_write_max_active = 10;
153154
uint32_t zfs_vdev_scrub_min_active = 1;
154-
uint32_t zfs_vdev_scrub_max_active = 2;
155+
uint32_t zfs_vdev_scrub_max_active = 3;
155156
uint32_t zfs_vdev_removal_min_active = 1;
156157
uint32_t zfs_vdev_removal_max_active = 2;
157158
uint32_t zfs_vdev_initializing_min_active = 1;
@@ -171,6 +172,28 @@ uint32_t zfs_vdev_rebuild_max_active = 3;
171172
int zfs_vdev_async_write_active_min_dirty_percent = 30;
172173
int zfs_vdev_async_write_active_max_dirty_percent = 60;
173174

175+
/*
176+
* For non-interactive I/O (scrub, resilver, removal, initialize and rebuild),
177+
* the number of concurrently-active I/O's is limited to *_min_active, unless
178+
* the vdev is "idle". When there are no interactive I/Os active (sync or
179+
* async), and zfs_vdev_nia_delay I/Os have completed since the last
180+
* interactive I/O, then the vdev is considered to be "idle", and the number
181+
* of concurrently-active non-interactive I/O's is increased to *_max_active.
182+
*/
183+
uint_t zfs_vdev_nia_delay = 5;
184+
185+
/*
186+
* Some HDDs tend to prioritize sequential I/O so high that concurrent
187+
* random I/O latency reaches several seconds. On some HDDs it happens
188+
* even if sequential I/Os are submitted one at a time, and so setting
189+
* *_max_active to 1 does not help. To prevent non-interactive I/Os, like
190+
* scrub, from monopolizing the device no more than zfs_vdev_nia_credit
191+
* I/Os can be sent while there are outstanding incomplete interactive
192+
* I/Os. This enforced wait ensures the HDD services the interactive I/O
193+
* within a reasonable amount of time.
194+
*/
195+
uint_t zfs_vdev_nia_credit = 5;
196+
174197
/*
175198
* To reduce IOPs, we aggregate small adjacent I/Os into one large I/O.
176199
* For read I/Os, we also aggregate across small adjacency gaps; for writes
@@ -261,7 +284,7 @@ vdev_queue_timestamp_compare(const void *x1, const void *x2)
261284
}
262285

263286
static int
264-
vdev_queue_class_min_active(zio_priority_t p)
287+
vdev_queue_class_min_active(vdev_queue_t *vq, zio_priority_t p)
265288
{
266289
switch (p) {
267290
case ZIO_PRIORITY_SYNC_READ:
@@ -273,15 +296,19 @@ vdev_queue_class_min_active(zio_priority_t p)
273296
case ZIO_PRIORITY_ASYNC_WRITE:
274297
return (zfs_vdev_async_write_min_active);
275298
case ZIO_PRIORITY_SCRUB:
276-
return (zfs_vdev_scrub_min_active);
299+
return (vq->vq_ia_active == 0 ? zfs_vdev_scrub_min_active :
300+
MIN(vq->vq_nia_credit, zfs_vdev_scrub_min_active));
277301
case ZIO_PRIORITY_REMOVAL:
278-
return (zfs_vdev_removal_min_active);
302+
return (vq->vq_ia_active == 0 ? zfs_vdev_removal_min_active :
303+
MIN(vq->vq_nia_credit, zfs_vdev_removal_min_active));
279304
case ZIO_PRIORITY_INITIALIZING:
280-
return (zfs_vdev_initializing_min_active);
305+
return (vq->vq_ia_active == 0 ?zfs_vdev_initializing_min_active:
306+
MIN(vq->vq_nia_credit, zfs_vdev_initializing_min_active));
281307
case ZIO_PRIORITY_TRIM:
282308
return (zfs_vdev_trim_min_active);
283309
case ZIO_PRIORITY_REBUILD:
284-
return (zfs_vdev_rebuild_min_active);
310+
return (vq->vq_ia_active == 0 ? zfs_vdev_rebuild_min_active :
311+
MIN(vq->vq_nia_credit, zfs_vdev_rebuild_min_active));
285312
default:
286313
panic("invalid priority %u", p);
287314
return (0);
@@ -337,7 +364,7 @@ vdev_queue_max_async_writes(spa_t *spa)
337364
}
338365

339366
static int
340-
vdev_queue_class_max_active(spa_t *spa, zio_priority_t p)
367+
vdev_queue_class_max_active(spa_t *spa, vdev_queue_t *vq, zio_priority_t p)
341368
{
342369
switch (p) {
343370
case ZIO_PRIORITY_SYNC_READ:
@@ -349,14 +376,34 @@ vdev_queue_class_max_active(spa_t *spa, zio_priority_t p)
349376
case ZIO_PRIORITY_ASYNC_WRITE:
350377
return (vdev_queue_max_async_writes(spa));
351378
case ZIO_PRIORITY_SCRUB:
379+
if (vq->vq_ia_active > 0) {
380+
return (MIN(vq->vq_nia_credit,
381+
zfs_vdev_scrub_min_active));
382+
} else if (vq->vq_nia_credit < zfs_vdev_nia_delay)
383+
return (zfs_vdev_scrub_min_active);
352384
return (zfs_vdev_scrub_max_active);
353385
case ZIO_PRIORITY_REMOVAL:
386+
if (vq->vq_ia_active > 0) {
387+
return (MIN(vq->vq_nia_credit,
388+
zfs_vdev_removal_min_active));
389+
} else if (vq->vq_nia_credit < zfs_vdev_nia_delay)
390+
return (zfs_vdev_removal_min_active);
354391
return (zfs_vdev_removal_max_active);
355392
case ZIO_PRIORITY_INITIALIZING:
393+
if (vq->vq_ia_active > 0) {
394+
return (MIN(vq->vq_nia_credit,
395+
zfs_vdev_initializing_min_active));
396+
} else if (vq->vq_nia_credit < zfs_vdev_nia_delay)
397+
return (zfs_vdev_initializing_min_active);
356398
return (zfs_vdev_initializing_max_active);
357399
case ZIO_PRIORITY_TRIM:
358400
return (zfs_vdev_trim_max_active);
359401
case ZIO_PRIORITY_REBUILD:
402+
if (vq->vq_ia_active > 0) {
403+
return (MIN(vq->vq_nia_credit,
404+
zfs_vdev_rebuild_min_active));
405+
} else if (vq->vq_nia_credit < zfs_vdev_nia_delay)
406+
return (zfs_vdev_rebuild_min_active);
360407
return (zfs_vdev_rebuild_max_active);
361408
default:
362409
panic("invalid priority %u", p);
@@ -372,17 +419,24 @@ static zio_priority_t
372419
vdev_queue_class_to_issue(vdev_queue_t *vq)
373420
{
374421
spa_t *spa = vq->vq_vdev->vdev_spa;
375-
zio_priority_t p;
422+
zio_priority_t p, n;
376423

377424
if (avl_numnodes(&vq->vq_active_tree) >= zfs_vdev_max_active)
378425
return (ZIO_PRIORITY_NUM_QUEUEABLE);
379426

380-
/* find a queue that has not reached its minimum # outstanding i/os */
381-
for (p = 0; p < ZIO_PRIORITY_NUM_QUEUEABLE; p++) {
427+
/*
428+
* Find a queue that has not reached its minimum # outstanding i/os.
429+
* Do round-robin to reduce starvation due to zfs_vdev_max_active
430+
* and vq_nia_credit limits.
431+
*/
432+
for (n = 0; n < ZIO_PRIORITY_NUM_QUEUEABLE; n++) {
433+
p = (vq->vq_last_prio + n + 1) % ZIO_PRIORITY_NUM_QUEUEABLE;
382434
if (avl_numnodes(vdev_queue_class_tree(vq, p)) > 0 &&
383435
vq->vq_class[p].vqc_active <
384-
vdev_queue_class_min_active(p))
436+
vdev_queue_class_min_active(vq, p)) {
437+
vq->vq_last_prio = p;
385438
return (p);
439+
}
386440
}
387441

388442
/*
@@ -392,8 +446,10 @@ vdev_queue_class_to_issue(vdev_queue_t *vq)
392446
for (p = 0; p < ZIO_PRIORITY_NUM_QUEUEABLE; p++) {
393447
if (avl_numnodes(vdev_queue_class_tree(vq, p)) > 0 &&
394448
vq->vq_class[p].vqc_active <
395-
vdev_queue_class_max_active(spa, p))
449+
vdev_queue_class_max_active(spa, vq, p)) {
450+
vq->vq_last_prio = p;
396451
return (p);
452+
}
397453
}
398454

399455
/* No eligible queued i/os */
@@ -493,6 +549,20 @@ vdev_queue_io_remove(vdev_queue_t *vq, zio_t *zio)
493549
}
494550
}
495551

552+
static boolean_t
553+
vdev_queue_is_interactive(zio_priority_t p)
554+
{
555+
switch (p) {
556+
case ZIO_PRIORITY_SCRUB:
557+
case ZIO_PRIORITY_REMOVAL:
558+
case ZIO_PRIORITY_INITIALIZING:
559+
case ZIO_PRIORITY_REBUILD:
560+
return (B_FALSE);
561+
default:
562+
return (B_TRUE);
563+
}
564+
}
565+
496566
static void
497567
vdev_queue_pending_add(vdev_queue_t *vq, zio_t *zio)
498568
{
@@ -502,6 +572,12 @@ vdev_queue_pending_add(vdev_queue_t *vq, zio_t *zio)
502572
ASSERT(MUTEX_HELD(&vq->vq_lock));
503573
ASSERT3U(zio->io_priority, <, ZIO_PRIORITY_NUM_QUEUEABLE);
504574
vq->vq_class[zio->io_priority].vqc_active++;
575+
if (vdev_queue_is_interactive(zio->io_priority)) {
576+
if (++vq->vq_ia_active == 1)
577+
vq->vq_nia_credit = 1;
578+
} else if (vq->vq_ia_active > 0) {
579+
vq->vq_nia_credit--;
580+
}
505581
avl_add(&vq->vq_active_tree, zio);
506582

507583
if (shk->kstat != NULL) {
@@ -520,6 +596,13 @@ vdev_queue_pending_remove(vdev_queue_t *vq, zio_t *zio)
520596
ASSERT(MUTEX_HELD(&vq->vq_lock));
521597
ASSERT3U(zio->io_priority, <, ZIO_PRIORITY_NUM_QUEUEABLE);
522598
vq->vq_class[zio->io_priority].vqc_active--;
599+
if (vdev_queue_is_interactive(zio->io_priority)) {
600+
if (--vq->vq_ia_active == 0)
601+
vq->vq_nia_credit = 0;
602+
else
603+
vq->vq_nia_credit = zfs_vdev_nia_credit;
604+
} else if (vq->vq_ia_active == 0)
605+
vq->vq_nia_credit++;
523606
avl_remove(&vq->vq_active_tree, zio);
524607

525608
if (shk->kstat != NULL) {
@@ -1072,6 +1155,12 @@ ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, rebuild_max_active, INT, ZMOD_RW,
10721155
ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, rebuild_min_active, INT, ZMOD_RW,
10731156
"Min active rebuild I/Os per vdev");
10741157

1158+
ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, nia_credit, INT, ZMOD_RW,
1159+
"Number of non-interactive I/Os to allow in sequence");
1160+
1161+
ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, nia_delay, INT, ZMOD_RW,
1162+
"Number of non-interactive I/Os before _max_active");
1163+
10751164
ZFS_MODULE_PARAM(zfs_vdev, zfs_vdev_, queue_depth_pct, INT, ZMOD_RW,
10761165
"Queue depth percentage for each top-level vdev");
10771166
/* END CSTYLED */

0 commit comments

Comments
 (0)