Description
Describe the feature would like to see added to OpenZFS
Traversal code, traverse_visitbp() does visit blocks recursively. Indirect (Non L0) Block of size 128k could contain, 1024 block pointers of 128 bytes. In case of full traverse OR incremental traverse, where all blocks were modified, it could traverse large number of blocks pointed by indirect. Traversal code does issue prefetch of blocks traversed below indirect. This could result into large number of async reads queued on vdev queue.
So, account for prefetch issued for blocks pointed by indirect and limit max prefetch in one go.
Module Param:
zfs_traverse_indirect_prefetch_limit: Limit of prefetch IO in one go, while traversing blocks pointed by an indirect (Non L0) block.
Local counters:
prefetched: Local counter to account for number of prefetch IO issued corresponding to blocks pointed by indirect block.
prefetchidx: Index for which next prefetch IO to be issued for a block pointed by indirect block.
prefetchtriggeridx: Index at which next prefetch IO to be triggered while traversing a block pointed by indirect.
Basic Logic is to account prefetching counters as explained. Keep setting prefetchtriggeridx somewhere in the middle of all blocks prefetched in every set of prefetch done in one go. This assures prefetching of next set of blocks, get triggered ahead and it gets a enough time window before demand read is issued for the same.
How will this feature improve OpenZFS?
It Optimizes Prefetch logic in traversal code, which avoids unnecessary burst of prefetch IO's, possible in workload with large files/directories. This avoids excessive queuing of prefetch IO's at vdev queue's and avoids excessive prefetch buffers caching in ARC. Effectively, it would help other IO sensitive primary workloads running on system.
Additional context
Referring to following code block in traverse_visitbp(), where this optimization would apply.
218 static int
219 traverse_visitbp(traverse_data_t *td, const dnode_phys_t *dnp,
220 const blkptr_t *bp, const zbookmark_phys_t zb)
221 {
.....
296 if (BP_GET_LEVEL(bp) > 0) {
....
319 / recursively visitbp() blocks below this */
320 for (i = 0; i < epb; i++) {
321 SET_BOOKMARK(czb, zb->zb_objset, zb->zb_object,
322 zb->zb_level - 1,
323 zb->zb_blkid * epb + i);
324 err = traverse_visitbp(td, dnp,
325 &((blkptr_t *)buf->b_data)[i], czb);
326 if (err != 0)
327 break;
328 }
329
330 kmem_free(czb, sizeof (zbookmark_phys_t));
331
332 }