BRT: Cloning sparse files should take time proportional to allocated size

### Describe the feature would like to see added to OpenZFS

Copying large sparse files with BRT should take time proportional to the number of _allocated_ blocks rather than time proportional to the number of records in the file (i.e. apparent size).

On my system a 128G file with one 4k block of data takes 15 seconds:

```
$ dd if=/dev/random of=x seek=32M count=1 bs=4k
$ zpool sync
$ time cp --reflink=always x x.2

real	0m3.851s
user	0m0.000s
sys	0m3.823s
```

And 512G file takes about 4x longer:

```
$ dd if=/dev/random of=x seek=128M count=1 bs=4k
$ zpool sync
$ time cp --reflink=always x x.2

real	0m15.071s
user	0m0.001s
sys	0m15.036s
```

Lastly a 2T file takes about 4x longer again:

```
$ dd if=/dev/random of=x seek=512M count=1 bs=4k
$ zpool sync
$ time cp --reflink=always x x.2

real	1m2.359s
user	0m0.000s
sys	1m2.337s
```

This is Fedora 39 6.6.8-200.fc39.x86_64 running openzfs@8f2f6cd2ac688916adb2caf979daf95365ccb48f non-DEBUG with a default-created pool with 128k recordsize on a single consumer NVMe vdev.

(I'm filing this as a feature because the performance is reasonable enough, but maybe some would argue that this is a regression vs. existing behavior with copying using sparse-aware tools.)

### How will this feature improve OpenZFS?

While block cloning sparse files works reasonably well, ZFS can easily create petabyte-sized sparse files, and recent `cp` can copy them efficiently with `SEEK_DATA`. Ideally block cloning should have similar performance.

This would help users with very large, sparsely allocated files such as VM images or large sparse array files. Also users that are trying to reflink extremely large, empty sparse files might otherwise be surprised by linear copy performance vs. existing `cp` before reflinking was added to ZFS.

### Additional context

See `dnode_next_offset` for possible implementation, though that function doesn't correctly handle traversing files with dirty blocks. It might be possible to query dbuf state for the range of in-memory dbufs to see if the cloned range is dirty before attempting cloning or use something like `dnode_is_dirty` as a workaround.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BRT: Cloning sparse files should take time proportional to allocated size #16014

Describe the feature would like to see added to OpenZFS

How will this feature improve OpenZFS?

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BRT: Cloning sparse files should take time proportional to allocated size #16014

Description

Describe the feature would like to see added to OpenZFS

How will this feature improve OpenZFS?

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions