FreeBSD: zfs_putpages: don't undirty pages until after write completes #17445

robn · 2025-06-09T12:02:13Z

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

While working on #17398, I noticed that FreeBSD's zfs_putpage immediately undirties the page, rather than waiting for confirmation that it has been written out. In theory, this could lead to a write to an mmap()'d region being lost if the pool suspends before it could be written out, and then the page is evicted.

This PR should take care of it. #17398 will need to update it further, but I think this can stand on its own.

Description

zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk.

So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it.

[removed after review discussion]

We change from a single `zfs_log_write()` to cover the whole range, to multiple calls, each for a single page. This is because if `zfs_log_write()` receives a write too large to fit into a single ZIL block, it will split it into multiple writes, but each will receive the same callback and argument. This will lead to the callback being called multiple times, but we don't which pages/range the call is for, and so can't clean them up. This way is not ideal, but is simple and correct, and so is a good start.

Further work

Right now I want to do enough to ensure correctness once #17398 lands and we can respond to a writeback error.

Later, I would like to look at bringing the async/sync stuff over from the Linux side to help with msync() performance (possibly merging those, since the logic is complicated). That can wait though; I'd like to avoid the additional complexity for now.

How Has This Been Tested?

mmap tests run successfully on 14.2p1.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

markjdb

This broadly looks good to me, I had one question inline.

module/os/freebsd/zfs/zfs_vnops_os.c

amotin · 2025-06-09T13:58:35Z

We change from a single zfs_log_write() to cover the whole range, to multiple calls, each for a single page. This is because if zfs_log_write() receives a write too large to fit into a single ZIL block, it will split it into multiple writes, but each will receive the same callback and argument. This will lead to the callback being called multiple times, but we don't which pages/range the call is for, and so can't clean them up. This way is not ideal, but is simple and correct, and so is a good start.

Could you explain why it is correct and more than a coincidence? What should protect single-page writes from splitting? Looking into zil_lwb_assign(), I see it is calling zil_itx_clone() when it needs to split a WR_NEED_COPY record, which leaves callback only for the last chunk. Any reason why zfs_log_write() should not do the same for WR_COPIED?

robn · 2025-06-09T23:34:44Z

Could you explain why it is correct and more than a coincidence? What should protect single-page writes from splitting? Looking into zil_lwb_assign(), I see it is calling zil_itx_clone() when it needs to split a WR_NEED_COPY record, which leaves callback only for the last chunk. Any reason why zfs_log_write() should not do the same for WR_COPIED?

@amotin yes, I think coincidence too. I hadn't seen zil_itx_clone(), and now I have, I think yes - zfs_log_write() should be doing the same thing. I haven't looked closely yet, but is there a good reason for for zfs_log_write() to do the splitting itself, vs just leaving it for zil_lwb_assign() to do if and when it needs to actually write it?

amotin · 2025-06-10T01:42:20Z

I haven't looked closely yet, but is there a good reason for for zfs_log_write() to do the splitting itself, vs just leaving it for zil_lwb_assign() to do if and when it needs to actually write it?

I'd say historically happened. It was not supposed for lwb code to modify any records. But due to write records being too big it got to fill large records to avoid memory copies. And since it has to do it, then splitting is easy to do. For smaller COPIED records it just copies records as-is, even though could also be made to split if needed. And zfs_log_write has to handle splitting any way to handle indirect records, which are limited by one block. I was thinking to redesign it at some point, but never got to it.

robn · 2025-06-10T04:45:48Z

@amotin here's the alternative version, only putting the callback on the last itx: robn/zfs:freebsd-putpage-async-split.

It's unclear whether or not its "better" in practice. The mmap_seek_001_pos test is a really good driver for this, as it regularly generates 16-32 itxes for a single zfs_log_write() call. Is one callback and object lock cycle, but with an extra memory alloc/free, better or worse than 32 calls and no allocation? Is having to wait for them all meaningful? My gut feeling says batching is rarely a losing prospect, so without a bunch of profiling I'm leaning towards this version.

Thoughts?

(either way, I should push the zfs_log_write() change, as multiple callbacks for a single write is obviously wrong).

amotin · 2025-06-11T02:22:41Z

here's the alternative version, only putting the callback on the last itx: robn/zfs:freebsd-putpage-async-split.

I like this alternative one better, especially the zfs_log_write() part. Just for allocation I would probably use something like offsetof(putpage_commit_arg_t, pca_pages[ncount]) (it might be a GCC'ism, but it works with Clang too), and in zfs_log_write() I would do something like:

		if (resid == len) {
			itx->itx_callback = callback;
			itx->itx_callback_data = callback_data;
		}

instead of tying the different iterations together.

It's unclear whether or not its "better" in practice. The mmap_seek_001_pos test is a really good driver for this, as it regularly generates 16-32 itxes for a single zfs_log_write() call. Is one callback and object lock cycle, but with an extra memory alloc/free, better or worse than 32 calls and no allocation? Is having to wait for them all meaningful? My gut feeling says batching is rarely a losing prospect, so without a bunch of profiling I'm leaning towards this version.

I was not so much thinking about one vs many callbacks, as one bigger log record vs bunch of tiny ones. Many log records would also require many allocations.

robn · 2025-06-11T05:45:59Z

I like this alternative one better

Ok, great! That's two votes to zero. Pushed!

I would do something like

Both suggestions taken. The offsetof thing is an interesting technique, which I will mull on further. I think I like it!

I was not so much thinking about one vs many callbacks, as one bigger log record vs bunch of tiny ones. Many log records would also require many allocations.

Ahh, yeah. In these test cases it did split it up that hard, but there's no particular reason it should. Thanks for explaining!

module/os/freebsd/zfs/zfs_vnops_os.c

markjdb

The FreeBSD bits look ok to me.

module/os/freebsd/zfs/zfs_vnops_os.c

If a write is split across mutliple itxs, we only want the callback on the last one, otherwise it will be called for every itx associated with this single write, which makes it very hard to know what to clean up. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17445

If a write is split across mutliple itxs, we only want the callback on the last one, otherwise it will be called for every itx associated with this single write, which makes it very hard to know what to clean up. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17445

zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17445

If a write is split across mutliple itxs, we only want the callback on the last one, otherwise it will be called for every itx associated with this single write, which makes it very hard to know what to clean up. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17445

zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17445

If a write is split across mutliple itxs, we only want the callback on the last one, otherwise it will be called for every itx associated with this single write, which makes it very hard to know what to clean up. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17445

zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17445

markjdb · 2025-06-26T20:27:22Z

Unfortunately, this commit hurts performance a lot. When building a kernel, for instance, the callback to unbusy pages isn't invoked until TXG flush time, and while the sbusy lock is held nothing can modify the page. That leads to a ton of blocking in the page fault handler.

If I modify zfs_putpages() to unconditionally commit the ZIL, then things look better, but I think there's still a slowdown (still testing).

While I think the commit's goal is correct for synchronous writes (in which case we indeed call zil_commit()), I'm not sure about async writes. There, I think the most we want to do is make sure that the contents of the pages are copied into the DMU. I believe it's ok to undirty the page immediately after that point.

markjdb reviewed Jun 9, 2025

View reviewed changes

module/os/freebsd/zfs/zfs_vnops_os.c Outdated Show resolved Hide resolved

robn force-pushed the freebsd-putpage-async branch from eeea246 to 0d603d1 Compare June 9, 2025 23:43

robn force-pushed the freebsd-putpage-async branch from 0d603d1 to 0759e5e Compare June 10, 2025 01:51

robn mentioned this pull request Jun 10, 2025

ZIL: "crash" the ZIL if the pool suspends during fallback #17398

Open

14 tasks

robn force-pushed the freebsd-putpage-async branch from 0759e5e to eeebf9f Compare June 10, 2025 04:30

robn force-pushed the freebsd-putpage-async branch from eeebf9f to 824eca9 Compare June 11, 2025 05:43

amotin reviewed Jun 11, 2025

View reviewed changes

module/os/freebsd/zfs/zfs_vnops_os.c Outdated Show resolved Hide resolved

markjdb approved these changes Jun 11, 2025

View reviewed changes

module/os/freebsd/zfs/zfs_vnops_os.c Outdated Show resolved Hide resolved

module/os/freebsd/zfs/zfs_vnops_os.c Show resolved Hide resolved

robn added 2 commits June 12, 2025 07:54

robn force-pushed the freebsd-putpage-async branch from 824eca9 to 2da30d0 Compare June 11, 2025 21:55

amotin approved these changes Jun 11, 2025

View reviewed changes

amotin added the Status: Accepted Ready to integrate (reviewed, tested) label Jun 12, 2025

behlendorf closed this in aa964ce Jun 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FreeBSD: zfs_putpages: don't undirty pages until after write completes #17445

FreeBSD: zfs_putpages: don't undirty pages until after write completes #17445

Uh oh!

robn commented Jun 9, 2025 •

edited

Loading

Uh oh!

markjdb left a comment

Uh oh!

Uh oh!

amotin commented Jun 9, 2025 •

edited

Loading

Uh oh!

robn commented Jun 9, 2025

Uh oh!

amotin commented Jun 10, 2025

Uh oh!

robn commented Jun 10, 2025 •

edited

Loading

Uh oh!

amotin commented Jun 11, 2025 •

edited

Loading

Uh oh!

robn commented Jun 11, 2025

Uh oh!

Uh oh!

markjdb left a comment

Uh oh!

Uh oh!

Uh oh!

markjdb commented Jun 26, 2025

Uh oh!

Uh oh!

FreeBSD: zfs_putpages: don't undirty pages until after write completes #17445

FreeBSD: zfs_putpages: don't undirty pages until after write completes #17445

Uh oh!

Conversation

robn commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

Further work

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

markjdb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amotin commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented Jun 9, 2025

Uh oh!

amotin commented Jun 10, 2025

Uh oh!

robn commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented Jun 11, 2025

Uh oh!

Uh oh!

markjdb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

markjdb commented Jun 26, 2025

Uh oh!

Uh oh!

robn commented Jun 9, 2025 •

edited

Loading

amotin commented Jun 9, 2025 •

edited

Loading

robn commented Jun 10, 2025 •

edited

Loading

amotin commented Jun 11, 2025 •

edited

Loading