Skip to content

FreeBSD: zfs_putpages: don't undirty pages until after write completes #17445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

robn
Copy link
Member

@robn robn commented Jun 9, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

While working on #17398, I noticed that FreeBSD's zfs_putpage immediately undirties the page, rather than waiting for confirmation that it has been written out. In theory, this could lead to a write to an mmap()'d region being lost if the pool suspends before it could be written out, and then the page is evicted.

This PR should take care of it. #17398 will need to update it further, but I think this can stand on its own.

Description

zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk.

So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it.

[removed after review discussion] We change from a single `zfs_log_write()` to cover the whole range, to multiple calls, each for a single page. This is because if `zfs_log_write()` receives a write too large to fit into a single ZIL block, it will split it into multiple writes, but each will receive the same callback and argument. This will lead to the callback being called multiple times, but we don't which pages/range the call is for, and so can't clean them up. This way is not ideal, but is simple and correct, and so is a good start.

Further work

Right now I want to do enough to ensure correctness once #17398 lands and we can respond to a writeback error.

Later, I would like to look at bringing the async/sync stuff over from the Linux side to help with msync() performance (possibly merging those, since the logic is complicated). That can wait though; I'd like to avoid the additional complexity for now.

How Has This Been Tested?

mmap tests run successfully on 14.2p1.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Contributor

@markjdb markjdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broadly looks good to me, I had one question inline.

@amotin
Copy link
Member

amotin commented Jun 9, 2025

We change from a single zfs_log_write() to cover the whole range, to multiple calls, each for a single page. This is because if zfs_log_write() receives a write too large to fit into a single ZIL block, it will split it into multiple writes, but each will receive the same callback and argument. This will lead to the callback being called multiple times, but we don't which pages/range the call is for, and so can't clean them up. This way is not ideal, but is simple and correct, and so is a good start.

Could you explain why it is correct and more than a coincidence? What should protect single-page writes from splitting? Looking into zil_lwb_assign(), I see it is calling zil_itx_clone() when it needs to split a WR_NEED_COPY record, which leaves callback only for the last chunk. Any reason why zfs_log_write() should not do the same for WR_COPIED?

@robn
Copy link
Member Author

robn commented Jun 9, 2025

Could you explain why it is correct and more than a coincidence? What should protect single-page writes from splitting? Looking into zil_lwb_assign(), I see it is calling zil_itx_clone() when it needs to split a WR_NEED_COPY record, which leaves callback only for the last chunk. Any reason why zfs_log_write() should not do the same for WR_COPIED?

@amotin yes, I think coincidence too. I hadn't seen zil_itx_clone(), and now I have, I think yes - zfs_log_write() should be doing the same thing. I haven't looked closely yet, but is there a good reason for for zfs_log_write() to do the splitting itself, vs just leaving it for zil_lwb_assign() to do if and when it needs to actually write it?

@robn robn force-pushed the freebsd-putpage-async branch from eeea246 to 0d603d1 Compare June 9, 2025 23:43
@amotin
Copy link
Member

amotin commented Jun 10, 2025

I haven't looked closely yet, but is there a good reason for for zfs_log_write() to do the splitting itself, vs just leaving it for zil_lwb_assign() to do if and when it needs to actually write it?

I'd say historically happened. It was not supposed for lwb code to modify any records. But due to write records being too big it got to fill large records to avoid memory copies. And since it has to do it, then splitting is easy to do. For smaller COPIED records it just copies records as-is, even though could also be made to split if needed. And zfs_log_write has to handle splitting any way to handle indirect records, which are limited by one block. I was thinking to redesign it at some point, but never got to it.

@robn
Copy link
Member Author

robn commented Jun 10, 2025

@amotin here's the alternative version, only putting the callback on the last itx: robn/zfs:freebsd-putpage-async-split.

It's unclear whether or not its "better" in practice. The mmap_seek_001_pos test is a really good driver for this, as it regularly generates 16-32 itxes for a single zfs_log_write() call. Is one callback and object lock cycle, but with an extra memory alloc/free, better or worse than 32 calls and no allocation? Is having to wait for them all meaningful? My gut feeling says batching is rarely a losing prospect, so without a bunch of profiling I'm leaning towards this version.

Thoughts?

(either way, I should push the zfs_log_write() change, as multiple callbacks for a single write is obviously wrong).

@amotin
Copy link
Member

amotin commented Jun 11, 2025

here's the alternative version, only putting the callback on the last itx: robn/zfs:freebsd-putpage-async-split.

I like this alternative one better, especially the zfs_log_write() part. Just for allocation I would probably use something like offsetof(putpage_commit_arg_t, pca_pages[ncount]) (it might be a GCC'ism, but it works with Clang too), and in zfs_log_write() I would do something like:

		if (resid == len) {
			itx->itx_callback = callback;
			itx->itx_callback_data = callback_data;
		}

instead of tying the different iterations together.

It's unclear whether or not its "better" in practice. The mmap_seek_001_pos test is a really good driver for this, as it regularly generates 16-32 itxes for a single zfs_log_write() call. Is one callback and object lock cycle, but with an extra memory alloc/free, better or worse than 32 calls and no allocation? Is having to wait for them all meaningful? My gut feeling says batching is rarely a losing prospect, so without a bunch of profiling I'm leaning towards this version.

I was not so much thinking about one vs many callbacks, as one bigger log record vs bunch of tiny ones. Many log records would also require many allocations.

@robn robn force-pushed the freebsd-putpage-async branch from eeebf9f to 824eca9 Compare June 11, 2025 05:43
@robn
Copy link
Member Author

robn commented Jun 11, 2025

I like this alternative one better

Ok, great! That's two votes to zero. Pushed!

I would do something like

Both suggestions taken. The offsetof thing is an interesting technique, which I will mull on further. I think I like it!

I was not so much thinking about one vs many callbacks, as one bigger log record vs bunch of tiny ones. Many log records would also require many allocations.

Ahh, yeah. In these test cases it did split it up that hard, but there's no particular reason it should. Thanks for explaining!

Copy link
Contributor

@markjdb markjdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FreeBSD bits look ok to me.

robn added 2 commits June 12, 2025 07:54
If a write is split across mutliple itxs, we only want the callback on
the last one, otherwise it will be called for every itx associated with
this single write, which makes it very hard to know what to clean up.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
zfs_putpages() would put the entire range of pages onto the ZIL, then
return VM_PAGER_OK for each page to the kernel. However, an associated
zil_commit() or txg sync had not happened at this point, so the write
may not actually be on disk.

So, we rework it to use a ZIL commit callback, and do the post-write
work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND to the kernel instead so it knows that we will take care
of it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn robn force-pushed the freebsd-putpage-async branch from 824eca9 to 2da30d0 Compare June 11, 2025 21:55
@amotin amotin added the Status: Accepted Ready to integrate (reviewed, tested) label Jun 12, 2025
behlendorf pushed a commit that referenced this pull request Jun 12, 2025
zfs_putpages() would put the entire range of pages onto the ZIL, then
return VM_PAGER_OK for each page to the kernel. However, an associated
zil_commit() or txg sync had not happened at this point, so the write
may not actually be on disk.

So, we rework it to use a ZIL commit callback, and do the post-write
work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND to the kernel instead so it knows that we will take care
of it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17445
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 13, 2025
If a write is split across mutliple itxs, we only want the callback on
the last one, otherwise it will be called for every itx associated with
this single write, which makes it very hard to know what to clean up.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17445
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 13, 2025
zfs_putpages() would put the entire range of pages onto the ZIL, then
return VM_PAGER_OK for each page to the kernel. However, an associated
zil_commit() or txg sync had not happened at this point, so the write
may not actually be on disk.

So, we rework it to use a ZIL commit callback, and do the post-write
work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND to the kernel instead so it knows that we will take care
of it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17445
behlendorf pushed a commit that referenced this pull request Jun 17, 2025
If a write is split across mutliple itxs, we only want the callback on
the last one, otherwise it will be called for every itx associated with
this single write, which makes it very hard to know what to clean up.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17445
behlendorf pushed a commit that referenced this pull request Jun 17, 2025
zfs_putpages() would put the entire range of pages onto the ZIL, then
return VM_PAGER_OK for each page to the kernel. However, an associated
zil_commit() or txg sync had not happened at this point, so the write
may not actually be on disk.

So, we rework it to use a ZIL commit callback, and do the post-write
work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND to the kernel instead so it knows that we will take care
of it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17445
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 17, 2025
If a write is split across mutliple itxs, we only want the callback on
the last one, otherwise it will be called for every itx associated with
this single write, which makes it very hard to know what to clean up.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17445
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 17, 2025
zfs_putpages() would put the entire range of pages onto the ZIL, then
return VM_PAGER_OK for each page to the kernel. However, an associated
zil_commit() or txg sync had not happened at this point, so the write
may not actually be on disk.

So, we rework it to use a ZIL commit callback, and do the post-write
work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND to the kernel instead so it knows that we will take care
of it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Mark Johnston <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17445
@markjdb
Copy link
Contributor

markjdb commented Jun 26, 2025

Unfortunately, this commit hurts performance a lot. When building a kernel, for instance, the callback to unbusy pages isn't invoked until TXG flush time, and while the sbusy lock is held nothing can modify the page. That leads to a ton of blocking in the page fault handler.

If I modify zfs_putpages() to unconditionally commit the ZIL, then things look better, but I think there's still a slowdown (still testing).

While I think the commit's goal is correct for synchronous writes (in which case we indeed call zil_commit()), I'm not sure about async writes. There, I think the most we want to do is make sure that the contents of the pages are copied into the DMU. I believe it's ok to undirty the page immediately after that point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants