-
Notifications
You must be signed in to change notification settings - Fork 1.9k
FreeBSD: zfs_putpages: don't undirty pages until after write completes #17445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This broadly looks good to me, I had one question inline.
Could you explain why it is correct and more than a coincidence? What should protect single-page writes from splitting? Looking into |
@amotin yes, I think coincidence too. I hadn't seen |
eeea246
to
0d603d1
Compare
I'd say historically happened. It was not supposed for lwb code to modify any records. But due to write records being too big it got to fill large records to avoid memory copies. And since it has to do it, then splitting is easy to do. For smaller COPIED records it just copies records as-is, even though could also be made to split if needed. And zfs_log_write has to handle splitting any way to handle indirect records, which are limited by one block. I was thinking to redesign it at some point, but never got to it. |
0d603d1
to
0759e5e
Compare
0759e5e
to
eeebf9f
Compare
@amotin here's the alternative version, only putting the callback on the last itx: robn/zfs:freebsd-putpage-async-split. It's unclear whether or not its "better" in practice. The Thoughts? (either way, I should push the |
I like this alternative one better, especially the
instead of tying the different iterations together.
I was not so much thinking about one vs many callbacks, as one bigger log record vs bunch of tiny ones. Many log records would also require many allocations. |
eeebf9f
to
824eca9
Compare
Ok, great! That's two votes to zero. Pushed!
Both suggestions taken. The
Ahh, yeah. In these test cases it did split it up that hard, but there's no particular reason it should. Thanks for explaining! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FreeBSD bits look ok to me.
If a write is split across mutliple itxs, we only want the callback on the last one, otherwise it will be called for every itx associated with this single write, which makes it very hard to know what to clean up. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
824eca9
to
2da30d0
Compare
zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17445
If a write is split across mutliple itxs, we only want the callback on the last one, otherwise it will be called for every itx associated with this single write, which makes it very hard to know what to clean up. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17445
zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17445
If a write is split across mutliple itxs, we only want the callback on the last one, otherwise it will be called for every itx associated with this single write, which makes it very hard to know what to clean up. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17445
zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #17445
If a write is split across mutliple itxs, we only want the callback on the last one, otherwise it will be called for every itx associated with this single write, which makes it very hard to know what to clean up. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17445
zfs_putpages() would put the entire range of pages onto the ZIL, then return VM_PAGER_OK for each page to the kernel. However, an associated zil_commit() or txg sync had not happened at this point, so the write may not actually be on disk. So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return VM_PAGER_PEND to the kernel instead so it knows that we will take care of it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Mark Johnston <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17445
Unfortunately, this commit hurts performance a lot. When building a kernel, for instance, the callback to unbusy pages isn't invoked until TXG flush time, and while the sbusy lock is held nothing can modify the page. That leads to a ton of blocking in the page fault handler. If I modify zfs_putpages() to unconditionally commit the ZIL, then things look better, but I think there's still a slowdown (still testing). While I think the commit's goal is correct for synchronous writes (in which case we indeed call zil_commit()), I'm not sure about async writes. There, I think the most we want to do is make sure that the contents of the pages are copied into the DMU. I believe it's ok to undirty the page immediately after that point. |
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
While working on #17398, I noticed that FreeBSD's
zfs_putpage
immediately undirties the page, rather than waiting for confirmation that it has been written out. In theory, this could lead to a write to anmmap()'d
region being lost if the pool suspends before it could be written out, and then the page is evicted.This PR should take care of it. #17398 will need to update it further, but I think this can stand on its own.
Description
zfs_putpages()
would put the entire range of pages onto the ZIL, then returnVM_PAGER_OK
for each page to the kernel. However, an associatedzil_commit()
or txg sync had not happened at this point, so the write may not actually be on disk.So, we rework it to use a ZIL commit callback, and do the post-write work of undirtying the page and signaling completion there. We return
VM_PAGER_PEND
to the kernel instead so it knows that we will take care of it.[removed after review discussion]
We change from a single `zfs_log_write()` to cover the whole range, to multiple calls, each for a single page. This is because if `zfs_log_write()` receives a write too large to fit into a single ZIL block, it will split it into multiple writes, but each will receive the same callback and argument. This will lead to the callback being called multiple times, but we don't which pages/range the call is for, and so can't clean them up. This way is not ideal, but is simple and correct, and so is a good start.Further work
Right now I want to do enough to ensure correctness once #17398 lands and we can respond to a writeback error.
Later, I would like to look at bringing the async/sync stuff over from the Linux side to help with
msync()
performance (possibly merging those, since the logic is complicated). That can wait though; I'd like to avoid the additional complexity for now.How Has This Been Tested?
mmap
tests run successfully on 14.2p1.Types of changes
Checklist:
Signed-off-by
.