Skip to content

syncfs: don't erroneously return success when the pool is suspended #17420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

robn
Copy link
Member

@robn robn commented Jun 4, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

See #17416. This PR is tackling the core problem described: syncfs() can return success even when the pool has failed. There's a lot more we can do but this should address correctness, which is more important than anything else.

I'm hesitant to say this closes #17416, since there's more we could do in there to improve the situation. On the other hand, it straight-up says "this is wrong" and this PR un-wrongs it, so idk, reviewer's call :)

Description

See individual commits, described below.

  1. Add a test case to demonstrate the issue, and verify the fix. We don't generally care what syncfs() returns when the pool is up, but we want to be sure that if the pool suspends, it either blocks until it returns, or returns an error. (we do care, of course, but it's not something we can address here; I'm working towards that in ZIL: "crash" the ZIL if the pool suspends during fallback #17398 and later).

  2. Simplify zfs_sync() by removing code for scenarios that can't occur, that is, called with NULL superblock and called with no ZIL. See syncfs() returns success on suspended pools #17416 for more on this.

  3. Change the suspended check to return EIO if the pool suspends. This is nod back to the original reason for this check, which is to not block on suspended pools during shutdown (see discussion in syncfs() returns success on suspended pools #17416). This will likely change in ZIL: "crash" the ZIL if the pool suspends during fallback #17398 once the ZIL can report a failure directly without blocking, but until then, this seems like a reasonable middle road to not block the call.

  4. Ensure safe error behaviour according to what syncfs() can do. The comment explains more, but the short version is that before 5.17 the kernel ignores errors directly returned by sync_fs(), but a workaround is available back to 5.8. If the workaround is not available either, then the call will return success to userspace, and we have no choice but to block until the txg completes. Unamazing, but ensures correctness.

How Has This Been Tested?

The new syncfs_suspend test passes on 5.4.293 (blocks, no fallbacks available) and 6.1.140 (sync_fs return works, immediate error return).

It's difficult to test the superblock writeback error case on a release kernel, since the LTS kernels in the 5.8-5.17 range that have no other options all have the fix backported, and non-LTS 5.x kernels are difficult to build these days. To check, I used 6.1.140, commented out the version check and forced the returned error to 0, leaving the superblock writeback error as the only remaining error state. This went as expected - syncfs() return an error on a suspended pool.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@robn robn changed the title `syncfs: don't erroneously return success when the pool is suspended syncfs: don't erroneously return success when the pool is suspended Jun 4, 2025
Copy link
Contributor

@pcd1193182 pcd1193182 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution is sort of gross, but so it the problem space. I can't think of a better approach than this one, given the constraints we have. I would say you could close the github issue, though I would consider opening a new one specifically for "We can't tell if the system is shutting down in the syncfs() path"

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Jun 4, 2025
@robn robn force-pushed the syncfs-suspend branch 3 times, most recently from 4b13dd1 to 6876437 Compare June 11, 2025 06:02
robn added 4 commits June 12, 2025 08:09
Fairly coarse, but if it returns while the pool suspends, it must be
with an error.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
The superblock pointer will always be set, as will z_log, so remove code
supporting cases that can't occur (on Linux at least).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
If the pool is suspended, we'll just block in zil_commit(). If the
system is shutting down, blocking wouldn't help anyone. So, we should
keep this test for now, but at least return an error for anyone who is
actually interested.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
If the kernel will honour our error returns, use them. If not, fool it
by setting a writeback error on the superblock, if available.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <[email protected]>
@robn robn force-pushed the syncfs-suspend branch from 6876437 to 9415646 Compare June 11, 2025 22:10
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Jun 11, 2025
behlendorf pushed a commit that referenced this pull request Jun 12, 2025
The superblock pointer will always be set, as will z_log, so remove code
supporting cases that can't occur (on Linux at least).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17420
behlendorf pushed a commit that referenced this pull request Jun 12, 2025
If the pool is suspended, we'll just block in zil_commit(). If the
system is shutting down, blocking wouldn't help anyone. So, we should
keep this test for now, but at least return an error for anyone who is
actually interested.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17420
behlendorf pushed a commit that referenced this pull request Jun 12, 2025
If the kernel will honour our error returns, use them. If not, fool it
by setting a writeback error on the superblock, if available.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17420
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 13, 2025
Fairly coarse, but if it returns while the pool suspends, it must be
with an error.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17420
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 13, 2025
The superblock pointer will always be set, as will z_log, so remove code
supporting cases that can't occur (on Linux at least).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17420
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 13, 2025
If the pool is suspended, we'll just block in zil_commit(). If the
system is shutting down, blocking wouldn't help anyone. So, we should
keep this test for now, but at least return an error for anyone who is
actually interested.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17420
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 13, 2025
If the kernel will honour our error returns, use them. If not, fool it
by setting a writeback error on the superblock, if available.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17420
behlendorf pushed a commit that referenced this pull request Jun 17, 2025
Fairly coarse, but if it returns while the pool suspends, it must be
with an error.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17420
behlendorf pushed a commit that referenced this pull request Jun 17, 2025
The superblock pointer will always be set, as will z_log, so remove code
supporting cases that can't occur (on Linux at least).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17420
behlendorf pushed a commit that referenced this pull request Jun 17, 2025
If the pool is suspended, we'll just block in zil_commit(). If the
system is shutting down, blocking wouldn't help anyone. So, we should
keep this test for now, but at least return an error for anyone who is
actually interested.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17420
behlendorf pushed a commit that referenced this pull request Jun 17, 2025
If the kernel will honour our error returns, use them. If not, fool it
by setting a writeback error on the superblock, if available.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes #17420
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 17, 2025
Fairly coarse, but if it returns while the pool suspends, it must be
with an error.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17420
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 17, 2025
The superblock pointer will always be set, as will z_log, so remove code
supporting cases that can't occur (on Linux at least).

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17420
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 17, 2025
If the pool is suspended, we'll just block in zil_commit(). If the
system is shutting down, blocking wouldn't help anyone. So, we should
keep this test for now, but at least return an error for anyone who is
actually interested.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17420
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Jun 17, 2025
If the kernel will honour our error returns, use them. If not, fool it
by setting a writeback error on the superblock, if available.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17420
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

syncfs() returns success on suspended pools
4 participants