Skip to content

Tighten up 'zpool remove' space check #11409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

behlendorf
Copy link
Contributor

Motivation and Context

Issue #11356. Handle additional cases where device removal should
be possible but is prevented by an overly aggressive free space check.

Description

The available space check in spa_vdev_remove_top_check() is intended
to verify there is enough available space on the other devices before
starting the removal process. This is obviously a good idea but the
current check can significantly overestimate the space requirements
and often prevent removal when it clearly should be possible.

This change reworks the check to use the per-vdev vs_stats. This
is sufficiently accurate and is convenient because it allows a direct
comparison to the allocated space. If using dsl_dir_space_available()
then the available space of the device being removed is also included
in the return value and must somehow be accounted for.

Additionally, we reduce the slop space requirement to be inline with
with the capacity of the pool after the device has been removed.
This way if a large device is accidentally added to a small pool the
vastly increased slop requirement of the larger pool won't prevent
removal. For example, it was previously possible that even newly
added empty vdevs couldn't be removed due to the increased slop
requirement. This was particularly unfortunate since one of the
main motivations for this feature was to allow such mistakes to be
corrected.

Lastly it's worth mentioning that by allowing the removal to start
with close to the minimum required free space it is more likely that
an active pool close to capacity may fail the removal process. This
failure case has always been possible since we can't know what new
data will be written during the removal process. It is correctly
handled and there's no danger to the pool.

How Has This Been Tested?

Added a new remove_unbalanced ZTS test which verifies that
when adding new large vdev (1TB) to an existing small pool (1GB)
it may still be removed as long as it hasn't been used. This would
previously fail.

Additionally, locally ran the "removal" ZTS test group.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 29, 2020
@behlendorf behlendorf requested a review from ahrens December 29, 2020 01:13
@ikozhukhov
Copy link
Contributor

we can see issues on DilOS with usecase:

root@lenovo:/var/tmp/space# cat space.sh 
#!/bin/sh

# Size of files for a test pool (MB)
FSIZE=256

# vdevs
FILE1=/var/tmp/file01.data
FILE2=/var/tmp/file02.data
FILE3=/var/tmp/file03.data
FILE4=/var/tmp/file04.data

POOL=testpool

# Create two files for the test pool
truncate -s ${FSIZE}m ${FILE1} ${FILE2}

# Create the pool
zpool create -o ashift=12 -o cachefile=none -O compression=on -O atime=off \
        ${POOL} mirror ${FILE1} ${FILE2}

# Create a dataset
zfs create -o recordsize=1m -o compression=on ${POOL}/01

# Prepare a compressed contents
dd if=/dev/zero of=/${POOL}/01/f01.tmp bs=1M count=$((FSIZE*16)) status=progress

# Create VDEVs to for another mirror
truncate -s ${FSIZE}m ${FILE3} ${FILE4}

# Add them to the test pool
zpool add ${POOL} mirror ${FILE3} ${FILE4}

# Try to remove first mirror
zpool remove ${POOL} mirror-0
root@lenovo:/var/tmp/space# sh -x space.sh 
+ FSIZE=256
+ FILE1=/var/tmp/file01.data
+ FILE2=/var/tmp/file02.data
+ FILE3=/var/tmp/file03.data
+ FILE4=/var/tmp/file04.data
+ POOL=testpool
+ truncate -s 256m /var/tmp/file01.data /var/tmp/file02.data
+ zpool create -o ashift=12 -o cachefile=none -O compression=on -O atime=off testpool mirror /var/tmp/file01.data /var/tmp/file02.data
+ zfs create -o recordsize=1m -o compression=on testpool/01
+ dd if=/dev/zero of=/testpool/01/f01.tmp bs=1M count=4096 status=progress
3883925504 bytes (3.9 GB, 3.6 GiB) copied, 10 s, 388 MB/s
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 10.9779 s, 391 MB/s
+ truncate -s 256m /var/tmp/file03.data /var/tmp/file04.data
+ zpool add testpool mirror /var/tmp/file03.data /var/tmp/file04.data
+ zpool remove testpool mirror-0
cannot remove mirror-0: out of space

@ikozhukhov
Copy link
Contributor

it is not issue with this commit, it issue with Out of space
@behlendorf could you please try to run test in comment ?

@behlendorf
Copy link
Contributor Author

@ikozhukhov your exact test case still fails but it's due to the very small size of the vdevs (256M) and the comparatively large spa_min_slop which is set to 128M. Even with the updated threshold a minimum of 2*slop+allocated space needs to be available and there's not quite enough. Increasing the vdev size in your test to even 300M is enough to allow it to pass, similarly decreasing the spa_min_slop would achieve the same effect. In practice, for any non-test vdev it shouldn't be an issue.

@behlendorf
Copy link
Contributor Author

@ikozhukhov I've updated the PR to also address your original test case with very small vdevs. This behavior was confusing for users particularly those who wanted to just test out device removal and would immediately hit this issue.

@behlendorf behlendorf marked this pull request as ready for review December 30, 2020 22:20
@ahrens
Copy link
Member

ahrens commented Jan 4, 2021

This change reworks the check to use the per-vdev vs_stats.

It looks like the new check ensures that at the time the removal starts, there is enough free space on the remaining devices to hold all the space allocated on the device to remove. However, the free space on the device to be removed may also be necessary to satisfy [ref]reservations. Therefore I think that after the removal you could end up with more space "in use" (including reservations) than there is in the pool. The existing check (based on the DMU's view of space) doesn't have this problem.

Additionally, we reduce the slop space requirement to be inline with
with the capacity of the pool after the device has been removed.

Sounds good.

This failure case [an active pool close to capacity may fail the removal process ] has always been possible since we can't know what new data will be written during the removal process.

I don't think that's the case. The current code makes the space in the removing device unavailable for use by the DMU, so concurrent write system calls can't cause the removal to fail. Instead, they will (correctly) fail with ENOSPC. See spa_update_dspace():

	if (spa->spa_vdev_removal != NULL) {
		/*
		 * We can't allocate from the removing device, so subtract
		 * its size if it was included in dspace (i.e. if this is a
		 * normal-class vdev, not special/dedup).  This prevents the
		 * DMU/DSL from filling up the (now smaller) pool while we
		 * are in the middle of removing the device.
		 *
		 * Note that the DMU/DSL doesn't actually know or care
		 * how much space is allocated (it does its own tracking
		 * of how much space has been logically used).  So it
		 * doesn't matter that the data we are moving may be
		 * allocated twice (on the old device and the new
		 * device).
		 */

[failing to allocate space for a removal] is correctly handled and there's no danger to the pool.

Where does that happen? I think that you'd hit this assertion in spa_vdev_copy_impl():

			/*
			 * The minimum-size allocation can not fail.
			 */
			ASSERT3U(attempted, >, 1 << spa->spa_max_ashift);

As mentioned above, this shouldn't happen because the DMU would not write() system calls to use space on the removing device (and the slop space is large enough for any MOS writes).

@behlendorf behlendorf force-pushed the issue-11356 branch 2 times, most recently from 13cda05 to 22cd16d Compare January 4, 2021 22:57
@behlendorf
Copy link
Contributor Author

However, the free space on the device to be removed may also be necessary to satisfy [ref]reservations.

That's a good point I hadn't considered. I've restored the use of dsl_dir_space_available() and updated the comment to mention that this will account for reserved space. I've also further reduced the slop requirement for pools where the minimum slop space exceeds a quarter of the total pool capacity. This was the cleanest way I could find to handle very small pools which is where this issue is most often reported.

This failure case [an active pool close to capacity may fail the removal process ] has always been possible since we can't know what new data will be written during the removal process.
I don't think that's the case

Makes sense. I've removed that comment from the commit message. Thanks for pointing to me to the relevant code.

@behlendorf behlendorf force-pushed the issue-11356 branch 2 times, most recently from 12ceac9 to 5d01f56 Compare January 9, 2021 00:50
* pool free, in addition to the normal slop space).
*/

if (available - (available >> 5) < vd->vdev_stat.vs_dspace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is calculating what the comment says. "leave 3% of the space available" to me means that after the removal, AVAIL (dsl_dir_space_available(root_dir)) should be at least 3% (1/2^5) of the size of the pool after the removal.

For example, imagine we have a 100TB pool, are removing a 1TB device, and have 1.04TB AVAIL. The code here would allow the removal to proceed (1.04-1.04>>5=1.0075TB). After the removal there would be 0.0075TB AVAIL, but 3% of 99TB=~3TB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I've updated the calculation accordingly. It now checks that the available space after the removal isn't less than spa_get_dspace() / (1^6). i.e. we always leave at least 1.6% of the pools normal class available. So for your example.

1.04TB AVAIL - 1.00TB = 0.04TB after remove which is less than 100TB / (1^6) = 1.6TB.


if (available < vd->vdev_stat.vs_dspace ||
available - vd->vdev_stat.vs_dspace <
spa_get_dspace(spa) >> 6) {
Copy link
Member

@ahrens ahrens Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this fixes the problem that removing a disproportionately-huge device can get ENOSPC when we want it to succeed. e.g. imagine a pool with a 1TB device and a 100TB device, with 0.5TB used. If we try to remove the 100TB device, the check would be 100.5 - 100 < 101*0.016, or 0.5 < 1.6, which is true. I think want to require that after removal, there is 1.6% space AVAIL.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're right, @ahrens, and I'm on an old 2.0.4 version and currently unwilling to upgrade, if nothing has changed in this regards since 2.0.4 then a major tighten up would be massively beneficial, and/or a way to override this with -f when the user does indeed know better. I will tag you both where I record my story from the last 2 days.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behlendorf behlendorf added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Feb 8, 2021
The available space check in spa_vdev_remove_top_check() is intended
to verify there is enough available space on the other devices before
starting the removal process.  This is obviously a good idea but the
current check can significantly overestimate the space requirements
for small pools preventing removal when it clearly should be possible.

This change reworks the check to drop the additional slop space from
the calculation.  This solves two problems:

1) If a large device is accidentally added to a small pool the slop
   requirement is dramatically increased.  This in turn can prevent
   removal of that larger device even if it has never been used.
   This was particularly unfortunate since a main motivation for
   this feature was to allow such mistakes to be corrected.

2) For very small pools with 256M vdevs the minimum slop size of
   128M represents a huge fraction of the total pool capacity.
   Requiring twice this amount of space to be available effectively
   prevents device removal from ever working with these tiny pools.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#11356
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants