Tighten up 'zpool remove' space check #11409

behlendorf · 2020-12-29T01:13:02Z

Motivation and Context

Issue #11356. Handle additional cases where device removal should
be possible but is prevented by an overly aggressive free space check.

Description

The available space check in spa_vdev_remove_top_check() is intended
to verify there is enough available space on the other devices before
starting the removal process. This is obviously a good idea but the
current check can significantly overestimate the space requirements
and often prevent removal when it clearly should be possible.

This change reworks the check to use the per-vdev vs_stats. This
is sufficiently accurate and is convenient because it allows a direct
comparison to the allocated space. If using dsl_dir_space_available()
then the available space of the device being removed is also included
in the return value and must somehow be accounted for.

Additionally, we reduce the slop space requirement to be inline with
with the capacity of the pool after the device has been removed.
This way if a large device is accidentally added to a small pool the
vastly increased slop requirement of the larger pool won't prevent
removal. For example, it was previously possible that even newly
added empty vdevs couldn't be removed due to the increased slop
requirement. This was particularly unfortunate since one of the
main motivations for this feature was to allow such mistakes to be
corrected.

Lastly it's worth mentioning that by allowing the removal to start
with close to the minimum required free space it is more likely that
an active pool close to capacity may fail the removal process. This
failure case has always been possible since we can't know what new
data will be written during the removal process. It is correctly
handled and there's no danger to the pool.

How Has This Been Tested?

Added a new remove_unbalanced ZTS test which verifies that
when adding new large vdev (1TB) to an existing small pool (1GB)
it may still be removed as long as it hasn't been used. This would
previously fail.

Additionally, locally ran the "removal" ZTS test group.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

ikozhukhov · 2020-12-29T19:01:37Z

we can see issues on DilOS with usecase:

root@lenovo:/var/tmp/space# cat space.sh 
#!/bin/sh

# Size of files for a test pool (MB)
FSIZE=256

# vdevs
FILE1=/var/tmp/file01.data
FILE2=/var/tmp/file02.data
FILE3=/var/tmp/file03.data
FILE4=/var/tmp/file04.data

POOL=testpool

# Create two files for the test pool
truncate -s ${FSIZE}m ${FILE1} ${FILE2}

# Create the pool
zpool create -o ashift=12 -o cachefile=none -O compression=on -O atime=off \
        ${POOL} mirror ${FILE1} ${FILE2}

# Create a dataset
zfs create -o recordsize=1m -o compression=on ${POOL}/01

# Prepare a compressed contents
dd if=/dev/zero of=/${POOL}/01/f01.tmp bs=1M count=$((FSIZE*16)) status=progress

# Create VDEVs to for another mirror
truncate -s ${FSIZE}m ${FILE3} ${FILE4}

# Add them to the test pool
zpool add ${POOL} mirror ${FILE3} ${FILE4}

# Try to remove first mirror
zpool remove ${POOL} mirror-0

root@lenovo:/var/tmp/space# sh -x space.sh 
+ FSIZE=256
+ FILE1=/var/tmp/file01.data
+ FILE2=/var/tmp/file02.data
+ FILE3=/var/tmp/file03.data
+ FILE4=/var/tmp/file04.data
+ POOL=testpool
+ truncate -s 256m /var/tmp/file01.data /var/tmp/file02.data
+ zpool create -o ashift=12 -o cachefile=none -O compression=on -O atime=off testpool mirror /var/tmp/file01.data /var/tmp/file02.data
+ zfs create -o recordsize=1m -o compression=on testpool/01
+ dd if=/dev/zero of=/testpool/01/f01.tmp bs=1M count=4096 status=progress
3883925504 bytes (3.9 GB, 3.6 GiB) copied, 10 s, 388 MB/s
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 10.9779 s, 391 MB/s
+ truncate -s 256m /var/tmp/file03.data /var/tmp/file04.data
+ zpool add testpool mirror /var/tmp/file03.data /var/tmp/file04.data
+ zpool remove testpool mirror-0
cannot remove mirror-0: out of space

ikozhukhov · 2020-12-29T19:02:30Z

it is not issue with this commit, it issue with Out of space
@behlendorf could you please try to run test in comment ?

behlendorf · 2020-12-29T20:20:55Z

@ikozhukhov your exact test case still fails but it's due to the very small size of the vdevs (256M) and the comparatively large spa_min_slop which is set to 128M. Even with the updated threshold a minimum of 2*slop+allocated space needs to be available and there's not quite enough. Increasing the vdev size in your test to even 300M is enough to allow it to pass, similarly decreasing the spa_min_slop would achieve the same effect. In practice, for any non-test vdev it shouldn't be an issue.

behlendorf · 2020-12-30T17:35:09Z

@ikozhukhov I've updated the PR to also address your original test case with very small vdevs. This behavior was confusing for users particularly those who wanted to just test out device removal and would immediately hit this issue.

ahrens · 2021-01-04T19:26:25Z

This change reworks the check to use the per-vdev vs_stats.

It looks like the new check ensures that at the time the removal starts, there is enough free space on the remaining devices to hold all the space allocated on the device to remove. However, the free space on the device to be removed may also be necessary to satisfy [ref]reservations. Therefore I think that after the removal you could end up with more space "in use" (including reservations) than there is in the pool. The existing check (based on the DMU's view of space) doesn't have this problem.

Additionally, we reduce the slop space requirement to be inline with
with the capacity of the pool after the device has been removed.

Sounds good.

This failure case [an active pool close to capacity may fail the removal process ] has always been possible since we can't know what new data will be written during the removal process.

I don't think that's the case. The current code makes the space in the removing device unavailable for use by the DMU, so concurrent write system calls can't cause the removal to fail. Instead, they will (correctly) fail with ENOSPC. See spa_update_dspace():

	if (spa->spa_vdev_removal != NULL) {
		/*
		 * We can't allocate from the removing device, so subtract
		 * its size if it was included in dspace (i.e. if this is a
		 * normal-class vdev, not special/dedup).  This prevents the
		 * DMU/DSL from filling up the (now smaller) pool while we
		 * are in the middle of removing the device.
		 *
		 * Note that the DMU/DSL doesn't actually know or care
		 * how much space is allocated (it does its own tracking
		 * of how much space has been logically used).  So it
		 * doesn't matter that the data we are moving may be
		 * allocated twice (on the old device and the new
		 * device).
		 */

[failing to allocate space for a removal] is correctly handled and there's no danger to the pool.

Where does that happen? I think that you'd hit this assertion in spa_vdev_copy_impl():

			/*
			 * The minimum-size allocation can not fail.
			 */
			ASSERT3U(attempted, >, 1 << spa->spa_max_ashift);

As mentioned above, this shouldn't happen because the DMU would not write() system calls to use space on the removing device (and the slop space is large enough for any MOS writes).

behlendorf · 2021-01-04T23:04:59Z

However, the free space on the device to be removed may also be necessary to satisfy [ref]reservations.

That's a good point I hadn't considered. I've restored the use of dsl_dir_space_available() and updated the comment to mention that this will account for reserved space. I've also further reduced the slop requirement for pools where the minimum slop space exceeds a quarter of the total pool capacity. This was the cleanest way I could find to handle very small pools which is where this issue is most often reported.

This failure case [an active pool close to capacity may fail the removal process ] has always been possible since we can't know what new data will be written during the removal process.
I don't think that's the case

Makes sense. I've removed that comment from the commit message. Thanks for pointing to me to the relevant code.

module/zfs/vdev_removal.c

ahrens · 2021-01-12T17:00:18Z

module/zfs/vdev_removal.c

-			 * pool free, in addition to the normal slop space).
-			 */
+
+		if (available - (available >> 5) < vd->vdev_stat.vs_dspace)


I don't think this is calculating what the comment says. "leave 3% of the space available" to me means that after the removal, AVAIL (dsl_dir_space_available(root_dir)) should be at least 3% (1/2^5) of the size of the pool after the removal.

For example, imagine we have a 100TB pool, are removing a 1TB device, and have 1.04TB AVAIL. The code here would allow the removal to proceed (1.04-1.04>>5=1.0075TB). After the removal there would be 0.0075TB AVAIL, but 3% of 99TB=~3TB.

Indeed. I've updated the calculation accordingly. It now checks that the available space after the removal isn't less than spa_get_dspace() / (1^6). i.e. we always leave at least 1.6% of the pools normal class available. So for your example.

1.04TB AVAIL - 1.00TB = 0.04TB after remove which is less than 100TB / (1^6) = 1.6TB.

ahrens · 2021-01-14T04:56:01Z

module/zfs/vdev_removal.c

+
+		if (available < vd->vdev_stat.vs_dspace ||
+		    available - vd->vdev_stat.vs_dspace <
+		    spa_get_dspace(spa) >> 6) {


I don't think this fixes the problem that removing a disproportionately-huge device can get ENOSPC when we want it to succeed. e.g. imagine a pool with a 1TB device and a 100TB device, with 0.5TB used. If we try to remove the 100TB device, the check would be 100.5 - 100 < 101*0.016, or 0.5 < 1.6, which is true. I think want to require that after removal, there is 1.6% space AVAIL.

While you're right, @ahrens, and I'm on an old 2.0.4 version and currently unwilling to upgrade, if nothing has changed in this regards since 2.0.4 then a major tighten up would be massively beneficial, and/or a way to override this with -f when the user does indeed know better. I will tag you both where I record my story from the last 2 days.

#13552 (comment)

The available space check in spa_vdev_remove_top_check() is intended to verify there is enough available space on the other devices before starting the removal process. This is obviously a good idea but the current check can significantly overestimate the space requirements for small pools preventing removal when it clearly should be possible. This change reworks the check to drop the additional slop space from the calculation. This solves two problems: 1) If a large device is accidentally added to a small pool the slop requirement is dramatically increased. This in turn can prevent removal of that larger device even if it has never been used. This was particularly unfortunate since a main motivation for this feature was to allow such mistakes to be corrected. 2) For very small pools with 256M vdevs the minimum slop size of 128M represents a huge fraction of the total pool capacity. Requiring twice this amount of space to be available effectively prevents device removal from ever working with these tiny pools. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#11356

behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 29, 2020

behlendorf requested a review from ahrens December 29, 2020 01:13

behlendorf force-pushed the issue-11356 branch from becab0a to 6952aad Compare December 29, 2020 16:26

behlendorf mentioned this pull request Dec 30, 2020

zpool remove fails with "Out of space" #11356

Closed

behlendorf force-pushed the issue-11356 branch from 6952aad to e2b5b90 Compare December 30, 2020 17:33

behlendorf marked this pull request as ready for review December 30, 2020 22:20

behlendorf force-pushed the issue-11356 branch 2 times, most recently from 13cda05 to 22cd16d Compare January 4, 2021 22:57

ahrens reviewed Jan 5, 2021

View reviewed changes

module/zfs/vdev_removal.c Outdated Show resolved Hide resolved

module/zfs/vdev_removal.c Outdated Show resolved Hide resolved

behlendorf force-pushed the issue-11356 branch 2 times, most recently from 12ceac9 to 5d01f56 Compare January 9, 2021 00:50

ahrens reviewed Jan 12, 2021

View reviewed changes

behlendorf force-pushed the issue-11356 branch from 5d01f56 to 8e408b0 Compare January 13, 2021 21:51

ahrens reviewed Jan 14, 2021

View reviewed changes

behlendorf added Status: Work in Progress Not yet ready for general review and removed Status: Code Review Needed Ready for review and testing labels Feb 8, 2021

ahrens assigned mmaybee Jun 4, 2021

behlendorf force-pushed the issue-11356 branch from 8e408b0 to 2f8c9bd Compare October 29, 2021 19:43

behlendorf marked this pull request as draft March 5, 2022 00:31

haraldrudell mentioned this pull request Jun 13, 2022

zpool remove of top-level mirror or vdev still no work #13552

Open

behlendorf closed this Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tighten up 'zpool remove' space check #11409

Tighten up 'zpool remove' space check #11409

Uh oh!

behlendorf commented Dec 29, 2020

Uh oh!

ikozhukhov commented Dec 29, 2020

Uh oh!

ikozhukhov commented Dec 29, 2020

Uh oh!

behlendorf commented Dec 29, 2020

Uh oh!

behlendorf commented Dec 30, 2020

Uh oh!

ahrens commented Jan 4, 2021

Uh oh!

behlendorf commented Jan 4, 2021

Uh oh!

Uh oh!

Uh oh!

ahrens Jan 12, 2021

Uh oh!

behlendorf Jan 13, 2021

Uh oh!

ahrens Jan 14, 2021 •

edited

Loading

Uh oh!

fredcooke Feb 2, 2023

Uh oh!

fredcooke Feb 2, 2023

Uh oh!

Uh oh!

Tighten up 'zpool remove' space check #11409

Tighten up 'zpool remove' space check #11409

Uh oh!

Conversation

behlendorf commented Dec 29, 2020

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

ikozhukhov commented Dec 29, 2020

Uh oh!

ikozhukhov commented Dec 29, 2020

Uh oh!

behlendorf commented Dec 29, 2020

Uh oh!

behlendorf commented Dec 30, 2020

Uh oh!

ahrens commented Jan 4, 2021

Uh oh!

behlendorf commented Jan 4, 2021

Uh oh!

Uh oh!

Uh oh!

ahrens Jan 12, 2021

Choose a reason for hiding this comment

Uh oh!

behlendorf Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

ahrens Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fredcooke Feb 2, 2023

Choose a reason for hiding this comment

Uh oh!

fredcooke Feb 2, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahrens Jan 14, 2021 •

edited

Loading