Skip to content

ZTS: Remove ashift setting from dedup_quota test #17250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 18, 2025

Conversation

amotin
Copy link
Member

@amotin amotin commented Apr 16, 2025

The test writes 1M of 1KB blocks, which may produce up to 1GB of dirty data. On top of that ashift=12 likely produces additional 4GB of ZIO buffers during sync process. On top of that we likely need some (up to 4GB of) page cache since the pool reside on files. And finally we need to cache the DDT. Not surprising that the test regularly ends up in OOMs, possibly depending on TXG size variations.

Also replace fio with pretty strange parameter set with a set of dd writes and TXG commits, just as we neeed here.

While here, remove compression. It has nothing to do here, but waste CI CPU time.

How Has This Been Tested?

Very briefly by hands. Lets see what CI say about it.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin
Copy link
Member Author

amotin commented Apr 17, 2025

Hooray! It passed on all platforms from the first attempts, and seems faster than before, even though the variation is significant, from ~1.5 to ~8 minutes.

@amotin amotin added the Status: Code Review Needed Ready for review and testing label Apr 17, 2025
@amotin amotin requested review from pcd1193182 and tonyhutter April 17, 2025 14:50
@pcd1193182
Copy link
Contributor

pcd1193182 commented Apr 17, 2025

While it passed on all the CI platforms, it now fails intermittently when I run it locally (where previously it did not):

PR change:		 6/10 pass
remove compression:	10/10 pass
remove ashift:		 7/10 pass
original:		10/10 pass

Removing the ashift from the pool appears to cause the test to become flaky, at least on my system. Removing compression only had no negative affect. I am investigating the failure to see if I can get insight into why this is causing problems.

@pcd1193182
Copy link
Contributor

Okay, I noticed that the failing runs always used a dedup device, while the passing runs used a special device in ddt_dedup_vdev_limit, so I think something is going wrong there. Investigating further.

@amotin
Copy link
Member Author

amotin commented Apr 17, 2025

I've restarted the CI to try my luck again. @pcd1193182 I wonder how exactly it fails for you.

@pcd1193182
Copy link
Contributor

So the failure mode on my runs is that the entries ends up greater than the target of 650000 in the ddt_dedup_vdev_limit test. And from looking at the zdb output, it looks like the on-disk size is larger than the disk allocated for the dedup class. It seems like we're not respecting that limit for some reason.

@amotin
Copy link
Member Author

amotin commented Apr 17, 2025

@pcd1193182 In the CI logs I regularly saw big discrepancy between the dedup_table_size property and output of zpool list -v. But I don't know why, though it might be a question of TXG commits when the data is updated, since the test obviously passed. I actually wonder where the 650000 comes from. Why not a million, as the number of written blocks? The value of 800,000 entries the comment is even more mysterious to me.

@pcd1193182
Copy link
Contributor

pcd1193182 commented Apr 17, 2025

@amotin By expanding the dbgmsg output we gather from the test, it looks like we flush ~400k log entries before we start spilling into the normal class, and then we keep flushing another ~380k after that. It seems like the log kept building up while we were doing the writes, and it didn't actually notice we were over quota (and so stop adding things to the log) until the dedup device filled up.

My guess is that this is because the dedup log is stored on the special device (as a metadata object), which means that if we're using a special device we notice we're over dedup quota once the log + DDT is big enough to fill the object. Whereas if we are using a dedup device, we only store the actual ddt entries there, and so it doesn't trip the quota until it's too late, and the dedup log has hundreds of thousands of extra entries to handle.

As for why removing the 4k made this appear... my guess is that the dirty data inflation was actually helping us by forcing things to spread into more TXGs. That slowed it down enough that the dedup code noticed that we were at the limit before it added the extra entries to the log.

EDIT: Also it makes sense why this was only appearing on my runs and not the CI; I'm probably using a significantly beefier VM than the CI runs are, and the problem only appears if we manage to shove all the extra stuff into the log before we can flush enough out to trip the quota.

@amotin
Copy link
Member Author

amotin commented Apr 17, 2025

@pcd1193182 IIRC we introduced zp_storage_type specifically to make sure DDT logs are stored same as DDT. But considering that this test uses 1KB recordsize, I wonder if special vdev might see significant additional usage from indirect blocks to affect the results. Actually I quickly run a manual test with both dedup and special vdevs and got this:

# zpool list -v
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
optane      254G  1.27G   253G        -         -     0%     0%  1.00x    ONLINE  -
  nvd2p1    256G  1.00G   253G        -         -     0%  0.39%      -    ONLINE
dedup          -      -      -        -         -      -      -      -         -
  md1       200M   179M  12.9M        -         -    27%  93.3%      -    ONLINE
special        -      -      -        -         -      -      -      -         -
  md0       200M  96.5M  95.5M        -         -    23%  50.3%      -    ONLINE

, so special vdev usage by metadata (I think) is pretty significant here to affect the result.

PS: May be in "dedup" case we could reduce the size to 100MB to compensate?

@pcd1193182
Copy link
Contributor

Ah interesting, I didn't know about that feature! But you're right that indirect blocks would also make a significant difference, at this recordsize. We could try having a smaller dedup device than special to account for that. I also tried the simple hack of increasing DEDUP_LOG_FLUSH_ENTRIES_MIN to 500000 for this test, and with that it now passes consistently on my VM.

@pcd1193182
Copy link
Contributor

Okay, with the size set to 100M for dedup and 200M for special, the test also passes consistently.

diff --git a/tests/zfs-tests/tests/functional/dedup/dedup_quota.ksh b/tests/zfs-tests/tests/functional/dedup/dedup_quota.ksh
index 3d5b80599..d62ac434e 100755
--- a/tests/zfs-tests/tests/functional/dedup/dedup_quota.ksh
+++ b/tests/zfs-tests/tests/functional/dedup/dedup_quota.ksh
@@ -189,10 +189,12 @@ function ddt_dedup_vdev_limit
        # add a dedicated dedup/special VDEV and enable an automatic quota
        if (( RANDOM % 2 == 0 )) ; then
                class="special"
+               size="200M"
        else
                class="dedup"
+               size="100M"
        fi
-       log_must truncate -s 200M $VDEV_DEDUP
+       log_must truncate -s $size $VDEV_DEDUP
        log_must zpool add $POOL $class $VDEV_DEDUP
        log_must zpool set dedup_table_quota=auto $POOL

@amotin amotin force-pushed the dedup_quota_ashift branch from dabba94 to 0755384 Compare April 17, 2025 19:23
Copy link
Contributor

@tonyhutter tonyhutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice detective work!

@amotin amotin force-pushed the dedup_quota_ashift branch from 0755384 to 3a36804 Compare April 18, 2025 01:30
@amotin
Copy link
Member Author

amotin commented Apr 18, 2025

And of course the final version triggered OOMs on two platforms again. In a lack of additional ideas about OOMs I decided to replace the fio with honestly very weird set of parameters with a loop of dd and zpool sync calls. If nothing else, it should generate very predictable workload just as we need for this test. Lets see what happen now.

@amotin amotin force-pushed the dedup_quota_ashift branch from 3a36804 to cd75ac8 Compare April 18, 2025 01:47
The test writes 1M of 1KB blocks, which may produce up to 1GB of
dirty data.  On top of that ashift=12 likely produces additional
4GB of ZIO buffers during sync process.  On top of that we likely
need some page cache since the pool reside on files.  And finally
we need to cache the DDT.  Not surprising that the test regularly
ends up in OOMs, possibly depending on TXG size variations.

Also replace fio with pretty strange parameter set with a set of
dd writes and TXG commits, just as we neeed here.

While here, remove compression.  It has nothing to do here, but
waste CI CPU time.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
@amotin amotin force-pushed the dedup_quota_ashift branch from cd75ac8 to 2e3d984 Compare April 18, 2025 13:12
@amotin
Copy link
Member Author

amotin commented Apr 18, 2025

It passed this time. Rebasing and restarting once more just in case.

Meanwhile, while formally the test passes, looking on the logs I still see a big discrepancy between reported DDT size 153105408 and dedup device ALLOC of 40.1M. I wonder if we somehow leak some part of DDT to the main pool. Though it may be complicated by the very small sizes of special/dedup vdevs.

@amotin amotin requested review from pcd1193182 and tonyhutter April 18, 2025 13:17
@tonyhutter tonyhutter merged commit 1d8f625 into openzfs:master Apr 18, 2025
22 of 24 checks passed
@amotin amotin deleted the dedup_quota_ashift branch April 19, 2025 00:35
@amotin amotin added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Apr 19, 2025
snajpa pushed a commit to vpsfreecz/zfs that referenced this pull request May 2, 2025
The test writes 1M of 1KB blocks, which may produce up to 1GB of
dirty data.  On top of that ashift=12 likely produces additional
4GB of ZIO buffers during sync process.  On top of that we likely
need some page cache since the pool reside on files.  And finally
we need to cache the DDT.  Not surprising that the test regularly
ends up in OOMs, possibly depending on TXG size variations.

Also replace fio with pretty strange parameter set with a set of
dd writes and TXG commits, just as we neeed here.

While here, remove compression.  It has nothing to do here, but
waste CI CPU time.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
robn pushed a commit to robn/zfs that referenced this pull request May 23, 2025
The test writes 1M of 1KB blocks, which may produce up to 1GB of
dirty data.  On top of that ashift=12 likely produces additional
4GB of ZIO buffers during sync process.  On top of that we likely
need some page cache since the pool reside on files.  And finally
we need to cache the DDT.  Not surprising that the test regularly
ends up in OOMs, possibly depending on TXG size variations.

Also replace fio with pretty strange parameter set with a set of
dd writes and TXG commits, just as we neeed here.

While here, remove compression.  It has nothing to do here, but
waste CI CPU time.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
(cherry picked from commit 1d8f625)
@robn robn mentioned this pull request May 23, 2025
14 tasks
tonyhutter pushed a commit that referenced this pull request May 28, 2025
The test writes 1M of 1KB blocks, which may produce up to 1GB of
dirty data.  On top of that ashift=12 likely produces additional
4GB of ZIO buffers during sync process.  On top of that we likely
need some page cache since the pool reside on files.  And finally
we need to cache the DDT.  Not surprising that the test regularly
ends up in OOMs, possibly depending on TXG size variations.

Also replace fio with pretty strange parameter set with a set of
dd writes and TXG commits, just as we neeed here.

While here, remove compression.  It has nothing to do here, but
waste CI CPU time.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Paul Dagnelie <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
(cherry picked from commit 1d8f625)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants