spa: clear checkpoint information during retry #17319
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Context
During spa loading, the
spa_add()
function allocates thespa_t
structure. The problematic rewind logic starts in
the spa_load_best() function:
spa_load()
is called, which is a wrapper forspa_load_impl()
.One of the first actions in that function is to call
spa_ld_read_checkpoint_txg().
If a checkpoint exists, this function sets
spa->spa_checkpoint_txg
.However, during
spa_load_impl()
, many steps may fail after readingthe checkpoint, such as reading feature flags or MOS directories.
If one of these steps fails, the function returns an error.
If the failure is something other than
ZFS_ERR_NO_CHECKPOINT
,a rewind is attempted in the loop shown above. But because
the checkpoint had already been read (and
spa_checkpoint_txg
was set), and nothing clears this state on failure,
another call to
spa_ld_read_checkpoint_txg()
during rewind willcasue assertion:
And we might end with crash like:
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Description
There are several potential ways to address this issue:
spa_checkpoint_txg
during retry (proposed in this PR).spa_ld_read_checkpoint_txg()
.spa_t
structure cleanup before retry.How Has This Been Tested?
I have hit this error during work on #16853. In that PR, we add one
additional TXG during export, which seems to trigger this loop during
import. This was triggered by the
checkpoint_zhack_feat
test.However, I don't think that this bug is limited to this PR, and it's a
more generic problem that users can hit.
Types of changes
Checklist:
Signed-off-by
.