-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Avoid posting duplicate zpool events #10861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Don Brady <[email protected]>
Address merge conflicts after PR-10857 landed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice! This will definitely be useful, it'll be great to catch the duplicates.
tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_duplicates.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_duplicates.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_duplicates.ksh
Outdated
Show resolved
Hide resolved
tests/zfs-tests/tests/functional/cli_root/zpool_events/zpool_events_duplicates.ksh
Outdated
Show resolved
Hide resolved
Signed-off-by: Don Brady <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really nice addition. I had a couple questions about the scheduling the cleaner.
Signed-off-by: Don Brady <[email protected]>
The FreeBSD stable/12 failure should be fixed by a rebase. |
Duplicate io and checksum ereport events can misrepresent that things are worse than they seem. Ideally the zpool events and the corresponding vdev stat error counts in a zpool status should be for unique errors -- not the same error being counted over and over. This can be demonstrated in a simple example. With a single bad block in a datafile and just 5 reads of the file we end up with a degraded vdev, even though there is only one unique error in the pool. The proposed solution to the above issue, is to eliminate duplicates when posting events and when updating vdev error stats. We now save recent error events of interest when posting events so that we can easily check for duplicates when posting an error. Reviewed by: Brad Lewis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes #10861
Duplicate io and checksum ereport events can misrepresent that things are worse than they seem. Ideally the zpool events and the corresponding vdev stat error counts in a zpool status should be for unique errors -- not the same error being counted over and over. This can be demonstrated in a simple example. With a single bad block in a datafile and just 5 reads of the file we end up with a degraded vdev, even though there is only one unique error in the pool. The proposed solution to the above issue, is to eliminate duplicates when posting events and when updating vdev error stats. We now save recent error events of interest when posting events so that we can easily check for duplicates when posting an error. Reviewed by: Brad Lewis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes openzfs#10861
Duplicate io and checksum ereport events can misrepresent that things are worse than they seem. Ideally the zpool events and the corresponding vdev stat error counts in a zpool status should be for unique errors -- not the same error being counted over and over. This can be demonstrated in a simple example. With a single bad block in a datafile and just 5 reads of the file we end up with a degraded vdev, even though there is only one unique error in the pool. The proposed solution to the above issue, is to eliminate duplicates when posting events and when updating vdev error stats. We now save recent error events of interest when posting events so that we can easily check for duplicates when posting an error. Reviewed by: Brad Lewis <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Don Brady <[email protected]> Closes openzfs#10861
Motivation and Context
Duplicate io and checksum ereport events can misrepresent that things are worse than they seem. Ideally the zpool events and the corresponding vdev stat error counts in a zpool status should be for unique errors -- not the same error being counted over and over. This can be demonstrated in a simple example. With a single bad block in a datafile and just 5 reads of the file we end up with a degraded vdev, even though there is only one unique error in the pool.
Example
Note that the zfs diagnosis agent will diagnose a vdev as degraded if it encounters 10 errors of the same type (io | checksum) within 10 minutes. These errors do not have to be unique! Any runtime condition that is retying a failed block read can easily generate a stream of error events that will trigger a degrade diagnosis.
Ideally, a single bad block should not be enough to degrade a vdev and it should only contribute once to the vdev's stat error count.
Description
The proposed solution to the above issue, is to eliminate duplicates when posting events and when updating vdev error stats. We now save recent error events of interest when posting events so that we can easily check for duplicates when posting an error. The bulk of the implementation changes are in
module/zfs/zfs_fm.c
Note: some function prototype changes (like dropping unused args) had ripple effects across a number of files. Don't let this number of files changed scare you away from offering a review!
There are two tunables introduced,
zfs_zevent_retain_max
andzfs_zevent_retain_expire_secs
. The duplicate checking mechanism can be disabled by setting thezfs_zevent_retain_max
to zero.Also added a
zio_priority
to the ereport payload since that adds the context of read/write and sync/async which is useful when evaluating duplicate errors.How Has This Been Tested?
functional/cli_root/zpool_events
suite of tests that cover zpool events.zpool_events_duplicates
, that generates duplicate block errors and confirms that duplicate events are not posted.Note that the existing test,
zpool_events_errors
, verifies that the event count matches the vdev error stat counts so this test confirms that we have not broken that constraint and adds adequate code coverage.Types of changes
Checklist:
Signed-off-by
.