Skip to content

Checksum errors may not be counted #11545

Open
@ahrens

Description

@ahrens

System information

Type Version/Name
Distribution Name
Distribution Version
Linux Kernel
Architecture
ZFS Version after 4f07282
SPL Version

Describe the problem you're observing

If a block is damaged after being repaired once, when it is repaired for the second time, the checksum error is not reported. This causes confusion (e.g. while testing) because there is no visibility into the checksum errors that are being detected (and potentially corrected).

This is a change in behavior caused by #10861. I understand the desire to limit the rate of event generation since we keep so few of them. However:

  1. This justification doesn't apply to the checksum error counts (vs_checksum_errors)- it doesn't cost anything to count to a large number.
  2. after a block is repaired (e.g. by zpool scrub) or errors are discarded (zpool clear), it would be reasonable to report the error again (even to generate another event).

I'd suggest that we make at least one (and perhaps all) of the following changes:

  1. always count the checksum errors
  2. reset the "recent" errors when a scrub completes, so that newly-discovered errors will be logged and counted
  3. reset the "recent" errors when zpool clear is run

Describe how to reproduce the problem

zpool create ... raidz ...
silently damage one disk (dd of=/dev/dsk/...)
zpool scrub
Scrub reports that it repaired some space, and vdev reports some checksum errors:

  scan: scrub repaired 1.00M in 00:00:03 with 0 errors on Fri Jan 29 04:32:40 2021
config:

	NAME                         STATE     READ WRITE CKSUM
	test                         ONLINE       0     0     0
	  raidz1-0                   ONLINE       0     0     0
	    /var/tmp/expand_vdevs/1  ONLINE       0     0    28
	    /var/tmp/expand_vdevs/2  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/3  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/4  ONLINE       0     0     0

silently damage one disk AGAIN (dd of=/dev/dsk/...)
zpool scrub AGAIN
Scrub reports that it repaired some space, BUT vdev reports no checksum errors:

  scan: scrub repaired 1.00M in 00:00:02 with 0 errors on Fri Jan 29 04:33:01 2021
config:

	NAME                         STATE     READ WRITE CKSUM
	test                         ONLINE       0     0     0
	  raidz1-0                   ONLINE       0     0     0
	    /var/tmp/expand_vdevs/1  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/2  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/3  ONLINE       0     0     0
	    /var/tmp/expand_vdevs/4  ONLINE       0     0     0

Include any warning/errors/backtraces from the system logs

@don-brady @behlendorf

Metadata

Metadata

Assignees

Labels

Status: UnderstoodThe root cause of the issue is knownType: DefectIncorrect behavior (e.g. crash, hang)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions