Skip to content

Frequent Checksum Errors During Scrub on ZFS Pool #16452

Open
@KoffeinKaio

Description

@KoffeinKaio

System information

Type Version/Name
Distribution Name Debian
Distribution Version 12
Kernel Version 6.10.3-amd64
Architecture amd64
OpenZFS Version zfs-2.2.5-1

Describe the problem you're observing

chksum errors on scrub - in rare cases none, in most cases 3-4 on diffrent drives. Sometimes not even the same drives.
No smart errors logged.

Changed:
RAM to ECC RAM
new PSU
connected 3 drives of the pool to another PCIe Sata card
changed all cables to the drives

Problem war originally a Q&A discussion - but since im out of options on what could cause it - maybe a bug somewhere?
#16445

Describe how to reproduce the problem

run a scrub. zpool scrub Data

Include any warning/errors/backtraces from the system logs

root@pve:~# zpool status -v
  pool: Data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Thu Aug 15 17:15:59 2024
	35.6T / 84.1T scanned at 14.4G/s, 3.92T / 84.1T issued at 1.58G/s
	516K repaired, 4.66% done, 14:24:32 to go
config:

	NAME                                   STATE     READ WRITE CKSUM
	Data                                   ONLINE       0     0     0
	  raidz2-0                             ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT4WHGT  ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT5NTBG  ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT5MER8  ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT0SCR9  ONLINE       0     0     1  (repairing)
	    ata-ST18000NM003D-3DL103_ZVT07MLA  ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT9JNZR  ONLINE       0     0     0
	    ata-ST18000NM003D-3DL103_ZVT9C7NM  ONLINE       0     0     1  (repairing)
	    ata-ST18000NM003D-3DL103_ZVTAEL6D  ONLINE       0     0     1  (repairing)

I dont want to blindly change out the CPU, Board or buy another HBA Controller - any ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: DefectIncorrect behavior (e.g. crash, hang)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions