Skip to content

Disk corruption without zfs_vdev_disk_classic=1 for a single virtual machine. #16279

Closed
@angstymeat

Description

@angstymeat

System information

Type Version/Name
Distribution Name Fedora Core
Distribution Version 23
Kernel Version 4.8.13-100.fc23.x86_64
Architecture x86_64
OpenZFS Version zfs-2.2.99-534_gc98295e

Describe the problem you're observing

I'm migrating our virtual machines from VMWare ESXi to Proxmox 8.22. It has gone smoothly except for a single virtual machine is one of our older systems running proprietary software which is why it is still running Fedora Core 23. I have four disks connected over an SAS JBOD that are pass-thru directly to the VM, the same configuration they had under VMWare.

This VM immediately began exhibiting disk corruption, reporting numerous read, write, and checksum errors. I immediately stopped it and booted up the VMWare version using the same disks and scrubbed the pool. No errors were reported.

I have at least a dozen other virtual machines that I have migrated to Proxmox, also using ZFS, most of them the latest version, and none of them exhibit this issue. The VM configuration (hardware type, CPU type, etc) is the same between all of them (memory size & cpu # varies).

Some of them are Fedora Core 18. Some are CentOS 7, some are CentOS 8. None of them have this issue.

The VM was originally FC22 when I migrated, and thinking it was a kernel issue I updated it to FC23 (the kernel went from 4.4 to 4.8), however the issue persisted.

While searching I came across #15533, which exhibited the same symptoms but I'm not running on top of LUKS or anything. When I applied zfs_vdev_disk_classic=1, the errors went away

Again, none of my other VMs need this option set. Other than the kernel versions, I can't figure out what is different or why this is happening. We either use older kernels like 3.10 under CentOS 7, or newer ones like 4.11 and above (FC24, CentOS 8, etc.).

Describe how to reproduce the problem

Currently, I can get this to occur regularly using this particular zpool on this particular machine under Proxmox 8.2.2, but not under VMWare. I boot them machine and start running our software which performs many small reads and writes in multiple threads (it is collecting seismic data from multiple sources) to memory-mapped files.

Include any warning/errors/backtraces from the system logs

Under FC22 I would see many errors about losing connection to the disks in the system logs. However, I did not hold onto those errors while I was debugging. These errors would not appear on the Proxmox host, on the the VM.

Under FC23 i get the following:

Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918200]  snd_hda_codec irqbypass iTCO_vendor_support crct10dif_pclmul crc32_pclmul snd_hda_core snd_hwdep crc32c_intel snd_seq snd_seq_device ghash_clmulni_intel snd_pcm intel_rapl_perf i2c_i801 i2c_smbus virtio_balloon joydev snd_timer snd lpc_ich soundcore shpchp acpi_cpufreq tpm_tis tpm_tis_core tpm qemu_fw_cfg nfsd auth_rpcgss nfs_acl lockd grace sunrpc virtio_net virtio_console virtio_scsi bochs_drm drm_kms_helper ttm drm serio_raw virtio_pci virtio_ring virtio lz4 lz4_compress
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918216] CPU: 0 PID: 958 Comm: z_wr_int Tainted: P        W  OE   4.8.13-100.fc23.x86_64 #1
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918217] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918217]  0000000000000286 000000007e16f741 ffff9c0259177a10 ffffffffa73e496e
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918219]  0000000000000000 0000000000000000 ffff9c0259177a50 ffffffffa70a0ecb
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918220]  000002d600000000 ffff9c025903e340 0000000000000000 0000000000001000
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918221] Call Trace:
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918225]  [<ffffffffa73e496e>] dump_stack+0x63/0x85
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918226]  [<ffffffffa70a0ecb>] __warn+0xcb/0xf0
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918227]  [<ffffffffa70a0ffd>] warn_slowpath_null+0x1d/0x20
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918273]  [<ffffffffc099084a>] vbio_fill_cb+0x15a/0x190 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918319]  [<ffffffffc09906f0>] ? vbio_completion+0xa0/0xa0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918358]  [<ffffffffc085b17e>] abd_iterate_page_func+0xce/0x190 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918405]  [<ffffffffc09910c5>] vdev_disk_io_rw+0x1d5/0x2e0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918453]  [<ffffffffc098fb81>] vdev_disk_io_start+0x161/0x490 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918507]  [<ffffffffc0980ac2>] zio_vdev_io_start+0x142/0x310 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918557]  [<ffffffffc097f4ff>] zio_execute+0x8f/0xf0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918606]  [<ffffffffc0932f03>] vdev_queue_io_done+0x123/0x220 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918654]  [<ffffffffc097daca>] zio_vdev_io_done+0x9a/0x210 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918701]  [<ffffffffc097f4ff>] zio_execute+0x8f/0xf0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918706]  [<ffffffffc06cf4fd>] taskq_thread+0x29d/0x4d0 [spl]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918708]  [<ffffffffa70cbb50>] ? wake_up_q+0x70/0x70
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918754]  [<ffffffffc097f470>] ? zio_reexecute+0x4a0/0x4a0 [zfs]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918759]  [<ffffffffc06cf260>] ? taskq_thread_spawn+0x60/0x60 [spl]
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918760]  [<ffffffffa70c0bf8>] kthread+0xd8/0xf0
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918762]  [<ffffffffa77ffdff>] ret_from_fork+0x1f/0x40
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918763]  [<ffffffffa70c0b20>] ? kthread_worker_fn+0x170/0x170
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918774] ---[ end trace 368f8d93b1defe8c ]---
Jun 17 22:33:18 dhcp-10-10-11-22 kernel: : [    5.918800] ------------[ cut here ]------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: DefectIncorrect behavior (e.g. crash, hang)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions