Description
System information
Type | Version/Name |
---|---|
Distribution Name | Ubuntu |
Distribution Version | 22.04.4 |
Kernel Version | 6.5.0-25-generic (HWE Kernel) |
Architecture | x86_64 |
OpenZFS Version | zfs-2.1.5-1ubuntu6~22.04.2 / zfs-kmod-2.2.0-0ubuntu1~23.10.1 |
Describe the problem you're observing
We are running multiple machines with LXD (5.20) as ephemeral GitHub Actions Runners, which results in a high number of container creation/deletion. The containers run on a ZFS filesystem which was created by LXD. After setting up another machine, we noticed that the machine crashed after about 16h of use.
After comparing the machines (all are set up identical, or should be) we noticed that on the working machines the ZFS dataset was created before ZFS 2.2 (I guess it was 2.0 or 2.1), while on the latest machine it was created with ZFS 2.2.
After this discovery we destroyed the ZFS pool and recreated it like this:
truncate -s 512G /var/snap/lxd/common/lxd/disks/default_legacy.img
zpool create -m none -O compression=on -o compatibility=openzfs-2.0-linux default_legacy /var/snap/lxd/common/lxd/disks/default_legacy.img
zpool set autotrim=on default_legacy
After that change, the server now runs without an issue so far.
The feature difference between the non-working and working pool were these:
default [email protected]:zilsaxattr readonly local
default [email protected]:head_errlog readonly local
default [email protected]:blake3 inactive local
default [email protected]:block_cloning readonly local
default [email protected]:vdev_zaps_v2 readonly local
In the syslog I found these pagefaults and null pointer deref:
Mar 2 01:45:06 garm3 kernel: [40773.506304] BUG: unable to handle page fault for address: ffffadd4e4636000
Mar 2 01:45:06 garm3 kernel: [40773.506603] #PF: supervisor write access in kernel mode
Mar 2 01:45:06 garm3 kernel: [40773.506838] #PF: error_code(0x0002) - not-present page
Mar 2 01:45:06 garm3 kernel: [40773.507064] PGD 100000067 P4D 100000067 PUD 7ef220067 PMD a39af3067 PTE 0
Mar 2 01:45:06 garm3 kernel: [40773.507292] Oops: 0002 [#1] PREEMPT SMP NOPTI
Mar 2 01:45:06 garm3 kernel: [40773.507516] CPU: 18 PID: 2064375 Comm: fuse-overlayfs Tainted: P O 6.5.0-21-generic #21~22.04.1-Ubuntu
Mar 2 01:45:06 garm3 kernel: [40773.507737] Hardware name: ASUS System Product Name/Pro WS 565-ACE, BIOS 9901 10/13/2022
Mar 2 01:45:06 garm3 kernel: [40773.507944] RIP: 0010:memcpy+0x8/0x10
Mar 2 01:45:06 garm3 kernel: [40773.508153] Code: 09 c2 48 89 d0 49 f7 e1 49 01 d0 eb c8 cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 31 81 01 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
Mar 2 01:45:06 garm3 kernel: [40773.508587] RSP: 0018:ffffadd4052239c0 EFLAGS: 00010286
Mar 2 01:45:06 garm3 kernel: [40773.508809] RAX: ffffadd4e4627490 RBX: ffff955018c42e38 RCX: 00000000000113d8
Mar 2 01:45:06 garm3 kernel: [40773.509031] RDX: 000000000001ff48 RSI: ffffadd4ec244bb8 RDI: ffffadd4e4636000
Mar 2 01:45:06 garm3 kernel: [40773.509254] RBP: ffffadd405223a00 R08: 0000000000000000 R09: 0000000000000000
Mar 2 01:45:06 garm3 kernel: [40773.509476] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9566547ea000
Mar 2 01:45:06 garm3 kernel: [40773.509699] R13: 000000000001ff48 R14: ffffadd4e4627490 R15: ffffadd4ec236000
Mar 2 01:45:06 garm3 kernel: [40773.509925] FS: 00007f8ec10de740(0000) GS:ffff956b6ee80000(0000) knlGS:0000000000000000
Mar 2 01:45:06 garm3 kernel: [40773.510152] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 2 01:45:06 garm3 kernel: [40773.510376] CR2: ffffadd4e4636000 CR3: 0000000eaa2dc000 CR4: 0000000000750ee0
Mar 2 01:45:06 garm3 kernel: [40773.510611] PKRU: 55555554
Mar 2 01:45:06 garm3 kernel: [40773.510836] Call Trace:
Mar 2 01:45:06 garm3 kernel: [40773.511057] <TASK>
Mar 2 01:45:06 garm3 kernel: [40773.511278] ? show_regs+0x6d/0x80
Mar 2 01:45:06 garm3 kernel: [40773.511498] ? __die+0x24/0x80
Mar 2 01:45:06 garm3 kernel: [40773.511716] ? page_fault_oops+0x99/0x1b0
Mar 2 01:45:06 garm3 kernel: [40773.511935] ? kernelmode_fixup_or_oops+0xb2/0x140
Mar 2 01:45:06 garm3 kernel: [40773.512152] ? __bad_area_nosemaphore+0x1a5/0x2c0
Mar 2 01:45:06 garm3 kernel: [40773.512368] ? srso_alias_return_thunk+0x5/0x7f
Mar 2 01:45:06 garm3 kernel: [40773.512583] ? __wake_up_common_lock+0x8b/0xd0
Mar 2 01:45:06 garm3 kernel: [40773.512797] ? bad_area_nosemaphore+0x16/0x30
Mar 2 01:45:06 garm3 kernel: [40773.513007] ? do_kern_addr_fault+0x7b/0xa0
Mar 2 01:45:06 garm3 kernel: [40773.513217] ? exc_page_fault+0x10d/0x1b0
Mar 2 01:45:06 garm3 kernel: [40773.513429] ? asm_exc_page_fault+0x27/0x30
Mar 2 01:45:06 garm3 kernel: [40773.513642] ? memcpy+0x8/0x10
Mar 2 01:45:06 garm3 kernel: [40773.513855] ? zil_lwb_commit+0x5f/0x340 [zfs]
Mar 2 01:45:06 garm3 kernel: [40773.514241] zil_lwb_write_issue+0x7d/0x920 [zfs]
Mar 2 01:45:06 garm3 kernel: [40773.514570] zil_commit_writer+0x91/0x140 [zfs]
Mar 2 01:45:06 garm3 kernel: [40773.514888] zil_commit_impl+0x64/0xf0 [zfs]
Mar 2 01:45:06 garm3 kernel: [40773.515204] zil_commit+0x3d/0x80 [zfs]
Mar 2 01:45:06 garm3 kernel: [40773.515515] zfs_write+0xaad/0xca0 [zfs]
Mar 2 01:45:06 garm3 kernel: [40773.515839] zpl_iter_write+0x118/0x160 [zfs]
Mar 2 01:45:06 garm3 kernel: [40773.516148] vfs_write+0x254/0x440
Mar 2 01:45:06 garm3 kernel: [40773.516354] __x64_sys_pwrite64+0xa6/0xd0
Mar 2 01:45:06 garm3 kernel: [40773.516558] do_syscall_64+0x5b/0x90
Mar 2 01:45:06 garm3 kernel: [40773.516762] ? srso_alias_return_thunk+0x5/0x7f
Mar 2 01:45:06 garm3 kernel: [40773.516963] ? do_syscall_64+0x67/0x90
Mar 2 01:45:06 garm3 kernel: [40773.517164] ? srso_alias_return_thunk+0x5/0x7f
Mar 2 01:45:06 garm3 kernel: [40773.517367] ? syscall_exit_to_user_mode+0x37/0x60
Mar 2 01:45:06 garm3 kernel: [40773.517566] ? srso_alias_return_thunk+0x5/0x7f
Mar 2 01:45:06 garm3 kernel: [40773.517766] ? do_syscall_64+0x67/0x90
Mar 2 01:45:06 garm3 kernel: [40773.517965] ? srso_alias_return_thunk+0x5/0x7f
Mar 2 01:45:06 garm3 kernel: [40773.518158] ? exit_to_user_mode_prepare+0x9b/0xb0
Mar 2 01:45:06 garm3 kernel: [40773.518350] ? srso_alias_return_thunk+0x5/0x7f
Mar 2 01:45:06 garm3 kernel: [40773.518539] ? syscall_exit_to_user_mode+0x37/0x60
Mar 2 01:45:06 garm3 kernel: [40773.518726] ? srso_alias_return_thunk+0x5/0x7f
Mar 2 01:45:06 garm3 kernel: [40773.518911] ? do_syscall_64+0x67/0x90
Mar 2 01:45:06 garm3 kernel: [40773.519087] ? do_syscall_64+0x67/0x90
Mar 2 01:45:06 garm3 kernel: [40773.519254] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Mar 2 01:45:06 garm3 kernel: [40773.519419] RIP: 0033:0x7f8ec11f39aa
Mar 2 01:45:06 garm3 kernel: [40773.519601] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb ba 0f 1f 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
Mar 2 01:45:06 garm3 kernel: [40773.519946] RSP: 002b:00007ffcd4d0f148 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
Mar 2 01:45:06 garm3 kernel: [40773.520126] RAX: ffffffffffffffda RBX: 0000000000002000 RCX: 00007f8ec11f39aa
Mar 2 01:45:06 garm3 kernel: [40773.520306] RDX: 0000000000002000 RSI: 00007f8ec0fdc060 RDI: 00000000000002e0
Mar 2 01:45:06 garm3 kernel: [40773.520485] RBP: 0000000000000000 R08: 0000000000002000 R09: 000000000000000e
Mar 2 01:45:06 garm3 kernel: [40773.520661] R10: 0000000000002000 R11: 0000000000000246 R12: 00007ffcd4d0f3d8
Mar 2 01:45:06 garm3 kernel: [40773.520835] R13: 0000000000000000 R14: 00007ffcd4d0f278 R15: 0000000000000000
Mar 2 01:45:06 garm3 kernel: [40773.521012] </TASK>
Mar 2 01:45:06 garm3 kernel: [40773.521184] Modules linked in: tls unix_diag overlay nf_conntrack_netlink xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE xfrm_user xfrm_algo xt_addrtype nft_compat veth nft_masq nft_chain_nat bridge stp llc zfs(PO) spl(O) ebtable_filter ebtables ip6table_raw ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter nf_tables nfnetlink vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock binfmt_misc intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass rapl eeepc_wmi wmi_bmof ccp k10temp mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ast ghash_clmulni_intel mfd_aaeon drm_shmem_helper asus_wmi aesni_intel
Mar 2 01:45:06 garm3 kernel: [40773.521263] video drm_kms_helper ledtrig_audio sparse_keymap crypto_simd platform_profile nvme cryptd drm igb nvme_core i2c_piix4 ahci dca nvme_common i2c_algo_bit xhci_pci libahci xhci_pci_renesas wmi gpio_amdpt
Mar 2 01:45:06 garm3 kernel: [40773.523585] CR2: ffffadd4e4636000
Mar 2 01:45:06 garm3 kernel: [40773.523820] ---[ end trace 0000000000000000 ]---
Mar 2 01:45:06 garm3 kernel: [40773.524057] RIP: 0010:memcpy+0x8/0x10
Mar 2 01:45:06 garm3 kernel: [40773.524296] Code: 09 c2 48 89 d0 49 f7 e1 49 01 d0 eb c8 cc cc cc cc cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 48 89 f8 48 89 d1 <f3> a4 e9 31 81 01 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
Mar 2 01:45:06 garm3 kernel: [40773.524801] RSP: 0018:ffffadd4052239c0 EFLAGS: 00010286
Mar 2 01:45:06 garm3 kernel: [40773.525060] RAX: ffffadd4e4627490 RBX: ffff955018c42e38 RCX: 00000000000113d8
Mar 2 01:45:06 garm3 kernel: [40773.525324] RDX: 000000000001ff48 RSI: ffffadd4ec244bb8 RDI: ffffadd4e4636000
Mar 2 01:45:06 garm3 kernel: [40773.525587] RBP: ffffadd405223a00 R08: 0000000000000000 R09: 0000000000000000
Mar 2 01:45:06 garm3 kernel: [40773.525852] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9566547ea000
Mar 2 01:45:06 garm3 kernel: [40773.526122] R13: 000000000001ff48 R14: ffffadd4e4627490 R15: ffffadd4ec236000
Mar 2 01:45:06 garm3 kernel: [40773.526390] FS: 00007f8ec10de740(0000) GS:ffff956b6ee80000(0000) knlGS:0000000000000000
Mar 2 01:45:06 garm3 kernel: [40773.526660] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 2 01:45:06 garm3 kernel: [40773.526931] CR2: ffffadd4e4636000 CR3: 0000000eaa2dc000 CR4: 0000000000750ee0
Mar 2 01:45:06 garm3 kernel: [40773.527205] PKRU: 55555554
Mar 2 01:45:06 garm3 kernel: [40773.527479] note: fuse-overlayfs[2064375] exited with irqs disabled
Mar 2 01:45:08 garm3 kernel: [40775.600880] overlayfs: fs on '/home/runner/.local/share/containers/storage/overlay/compat2342864860/lower1' does not support file handles, falling back to xino=off.
Mar 2 01:45:09 garm3 kernel: [40775.919960] BUG: kernel NULL pointer dereference, address: 0000000000000020
Mar 2 01:45:09 garm3 kernel: [40775.920527] #PF: supervisor read access in kernel mode
Mar 2 01:45:09 garm3 kernel: [40775.921066] #PF: error_code(0x0000) - not-present page
Mar 2 01:45:09 garm3 kernel: [40775.921445] PGD 0 P4D 0
Mar 2 01:45:09 garm3 kernel: [40775.921798] Oops: 0000 [#2] PREEMPT SMP NOPTI
Mar 2 01:45:09 garm3 kernel: [40775.922173] CPU: 17 PID: 2038119 Comm: z_wr_int_h Tainted: P D O 6.5.0-21-generic #21~22.04.1-Ubuntu
Mar 2 01:45:09 garm3 kernel: [40775.922540] Hardware name: ASUS System Product Name/Pro WS 565-ACE, BIOS 9901 10/13/2022
Mar 2 01:45:09 garm3 kernel: [40775.922909] RIP: 0010:list_add+0x1/0x20 [spl]
Mar 2 01:45:09 garm3 kernel: [40775.923304] Code: 89 1c 24 5b 41 5c 41 5d 5d 31 c0 31 d2 31 f6 31 ff e9 53 24 c4 d9 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 <48> 8b 16 48 89 e5 e8 a4 ff ff ff 5d 31 d2 31 f6 31 ff e9 28 24 c4
Mar 2 01:45:09 garm3 kernel: [40775.924070] RSP: 0018:ffffadd4107afc18 EFLAGS: 00010046
Mar 2 01:45:09 garm3 kernel: [40775.924445] RAX: ffffadd4e4635000 RBX: 0000000000000000 RCX: 0000000000000000
Mar 2 01:45:09 garm3 kernel: [40775.924818] RDX: 0000000000000000 RSI: 0000000000000020 RDI: ffffadd4e4635018
Mar 2 01:45:09 garm3 kernel: [40775.925189] RBP: ffffadd4107afc40 R08: 0000000000000000 R09: 0000000000000000
Mar 2 01:45:09 garm3 kernel: [40775.925563] R10: 0000000000000000 R11: 0000000000000000 R12: ffff954cea463400
Mar 2 01:45:09 garm3 kernel: [40775.925957] R13: ffff954cc190b920 R14: ffff954cc190b800 R15: ffff954cc190b8b8
Mar 2 01:45:09 garm3 kernel: [40775.926329] FS: 0000000000000000(0000) GS:ffff956b6ee40000(0000) knlGS:0000000000000000
Mar 2 01:45:09 garm3 kernel: [40775.926706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 2 01:45:09 garm3 kernel: [40775.927079] CR2: 0000000000000020 CR3: 0000000e67d5e000 CR4: 0000000000750ee0
Mar 2 01:45:09 garm3 kernel: [40775.927452] PKRU: 55555554
Mar 2 01:45:09 garm3 kernel: [40775.927816] Call Trace:
Mar 2 01:45:09 garm3 kernel: [40775.928175] <TASK>
Mar 2 01:45:09 garm3 kernel: [40775.928528] ? show_regs+0x6d/0x80
Mar 2 01:45:09 garm3 kernel: [40775.928877] ? __die+0x24/0x80
Mar 2 01:45:09 garm3 kernel: [40775.929219] ? page_fault_oops+0x99/0x1b0
Mar 2 01:45:09 garm3 kernel: [40775.929560] ? do_user_addr_fault+0x31d/0x6b0
Mar 2 01:45:09 garm3 kernel: [40775.929903] ? exc_page_fault+0x83/0x1b0
Mar 2 01:45:09 garm3 kernel: [40775.930228] ? asm_exc_page_fault+0x27/0x30
Mar 2 01:45:09 garm3 kernel: [40775.930549] ? list_add+0x1/0x20 [spl]
Mar 2 01:45:09 garm3 kernel: [40775.930864] ? list_add+0xc/0x20 [spl]
Mar 2 01:45:09 garm3 kernel: [40775.931172] ? spl_cache_shrink+0x27/0xc0 [spl]
Mar 2 01:45:09 garm3 kernel: [40775.931481] spl_cache_flush+0x66
Mar 4 13:59:10 garm3 multipathd[764]: --------start up--------
The only thing from the kernel side I was able to retrieve was this (unable to scroll or catch it otherwise, sorry):
Not sure if this is helpful since this is far from my expertise, but maybe it makes sense to anyone here.