Skip to content

Kernel oops page fault triggered by Docker in arc_prune #16324

Open
@maxpoulin64

Description

@maxpoulin64

System information

Type Version/Name
Distribution Name ArchLinux
Distribution Version Latest
Kernel Version 6.8.9-zen1-1-zen
Architecture x86_64
OpenZFS Version 2.2.4-2

I'm holding to 6.8.9 specifically to stay within official supported kernel versions.

Describe the problem you're observing

Extracting large container images in Docker causes ZFS to trigger an unhandled page fault, and permanently locks up the filesystem until reboot. Sync will never complete, and normal shutdown also doesn't complete.

Describe how to reproduce the problem

Running this particular container reliably hangs ZFS on my system during extraction, using Docker's ZFS storage driver.

docker run -it --rm -p 8080:8080 --gpus all --name localai quay.io/go-skynet/local-ai:latest-aio-gpu-hipblas

It gets stuck on a line such as this one and never completes, killing the Docker daemon makes it a zombie, IO is completely hosed.

6ddbee975253: Extracting  352.2MB/352.2MB

Include any warning/errors/backtraces from the system logs

[184791.050957] BUG: unable to handle page fault for address: 00000000208db6e0
[184791.050969] #PF: supervisor instruction fetch in kernel mode
[184791.050972] #PF: error_code(0x0010) - not-present page
[184791.050975] PGD 0 P4D 0 
[184791.050981] Oops: 0010 [#1] PREEMPT SMP NOPTI
[184791.050985] CPU: 11 PID: 482 Comm: arc_prune Tainted: P        W  OE      6.8.9-zen1-1-zen #1 b3e4ad3c9dbde87c9fb9d46fb90ca62a28a66a12
[184791.050992] Hardware name: Micro-Star International Co., Ltd. MS-7B09/X399 GAMING PRO CARBON AC (MS-7B09), BIOS 1.B0 08/09/2018
[184791.050995] RIP: 0010:0x208db6e0
[184791.051042] Code: Unable to access opcode bytes at 0x208db6b6.
[184791.051045] RSP: 0018:ffffb417d2293ce0 EFLAGS: 00010246
[184791.051049] RAX: 00000000208db6e0 RBX: ffffb417d2293d94 RCX: 0000000000000000
[184791.051052] RDX: 0000000000000000 RSI: ffffb417d2293d30 RDI: ffff97e1ac586a80
[184791.051056] RBP: 0000000000003ae0 R08: 0000000000006d66 R09: ffff97e4860e2e90
[184791.051059] R10: ffff97e4860e2e80 R11: ffff97e1f96c0000 R12: ffff97e538d00000
[184791.051063] R13: ffff97e48bf9d780 R14: ffff97e4860e2e28 R15: ffff97e1ac586a80
[184791.051066] FS:  0000000000000000(0000) GS:ffff97e46e4c0000(0000) knlGS:0000000000000000
[184791.051070] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[184791.051074] CR2: 00000000208db6e0 CR3: 000000019b3e6000 CR4: 00000000003506f0
[184791.051077] Call Trace:
[184791.051082]  <TASK>
[184791.051085]  ? __die+0x10f/0x120
[184791.051092]  ? page_fault_oops+0x171/0x4e0
[184791.051101]  ? exc_page_fault+0x7f/0x180
[184791.051107]  ? asm_exc_page_fault+0x26/0x30
[184791.051119]  ? zfs_prune+0xb0/0x4e0 [zfs 158ff065068c3ea6e221f98356463834dc655cec]
[184791.051438]  ? zpl_prune_sb+0x36/0x60 [zfs 158ff065068c3ea6e221f98356463834dc655cec]
[184791.051653]  ? arc_prune_task+0x22/0x40 [zfs 158ff065068c3ea6e221f98356463834dc655cec]
[184791.051880]  ? taskq_thread+0x2d4/0x6f0 [spl 44541b25f59ba0491e81482257bd475148318e14]
[184791.051901]  ? srso_return_thunk+0x5/0x5f
[184791.051907]  ? finish_task_switch.isra.0+0x94/0x2f0
[184791.051914]  ? __pfx_default_wake_function+0x10/0x10
[184791.051924]  ? __pfx_taskq_thread+0x10/0x10 [spl 44541b25f59ba0491e81482257bd475148318e14]
[184791.051941]  ? kthread+0xe8/0x120
[184791.051946]  ? __pfx_kthread+0x10/0x10
[184791.051951]  ? ret_from_fork+0x34/0x50
[184791.051955]  ? __pfx_kthread+0x10/0x10
[184791.051960]  ? ret_from_fork_asm+0x1b/0x30
[184791.051969]  </TASK>
[184791.051971] Modules linked in: xt_conntrack nf_conntrack_netlink xfrm_user xfrm_algo ip6table_nat ip6table_filter ip6_tables xt_addrtype br_netfilter overlay rfcomm snd_seq_dummy snd_hrtimer snd_seq wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel bridge stp llc uhid cmac algif_hash algif_skcipher af_alg xt_MASQUERADE bnep xt_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c crc32c_generic iptable_filter dm_crypt cbc encrypted_keys vfat fat intel_rapl_msr intel_rapl_common btusb snd_hda_codec_realtek btrtl crct10dif_pclmul snd_hda_codec_generic btintel crc32_pclmul iwlmvm snd_hda_codec_hdmi btbcm crc32c_intel snd_usb_audio btmtk polyval_clmulni snd_hda_intel snd_usbmidi_lib polyval_generic mac80211 gf128mul libarc4 snd_intel_dspcfg snd_ump ghash_clmulni_intel snd_intel_sdw_acpi bluetooth snd_rawmidi sha512_ssse3 joydev snd_seq_device snd_hda_codec sha256_ssse3 ecdh_generic iwlwifi mousedev sha1_ssse3 mc
[184791.052084]  razerkbd(OE) crc16 aesni_intel snd_hda_core crypto_simd snd_hwdep cryptd snd_pcm igb cfg80211 rapl ptp snd_timer sp5100_tco pps_core gpio_amdpt snd dca wmi_bmof rfkill pcspkr soundcore gpio_generic mxm_wmi i2c_piix4 k10temp mac_hid kvmfr(OE) sg crypto_user loop nfnetlink ip_tables x_tables hid_steam ff_memless hid_logitech_hidpp hid_logitech_dj hid_generic trusted asn1_encoder tee dm_mod usbhid amdgpu vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd video amdxcp i2c_algo_bit drm_ttm_helper ttm kvm_amd drm_exec gpu_sched drm_suballoc_helper kvm nvme drm_buddy nvme_core drm_display_helper xhci_pci irqbypass cec ccp nvme_auth xhci_pci_renesas wmi zfs(POE) spl(OE) vendor_reset(OE) nct6775 nct6775_core hwmon_vid i2c_dev
[184791.052189] CR2: 00000000208db6e0
[184791.052193] ---[ end trace 0000000000000000 ]---
[184791.052196] RIP: 0010:0x208db6e0
[184791.052216] Code: Unable to access opcode bytes at 0x208db6b6.
[184791.052219] RSP: 0018:ffffb417d2293ce0 EFLAGS: 00010246
[184791.052223] RAX: 00000000208db6e0 RBX: ffffb417d2293d94 RCX: 0000000000000000
[184791.052226] RDX: 0000000000000000 RSI: ffffb417d2293d30 RDI: ffff97e1ac586a80
[184791.052229] RBP: 0000000000003ae0 R08: 0000000000006d66 R09: ffff97e4860e2e90
[184791.052232] R10: ffff97e4860e2e80 R11: ffff97e1f96c0000 R12: ffff97e538d00000
[184791.052235] R13: ffff97e48bf9d780 R14: ffff97e4860e2e28 R15: ffff97e1ac586a80
[184791.052238] FS:  0000000000000000(0000) GS:ffff97e46e4c0000(0000) knlGS:0000000000000000
[184791.052241] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[184791.052244] CR2: 00000000208db6e0 CR3: 000000019b3e6000 CR4: 00000000003506f0
[184791.052248] note: arc_prune[482] exited with irqs disabled

The stack trace is always the same. Disk passes scrub with 0 errors after rebooting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: DefectIncorrect behavior (e.g. crash, hang)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions