Skip to content

ztest crashes with default zfs_abd_scatter_min_size #12793

Open
@rincebrain

Description

@rincebrain

System information

Type Version/Name
Distribution Name Debian
Distribution Version 11
Kernel Version not relevant
Architecture x86_64
OpenZFS Version ded851b

Describe the problem you're observing

ztest crashes an awful lot.

Most of the crashes, IME, look something like:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7737537 in __GI_abort () at abort.c:79
#2  0x00007ffff7ae6923 in vpanic (fmt=0x7ffff7d74540 "Got SIGSEGV at address: 0x%lx\n", adx=adx@entry=0x7fffffffce58) at kernel.c:612
#3  0x00007ffff7ae69bb in panic (fmt=fmt@entry=0x7ffff7d74540 "Got SIGSEGV at address: 0x%lx\n") at kernel.c:621
#4  0x00007ffff7afaeb6 in arc_buf_sigsegv (sig=<optimized out>, si=<optimized out>, unused=<optimized out>) at ../../module/zfs/arc.c:1515
#5  <signal handler called>
#6  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:437
#7  0x00007ffff7aef9be in abd_copy_to_buf_off_cb (buf=<optimized out>, size=size@entry=4096, private=private@entry=0x7fffffffd5a8) at ../../module/zfs/abd.c:828
#8  0x00007ffff7af1044 in abd_iterate_func (private=0x7fffffffd5a8, func=0x7ffff7aef9a0 <abd_copy_to_buf_off_cb>, size=4096, off=<optimized out>, abd=0x7fffd8021c90) at ../../module/zfs/abd.c:805
#9  abd_iterate_func (abd=0x7fffd8021c90, off=<optimized out>, size=<optimized out>, func=0x7ffff7aef9a0 <abd_copy_to_buf_off_cb>, private=0x7fffffffd5a8) at ../../module/zfs/abd.c:780
#10 0x00007ffff7af1278 in abd_copy_to_buf_off (buf=<optimized out>, abd=<optimized out>, off=off@entry=0, size=<optimized out>) at ../../module/zfs/abd.c:842
#11 0x00007ffff7b0239e in abd_copy_to_buf (size=<optimized out>, abd=<optimized out>, buf=<optimized out>) at ../../include/sys/abd.h:159
#12 arc_buf_fill (buf=0x555555b35a90, spa=spa@entry=0x5555556569a0, zb=zb@entry=0x7fffffffd6a0, flags=flags@entry=0) at ../../module/zfs/arc.c:2067
#13 0x00007ffff7b0327d in arc_untransform (buf=<optimized out>, spa=0x5555556569a0, zb=zb@entry=0x7fffffffd6a0, in_place=in_place@entry=B_FALSE) at ../../module/zfs/arc.c:2171
#14 0x00007ffff7b366f6 in dmu_objset_own_impl (ds=ds@entry=0x5555556af3b0, type=type@entry=DMU_OST_ANY, readonly=readonly@entry=B_TRUE, decrypt=decrypt@entry=B_TRUE, osp=osp@entry=0x7fffffffd8c8, tag=<optimized out>)
    at ../../module/zfs/dmu_objset.c:774
#15 0x00007ffff7b3abed in dmu_objset_own_impl (tag=0x555555573cf0 <__func__.9>, osp=0x7fffffffd8c8, decrypt=B_TRUE, readonly=B_TRUE, type=DMU_OST_ANY, ds=0x5555556af3b0) at ../../module/zfs/dmu_objset.c:757
#16 dmu_objset_own (name=name@entry=0x5555565cbfa0 "ztest/ds_4", type=type@entry=DMU_OST_ANY, readonly=readonly@entry=B_TRUE, decrypt=decrypt@entry=B_TRUE, tag=tag@entry=0x555555573cf0 <__func__.9>, osp=osp@entry=0x7fffffffd8c8)
    at ../../module/zfs/dmu_objset.c:808
#17 0x0000555555563fcb in ztest_dmu_objset_own (name=name@entry=0x5555565cbfa0 "ztest/ds_4", type=type@entry=DMU_OST_ANY, readonly=readonly@entry=B_TRUE, tag=tag@entry=0x555555573cf0 <__func__.9>, osp=osp@entry=0x7fffffffd8c8,
    decrypt=B_TRUE) at ztest.c:1602
#18 0x0000555555567d9c in ztest_replay_zil_cb (name=name@entry=0x5555565cbfa0 "ztest/ds_4", arg=arg@entry=0x0) at ztest.c:7255
#19 0x00007ffff7b35d0d in dmu_objset_find_impl (spa=spa@entry=0x5555556569a0, name=name@entry=0x5555565cbfa0 "ztest/ds_4", func=func@entry=0x555555567d70 <ztest_replay_zil_cb>, arg=arg@entry=0x0, flags=flags@entry=2)
    at ../../module/zfs/dmu_objset.c:2951
#20 0x00007ffff7b35e40 in dmu_objset_find_impl (spa=0x5555556569a0, name=name@entry=0x55555557a960 <ztest_opts> "ztest", func=func@entry=0x555555567d70 <ztest_replay_zil_cb>, arg=arg@entry=0x0, flags=flags@entry=2)
    at ../../module/zfs/dmu_objset.c:2894
#21 0x00007ffff7b3b303 in dmu_objset_find (name=name@entry=0x55555557a960 <ztest_opts> "ztest", func=func@entry=0x555555567d70 <ztest_replay_zil_cb>, arg=arg@entry=0x0, flags=flags@entry=2) at ../../module/zfs/dmu_objset.c:2967
#22 0x000055555555cdc4 in ztest_run (zs=0x7ffff7ffb738) at ztest.c:7563
#23 main (argc=<optimized out>, argv=<optimized out>) at ztest.c:8062

After a round of bisecting, I ended up at 87c25d5, which I would not have guessed, but here we are. And lo, if you extend ztest to set
zfs_abd_scatter_min_size to 4097
on x86_64, it goes from crashing practically always to crashing never so far.

If we ask valgrind, first it complains a lot about uninitialized values in the crypto code being read a bunch, but if you zero those, it becomes limited to eventually spitting out:

==129414== Thread 109:
==129414== Conditional jump or move depends on uninitialised value(s)
==129414==    at 0x483EEEE: bcmp (vg_replace_strmem.c:1111)
==129414==    by 0x48ABA33: abd_cmp_buf_off_cb (abd.c:852)
==129414==    by 0x48AD043: abd_iterate_func (abd.c:805)
==129414==    by 0x48AD043: abd_iterate_func (abd.c:780)
==129414==    by 0x48AD304: abd_cmp_buf_off (abd.c:866)
==129414==    by 0x48AD377: abd_cmp_buf (abd.h:165)
==129414==    by 0x48AD377: abd_return_buf (abd.c:673)
==129414==    by 0x48C0F42: arc_read_done (arc.c:5692)
==129414==    by 0x4A606A3: zio_done (zio.c:4835)
==129414==    by 0x4A54968: __zio_execute (zio.c:2209)
==129414==    by 0x4A54968: zio_execute (zio.c:2122)
==129414==    by 0x48A3961: taskq_thread (taskq.c:237)
==129414==    by 0x4EFFEA6: start_thread (pthread_create.c:477)
==129414==    by 0x5018DEE: clone (clone.S:95)
==129414==  Uninitialised value was created by a heap allocation
==129414==    at 0x483AEB8: memalign (vg_replace_malloc.c:906)
==129414==    by 0x483AFCE: posix_memalign (vg_replace_malloc.c:1070)
==129414==    by 0x48AF0BF: umem_alloc_aligned (umem.h:105)
==129414==    by 0x48AF0BF: abd_alloc_chunks (abd_os.c:579)
==129414==    by 0x48ABF8C: abd_alloc (abd.c:192)
==129414==    by 0x48BC838: arc_hdr_alloc_abd (arc.c:3191)
==129414==    by 0x48C24DD: arc_read (arc.c:6188)
==129414==    by 0x4A4C5BF: zil_read_log_block (zil.c:241)
==129414==    by 0x4A4C5BF: zil_parse (zil.c:398)
==129414==    by 0x4A4D659: zil_check_log_chain (zil.c:975)
==129414==    by 0x48F225A: dmu_objset_find_dp_impl (dmu_objset.c:2725)
==129414==    by 0x48F2625: dmu_objset_find_dp_cb (dmu_objset.c:2758)
==129414==    by 0x48A3961: taskq_thread (taskq.c:237)
==129414==    by 0x4EFFEA6: start_thread (pthread_create.c:477)
==129414==
==129414==
==129414== Process terminating with default action of signal 6 (SIGABRT): dumping core
==129414==    at 0x4F56CE1: raise (raise.c:51)
==129414==    by 0x4F40536: abort (abort.c:79)
==129414==    by 0x48A2922: vpanic (kernel.c:612)
==129414==    by 0x48A29BA: panic (kernel.c:621)
==129414==    by 0x48B6EB5: arc_buf_sigsegv (arc.c:1515)
==129414==    by 0x4F0B13F: ??? (in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so)
==129414==    by 0x483F7F2: memmove (vg_replace_strmem.c:1270)
==129414==    by 0x48AB9BD: abd_copy_to_buf_off_cb (abd.c:828)
==129414==    by 0x48AD043: abd_iterate_func (abd.c:805)
==129414==    by 0x48AD043: abd_iterate_func (abd.c:780)
==129414==    by 0x48AD277: abd_copy_to_buf_off (abd.c:842)
==129414==    by 0x48BE39D: abd_copy_to_buf (abd.h:159)
==129414==    by 0x48BE39D: arc_buf_fill (arc.c:2067)
==129414==    by 0x48BF27C: arc_untransform (arc.c:2171)

Obviously we could just...make ztest do that for now, but that seems problematic, and it's not presently clear to me whether the logical flaw is in the umem implementations of things or elsewhere? (Will continue looking, of course, but.)

Describe how to reproduce the problem

Above.

Include any warning/errors/backtraces from the system logs

Above.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions