Description
While stress testing TestStackBarrierProfiling at 54bd5a7 on master, I got a segfault in sigtrampgo
in signal_linux.go because g != nil, but g.m == nil.
I've saved the binary and core file. Here is some preliminary digging through the core:
Core was generated by `./pprof.test -test.run=TestStackBarrierProfiling'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000000043de2e in runtime.sigtrampgo (sig=11, info=0xc820061bf0,
ctx=0xc820061ac0) at /home/austin/go.dev/src/runtime/signal_linux.go:20
20 setg(g.m.gsignal)
Loading Go Runtime support.
(gdb) bt
#0 0x000000000043de2e in runtime.sigtrampgo (sig=11, info=0xc820061bf0,
ctx=0xc820061ac0) at /home/austin/go.dev/src/runtime/signal_linux.go:20
#1 0x000000000045c93b in runtime.sigtramp ()
at /home/austin/go.dev/src/runtime/sys_linux_amd64.s:234
#2 0x000000000045c940 in runtime.sigtramp ()
at /home/austin/go.dev/src/runtime/sys_linux_amd64.s:235
#3 0x0000000000000001 in ?? ()
#4 0x0000000000000000 in ?? ()
(gdb) print/x *g
$5 = {stack = {lo = 0xc820092000, hi = 0xc820092800},
stackguard0 = 0xfffffffffffffade, stackguard1 = 0xffffffffffffffff,
_panic = 0x0, _defer = 0x0, m = 0x0, stackAlloc = 0x1000, sched = {sp = 0x0,
pc = 0x459140, g = 0xc820001200, ctxt = 0x0, ret = 0x0, lr = 0x0,
bp = 0x4}, syscallsp = 0x0, syscallpc = 0x0, stkbar = []runtime.stkbar,
stkbarPos = 0x0, stktopsp = 0xc8200927d8, param = 0x0, atomicstatus = 0x2,
stackLock = 0x0, goid = 0x6, waitsince = 0x0,
waitreason = "GC worker (idle)", schedlink = 0x0, preempt = 0x1,
paniconfault = 0x0, preemptscan = 0x0, gcscandone = 0x1, gcscanvalid = 0x0,
throwsplit = 0x0, raceignore = 0x0, sysblocktraced = 0x0,
sysexitticks = 0x0, sysexitseq = 0x0, lockedm = 0x0, sig = 0x0,
writebuf = []uint8, sigcode0 = 0x0, sigcode1 = 0x0, sigpc = 0x0,
gopc = 0x418163, startpc = 0x4181d0, racectx = 0x0, waiting = 0x0,
gcAssistBytes = 0x0}
(gdb) x/i g.sched.pc
0x459140 <runtime.systemstack_switch>: retq
(gdb) x/i g.gopc
0x418163 <runtime.gcBgMarkStartWorkers+147>:
lea 0x2a4c96(%rip),%rbx # 0x6bce00 <runtime.work>
(gdb) x/i g.startpc
0x4181d0 <runtime.gcBgMarkWorker>: mov %fs:0xfffffffffffffff8,%rcx
(gdb) print/x *(struct 'runtime.sigcontext'*)(ctx+40)
$11 = {r8 = 0x0, r9 = 0x0, r10 = 0xc82003ef18, r11 = 0x206, r12 = 0x800,
r13 = 0x400, r14 = 0x9, r15 = 0x8, rdi = 0x6bcdb0, rsi = 0x0,
rbp = 0xc820026000, rbx = 0x0, rdx = 0x0, rax = 0xc820001200,
rcx = 0xc820001200, rsp = 0xc820092738, rip = 0x45920d, eflags = 0x10246,
cs = 0x33, gs = 0x0, fs = 0x0, __pad0 = 0x0, err = 0x4, trapno = 0xe,
oldmask = 0x0, cr2 = 0x0, fpstate = 0xc820061c80, __reserved1 = {0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}
(gdb) x/i $11.rip
0x45920d <runtime.morestack+13>: mov (%rbx),%rsi
(gdb) x/21a $11.rsp
0xc820092738: 0x4189e2 <runtime.gcFlushGCWork+130> 0x417390 <runtime.gcMarkDone+560>
0xc820092748: 0x619060 <runtime.stopTheWorldWithSema.f> 0xffffffff00000001
0xc820092758: 0xffffffff00000001 0x4186a8 <runtime.gcBgMarkWorker+1240>
0xc820092768: 0x0 0x0
0xc820092778: 0xffffffff 0x10
0xc820092788: 0x14 0x0
0xc820092798: 0xffffffff01000000 0xfffffffe
0xc8200927a8: 0x2120869068188 0x0
0xc8200927b8: 0xc820034800 0xc820001200
0xc8200927c8: 0x618e78 <runtime.gcBgMarkWorker.func1.f> 0x45bba1 <runtime.goexit+1>
0xc8200927d8: 0xc820024a00
(gdb) disassemble 'runtime.gcBgMarkWorker'
0x00000000004186a3 <+1235>: callq 0x417160 <runtime.gcMarkDone>
0x00000000004186a8 <+1240>: mov %fs:0xfffffffffffffff8,%rax
(gdb) disassemble 'runtime.gcMarkDone'
0x000000000041738b <+555>: callq 0x418960 <runtime.gcFlushGCWork>
0x0000000000417390 <+560>: callq 0x41a650 <runtime.gcWakeAllAssists>
(gdb) disassemble 'runtime.gcFlushGCWork'
0x00000000004189dd <+125>: callq 0x459290 <runtime.morestack_noctxt>
0x00000000004189e2 <+130>: jmpq 0x418960 <runtime.gcFlushGCWork>
This appears to be a nested signal. The original signal was a SIGSEGV in morestack at MOVQ m_g0(BX), SI
because BX (getg().m) is 0. The signal handler then also crashed for the same reason. We clearly tried to grow the stack in gcFlushGCWork (the stack is very small because this test runs in gcstackbarrierall mode), but I'm not sure why there wasn't an M at that point. It may be related to the fact that we've stopped the world in gcMarkDone at that point.
This is relatively easy to reproduce. It happened five times out of 3,000 stress runs on my workstation (which took ~25 minutes).