Skip to content

runtime: nil pointer dereference in sigtrampgo #13363

Closed
@aclements

Description

@aclements

While stress testing TestStackBarrierProfiling at 54bd5a7 on master, I got a segfault in sigtrampgo in signal_linux.go because g != nil, but g.m == nil.

I've saved the binary and core file. Here is some preliminary digging through the core:

Core was generated by `./pprof.test -test.run=TestStackBarrierProfiling'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000043de2e in runtime.sigtrampgo (sig=11, info=0xc820061bf0, 
    ctx=0xc820061ac0) at /home/austin/go.dev/src/runtime/signal_linux.go:20
20      setg(g.m.gsignal)
Loading Go Runtime support.
(gdb) bt
#0  0x000000000043de2e in runtime.sigtrampgo (sig=11, info=0xc820061bf0, 
    ctx=0xc820061ac0) at /home/austin/go.dev/src/runtime/signal_linux.go:20
#1  0x000000000045c93b in runtime.sigtramp ()
    at /home/austin/go.dev/src/runtime/sys_linux_amd64.s:234
#2  0x000000000045c940 in runtime.sigtramp ()
    at /home/austin/go.dev/src/runtime/sys_linux_amd64.s:235
#3  0x0000000000000001 in ?? ()
#4  0x0000000000000000 in ?? ()
(gdb) print/x *g
$5 = {stack = {lo = 0xc820092000, hi = 0xc820092800}, 
  stackguard0 = 0xfffffffffffffade, stackguard1 = 0xffffffffffffffff, 
  _panic = 0x0, _defer = 0x0, m = 0x0, stackAlloc = 0x1000, sched = {sp = 0x0, 
    pc = 0x459140, g = 0xc820001200, ctxt = 0x0, ret = 0x0, lr = 0x0, 
    bp = 0x4}, syscallsp = 0x0, syscallpc = 0x0, stkbar =  []runtime.stkbar, 
  stkbarPos = 0x0, stktopsp = 0xc8200927d8, param = 0x0, atomicstatus = 0x2, 
  stackLock = 0x0, goid = 0x6, waitsince = 0x0, 
  waitreason = "GC worker (idle)", schedlink = 0x0, preempt = 0x1, 
  paniconfault = 0x0, preemptscan = 0x0, gcscandone = 0x1, gcscanvalid = 0x0, 
  throwsplit = 0x0, raceignore = 0x0, sysblocktraced = 0x0, 
  sysexitticks = 0x0, sysexitseq = 0x0, lockedm = 0x0, sig = 0x0, 
  writebuf =  []uint8, sigcode0 = 0x0, sigcode1 = 0x0, sigpc = 0x0, 
  gopc = 0x418163, startpc = 0x4181d0, racectx = 0x0, waiting = 0x0, 
  gcAssistBytes = 0x0}
(gdb) x/i g.sched.pc
   0x459140 <runtime.systemstack_switch>:   retq   
(gdb) x/i g.gopc
   0x418163 <runtime.gcBgMarkStartWorkers+147>: 
    lea    0x2a4c96(%rip),%rbx        # 0x6bce00 <runtime.work>
(gdb) x/i g.startpc
   0x4181d0 <runtime.gcBgMarkWorker>:   mov    %fs:0xfffffffffffffff8,%rcx

(gdb) print/x *(struct 'runtime.sigcontext'*)(ctx+40)
$11 = {r8 = 0x0, r9 = 0x0, r10 = 0xc82003ef18, r11 = 0x206, r12 = 0x800, 
  r13 = 0x400, r14 = 0x9, r15 = 0x8, rdi = 0x6bcdb0, rsi = 0x0, 
  rbp = 0xc820026000, rbx = 0x0, rdx = 0x0, rax = 0xc820001200, 
  rcx = 0xc820001200, rsp = 0xc820092738, rip = 0x45920d, eflags = 0x10246, 
  cs = 0x33, gs = 0x0, fs = 0x0, __pad0 = 0x0, err = 0x4, trapno = 0xe, 
  oldmask = 0x0, cr2 = 0x0, fpstate = 0xc820061c80, __reserved1 = {0x0, 0x0, 
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}
(gdb) x/i $11.rip
   0x45920d <runtime.morestack+13>: mov    (%rbx),%rsi
(gdb) x/21a $11.rsp
0xc820092738:   0x4189e2 <runtime.gcFlushGCWork+130>    0x417390 <runtime.gcMarkDone+560>
0xc820092748:   0x619060 <runtime.stopTheWorldWithSema.f>   0xffffffff00000001
0xc820092758:   0xffffffff00000001  0x4186a8 <runtime.gcBgMarkWorker+1240>
0xc820092768:   0x0 0x0
0xc820092778:   0xffffffff  0x10
0xc820092788:   0x14    0x0
0xc820092798:   0xffffffff01000000  0xfffffffe
0xc8200927a8:   0x2120869068188 0x0
0xc8200927b8:   0xc820034800    0xc820001200
0xc8200927c8:   0x618e78 <runtime.gcBgMarkWorker.func1.f>   0x45bba1 <runtime.goexit+1>
0xc8200927d8:   0xc820024a00
(gdb) disassemble 'runtime.gcBgMarkWorker'
   0x00000000004186a3 <+1235>:  callq  0x417160 <runtime.gcMarkDone>
   0x00000000004186a8 <+1240>:  mov    %fs:0xfffffffffffffff8,%rax
(gdb) disassemble 'runtime.gcMarkDone'
   0x000000000041738b <+555>:   callq  0x418960 <runtime.gcFlushGCWork>
   0x0000000000417390 <+560>:   callq  0x41a650 <runtime.gcWakeAllAssists>
(gdb) disassemble 'runtime.gcFlushGCWork'  
   0x00000000004189dd <+125>:   callq  0x459290 <runtime.morestack_noctxt>
   0x00000000004189e2 <+130>:   jmpq   0x418960 <runtime.gcFlushGCWork>

This appears to be a nested signal. The original signal was a SIGSEGV in morestack at MOVQ m_g0(BX), SI because BX (getg().m) is 0. The signal handler then also crashed for the same reason. We clearly tried to grow the stack in gcFlushGCWork (the stack is very small because this test runs in gcstackbarrierall mode), but I'm not sure why there wasn't an M at that point. It may be related to the fact that we've stopped the world in gcMarkDone at that point.

This is relatively easy to reproduce. It happened five times out of 3,000 stress runs on my workstation (which took ~25 minutes).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions