Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Segmentation fault: 11 #17043

Open
Open
@tranvanhoa533

Description

@tranvanhoa533

Description

I trained arcface with 8 gpus and met Segmentation fault after some iterations

Error Message

INFO:root:Iter[0] Batch [25300]	Speed: 521.90 samples/sec
INFO:root:Iter[25320] fc7_acc 0.0 	 fc7_ce 13.38002197265625
INFO:root:Iter[0] Batch [25320]	Speed: 515.28 samples/sec
INFO:root:Iter[25340] fc7_acc 0.0 	 fc7_ce 13.379957275390625
INFO:root:Iter[0] Batch [25340]	Speed: 514.40 samples/sec
INFO:root:Iter[25360] fc7_acc 0.004999999888241291 	 fc7_ce 13.37499755859375
INFO:root:Iter[0] Batch [25360]	Speed: 495.12 samples/sec
INFO:root:Iter[25380] fc7_acc 0.0 	 fc7_ce 13.380018310546875
INFO:root:Iter[0] Batch [25380]	Speed: 517.99 samples/sec
INFO:root:Iter[25400] fc7_acc 0.0 	 fc7_ce 13.3799951171875
INFO:root:Iter[0] Batch [25400]	Speed: 516.58 samples/sec
INFO:root:Iter[25420] fc7_acc 0.0024999999441206455 	 fc7_ce 13.377520751953124
INFO:root:Iter[0] Batch [25420]	Speed: 499.38 samples/sec
INFO:root:Iter[25440] fc7_acc 0.0024999999441206455 	 fc7_ce 13.37696044921875
INFO:root:Iter[0] Batch [25440]	Speed: 515.53 samples/sec
INFO:root:Iter[25460] fc7_acc 0.0 	 fc7_ce 13.3800244140625
INFO:root:Iter[0] Batch [25460]	Speed: 527.34 samples/sec
INFO:root:Iter[25480] fc7_acc 0.0 	 fc7_ce 13.38001953125
INFO:root:Iter[0] Batch [25480]	Speed: 504.62 samples/sec
INFO:root:Iter[25500] fc7_acc 0.0 	 fc7_ce 13.38001953125
INFO:root:Iter[0] Batch [25500]	Speed: 527.38 samples/sec
INFO:root:Iter[25520] fc7_acc 0.0 	 fc7_ce 13.37994873046875
INFO:root:Iter[0] Batch [25520]	Speed: 514.74 samples/sec

Segmentation fault: 11

Stack trace returned 10 entries:
[bt] (0) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4015ca) [0x7f7ca48725ca]
[bt] (1) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x341c826) [0x7f7ca788d826]
[bt] (2) /lib64/libc.so.6(+0x363b0) [0x7f7da0e303b0]
[bt] (3) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x309b98e) [0x7f7ca750c98e]
[bt] (4) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a03d5) [0x7f7ca75113d5]
[bt] (5) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a0f6f) [0x7f7ca7511f6f]
[bt] (6) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x2e8) [0x7f7ca71d0618]
[bt] (7) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb0b09) [0x7f7ca7121b09]
[bt] (8) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cba444) [0x7f7ca712b444]
[bt] (9) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cbe5d2) [0x7f7ca712f5d2]

To Reproduce

I used code from repo insightface. And run train_parall.py with per-batch-size 50

What have you tried to solve it?

I tried to install different mxnet version (1.4.0, 1.4.1, 1.5.0, 1.5.1) by pip

Environment

  • Python 3.6
  • Centos 7.6
  • Cuda 10.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions