This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
Segmentation fault: 11 #17043
Open
Description
Description
I trained arcface with 8 gpus and met Segmentation fault after some iterations
Error Message
INFO:root:Iter[0] Batch [25300] Speed: 521.90 samples/sec
INFO:root:Iter[25320] fc7_acc 0.0 fc7_ce 13.38002197265625
INFO:root:Iter[0] Batch [25320] Speed: 515.28 samples/sec
INFO:root:Iter[25340] fc7_acc 0.0 fc7_ce 13.379957275390625
INFO:root:Iter[0] Batch [25340] Speed: 514.40 samples/sec
INFO:root:Iter[25360] fc7_acc 0.004999999888241291 fc7_ce 13.37499755859375
INFO:root:Iter[0] Batch [25360] Speed: 495.12 samples/sec
INFO:root:Iter[25380] fc7_acc 0.0 fc7_ce 13.380018310546875
INFO:root:Iter[0] Batch [25380] Speed: 517.99 samples/sec
INFO:root:Iter[25400] fc7_acc 0.0 fc7_ce 13.3799951171875
INFO:root:Iter[0] Batch [25400] Speed: 516.58 samples/sec
INFO:root:Iter[25420] fc7_acc 0.0024999999441206455 fc7_ce 13.377520751953124
INFO:root:Iter[0] Batch [25420] Speed: 499.38 samples/sec
INFO:root:Iter[25440] fc7_acc 0.0024999999441206455 fc7_ce 13.37696044921875
INFO:root:Iter[0] Batch [25440] Speed: 515.53 samples/sec
INFO:root:Iter[25460] fc7_acc 0.0 fc7_ce 13.3800244140625
INFO:root:Iter[0] Batch [25460] Speed: 527.34 samples/sec
INFO:root:Iter[25480] fc7_acc 0.0 fc7_ce 13.38001953125
INFO:root:Iter[0] Batch [25480] Speed: 504.62 samples/sec
INFO:root:Iter[25500] fc7_acc 0.0 fc7_ce 13.38001953125
INFO:root:Iter[0] Batch [25500] Speed: 527.38 samples/sec
INFO:root:Iter[25520] fc7_acc 0.0 fc7_ce 13.37994873046875
INFO:root:Iter[0] Batch [25520] Speed: 514.74 samples/sec
Segmentation fault: 11
Stack trace returned 10 entries:
[bt] (0) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4015ca) [0x7f7ca48725ca]
[bt] (1) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x341c826) [0x7f7ca788d826]
[bt] (2) /lib64/libc.so.6(+0x363b0) [0x7f7da0e303b0]
[bt] (3) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x309b98e) [0x7f7ca750c98e]
[bt] (4) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a03d5) [0x7f7ca75113d5]
[bt] (5) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a0f6f) [0x7f7ca7511f6f]
[bt] (6) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x2e8) [0x7f7ca71d0618]
[bt] (7) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cb0b09) [0x7f7ca7121b09]
[bt] (8) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cba444) [0x7f7ca712b444]
[bt] (9) /home/zdeploy/AILab/hoavt2/dl-py3-ku/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x2cbe5d2) [0x7f7ca712f5d2]
To Reproduce
I used code from repo insightface. And run train_parall.py with per-batch-size 50
What have you tried to solve it?
I tried to install different mxnet version (1.4.0, 1.4.1, 1.5.0, 1.5.1) by pip
Environment
- Python 3.6
- Centos 7.6
- Cuda 10.0