Skip to content

Docker run hangs after parallel builds - symptom is TestRouter in integration #12236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
smarterclayton opened this issue Dec 12, 2016 · 19 comments
Assignees
Labels
area/tests component/build kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Milestone

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented Dec 12, 2016

TestRouter was flaking after we enabled parallel builds. That's the first execution of a container after we run the builds (several minutes later) and the container would create, start, but not actually launch the process. The test would fail and the container would be cleaned up, and subsequent containers worked fine. Suspect this is a bug in docker or devicemapper that is triggered by a race in docker.

Narrowed down to changes caused by #12218, will revert unless we can triage the change before tomorrow morning.

@smarterclayton
Copy link
Contributor Author

@smarterclayton smarterclayton self-assigned this Dec 12, 2016
@smarterclayton smarterclayton added kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0 labels Dec 12, 2016
@bparees
Copy link
Contributor

bparees commented Dec 12, 2016

@smarterclayton
Copy link
Contributor Author

Yeah it's almost every build.

@smarterclayton
Copy link
Contributor Author

TestRouter starts its server, then tries to start the router container. Create and Start succeed - but no output is ever logged. The create times out, then the container is torn down:

Dec 12 17:40:16 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:16.633978636-05:00" level=info msg="{Action=create, Username=ec2-user, LoginUID=1000, PID=30486}"
Dec 12 17:40:16 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:16.921336740-05:00" level=info msg="{Action=start, Username=ec2-user, LoginUID=1000, PID=30486}"

# we invoke start - and then router_test tries to listen on the channel (which implies that container.Running returned true)

Dec 12 17:40:49 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:49.192398481-05:00" level=info msg="{Action=stop, Username=ec2-user, LoginUID=1000, PID=30486}"
Dec 12 17:40:49 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:49.504717681-05:00" level=info msg="{Action=logs, Username=ec2-user, LoginUID=1000, PID=30486}"

# we wait 30 seconds, then fail, go into the defer for container cleanup, and get the logs

# now we start the next container

Dec 12 17:40:49 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:49.506453258-05:00" level=info msg="{Action=remove, Username=ec2-user, LoginUID=1000, PID=30486}"
Dec 12 17:40:50 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:50.246975397-05:00" level=info msg="{Action=create, Username=ec2-user, LoginUID=1000, PID=30609}"
Dec 12 17:40:50 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:50.483641328-05:00" level=info msg="{Action=start, Username=ec2-user, LoginUID=1000, PID=30609}"
Dec 12 17:40:52 ip-172-18-8-178.ec2.internal dockerd-current[2303]: I1212 22:40:52.312631       1 reflector.go:200] Starting reflector *api.Service (10m0s) from github.com/openshift/origin/pkg/router/template/service_lookup.go:30

What in the world could cause the container to fail to startup the first time? @mrunalp may be related to the suspicious hangs on docker build I was seeing.

@smarterclayton
Copy link
Contributor Author

Is it possible that running parallel builds queues up a lot of container removes, such that the next container create (which is when TestRouter runs) can't make progress? And then after the remove and cleanup the next container starts fine?

@mrunalp
Copy link
Member

mrunalp commented Dec 13, 2016

@smarterclayton I tried to reproduce this as you suggested. I started make release and when I tried to run a container a few minutes later, it took forever. The one time I was able to launch top, wa was around 95%.

[root@ip-172-18-2-207 origin]# time docker run -it --rm fedora ls


real    52m56.074s
user    0m0.014s
sys     0m0.046s
[root@ip-172-18-2-207 origin]# 

@smarterclayton
Copy link
Contributor Author

Oooo. I ran without parallel builds and the issue stopped reproducing. So it looks like parallel docker builds on amazon AMI cause a problem for the next container to be run. I'm going to try a workaround that creates a container and then removes it and see if that has an impact (or start and then immediately stop).

@smarterclayton
Copy link
Contributor Author

Mrunal do you have any suspicions about what could cause this?

@smarterclayton smarterclayton changed the title TestRouter in integration is failing due to docker build Docker run hangs after parallel builds - symptom is TestRouter in integration Dec 13, 2016
@mrunalp
Copy link
Member

mrunalp commented Dec 13, 2016

@smarterclayton I think that we may just be hitting the VM limits. That said, I plan to enable some more debugs and retry this today.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Dec 13, 2016 via email

@danmcp
Copy link
Contributor

danmcp commented Dec 13, 2016

@smarterclayton There isn't a monitor, you can only watch the io and estimate if your usage might be approaching the limit after X amount of time. You can also rerun the test with a larger disk or provisioned iops and see if it still hits it.

@stevekuznetsov
Copy link
Contributor

@smarterclayton this has been resolved, right? As in, no parallel builds are happening and this shouldn't manifest today.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Dec 20, 2016 via email

@smarterclayton
Copy link
Contributor Author

This may be related to #11016 - if the kubelet is aborting start times at some point, it's likely it would manifest the same way.

@smarterclayton smarterclayton modified the milestones: 1.5.0, 1.6.0 Mar 6, 2017
@smarterclayton
Copy link
Contributor Author

Moving to 1.6 because we have coverage on #11016

@liggitt
Copy link
Contributor

liggitt commented May 23, 2017

is this still an issue?

@stevekuznetsov
Copy link
Contributor

We need to revisit parallel builds for images, could significantly reduce test duration.

@smarterclayton smarterclayton modified the milestones: 3.6.0, 3.6.x Oct 1, 2017
@bparees
Copy link
Contributor

bparees commented Oct 25, 2017

@stevekuznetsov did you do so? this seems like a dead issue at this point.

@stevekuznetsov
Copy link
Contributor

/close

We're doing parallel builds with Origin Builds in the future anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tests component/build kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Projects
None yet
Development

No branches or pull requests

9 participants