Docker run hangs after parallel builds - symptom is TestRouter in integration #12236

smarterclayton · 2016-12-12T23:13:32Z

TestRouter was flaking after we enabled parallel builds. That's the first execution of a container after we run the builds (several minutes later) and the container would create, start, but not actually launch the process. The test would fail and the container would be cleaned up, and subsequent containers worked fine. Suspect this is a bug in docker or devicemapper that is triggered by a race in docker.

Narrowed down to changes caused by #12218, will revert unless we can triage the change before tomorrow morning.

smarterclayton · 2016-12-12T23:13:39Z

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/9333/

bparees · 2016-12-12T23:18:56Z

hit it here too https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_integration/9372/consoleFull#-80322389956bf4006e4b05b79524e5923

smarterclayton · 2016-12-12T23:23:55Z

Yeah it's almost every build.

smarterclayton · 2016-12-13T00:42:25Z

TestRouter starts its server, then tries to start the router container. Create and Start succeed - but no output is ever logged. The create times out, then the container is torn down:

Dec 12 17:40:16 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:16.633978636-05:00" level=info msg="{Action=create, Username=ec2-user, LoginUID=1000, PID=30486}"
Dec 12 17:40:16 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:16.921336740-05:00" level=info msg="{Action=start, Username=ec2-user, LoginUID=1000, PID=30486}"

# we invoke start - and then router_test tries to listen on the channel (which implies that container.Running returned true)

Dec 12 17:40:49 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:49.192398481-05:00" level=info msg="{Action=stop, Username=ec2-user, LoginUID=1000, PID=30486}"
Dec 12 17:40:49 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:49.504717681-05:00" level=info msg="{Action=logs, Username=ec2-user, LoginUID=1000, PID=30486}"

# we wait 30 seconds, then fail, go into the defer for container cleanup, and get the logs

# now we start the next container

Dec 12 17:40:49 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:49.506453258-05:00" level=info msg="{Action=remove, Username=ec2-user, LoginUID=1000, PID=30486}"
Dec 12 17:40:50 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:50.246975397-05:00" level=info msg="{Action=create, Username=ec2-user, LoginUID=1000, PID=30609}"
Dec 12 17:40:50 ip-172-18-8-178.ec2.internal dockerd-current[2303]: time="2016-12-12T17:40:50.483641328-05:00" level=info msg="{Action=start, Username=ec2-user, LoginUID=1000, PID=30609}"
Dec 12 17:40:52 ip-172-18-8-178.ec2.internal dockerd-current[2303]: I1212 22:40:52.312631       1 reflector.go:200] Starting reflector *api.Service (10m0s) from github.com/openshift/origin/pkg/router/template/service_lookup.go:30

What in the world could cause the container to fail to startup the first time? @mrunalp may be related to the suspicious hangs on docker build I was seeing.

smarterclayton · 2016-12-13T00:43:10Z

Is it possible that running parallel builds queues up a lot of container removes, such that the next container create (which is when TestRouter runs) can't make progress? And then after the remove and cleanup the next container starts fine?

mrunalp · 2016-12-13T04:03:03Z

@smarterclayton I tried to reproduce this as you suggested. I started make release and when I tried to run a container a few minutes later, it took forever. The one time I was able to launch top, wa was around 95%.

[root@ip-172-18-2-207 origin]# time docker run -it --rm fedora ls


real    52m56.074s
user    0m0.014s
sys     0m0.046s
[root@ip-172-18-2-207 origin]#

smarterclayton · 2016-12-13T05:11:12Z

Oooo. I ran without parallel builds and the issue stopped reproducing. So it looks like parallel docker builds on amazon AMI cause a problem for the next container to be run. I'm going to try a workaround that creates a container and then removes it and see if that has an impact (or start and then immediately stop).

smarterclayton · 2016-12-13T17:13:57Z

Mrunal do you have any suspicions about what could cause this?

mrunalp · 2016-12-13T18:38:34Z

@smarterclayton I think that we may just be hitting the VM limits. That said, I plan to enable some more debugs and retry this today.

smarterclayton · 2016-12-13T18:47:38Z

A 52 minute pause? @danmcp if this is the IO credit shortfall how can we verify? On Dec 13, 2016, at 1:38 PM, Mrunal Patel <[email protected]> wrote: @smarterclayton <https://github.com/smarterclayton> I think that we may just be hitting the VM limits. That said, I plan to enable some more debugs and retry this today. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12236 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pxZFK4ANywyDeVLaKD0BhrDBgx34ks5rHuYsgaJpZM4LLJSx> .

danmcp · 2016-12-13T18:55:27Z

@smarterclayton There isn't a monitor, you can only watch the io and estimate if your usage might be approaching the limit after X amount of time. You can also rerun the test with a larger disk or provisioned iops and see if it still hits it.

stevekuznetsov · 2016-12-20T00:34:53Z

@smarterclayton this has been resolved, right? As in, no parallel builds are happening and this shouldn't manifest today.

smarterclayton · 2016-12-20T00:41:09Z

The bug in docker or the kernel isn't fixed. This should be repurposed since parallelism is a net win. On Dec 19, 2016, at 7:35 PM, Steve Kuznetsov <[email protected]> wrote: @smarterclayton <https://github.com/smarterclayton> this has been resolved, right? As in, no parallel builds are happening and this shouldn't manifest today. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12236 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p8GO1Dt5iuGh4LzFkulGRofDcotPks5rJyKvgaJpZM4LLJSx> .

smarterclayton · 2016-12-29T17:18:19Z

This may be related to #11016 - if the kubelet is aborting start times at some point, it's likely it would manifest the same way.

smarterclayton · 2017-03-06T17:18:37Z

Moving to 1.6 because we have coverage on #11016

liggitt · 2017-05-23T15:07:49Z

is this still an issue?

stevekuznetsov · 2017-05-23T15:16:21Z

We need to revisit parallel builds for images, could significantly reduce test duration.

bparees · 2017-10-25T20:28:26Z

@stevekuznetsov did you do so? this seems like a dead issue at this point.

stevekuznetsov · 2017-10-25T20:47:19Z

/close

We're doing parallel builds with Origin Builds in the future anyway

smarterclayton self-assigned this Dec 12, 2016

smarterclayton added kind/test-flake Categorizes issue or PR as related to test flakes. priority/P0 labels Dec 12, 2016

bparees mentioned this issue Dec 12, 2016

do s2i git cloning up front, not in s2i itself #12234

Merged

0xmichalis mentioned this issue Dec 13, 2016

TestRouter integration flake #8502

Closed

php-coder mentioned this issue Dec 13, 2016

oadm ca/openshift admin/openshift start: add --expire-days/--signer-expire-days options #11814

Merged

csrwng mentioned this issue Dec 13, 2016

oc cluster up: work around docker attach race condition #12223

Merged

pweil- added area/tests component/routing labels Dec 13, 2016

pweil- mentioned this issue Dec 13, 2016

missed punctuation #12221

Merged

knobunc mentioned this issue Dec 13, 2016

added an enviroment variable for balance alg. and ability to disable route cookies #11984

Merged

coreydaley mentioned this issue Dec 13, 2016

Updating usage of the --metrics flag to create a job #12174

Merged

stevekuznetsov mentioned this issue Dec 13, 2016

Revert parallel Docker builds #12246

Closed

smarterclayton changed the title ~~TestRouter in integration is failing due to docker build~~ Docker run hangs after parallel builds - symptom is TestRouter in integration Dec 13, 2016

mfojtik mentioned this issue Jan 26, 2017

dockerregistry tests #12423

Merged

eparis added the component/build label Feb 6, 2017

eparis removed the component/routing label Feb 6, 2017

smarterclayton added priority/P1 and removed priority/P0 labels Mar 6, 2017

smarterclayton modified the milestones: 1.5.0, 1.6.0 Mar 6, 2017

csrwng mentioned this issue Apr 4, 2017

Use shared informer in BuildPodController and BuildPodDeleteController #13510

Merged

smarterclayton modified the milestones: 3.6.0, 3.6.x Oct 1, 2017

openshift-ci-robot assigned stevekuznetsov Oct 25, 2017

openshift-ci-robot closed this as completed Oct 25, 2017

Docker run hangs after parallel builds - symptom is TestRouter in integration #12236

Docker run hangs after parallel builds - symptom is TestRouter in integration #12236

Comments

smarterclayton commented Dec 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

smarterclayton commented Dec 12, 2016

Uh oh!

bparees commented Dec 12, 2016

Uh oh!

smarterclayton commented Dec 12, 2016

Uh oh!

smarterclayton commented Dec 13, 2016

Uh oh!

smarterclayton commented Dec 13, 2016

Uh oh!

mrunalp commented Dec 13, 2016

Uh oh!

smarterclayton commented Dec 13, 2016

Uh oh!

smarterclayton commented Dec 13, 2016

Uh oh!

mrunalp commented Dec 13, 2016

Uh oh!

smarterclayton commented Dec 13, 2016 via email

Uh oh!

danmcp commented Dec 13, 2016

Uh oh!

stevekuznetsov commented Dec 20, 2016

Uh oh!

smarterclayton commented Dec 20, 2016 via email

Uh oh!

smarterclayton commented Dec 29, 2016

Uh oh!

smarterclayton commented Mar 6, 2017

Uh oh!

liggitt commented May 23, 2017

Uh oh!

stevekuznetsov commented May 23, 2017

Uh oh!

bparees commented Oct 25, 2017

Uh oh!

stevekuznetsov commented Oct 25, 2017

Uh oh!

smarterclayton commented Dec 12, 2016 •

edited

Loading