-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Docker run hangs after parallel builds - symptom is TestRouter in integration #12236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah it's almost every build. |
TestRouter starts its server, then tries to start the router container. Create and Start succeed - but no output is ever logged. The create times out, then the container is torn down:
What in the world could cause the container to fail to startup the first time? @mrunalp may be related to the suspicious hangs on docker build I was seeing. |
Is it possible that running parallel builds queues up a lot of container removes, such that the next container create (which is when TestRouter runs) can't make progress? And then after the remove and cleanup the next container starts fine? |
@smarterclayton I tried to reproduce this as you suggested. I started make release and when I tried to run a container a few minutes later, it took forever. The one time I was able to launch top, wa was around 95%.
|
Oooo. I ran without parallel builds and the issue stopped reproducing. So it looks like parallel docker builds on amazon AMI cause a problem for the next container to be run. I'm going to try a workaround that creates a container and then removes it and see if that has an impact (or start and then immediately stop). |
Mrunal do you have any suspicions about what could cause this? |
@smarterclayton I think that we may just be hitting the VM limits. That said, I plan to enable some more debugs and retry this today. |
A 52 minute pause? @danmcp if this is the IO credit shortfall how can we
verify?
On Dec 13, 2016, at 1:38 PM, Mrunal Patel <[email protected]> wrote:
@smarterclayton <https://github.com/smarterclayton> I think that we may
just be hitting the VM limits. That said, I plan to enable some more debugs
and retry this today.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12236 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_pxZFK4ANywyDeVLaKD0BhrDBgx34ks5rHuYsgaJpZM4LLJSx>
.
|
@smarterclayton There isn't a monitor, you can only watch the io and estimate if your usage might be approaching the limit after X amount of time. You can also rerun the test with a larger disk or provisioned iops and see if it still hits it. |
@smarterclayton this has been resolved, right? As in, no parallel builds are happening and this shouldn't manifest today. |
The bug in docker or the kernel isn't fixed. This should be repurposed
since parallelism is a net win.
On Dec 19, 2016, at 7:35 PM, Steve Kuznetsov <[email protected]> wrote:
@smarterclayton <https://github.com/smarterclayton> this has been resolved,
right? As in, no parallel builds are happening and this shouldn't manifest
today.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12236 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p8GO1Dt5iuGh4LzFkulGRofDcotPks5rJyKvgaJpZM4LLJSx>
.
|
This may be related to #11016 - if the kubelet is aborting start times at some point, it's likely it would manifest the same way. |
Moving to 1.6 because we have coverage on #11016 |
is this still an issue? |
We need to revisit parallel builds for images, could significantly reduce test duration. |
@stevekuznetsov did you do so? this seems like a dead issue at this point. |
/close We're doing parallel builds with Origin |
Uh oh!
There was an error while loading. Please reload this page.
TestRouter was flaking after we enabled parallel builds. That's the first execution of a container after we run the builds (several minutes later) and the container would create, start, but not actually launch the process. The test would fail and the container would be cleaned up, and subsequent containers worked fine. Suspect this is a bug in docker or devicemapper that is triggered by a race in docker.
Narrowed down to changes caused by #12218, will revert unless we can triage the change before tomorrow morning.
The text was updated successfully, but these errors were encountered: