-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Router fails to get endpoints during e2e #8403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Early in the run:
@Kargakis @ironcladlou should this happen in e2e? Getting two deployments of the router? |
NM, that's because of how the router install works. It's triggering a reload by changing the env for the syn eater. |
This seems wrong:
|
|
So it looks like we schedule both deployment pods. The second one? starts running, and creates at least one pod in the second group. It scales up, then for some reason gets scaled down (failed, death?) at
It is incredibly worrying to me that we don't see the logs of the failed pods - it may be because they were empty, but not seeing how the deployment logs would be empty at all. |
Looks like we had two deployment pods running at the same time. That shouldn't be happening? |
Moving this to @pweil-'s team for triage |
@Kargakis - I haven't been able to reproduce this yet. Of note in the logs, there were some scheduling issues due to the node not being registered. I tried to reproduce that by deleting my node and re-adding right after a back off for scheduling deploy-1 after I triggered a config change but that didn't seem to work. Sending your way to continue looking. Will check in in the morning |
May be useful to run only install_router and then wait for endpoints,
then tear it down over and over. The deploy race doesn't seem to be
correlated with other workloads.
|
The second deployment comes from the config generator
What might have changed the router template between 1 and 2? Or is it probably a race? Looking |
ouch |
It seems that the first deployment is cancelled by the deploymentconfig controller because it noticed the second version:
The deployment correctly switches from New to Failed:
Later, the already existing deployer is being scheduled
The deployer controller and the deployer don't care a lot about the deployment phase so eventually the deployer pod will complete successfully and switch the deployment from Failed to Complete. Still not sure why the second deployment started - we probably were hitting it since ever but after #8163 we started cleaning up cancel annotation on successful deployers - I think we should ensure the deployment phase direction for now and fail deployers if a deployment is already failed. |
The second deployment happens because of Line 628 in 28e2e75
I think 8417 and 8418 should be enough for this |
Saw this happen again https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/14441/artifact/origin/artifacts/test-end-to-end-docker/logs/container-router-1-deploy-deployment.log
|
@Kargakis One thing I don't think we accounted for is the strategy code itself checking to ensure the deployment it's about to execute is still valid. Even though we take steps to prevent their scheduling in the first place, and handle the outcomes more correctly, it's still possible the invalidated deployer pod gets scheduled and then spins up before deadlines or deletions are effective. I'm thinking the strategy code itself needs to inspect the state of things and bail. This is evidenced by the output @smarterclayton brought: the strategy code knew about the later deployment which should invalidate the executing code, but instead just scales it down and keeps going. |
I think by this point we have more than enough context to exit early if we detect the deployment is superceded. |
Question is - are we running multiple here, or just one? And what
changed that suddenly we're running parallel deployments? Is this
new? Old? Why did this start happening now?
|
Looked to me like router-2 deployed successfully, and then the router-1-deploy pod got scheduled and executed. The common deployer logic which detects the "from" deployment and "old" deployments has a problem: it considers the latest of all completed deployments not equal to the target (router-1 in this case) to be the "from" and everything else to be "old". So if router-2 is completed before router-1-deploy is scheduled and router-1-deploy gets an opportunity to execute, router-2 is considered "from". I think version checking here will correct it. I'm not sure it has anything to do with parallelism- it still seems to be related to a pending deployer pod getting hung up in scheduling and getting a chance to run after a newer deployment was already created and deployed. @Kargakis something else I noticed after a fresh look this morning is that the deploymentConfigController should be waiting for cancelled deployments to have an outcome before creating the new one. If the old deployment has a deployer pod hanging out in pending awaiting scheduling, I'm not sure why a new version would even be created leading to a new version/deployer pod unless the old deployment erroneously got updated to failed/complete before its deployer pod finished. So in addition to the common deployer code not doing the right thing when we get into this situation, it's not yet clear to me how router-2 was allowed to exist to begin with prior to the completion of router-1. |
Didn't realize @Kargakis is out of the office- I'll be digging into this later in the morning. |
Instead of waiting for previous cancelled deployments, can't we just delete Also, the deployer code seems broken. On Tue, Apr 12, 2016 at 2:27 PM, Dan Mace [email protected] wrote:
|
Thinking about this yet again, patching the deployer CLI is only a bandaid. The router-2 RC shouldn't exist since router-1-deploy exists/is non-terminal and router-1 is presumably not complete/failed. That seems to be the root issue from what I can see. |
Wouldn't it be confusing if a second deployment is expected but we still On Tue, Apr 12, 2016 at 5:11 PM, Dan Mace [email protected] wrote:
|
We should NEVER have more than one deployer pod running at a time. Ever. On Tue, Apr 12, 2016 at 10:13 AM, Michail Kargakis <[email protected]
|
Agreed. That;s why I am proposing deletion of older deployers when a newer On Tue, Apr 12, 2016 at 5:14 PM, Clayton Coleman [email protected]
|
This is a problem: https://github.com/openshift/origin/blob/master/pkg/deploy/controller/deployment/controller.go#L61 If we set the RC phase to failed without waiting for the deployer pods to resolve to a terminal phase, the deployment config controller will see the RC in a terminal phase and decide that there are no outstanding deployments, and make a new one. I don't think we should ever transition the RC phase to a terminal phase unless:
Otherwise we can't rely on the RC terminal phases as being true in terms of deployer pods. |
Good catch, I think this is the only place we do such a thing and never On Tue, Apr 12, 2016 at 5:31 PM, Dan Mace [email protected] wrote:
|
Actually, that failed update is only happening when the RC is in New, in which case how would any deployer pods have been created to cause an issue? Maybe I jumped the gun... |
We create the deployer when the deployment is New, right? On Tue, Apr 12, 2016 at 5:35 PM, Dan Mace [email protected] wrote:
|
I just noticed that we create the pod and then break without transitioning out of New, so if before the next sync there's a cancellation, we'll enter the scenario I describe. So yeah, I think my analysis still stands. |
@ironcladlou - assigning this to you since you're already working it and @Kargakis is on PTO until the end of the week. |
@smarterclayton is this now considered fixed as of #8478? |
@ironcladlou can we have an issue to fix the deployer too? Not a blocker but I think we should fix it for 1.3 |
It'll always be racing with something (cancellation, other deployments it might just miss when looking, etc.) Seems more valuable to keep on top of enforcing the "only 1 deployer pod should exist at a time" invariant externally. It's still probably useful to have a "last line of defense" in the deployer code. Not sure about 1.3- I'll let @pweil- and @smarterclayton chime in there. |
Related: #8500 |
Deployers should bail if their invariants are violated and they detect it. We can do more in 1.3. Leaving open to ensure its well and truly gone. |
removing from the blocker list for now then |
Opened #8972 for the deployer. Is there anything else tbd here? |
No. |
https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5537/
Happening fairly frequently
The text was updated successfully, but these errors were encountered: