-
Notifications
You must be signed in to change notification settings - Fork 560
(fix) registry pods do not come up again after node failure #3366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(fix) registry pods do not come up again after node failure #3366
Conversation
So far it looks pretty good =D should we add any unit tests for the new behaviour? and maybe the test we missed in the original bug fix? |
I was scared you'll ask this. :p I did think about it, looked into it, and it looks like it's going to be a little bit of effort to write a test for this. Mainly because it's not really unit testable so an e2e test is the viable option. However, I'm not sure we can mimic a node going down from within our tests. I just figured the squeeze might not be worth the juice, however if someone has a better idea (or really insist we include a test for this even if it'll require some effort), I'm happy to include one in this PR. |
215fe08
to
db34956
Compare
What's the signal
What's the signal we get in the code that a node has gone down? Are some pods unreachable? |
We don't really get any signal that a node's gone down, we just discover pod/s that have been And that "discovery" is done here. We could add a test for that, but it's a pretty simple function. The real useful test would have been if we could mimic pods "hanging around" in an e2e setting. In fact, the entire |
I may have found a way to e2e test this......working on it now.... |
Would it be possible to mock those client responses? e.g. list says they are there, get says they aren't? |
Okay I have some tests now.
This did not work out. I was thinking about using
I went with the unit tests route, that mocks these interactions. So the entire feature is tested by 3 components of tests:
This is the absolute best we can do. e2e isn't possible without much more digging in, and frankly at this point not necessary at all. |
26a24c7
to
be108df
Compare
Pivoted the PR to include the logic inside |
continue | ||
} | ||
healthy = false | ||
forceDeletionErrs = append(forceDeletionErrs, pkgerrors.Errorf("found %s in a deleted but not removed state", pod.Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to return an error from this function when we find a wedged pod and successfully delete it? My initial thought is that we would only error from this function if:
- we failed to determine healthy
- we failed to delete wedged pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started out thinking the same, but then realized that that'd be an "artificial" error. If we include both of the scenarios you mentioned in the definition of "error", we're sort of sullying the definition which is "something went wrong". In the case of "we failed to determine healthy", we are already sending that signal through the boolean variable anyway. So decided to not include both, and only include the second scenario of "we failed to delete the wedged pods".
be108df
to
beeedd5
Compare
logger.WithFields(logrus.Fields{"pod.namespace": sourceNamespace, "pod.name": pod.GetName()}).Debug("pod is alive") | ||
continue | ||
} | ||
foundDeadPod = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is trickier than I originally thought. If we have:
- at least one alive pod
- the alive pods are otherwise deemed healthy
- we successfully delete the dead pods
What should we return from CheckRegistryServer
? Seems like we could say healthy in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this is a bit confusing, I'm thinking if we detect even one dead pod, we just force delete all the pods, and let EnsureRegistryServer
recreate everything. Otherwise we're in a very non-deterministic state...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like that could have significant performance implications. There's a lot of disk/cpu/memory costs that are paid when catalog pods are started, so we should minimized that as much as possible.
I think it is okay to force delete the dead pods, and if there are any alive pods, we just move forward as if those were the only pods that were there to begin with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair too. I was a bit confused about there being multiple pods in the first place, since my impression is that we only create one registry pod per catalog, but turns out we have pods
mainly because we use the lister to list via label selector. In any case this should return one pod, unless I'm unaware of some feature that allows us to spin up multiple registry pods per catalog.
Changed it back to just deleting dead pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are definitely multiple pods when we're polling. In that case, we spin up a new pod to see what the digest is, and we compare the digests to see if we need to use the new pod (if it has a new digest) or keep using the old pod (when the digests match)
There may also be multiple pods when the pod spec changes (e.g. when the catalog source image is changed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, both are transitory states though, and we're in that state because EnsureRegistryServer
has already been called. But yes I think this is in a good state now.
a9b8cd5
to
20a1716
Compare
@joelanford are you good with this? |
service == nil || c.currentServiceAccount(source) == nil { | ||
return false, nil | ||
} | ||
|
||
if deadPodsDetected, e := detectAndDeleteDeadPods(logger, c.OpClient, currentPods, source.GetNamespace()); deadPodsDetected { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should return true, nil
from this function in the case where:
- There is at least one healthy pod
- There are multiple "dead" pods
- We deleted all of the "dead" pods successfully.
That sort of behavior would mean that we would short circuit and avoid calling EnsureRegistryServer
, which would:
- end up making some no-op calls to the apiserver:
- (maybe, not sure) have some strange issues related to caching, where subsequent calls to
currentPods
would still return the dead pods because the deletion has no propagated back to our informer cache yet (I'm assumingc.Lister
is backed by a cache. If it is not, then we'd be making more apiserver calls unnecessarily)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially, I think both of the following should result in identical behavior:
- There are 1 or more alive pods, and no dead pods
- There are 1 or more alive pods, and we deleted all of the dead pods (hence: there are no dead pods, so this is actually a variant of (1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joelanford thanks for the discussion yesterday. Thought about it a little more and we can do this. I've had to add an additional change though, highlighting it in the next comment...
20a1716
to
c8162e7
Compare
80cd5ae
to
507e0de
Compare
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.
507e0de
to
406fede
Compare
f243189
Description of the change:
PR 3201 attempted to solve for the issue by deleting the pods stuck in
Terminating
due to unreachable node. However, the logic to do that wasincluded in
EnsureRegistryServer
, which only gets executed if polling inrequested by the user.
This PR moves the logic of checking for dead pods out of
EnsureRegistryServer
,and puts it in
CheckRegistryServer
instead. This way, if there are any dead podsdetected during
CheckRegistryServer
, the value ofhealthy
is returnedfalse
,which inturn triggers
EnsureRegistryServer
.Motivation for the change:
Architectural changes:
Testing remarks:
Reviewer Checklist
/doc
[FLAKE]
are truly flaky and have an issue