(fix) registry pods do not come up again after node failure #3366

anik120 · 2024-08-14T14:38:14Z

Description of the change:

PR 3201 attempted to solve for the issue by deleting the pods stuck in
Terminating due to unreachable node. However, the logic to do that was
included in EnsureRegistryServer, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of EnsureRegistryServer,
and puts it in CheckRegistryServer instead. This way, if there are any dead pods
detected during CheckRegistryServer, the value of healthy is returned false,
which inturn triggers EnsureRegistryServer.

Motivation for the change:

Architectural changes:

Testing remarks:

Reviewer Checklist

perdasilva · 2024-08-14T14:50:45Z

So far it looks pretty good =D should we add any unit tests for the new behaviour? and maybe the test we missed in the original bug fix?

pkg/controller/operators/catalog/operator.go

anik120 · 2024-08-14T14:59:12Z

So far it looks pretty good =D should we add any unit tests for the new behaviour? and maybe the test we missed in the original bug fix?

I was scared you'll ask this. :p

I did think about it, looked into it, and it looks like it's going to be a little bit of effort to write a test for this. Mainly because it's not really unit testable so an e2e test is the viable option. However, I'm not sure we can mimic a node going down from within our tests.

I just figured the squeeze might not be worth the juice, however if someone has a better idea (or really insist we include a test for this even if it'll require some effort), I'm happy to include one in this PR.

perdasilva · 2024-08-15T16:09:30Z

What's the signal

So far it looks pretty good =D should we add any unit tests for the new behaviour? and maybe the test we missed in the original bug fix?

I was scared you'll ask this. :p

I did think about it, looked into it, and it looks like it's going to be a little bit of effort to write a test for this. Mainly because it's not really unit testable so an e2e test is the viable option. However, I'm not sure we can mimic a node going down from within our tests.

I just figured the squeeze might not be worth the juice, however if someone has a better idea (or really insist we include a test for this even if it'll require some effort), I'm happy to include one in this PR.

What's the signal we get in the code that a node has gone down? Are some pods unreachable?

anik120 · 2024-08-15T18:33:36Z

What's the signal we get in the code that a node has gone down? Are some pods unreachable?

We don't really get any signal that a node's gone down, we just discover pod/s that have been Deleted, but they still showed up in the result of list pods action. So we say "something's awry, let's force clean this area and move on".

And that "discovery" is done here. We could add a test for that, but it's a pretty simple function.

The real useful test would have been if we could mimic pods "hanging around" in an e2e setting. In fact, the entire syncRegistryServer functionality is mostly e2e tested, this is the only unit test that's there.

anik120 · 2024-08-15T18:43:06Z

I may have found a way to e2e test this......working on it now....

perdasilva · 2024-08-15T19:59:11Z

t discover pod/s that have been Deleted

Would it be possible to mock those client responses? e.g. list says they are there, get says they aren't?

anik120 · 2024-08-19T20:44:30Z

Okay I have some tests now.

I may have found a way to e2e test this......working on it now....

This did not work out. I was thinking about using gracefulDeletionPeriod as a hack, but that's not going to be a very reliable test (since gracefulDeletionPeriod doesn't necessarily force the pod to hang around for the period specified after it's been deleted, like I was hoping it would. Instead, it looks like as soon as the container in the pod stops, the GC comes and deletes it, which in our case is very quick. ie the registry server is shut down immediately so the pod will be cleaned up immediately too.)

Would it be possible to mock those client responses? e.g. list says they are there, get says they aren't?

I went with the unit tests route, that mocks these interactions. So the entire feature is tested by 3 components of tests:

The CleanRegistryServer for both Grpc and ConfigMap is tested individually in the reconciler package
note: it's more clear that the CleanRegistryServer is getting called because of the presence of a "deleted" pod when the test is run in verbose mode:

go test ./pkg/controller/registry/reconciler -run TestGrpcRegistryCleaner -v
=== RUN   TestGrpcRegistryCleaner
=== RUN   TestGrpcRegistryCleaner/Grpc/ExistingRegistry/DeletedPod
time="2024-08-19T16:42:20-04:00" level=info msg="force deleting dead pod" pod.name= pod.namespace=testns
--- PASS: TestGrpcRegistryCleaner (0.10s)
    --- PASS: TestGrpcRegistryCleaner/Grpc/ExistingRegistry/DeletedPod (0.10s)
PASS
ok  	github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler	0.939s

force deleting dead pod is logged.

A level above, in the operator package (that uses the reconciler implementations), a unit test is put in place to check that reconciler.CleanRegistryServer is called.

This is the absolute best we can do. e2e isn't possible without much more digging in, and frankly at this point not necessary at all.

anik120 · 2024-08-21T18:34:36Z

Pivoted the PR to include the logic inside CheckRegistryServer, instead of adding a CleanRegistryServer component to the interface.

joelanford · 2024-08-21T18:48:46Z

pkg/controller/registry/reconciler/configmap.go

+			continue
+		}
+		healthy = false
+		forceDeletionErrs = append(forceDeletionErrs, pkgerrors.Errorf("found %s in a deleted but not removed state", pod.Name))


Do we need to return an error from this function when we find a wedged pod and successfully delete it? My initial thought is that we would only error from this function if:

we failed to determine healthy

we failed to delete wedged pods.

I started out thinking the same, but then realized that that'd be an "artificial" error. If we include both of the scenarios you mentioned in the definition of "error", we're sort of sullying the definition which is "something went wrong". In the case of "we failed to determine healthy", we are already sending that signal through the boolean variable anyway. So decided to not include both, and only include the second scenario of "we failed to delete the wedged pods".

pkg/controller/registry/reconciler/grpc.go

pkg/controller/registry/reconciler/configmap.go

joelanford · 2024-08-21T20:42:02Z

pkg/controller/registry/reconciler/configmap.go

+			logger.WithFields(logrus.Fields{"pod.namespace": sourceNamespace, "pod.name": pod.GetName()}).Debug("pod is alive")
+			continue
+		}
+		foundDeadPod = true


This is trickier than I originally thought. If we have:

at least one alive pod

the alive pods are otherwise deemed healthy

we successfully delete the dead pods

What should we return from CheckRegistryServer? Seems like we could say healthy in that case?

Yea this is a bit confusing, I'm thinking if we detect even one dead pod, we just force delete all the pods, and let EnsureRegistryServer recreate everything. Otherwise we're in a very non-deterministic state...

That seems like that could have significant performance implications. There's a lot of disk/cpu/memory costs that are paid when catalog pods are started, so we should minimized that as much as possible.

I think it is okay to force delete the dead pods, and if there are any alive pods, we just move forward as if those were the only pods that were there to begin with.

That's fair too. I was a bit confused about there being multiple pods in the first place, since my impression is that we only create one registry pod per catalog, but turns out we have pods mainly because we use the lister to list via label selector. In any case this should return one pod, unless I'm unaware of some feature that allows us to spin up multiple registry pods per catalog.

Changed it back to just deleting dead pods.

There are definitely multiple pods when we're polling. In that case, we spin up a new pod to see what the digest is, and we compare the digests to see if we need to use the new pod (if it has a new digest) or keep using the old pod (when the digests match)

There may also be multiple pods when the pod spec changes (e.g. when the catalog source image is changed).

Right, both are transitory states though, and we're in that state because EnsureRegistryServer has already been called. But yes I think this is in a good state now.

pkg/controller/registry/reconciler/grpc.go

pkg/controller/registry/reconciler/grpc_test.go

tmshort · 2024-08-27T19:21:55Z

@joelanford are you good with this?

joelanford · 2024-08-28T17:40:48Z

pkg/controller/registry/reconciler/grpc.go

 		service == nil || c.currentServiceAccount(source) == nil {
 		return false, nil
 	}
-
+	if deadPodsDetected, e := detectAndDeleteDeadPods(logger, c.OpClient, currentPods, source.GetNamespace()); deadPodsDetected {


I still think we should return true, nil from this function in the case where:

There is at least one healthy pod

There are multiple "dead" pods

We deleted all of the "dead" pods successfully.

That sort of behavior would mean that we would short circuit and avoid calling EnsureRegistryServer, which would:

end up making some no-op calls to the apiserver:

e.g. https://github.com/anik120/operator-lifecycle-manager/blob/20a171690cf96fcea177e5fa90426e7c234a14d3/pkg/controller/registry/reconciler/grpc.go#L459

(maybe, not sure) have some strange issues related to caching, where subsequent calls to currentPods would still return the dead pods because the deletion has no propagated back to our informer cache yet (I'm assuming c.Lister is backed by a cache. If it is not, then we'd be making more apiserver calls unnecessarily)

Essentially, I think both of the following should result in identical behavior:

There are 1 or more alive pods, and no dead pods

There are 1 or more alive pods, and we deleted all of the dead pods (hence: there are no dead pods, so this is actually a variant of (1))

@joelanford thanks for the discussion yesterday. Thought about it a little more and we can do this. I've had to add an additional change though, highlighting it in the next comment...

pkg/controller/operators/catalog/operator.go

pkg/controller/registry/reconciler/configmap.go

[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in `Terminating` due to unreachable node. However, the logic to do that was included in `EnsureRegistryServer`, which only gets executed if polling in requested by the user. This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`, and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods detected during `CheckRegistryServer`, the value of `healthy` is returned `false`, which inturn triggers `EnsureRegistryServer`.

openshift-ci bot requested review from dtfranz and joelanford August 14, 2024 14:38

tmshort reviewed Aug 14, 2024

View reviewed changes

pkg/controller/operators/catalog/operator.go Outdated Show resolved Hide resolved

anik120 force-pushed the fix-catalog-pods-node-failure branch from 215fe08 to db34956 Compare August 14, 2024 17:13

anik120 force-pushed the fix-catalog-pods-node-failure branch from 26a24c7 to be108df Compare August 21, 2024 18:32

joelanford reviewed Aug 21, 2024

View reviewed changes

pkg/controller/registry/reconciler/grpc.go Outdated Show resolved Hide resolved

anik120 force-pushed the fix-catalog-pods-node-failure branch from be108df to beeedd5 Compare August 21, 2024 19:53

joelanford reviewed Aug 21, 2024

View reviewed changes

pkg/controller/registry/reconciler/configmap.go Outdated Show resolved Hide resolved

joelanford reviewed Aug 21, 2024

View reviewed changes

pkg/controller/registry/reconciler/grpc.go Outdated Show resolved Hide resolved

joelanford reviewed Aug 21, 2024

View reviewed changes

pkg/controller/registry/reconciler/grpc_test.go Outdated Show resolved Hide resolved

anik120 force-pushed the fix-catalog-pods-node-failure branch 3 times, most recently from a9b8cd5 to 20a1716 Compare August 22, 2024 13:34

joelanford reviewed Aug 28, 2024

View reviewed changes

anik120 force-pushed the fix-catalog-pods-node-failure branch from 20a1716 to c8162e7 Compare August 29, 2024 14:03

anik120 commented Aug 29, 2024

View reviewed changes

pkg/controller/operators/catalog/operator.go Outdated Show resolved Hide resolved

joelanford reviewed Aug 29, 2024

View reviewed changes

pkg/controller/registry/reconciler/configmap.go Outdated Show resolved Hide resolved

anik120 force-pushed the fix-catalog-pods-node-failure branch 2 times, most recently from 80cd5ae to 507e0de Compare August 29, 2024 14:35

anik120 commented Aug 29, 2024

View reviewed changes

pkg/controller/registry/reconciler/configmap.go Outdated Show resolved Hide resolved

anik120 force-pushed the fix-catalog-pods-node-failure branch from 507e0de to 406fede Compare August 29, 2024 15:14

joelanford approved these changes Aug 29, 2024

View reviewed changes

anik120 added this pull request to the merge queue Aug 30, 2024

Merged via the queue into operator-framework:master with commit f243189 Aug 30, 2024
12 checks passed

(fix) registry pods do not come up again after node failure #3366

(fix) registry pods do not come up again after node failure #3366

Uh oh!

Conversation

anik120 commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

perdasilva commented Aug 14, 2024

Uh oh!

Uh oh!

anik120 commented Aug 14, 2024

Uh oh!

perdasilva commented Aug 15, 2024

Uh oh!

anik120 commented Aug 15, 2024

Uh oh!

anik120 commented Aug 15, 2024

Uh oh!

perdasilva commented Aug 15, 2024

Uh oh!

anik120 commented Aug 19, 2024

Uh oh!

anik120 commented Aug 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anik120 Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tmshort commented Aug 27, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joelanford Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anik120 commented Aug 14, 2024 •

edited

Loading

anik120 Aug 21, 2024 •

edited

Loading

joelanford Aug 28, 2024 •

edited

Loading