Skip to content

[bug] Operator becoming non-functional after transient RBAC changes #1419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #1422
andreaTP opened this issue Aug 24, 2022 · 6 comments · Fixed by #1571
Closed
Tracked by #1422

[bug] Operator becoming non-functional after transient RBAC changes #1419

andreaTP opened this issue Aug 24, 2022 · 6 comments · Fixed by #1571
Assignees
Milestone

Comments

@andreaTP
Copy link
Collaborator

Bug Report

Hi all and thanks for the amazing project!
I was looking at real-world edge cases where the functionality of the operator gets compromised because the Informers are crashing in background.
A little playing with RBAC resources with an operator running turns it to be completely unresponsive on any CR event.

What did you do?

  • start a new minikube cluster
  • deploy the sample tomcat-operator
  • kubectl apply -f sample-operators/tomcat-operator/k8s/tomcat-sample1.yaml
  • kubectl delete serviceaccount/tomcat-operator -n tomcat-operator
  • wait for the reconciliation loop to exhaust the retries
  • re-create the Service Account:
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: tomcat-operator
  namespace: tomcat-operator
EOF

Now the operator becomes completely unresponsive:

  • doesn't react to changes to the test-tomcat1 CR
  • doesn't react to the creation of a new CR e.g. kubectl apply -f sample-operators/tomcat-operator/k8s/tomcat-sample2.yaml

What did you expect to see?

The operator pod should(probably) restart in case it loose access to the API in order to be able to restore the communication.
Alternatively, the situation should be handled and, somehow, the connection of the SharedInformers restored.

What did you see instead? Under which circumstances?

The operator remains unresponsive but alive.

Environment

Kubernetes cluster type:
minikube

$ Mention java-operator-sdk version from pom.xml file
main

$ java -version

openjdk version "11.0.15" 2022-04-19
OpenJDK Runtime Environment Temurin-11.0.15+10 (build 11.0.15+10)
OpenJDK 64-Bit Server VM Temurin-11.0.15+10 (build 11.0.15+10, mixed mode)

$ kubectl version

Possible Solution

The best would be to have callback endpoint in the Controller that gets called if an error happens with the SharedInformers, so that the user can decide what to do.
At a very minimum, in this specific situation, I do believe that crashing the Operator is the correct behavior, but it would be nice to have a more generic mechanism for handling SharedInformers failures that are currently happening in background.

Additional context

During my test, I verified that the communication with the API server gets restored if the API server becomes temporarily unavailable, that's great work 👍

@andreaTP
Copy link
Collaborator Author

cc. @lburgazzoli

@csviri
Copy link
Collaborator

csviri commented Aug 24, 2022

Probably related to this issue:
#1405

@andreaTP
Copy link
Collaborator Author

Related to #1170 also.
Please note that a pod restart recovers the situation.

@andreaTP
Copy link
Collaborator Author

@csviri do we have an integration test for this?

@csviri
Copy link
Collaborator

csviri commented Oct 27, 2022

@andreaTP
Copy link
Collaborator Author

awesome! thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants