Skip to content

Cannot get secondary resource from context after operator restart #1299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
morhidi opened this issue Jun 23, 2022 · 4 comments · Fixed by #1300
Closed

Cannot get secondary resource from context after operator restart #1299

morhidi opened this issue Jun 23, 2022 · 4 comments · Fixed by #1300
Assignees

Comments

@morhidi
Copy link

morhidi commented Jun 23, 2022

Hi folks,

users encountered a blocking issue FLINK-28008 in Flink Kubernetes Operator related to primary/secondary resources after upgrading it to JOSDK v3.0.2.

We are managing session jobs(primary resource) and session clusters(secondary resource) with the Operator. Once a session job finishes users usually delete it. However, when the Operator is restarted, newly submitted session jobs won't find the session cluster anymore. The session cluster must be deleted and recreated to make it work which is not an acceptable workaround in our case unfortunately. I've uploaded the repro logs here

Could you please take a look into this issue?

Thanks,
Matyas

@csviri csviri self-assigned this Jun 23, 2022
@csviri
Copy link
Collaborator

csviri commented Jun 23, 2022

thx @morhidi, will take a look. This on first guess could be related to the ordering of event sources and/or their indexes, will dig deeper, and try to come up with a solution.

@csviri csviri linked a pull request Jun 23, 2022 that will close this issue
@csviri
Copy link
Collaborator

csviri commented Jun 23, 2022

The problem is when there is "many-to-one" or "many-to-many" relationship between the primary and the secondary resources.
Create an integration test to reproduce this issue, you can see when it is happening:

https://github.com/java-operator-sdk/java-operator-sdk/blob/80fb6fc4430f2e1ba9a874e0327f148b4a0d1873/operator-framework/src/test/java/io/javaoperatorsdk/operator/sample/primarytosecondary/JobReconciler.java

This replicates the case for Flink Operator.

In this example the primary resource referencing the secondary in the spec (by name) - there is no owner reference or annotation on the secondary resource. Without a primaryToSecondary mapper what happens is that, if the secondary arrives first, it maps itself to primaries based on the index in the primary resource informer (see example). If a primary resource received after that, the secondary resource's index does not contain that new resource. So won't be accessible using the getSecondaryResource api.

Unfortunately there is no efficient way to cover this, other way then having PrimaryToSecondary mapper. Will provide the fix. You can already see how it will work from the link above / related PR.

@csviri
Copy link
Collaborator

csviri commented Jun 23, 2022

Will create a separate issue to generalize the concept for event sources for external resources for v3.1

@csviri
Copy link
Collaborator

csviri commented Jun 24, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants