Skip to content

baremetal: propose a new Second Level Operator for bare metal #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

markmc
Copy link
Contributor

@markmc markmc commented Feb 17, 2020

The cluster-baremetal-operator Second Level Operator (SLO) will provide bare metal machine management capabilities needed for the Machine API provider on the baremetal platform. There is no equivalent for this component on other platforms.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 17, 2020
@markmc
Copy link
Contributor Author

markmc commented Feb 17, 2020

This draft just lays out the problem and describes the two options - CVO-managed versus MAO-managed. Based on initial feedback from MAO folks, I suspect the preference will be for CVO-managed.

And I've just learned about #200 which seems like a likely path forward - i.e. we would add a baremetal cluster profile? I will follow-up with more details on what this would look like as a cluster profile.

@markmc markmc mentioned this pull request Feb 17, 2020
@markmc
Copy link
Contributor Author

markmc commented Feb 18, 2020

Some great feedback here, in #200, and in separate email threads. I thought I'd try and capture that feedback here for completion, organized by topic.

@markmc

This comment has been minimized.

@markmc

This comment has been minimized.

@markmc

This comment has been minimized.

@markmc

This comment has been minimized.

@markmc
Copy link
Contributor Author

markmc commented Feb 18, 2020

Next step - I'll re-work this to:

  1. Explore the details of the design of evolving baremetal-operator into a "real operator", or indeed adding a new metal3-operator.
  2. Capture the details of how this evolved-BMO should be a CVO-managed SLO that is "disabled" except on the baremetal platform.

The Bare Metal Operator is currently launched by the Machine API
Operator when the cluster is running on the baremetal infrastructure
platform.

Discuss adding a new SLO that would manage the BareMetalHost
controller for the baremetal infrastructure platform.
@markmc markmc force-pushed the baremetal-operator-management branch from 63c9880 to 54e116d Compare February 27, 2020 12:28
@markmc markmc changed the title baremetal: discuss options for managing baremetal-operator baremetal: propose a new Second Level Operator for bare metal Feb 27, 2020
@markmc
Copy link
Contributor Author

markmc commented Feb 27, 2020

Next step - I'll re-work this to:

  1. Explore the details of the design of evolving baremetal-operator into a "real operator", or indeed adding a new metal3-operator.
  2. Capture the details of how this evolved-BMO should be a CVO-managed SLO that is "disabled" except on the baremetal platform.

Ok, new revision pushed with these details, and also a prototype here: https://github.com/markmc/cluster-baremetal-operator

Base our strategy for transferring the metal3 deployment between
operators on a similar approach followed for oauth apiservices.
@markmc
Copy link
Contributor Author

markmc commented Apr 28, 2020

Ok, I've added an upgrade/downgrade section now

markmc added 2 commits April 28, 2020 09:45
Consistently refer to the new component as
`cluster-baremetal-operator` or CBO.

Mention in the summary that the new component will be the natural
place to add future bare metal specific functionality.
Rather than introducing a new openshift-baremetal namespace, continue
to use the openshift-machine-api namespace. This avoids the need to
move the baremetalhost resources to a new namespace.
Comment on lines +284 to +287
resource. A new `baremetal.openshift.io/owned` annotation will
indicate that CBO is in control. Only one of these annotations
should be set on a resource but, for the avoidance of doubt, CBO
takes precedence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that MAO opted to not use an owner reference for reasons specific to them. From what I understand of this proposal, CBO would not have such a limitation. Since CBO is taking over as the long-term owner, it's worth considering whether it makes sense for it to express its ownership through a standard owner reference instead of an annotation. It could do both, but that diminishes the value of the annotation.

There is value in the consistency of having both operators use an annotation, though they each have their own annotation and thus their own independent ways of identifying their ownership. Maybe it makes more sense to optimize for the long-term state than the transitional state, and not have a baremetal.openshift.io/owned annotation to either get rid of later or retain as technical debt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that MAO opted to not use an owner reference for reasons specific to them. From what I understand of this proposal, CBO would not have such a limitation.

I assume you're suggesting that the CBO deployment would be what we'd list as the owner of the metal3 deployment?

The logic appears to apply here? Is running MAO off-cluster any more of a valid use case than running CBO off-cluster? The only use case I think of is during development?

And likewise, "the controller itself is the owner" comment seems to apply here. Unless it could make sense to make the provisioning resource the owner?

(I'm not sure I can judge how valid the MAO justification is ... I'm just saying it appears to apply here too)

Since CBO is taking over as the long-term owner, it's worth considering whether it makes sense for it to express its ownership through a standard owner reference instead of an annotation. It could do both, but that diminishes the value of the annotation.

There is value in the consistency of having both operators use an annotation, though they each have their own annotation and thus their own independent ways of identifying their ownership. Maybe it makes more sense to optimize for the long-term state than the transitional state, and not have a baremetal.openshift.io/owned annotation to either get rid of later or retain as technical debt.

It's an interesting suggestion. I must admit though I find it challenging to think through. There are 2 aspects:

  1. What would be an appropriate ownerRef for the metal3 deployment?

I struggle with this one because I feel like there's tons of "case law" I'm missing.

For example, looking at my running 4.5 cluster, I see exactly 1 owner reference on a deployment - the OLM packageserver deployment lists its ClusterServiceVersion as an owner.

  1. Are there any gotchas with using ownerReferences for the purposes of this upgrade hand-off?

I can't think of any particularly, but I worry I'm missing something.

Here's my first cut at code in MAO to stand down if the CBO has claimed ownership: openshift/machine-api-operator@master...markmc:cbo-gate

markmc added a commit to markmc/machine-api-operator that referenced this pull request May 2, 2020
As per openshift/enhancements#212 MAO should stand down from managing
the metal3 deployment if cluster-baremetal-operator has claimed
ownership by setting the baremetal.openshift.io annotation.

This allows a smooth transition on upgrade of bare metal components to
a new Second Level Operator (SLO) for this platform.
* A sense that the MAO is responsible for a significant amount of bare
metal specific tasks that should be declared outside of the scope of
the MAO, particularly since it does not have equivalent
responsibilities on any other platform.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also apply to vSphere? Is the vSphere actuator going to be moved as part of this work?

Copy link
Contributor Author

@markmc markmc May 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back to a discussion elsewhere in February, this seems like a continuation of a question you asked there, so I'm going to include that here.

Doug:

I would like a technical explanation of why a new operator is the right approach. Why is the operator responsible for standing up the controller for the machine API not the right place for doing the work to stand up the controller for the machine API for bare metal?

David Eads @deads2k :

The machine-api-operator is responsible for making a call to the cloud provider API to get a machine provisioned, but it is not responsible for running the components that cloud provider API requires in order to function. My understanding of the metal3 additions to the machine-api-operator are that it has two parts

  1. An actuator that knows how to make a request for a machine. This actuator clearly
    belongs in the machine-api-operator.
  2. A metal3 deployment which contains an ipa-downloader, a machine-os-downloader, an ip-setter, a maria db, an ironic conductor, ironic api, and an ironic inspector. And I have a memory of this even having a way to provide DHCP for the machines provided. This doesn't appear to belong in the machine-api-operator.

The machine-api-operator is a client to a cloud provider API which provisions machines. The actuator for vsphere doesn't include infrastructure for vsphere to be able launch its machines, only the calls to the API to create them. We would expect metal3 to do the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Davids explanation is pretty reasonable I think, but Doug's follow on question is equally reasonable.

That is ... why is the vsphere actuator the only actuator codebase that actually lives in the machine-api-operator codebase?

Or ... why does the machine-controller-manager binary for vsphere come from the machine-api-operator image? What's so special about vsphere?

I'm tempted to guess that the only reason it's special is because the authors of the vsphere provider are also the maintainers of the machine-api-operator repo? But it'd be great to get that background from @enxebre

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference, I don't see any of the rationale for that in e.g. openshift/machine-api-operator#447 or #154

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I've understood correctly, the logic for having the vSphere provider inside the MAO repository is that we (the cloud team) were planning to move to a monorepo approach for the MAO and the providers that we manage (AWS, Azure, GCP, VSphere) and possibly other providers as well. Having each provider in a separate repository created a lot of maintenance overhead when we make small changes to the core of the machine API, raising BZs and PRs for each provider just to revendor the core MAO repo takes us a lot of time. I'm not sure if that's still a priority, but I think that was the motivation at the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the trend would actually be in the opposite direction? All of the cluster-api-provider repos should be moving their code into the MAO repo? Presumably including cluster-api-provider-baremetal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so, though I'll defer to @enxebre as I joined the team after this was done so I'm not entirely sure what the plan was

Copy link
Member

@enxebre enxebre May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machine API operator is intentionally simple and fairly rudimentary by design.
It literally just enforces the lifecycle of a kubernetes deployment resource.
This deployment has the same orthogonal structure everywhere. It knows nothing about clouds.

Now, where it lives the code base for the containers which are run by that deployment has zero impact on the design described above.

I'm tempted to guess that the only reason it's special is because the authors of the vsphere provider are also the maintainers of the machine-api-operator repo?

Yes. The home for that code base responds to factors totally irrelevant for mao design. These factors are variable in time and context. It's merely a matter of enabling the best development workflow for each scenario.

You'll see aws, gcp and azure in individual repos because that made sense at the time they were created. They were forks mimicking an upstream code organisation.
vSphere was created in a completely different context. Have it in its individual repo would have only drawbacks but no benefits. Therefore we choose to put the code base in mao repo purely to enable a healthier development workflow (e.g remove dependency revendoring hassle, on the fly CI).

The metal3 bare metal or the openstack actuators could well live in the mao repo and that would make no difference at all on the mao design. It is just natural and convenient for them to be individual repos to enable a more effective development workflow given our teams organisation.

Now managing the lifecycle of the underlying cloud required for the machine API to work, designing and understanding how to fetch the required external config, give semantics to possible failure scenarios and communicate those appropriately, that's beyond mao design scope and that's only relevant in a bare metal environment. This is the knowledge that I think is reasonable to embed in an ad-hoc operator and it would favour a more sustainable component ecosystem with clear responsibility boundaries. mao assumes this underlying "cloud" exists.

markmc added 4 commits May 9, 2020 09:38
Thanks to Michael Hrivnak.
Thanks to Doug Hellmann for this explanation on why its better for us
to take over the existing metal3 deployment rather than launching a
new one.
Lots of good discussion in PR comments that would be good to capture
as rationale that can be referred to later.
@markmc markmc force-pushed the baremetal-operator-management branch from ff0bad2 to 945dc8d Compare May 9, 2020 10:24
@markmc
Copy link
Contributor Author

markmc commented May 9, 2020

Added a new discussion section to the doc capturing some of the key points of debate along the way

- Be a new `openshift/cluster-baremetal-operator` project.
- Publish an image called `cluster-baremetal-operator`.
- Use CVO run-level 30, so its manifests will be applied in parallel
with the MAO.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to run it in parallel instead of having the MAO launch it? We could move the cluster-api-provider-baremetal deployment into the new operator along with everything else, and then the MAO could go back to simply managing 1 Deployment with platform-specific content.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the above suggestion, then it wouldn't run at all on clusters that aren't baremetal. Is there any precedence within OpenShift for this kind of thing, a third level operator?

CounterPoint: This goes against https://github.com/openshift/enhancements/pull/212/files#r422472511, so I can still see some value in having it separate (eg having it's own namespace (eventually) would be good if it's going to deploy lots of resources)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have settled on the approach were this SLO is launched by the CVO. The operator then puts itself in a Disabled state immediately when not running on a Baremetal Platform.

release image.
2. Implement a `ClusterOperator` resource, including updating its
status field with the `Available/Progressing/Degraded` conditions,
operator version number, and any `relatedObjects`.
Copy link
Member

@stbenjam stbenjam May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use these status fields likeDegraded=true or Available=false,Progressing=true to relay information to the installer about how the underlying provisioning activities are going?

The installer takes a naïve approach in determining failure: if it's taking longer than 30m (60m for metal) for all operators to come up, then it's marked as failed. The installer can't tell the difference between "this is taking a while" vs "this will never complete."

However, it's conceivable to exceed 60m. POST can take a long time, or UEFI firmwares can be buggy and need to retry PXE. Indeed, we retry PXE on failure and have seen need for it in some environments.

I'd like to see us take a more intelligent approach for baremetal. If we know we're still working on provisioning and Metal3's Ironic hasn't given up, then we could conceivably wait longer. A recent, successful deployment of a cluster with 3 masters and 6 workers took 97m.

Alternatively, we might know 15m into the process that your BMC creds are garbage, and we could fail early.

It's a really poor UX to have the installer make users wait when it's knowable earlier that the install is failed, and similarly bad UX to fail when we know the underlying infrastructure is still making progress and there's a good chance of success.

I'm not even sure if that belongs in CBO or MAO, and maybe it belongs as a separate enhancement request.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very good suggestion. Honestly, we could probably trust ironic to apply its own timeouts (and adjust ironic-image as we see fit).

## Summary

The `cluster-baremetal-operator` component will provide bare metal
machine management capabilities needed for the Machine API provider on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"machine" in this context usually refers to the kubernetes API machine resource.
May be "will provide raw infrastructure management capabilities needed to satisfy the metal3 bare metal Machine API provider requests" would be more accurate here?

Probably good to include a doc glossary: machine (crd), host(crd), instance/infrastructure.

- Upgrades
- API stability
- Additional bare metal specific functionality in future
- Upstream (Metal3) implications
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section probably needs either some more details or cleanup - I think the most critical items are covered below e.g "Not In Use" SLO Behaviors" for impact on non-baremetal platforms and the discussion around upgrade strategy.

will be to reconcile the `Provisioning` singleton resource.

### Test Plan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some test plan details here, e.g at a minimum we need to ensure the fully integrated solution passes the existing e2e-baremetal-ipi tests, and also upgrade CI will soon be in place which can help us prove the upgrade aspects are robust.

@hardys
Copy link
Contributor

hardys commented Jun 24, 2020

I added a couple of comments which could be resolved in a followup - I think that @sadasu is taking this work over from @markmc so I'd propose we go ahead and merge this unless there are any objections - this will unblock work on the implementation and @sadasu can push a follow-up PR to resolve the remaining comments (since she can't push to the fork this originates from to update this PR).

/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 24, 2020
@openshift-merge-robot openshift-merge-robot merged commit 006f1c3 into openshift:master Jun 24, 2020
Copy link
Contributor

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this approach.

4. On infrastructure platforms other than `baremetal`, the CBO should
following emerging patterns for "not in use" components, for
example setting its cluster operator condition to `Disabled=true`,
`Available=true`, and `Progressing=False`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies, I think, that we would not support deploying a UPI cluster on metal (resulting in a platform of ""), then on day 2 configuring the machine API to support fencing hosts or provisioning new nodes. I think that's probably OK for now, but in the future we may want to reconsider what the trigger is for deciding when the new operator is enabled to support that case. Changing that trigger wouldn't be the only work needed, so we'll want another enhancement to work through the design when the time comes.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, hardys, markmc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

3. A change to remove `metal3` deployment from MAO.

In `4.N-dev`, once the CBO has been added to the release, we can
remove all awareness of the `metal3` deployment from MAO.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why all this is needed. why do we need to do any check for the CBO to be available or annotations at all?
The CBO either is running or not. If it is running in 4.N, then that version of MAO needs to know nothing about baremetal. During a 4.N upgrade the CBO just takes over, in a downgrade the old mao code takes over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main issue is that during an upgrade, we can't actually guaranteee that the old MAO is shut down before the new CBO is running.

If you can explain how we could do that, then great - nobody involved loves this extra complexity.

Downgrade is perhaps a corner case not worth over-designing for, but ... AFAICT, in a downgrade scenario, the CBO will not be automatically deleted by the CVO. So you'll have metal3-managing MAO and CBO running, until the admin manually deletes the CBO.

Also if you could explain why this was needed for the oauth example (in the "prior art" section) but not needed in this case, then that would be super helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you at least have a very short term challenge with this transition - you can't merge "remove metal3 management from MAO" and "add CBO to the release image" as an atomic change. There will be some set of release images with either a regression or a conflict.

(But again, my main concern is e.g. a 4.6->4.7 upgrade where we have to transition from MAO to CBO, and we can't guarantee that the 4.6 MAO is shut down first)

Copy link
Member

@enxebre enxebre Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @markmc I share all those concerns. I think we could possibly take advantage of prefix order https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#upgrades-and-order for upgrades to guarantee that the old MAO is shut down before the new CBO is running.

For the downgrade corner case hopefully let the mao check for the CBO existence should be enough, so we can remove some complexity without getting annotations involved.

fwiw the rationale for the annotation in mao was purely to give us control over which specific deployments the informer is watching in that namespace, there's probably more elegant ways to configure the informer to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @markmc I share all those concerns. I think we could possibly take advantage of prefix order https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#upgrades-and-order for upgrades to guarantee that the old MAO is shut down before the new CBO is running.

(duplicating from openshift/machine-api-operator#689)

Yeah, but it's pretty ugly to put CBO in a later runlevel just for this temporary upgrade handover. Even that link suggests using another mechanism - "Over time, operators should be better at detecting their prerequisites without overcomplicating or risking the predictability of upgrades"

For the downgrade corner case hopefully let the mao check for the CBO existence should be enough, so we can remove some complexity without getting annotations involved.

I think I could buy that the "existence of a baremetal clusteroperator" check could serve the same purpose as the annotation - at least I'm not seeing any obvious reason why not, months after writing the enhancement! (i.e. if there's a good reason, I should have captured it in the enhancement)

fwiw the rationale for the annotation in mao was purely to give us control over which specific deployments the informer is watching in that namespace, there's probably more elegant ways to configure the informer to do this.

Just to be open about some of the irritation here - the level of scrutiny this minor, temporary, and exhaustively documented piece of "inelegance" is getting seems way out of proportion with e.g. openshift/machine-api-operator#424. Particularly since this entire move is motivated by the clear desire for the MAO maintainers to not have to worry about bare metal use cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh we are totally on the same page. I meant is not elegant how is done in mao, so hopefully we could avoid perpetuate the pattern and later improve it in mao. The proposal and openshift/machine-api-operator#689 looks all good to me in its current form. I was just making sure I got the rationale behind all details after being away for a little while. Thanks for the clarifications.

@markmc markmc deleted the baremetal-operator-management branch January 28, 2021 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.