baremetal: propose a new Second Level Operator for bare metal #212

markmc · 2020-02-17T17:09:27Z

The cluster-baremetal-operator Second Level Operator (SLO) will provide bare metal machine management capabilities needed for the Machine API provider on the baremetal platform. There is no equivalent for this component on other platforms.

markmc · 2020-02-17T17:12:03Z

This draft just lays out the problem and describes the two options - CVO-managed versus MAO-managed. Based on initial feedback from MAO folks, I suspect the preference will be for CVO-managed.

And I've just learned about #200 which seems like a likely path forward - i.e. we would add a baremetal cluster profile? I will follow-up with more details on what this would look like as a cluster profile.

enhancements/baremetal/baremetal-operator-management.md

markmc · 2020-02-18T14:45:07Z

Some great feedback here, in #200, and in separate email threads. I thought I'd try and capture that feedback here for completion, organized by topic.

markmc · 2020-02-18T14:56:06Z

Next step - I'll re-work this to:

Explore the details of the design of evolving baremetal-operator into a "real operator", or indeed adding a new metal3-operator.
Capture the details of how this evolved-BMO should be a CVO-managed SLO that is "disabled" except on the baremetal platform.

The Bare Metal Operator is currently launched by the Machine API Operator when the cluster is running on the baremetal infrastructure platform. Discuss adding a new SLO that would manage the BareMetalHost controller for the baremetal infrastructure platform.

markmc · 2020-02-27T12:30:50Z

Next step - I'll re-work this to:

Explore the details of the design of evolving baremetal-operator into a "real operator", or indeed adding a new metal3-operator.

Capture the details of how this evolved-BMO should be a CVO-managed SLO that is "disabled" except on the baremetal platform.

Ok, new revision pushed with these details, and also a prototype here: https://github.com/markmc/cluster-baremetal-operator

enhancements/baremetal/an-slo-for-baremetal.md

Base our strategy for transferring the metal3 deployment between operators on a similar approach followed for oauth apiservices.

markmc · 2020-04-28T07:45:24Z

Ok, I've added an upgrade/downgrade section now

Consistently refer to the new component as `cluster-baremetal-operator` or CBO. Mention in the summary that the new component will be the natural place to add future bare metal specific functionality.

Rather than introducing a new openshift-baremetal namespace, continue to use the openshift-machine-api namespace. This avoids the need to move the baremetalhost resources to a new namespace.

enhancements/baremetal/an-slo-for-baremetal.md

mhrivnak · 2020-04-30T20:29:29Z

enhancements/baremetal/an-slo-for-baremetal.md

+   resource. A new `baremetal.openshift.io/owned` annotation will
+   indicate that CBO is in control. Only one of these annotations
+   should be set on a resource but, for the avoidance of doubt, CBO
+   takes precedence.


I see that MAO opted to not use an owner reference for reasons specific to them. From what I understand of this proposal, CBO would not have such a limitation. Since CBO is taking over as the long-term owner, it's worth considering whether it makes sense for it to express its ownership through a standard owner reference instead of an annotation. It could do both, but that diminishes the value of the annotation.

There is value in the consistency of having both operators use an annotation, though they each have their own annotation and thus their own independent ways of identifying their ownership. Maybe it makes more sense to optimize for the long-term state than the transitional state, and not have a baremetal.openshift.io/owned annotation to either get rid of later or retain as technical debt.

I see that MAO opted to not use an owner reference for reasons specific to them. From what I understand of this proposal, CBO would not have such a limitation.

I assume you're suggesting that the CBO deployment would be what we'd list as the owner of the metal3 deployment?

The logic appears to apply here? Is running MAO off-cluster any more of a valid use case than running CBO off-cluster? The only use case I think of is during development?

And likewise, "the controller itself is the owner" comment seems to apply here. Unless it could make sense to make the provisioning resource the owner?

(I'm not sure I can judge how valid the MAO justification is ... I'm just saying it appears to apply here too)

Since CBO is taking over as the long-term owner, it's worth considering whether it makes sense for it to express its ownership through a standard owner reference instead of an annotation. It could do both, but that diminishes the value of the annotation.

There is value in the consistency of having both operators use an annotation, though they each have their own annotation and thus their own independent ways of identifying their ownership. Maybe it makes more sense to optimize for the long-term state than the transitional state, and not have a baremetal.openshift.io/owned annotation to either get rid of later or retain as technical debt.

It's an interesting suggestion. I must admit though I find it challenging to think through. There are 2 aspects:

What would be an appropriate ownerRef for the metal3 deployment?

I struggle with this one because I feel like there's tons of "case law" I'm missing.

For example, looking at my running 4.5 cluster, I see exactly 1 owner reference on a deployment - the OLM packageserver deployment lists its ClusterServiceVersion as an owner.

Are there any gotchas with using ownerReferences for the purposes of this upgrade hand-off?

I can't think of any particularly, but I worry I'm missing something.

Here's my first cut at code in MAO to stand down if the CBO has claimed ownership: openshift/machine-api-operator@master...markmc:cbo-gate

enhancements/baremetal/an-slo-for-baremetal.md

As per openshift/enhancements#212 MAO should stand down from managing the metal3 deployment if cluster-baremetal-operator has claimed ownership by setting the baremetal.openshift.io annotation. This allows a smooth transition on upgrade of bare metal components to a new Second Level Operator (SLO) for this platform.

dhellmann · 2020-05-06T13:44:40Z

enhancements/baremetal/an-slo-for-baremetal.md

+* A sense that the MAO is responsible for a significant amount of bare
+  metal specific tasks that should be declared outside of the scope of
+  the MAO, particularly since it does not have equivalent
+  responsibilities on any other platform.


Does this also apply to vSphere? Is the vSphere actuator going to be moved as part of this work?

Looking back to a discussion elsewhere in February, this seems like a continuation of a question you asked there, so I'm going to include that here.

Doug:

I would like a technical explanation of why a new operator is the right approach. Why is the operator responsible for standing up the controller for the machine API not the right place for doing the work to stand up the controller for the machine API for bare metal?

David Eads @deads2k :

The machine-api-operator is responsible for making a call to the cloud provider API to get a machine provisioned, but it is not responsible for running the components that cloud provider API requires in order to function. My understanding of the metal3 additions to the machine-api-operator are that it has two parts

An actuator that knows how to make a request for a machine. This actuator clearly
belongs in the machine-api-operator.

A metal3 deployment which contains an ipa-downloader, a machine-os-downloader, an ip-setter, a maria db, an ironic conductor, ironic api, and an ironic inspector. And I have a memory of this even having a way to provide DHCP for the machines provided. This doesn't appear to belong in the machine-api-operator.

The machine-api-operator is a client to a cloud provider API which provisions machines. The actuator for vsphere doesn't include infrastructure for vsphere to be able launch its machines, only the calls to the API to create them. We would expect metal3 to do the same.

Davids explanation is pretty reasonable I think, but Doug's follow on question is equally reasonable.

That is ... why is the vsphere actuator the only actuator codebase that actually lives in the machine-api-operator codebase?

Or ... why does the machine-controller-manager binary for vsphere come from the machine-api-operator image? What's so special about vsphere?

I'm tempted to guess that the only reason it's special is because the authors of the vsphere provider are also the maintainers of the machine-api-operator repo? But it'd be great to get that background from @enxebre

Just for reference, I don't see any of the rationale for that in e.g. openshift/machine-api-operator#447 or #154

If I've understood correctly, the logic for having the vSphere provider inside the MAO repository is that we (the cloud team) were planning to move to a monorepo approach for the MAO and the providers that we manage (AWS, Azure, GCP, VSphere) and possibly other providers as well. Having each provider in a separate repository created a lot of maintenance overhead when we make small changes to the core of the machine API, raising BZs and PRs for each provider just to revendor the core MAO repo takes us a lot of time. I'm not sure if that's still a priority, but I think that was the motivation at the time.

Ah, so the trend would actually be in the opposite direction? All of the cluster-api-provider repos should be moving their code into the MAO repo? Presumably including cluster-api-provider-baremetal?

I guess so, though I'll defer to @enxebre as I joined the team after this was done so I'm not entirely sure what the plan was

The machine API operator is intentionally simple and fairly rudimentary by design.
It literally just enforces the lifecycle of a kubernetes deployment resource.
This deployment has the same orthogonal structure everywhere. It knows nothing about clouds.

Now, where it lives the code base for the containers which are run by that deployment has zero impact on the design described above.

I'm tempted to guess that the only reason it's special is because the authors of the vsphere provider are also the maintainers of the machine-api-operator repo?

Yes. The home for that code base responds to factors totally irrelevant for mao design. These factors are variable in time and context. It's merely a matter of enabling the best development workflow for each scenario.

You'll see aws, gcp and azure in individual repos because that made sense at the time they were created. They were forks mimicking an upstream code organisation.
vSphere was created in a completely different context. Have it in its individual repo would have only drawbacks but no benefits. Therefore we choose to put the code base in mao repo purely to enable a healthier development workflow (e.g remove dependency revendoring hassle, on the fly CI).

The metal3 bare metal or the openstack actuators could well live in the mao repo and that would make no difference at all on the mao design. It is just natural and convenient for them to be individual repos to enable a more effective development workflow given our teams organisation.

Now managing the lifecycle of the underlying cloud required for the machine API to work, designing and understanding how to fetch the required external config, give semantics to possible failure scenarios and communicate those appropriately, that's beyond mao design scope and that's only relevant in a bare metal environment. This is the knowledge that I think is reasonable to embed in an ad-hoc operator and it would favour a more sustainable component ecosystem with clear responsibility boundaries. mao assumes this underlying "cloud" exists.

Thanks to Michael Hrivnak.

Thanks to Doug Hellmann for this explanation on why its better for us to take over the existing metal3 deployment rather than launching a new one.

Lots of good discussion in PR comments that would be good to capture as rationale that can be referred to later.

markmc · 2020-05-09T10:27:00Z

Added a new discussion section to the doc capturing some of the key points of debate along the way

dhellmann · 2020-05-11T15:34:30Z

enhancements/baremetal/an-slo-for-baremetal.md

+- Be a new `openshift/cluster-baremetal-operator` project.
+- Publish an image called `cluster-baremetal-operator`.
+- Use CVO run-level 30, so its manifests will be applied in parallel
+  with the MAO.


Why do we want to run it in parallel instead of having the MAO launch it? We could move the cluster-api-provider-baremetal deployment into the new operator along with everything else, and then the MAO could go back to simply managing 1 Deployment with platform-specific content.

+1 to the above suggestion, then it wouldn't run at all on clusters that aren't baremetal. Is there any precedence within OpenShift for this kind of thing, a third level operator?

CounterPoint: This goes against https://github.com/openshift/enhancements/pull/212/files#r422472511, so I can still see some value in having it separate (eg having it's own namespace (eventually) would be good if it's going to deploy lots of resources)

I think we have settled on the approach were this SLO is launched by the CVO. The operator then puts itself in a Disabled state immediately when not running on a Baremetal Platform.

stbenjam · 2020-05-14T11:18:24Z

enhancements/baremetal/an-slo-for-baremetal.md

+   release image.
+2. Implement a `ClusterOperator` resource, including updating its
+   status field with the `Available/Progressing/Degraded` conditions,
+   operator version number, and any `relatedObjects`.


Could we use these status fields likeDegraded=true or Available=false,Progressing=true to relay information to the installer about how the underlying provisioning activities are going?

The installer takes a naïve approach in determining failure: if it's taking longer than 30m (60m for metal) for all operators to come up, then it's marked as failed. The installer can't tell the difference between "this is taking a while" vs "this will never complete."

However, it's conceivable to exceed 60m. POST can take a long time, or UEFI firmwares can be buggy and need to retry PXE. Indeed, we retry PXE on failure and have seen need for it in some environments.

I'd like to see us take a more intelligent approach for baremetal. If we know we're still working on provisioning and Metal3's Ironic hasn't given up, then we could conceivably wait longer. A recent, successful deployment of a cluster with 3 masters and 6 workers took 97m.

Alternatively, we might know 15m into the process that your BMC creds are garbage, and we could fail early.

It's a really poor UX to have the installer make users wait when it's knowable earlier that the install is failed, and similarly bad UX to fail when we know the underlying infrastructure is still making progress and there's a good chance of success.

I'm not even sure if that belongs in CBO or MAO, and maybe it belongs as a separate enhancement request.

What do you think?

It's a very good suggestion. Honestly, we could probably trust ironic to apply its own timeouts (and adjust ironic-image as we see fit).

enxebre · 2020-05-28T10:27:18Z

enhancements/baremetal/an-slo-for-baremetal.md

+## Summary
+
+The `cluster-baremetal-operator` component will provide bare metal
+machine management capabilities needed for the Machine API provider on


"machine" in this context usually refers to the kubernetes API machine resource.
May be "will provide raw infrastructure management capabilities needed to satisfy the metal3 bare metal Machine API provider requests" would be more accurate here?

Probably good to include a doc glossary: machine (crd), host(crd), instance/infrastructure.

hardys · 2020-06-24T15:43:02Z

enhancements/baremetal/an-slo-for-baremetal.md

+- Upgrades
+- API stability
+- Additional bare metal specific functionality in future
+- Upstream (Metal3) implications


This section probably needs either some more details or cleanup - I think the most critical items are covered below e.g "Not In Use" SLO Behaviors" for impact on non-baremetal platforms and the discussion around upgrade strategy.

hardys · 2020-06-24T15:45:33Z

enhancements/baremetal/an-slo-for-baremetal.md

+  will be to reconcile the `Provisioning` singleton resource.
+
+### Test Plan
+


We should add some test plan details here, e.g at a minimum we need to ensure the fully integrated solution passes the existing e2e-baremetal-ipi tests, and also upgrade CI will soon be in place which can help us prove the upgrade aspects are robust.

hardys · 2020-06-24T15:48:22Z

I added a couple of comments which could be resolved in a followup - I think that @sadasu is taking this work over from @markmc so I'd propose we go ahead and merge this unless there are any objections - this will unblock work on the implementation and @sadasu can push a follow-up PR to resolve the remaining comments (since she can't push to the fork this originates from to update this PR).

/lgtm

dhellmann

I'm happy with this approach.

dhellmann · 2020-06-24T15:59:09Z

enhancements/baremetal/an-slo-for-baremetal.md

+4. On infrastructure platforms other than `baremetal`, the CBO should
+   following emerging patterns for "not in use" components, for
+   example setting its cluster operator condition to `Disabled=true`,
+   `Available=true`, and `Progressing=False`.


This implies, I think, that we would not support deploying a UPI cluster on metal (resulting in a platform of ""), then on day 2 configuring the machine API to support fencing hosts or provisioning new nodes. I think that's probably OK for now, but in the future we may want to reconsider what the trigger is for deciding when the new operator is enabled to support that case. Changing that trigger wouldn't be the only work needed, so we'll want another enhancement to work through the design when the time comes.

openshift-ci-robot · 2020-06-24T16:01:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, hardys, markmc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [hardys,markmc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enxebre · 2020-09-01T08:40:31Z

enhancements/baremetal/an-slo-for-baremetal.md

+3. A change to remove `metal3` deployment from MAO.
+
+   In `4.N-dev`, once the CBO has been added to the release, we can
+   remove all awareness of the `metal3` deployment from MAO.


I'm not sure why all this is needed. why do we need to do any check for the CBO to be available or annotations at all?
The CBO either is running or not. If it is running in 4.N, then that version of MAO needs to know nothing about baremetal. During a 4.N upgrade the CBO just takes over, in a downgrade the old mao code takes over.

I think the main issue is that during an upgrade, we can't actually guaranteee that the old MAO is shut down before the new CBO is running.

If you can explain how we could do that, then great - nobody involved loves this extra complexity.

Downgrade is perhaps a corner case not worth over-designing for, but ... AFAICT, in a downgrade scenario, the CBO will not be automatically deleted by the CVO. So you'll have metal3-managing MAO and CBO running, until the admin manually deletes the CBO.

Also if you could explain why this was needed for the oauth example (in the "prior art" section) but not needed in this case, then that would be super helpful.

Also, you at least have a very short term challenge with this transition - you can't merge "remove metal3 management from MAO" and "add CBO to the release image" as an atomic change. There will be some set of release images with either a regression or a conflict.

(But again, my main concern is e.g. a 4.6->4.7 upgrade where we have to transition from MAO to CBO, and we can't guarantee that the 4.6 MAO is shut down first)

Thanks @markmc I share all those concerns. I think we could possibly take advantage of prefix order https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#upgrades-and-order for upgrades to guarantee that the old MAO is shut down before the new CBO is running.

For the downgrade corner case hopefully let the mao check for the CBO existence should be enough, so we can remove some complexity without getting annotations involved.

fwiw the rationale for the annotation in mao was purely to give us control over which specific deployments the informer is watching in that namespace, there's probably more elegant ways to configure the informer to do this.

Thanks @markmc I share all those concerns. I think we could possibly take advantage of prefix order https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/upgrades.md#upgrades-and-order for upgrades to guarantee that the old MAO is shut down before the new CBO is running.

(duplicating from openshift/machine-api-operator#689)

Yeah, but it's pretty ugly to put CBO in a later runlevel just for this temporary upgrade handover. Even that link suggests using another mechanism - "Over time, operators should be better at detecting their prerequisites without overcomplicating or risking the predictability of upgrades"

For the downgrade corner case hopefully let the mao check for the CBO existence should be enough, so we can remove some complexity without getting annotations involved.

I think I could buy that the "existence of a baremetal clusteroperator" check could serve the same purpose as the annotation - at least I'm not seeing any obvious reason why not, months after writing the enhancement! (i.e. if there's a good reason, I should have captured it in the enhancement)

fwiw the rationale for the annotation in mao was purely to give us control over which specific deployments the informer is watching in that namespace, there's probably more elegant ways to configure the informer to do this.

Just to be open about some of the irritation here - the level of scrutiny this minor, temporary, and exhaustively documented piece of "inelegance" is getting seems way out of proportion with e.g. openshift/machine-api-operator#424. Particularly since this entire move is motivated by the clear desire for the MAO maintainers to not have to worry about bare metal use cases.

Yeh we are totally on the same page. I meant is not elegant how is done in mao, so hopefully we could avoid perpetuate the pattern and later improve it in mao. The proposal and openshift/machine-api-operator#689 looks all good to me in its current form. I was just making sure I got the rationale behind all details after being away for a little while. Thanks for the clarifications.

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 17, 2020

openshift-ci-robot requested review from spadgett and sudhaponnaganti February 17, 2020 17:10

markmc mentioned this pull request Feb 17, 2020

Cluster Profiles #200

Merged

dhellmann reviewed Feb 17, 2020

View reviewed changes

enhancements/baremetal/baremetal-operator-management.md Outdated Show resolved Hide resolved

smarterclayton reviewed Feb 17, 2020

View reviewed changes