Skip to content

baremetal: propose a new Second Level Operator for bare metal #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
308 changes: 308 additions & 0 deletions enhancements/baremetal/an-slo-for-baremetal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
---
title: an-slo-for-baremetal
authors:
- "@markmc"
reviewers:
- "@abhinavdahiya"
- "@dhellmann"
- "@enxebre"
- "@eparis"
- "@hardys"
- "@sadasu"
- "@smarterclayton"
- "@stbenjam"
approvers:
- TBD
creation-date: 2020-02-13
last-updated: 2020-02-21
status: provisional
see-also:
- https://github.com/markmc/cluster-baremetal-operator
replaces:
superseded-by:
---

# An SLO for baremetal

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Open Questions

1. How to handle upgrades?
2. Which release should this change target?

## Summary

The Bare Metal Operator provides bare metal machine management
capabilities needed for the Machine API provider on the `baremetal`
platform, and there is no equivalent for this component on other
platforms.

The Bare Metal Operator is currently managed by the Machine API
Operator in a fashion that requires the Machine API Operator to have
significant bare metal specific knowledge. This proposal outlines a
plan for the Bare Metal Operator to become a fully-fledged Second
Level Operator under the management of the Cluster Version Operator.

(To avoid likely confusion, please note that this proposal describes a
new project tentatively named Bare Metal Operator (BMO), implying that
the current baremetal-operator project will be renamed to reflect the
fact it is "just a controller").

## Motivation

In order to bring up a cluster using the baremetal platform:

1. The installer needs to capture bare metal host information and
provisioning configuration for later use.
2. Something needs to install the CRDs for these bare metal specific
resources created by the installer.
3. Something needs to launch a controller for these resources.

Currently (1) is achieved by the installer creating:

* A Provisioning resource manifest
* Manifests for BareMetalHost resources and their associated secrets

and these manifests are applied by the cluster-bootstrap component
towards the end of the cluster bootstrapping process.

This resource creation step does not succeed until the step (2)
completes - i.e. the relevant CRDs are applied - and this is currently
done by the Cluster Version Operator (CVO) as it applies the manifests
for the MAO.

Finally, (3) happens when the MAO detects that it is running on the
`baremetal` infrastructure platform and instantiates a relatively
complex Deployment including the BareMetalHost controller and various
containers running other components, all from the Metal3 project. The
configuration of this deployment is driven by the Provisioning
resource created by the installer in (1).

There are two problems emerging with this design:

* A sense that the MAO is responsible for a significant amount of bare
metal specific tasks that should be declared outside of the scope of
the MAO, particularly since it does not have equivalent
responsibilities on any other platform.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this also apply to vSphere? Is the vSphere actuator going to be moved as part of this work?

Copy link
Contributor Author

@markmc markmc May 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back to a discussion elsewhere in February, this seems like a continuation of a question you asked there, so I'm going to include that here.

Doug:

I would like a technical explanation of why a new operator is the right approach. Why is the operator responsible for standing up the controller for the machine API not the right place for doing the work to stand up the controller for the machine API for bare metal?

David Eads @deads2k :

The machine-api-operator is responsible for making a call to the cloud provider API to get a machine provisioned, but it is not responsible for running the components that cloud provider API requires in order to function. My understanding of the metal3 additions to the machine-api-operator are that it has two parts

  1. An actuator that knows how to make a request for a machine. This actuator clearly
    belongs in the machine-api-operator.
  2. A metal3 deployment which contains an ipa-downloader, a machine-os-downloader, an ip-setter, a maria db, an ironic conductor, ironic api, and an ironic inspector. And I have a memory of this even having a way to provide DHCP for the machines provided. This doesn't appear to belong in the machine-api-operator.

The machine-api-operator is a client to a cloud provider API which provisions machines. The actuator for vsphere doesn't include infrastructure for vsphere to be able launch its machines, only the calls to the API to create them. We would expect metal3 to do the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Davids explanation is pretty reasonable I think, but Doug's follow on question is equally reasonable.

That is ... why is the vsphere actuator the only actuator codebase that actually lives in the machine-api-operator codebase?

Or ... why does the machine-controller-manager binary for vsphere come from the machine-api-operator image? What's so special about vsphere?

I'm tempted to guess that the only reason it's special is because the authors of the vsphere provider are also the maintainers of the machine-api-operator repo? But it'd be great to get that background from @enxebre

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for reference, I don't see any of the rationale for that in e.g. openshift/machine-api-operator#447 or #154

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I've understood correctly, the logic for having the vSphere provider inside the MAO repository is that we (the cloud team) were planning to move to a monorepo approach for the MAO and the providers that we manage (AWS, Azure, GCP, VSphere) and possibly other providers as well. Having each provider in a separate repository created a lot of maintenance overhead when we make small changes to the core of the machine API, raising BZs and PRs for each provider just to revendor the core MAO repo takes us a lot of time. I'm not sure if that's still a priority, but I think that was the motivation at the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the trend would actually be in the opposite direction? All of the cluster-api-provider repos should be moving their code into the MAO repo? Presumably including cluster-api-provider-baremetal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so, though I'll defer to @enxebre as I joined the team after this was done so I'm not entirely sure what the plan was

Copy link
Member

@enxebre enxebre May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The machine API operator is intentionally simple and fairly rudimentary by design.
It literally just enforces the lifecycle of a kubernetes deployment resource.
This deployment has the same orthogonal structure everywhere. It knows nothing about clouds.

Now, where it lives the code base for the containers which are run by that deployment has zero impact on the design described above.

I'm tempted to guess that the only reason it's special is because the authors of the vsphere provider are also the maintainers of the machine-api-operator repo?

Yes. The home for that code base responds to factors totally irrelevant for mao design. These factors are variable in time and context. It's merely a matter of enabling the best development workflow for each scenario.

You'll see aws, gcp and azure in individual repos because that made sense at the time they were created. They were forks mimicking an upstream code organisation.
vSphere was created in a completely different context. Have it in its individual repo would have only drawbacks but no benefits. Therefore we choose to put the code base in mao repo purely to enable a healthier development workflow (e.g remove dependency revendoring hassle, on the fly CI).

The metal3 bare metal or the openstack actuators could well live in the mao repo and that would make no difference at all on the mao design. It is just natural and convenient for them to be individual repos to enable a more effective development workflow given our teams organisation.

Now managing the lifecycle of the underlying cloud required for the machine API to work, designing and understanding how to fetch the required external config, give semantics to possible failure scenarios and communicate those appropriately, that's beyond mao design scope and that's only relevant in a bare metal environment. This is the knowledge that I think is reasonable to embed in an ad-hoc operator and it would favour a more sustainable component ecosystem with clear responsibility boundaries. mao assumes this underlying "cloud" exists.


* Expanding needs for bare metal specific manifests - for example, new
CRDs used to drive new BMO capabilities - to be installed early in
the cluster bring-up means introducing yet more bare metal specific
concerns into the MAO.

Steps (2) and (3) are aspects of cluster bring-up which the CVO is
clearly well-suited. However, to date, it was understood that creating
a Second Level Operator (SLO) (in other words, a CVO-managed operator)
for bare metal would not make sense, since it implied a component
installed and running on clusters where it is not needed.

### Goals

Allow bare metal machine management capabilities to be fully
encapsulated in a new SLO.

Ensure that this new SLO has minimal impact on clusters not running on
the `baremetal` platform.

### Non-Goals

### Proposal

Recognizing that bare metal support warrants the creation of a new
"subsystem":

1. Create a new, OpenShift-specific project called the Bare Metal
Operator.
2. This project should implement a new SLO which is responsible for
installing bare metal specific CRDs and running the BareMetalHost
controller and related Metal3 components.
3. The BMO should meet all of the standard expectations of an SLO,
including for example keeping a ClusterOperator resource updated.
4. On infrastructure platforms other than `baremetal`, the BMO should
following emerging patterns for "not in use" components, for
example setting its cluster operator condition to `Disabled=true`,
`Available=true`, and `Progressing=False`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies, I think, that we would not support deploying a UPI cluster on metal (resulting in a platform of ""), then on day 2 configuring the machine API to support fencing hosts or provisioning new nodes. I think that's probably OK for now, but in the future we may want to reconsider what the trigger is for deciding when the new operator is enabled to support that case. Changing that trigger wouldn't be the only work needed, so we'll want another enhancement to work through the design when the time comes.


### Risks and Mitigations

(FIXME)

- Impact on non-baremetal platforms
- Implementation cost and complexity
- Upgrades
- API stability
- Additional bare metal specific functionality in future
- Upstream (Metal3) implications
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section probably needs either some more details or cleanup - I think the most critical items are covered below e.g "Not In Use" SLO Behaviors" for impact on non-baremetal platforms and the discussion around upgrade strategy.


## Design Details

The Bare Metal Operator is a new Second Level Operator (SLO) whose
operand is a controller for the BareMetalHost resource and associated
components from the Metal3 project. The below sections covers
different areas of the design of this new SLO.

### Standard SLO Behaviors

As an SLO, the BMO is expected to adhere to the standard expected
behaviours of SLO, including:

1. The BMO image should be tagged with
`io.openshift.release.operator=true` and contain a `/manifests`
directory with all of the manifests it requires the CVO to apply,
along with an `image-references` file listing the images referenced
by those manifests that need to be included in the OpenShift
release image.
2. Implement a `ClusterOperator` resource, including updating its
status field with the `Available/Progressing/Degraded` conditions,
operator version number, and any `relatedObjects`.
Copy link
Member

@stbenjam stbenjam May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use these status fields likeDegraded=true or Available=false,Progressing=true to relay information to the installer about how the underlying provisioning activities are going?

The installer takes a naïve approach in determining failure: if it's taking longer than 30m (60m for metal) for all operators to come up, then it's marked as failed. The installer can't tell the difference between "this is taking a while" vs "this will never complete."

However, it's conceivable to exceed 60m. POST can take a long time, or UEFI firmwares can be buggy and need to retry PXE. Indeed, we retry PXE on failure and have seen need for it in some environments.

I'd like to see us take a more intelligent approach for baremetal. If we know we're still working on provisioning and Metal3's Ironic hasn't given up, then we could conceivably wait longer. A recent, successful deployment of a cluster with 3 masters and 6 workers took 97m.

Alternatively, we might know 15m into the process that your BMC creds are garbage, and we could fail early.

It's a really poor UX to have the installer make users wait when it's knowable earlier that the install is failed, and similarly bad UX to fail when we know the underlying infrastructure is still making progress and there's a good chance of success.

I'm not even sure if that belongs in CBO or MAO, and maybe it belongs as a separate enhancement request.

What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very good suggestion. Honestly, we could probably trust ironic to apply its own timeouts (and adjust ironic-image as we see fit).


While not required initially, other common SLO patterns can be
considered in future:

1. Implement an operator configuration resource, including
`OperatorSpec` and `OperatorStatus` (as per in openshift/api#125)
2. Implement cluster-level configuration under `config.openshift.io`
(as per openshift/api#124)
3. Expose a `/metrics` endpoint for Prometheus to be configured to
scrape (via a `ServiceMontitor`) and define any relevant Prometheus
alert rules based on those metrics.

### "Not In Use" SLO Behaviors

Unlike most other SLOs, the BMO is not applicable to all cluster
configurations. On clusters running on an infrastructure platform
other than `baremetal` it should adhere to the emerging expected
behaviors for "not in use" SLOs, including:

1. Setting its `ClusterOperator` condition to `Disabled=true`,
`Available=true`, `Progressing=False` with appropriate messages.
2. User interfaces should convey this disabled state differently than
a failure mode (e.g. by graying it out).

Currently, insights-operator is the only other example of an SLO
following this pattern of using a `Disabled` status. Other somewhat
similar cases following different patterns include:

- Image registry and samples have a `Removed` management state where
`Degraded=False`, `Progressing=False`, `Available=True` with
`Reason=currentlyUnmanaged`.
- The cluster credentials operator has a `disabled` config map setting
that can be used to disable the operator and it then sets
ClusterOperator status conditions to `Degraded=False`,
`Progressing=False`, `Available=True` with
`Reason=OperatorDisabledByAdmin` for all three conditions.

### BMO Details

The BMO will:

- Be a new `openshift/cluster-baremetal-operator` project.
- Publish an image called `cluster-baremetal-operator`.
- Use CVO run-level 30, so its manifests will be applied in parallel
with the MAO.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to run it in parallel instead of having the MAO launch it? We could move the cluster-api-provider-baremetal deployment into the new operator along with everything else, and then the MAO could go back to simply managing 1 Deployment with platform-specific content.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the above suggestion, then it wouldn't run at all on clusters that aren't baremetal. Is there any precedence within OpenShift for this kind of thing, a third level operator?

CounterPoint: This goes against https://github.com/openshift/enhancements/pull/212/files#r422472511, so I can still see some value in having it separate (eg having it's own namespace (eventually) would be good if it's going to deploy lots of resources)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have settled on the approach were this SLO is launched by the CVO. The operator then puts itself in a Disabled state immediately when not running on a Baremetal Platform.

- Add a new `baremetal` `ClusterOperator` with an additional
`Disabled` status for non-baremetal platforms.
- Use a new namespace called `openshift-baremetal`.
- Install the `metal3.io` `Provisioning` (cluster-scoped) and
`BareMetalHost` (namespaced) CRDs.
- Run under a new `openshift-baremetal/cluster-baremetal-operator`
`ServiceAccount`.
- Be launched by a `openshift-baremetal/cluster-baremetal-operator`
`Deployment`, copying much of the MAO pod spec in terms of
`system-node-critical` priority class, running on masters, security
context, resource requests, etc.
- Implement a controller reconciling the singleton
`provisioning-configuration` cluster-scoped `Provisioning` resource
- Do nothing except set `Disabled=true`, `Available=true`, and
`Progressing=False` when the `Infrastructure` resource a platform
type other than `BareMetal`.
- Based on the values in the `Provisioning` resource, create a
`metal3` `Deployment` and associated `metal3-mariadb-password` under
the `openshift-baremetal` namespace. This is the same as the MAO
currently creates under the `openshift-machine-api` namespace.

### Test Plan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some test plan details here, e.g at a minimum we need to ensure the fully integrated solution passes the existing e2e-baremetal-ipi tests, and also upgrade CI will soon be in place which can help us prove the upgrade aspects are robust.

### Graduation Criteria

### Upgrade / Downgrade Strategy


### Version Skew Strategy


## Implementation History


## Drawbacks


## Alternatives

### Continue with the BMO under the MAO

In order to fully encapsulate the responsibilities of the BMO in an
operator - and remove the bare metal specific code and manifests from
the MAO - the MAO coul add a generic operator management framework
for platform specific operators, and the BMO would integrate with this
framework.

This would involve a more generic mechanism where the MAO could
discover and apply any required manifests from BMO image would mean
the addition of operator management capabilities that look very much
like some of the CVO's capabilities.

Adding such a framework seems unnecessarily complex, when there will
only be a single user of this framework.

### Add platform awareness to the CVO

In order to reduce the impact of the BMO when running on non bare
metal platforms, the CVO could gain the ability to manage operators
that are platform-specific, meaning the BMO would move only be
installed and run when the CVO detects (via a
`io.openshift.release.platform` image label, for example) that this
operator is required on this platform.

While this may seem a minimal extension to the CVO's capabilities, we
want to avoid a trend where the CVO continues to gain more and more of
such conditional behavior.

### Use a CVO cluster profile

The [cluster profiles
enhancement](https://github.com/openshift/enhancements/pull/200) offer
a generic framework for conditions that affect how the CVO applies the
content in a release image. This framework could be used in this case
by creating a `baremetal` cluster profile, and the BMO would only be
installed when this profile is active.

As per the enhancement document, cluster profiles are being introduced
to initially handle two specific cases (hypershift, CRC) and there is
a desire to proceed cautiously and avoid using cluster profiles
extensively at this point. Also, the enhancement proposes that only a
single cluster profile can be activated at a time, and such a
`baremetal` profile is not something that would naturally be mutually
exclusive with other potential profiles.

Compared to proposed mechanism to reduce the impact of the BMO on non
bare metal platforms - i.e. the `Disabled` state - there are greater
potential downsides from jumping into using cluster profiles for this
at this early stage.

## References

- "/enhancements/baremetal/baremetal-provisioning-config.md"
- https://github.com/openshift/machine-api-operator/pull/302
- https://github.com/metal3-io/baremetal-operator/issues/227
- https://github.com/openshift/enhancements/pull/90
- https://github.com/openshift/enhancements/pull/102