Skip to content

Commit 945dc8d

Browse files
committed
baremetal SLO: add a new discussion section
Lots of good discussion in PR comments that would be good to capture as rationale that can be referred to later.
1 parent ec009dc commit 945dc8d

File tree

1 file changed

+149
-0
lines changed

1 file changed

+149
-0
lines changed

enhancements/baremetal/an-slo-for-baremetal.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -434,6 +434,155 @@ bare metal platforms - i.e. the `Disabled` state - there are greater
434434
potential downsides from jumping into using cluster profiles for this
435435
at this early stage.
436436

437+
## Discussion
438+
439+
**Q: Should BMO be CVO-managed, OLM-managed, or SLO-managed?**
440+
441+
@smarterclayton
442+
443+
I believe [BMO] should be managed by the machine api operator. CVO
444+
does not manage "operators", it manages resources. It does not do
445+
conditional logic for operator deployment. That's the responsibility
446+
of second level operators, of which MAO is one.
447+
448+
I don't see much difference between the current mechanism of MAO
449+
deploying an actuator (a controller AKA an operator) and MAO deploying
450+
the bare metal operator.
451+
452+
Why can't launching BMO under MAO be exactly like launching an
453+
actuator, and then BMO manages the actuator? Or simply make the bare
454+
metal actuator own the responsibility of managing lifecycle of its
455+
components?
456+
457+
How can we make "managing sub operators" cheaper by reducing
458+
deployment complexity?
459+
460+
There needs to be a second level operator that either deploys or
461+
manages the appropriate machine components for the current
462+
infrastructure platform.
463+
464+
There appears to be a missing “machine-infrastructure” operator that
465+
acts like cluster network operator and deploys the right
466+
components. I’m really confused why that wouldn’t just be “machine api
467+
operator”.
468+
469+
Having unique operators per infrastructure sounds like an anti pattern
470+
if we already have a top level operator.
471+
472+
@deads2k
473+
474+
There are development and support benefits to being able to divide
475+
responsibilities between the machine-api-operator making calls to a
476+
cloud provider API from the mechanisms that provides those cloud
477+
provider APIs themselves and the support infrastructure for the
478+
machines. Doing so forces good planning and API boundaries on both the
479+
MAO and the baremetal deployments. … clear separation of
480+
responsibility and failures for both developers and customers.
481+
482+
@smarterclayton
483+
484+
An SLO is a "component" or "subsystem" - given what we know today,
485+
bare metal feels like our one infrastructure platform that most
486+
deserves to be viewed as its own subsystem.
487+
488+
**Q: How should BMO behave if it is SLO-managed?**
489+
490+
@deads2k
491+
492+
[Add BMO] to the payload and then the baremetal operator would put
493+
itself into a Disabled state if it was on a non-metal platform.
494+
495+
@smarterclayton
496+
497+
Disabled operators already need special treatment in the API. They
498+
must be efficient and self-effacing when not used, like the image
499+
registry, samples, or insights operators must (mark disabled, be
500+
deemphasized in UI).
501+
502+
The baremetal-operator is installed by default, if infrastructure is
503+
!= BareMetal on startup then it just pauses (and does nothing) and
504+
sets its cluster operator condition to Disabled=true, Available=true,
505+
Progressing=False with appropriate messages, or if infrastructure ==
506+
BareMetal, then it runs as normal. The cluster operator object is
507+
always set, but when disabled user interfaces should convey that
508+
disabled state differently than failing (by graying it out).
509+
510+
BMO must fully participate in CVO lifecycle. CVO enforces upgrade
511+
rules. BMO API must be stable.
512+
513+
**Q: Should bare metal specific CRDs be installed on all clusters or
514+
only on bare metal clusters?**
515+
516+
@smarterclayton
517+
518+
[Bare metal specific CRDs] feel like they are part of MAO, just like
519+
CNO installs CRDs for the two core platform types. In general, CNO
520+
already demonstrates this pattern and is successful doing so, so the
521+
default answer for this pattern is MAO should behave like CNO and any
522+
deviation needs justification.
523+
524+
@derekwaynecarr
525+
526+
I think its an error that we have namespaces and crds deployed to a
527+
cluster for contexts that are not appropriate. we should aspire to
528+
move away from that rather than continue to lean into it. for example,
529+
every cluster has a openshift-kni-infra or openshift-ovirt-infra even
530+
where it is not appropriate.
531+
532+
**Q: Why not use CVO profiles to control when BMO is deployed?**
533+
534+
@smarterclayton
535+
536+
CVO Profiles were not intended to be dynamic or conditional (and there
537+
are substantial risks to doing that).
538+
539+
Profiles don't seem appropriate for conditional parameterization of
540+
the payload based on global configuration
541+
542+
The general problem with profiles is that they expand the scope of the
543+
templating the payload provides. [..] If we expanded this to include
544+
operators that are determined by infrastructure, then we're
545+
potentially introducing a new variable (not just a new profile), since
546+
we very well may want to deploy bare metal operator in a hypershift
547+
mode.
548+
549+
**Q: Why not name the new operator "metal3-operator"? Should this
550+
operator come from the Metal3 upstream project?**
551+
552+
@markmc
553+
554+
Naming - in terms of what name shows up in `oc get clusteroperator`, I
555+
think that should be a name reflecting the functionality in user terms
556+
rather than the software project brand. And if `baremetal` is the name
557+
of the clusteroperator, then I think it makes sense to follow that
558+
through and metal3 is an implementation detail.
559+
560+
Scope - if we imagine other bare metal related functionality in
561+
OpenShift that isn't directly related to the Metal3 project, do we
562+
think that should fall under another SLO, or this one? I think it's
563+
best to say this new SLO is where bare metal related functionality
564+
would be managed.
565+
566+
Upstream project - you could imagine an upstream project which would
567+
encapsulate the [kustomize-based deployment
568+
scenarios](https://github.com/openshift/baremetal-operator/blob/master/docs/ironic-endpoint-keepalived-configuration.md#kustomization-structure)
569+
in the metal3/baremetal-operator project. We could re-use something
570+
like that, but we would also need to add OpenShift integration
571+
downstream - e.g. the clusteroperator, and checking the platform type
572+
in the infrastructure resource. Is there an example of another SLO
573+
that is derived from an operator that is more generally applicable?
574+
575+
**Q: Why not use `ownerReferences` on the `metal3` deployment to
576+
indicate that it is owned by the CBO?**
577+
578+
(Discussion ongoing in the PR)
579+
580+
**Q: If the concern is there is "too much bare metal stuff" in the MAO
581+
repo, wouldn't that concern also apply to the [vSphere
582+
actuator](https://github.com/openshift/machine-api-operator/tree/master/pkg/controller/vsphere)?**
583+
584+
(Discussion ongoing in the PR)
585+
437586
## References
438587

439588
- "/enhancements/baremetal/baremetal-provisioning-config.md"

0 commit comments

Comments
 (0)