Skip to content
This repository was archived by the owner on Mar 23, 2020. It is now read-only.

configure host networking policy with default bridge #47

Merged
merged 2 commits into from
Sep 23, 2019

Conversation

phoracek
Copy link
Contributor

@phoracek phoracek commented Aug 28, 2019

Configure linux bridge as the default interface (so users can connect their VMs to the L2 network).

Depends on: openshift/machine-config-operator#1124 openshift/baremetal-runtimecfg#20

@phoracek phoracek marked this pull request as ready for review September 3, 2019 12:21
@phoracek phoracek force-pushed the setup_bridge branch 2 times, most recently from 906b585 to 371ad43 Compare September 3, 2019 12:30
Copy link
Member

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

nice job!

@celebdor I have a question for the network configuration of the environment.
The default_iface is a regular interface (no vlan one) ?
did the interface on the switch side is configured using trunk with native vlan or just a vlan?

CNV/cnv-2.1.0.sh Outdated

echo "Waiting until all nodes come back"
sleep 10
until oc get nodes 2> /dev/null; do sleep 10; done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add here a check of the hostnetwork CR to check the state contains the new bridge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the following block. We check that the new machineCIDR route is on the bridge.

@sreichar
Copy link
Collaborator

sreichar commented Sep 5, 2019

@phoracek see @SchSeba approves, will merge, can you just confirm it has been tested?

(CI on PRs should be coming soon, but until then jsut checking)

@phoracek
Copy link
Contributor Author

phoracek commented Sep 5, 2019

/hold

@sreichar Want to check against the subnet as @celebdor suggested. We also need to properly test this on bare-metal.

@phoracek phoracek changed the title configure linux bridge as the default network interface WIP configure linux bridge as the default network interface Sep 5, 2019
CNV/cnv-2.1.0.sh Outdated

echo "Setting up a bridge as the default interface"
machineCIDR=$(grep 'machineCIDR' ../OpenShift/install-config.yaml | sed 's/\(.*\): *\(.*\)/\2/')
while ! default_iface=$(oc get nodenetworkstate -o jsonpath="{.items[0].status.currentState.routes.running[?(@.destination==\"${machineCIDR}\")].next-hop-interface}"); do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to be so smart? Isn't it known that default_iface==eno1?
I worry that this code would break in case brext already exists on top of eno1.

(Don't get me wrong: at some point we'd want a smart Policy of "apply this on the nic that is on the default route and keep the existing IP")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we need this. On bare-metal env I used the default nic was eno4, on virtual ens3 (or something similar), it is not even always the first found NIC. This script has two assumptions.

  • All nodes are the same, have the same nic connected to the default network
  • This script is called only once during the initial installation (we can improve it and just skip the config in case brext is already default)

@dankenigsberg
Copy link

I find idempotency quite important. Let's bail out early if brext already exists - we don't want to wait for Karim to report this.

CNV/cnv-2.1.0.sh Outdated

echo "Applying new network configuration on all nodes, one by one, in order to sustain quorum"
machineCIDR=$(grep 'machineCIDR' ../OpenShift/install-config.yaml | sed 's/\(.*\): *\(.*\)/\2/')
storageCIDR="172.22.0.0/24" # TODO: get from a config file
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this subnet is reserved for provisioning, and the only nic in the deployment which is up on this network is the node where the baremetal-operator is running, so you probably need to choose a different range and configure storage traffic to go via a vlan - I think @sreichar has a spreadsheet with vlan IDs and subnet ranges to refer to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my libvirt setup, this subnet is available on all nodes on the secondary interface. storageCIDR is indeed a bad name for this. I will drop VLAN configuration from this PR and limit it to bridge, so we get something working in first. I'd like to sync with you folks to discuss what should be done with the VLAN part.

@phoracek phoracek force-pushed the setup_bridge branch 2 times, most recently from 96b18ec to eae009e Compare September 13, 2019 09:05
@phoracek
Copy link
Contributor Author

phoracek commented Sep 13, 2019

@celebdor @karmab the script works fine except one issue. Once we reconfigure networking on two masters (successfully, keeping connectivity), we lose API server Unable to connect to the server: dial tcp 192.168.111.5:6443: connect: no route to host. Any suggestions how to fix that? I think it fails to reach the IP address given by keepalived.

@celebdor
Copy link

keepalived pods need to be restarted to pick up on the interface change

@phoracek
Copy link
Contributor Author

Pushed some changes and fixes. In this state, the configuration should work well. The only issue we have yet to solve is the keepalived restart. Then we can move to test this on bare-metal and hopefully merge.

@phoracek phoracek changed the title WIP configure linux bridge as the default network interface configure linux bridge as the default network interface Sep 16, 2019
@johnbieren
Copy link
Contributor

johnbieren commented Sep 17, 2019

[test] since I canceled the build earlier to debug a problem in the CI environment

@karmab
Copy link
Contributor

karmab commented Sep 18, 2019

we can install ocs before cnv, order doesn't matter.
let s confirm nmstate addresses the bridge reconfiguration before merging though.
we will also need to remove the corresponding steps from the post install script (it also means that when there s no internal registry needed, we will no longer reboot the masters during the install)

@phoracek
Copy link
Contributor Author

Before we can test this PR, we need keepalived fix. @celebdor could we use the initial workaround you created, so we have something to start with? I'm happy to open needed PRs, just tell me where it should be done.

@karmab once we have keepalived fixed and this PR merged, would you be so kind to remove post-install bits? I'm not that familiar with the project.

@johnbieren
Copy link
Contributor

johnbieren commented Sep 18, 2019

Before we can test this PR, we need keepalived fix. @celebdor could we use the initial workaround you created, so we have something to start with? I'm happy to open needed PRs, just tell me where it should be done.

@karmab once we have keepalived fixed and this PR merged, would you be so kind to remove post-install bits? I'm not that familiar with the project.

I think for this to be properly tested in CI, the post-install bits should be removed as part of it. In CI right now, this passed, but we think it is because the post-install bits removed the bridge so the changes in this PR saw the bridge and didn't do anything.

@jparrill
Copy link
Contributor

jparrill commented Sep 18, 2019

Needs rebase on CNV/cnv-2.1.0.sh:160:

from:

metadata:
  name: hyperconverged-cluster
  namespace: "${TARGET_NAMESPACE}"
spec: {}

to:

metadata:
  name: hyperconverged-cluster
  namespace: "${TARGET_NAMESPACE}"
spec:
  BareMetalPlatform: true
EOF

I've tested the usual way OCP > CNV > OCS and works fine.

@jparrill
Copy link
Contributor

jparrill commented Sep 18, 2019

OCP42 and OCS went fine but the CNV deployment went bad for me:

+ echo 'Creating registry secret'
Creating registry secret
+ cat
+ oc create -f -
secret/quay-registry-rh-osbs-operators created
+ echo 'Creating OperatorGroup'
Creating OperatorGroup
+ cat
+ oc create -f -
operatorgroup.operators.coreos.com/openshift-cnv-group created
+ echo 'Creating OperatorSource'
Creating OperatorSource
+ oc create -f -
+ cat
operatorsource.operators.coreos.com/rh-osbs-operators created
+ echo 'Give the cluster 30 seconds to create the catalogSourceConfig...'
Give the cluster 30 seconds to create the catalogSourceConfig...
+ sleep 30
+ cat
+ oc apply -f -
catalogsourceconfig.operators.coreos.com/hco-catalogsource-config created
+ echo 'Give the cluster 30 seconds to process catalogSourceConfig...'
Give the cluster 30 seconds to process catalogSourceConfig...
+ sleep 30
+ oc wait deploy hco-catalogsource-config --for condition=available -n openshift-marketplace --timeout=360s
deployment.extensions/hco-catalogsource-config condition met
++ seq 1 10
+ for i in $(seq 1 $RETRIES)
+ echo 'Waiting for packagemanifest '\''kubevirt-hyperconverged'\'' to be created in namespace '\''openshift-cnv'\''...'
Waiting for packagemanifest 'kubevirt-hyperconverged' to be created in namespace 'openshift-cnv'...
+ oc get packagemanifest -n openshift-cnv kubevirt-hyperconverged
NAME                      CATALOG             AGE
kubevirt-hyperconverged   rh-osbs-operators   59s
+ break
+ echo 'Creating Subscription'
Creating Subscription
+ cat
+ oc create -f -
subscription.operators.coreos.com/hco-operatorhub created
+ echo 'Give OLM 60 seconds to process the subscription...'
Give OLM 60 seconds to process the subscription...
+ sleep 60
+ sed 's/approved: false/approved: true/'
+ oc apply -n openshift-cnv -f -
++ oc get installplan -n openshift-cnv --no-headers
++ grep 'kubevirt-hyperconverged-operator.v{CNV_VERSION}'
++ awk '{print $1}'
No resources found.
+ oc get installplan -o yaml -n openshift-cnv
error: no objects passed to apply
make: *** [Makefile:8: deploy] Error 1
Ending Time: 16:11:54
Seconds: 123

The actual status:

[root@nostromo ~]# oc get pods -n openshift-marketplace
NAME                                        READY   STATUS    RESTARTS   AGE
certified-operators-7bc89d8c5d-h4rsk        1/1     Running   0          49m
community-operators-9d6976f8f-72b5q         1/1     Running   0          49m
hco-catalogsource-config-75ffbc7577-sstms   1/1     Running   0          17m
local-storage-manifests-8z662               1/1     Running   0          29m
marketplace-operator-874f769f6-52hk8        1/1     Running   0          50m
ocs-catalogsource-l7x9v                     1/1     Running   0          29m
redhat-operators-75c54d5bc4-9gb2f           1/1     Running   0          49m
rh-osbs-operators-6fbc6c5dcc-q9rpj          1/1     Running   0          18m
[root@nostromo ~]# oc get ip -n openshift-cnv
No resources found.
[root@nostromo ~]# oc get csv -n openshift-cnv
No resources found.

image

UPDATE: Issued a new bug

@jparrill jparrill mentioned this pull request Sep 18, 2019
@phoracek
Copy link
Contributor Author

Ok, removal of code from post-install scared me. Let me use this only for bridge (I will panic and add VLAN soon, don't worry).

@phoracek phoracek force-pushed the setup_bridge branch 2 times, most recently from 52e6719 to e4628cb Compare September 18, 2019 16:43
@phoracek phoracek changed the title configure host networking policy configure host networking policy with default bridge Sep 18, 2019
@phoracek phoracek force-pushed the setup_bridge branch 3 times, most recently from c2fc7d8 to 115a917 Compare September 20, 2019 08:38
@phoracek
Copy link
Contributor Author

We were finally able to verify this. The new script successfully set up the default bridge on a cluster deployed with @celebdor's openshift/baremetal-runtimecfg#20 and openshift/machine-config-operator#1124. Once those two PRs are merged, we can finish this one.

@celebdor
Copy link

I hope we can soon improve it so that no re-triggering of the script is necessary when growing the worker amount. Until then...
/approve

@johnbieren
Copy link
Contributor

johnbieren commented Sep 20, 2019

[test]

@phoracek
Copy link
Contributor Author

@karmab the CI failed with:

"Error from server (NotFound): deployments.extensions \"hco-catalogsource-config\" not found"

Do I need to rebase?

Configure linux bridge as the default network interface.

Signed-off-by: Petr Horacek <[email protected]>
Due to a bug in kubernetes-nmstate/nmstate/NM, we get a random MAC
address on our bridge after reboot. Because of that we get a different
IP address and lose the host.

With this patch, we explicitly request MAC of the NIC on the bridge.

Signed-off-by: Petr Horacek <[email protected]>
@phoracek
Copy link
Contributor Author

CI failed on CNV deloyment: https://jenkins-fci-continuous-productization.cloud.paas.psi.redhat.com/job/Install_Scripts_CI/66/console

I rebased the PR, hopefully it will work now.

@karmab
Copy link
Contributor

karmab commented Sep 23, 2019

lgtt

@karmab karmab merged commit d780d13 into openshift-kni:master Sep 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants