-
Notifications
You must be signed in to change notification settings - Fork 20
configure host networking policy with default bridge #47
Conversation
22d1e93
to
161c142
Compare
906b585
to
371ad43
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
nice job!
@celebdor I have a question for the network configuration of the environment.
The default_iface
is a regular interface (no vlan one) ?
did the interface on the switch side is configured using trunk with native vlan or just a vlan?
CNV/cnv-2.1.0.sh
Outdated
|
||
echo "Waiting until all nodes come back" | ||
sleep 10 | ||
until oc get nodes 2> /dev/null; do sleep 10; done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add here a check of the hostnetwork CR to check the state contains the new bridge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in the following block. We check that the new machineCIDR route is on the bridge.
371ad43
to
a01d7c4
Compare
a01d7c4
to
4c268e9
Compare
CNV/cnv-2.1.0.sh
Outdated
|
||
echo "Setting up a bridge as the default interface" | ||
machineCIDR=$(grep 'machineCIDR' ../OpenShift/install-config.yaml | sed 's/\(.*\): *\(.*\)/\2/') | ||
while ! default_iface=$(oc get nodenetworkstate -o jsonpath="{.items[0].status.currentState.routes.running[?(@.destination==\"${machineCIDR}\")].next-hop-interface}"); do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to be so smart? Isn't it known that default_iface==eno1
?
I worry that this code would break in case brext
already exists on top of eno1
.
(Don't get me wrong: at some point we'd want a smart Policy of "apply this on the nic that is on the default route and keep the existing IP")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we need this. On bare-metal env I used the default nic was eno4, on virtual ens3 (or something similar), it is not even always the first found NIC. This script has two assumptions.
- All nodes are the same, have the same nic connected to the default network
- This script is called only once during the initial installation (we can improve it and just skip the config in case brext is already default)
I find idempotency quite important. Let's bail out early if |
4c268e9
to
5dd8106
Compare
CNV/cnv-2.1.0.sh
Outdated
|
||
echo "Applying new network configuration on all nodes, one by one, in order to sustain quorum" | ||
machineCIDR=$(grep 'machineCIDR' ../OpenShift/install-config.yaml | sed 's/\(.*\): *\(.*\)/\2/') | ||
storageCIDR="172.22.0.0/24" # TODO: get from a config file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this subnet is reserved for provisioning, and the only nic in the deployment which is up on this network is the node where the baremetal-operator is running, so you probably need to choose a different range and configure storage traffic to go via a vlan - I think @sreichar has a spreadsheet with vlan IDs and subnet ranges to refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my libvirt setup, this subnet is available on all nodes on the secondary interface. storageCIDR is indeed a bad name for this. I will drop VLAN configuration from this PR and limit it to bridge, so we get something working in first. I'd like to sync with you folks to discuss what should be done with the VLAN part.
96b18ec
to
eae009e
Compare
@celebdor @karmab the script works fine except one issue. Once we reconfigure networking on two masters (successfully, keeping connectivity), we lose API server |
eae009e
to
aa3aa8a
Compare
keepalived pods need to be restarted to pick up on the interface change |
aa3aa8a
to
9a2e696
Compare
Pushed some changes and fixes. In this state, the configuration should work well. The only issue we have yet to solve is the keepalived restart. Then we can move to test this on bare-metal and hopefully merge. |
[test] since I canceled the build earlier to debug a problem in the CI environment |
we can install ocs before cnv, order doesn't matter. |
Before we can test this PR, we need keepalived fix. @celebdor could we use the initial workaround you created, so we have something to start with? I'm happy to open needed PRs, just tell me where it should be done. @karmab once we have keepalived fixed and this PR merged, would you be so kind to remove post-install bits? I'm not that familiar with the project. |
I think for this to be properly tested in CI, the post-install bits should be removed as part of it. In CI right now, this passed, but we think it is because the post-install bits removed the bridge so the changes in this PR saw the bridge and didn't do anything. |
Needs rebase on CNV/cnv-2.1.0.sh:160: from:
to:
I've tested the usual way OCP > CNV > OCS and works fine. |
OCP42 and OCS went fine but the CNV deployment went bad for me:
The actual status:
UPDATE: Issued a new bug |
Ok, removal of code from post-install scared me. Let me use this only for bridge (I will panic and add VLAN soon, don't worry). |
52e6719
to
e4628cb
Compare
c2fc7d8
to
115a917
Compare
We were finally able to verify this. The new script successfully set up the default bridge on a cluster deployed with @celebdor's openshift/baremetal-runtimecfg#20 and openshift/machine-config-operator#1124. Once those two PRs are merged, we can finish this one. |
55bcc6e
to
938c5d8
Compare
I hope we can soon improve it so that no re-triggering of the script is necessary when growing the worker amount. Until then... |
[test] |
@karmab the CI failed with:
Do I need to rebase? |
Configure linux bridge as the default network interface. Signed-off-by: Petr Horacek <[email protected]>
Due to a bug in kubernetes-nmstate/nmstate/NM, we get a random MAC address on our bridge after reboot. Because of that we get a different IP address and lose the host. With this patch, we explicitly request MAC of the NIC on the bridge. Signed-off-by: Petr Horacek <[email protected]>
938c5d8
to
a54fe41
Compare
CI failed on CNV deloyment: https://jenkins-fci-continuous-productization.cloud.paas.psi.redhat.com/job/Install_Scripts_CI/66/console I rebased the PR, hopefully it will work now. |
lgtt |
Configure linux bridge as the default interface (so users can connect their VMs to the L2 network).
Depends on: openshift/machine-config-operator#1124 openshift/baremetal-runtimecfg#20