-
Notifications
You must be signed in to change notification settings - Fork 157
DNM/TEST: OCPBUGS-56281 #2581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release-4.19
Are you sure you want to change the base?
DNM/TEST: OCPBUGS-56281 #2581
Conversation
Signed-off-by: Surya Seetharaman <[email protected]>
This is the CRD for Network QoS, based on the enhancement ovn-kubernetes/ovn-kubernetes#4366 Signed-off-by: Flavio Fernandes <[email protected]>
Signed-off-by: Flavio Fernandes <[email protected]>
cd go-controller && ./hack/update-codegen.sh Signed-off-by: Flavio Fernandes <[email protected]>
This commit adds a new flag called enable-network-qos, which can be used to feature gate the Network QoS feature. The following flag is exposed in kind scripts as -nqe or --network-qos-enable: kind.sh -nqe kind-helm.sh -nqe Alternatively, it can be enabled via an exported shell variable: OVN_NETWORK_QOS_ENABLE=true kind.sh Signed-off-by: Flavio Fernandes <[email protected]>
This commit adds the preparation bits needed to be used by the network qos controller in the following commit. It also provides sufficient permissions to the systemaccount to list/watch/patch the new CRD accordingly. Signed-off-by: Flavio Fernandes <[email protected]>
This commit implements Network QoS controller, based on the enhancement ovn-kubernetes/ovn-kubernetes#4366 Signed-off-by: Flavio Fernandes <[email protected]> Signed-off-by: Xiaobin Qu <[email protected]>
Signed-off-by: Xiaobin Qu <[email protected]>
e2e skip test: Network QoS feature depends on multi-homing to exercise secondary NAD Signed-off-by: Flavio Fernandes <[email protected]>
misc changes to address review comments. Signed-off-by: Xiaobin Qu <[email protected]> Signed-off-by: Flavio Fernandes <[email protected]>
Signed-off-by: Flavio Fernandes <[email protected]>
use NetworkSelector spec to match net-attach-defs instead of explicitly specify a list of net-attach-defs. Signed-off-by: Xiaobin Qu <[email protected]> Signed-off-by: Flavio Fernandes <[email protected]>
pkg/ovn/controller/network_qos/network_qos_pod.go:9:2: import "k8s.io/api/core/v1" imported as "v1" but must be "corev1" according to config (importas) v1 "k8s.io/api/core/v1" ^ pkg/ovn/controller/network_qos/network_qos_test.go:547:26: Error return value of addrset.AddAddresses is not checked (errcheck) addrset.AddAddresses([]string{"10.194.188.4"}) ^ pkg/factory/factory_test.go:41: File is not gci-ed with --skip-generated -s standard -s default -s prefix(k8s.io,sigs.k8s.io) -s prefix(github.com/ovn-org) -s localmodule -s dot --custom-order (gci) "github.com/ovn-org/ovn-kubernetes/go-controller/pkg/util" pkg/factory/factory_test.go:44: File is not gci-ed with --skip-generated -s standard -s default -s prefix(k8s.io,sigs.k8s.io) -s prefix(github.com/ovn-org) -s localmodule -s dot --custom-order (gci) networkqosfake "github.com/ovn-org/ovn-kubernetes/go-controller/pkg/crd/networkqos/v1/apis/clientset/versioned/fake" pkg/ovn/controller/network_qos/network_qos_test.go:12: File is not gci-ed with --skip-generated -s standard -s default -s prefix(k8s.io,sigs.k8s.io) -s prefix(github.com/ovn-org) -s localmodule -s dot --custom-order (gci) . "github.com/onsi/ginkgo/v2" . "github.com/onsi/gomega" pkg/factory/factory_test.go:2160:21: unused-parameter: parameter 'old' seems to be unused, consider removing or renaming it as _ (revive) UpdateFunc: func(old, new interface{}) { ^ pkg/ovn/controller/network_qos/network_qos.go:70:43: unused-parameter: parameter 'nqosKey' seems to be unused, consider removing or renaming it as _ (revive) return c.nqosCache.DoWithLock(key, func(nqosKey string) error { ^ pkg/ovn/controller/network_qos/network_qos.go:84:43: unused-parameter: parameter 'nqosKey' seems to be unused, consider removing or renaming it as _ (revive) return c.nqosCache.DoWithLock(key, func(nqosKey string) error { ^ pkg/ovn/controller/network_qos/network_qos.go:91:42: unused-parameter: parameter 'nqosKey' seems to be unused, consider removing or renaming it as _ (revive) return c.nqosCache.DoWithLock(key, func(nqosKey string) error { ^ pkg/ovn/controller/network_qos/types.go:352:28: unused-parameter: parameter 'value' seems to be unused, consider removing or renaming it as _ (revive) dest.Pods.Range(func(key, value any) bool { ^ pkg/ovn/controller/network_qos/network_qos_test.go:906:12: unused-parameter: parameter 'nbGlobal' seems to be unused, consider removing or renaming it as _ (revive) p := func(nbGlobal *nbdb.NBGlobal) bool { ^ pkg/ovn/controller/network_qos/network_qos_controller.go:9:2: import "k8s.io/api/core/v1" imported as "v1" but must be "corev1" according to config (importas) v1 "k8s.io/api/core/v1" ^ level=info msg="[runner/max_same_issues] 14/17 issues with text "File is not gci-ed with --skip-generated -s standard -s default -s prefix(k8s.io,sigs.k8s.io) -s prefix(github.com/ovn-org) -s localmodule -s dot --custom-order" were hidden, use --max-same-issues" level=info msg="[runner/max_same_issues] 2/5 issues with text "import \"k8s.io/api/core/v1\" imported as \"v1\" but must be \"corev1\" according to config" were hidden, use --max-same-issues" level=info msg="[runner] Issues before processing: 154, after processing: 15" level=info msg="[runner] Processors filtering stat (out/in): invalid_issue: 154/154, path_prettifier: 154/154, max_same_issues: 15/31, severity-rules: 15/15, path_prefixer: 15/15, filename_unadjuster: 154/154, identifier_marker: 104/104, source_code: 15/15, cgo: 154/154, exclude: 104/104, exclude-rules: 34/104, diff: 31/31, max_from_linter: 15/15, fixer: 15/15, sort_results: 15/15, skip_files: 154/154, autogenerated_exclude: 104/113, nolint: 31/34, uniq_by_line: 31/31, max_per_file_from_linter: 31/31, path_shortener: 15/15, skip_dirs: 113/154" level=info msg="[runner] processing took 20.696604ms with stages: nolint: 9.359103ms, path_prettifier: 4.425669ms, autogenerated_exclude: 3.521318ms, source_code: 1.252048ms, exclude-rules: 986.811µs, identifier_marker: 704.888µs, skip_dirs: 394.602µs, max_same_issues: 15.034µs, cgo: 9.933µs, invalid_issue: 7.606µs, uniq_by_ line: 6.214µs, path_shortener: 4.436µs, filename_unadjuster: 3.748µs, max_from_linter: 3.014µs, max_per_file_from_linter: 672ns, fixer: 346ns, skip_files: 311ns, diff: 231ns, exclude: 218ns, sort_results: 204ns, severity-rules: 104ns, path_prefixer: 94ns" level=info msg="[runner] linters took 16.440293927s with stages: goanalysis_metalinter: 16.419554906s" level=info msg="File cache stats: 8 entries of total size 176.9KiB" level=info msg="Memory: 431 samples, avg is 1072.1MB, max is 4950.4MB" level=info msg="Execution took 43.614943666s" pkg/ovn/controller/network_qos/network_qos_namespace.go:8:2: import "k8s.io/api/core/v1" imported as "v1" but must be "corev1" according to config (importas) v1 "k8s.io/api/core/v1" ^ pkg/ovn/controller/network_qos/network_qos_node.go:10:2: import "k8s.io/api/core/v1" imported as "v1" but must be "corev1" according to config (importas) v1 "k8s.io/api/core/v1" ^ pkg/ovn/controller/network_qos/network_qos_ovnnb.go:10:2: import "github.com/ovn-org/libovsdb/ovsdb" imported as "libovsdb" but must be "" according to config (importas) libovsdb "github.com/ovn-org/libovsdb/ovsdb" ^ pkg/ovn/controller/network_qos/types.go:13:2: import "k8s.io/apimachinery/pkg/api/errors" imported without alias but must be with alias "apierrors" according to config (importas) "k8s.io/apimachinery/pkg/api/errors" ^ Signed-off-by: Flavio Fernandes <[email protected]>
change `port` in classifier to `ports` which contains a list of protocol&port combinations. Signed-off-by: Xiaobin Qu <[email protected]> Signed-off-by: Flavio Fernandes <[email protected]>
cd go-controller && ./hack/update-codegen.sh Signed-off-by: Flavio Fernandes <[email protected]>
Tests: - host networked pod -> nodeport -> ovnk backend (UDP) - VM (1500 mtu) -> host networked pod on another node (TCP) - ovnk pod -> host networked pod on another node (UDP) Each of these test would trigger a node to install a lower MTU route in its cache towards another node, due to PMTUD. The test fails if it detects such a route. Signed-off-by: Tim Rozet <[email protected]>
Create nftables rules to block sending ICMP needs frag/packet too big for known Kubernetes node IPs. PMTUD between nodes can be deterimental to the cluster. Note, this does not affect PMTUD messages received from an intermediary router. For shared gateway mode, also install openflow rules to drop needs frag packets from OVN GR that are destined to known kubernetes nodes. Signed-off-by: Tim Rozet <[email protected]>
This test works under the premise that sending UDP packets via netcat would trigger ICMP needs frag. For shared gateway mode, packets to a host networked pod endpoint are not fragmented and are rerouted out of GR back to the next host endpoint. However, for OVN networked endpoints and shared gateway mode, the packet will hit the GR and then even if DF bit is not set, OVN GR will send back an ICMP needs frag (because OVS is incapable of fragmentation across different MTU boundary interfaces). For local gateway mode, large packets that hit an intermediary node are DNAT'ed in iptables to the cluster IP service. Then there is a route for cluster IP service set to 1400 byte MTU. This will trigger the kernel to drop the packet when DF bit is set. Since our new logic prevents ICMP needs frag from installing a cached MTU route, the client will continue to send packets with a 1500 byte MTU and they will be dropped. We choose not to fix this for now as it was identified as not a practical use case: https://issues.redhat.com/browse/OCPBUGS-7609 It could be fixed in the future by using an ip rule to match on nodeport port range, and then redirecting traffic to another routing table with a lower MTU on the route. Signed-off-by: Tim Rozet <[email protected]>
This allows ovn_cluster_router to send ICMP needs frag when too large of a packet is sent from a pod (such as from a kubevirt VM with too large an MTU). See https://issues.redhat.com/browse/FDP-1295 for more details. Signed-off-by: Tim Rozet <[email protected]>
Today, we require that VF accelerated gateway interface must be explicitly configured. This change allows VF based accelerated gateway interface without explict configuration. Signed-off-by: Yun Zhou <[email protected]>
- namespace/pod/node handlers detect possible changes and trigger reconciliation by putting networkqos in event queue. - reconcile networkqos rules in networkqos handler only Signed-off-by: Xiaobin Qu <[email protected]> Signed-off-by: Flavio Fernandes <[email protected]>
Signed-off-by: nithyar <[email protected]>
Guard against PMTUD route caching for Kubernetes Node IPs
Signed-off-by: Xiaobin Qu <[email protected]>
- acquire lock on the start in networkqos handler - add missing DeepEqual check in networkqos event handler Signed-off-by: Xiaobin Qu <[email protected]>
Add Support for Network QoS
Signed-off-by: nithyar <[email protected]>
Signed-off-by: nithyar <[email protected]>
Signed-off-by: nithyar <[email protected]>
Chassis should handle multiple encaps and they could be added or removed dynamically. CreateOrUpdateChassis function is programmed to set only the list of current encaps based on node-encap-ips annotation and stale entries would be garbage collected. Signed-off-by: nithyar <[email protected]>
Use network ID as one of the ACL DB keys to avoid any potential conflicts for networks with the same name. Signed-off-by: Patryk Diak <[email protected]>
Signed-off-by: Patryk Diak <[email protected]>
…n_isolation.go Signed-off-by: Patryk Diak <[email protected]>
Signed-off-by: Patryk Diak <[email protected]>
Use the constant value for global DB entries. Use the actual controller for per-network ACLs. Signed-off-by: Patryk Diak <[email protected]>
…ction Signed-off-by: Patryk Diak <[email protected]>
Drop trafffic between advertised UDN networks
In our pod handlers we have code that checks if a pod is scheduled. If the pod is not scheduled, then we do not add the pod to our retry framework. However, there are times where we automatically enqueue all pods on a node or across the cluster. This can be from when a new node comes up (pre-IC) and need to add all pods on that node. Another use case is when a new UDN/NAD is created and the controller spins up...we add all pods then. We shouldn't be queuing pods to the retry framework that are not scheduled. It's a waste of operations. However, even if we do enqueue them, we definitely should not be treating a non-scheduled resource as an error and retrying it again later. This commit changes the retry framework to detect the above case, and log an error. It does not trigger retrying of the resource, which may perpetually fail and then then cause OVNKubernetesResourceRetryFailure. Signed-off-by: Tim Rozet <[email protected]>
Related to investigating the root cause for: #5260. This commit removes adding pods that are not scheduled to the retry framework. When the pod is scheduled the controller will receive an event. Additionally these functions that add pods were using the kubeclient instead of informer cache. That means everytime a UDN was added we would issue kubeclient command to get all pods, which is really bad for performance. Signed-off-by: Tim Rozet <[email protected]>
/hold /payload-aggregate periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade 5 |
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a56e64f0-35ba-11f0-86d6-758e923bf91e-0 |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jluhrsen The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@neisw: This PR was included in a payload test run from openshift/origin#29811
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/22557bf0-35d1-11f0-8776-367250159e02-0 |
/retest |
@jluhrsen: trigger 0 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5 |
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/11eba6e0-3608-11f0-9475-24a92b865a9c-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5 build02 had issues |
@neisw: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/15654400-365a-11f0-92fc-3d3394dd2fbf-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5 |
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/aa884580-36c5-11f0-8f6e-20bdb195ed88-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5 |
@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8edc4100-3761-11f0-8eb0-7d8ede057fba-0 |
this is a full d/s sync from u/s master to 4.19 and then adding one two in-flight PRs on top to
test if we can resolve OCPBUGS-56281