Skip to content

DNM/TEST: OCPBUGS-56281 w/out 2560 #2582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 81 commits into
base: release-4.19
Choose a base branch
from

Conversation

jluhrsen
Copy link
Contributor

πŸ“‘ Description

Fixes #

Additional Information for reviewers

βœ… Checks

  • My code requires changes to the documentation
  • if so, I have updated the documentation as required
  • My code requires tests
  • if so, I have added and/or updated the tests as required
  • All the tests have passed in the CI

How to verify it

tssurya and others added 30 commits April 22, 2025 22:56
This is the CRD for Network QoS, based on the enhancement
ovn-kubernetes/ovn-kubernetes#4366

Signed-off-by: Flavio Fernandes <[email protected]>
cd go-controller && ./hack/update-codegen.sh

Signed-off-by: Flavio Fernandes <[email protected]>
This commit adds a new flag called
enable-network-qos, which can be used
to feature gate the Network QoS feature.

The following flag is exposed in kind
scripts as -nqe or --network-qos-enable:

   kind.sh -nqe
   kind-helm.sh -nqe

Alternatively, it can be enabled via an
exported shell variable:

   OVN_NETWORK_QOS_ENABLE=true kind.sh

Signed-off-by: Flavio Fernandes <[email protected]>
This commit adds the preparation bits needed to be
used by the network qos controller in the following
commit. It also provides sufficient permissions to
the systemaccount to list/watch/patch the new CRD
accordingly.

Signed-off-by: Flavio Fernandes <[email protected]>
This commit implements Network QoS controller, based on the enhancement
ovn-kubernetes/ovn-kubernetes#4366

Signed-off-by: Flavio Fernandes <[email protected]>
Signed-off-by: Xiaobin Qu <[email protected]>
e2e skip test: Network QoS feature depends on multi-homing
to exercise secondary NAD

Signed-off-by: Flavio Fernandes <[email protected]>
misc changes to address review comments.

Signed-off-by: Xiaobin Qu <[email protected]>
Signed-off-by: Flavio Fernandes <[email protected]>
use NetworkSelector spec to match net-attach-defs instead of explicitly
specify a list of net-attach-defs.

Signed-off-by: Xiaobin Qu <[email protected]>
Signed-off-by: Flavio Fernandes <[email protected]>
pkg/ovn/controller/network_qos/network_qos_pod.go:9:2: import "k8s.io/api/core/v1" imported as "v1" but must be "corev1" according to config (importas)
v1 "k8s.io/api/core/v1"
^
pkg/ovn/controller/network_qos/network_qos_test.go:547:26: Error return value of addrset.AddAddresses is not checked (errcheck)
addrset.AddAddresses([]string{"10.194.188.4"})
^
pkg/factory/factory_test.go:41: File is not gci-ed with --skip-generated -s standard -s default -s prefix(k8s.io,sigs.k8s.io) -s prefix(github.com/ovn-org) -s localmodule -s dot --custom-order (gci)
"github.com/ovn-org/ovn-kubernetes/go-controller/pkg/util"
pkg/factory/factory_test.go:44: File is not gci-ed with --skip-generated -s standard -s default -s prefix(k8s.io,sigs.k8s.io) -s prefix(github.com/ovn-org) -s localmodule -s dot --custom-order (gci)
networkqosfake "github.com/ovn-org/ovn-kubernetes/go-controller/pkg/crd/networkqos/v1/apis/clientset/versioned/fake"
pkg/ovn/controller/network_qos/network_qos_test.go:12: File is not gci-ed with --skip-generated -s standard -s default -s prefix(k8s.io,sigs.k8s.io) -s prefix(github.com/ovn-org) -s localmodule -s dot --custom-order (gci)
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
pkg/factory/factory_test.go:2160:21: unused-parameter: parameter 'old' seems to be unused, consider removing or renaming it as _ (revive)
UpdateFunc: func(old, new interface{}) {
^
pkg/ovn/controller/network_qos/network_qos.go:70:43: unused-parameter: parameter 'nqosKey' seems to be unused, consider removing or renaming it as _ (revive)
return c.nqosCache.DoWithLock(key, func(nqosKey string) error {
^
pkg/ovn/controller/network_qos/network_qos.go:84:43: unused-parameter: parameter 'nqosKey' seems to be unused, consider removing or renaming it as _ (revive)
return c.nqosCache.DoWithLock(key, func(nqosKey string) error {
^
pkg/ovn/controller/network_qos/network_qos.go:91:42: unused-parameter: parameter 'nqosKey' seems to be unused, consider removing or renaming it as _ (revive)
return c.nqosCache.DoWithLock(key, func(nqosKey string) error {
^
pkg/ovn/controller/network_qos/types.go:352:28: unused-parameter: parameter 'value' seems to be unused, consider removing or renaming it as _ (revive)
dest.Pods.Range(func(key, value any) bool {
^
pkg/ovn/controller/network_qos/network_qos_test.go:906:12: unused-parameter: parameter 'nbGlobal' seems to be unused, consider removing or renaming it as _ (revive)
p := func(nbGlobal *nbdb.NBGlobal) bool {
^
pkg/ovn/controller/network_qos/network_qos_controller.go:9:2: import "k8s.io/api/core/v1" imported as "v1" but must be "corev1" according to config (importas)
v1 "k8s.io/api/core/v1"
^
level=info msg="[runner/max_same_issues] 14/17 issues with text "File is not gci-ed with --skip-generated -s standard -s default -s prefix(k8s.io,sigs.k8s.io) -s prefix(github.com/ovn-org) -s localmodule -s dot --custom-order" were hidden, use --max-same-issues"
level=info msg="[runner/max_same_issues] 2/5 issues with text "import \"k8s.io/api/core/v1\" imported as \"v1\" but must be \"corev1\" according to config" were hidden, use --max-same-issues"
level=info msg="[runner] Issues before processing: 154, after processing: 15"
level=info msg="[runner] Processors filtering stat (out/in): invalid_issue: 154/154, path_prettifier: 154/154, max_same_issues: 15/31, severity-rules: 15/15, path_prefixer: 15/15, filename_unadjuster: 154/154, identifier_marker: 104/104, source_code: 15/15, cgo: 154/154, exclude: 104/104, exclude-rules: 34/104, diff: 31/31,
max_from_linter: 15/15, fixer: 15/15, sort_results: 15/15, skip_files: 154/154, autogenerated_exclude: 104/113, nolint: 31/34, uniq_by_line: 31/31, max_per_file_from_linter: 31/31, path_shortener: 15/15, skip_dirs: 113/154"
level=info msg="[runner] processing took 20.696604ms with stages: nolint: 9.359103ms, path_prettifier: 4.425669ms, autogenerated_exclude: 3.521318ms, source_code: 1.252048ms, exclude-rules: 986.811Β΅s, identifier_marker: 704.888Β΅s, skip_dirs: 394.602Β΅s, max_same_issues: 15.034Β΅s, cgo: 9.933Β΅s, invalid_issue: 7.606Β΅s, uniq_by_
line: 6.214Β΅s, path_shortener: 4.436Β΅s, filename_unadjuster: 3.748Β΅s, max_from_linter: 3.014Β΅s, max_per_file_from_linter: 672ns, fixer: 346ns, skip_files: 311ns, diff: 231ns, exclude: 218ns, sort_results: 204ns, severity-rules: 104ns, path_prefixer: 94ns"
level=info msg="[runner] linters took 16.440293927s with stages: goanalysis_metalinter: 16.419554906s"
level=info msg="File cache stats: 8 entries of total size 176.9KiB"
level=info msg="Memory: 431 samples, avg is 1072.1MB, max is 4950.4MB"
level=info msg="Execution took 43.614943666s"
pkg/ovn/controller/network_qos/network_qos_namespace.go:8:2: import "k8s.io/api/core/v1" imported as "v1" but must be "corev1" according to config (importas)
v1 "k8s.io/api/core/v1"
^
pkg/ovn/controller/network_qos/network_qos_node.go:10:2: import "k8s.io/api/core/v1" imported as "v1" but must be "corev1" according to config (importas)
v1 "k8s.io/api/core/v1"
^
pkg/ovn/controller/network_qos/network_qos_ovnnb.go:10:2: import "github.com/ovn-org/libovsdb/ovsdb" imported as "libovsdb" but must be "" according to config (importas)
libovsdb "github.com/ovn-org/libovsdb/ovsdb"
^
pkg/ovn/controller/network_qos/types.go:13:2: import "k8s.io/apimachinery/pkg/api/errors" imported without alias but must be with alias "apierrors" according to config (importas)
"k8s.io/apimachinery/pkg/api/errors"
^

Signed-off-by: Flavio Fernandes <[email protected]>
change `port` in classifier to `ports` which contains a list of
protocol&port combinations.

Signed-off-by: Xiaobin Qu <[email protected]>
Signed-off-by: Flavio Fernandes <[email protected]>
cd go-controller && ./hack/update-codegen.sh

Signed-off-by: Flavio Fernandes <[email protected]>
Tests:
 - host networked pod -> nodeport -> ovnk backend (UDP)
 - VM (1500 mtu) -> host networked pod on another node (TCP)
 - ovnk pod -> host networked pod on another node (UDP)

Each of these test would trigger a node to install a lower MTU route
in its cache towards another node, due to PMTUD. The test fails if it
detects such a route.

Signed-off-by: Tim Rozet <[email protected]>
Create nftables rules to block sending ICMP needs frag/packet too big
for known Kubernetes node IPs. PMTUD between nodes can be deterimental
to the cluster. Note, this does not affect PMTUD messages received from
an intermediary router. For shared gateway mode, also install openflow
rules to drop needs frag packets from OVN GR that are destined to known
kubernetes nodes.

Signed-off-by: Tim Rozet <[email protected]>
This test works under the premise that sending UDP packets via netcat
would trigger ICMP needs frag.

For shared gateway mode, packets to a host networked pod endpoint are
not fragmented and are rerouted out of GR back to the next host endpoint.
However, for OVN networked endpoints and shared gateway mode, the packet
will hit the GR and then even if DF bit is not set, OVN GR will send
back an ICMP needs frag (because OVS is incapable of fragmentation
across different MTU boundary interfaces).

For local gateway mode, large packets that hit an intermediary node are
DNAT'ed in iptables to the cluster IP service. Then there is a route for
cluster IP service set to 1400 byte MTU. This will trigger the kernel to
drop the packet when DF bit is set.

Since our new logic prevents ICMP needs frag from installing a cached
MTU route, the client will continue to send packets with a 1500 byte MTU
and they will be dropped. We choose not to fix this for now as it was
identified as not a practical use case:
https://issues.redhat.com/browse/OCPBUGS-7609

It could be fixed in the future by using an ip rule to match on nodeport
port range, and then redirecting traffic to another routing table with a
lower MTU on the route.

Signed-off-by: Tim Rozet <[email protected]>
This allows ovn_cluster_router to send ICMP needs frag when too large of
a packet is sent from a pod (such as from a kubevirt VM with too large
an MTU).

See https://issues.redhat.com/browse/FDP-1295 for more details.

Signed-off-by: Tim Rozet <[email protected]>
Today, we require that VF accelerated gateway interface must be
explicitly configured. This change allows VF based accelerated gateway
interface without explict configuration.

Signed-off-by: Yun Zhou <[email protected]>
- namespace/pod/node handlers detect possible changes and trigger
  reconciliation by putting networkqos in event queue.
- reconcile networkqos rules in networkqos handler only

Signed-off-by: Xiaobin Qu <[email protected]>
Signed-off-by: Flavio Fernandes <[email protected]>
Guard against PMTUD route caching for Kubernetes Node IPs
- acquire lock on the start in networkqos handler
- add missing DeepEqual check in networkqos event handler

Signed-off-by: Xiaobin Qu <[email protected]>
Chassis should handle multiple encaps and they could be added or removed
dynamically. CreateOrUpdateChassis function is programmed to set only the
list of current encaps based on node-encap-ips annotation and stale
entries would be garbage collected.

Signed-off-by: nithyar <[email protected]>
kyrtapz and others added 14 commits May 16, 2025 16:30
no functional changes

Signed-off-by: Patryk Diak <[email protected]>
Use network ID as one of the ACL DB keys to avoid any potential
conflicts for networks with the same name.

Signed-off-by: Patryk Diak <[email protected]>
Use the constant value for global DB entries.
Use the actual controller for per-network ACLs.

Signed-off-by: Patryk Diak <[email protected]>
Drop trafffic between advertised UDN networks
In our pod handlers we have code that checks if a pod is scheduled. If
the pod is not scheduled, then we do not add the pod to our retry
framework. However, there are times where we automatically enqueue all
pods on a node or across the cluster. This can be from when a new node
comes up (pre-IC) and need to add all pods on that node. Another use
case is when a new UDN/NAD is created and the controller spins up...we
add all pods then.

We shouldn't be queuing pods to the retry framework that are not
scheduled. It's a waste of operations. However, even if we do enqueue
them, we definitely should not be treating a non-scheduled resource as
an error and retrying it again later.

This commit changes the retry framework to detect the above case, and
log an error. It does not trigger retrying of the resource, which may
perpetually fail and then then cause OVNKubernetesResourceRetryFailure.

Signed-off-by: Tim Rozet <[email protected]>
Related to investigating the root cause for: #5260. This commit removes
adding pods that are not scheduled to the retry framework. When the pod
is scheduled the controller will receive an event.

Additionally these functions that add pods were using the kubeclient
instead of informer cache. That means everytime a UDN was added we would
issue kubeclient command to get all pods, which is really bad for
performance.

Signed-off-by: Tim Rozet <[email protected]>
@openshift-ci openshift-ci bot requested review from trozet and tssurya May 20, 2025 22:35
Copy link
Contributor

openshift-ci bot commented May 20, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jluhrsen
Once this PR has been reviewed and has the lgtm label, please assign knobunc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jluhrsen
Copy link
Contributor Author

/hold
just to see if we are clear of our bug

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade 5

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 20, 2025
Copy link
Contributor

openshift-ci bot commented May 20, 2025

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c5f3cac0-35ca-11f0-8bc6-7cdf8d41479a-0

@jluhrsen
Copy link
Contributor Author

/retest

Copy link
Contributor

openshift-ci bot commented May 21, 2025

@jluhrsen: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-hypershift-conformance-techpreview 6c813ae link false /test e2e-aws-ovn-hypershift-conformance-techpreview
ci/prow/openshift-e2e-gcp-ovn-techpreview-upgrade 6c813ae link false /test openshift-e2e-gcp-ovn-techpreview-upgrade
ci/prow/e2e-gcp-ovn 6c813ae link true /test e2e-gcp-ovn
ci/prow/e2e-aws-ovn-edge-zones 6c813ae link true /test e2e-aws-ovn-edge-zones
ci/prow/e2e-metal-ipi-ovn-ipv6 6c813ae link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/security 6c813ae link false /test security
ci/prow/e2e-gcp-ovn-techpreview 6c813ae link true /test e2e-gcp-ovn-techpreview
ci/prow/e2e-aws-ovn-hypershift-kubevirt 6c813ae link false /test e2e-aws-ovn-hypershift-kubevirt
ci/prow/4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 6c813ae link true /test 4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jluhrsen
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade 5
infra troubles y'day. let's see now

Copy link
Contributor

openshift-ci bot commented May 21, 2025

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/158ae280-365c-11f0-9741-4dd6b842b810-0

@jluhrsen
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-upgrade-from-stable-4.17-e2e-gcp-ovn-rt-upgrade 5 infra troubles y'day. let's see now
wrong version!

/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5
infra troubles y'day. trying again

Copy link
Contributor

openshift-ci bot commented May 21, 2025

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7260e810-365c-11f0-86c0-52c951da2904-0

@jluhrsen
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5

Copy link
Contributor

openshift-ci bot commented May 22, 2025

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a3d3f680-36c5-11f0-92f9-51cece8af405-0

@jluhrsen
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade 5

Copy link
Contributor

openshift-ci bot commented May 22, 2025

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/89a93ad0-3761-11f0-8319-4f9b521e6056-0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.