-
Notifications
You must be signed in to change notification settings - Fork 40.7k
DRA kubelet: fix potential flake in unit test #131065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA kubelet: fix potential flake in unit test #131065
Conversation
If the test binary ran long enough after test completion to reach the ResourceSlice removal grace period, that background activity started and failed because it was using out-dated state and an invalid testing.T pointer, causing a panic. The root cause was to leave those background activities running. They need to be stopped before a test returns.
/triage accepted |
/lgtm |
LGTM label has been added. Git tree hash: 5405b5075b5bd57e71fc1016a0a0a9e6fcbe7a93
|
/assign @klueska @SergeyKanzhelev @mrunalp |
/priority import-soon Let's try to get this fixed before 1.33 is released, otherwise backport it, because this flake is seen more often in downstream testing. /cc @kubernetes/release-team /cc @liggitt |
@pohly: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@pohly: GitHub didn't allow me to request PR reviews from the following users: kubernetes/release-team. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/priority important-soon |
/test pull-kubernetes-unit |
/milestone v1.33 |
/assign @ffromani For approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question to make sure I fully get the flow, other than that LGTM and I can approve
@@ -149,15 +149,26 @@ func TestRegistrationHandler(t *testing.T) { | |||
// Set expected slice fields for the next call of this reactor. | |||
// The reactor will be called next time when resourceslices object is deleted | |||
// by the kubelet after plugin deregistration. | |||
expectedSliceFields = fields.Set{"spec.nodeName": nodeName, "spec.driver": test.pluginName} | |||
|
|||
switch len(expectedSliceFields) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even factoring in my very limited knowledge of DRA, is not clear to me why we need case 2
now. Every other change I made I can follow and make sense to me, but this specific addition is not clear to me. Could you please elaborate a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test registers two plugins. case 1
is for the cleanup after unregistering the first one (test.pluginName
). case 2
is for the cleanup after the other one.
In practice, the test usually exits before getting to that second delete, but some future variant might hit it. I was seeing it when experimenting with different delays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, IIUC it is about experimenting with the timing and interplay between 2 plugins, which makes sense to me and I agree it is useful to have and make sure it works.
/approve the changes make sense to me. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hmm, a lot of unrelated flakes in https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/131065/pull-kubernetes-e2e-gce/1909912840441958400 /test pull-kubernetes-e2e-gce |
commit 9ba7dce Author: Kubernetes Release Robot <[email protected]> Date: Tue Apr 22 16:27:30 2025 +0000 CHANGELOG: Update directory for v1.30.12 release commit 191c34e Author: Kubernetes Release Robot <[email protected]> Date: Tue Apr 22 16:15:31 2025 +0000 CHANGELOG: Update directory for v1.31.8 release commit 7bf818f Author: Kubernetes Release Robot <[email protected]> Date: Tue Apr 22 16:21:58 2025 +0000 CHANGELOG: Update directory for v1.32.4 release commit 680ea07 Merge: 0d9dccf e467c95 Author: Kubernetes Prow Robot <[email protected]> Date: Mon Apr 21 09:19:06 2025 -0700 Merge pull request kubernetes#131369 from ameukam/update-1242-master [Go] Bump dependencies, images and versions used to Go 1.24.2 and distroless iptables commit 0d9dccf Merge: 66931f0 95b926c Author: Kubernetes Prow Robot <[email protected]> Date: Sat Apr 19 16:20:59 2025 -0700 Merge pull request kubernetes#131382 from liggitt/watch-list-e2e Correctly feature-gate WatchList e2e commit 95b926c Author: Jordan Liggitt <[email protected]> Date: Sat Apr 19 17:18:31 2025 -0400 Feature-gate watchlist e2e commit 66931f0 Merge: b53b9fb 660df22 Author: Kubernetes Prow Robot <[email protected]> Date: Fri Apr 18 07:55:08 2025 -0700 Merge pull request kubernetes#131359 from deads2k/disable Stop exposing list-via-watch from the server commit e467c95 Author: Arnaud Meukam <[email protected]> Date: Fri Apr 18 15:49:48 2025 +0200 [Go] Bump dependencies, images and versions used to Go 1.24.2 and distroless-iptables Signed-off-by: Arnaud Meukam <[email protected]> commit 660df22 Author: David Eads <[email protected]> Date: Thu Apr 17 16:34:46 2025 -0400 Stop exposing list-via-watch from the server With StreamingCollectionEncodingToJSON and StreamingCollectionEncodingToProtobuf, the WatchList must re-justify its necessity. To prevent an ecosystem from building around a feature that may not be promoted, we will stop serving list-via-watch until performance numbers can justify its inclusion. This also stops the kube-controller-manager from using the list-via-watch by default. The fallback is a regular list, so during the skew during an upgrade the "right" thing will happen and the new StreamingCollectionEncoding will be used. commit b53b9fb Merge: 44c230b a8f6d77 Author: Kubernetes Prow Robot <[email protected]> Date: Wed Apr 16 11:05:08 2025 -0700 Merge pull request kubernetes#131015 from aojea/final_servicecidr Add the missing Conformance Test for ServiceCIDR and IPAddress APIS commit a8f6d77 Author: Antonio Ojea <[email protected]> Date: Tue Apr 15 14:52:14 2025 +0000 ServiceCIDR and IPAddess Conformance Change-Id: I6ee188cc8c163c312f8a8da9f1277d83e1ea634c commit 44c230b Author: Kubernetes Release Robot <[email protected]> Date: Tue Apr 15 17:20:14 2025 +0000 CHANGELOG: Update directory for v1.33.0-rc.1 release commit 30469e1 Merge: cfed8c7 0266d3b Author: Kubernetes Prow Robot <[email protected]> Date: Mon Apr 14 11:17:06 2025 -0700 Merge pull request kubernetes#131263 from aojea/dualstack_upgrade Allow to convert clusters Service CIDRs from single to dual stack commit 0266d3b Author: Antonio Ojea <[email protected]> Date: Fri Apr 11 15:37:28 2025 +0000 Allow single-to-dual-stack reconfiguration for ServiceCIDR This change modifies the validation logic for ServiceCIDR updates (`ValidateServiceCIDRUpdate`) to specifically permit upgrading a single-stack ServiceCIDR (either IPv4 or IPv6) to a dual-stack configuration. This reconfiguration path is considered safe because it only involves adding a new CIDR range without altering the existing primary CIDR. This ensures that existing Service IP allocations are not disrupted. Other modifications, such as: - Downgrading from dual-stack to single-stack - Reordering CIDRs in a dual-stack configuration - Changing the primary CIDR during a single-to-dual-stack reconfiguration remain disallowed by the validation. These operations carry a higher risk of breaking existing Services or cluster networking configurations. Preventing these updates automatically encourages administrators to perform such changes manually after carefully assessing the potential impact on their specific cluster environment. The validation errors and controller logs provide guidance when such disallowed changes are attempted. Change-Id: I41dc09dfddb05f277925da2262f8114d6accbd1d commit cfed8c7 Merge: d6e3f34 7d7fc2d Author: Kubernetes Prow Robot <[email protected]> Date: Mon Apr 14 08:19:13 2025 -0700 Merge pull request kubernetes#131234 from carlory/fix-131229 Fix flaky test: Metrics should grab all metrics from kubelet /metrics/resource endpoint commit d6e3f34 Merge: b15dfce 18249aa Author: Kubernetes Prow Robot <[email protected]> Date: Mon Apr 14 08:19:06 2025 -0700 Merge pull request kubernetes#131211 from BenTheElder/max-pods fix flaky garbage collector tests commit b15dfce Merge: 908bdb3 e5a5f72 Author: Kubernetes Prow Robot <[email protected]> Date: Fri Apr 11 04:46:42 2025 -0700 Merge pull request kubernetes#131240 from jsafrane/selinux-tests-tag Tag SELinux tests that require SELinux warning controller commit 7d7fc2d Author: carlory <[email protected]> Date: Thu Apr 10 17:58:05 2025 +0800 Fix flaky test: Metrics should grab all metrics from kubelet /metrics/resource endpoint Signed-off-by: carlory <[email protected]> commit 908bdb3 Merge: 88dfcb2 505836c Author: Kubernetes Prow Robot <[email protected]> Date: Thu Apr 10 17:08:48 2025 -0700 Merge pull request kubernetes#131250 from jimangel/publishing-release-33 staging/publishing: add release-1.33 rules commit 505836c Author: Jim Angel <[email protected]> Date: Thu Apr 10 16:42:44 2025 -0500 staging/publishing: add release-1.33 rules commit e5a5f72 Author: Jan Safranek <[email protected]> Date: Thu Apr 10 14:28:25 2025 +0200 Tag SELinux tests that require SELinux warning controller The controller is available only with SELinuxChangePolicy FG enabled. Kubernetes e2e jobs do the right --focus & --skip explicitly in the SELinux jobs, this just helps other test runners to figure out what FGs are needed to run the tests. When at it, I noticed SELinuxMountReadWriteOncePod tag is always required, so add it on the common level and not in each test. commit 88dfcb2 Merge: cacd595 52298cf Author: Kubernetes Prow Robot <[email protected]> Date: Wed Apr 9 07:06:48 2025 -0700 Merge pull request kubernetes#131065 from pohly/dra-kubelet-registration-unit-test-fix DRA kubelet: fix potential flake in unit test commit cacd595 Author: Kubernetes Release Robot <[email protected]> Date: Tue Apr 8 19:22:49 2025 +0000 CHANGELOG: Update directory for v1.33.0-rc.0 release commit 18249aa Author: Benjamin Elder <[email protected]> Date: Tue Apr 8 15:54:45 2025 -0700 hack/update-conformance-yaml.sh adds [Serial] tags to some tests that schedule lots of pods commit 640489a Merge: 92af6ab b1a9cc3 Author: Kubernetes Prow Robot <[email protected]> Date: Tue Apr 8 15:12:51 2025 -0700 Merge pull request kubernetes#131196 from siyuanfoundation/forward-api bug fix: fix version order in emulation forward compatibility. commit 1eab303 Author: Benjamin Elder <[email protected]> Date: Tue Apr 8 14:46:38 2025 -0700 mark tests that use estimateMaxPods as serial commit b2933c0 Author: Benjamin Elder <[email protected]> Date: Tue Apr 8 14:46:06 2025 -0700 estimate some system daemonset overhead for max pods commit b1a9cc3 Author: Siyuan Zhang <[email protected]> Date: Mon Apr 7 18:38:43 2025 -0700 bug fix: fix version order in emulation forward compatibility. Signed-off-by: Siyuan Zhang <[email protected]> commit 92af6ab Merge: ab3e83f 2ef4a84 Author: Kubernetes Prow Robot <[email protected]> Date: Tue Apr 8 12:18:44 2025 -0700 Merge pull request kubernetes#131204 from dims/move-to-released-version-of-prometheus/client_golang-v1.22.0-from-rc.0 Move to released version of prometheus/client_golang v1.22.0 from rc.0 commit 2ef4a84 Author: Davanum Srinivas <[email protected]> Date: Tue Apr 8 08:35:18 2025 -0400 Move to released version of prometheus/client_golang v1.22.0 from rc.0 Signed-off-by: Davanum Srinivas <[email protected]> commit ab3e83f Merge: 195803c 10a7d6f Author: Kubernetes Prow Robot <[email protected]> Date: Tue Apr 8 02:06:44 2025 -0700 Merge pull request kubernetes#131146 from mauriciopoppe/csi-proxy-1-2-1 Bump CSI Proxy to v1.2.1-gke.2 commit 10a7d6f Author: Mauricio Poppe <[email protected]> Date: Tue Apr 1 20:22:44 2025 +0000 Update CSI Proxy to v1.2.1-gke.2 commit 52298cf Author: Patrick Ohly <[email protected]> Date: Wed Mar 26 09:21:50 2025 +0100 DRA kubelet: fix potential flake in unit test If the test binary ran long enough after test completion to reach the ResourceSlice removal grace period, that background activity started and failed because it was using out-dated state and an invalid testing.T pointer, causing a panic. The root cause was to leave those background activities running. They need to be stopped before a test returns.
What type of PR is this?
/kind bug
/kind failing-test
/kind flake
What this PR does / why we need it:
If the test binary ran long enough after test completion to reach the ResourceSlice removal grace period, that background activity started and failed because it was using out-dated state and an invalid testing.T pointer, causing a panic.
The root cause was to leave those background activities running. They need to be stopped before a test returns.
Which issue(s) this PR fixes:
Fixes #131056
Special notes for your reviewer:
Does this PR introduce a user-facing change?
/assign @bart0sh