Skip to content

DRA kubelet: fix potential flake in unit test #131065

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Mar 26, 2025

What type of PR is this?

/kind bug
/kind failing-test
/kind flake

What this PR does / why we need it:

If the test binary ran long enough after test completion to reach the ResourceSlice removal grace period, that background activity started and failed because it was using out-dated state and an invalid testing.T pointer, causing a panic.

The root cause was to leave those background activities running. They need to be stopped before a test returns.

Which issue(s) this PR fixes:

Fixes #131056

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

/assign @bart0sh

If the test binary ran long enough after test completion to reach the
ResourceSlice removal grace period, that background activity started and failed
because it was using out-dated state and an invalid testing.T pointer, causing
a panic.

The root cause was to leave those background activities running. They need to
be stopped before a test returns.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 26, 2025
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Mar 26, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 26, 2025
@bart0sh
Copy link
Contributor

bart0sh commented Mar 26, 2025

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 26, 2025
@bart0sh bart0sh moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Mar 26, 2025
@bart0sh bart0sh moved this from Triage to PRs - Needs Reviewer in SIG Node CI/Test Board Mar 26, 2025
@bart0sh
Copy link
Contributor

bart0sh commented Mar 26, 2025

/lgtm

@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 5405b5075b5bd57e71fc1016a0a0a9e6fcbe7a93

@bart0sh bart0sh moved this from PRs - Needs Reviewer to PRs - Needs Approver in SIG Node CI/Test Board Mar 26, 2025
@bart0sh
Copy link
Contributor

bart0sh commented Mar 26, 2025

/assign @klueska @SergeyKanzhelev @mrunalp
for approval

@pohly
Copy link
Contributor Author

pohly commented Apr 9, 2025

/priority import-soon

Let's try to get this fixed before 1.33 is released, otherwise backport it, because this flake is seen more often in downstream testing.

/cc @kubernetes/release-team

/cc @liggitt

@k8s-ci-robot k8s-ci-robot requested a review from liggitt April 9, 2025 06:09
@k8s-ci-robot
Copy link
Contributor

@pohly: The label(s) priority/import-soon cannot be applied, because the repository doesn't have them.

In response to this:

/priority import-soon

Let's try to get this fixed before 1.33 is released, otherwise backport it, because this flake is seen more often in downstream testing.

/cc @kubernetes/release-team

/cc @liggitt

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

@pohly: GitHub didn't allow me to request PR reviews from the following users: kubernetes/release-team.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/priority import-soon

Let's try to get this fixed before 1.33 is released, otherwise backport it, because this flake is seen more often in downstream testing.

/cc @kubernetes/release-team

/cc @liggitt

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@pohly
Copy link
Contributor Author

pohly commented Apr 9, 2025

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Apr 9, 2025
@pohly
Copy link
Contributor Author

pohly commented Apr 9, 2025

/test pull-kubernetes-unit

@npolshakova
Copy link
Contributor

/milestone v1.33

@pohly
Copy link
Contributor Author

pohly commented Apr 9, 2025

/assign @ffromani

For approval.

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question to make sure I fully get the flow, other than that LGTM and I can approve

@@ -149,15 +149,26 @@ func TestRegistrationHandler(t *testing.T) {
// Set expected slice fields for the next call of this reactor.
// The reactor will be called next time when resourceslices object is deleted
// by the kubelet after plugin deregistration.
expectedSliceFields = fields.Set{"spec.nodeName": nodeName, "spec.driver": test.pluginName}

switch len(expectedSliceFields) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even factoring in my very limited knowledge of DRA, is not clear to me why we need case 2 now. Every other change I made I can follow and make sense to me, but this specific addition is not clear to me. Could you please elaborate a bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test registers two plugins. case 1 is for the cleanup after unregistering the first one (test.pluginName). case 2 is for the cleanup after the other one.

In practice, the test usually exits before getting to that second delete, but some future variant might hit it. I was seeing it when experimenting with different delays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, IIUC it is about experimenting with the timing and interplay between 2 plugins, which makes sense to me and I agree it is useful to have and make sure it works.

@ffromani
Copy link
Contributor

ffromani commented Apr 9, 2025

/approve

the changes make sense to me.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 9, 2025
@pohly
Copy link
Contributor Author

pohly commented Apr 9, 2025

Hmm, a lot of unrelated flakes in https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/131065/pull-kubernetes-e2e-gce/1909912840441958400

/test pull-kubernetes-e2e-gce

@k8s-ci-robot k8s-ci-robot merged commit 88dfcb2 into kubernetes:master Apr 9, 2025
18 checks passed
@github-project-automation github-project-automation bot moved this from PRs - Needs Approver to Done in SIG Node CI/Test Board Apr 9, 2025
@pohly pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Apr 17, 2025
saisindhuri91 added a commit to linux-on-ibm-z/kubernetes that referenced this pull request Apr 23, 2025
commit 9ba7dce
Author: Kubernetes Release Robot <[email protected]>
Date:   Tue Apr 22 16:27:30 2025 +0000

    CHANGELOG: Update directory for v1.30.12 release

commit 191c34e
Author: Kubernetes Release Robot <[email protected]>
Date:   Tue Apr 22 16:15:31 2025 +0000

    CHANGELOG: Update directory for v1.31.8 release

commit 7bf818f
Author: Kubernetes Release Robot <[email protected]>
Date:   Tue Apr 22 16:21:58 2025 +0000

    CHANGELOG: Update directory for v1.32.4 release

commit 680ea07
Merge: 0d9dccf e467c95
Author: Kubernetes Prow Robot <[email protected]>
Date:   Mon Apr 21 09:19:06 2025 -0700

    Merge pull request kubernetes#131369 from ameukam/update-1242-master

    [Go] Bump dependencies, images and versions used to Go 1.24.2 and distroless iptables

commit 0d9dccf
Merge: 66931f0 95b926c
Author: Kubernetes Prow Robot <[email protected]>
Date:   Sat Apr 19 16:20:59 2025 -0700

    Merge pull request kubernetes#131382 from liggitt/watch-list-e2e

    Correctly feature-gate WatchList e2e

commit 95b926c
Author: Jordan Liggitt <[email protected]>
Date:   Sat Apr 19 17:18:31 2025 -0400

    Feature-gate watchlist e2e

commit 66931f0
Merge: b53b9fb 660df22
Author: Kubernetes Prow Robot <[email protected]>
Date:   Fri Apr 18 07:55:08 2025 -0700

    Merge pull request kubernetes#131359 from deads2k/disable

    Stop exposing list-via-watch from the server

commit e467c95
Author: Arnaud Meukam <[email protected]>
Date:   Fri Apr 18 15:49:48 2025 +0200

    [Go] Bump dependencies, images and versions used to Go 1.24.2 and distroless-iptables

    Signed-off-by: Arnaud Meukam <[email protected]>

commit 660df22
Author: David Eads <[email protected]>
Date:   Thu Apr 17 16:34:46 2025 -0400

    Stop exposing list-via-watch from the server

    With StreamingCollectionEncodingToJSON and
    StreamingCollectionEncodingToProtobuf, the WatchList must re-justify its
    necessity.  To prevent an ecosystem from building around a feature that
    may not be promoted, we will stop serving list-via-watch until
    performance numbers can justify its inclusion.

    This also stops the kube-controller-manager from using the
    list-via-watch by default.  The fallback is a regular list, so during
    the skew during an upgrade the "right" thing will happen and the new
    StreamingCollectionEncoding will be used.

commit b53b9fb
Merge: 44c230b a8f6d77
Author: Kubernetes Prow Robot <[email protected]>
Date:   Wed Apr 16 11:05:08 2025 -0700

    Merge pull request kubernetes#131015 from aojea/final_servicecidr

    Add the missing Conformance Test for ServiceCIDR and IPAddress APIS

commit a8f6d77
Author: Antonio Ojea <[email protected]>
Date:   Tue Apr 15 14:52:14 2025 +0000

    ServiceCIDR and IPAddess Conformance

    Change-Id: I6ee188cc8c163c312f8a8da9f1277d83e1ea634c

commit 44c230b
Author: Kubernetes Release Robot <[email protected]>
Date:   Tue Apr 15 17:20:14 2025 +0000

    CHANGELOG: Update directory for v1.33.0-rc.1 release

commit 30469e1
Merge: cfed8c7 0266d3b
Author: Kubernetes Prow Robot <[email protected]>
Date:   Mon Apr 14 11:17:06 2025 -0700

    Merge pull request kubernetes#131263 from aojea/dualstack_upgrade

    Allow to convert clusters Service CIDRs from single to dual stack

commit 0266d3b
Author: Antonio Ojea <[email protected]>
Date:   Fri Apr 11 15:37:28 2025 +0000

    Allow single-to-dual-stack reconfiguration for ServiceCIDR

    This change modifies the validation logic for ServiceCIDR updates
    (`ValidateServiceCIDRUpdate`) to specifically permit upgrading a
    single-stack ServiceCIDR (either IPv4 or IPv6) to a dual-stack
    configuration.

    This reconfiguration path is considered safe because it only involves adding
    a new CIDR range without altering the existing primary CIDR. This
    ensures that existing Service IP allocations are not disrupted.

    Other modifications, such as:
    - Downgrading from dual-stack to single-stack
    - Reordering CIDRs in a dual-stack configuration
    - Changing the primary CIDR during a single-to-dual-stack
      reconfiguration

    remain disallowed by the validation. These operations carry a higher
    risk of breaking existing Services or cluster networking
    configurations. Preventing these updates automatically encourages
    administrators to perform such changes manually after carefully
    assessing the potential impact on their specific cluster environment.
    The validation errors and controller logs provide guidance when such
    disallowed changes are attempted.

    Change-Id: I41dc09dfddb05f277925da2262f8114d6accbd1d

commit cfed8c7
Merge: d6e3f34 7d7fc2d
Author: Kubernetes Prow Robot <[email protected]>
Date:   Mon Apr 14 08:19:13 2025 -0700

    Merge pull request kubernetes#131234 from carlory/fix-131229

    Fix flaky test: Metrics should grab all metrics from kubelet /metrics/resource endpoint

commit d6e3f34
Merge: b15dfce 18249aa
Author: Kubernetes Prow Robot <[email protected]>
Date:   Mon Apr 14 08:19:06 2025 -0700

    Merge pull request kubernetes#131211 from BenTheElder/max-pods

    fix flaky garbage collector tests

commit b15dfce
Merge: 908bdb3 e5a5f72
Author: Kubernetes Prow Robot <[email protected]>
Date:   Fri Apr 11 04:46:42 2025 -0700

    Merge pull request kubernetes#131240 from jsafrane/selinux-tests-tag

    Tag SELinux tests that require SELinux warning controller

commit 7d7fc2d
Author: carlory <[email protected]>
Date:   Thu Apr 10 17:58:05 2025 +0800

    Fix flaky test: Metrics should grab all metrics from kubelet /metrics/resource endpoint

    Signed-off-by: carlory <[email protected]>

commit 908bdb3
Merge: 88dfcb2 505836c
Author: Kubernetes Prow Robot <[email protected]>
Date:   Thu Apr 10 17:08:48 2025 -0700

    Merge pull request kubernetes#131250 from jimangel/publishing-release-33

    staging/publishing: add release-1.33 rules

commit 505836c
Author: Jim Angel <[email protected]>
Date:   Thu Apr 10 16:42:44 2025 -0500

    staging/publishing: add release-1.33 rules

commit e5a5f72
Author: Jan Safranek <[email protected]>
Date:   Thu Apr 10 14:28:25 2025 +0200

    Tag SELinux tests that require SELinux warning controller

    The controller is available only with SELinuxChangePolicy FG enabled.
    Kubernetes e2e jobs do the right --focus & --skip explicitly in the SELinux
    jobs, this just helps other test runners to figure out what FGs are needed
    to run the tests.

    When at it, I noticed SELinuxMountReadWriteOncePod tag is always required, so
    add it on the common level and not in each test.

commit 88dfcb2
Merge: cacd595 52298cf
Author: Kubernetes Prow Robot <[email protected]>
Date:   Wed Apr 9 07:06:48 2025 -0700

    Merge pull request kubernetes#131065 from pohly/dra-kubelet-registration-unit-test-fix

    DRA kubelet: fix potential flake in unit test

commit cacd595
Author: Kubernetes Release Robot <[email protected]>
Date:   Tue Apr 8 19:22:49 2025 +0000

    CHANGELOG: Update directory for v1.33.0-rc.0 release

commit 18249aa
Author: Benjamin Elder <[email protected]>
Date:   Tue Apr 8 15:54:45 2025 -0700

    hack/update-conformance-yaml.sh

    adds [Serial] tags to some tests that schedule lots of pods

commit 640489a
Merge: 92af6ab b1a9cc3
Author: Kubernetes Prow Robot <[email protected]>
Date:   Tue Apr 8 15:12:51 2025 -0700

    Merge pull request kubernetes#131196 from siyuanfoundation/forward-api

    bug fix: fix version order in emulation forward compatibility.

commit 1eab303
Author: Benjamin Elder <[email protected]>
Date:   Tue Apr 8 14:46:38 2025 -0700

    mark tests that use estimateMaxPods as serial

commit b2933c0
Author: Benjamin Elder <[email protected]>
Date:   Tue Apr 8 14:46:06 2025 -0700

    estimate some system daemonset overhead for max pods

commit b1a9cc3
Author: Siyuan Zhang <[email protected]>
Date:   Mon Apr 7 18:38:43 2025 -0700

    bug fix: fix version order in emulation forward compatibility.

    Signed-off-by: Siyuan Zhang <[email protected]>

commit 92af6ab
Merge: ab3e83f 2ef4a84
Author: Kubernetes Prow Robot <[email protected]>
Date:   Tue Apr 8 12:18:44 2025 -0700

    Merge pull request kubernetes#131204 from dims/move-to-released-version-of-prometheus/client_golang-v1.22.0-from-rc.0

    Move to released version of prometheus/client_golang v1.22.0 from rc.0

commit 2ef4a84
Author: Davanum Srinivas <[email protected]>
Date:   Tue Apr 8 08:35:18 2025 -0400

    Move to released version of prometheus/client_golang v1.22.0 from rc.0

    Signed-off-by: Davanum Srinivas <[email protected]>

commit ab3e83f
Merge: 195803c 10a7d6f
Author: Kubernetes Prow Robot <[email protected]>
Date:   Tue Apr 8 02:06:44 2025 -0700

    Merge pull request kubernetes#131146 from mauriciopoppe/csi-proxy-1-2-1

    Bump CSI Proxy to v1.2.1-gke.2

commit 10a7d6f
Author: Mauricio Poppe <[email protected]>
Date:   Tue Apr 1 20:22:44 2025 +0000

    Update CSI Proxy to v1.2.1-gke.2

commit 52298cf
Author: Patrick Ohly <[email protected]>
Date:   Wed Mar 26 09:21:50 2025 +0100

    DRA kubelet: fix potential flake in unit test

    If the test binary ran long enough after test completion to reach the
    ResourceSlice removal grace period, that background activity started and failed
    because it was using out-dated state and an invalid testing.T pointer, causing
    a panic.

    The root cause was to leave those background activities running. They need to
    be stopped before a test returns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: Done
Archived in project
Development

Successfully merging this pull request may close these issues.

DRA: Unit test TestRegistrationHandler panics
8 participants