Skip to content

migrate critical kind jobs to the new nodepool #34851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

upodroid
Copy link
Member

@upodroid upodroid commented May 26, 2025

/cc @BenTheElder @aojea @pacoxu

Part of kubernetes/k8s.io#5276
Tested in kubernetes/kubernetes#131948 and #34840

tl;dr: There is a bad kernel update in Ubuntu(the underlying OS of the GKE nodes) and it was fixed in a newer version but hasn't been rolled out yet. This change migrates to a nodepool with cgroups v2 + faster VMs + Container Optimised OS

@aojea
Copy link
Member

aojea commented May 26, 2025

/lgtm
/approve

@pacoxu this is the one that should fix the CI

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/config Issues or PRs related to code in /config labels May 26, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, upodroid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/conformance Issues or PRs related to kubernetes conformance tests approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm "Looks good to me", indicates that a PR is ready to be merged. area/jobs sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 26, 2025
@pacoxu
Copy link
Member

pacoxu commented May 26, 2025

@pacoxu this is the one that should fix the CI

Thanks for the information.

/lgtm

@upodroid
Copy link
Member Author

/easycla

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 26, 2025
@k8s-ci-robot k8s-ci-robot merged commit 010416f into kubernetes:master May 26, 2025
7 checks passed
@k8s-ci-robot
Copy link
Contributor

@upodroid: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key kind-release-blocking.yaml using file config/jobs/kubernetes-sigs/kind/kind-release-blocking.yaml
  • key kubernetes-code-organization.yaml using file config/jobs/kubernetes/sig-arch/kubernetes-code-organization.yaml
  • key kubernetes-kind.yaml using file config/jobs/kubernetes/sig-testing/kubernetes-kind.yaml

In response to this:

/cc @BenTheElder @aojea @pacoxu

Part of kubernetes/k8s.io#5276
Tested in kubernetes/kubernetes#131948 and #34840

tl;dr: There is a bad kernel update in Ubuntu(the underlying OS of the GKE nodes) and it was fixed in a newer version but hasn't been rolled out yet. This change migrates to a nodepool with groups v2 + faster VMs + Container Optimised OS

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

1 similar comment
@k8s-ci-robot
Copy link
Contributor

@upodroid: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key kind-release-blocking.yaml using file config/jobs/kubernetes-sigs/kind/kind-release-blocking.yaml
  • key kubernetes-code-organization.yaml using file config/jobs/kubernetes/sig-arch/kubernetes-code-organization.yaml
  • key kubernetes-kind.yaml using file config/jobs/kubernetes/sig-testing/kubernetes-kind.yaml

In response to this:

/cc @BenTheElder @aojea @pacoxu

Part of kubernetes/k8s.io#5276
Tested in kubernetes/kubernetes#131948 and #34840

tl;dr: There is a bad kernel update in Ubuntu(the underlying OS of the GKE nodes) and it was fixed in a newer version but hasn't been rolled out yet. This change migrates to a nodepool with groups v2 + faster VMs + Container Optimised OS

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@aojea
Copy link
Member

aojea commented May 26, 2025

@upodroid this job https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-kind-ipv6-e2e-parallel/1926455138575192064 is still failing and running in gke-prow-build-pool5-2021092812495606-e8f905a4-vdpr / 10.128.0.12 , should it be moved too?

@upodroid
Copy link
Member Author

It has been moved. Check the new runs.

Also it's a lot faster, 33m to 25m

- key: "dedicated"
operator: "Equal"
value: "sig-testing"
effect: "NoSchedule"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding: is this a permanent thing or a temporary stop-gap?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that is a long standing issue that this kernel bug accelerated kubernetes/k8s.io#5276 (comment)

Copy link
Contributor

@pohly pohly May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that it's permanent doesn't become clear to me from that comment.

If it's permanent, then we need to document these node labels this "dedicated" node taint somewhere and explain to job authors when to use them tolerate it.

cc @BenTheElder

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The taint is temporary, it allowed us to test the new nodepool without scheduling the other prow jobs. It will be removed tomorrow after I promote it as the main nodepool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Is reverting this PR tracked somewhere when it's no longer needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In previous PRs I did put a comment on these, but we will want to mass remove the toleration from all jobs when the taint is removed from the node pool and the old node pool is phased out.

Similarly we'll want to remove the node selector at the same time

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep a tracking bug, I had been using kubernetes/k8s.io#5276

Copy link
Member Author

@upodroid upodroid May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned up the tolerations and nodeselectors in #34861

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/conformance Issues or PRs related to kubernetes conformance tests area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants