-
Notifications
You must be signed in to change notification settings - Fork 2.7k
migrate critical kind jobs to the new nodepool #34851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
migrate critical kind jobs to the new nodepool #34851
Conversation
/lgtm @pacoxu this is the one that should fix the CI |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, upodroid The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for the information. /lgtm |
/easycla |
@upodroid: Updated the
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
1 similar comment
@upodroid: Updated the
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@upodroid this job https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-kubernetes-kind-ipv6-e2e-parallel/1926455138575192064 is still failing and running in |
It has been moved. Check the new runs. Also it's a lot faster, 33m to 25m |
- key: "dedicated" | ||
operator: "Equal" | ||
value: "sig-testing" | ||
effect: "NoSchedule" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding: is this a permanent thing or a temporary stop-gap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding is that is a long standing issue that this kernel bug accelerated kubernetes/k8s.io#5276 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that it's permanent doesn't become clear to me from that comment.
If it's permanent, then we need to document these node labels this "dedicated" node taint somewhere and explain to job authors when to use them tolerate it.
cc @BenTheElder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The taint is temporary, it allowed us to test the new nodepool without scheduling the other prow jobs. It will be removed tomorrow after I promote it as the main nodepool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Is reverting this PR tracked somewhere when it's no longer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In previous PRs I did put a comment on these, but we will want to mass remove the toleration from all jobs when the taint is removed from the node pool and the old node pool is phased out.
Similarly we'll want to remove the node selector at the same time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep a tracking bug, I had been using kubernetes/k8s.io#5276
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned up the tolerations and nodeselectors in #34861
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
/cc @BenTheElder @aojea @pacoxu
Part of kubernetes/k8s.io#5276
Tested in kubernetes/kubernetes#131948 and #34840
tl;dr: There is a bad kernel update in Ubuntu(the underlying OS of the GKE nodes) and it was fixed in a newer version but hasn't been rolled out yet. This change migrates to a nodepool with cgroups v2 + faster VMs + Container Optimised OS