-
Notifications
You must be signed in to change notification settings - Fork 667
DNS resolution for kubelet container image does not retry #11052
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't think the analysis is correct here, there are two pull errors:
So I don't see a bug so far. New resolver settings are correctly picked up after a retry. |
Yeah, it looks like this is coming from Outside of that, though, I think the premise is wrong: just because an image is not found at one point in time does not mean it will never be found. That is, while a 404 error should be cause to cancel immediate retries, as it is semantically a terminal error, they should not constitute a forever condition. |
Talos is not completely stuck there - e.g. changing an image reference to a correct one would cause a new attempt, but I don't think Talos should retry on its own in this condition. |
No, it's not stuck in the sense that an external event (such as the restart of The issue with image availability is that it is not a fixed-over-time value. It is entirely reasonable to presume that the unavailability of an image is an unavailability yet of that image. It does beg the question of how often and under what conditions Talos should retry, but I would suggest doing the same thing that Kubernetes does in this condition: a bounded exponential backoff. In this example, presuming all the current assumptions are true, it is the fault of the |
Bug Report
New with v1.10:
The kubelet service attempts to start, but it uses the incorrect DNS server (8.8.8.8), which fails to resolve. This occurs long after the machineconfig with the correct DNS servers is up and otherwise operational. After this failure,
kubelet
never comes up.Manually restarting the kubelet service via
talosctl service restart kubelet
does allow the kubelet to come up.When running v1.9.6, this works (logs attached at the bottom).
Description
From console logs:
Effect: kubelet never starts because its resolution fails.
Manually restarting kubelet service causes it to come up.
Logs
Full logs attached; trimmed logs below:
console-log.txt
As seen from
services
:For comparison, here are the logs from a new, working v1.9.6 node in the same cluster.
working-console-log.txt
Environment
The text was updated successfully, but these errors were encountered: