Skip to content

DNS resolution for kubelet container image does not retry #11052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Ulexus opened this issue May 20, 2025 · 4 comments
Open

DNS resolution for kubelet container image does not retry #11052

Ulexus opened this issue May 20, 2025 · 4 comments

Comments

@Ulexus
Copy link
Contributor

Ulexus commented May 20, 2025

Bug Report

New with v1.10:

The kubelet service attempts to start, but it uses the incorrect DNS server (8.8.8.8), which fails to resolve. This occurs long after the machineconfig with the correct DNS servers is up and otherwise operational. After this failure, kubelet never comes up.
Manually restarting the kubelet service via talosctl service restart kubelet does allow the kubelet to come up.

When running v1.9.6, this works (logs attached at the bottom).

Description

From console logs:

  • Containerd is started before the correct DNS server is loaded: line 17
  • Correct DNS server is loaded: line 97
  • Containerd resolves using incorrect DNS server: line 201

Effect: kubelet never starts because its resolution fails.

Manually restarting kubelet service causes it to come up.

Logs

Full logs attached; trimmed logs below:

 user: warning: [2025-05-19T23:46:35.229803076Z]: [talos] service[containerd](Running): Process Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"]) started      
 with PID 1127                                                                                                                                                                                                                                                          
 user: warning: [2025-05-19T23:46:35.969378076Z]: [talos] service[containerd](Running): Health check successful                                                                                                                                                         
 user: warning: [2025-05-19T23:46:37.719428076Z]: [talos] setting resolvers {"component": "controller-runtime", "controller": "network.ResolverSpecController", "resolvers": ["169.254.169.253", "fd00:ec2::253"], "searchDomains": ["ec2.internal"]}    
 user: warning: [2025-05-19T23:46:38.761669076Z]: [talos] service[cri](Starting): Starting service                                                                                                                                                                      
 user: warning: [2025-05-19T23:46:38.850301076Z]: [talos] service[cri](Running): Process Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"]) started with PID 2035                                         
 user: warning: [2025-05-19T23:46:39.152175076Z]: [talos] service[kubelet](Starting): Starting service                                                                                                                                                                  
 user: warning: [2025-05-19T23:46:39.152860076Z]: [talos] service[kubelet](Waiting): Waiting for volume "/var/lib" to be mounted, volume "/var/lib/kubelet" to be mounted, volume "/var/log" to be mounted, volume "/var/log/audit" to be mounted, volume "/var/log/    
 containers" to be mounted, volume "/var/log/pods" to be mounted, volume "/var/lib/kubelet/seccomp" to be mounted, volume "/var/lib/kubelet/seccomp/profiles" to be mounted, volume "/var/log/audit/kube" to be mounted, volume "/var/mnt" to be mounted, service       
 "cri" to be "up", time sync, network                                                                                                                                                                                                                                   
 user: warning: [2025-05-19T23:46:39.175239076Z]: [talos] created route {"component": "controller-runtime", "controller": "network.RouteSpecController", "destination": "default", "gateway": "", "table": "RoutingTable(180)", "link": "kubespan", "priority": 1,      
 "family": "inet4"}                                                                                                                                                                                                                                                     
 user: warning: [2025-05-19T23:46:39.177365076Z]: [talos] created route {"component": "controller-runtime", "controller": "network.RouteSpecController", "destination": "default", "gateway": "", "table": "RoutingTable(180)", "link": "kubespan", "priority": 1,      
 "family": "inet6"}                                                                                                                                                                                                                                                     
 user: warning: [2025-05-19T23:46:39.794763076Z]: [talos] service[cri](Running): Health check successful                                                                                                                                                                
 user: warning: [2025-05-19T23:46:39.795498076Z]: [talos] service[kubelet](Preparing): Running pre state                                                                                                                                                                
 user: warning: [2025-05-19T23:46:53.665193076Z]: [talos] task startAllServices (1/1): service "apid" to be "up", service "kubelet" to be "up"                                                                                                                          
 user: warning: [2025-05-19T23:46:59.715580076Z]: level=info msg=fetch failed error=failed to do request: Head "https://ghcr.io/v2/siderolabs/kubelet/manifests/sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": dial tcp: lookup ghcr.io on   
 8.8.8.8:53: read udp 10.224.128.22:59060->8.8.8.8:53: i/o timeout host=ghcr.io image=ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb                                                                                
 user: warning: [2025-05-19T23:46:59.779107076Z]: level=info msg=fetch failed after status: 404 Not Found host=ghcr.io image=ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb                                         
 user: warning: [2025-05-19T23:46:59.787276076Z]: [talos] service[kubelet](Failed): Failed to run pre stage: 1 error(s) occurred:                                                                                                                                       
 user: warning: [2025-05-19T23:46:59.790284076Z]:  failed to pull image "ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": failed to resolve reference "ghcr.io/siderolabs/                                          
 kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb: not found                                                                        

console-log.txt

As seen from services:

$ talosctl services
NODE                                     SERVICE      STATE     HEALTH   LAST CHANGE   LAST EVENT
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   apid         Running   OK       4m5s ago      Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   auditd       Running   OK       4m31s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   containerd   Running   OK       4m31s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   cri          Running   OK       4m28s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   dashboard    Running   ?        4m30s ago     Process Process(["/sbin/dashboard"]) started with PID 1949
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   kubelet      Failed    ?        4m7s ago      Failed to run pre stage: 1 error(s) occurred:
                                         failed to pull image "ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": failed to resolve reference "ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb": ghcr.io/siderolabs/kubelet@sha256:0072b6738306b927cb85ad53999c2f9691f2f533cff22f4afc30350c3b9e62bb: not found
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   machined   Running   OK   4m31s ago   Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   syslogd    Running   OK   4m30s ago   Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   udevd      Running   OK   4m32s ago   Health check successful
$ talosctl service kubelet restart
NODE                                     RESPONSE
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   Service "kubelet" restarted
$ talosctl services
NODE                                     SERVICE      STATE     HEALTH   LAST CHANGE   LAST EVENT
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   apid         Running   OK       6m57s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   auditd       Running   OK       7m24s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   containerd   Running   OK       7m24s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   cri          Running   OK       7m20s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   dashboard    Running   ?        7m22s ago     Process Process(["/sbin/dashboard"]) started with PID 1949
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   kubelet      Running   ?        1s ago        Started task kubelet (PID 2159) for container kubelet
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   machined     Running   OK       7m24s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   syslogd      Running   OK       7m23s ago     Health check successful
fd26:7b81:1226:c202:cd4:81ff:feef:e9ff   udevd        Running   OK       7m24s ago     Health check successful

For comparison, here are the logs from a new, working v1.9.6 node in the same cluster.

working-console-log.txt

Environment

  • Talos version: v1.10.1 (though also fails in v1.10.0)
  • Kubernetes version: 1.33.0
  • Platform: various x86 on AWS
@smira
Copy link
Member

smira commented May 20, 2025

I don't think the analysis is correct here, there are two pull errors:

  • failure to do DNS lookup which is retried
  • HTTP 404 from the registry which is not retried, as it's a terminal error (image is not found)

So I don't see a bug so far. New resolver settings are correctly picked up after a retry.

@smira smira added the Stale label May 20, 2025
@Ulexus
Copy link
Contributor Author

Ulexus commented May 20, 2025

Yeah, it looks like this is coming from containerd, not Talos. I haven't completely run through that morass, but I'm concerned that, at least at a high level, they are using the NotFound term to be inclusive of "not resolved"... but I haven't found code to indicate that that is what is happening; rather, it appears to be as you indicate, and it's just the logs that are confusing.

Outside of that, though, I think the premise is wrong: just because an image is not found at one point in time does not mean it will never be found. That is, while a 404 error should be cause to cancel immediate retries, as it is semantically a terminal error, they should not constitute a forever condition.

@smira
Copy link
Member

smira commented May 20, 2025

Talos is not completely stuck there - e.g. changing an image reference to a correct one would cause a new attempt, but I don't think Talos should retry on its own in this condition.

@Ulexus
Copy link
Contributor Author

Ulexus commented May 20, 2025

No, it's not stuck in the sense that an external event (such as the restart of kubelet described above) is sufficient to get it unstuck... but this doesn't facilitate automation.

The issue with image availability is that it is not a fixed-over-time value. It is entirely reasonable to presume that the unavailability of an image is an unavailability yet of that image. It does beg the question of how often and under what conditions Talos should retry, but I would suggest doing the same thing that Kubernetes does in this condition: a bounded exponential backoff.

In this example, presuming all the current assumptions are true, it is the fault of the ghcr.io server to have misrepresented a 404, and a simple retry would have resolved the problem. A more common case, though, would be that the image is not yet available for some reason (slow mirror, new service which hasn't yet synced, out-of-order or non-deterministically-ordered deployment, etc). In all of these cases, a bounded exponential backoff retry strategy would automatically resolve the issue without causing undue load on the container registry even if the image never appears.

@github-actions github-actions bot removed the Stale label May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants