You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does anyone else have issues where the task driver will "lose" a running container, then enter a loop where nomad is continually trying to restart the allocation, failing where the task uses a statically mapped port which cannot be re-bound since it's still in use? I'm seeing this happen fairly regularly (at least once every couple months) which takes down homelab and needs manual intervention to fix.
An example allocation that can't be restarted due to port conflict on still-running container:
Apr 16, '25 12:04:33 +0100 | Not Restarting | Error was unrecoverable
Apr 16, '25 12:04:33 +0100 | Driver Failure | rpc error: code = Unknown desc = failed to start task, could not start container: cannot start container, status code: 500: {"cause":"address already in use","message":"cannot listen on the TCP port: listen tcp4 81.187.21.248:9116: bind: address already in use","response":500}
Apr 16, '25 12:04:31 +0100 | Task Setup | Building Task Directory
Apr 16, '25 12:04:31 +0100 | Received | Task received by client
In this case I found there were actually two copies of this task running on different nodes that the task driver had previously lost, and it was cycling between trying to start the host on both nodes and repeatedly failing. Where the allocation is lost, consul service registrations are also dropped, so the services become inaccessible whilst still running.
Task driver is running with the following plugin config:
The lost allocations have all been GC'd so I can't see what happened to them at the end. Looking at journal on the client, the restarts seem spontaneous.
Apr 15 22:56:52 caron. nomad[3730345]: 2025-04-15T22:56:52.058Z [INFO] client.device_mgr.nomad-device-usb: fingerprinting USB devices: plugin=usb @module=usb timestamp=2025-04-15T22:56:52.058Z
Apr 15 22:56:52 caron. nomad[3730345]: 2025-04-15T22:56:52.058Z [INFO] client.device_mgr.nomad-device-usb: no changes to devices: plugin=usb @module=usb timestamp=2025-04-15T22:56:52.058Z
Apr 15 22:57:52 caron. nomad[3730345]: 2025-04-15T22:57:52.058Z [INFO] client.device_mgr.nomad-device-usb: fingerprinting USB devices: plugin=usb @module=usb timestamp=2025-04-15T22:57:52.058Z
Apr 15 22:57:52 caron. nomad[3730345]: 2025-04-15T22:57:52.058Z [INFO] client.device_mgr.nomad-device-usb: no changes to devices: plugin=usb @module=usb timestamp=2025-04-15T22:57:52.058Z
Apr 15 22:58:52 caron. nomad[3730345]: 2025-04-15T22:58:52.059Z [INFO] client.device_mgr.nomad-device-usb: fingerprinting USB devices: plugin=usb @module=usb timestamp=2025-04-15T22:58:52.059Z
Apr 15 22:58:52 caron. nomad[3730345]: 2025-04-15T22:58:52.059Z [INFO] client.device_mgr.nomad-device-usb: no changes to devices: plugin=usb @module=usb timestamp=2025-04-15T22:58:52.059Z
Apr 15 22:59:31 caron. nomad[3730345]: 2025-04-15T22:59:31.794Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=f27c20dd-b775-6005-a784-b6867e7ffd62 task=exporter type=Received msg="Task received by client" failed=false
Apr 15 22:59:31 caron. nomad[3730345]: 2025-04-15T22:59:31.942Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=1bd610be-9d13-a031-0240-b96af69f80b9 task=runtime type=Received msg="Task received by client" failed=false
Apr 15 22:59:32 caron. nomad[3730345]: 2025-04-15T22:59:32.085Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=1c7cd278-f81c-d06f-a143-4bd72276b154 task=oauth2-proxy type=Received msg="Task received by client" failed=false
Apr 15 22:59:32 caron. nomad[3730345]: 2025-04-15T22:59:32.087Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=1c7cd278-f81c-d06f-a143-4bd72276b154 task=unconditional-oauth2-proxy type=Received msg="Task received by client" failed=false
Apr 15 22:59:32 caron. nomad[3730345]: 2025-04-15T22:59:32.112Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=f27c20dd-b775-6005-a784-b6867e7ffd62 task=exporter type="Task Setup" msg="Building Task Directory" failed=false
Apr 15 22:59:32 caron. nomad[3730345]: 2025-04-15T22:59:32.177Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=1965140e-5839-6736-fdc7-a8cd35cfe904 task=scraper type=Received msg="Task received by client" failed=false
Apr 15 22:59:32 caron. nomad[3730345]: 2025-04-15T22:59:32.289Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=1bd610be-9d13-a031-0240-b96af69f80b9 task=runtime type="Task Setup" msg="Building Task Directory" failed=false
...
Apr 15 22:59:35 caron. nomad[3730345]: 2025-04-15T22:59:35.661Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=41245b44-6d54-2d58-16e6-3c3a3cea323e task=flightradar24 type="Task Setup" msg="Building Task Directory" failed=false
Apr 15 22:59:36 caron. nomad[3730345]: 2025-04-15T22:59:36.722Z [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: name=default error=<nil> period="2025-04-17 10:59:35.722091808 +0000 UTC m=+473441.722540049"
Apr 15 22:59:36 caron. nomad[3730345]: 2025-04-15T22:59:36.834Z [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: name=default error=<nil> period="2025-04-17 10:59:41.834317554 +0000 UTC m=+473447.834765815"
Apr 15 22:59:36 caron. nomad[3730345]: 2025-04-15T22:59:36.907Z [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: name=default error=<nil> period="2025-04-17 10:59:45.9073335 +0000 UTC m=+473451.907781755"
Apr 15 22:59:36 caron. nomad[3730345]: 2025-04-15T22:59:36.911Z [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: name=default error=<nil> period="2025-04-17 10:59:35.911881226 +0000 UTC m=+473441.912329497"
Apr 15 22:59:37 caron. nomad[3730345]: 2025-04-15T22:59:37.153Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=f27c20dd-b775-6005-a784-b6867e7ffd62 task=exporter type="Driver Failure"
Apr 15 22:59:37 caron. nomad[3730345]: msg=
Apr 15 22:59:37 caron. nomad[3730345]: | rpc error: code = Unknown desc = failed to start task, could not start container: cannot start container, status code: 500: {"cause":"address already in use","message":"cannot listen on the TCP port: listen tcp4 81.187.21.247:9116: bind: address alread>
Apr 15 22:59:37 caron. nomad[3730345]: failed=false
Apr 15 22:59:37 caron. nomad[3730345]: 2025-04-15T22:59:37.621Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=f27c20dd-b775-6005-a784-b6867e7ffd62 task=exporter
Apr 15 22:59:37 caron. nomad[3730345]: error=
Apr 15 22:59:37 caron. nomad[3730345]: | rpc error: code = Unknown desc = failed to start task, could not start container: cannot start container, status code: 500: {"cause":"address already in use","message":"cannot listen on the TCP port: listen tcp4 81.187.21.247:9116: bind: address alread>
From the three nomad servers, the only logs recorded around these times were:
Aside from that one missed heartbeat from a different node (crow), there doesn't seem to be any obvious trigger for nomad to be restarting tasks that are otherwise running (on caron). I'm not seeing any log messages relating to lost allocations on either client or server nodes.
When this does happen, I need to podman stop each of the containers that nomad is trying to restart, and then tpically run a nomad job eval -force-reschedule to get it running again without waiting for the next retry.
The text was updated successfully, but these errors were encountered:
Does anyone else have issues where the task driver will "lose" a running container, then enter a loop where nomad is continually trying to restart the allocation, failing where the task uses a statically mapped port which cannot be re-bound since it's still in use? I'm seeing this happen fairly regularly (at least once every couple months) which takes down homelab and needs manual intervention to fix.
An example allocation that can't be restarted due to port conflict on still-running container:
In this case I found there were actually two copies of this task running on different nodes that the task driver had previously lost, and it was cycling between trying to start the host on both nodes and repeatedly failing. Where the allocation is lost, consul service registrations are also dropped, so the services become inaccessible whilst still running.
Task driver is running with the following plugin config:
The lost allocations have all been GC'd so I can't see what happened to them at the end. Looking at journal on the client, the restarts seem spontaneous.
From the three nomad servers, the only logs recorded around these times were:
Aside from that one missed heartbeat from a different node (crow), there doesn't seem to be any obvious trigger for nomad to be restarting tasks that are otherwise running (on caron). I'm not seeing any log messages relating to lost allocations on either client or server nodes.
When this does happen, I need to
podman stop
each of the containers that nomad is trying to restart, and then tpically run anomad job eval -force-reschedule
to get it running again without waiting for the next retry.The text was updated successfully, but these errors were encountered: