Skip to content

Driver loses running containers, fails to restart with port conflicts #432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
optiz0r opened this issue Apr 16, 2025 · 0 comments
Open

Comments

@optiz0r
Copy link
Contributor

optiz0r commented Apr 16, 2025

Does anyone else have issues where the task driver will "lose" a running container, then enter a loop where nomad is continually trying to restart the allocation, failing where the task uses a statically mapped port which cannot be re-bound since it's still in use? I'm seeing this happen fairly regularly (at least once every couple months) which takes down homelab and needs manual intervention to fix.

An example allocation that can't be restarted due to port conflict on still-running container:

Apr 16, '25 12:04:33 +0100 | Not Restarting | Error was unrecoverable
Apr 16, '25 12:04:33 +0100 | Driver Failure | rpc error: code = Unknown desc = failed to start task, could not start container: cannot start container, status code: 500: {"cause":"address already in use","message":"cannot listen on the TCP port: listen tcp4 81.187.21.248:9116: bind: address already in use","response":500}
Apr 16, '25 12:04:31 +0100 | Task Setup | Building Task Directory
Apr 16, '25 12:04:31 +0100 | Received | Task received by client

In this case I found there were actually two copies of this task running on different nodes that the task driver had previously lost, and it was cycling between trying to start the host on both nodes and repeatedly failing. Where the allocation is lost, consul service registrations are also dropped, so the services become inaccessible whilst still running.

Task driver is running with the following plugin config:

{
  "plugin": {
    "nomad-driver-podman": {
      "config": {
        "socket": [
          {
            "name": "default",
            "socket_path": "unix:///run/podman/podman.sock"
          },
          {
            "name": "ben",
            "socket_path": "unix://run/user/1093000004/podman/podman.sock"
          }
        ],
        "volumes": {
          "enabled": true,
          "selinuxlabel": "z"
        }
      }
    }
}

The lost allocations have all been GC'd so I can't see what happened to them at the end. Looking at journal on the client, the restarts seem spontaneous.

Apr 15 22:56:52 caron. nomad[3730345]:     2025-04-15T22:56:52.058Z [INFO]  client.device_mgr.nomad-device-usb: fingerprinting USB devices: plugin=usb @module=usb timestamp=2025-04-15T22:56:52.058Z
Apr 15 22:56:52 caron. nomad[3730345]:     2025-04-15T22:56:52.058Z [INFO]  client.device_mgr.nomad-device-usb: no changes to devices: plugin=usb @module=usb timestamp=2025-04-15T22:56:52.058Z
Apr 15 22:57:52 caron. nomad[3730345]:     2025-04-15T22:57:52.058Z [INFO]  client.device_mgr.nomad-device-usb: fingerprinting USB devices: plugin=usb @module=usb timestamp=2025-04-15T22:57:52.058Z
Apr 15 22:57:52 caron. nomad[3730345]:     2025-04-15T22:57:52.058Z [INFO]  client.device_mgr.nomad-device-usb: no changes to devices: plugin=usb @module=usb timestamp=2025-04-15T22:57:52.058Z
Apr 15 22:58:52 caron. nomad[3730345]:     2025-04-15T22:58:52.059Z [INFO]  client.device_mgr.nomad-device-usb: fingerprinting USB devices: plugin=usb @module=usb timestamp=2025-04-15T22:58:52.059Z
Apr 15 22:58:52 caron. nomad[3730345]:     2025-04-15T22:58:52.059Z [INFO]  client.device_mgr.nomad-device-usb: no changes to devices: plugin=usb @module=usb timestamp=2025-04-15T22:58:52.059Z
Apr 15 22:59:31 caron. nomad[3730345]:     2025-04-15T22:59:31.794Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f27c20dd-b775-6005-a784-b6867e7ffd62 task=exporter type=Received msg="Task received by client" failed=false
Apr 15 22:59:31 caron. nomad[3730345]:     2025-04-15T22:59:31.942Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1bd610be-9d13-a031-0240-b96af69f80b9 task=runtime type=Received msg="Task received by client" failed=false
Apr 15 22:59:32 caron. nomad[3730345]:     2025-04-15T22:59:32.085Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1c7cd278-f81c-d06f-a143-4bd72276b154 task=oauth2-proxy type=Received msg="Task received by client" failed=false
Apr 15 22:59:32 caron. nomad[3730345]:     2025-04-15T22:59:32.087Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1c7cd278-f81c-d06f-a143-4bd72276b154 task=unconditional-oauth2-proxy type=Received msg="Task received by client" failed=false
Apr 15 22:59:32 caron. nomad[3730345]:     2025-04-15T22:59:32.112Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f27c20dd-b775-6005-a784-b6867e7ffd62 task=exporter type="Task Setup" msg="Building Task Directory" failed=false
Apr 15 22:59:32 caron. nomad[3730345]:     2025-04-15T22:59:32.177Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1965140e-5839-6736-fdc7-a8cd35cfe904 task=scraper type=Received msg="Task received by client" failed=false
Apr 15 22:59:32 caron. nomad[3730345]:     2025-04-15T22:59:32.289Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1bd610be-9d13-a031-0240-b96af69f80b9 task=runtime type="Task Setup" msg="Building Task Directory" failed=false
...
Apr 15 22:59:35 caron. nomad[3730345]:     2025-04-15T22:59:35.661Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=41245b44-6d54-2d58-16e6-3c3a3cea323e task=flightradar24 type="Task Setup" msg="Building Task Directory" failed=false
Apr 15 22:59:36 caron. nomad[3730345]:     2025-04-15T22:59:36.722Z [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: name=default error=<nil> period="2025-04-17 10:59:35.722091808 +0000 UTC m=+473441.722540049"
Apr 15 22:59:36 caron. nomad[3730345]:     2025-04-15T22:59:36.834Z [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: name=default error=<nil> period="2025-04-17 10:59:41.834317554 +0000 UTC m=+473447.834765815"
Apr 15 22:59:36 caron. nomad[3730345]:     2025-04-15T22:59:36.907Z [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: name=default error=<nil> period="2025-04-17 10:59:45.9073335 +0000 UTC m=+473451.907781755"
Apr 15 22:59:36 caron. nomad[3730345]:     2025-04-15T22:59:36.911Z [ERROR] client.vault: error during renewal of lease or token failed due to a non-fatal error; retrying: name=default error=<nil> period="2025-04-17 10:59:35.911881226 +0000 UTC m=+473441.912329497"
Apr 15 22:59:37 caron. nomad[3730345]:     2025-04-15T22:59:37.153Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f27c20dd-b775-6005-a784-b6867e7ffd62 task=exporter type="Driver Failure"
Apr 15 22:59:37 caron. nomad[3730345]:   msg=
Apr 15 22:59:37 caron. nomad[3730345]:   | rpc error: code = Unknown desc = failed to start task, could not start container: cannot start container, status code: 500: {"cause":"address already in use","message":"cannot listen on the TCP port: listen tcp4 81.187.21.247:9116: bind: address alread>
Apr 15 22:59:37 caron. nomad[3730345]:    failed=false
Apr 15 22:59:37 caron. nomad[3730345]:     2025-04-15T22:59:37.621Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=f27c20dd-b775-6005-a784-b6867e7ffd62 task=exporter
Apr 15 22:59:37 caron. nomad[3730345]:   error=
Apr 15 22:59:37 caron. nomad[3730345]:   | rpc error: code = Unknown desc = failed to start task, could not start container: cannot start container, status code: 500: {"cause":"address already in use","message":"cannot listen on the TCP port: listen tcp4 81.187.21.247:9116: bind: address alread>

From the three nomad servers, the only logs recorded around these times were:

Apr 15 18:19:23 branson nomad[3121592]: ==> Newer Nomad version available: 1.10.0 (currently running: 1.9.4)
Apr 15 22:59:31 branson nomad[3121592]:     2025-04-15T22:59:31.709Z [WARN]  nomad.heartbeat: node TTL expired: node_id=110fb005-c953-88e2-bd9c-021bddc50826
Apr 16 08:21:02 branson nomad[3121592]:     2025-04-16T08:21:02.024Z [ERROR] nomad.event_broker: failed resolving ACL for secretID, closing subscriptions: error="ACL token not found"

Aside from that one missed heartbeat from a different node (crow), there doesn't seem to be any obvious trigger for nomad to be restarting tasks that are otherwise running (on caron). I'm not seeing any log messages relating to lost allocations on either client or server nodes.

When this does happen, I need to podman stop each of the containers that nomad is trying to restart, and then tpically run a nomad job eval -force-reschedule to get it running again without waiting for the next retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants