Skip to content

[🐛 Bug]: Chrome nodes stuck on Termination state #2168

Closed
@Aymen-Ben-S

Description

@Aymen-Ben-S

What happened?

We have an consistent behavior where Chrome nodes get stuck on Terminating state.

I'm not sure I can provide the exact steps to reproduce but I'm happy to share logs from a system where this is happening.

Command used to start Selenium Grid with Docker (or Kubernetes)

global:
  seleniumGrid:
    imageRegistry: {{ fvt_image_registry }}/selenium
    imagePullSecret: xxx
hub:
  imageTag: 4.18.1-20240224
chromeNode:
  imageTag: 122.0-20240224
  resources:
    requests:
      cpu: "0.1"
firefoxNode:
  enabled: false
edgeNode:
  enabled: false
autoscaling:
  enabled: true
  scalingType: job
  scaledOptions:
    maxReplicaCount: 999
  scaledJobOptions:
    scalingStrategy:
      strategy: default
ingress:
  hostname: selenium-grid.local
  path: /selenium

Relevant log output

NAME                                              READY   STATUS        RESTARTS      AGE     IP               NODE           NOMINATED NODE   READINESS GATES
keda-operator-d44bc8ffc-f7rzk                     1/1     Running       0             161m    172.30.107.176   10.74.145.3    <none>           <none>
keda-operator-metrics-apiserver-b994566dc-8b59f   1/1     Running       1 (14h ago)   14h     172.30.180.80    10.74.145.8    <none>           <none>
selenium-grid-selenium-chrome-node-25f9x-5b6dz    1/1     Terminating   0             3h24m   172.30.93.77     10.48.76.225   <none>           <none>
selenium-grid-selenium-chrome-node-5cx4x-j9clf    1/1     Terminating   0             174m    172.30.180.158   10.74.145.8    <none>           <none>
selenium-grid-selenium-chrome-node-5hpg8-m2xdz    1/1     Running       0             160m    172.30.202.92    10.48.76.223   <none>           <none>
selenium-grid-selenium-chrome-node-7mt5f-gwgm6    1/1     Running       0             160m    172.30.139.72    10.74.145.20   <none>           <none>
selenium-grid-selenium-chrome-node-cmggr-vvl2l    1/1     Running       0             160m    172.30.180.178   10.74.145.8    <none>           <none>
selenium-grid-selenium-chrome-node-fmq9j-qlvmj    1/1     Terminating   0             174m    172.30.107.131   10.74.145.3    <none>           <none>
selenium-grid-selenium-chrome-node-fxgnj-hb8qf    1/1     Terminating   0             174m    172.30.93.97     10.48.76.225   <none>           <none>
selenium-grid-selenium-chrome-node-gsrsp-h9tzz    1/1     Terminating   0             3h16m   172.30.93.83     10.48.76.225   <none>           <none>
selenium-grid-selenium-chrome-node-xd72s-vws74    1/1     Terminating   0             3h24m   172.30.202.89    10.48.76.223   <none>           <none>
selenium-grid-selenium-chrome-node-xm8h6-69wl8    1/1     Terminating   0             3h24m   172.30.139.32    10.74.145.20   <none>           <none>
selenium-grid-selenium-chrome-node-xqt4h-hkt67    1/1     Terminating   0             3h16m   172.30.139.102   10.74.145.20   <none>           <none>
selenium-grid-selenium-hub-5f49c8fc47-vzmt6       1/1     Running       0             14h     172.30.202.81    10.48.76.223   <none>           <none>




Chrome node log
2024-03-11 18:58:48,902 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2024-03-11 18:58:48,905 INFO RPC interface 'supervisor' initialized
2024-03-11 18:58:48,905 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-03-11 18:58:48,906 INFO supervisord started with pid 8
2024-03-11 18:58:49,909 INFO spawned: 'xvfb' with pid 9
2024-03-11 18:58:49,912 INFO spawned: 'vnc' with pid 10
2024-03-11 18:58:49,915 INFO spawned: 'novnc' with pid 11
2024-03-11 18:58:49,917 INFO spawned: 'selenium-node' with pid 12
2024-03-11 18:58:49,938 INFO success: selenium-node entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
E: [pulseaudio] main.c: Daemon startup failed.
Home directory not accessible: Permission denied
No PulseAudio daemon running, or not running as session daemon.
Home directory not accessible: Permission denied
No PulseAudio daemon running, or not running as session daemon.
Home directory not accessible: Permission denied
No PulseAudio daemon running, or not running as session daemon.
Appending Selenium options: --session-timeout 300
Appending Selenium options: --register-period 60
Appending Selenium options: --register-cycle 5
Appending Selenium options: --heartbeat-period 30
Appending Selenium options: --log-level INFO
Generating Selenium Config
Setting up SE_NODE_HOST...
Tracing is disabled
Selenium Grid Node configuration: 
[events]
publish = "tcp://selenium-grid-selenium-hub.selenium:4442"
subscribe = "tcp://selenium-grid-selenium-hub.selenium:4443"

[server]
port = "5555"
[node]
grid-url = "http://admin:[email protected]:4444"
session-timeout = "300"
override-max-sessions = false
detect-drivers = false
drain-after-session-count = 1
max-sessions = 1

[[node.driver-configuration]]
display-name = "chrome"
stereotype = '{"browserName": "chrome", "browserVersion": "122.0", "platformName": "Linux", "goog:chromeOptions": {"binary": "/usr/bin/google-chrome"}}'
max-sessions = 1

Starting Selenium Grid Node...
2024-03-11 18:58:51,777 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-11 18:58:51,777 INFO success: vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-11 18:58:51,777 INFO success: novnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
18:58:51.977 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
18:58:51.985 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
18:58:52.554 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://selenium-grid-selenium-hub.selenium:4442 and tcp://selenium-grid-selenium-hub.selenium:4443
18:58:52.767 INFO [UnboundZmqEventBus.<init>] - Sockets created
18:58:53.781 INFO [UnboundZmqEventBus.<init>] - Event bus ready
18:58:54.166 INFO [NodeServer.createHandlers] - Reporting self as: http://172.30.93.77:5555
18:58:54.252 INFO [NodeOptions.getSessionFactories] - Detected 1 available processors
18:58:54.763 INFO [NodeOptions.report] - Adding chrome for {"browserName": "chrome","browserVersion": "122.0","goog:chromeOptions": {"binary": "\u002fusr\u002fbin\u002fgoogle-chrome"},"platformName": "linux","se:noVncPort": 7900,"se:vncEnabled": true} 1 times
2024-03-11T18:58:54UTC [Probe.Startup] - Wait for the Node to report its status
18:58:54.884 INFO [Node.<init>] - Binding additional locator mechanisms: relative
18:58:55.373 INFO [NodeServer$1.start] - Starting registration process for Node http://172.30.93.77:5555
18:58:55.375 INFO [NodeServer.execute] - Started Selenium node 4.18.1 (revision b1d3319b48): http://172.30.93.77:5555
18:58:55.464 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
18:58:55.984 INFO [NodeServer.lambda$createHandlers$2] - Node has been added
18:58:57.964 INFO [LocalNode.checkSessionCount] - Draining Node, configured sessions value (1) has been reached.
18:58:57.972 INFO [LocalNode.newSession] - Session created by the Node. Id: b03b291c5d5108416cf0ac1327aeeda8, Caps: Capabilities {acceptInsecureCerts: true, browserName: chrome-headless-shell, browserVersion: 122.0.6261.69, chrome: {chromedriverVersion: 122.0.6261.69 (81bc525b6a36..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:43491}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: ws://admin:admin@selenium-g..., se:cdpVersion: 122.0.6261.69, se:vnc: ws://admin:admin@selenium-g..., se:vncEnabled: true, se:vncLocalAddress: ws://172.30.93.77:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
2024-03-11T18:58:58UTC [Probe.Startup] - Node responds the ID: 81a24be5-1a59-402b-9407-62e5be087a72 with status: UP
2024-03-11T18:58:58UTC [Probe.Startup] - Grid responds a matched Node ID: 81a24be5-1a59-402b-9407-62e5be087a72
2024-03-11T18:58:58UTC [Probe.Startup] - Node ID: 81a24be5-1a59-402b-9407-62e5be087a72 is found in the Grid. Node is ready.
19:03:34.310 INFO [SessionSlot.stop] - Stopping session b03b291c5d5108416cf0ac1327aeeda8
19:03:34.350 INFO [LocalNode.stopTimedOutSession] - Node draining complete!
19:03:35.357 INFO [NodeServer.lambda$createHandlers$3] - Shutting down
2024-03-11 19:03:35,722 INFO exited: selenium-node (exit status 0; expected)
2024-03-11 19:03:35,722 WARN received SIGINT indicating exit request
2024-03-11 19:03:35,723 INFO waiting for xvfb, vnc, novnc to die
2024-03-11 19:03:37,727 INFO stopped: novnc (terminated by SIGTERM)
2024-03-11 19:03:38,730 INFO stopped: vnc (terminated by SIGTERM)
2024-03-11 19:03:38,731 INFO waiting for xvfb to die
2024-03-11 19:03:39,732 INFO stopped: xvfb (terminated by SIGTERM)


Chrome node yml:
Name:                      selenium-grid-selenium-chrome-node-25f9x-5b6dz
Namespace:                 selenium
Priority:                  0
Service Account:           selenium-grid-selenium-serviceaccount
Node:                      10.48.76.225/10.48.76.225
Start Time:                Mon, 11 Mar 2024 18:58:22 +0000
Labels:                    app=selenium-grid-selenium-chrome-node
                           app.kubernetes.io/component=selenium-grid-4.18.1-20240224
                           app.kubernetes.io/instance=selenium-grid
                           app.kubernetes.io/managed-by=helm
                           app.kubernetes.io/name=selenium-grid-selenium-chrome-node
                           app.kubernetes.io/version=4.18.1-20240224
                           controller-uid=449ef28c-fc3f-4da4-8b3f-31469fb86d9d
                           helm.sh/chart=selenium-grid-0.28.4
                           job-name=selenium-grid-selenium-chrome-node-25f9x
                           scaledjob.keda.sh/name=selenium-grid-selenium-chrome-node
Annotations:               checksum/event-bus-configmap: 2698802d0bbf358d1634b47dff1ef36c5fc2501a27a9d2eef02c7874eb9496f8
                           checksum/logging-configmap: 7f721b250f90c8a5877dc9217b97f0a14392b420edf0e5af105a60944d2b9dc3
                           checksum/node-configmap: 3b6c0fffa6e6a10d57e5455ce21e1e7ee55e0638f15ff521b8c96fe8c10d8e91
                           checksum/server-configmap: ac6520a86bfffa04b4946bbce02ac8f1be341d800f4ab09f4e7cf274f74d3770
                           cni.projectcalico.org/containerID: a880de0585776e16db52c8eb9c290406e1b69dfd510cd15fc0c421cd6a9ed1dc
                           cni.projectcalico.org/podIP: 
                           cni.projectcalico.org/podIPs: 
                           k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "k8s-pod-network",
                                 "ips": [
                                     "172.30.93.77"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "k8s-pod-network",
                                 "ips": [
                                     "172.30.93.77"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           openshift.io/scc: restricted-v2
                           seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:                    Terminating (lasts 3h15m)
Termination Grace Period:  30s
SeccompProfile:            RuntimeDefault
IP:                        172.30.93.77
IPs:
  IP:           172.30.93.77
Controlled By:  Job/selenium-grid-selenium-chrome-node-25f9x
Containers:
  selenium-grid-selenium-chrome-node:
    Container ID:   cri-o://f177006210af63a402465588e62b42ca96387427102f59824d4a6b29b197ab21
    Image:          docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome:122.0-20240224
    Image ID:       docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome@sha256:3b50643ff9885215c9142fefecace7b17efdde64a8c863ef702bd4a6c3e6a378
    Port:           5555/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Mon, 11 Mar 2024 18:58:48 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     100m
      memory:  1Gi
    Startup:   exec [bash -c /opt/selenium/nodeProbe.sh Startup >> /proc/1/fd/1] delay=0s timeout=60s period=5s #success=1 #failure=12
    Environment Variables from:
      selenium-grid-selenium-event-bus       ConfigMap  Optional: false
      selenium-grid-selenium-node-config     ConfigMap  Optional: false
      selenium-grid-selenium-logging-config  ConfigMap  Optional: false
      selenium-grid-selenium-server-config   ConfigMap  Optional: false
      selenium-grid-selenium-secrets         Secret     Optional: false
    Environment:
      SE_OTEL_SERVICE_NAME:     selenium-grid-selenium-chrome-node
      SE_NODE_PORT:             5555
      SE_NODE_REGISTER_PERIOD:  60
      SE_NODE_REGISTER_CYCLE:   5
    Mounts:
      /dev/shm from dshm (rw)
      /opt/selenium/nodePreStop.sh from selenium-grid-selenium-node-config (rw,path="nodePreStop.sh")
      /opt/selenium/nodeProbe.sh from selenium-grid-selenium-node-config (rw,path="nodeProbe.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-n6b64 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  selenium-grid-selenium-node-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      selenium-grid-selenium-node-config
    Optional:  false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Gi
  kube-api-access-n6b64:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age   From     Message
  ----     ------             ----  ----     -------
  Warning  FailedPreStopHook  165m  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T14:37:35-05:00" level=fatal msg="nsexec-1[153181]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T14:37:35-05:00" level=fatal msg="nsexec-0[153169]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T14:37:35-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""
  Warning  FailedPreStopHook  135m  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T15:08:06-05:00" level=fatal msg="nsexec-1[358906]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T15:08:06-05:00" level=fatal msg="nsexec-0[358893]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T15:08:06-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""
  Warning  FailedPreStopHook  104m  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T15:38:36-05:00" level=fatal msg="nsexec-1[149417]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T15:38:36-05:00" level=fatal msg="nsexec-0[149388]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T15:38:36-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""
  Warning  FailedPreStopHook  74m  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T16:09:06-05:00" level=fatal msg="nsexec-1[454447]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T16:09:06-05:00" level=fatal msg="nsexec-0[454427]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T16:09:06-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""
  Warning  FailedPreStopHook  43m  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T16:39:37-05:00" level=fatal msg="nsexec-1[259528]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T16:39:37-05:00" level=fatal msg="nsexec-0[259505]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T16:39:37-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""
  Normal   Killing            13m (x7 over 3h16m)  kubelet  Stopping container selenium-grid-selenium-chrome-node
  Warning  FailedKillPod      13m (x6 over 165m)   kubelet  error killing pod: [failed to "KillContainer" for "selenium-grid-selenium-chrome-node" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "8ff39bf5-08ec-4467-b609-0db98b02ff8c" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
  Warning  FailedPreStopHook  13m                  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T17:10:07-05:00" level=fatal msg="nsexec-1[51553]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T17:10:07-05:00" level=fatal msg="nsexec-0[51532]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T17:10:07-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""

Operating System

Openshift 4.12.49

Docker Selenium version (image tag)

4.18.1

Selenium Grid chart version (chart version)

0.28.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    I-autoscaling-k8sIssue relates to autoscaling in Kubernetes, or the scaler in KEDA

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions