Skip to content

metrics-server reporting inconsistent numbers of control plane nodes #803

Closed
@techstep

Description

@techstep

What happened:

When I run kubectl top nodes, or kubectl get nodemetrics on a k8s cluster with metrics-server, I almost always have at least one control-plane node unaccounted for. The missing control plane node(s) change every minute with every run. All three control plane nodes are up and healthy, and the worker nodes show up all the time.

What you expected to happen:

I expected to see all three worker nodes, and all three control plane nodes.

Anything else we need to know?:

  • I have looked through the metrics-server logs, and found that the requests to the nodes, control plane and worker, received 200 responses; moreover, manually making those requests returned metrics I was expecting to see.

  • While the control planes flicker in and out of existence on the aforementioned commands, the actual number and type of pods remains consistent, and the metrics for the pods look completely fine.

  • The problem persists whether I am running on one or two replicas.

  • We are running metrics-server on the control plane, because we could not get metrics for pods running on the control plane otherwise.

Environment:

  • Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.): kubeadm on top of OpenStack using ClusterAPI

  • Container Network Setup (flannel, calico, etc.): calico

  • Kubernetes version (use kubectl version): 1.21 (client), 1.20 (server)

  • Metrics Server manifest

spoiler for Metrics Server manifest:

apiVersion: v1
items:
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "26"
      meta.helm.sh/release-name: metrics-server
      meta.helm.sh/release-namespace: metrics-server
    creationTimestamp: "2021-07-13T18:41:53Z"
    generation: 26
 labels:
      app.kubernetes.io/instance: metrics-server
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: metrics-server
      helm.sh/chart: metrics-server-5.8.14
    name: metrics-server
    namespace: metrics-server
    resourceVersion: "11957101"
    uid: `[redacted]`
  spec:
    progressDeadlineSeconds: 600
    replicas: 2
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        app.kubernetes.io/instance: metrics-server
        app.kubernetes.io/name: metrics-server
    strategy:
      rollingUpdate:
        maxSurge: 25%
        maxUnavailable: 25%
      type: RollingUpdate
    template:
      metadata:
        annotations:
          ad.datadoghq.com/nginx-ingress-controller.check_names: '["kube_metrics_server"]'
          ad.datadoghq.com/nginx-ingress-controller.init_configs: '[{}]'
          ad.datadoghq.com/nginx-ingress-controller.instances: |
            [
              {
                "prometheus_url": "https://%%host%%:443/metrics"
              }
            ]
          enable.version-checker.io/metrics-server: "true"
          override-url.version-checker.io/metrics-server: bitnami/metrics-server
        creationTimestamp: null
        labels:
          app.kubernetes.io/instance: metrics-server
          app.kubernetes.io/managed-by: Helm
          app.kubernetes.io/name: metrics-server
          helm.sh/chart: metrics-server-5.8.14
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: node-role.kubernetes.io/master
                  operator: Exists
        containers:
        - command:
          - /pod_nanny
          - --config-dir=/etc/config
          - --cpu=100m
          - --extra-cpu=7m
          - --memory=300Mi
          - --extra-memory=3Mi
          - --threshold=10
          - --deployment=metrics-server
          - --container=metrics-server
          env:
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.name
          - name: MY_POD_NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          - name: ADDON_NAME
            value: metrics
          image: [image_mirror]/k8s.gcr.io/addon-resizer:1.8.11
          imagePullPolicy: IfNotPresent
          name: pod-nanny
          resources:
            limits:
              cpu: 100m
              memory: 20Mi
            requests:
              cpu: 100m
              memory: 20Mi
          securityContext:
            runAsGroup: 65534
            runAsUser: 65534
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/config
            name: nanny-config-volume
        - args:
          - --secure-port=8443
          - --cert-dir=/tmp
          - --kubelet-insecure-tls=true
          - --kubelet-preferred-address-types=\[InternalDNS,InternalIP,ExternalDNS,ExternalIP\]
          - --profiling=true
          command:
          - metrics-server
          image: [image_mirror]/bitnami/metrics-server:0.5.0
          imagePullPolicy: IfNotPresent
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /livez
              port: https
              scheme: HTTPS
            initialDelaySeconds: 40
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          name: metrics-server
          ports:
          - containerPort: 8443
            hostPort: 8443
            name: https
            protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /readyz
              port: https
              scheme: HTTPS
            initialDelaySeconds: 40
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: 142m
              memory: 318Mi
            requests:
              cpu: 142m
              memory: 318Mi
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
              - ALL
            readOnlyRootFilesystem: true
            runAsGroup: 10001
            runAsNonRoot: true
            runAsUser: 10001
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /etc/config
            name: nanny-config-volume
          - mountPath: /tmp
            name: tmpdir
        dnsPolicy: ClusterFirst
        hostNetwork: true
        imagePullSecrets:
        - name: regcred-pseudo
        priorityClassName: highest-platform
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: metrics-server
        serviceAccountName: metrics-server
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/master
          operator: Exists
        volumes:
        - configMap:
            defaultMode: 420
            name: nanny-config-metrics-server
          name: nanny-config-volume
        - emptyDir: {}
          name: tmpdir
  status:
    availableReplicas: 2
    conditions:
    - lastTransitionTime: "2021-07-27T20:09:32Z"
      lastUpdateTime: "2021-07-27T20:09:32Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2021-07-13T18:41:54Z"
      lastUpdateTime: "2021-07-27T20:10:01Z"
      message: ReplicaSet "metrics-server-[redacted]" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
    observedGeneration: 26
    readyReplicas: 2
    replicas: 2
    updatedReplicas: 2
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

  • Kubelet config:
spoiler for Kubelet config:
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: [redacted]
    server: https://[redacted]:6443
  name: default-cluster
contexts:
- context:
    cluster: default-cluster
    namespace: default
    user: default-auth
  name: default-context
current-context: default-context
kind: Config
preferences: {}
users:
- name: default-auth
  user:
    client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
    client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
  • Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io
Name:         v1beta1.metrics.k8s.io
Namespace:
Labels:       app.kubernetes.io/instance=metrics-server
            app.kubernetes.io/managed-by=Helm
            app.kubernetes.io/name=metrics-server
            helm.sh/chart=metrics-server-5.8.14
Annotations:  meta.helm.sh/release-name: metrics-server
            meta.helm.sh/release-namespace: metrics-server
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
Creation Timestamp:  2021-07-13T18:57:27Z
Resource Version:    11462943
UID:                 86dd3191-802e-4695-996a-017984296eff
Spec:
Group:                     metrics.k8s.io
Group Priority Minimum:    100
Insecure Skip TLS Verify:  true
Service:
  Name:            metrics-server
  Namespace:       metrics-server
  Port:            443
Version:           v1beta1
Version Priority:  100
Status:
Conditions:
  Last Transition Time:  2021-07-25T06:47:48Z
  Message:               all checks passed
  Reason:                Passed
  Status:                True
  Type:                  Available
Events:                    <none>

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions