Description
What happened:
When I run kubectl top nodes
, or kubectl get nodemetrics
on a k8s cluster with metrics-server, I almost always have at least one control-plane node unaccounted for. The missing control plane node(s) change every minute with every run. All three control plane nodes are up and healthy, and the worker nodes show up all the time.
What you expected to happen:
I expected to see all three worker nodes, and all three control plane nodes.
Anything else we need to know?:
-
I have looked through the metrics-server logs, and found that the requests to the nodes, control plane and worker, received 200 responses; moreover, manually making those requests returned metrics I was expecting to see.
-
While the control planes flicker in and out of existence on the aforementioned commands, the actual number and type of pods remains consistent, and the metrics for the pods look completely fine.
-
The problem persists whether I am running on one or two replicas.
-
We are running metrics-server on the control plane, because we could not get metrics for pods running on the control plane otherwise.
Environment:
-
Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.): kubeadm on top of OpenStack using ClusterAPI
-
Container Network Setup (flannel, calico, etc.): calico
-
Kubernetes version (use
kubectl version
): 1.21 (client), 1.20 (server) -
Metrics Server manifest
spoiler for Metrics Server manifest:
apiVersion: v1
items:
- apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "26"
meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: metrics-server
creationTimestamp: "2021-07-13T18:41:53Z"
generation: 26
labels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: metrics-server
helm.sh/chart: metrics-server-5.8.14
name: metrics-server
namespace: metrics-server
resourceVersion: "11957101"
uid: `[redacted]`
spec:
progressDeadlineSeconds: 600
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/name: metrics-server
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
ad.datadoghq.com/nginx-ingress-controller.check_names: '["kube_metrics_server"]'
ad.datadoghq.com/nginx-ingress-controller.init_configs: '[{}]'
ad.datadoghq.com/nginx-ingress-controller.instances: |
[
{
"prometheus_url": "https://%%host%%:443/metrics"
}
]
enable.version-checker.io/metrics-server: "true"
override-url.version-checker.io/metrics-server: bitnami/metrics-server
creationTimestamp: null
labels:
app.kubernetes.io/instance: metrics-server
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: metrics-server
helm.sh/chart: metrics-server-5.8.14
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: Exists
containers:
- command:
- /pod_nanny
- --config-dir=/etc/config
- --cpu=100m
- --extra-cpu=7m
- --memory=300Mi
- --extra-memory=3Mi
- --threshold=10
- --deployment=metrics-server
- --container=metrics-server
env:
- name: MY_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: ADDON_NAME
value: metrics
image: [image_mirror]/k8s.gcr.io/addon-resizer:1.8.11
imagePullPolicy: IfNotPresent
name: pod-nanny
resources:
limits:
cpu: 100m
memory: 20Mi
requests:
cpu: 100m
memory: 20Mi
securityContext:
runAsGroup: 65534
runAsUser: 65534
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/config
name: nanny-config-volume
- args:
- --secure-port=8443
- --cert-dir=/tmp
- --kubelet-insecure-tls=true
- --kubelet-preferred-address-types=\[InternalDNS,InternalIP,ExternalDNS,ExternalIP\]
- --profiling=true
command:
- metrics-server
image: [image_mirror]/bitnami/metrics-server:0.5.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /livez
port: https
scheme: HTTPS
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: metrics-server
ports:
- containerPort: 8443
hostPort: 8443
name: https
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readyz
port: https
scheme: HTTPS
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 142m
memory: 318Mi
requests:
cpu: 142m
memory: 318Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsGroup: 10001
runAsNonRoot: true
runAsUser: 10001
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/config
name: nanny-config-volume
- mountPath: /tmp
name: tmpdir
dnsPolicy: ClusterFirst
hostNetwork: true
imagePullSecrets:
- name: regcred-pseudo
priorityClassName: highest-platform
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: metrics-server
serviceAccountName: metrics-server
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
volumes:
- configMap:
defaultMode: 420
name: nanny-config-metrics-server
name: nanny-config-volume
- emptyDir: {}
name: tmpdir
status:
availableReplicas: 2
conditions:
- lastTransitionTime: "2021-07-27T20:09:32Z"
lastUpdateTime: "2021-07-27T20:09:32Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2021-07-13T18:41:54Z"
lastUpdateTime: "2021-07-27T20:10:01Z"
message: ReplicaSet "metrics-server-[redacted]" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 26
readyReplicas: 2
replicas: 2
updatedReplicas: 2
kind: List
metadata:
resourceVersion: ""
selfLink: ""
- Kubelet config:
spoiler for Kubelet config:
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: [redacted]
server: https://[redacted]:6443
name: default-cluster
contexts:
- context:
cluster: default-cluster
namespace: default
user: default-auth
name: default-context
current-context: default-context
kind: Config
preferences: {}
users:
- name: default-auth
user:
client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
- Status of Metrics API:
spolier for Status of Metrics API:
kubectl describe apiservice v1beta1.metrics.k8s.io
Name: v1beta1.metrics.k8s.io
Namespace:
Labels: app.kubernetes.io/instance=metrics-server
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=metrics-server
helm.sh/chart=metrics-server-5.8.14
Annotations: meta.helm.sh/release-name: metrics-server
meta.helm.sh/release-namespace: metrics-server
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2021-07-13T18:57:27Z
Resource Version: 11462943
UID: 86dd3191-802e-4695-996a-017984296eff
Spec:
Group: metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: metrics-server
Namespace: metrics-server
Port: 443
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2021-07-25T06:47:48Z
Message: all checks passed
Reason: Passed
Status: True
Type: Available
Events: <none>
/kind bug