-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Example prometheus rules for kube api #17608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Prometheus and Alertmanager Rules | ||
|
||
## Loading Rules | ||
|
||
With this deployment method all files in the rules directory are mounted into the pod as a configmap. | ||
|
||
1. Create a configmap of the rules directory | ||
|
||
oc create configmap base-rules --from-file=rules/ | ||
1. Attach the configmap to the prometheus statefulset as a volume | ||
|
||
oc volume statefulset/prometheus --add \ | ||
--configmap-name=base-rules --name=base-rules -t configmap \ | ||
--mount-path=/etc/prometheus/rules | ||
1. Delete pod to restart with new configuration | ||
|
||
oc delete $(oc get pods -o name --selector='app=prometheus') | ||
|
||
## Updating Rules | ||
|
||
1. Edit or add a local rules file | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh I see you are going for a local update |
||
1. Validate the rules directory. ('promtool' may be downloaded from the [Prometheus web site](https://prometheus.io/download/).) | ||
|
||
promtool check rules rules/*.rules | ||
1. Update the configmap | ||
|
||
oc create configmap base-rules --from-file=rules/ --dry-run -o yaml | oc apply -f - | ||
1. Delete pod to restart with new configuration | ||
|
||
oc delete $(oc get pods -o name --selector='app=prometheus') | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
groups: | ||
- name: kubernetes-rules | ||
rules: | ||
|
||
- alert: DockerLatencyHigh | ||
expr: max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) / 1e+06 > 1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would be nice to have this alert per node: (and use $instance in the description) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch. Thanks. |
||
for: 5m | ||
labels: | ||
severity: warning | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason for the severity being a label and not an annotation? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having it as a label allows the alert to be routed to different receivers in AlertManager. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's my understanding (upstream docs). We could also add as an annotation if we determine it's useful for in some way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm make sense. |
||
annotations: | ||
summary: Docker latency is high | ||
description: "Docker latency is {{ $value }} seconds for 90% of kubelet operations" | ||
alertType: latency | ||
miqTarget: ContainerNode | ||
component: container runtime | ||
selfHealing: false | ||
url: | ||
|
||
- alert: KubernetesAPIDown | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Comment outdated as it has been added in the latest commit... |
||
expr: up{job="kubernetes-apiservers"} == 0 | ||
for: 10m | ||
labels: | ||
severity: error | ||
annotations: | ||
summary: Kubernetes API server unreachable | ||
description: "Kubernetes API server unreachable on {{ $labels.cluster }} instance {{ $labels.instance }}" | ||
alertType: availability | ||
miqTarget: ContainerNode | ||
component: kubernetes | ||
selfHealing: false | ||
url: | ||
|
||
- alert: KubernetesAPIAbsent | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this alert will also be firing in case prometheus isn't scraping metrics There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, this is to ensure we catch silent failures. |
||
expr: absent(up{job="kubernetes-apiservers"}) | ||
for: 5m | ||
labels: | ||
severity: error | ||
annotations: | ||
summary: Kubernetes API server absent | ||
description: Kubernetes API server absent | ||
alertType: availability | ||
miqTarget: ContainerNode | ||
component: kubernetes | ||
selfHealing: false | ||
url: | ||
|
||
- alert: KubernetesAPIErrorsHigh | ||
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5 | ||
for: 5m | ||
labels: | ||
severity: warning | ||
annotations: | ||
summary: Kubernetes API server errors high | ||
description: "Kubernetes API server errors (response code 5xx) are {{ $value }}% of total requests" | ||
alertType: errors | ||
miqTarget: ContainerNode | ||
component: kubernetes | ||
selfHealing: false | ||
url: | ||
|
||
- alert: KubernetesAPILatencyHigh | ||
expr: apiserver_request_latencies_summary{quantile="0.9",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} / 1e+06 > .5 | ||
for: 10m | ||
labels: | ||
severity: warning | ||
annotations: | ||
summary: Kubernetes API server latency high | ||
description: "Kubernetes API server request latency is {{ $value }} seconds for 90% of requests. NOTE: long-standing requests have been removed from alert query." | ||
alertType: latency | ||
miqTarget: ContainerNode | ||
component: kubernetes | ||
selfHealing: false | ||
url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to say something about reload through sending HUP the prometheus process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a developer-focused workflow so I thought local files->restart was the most straight-forward path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's still worth to mention the HUP for operators out there. Up to you