-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Example prometheus rules for kube api #17608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @smarterclayton |
LGTM |
examples/prometheus/rules/README.md
Outdated
|
||
## Updating Rules | ||
|
||
NOTE: We cannot yet "update" a configmap from a local file (see [comment](https://github.com/kubernetes/kubernetes/issues/30558#issuecomment-326643503)). For now we delete and recreate. Why not use the Pometheus API `/-/reload/` endpoint? It can take over 60 seconds for changes to a configmap to appear in a pod (see [detailed explaination](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#mounted-configmaps-are-updated-automatically)). It is more reliable to simply delete the pod so it creates a new one with the new configmap. This has the cost of ~10s downtime but ensures you've got the updated config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Pometheus/Prometheus/
examples/prometheus/rules/os.rules
Outdated
- record: instance:fd_utilization | ||
expr: process_open_fds / process_max_fds | ||
|
||
- alert: FdExhaustionIn4Hrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predict_linear() alerts won't fire when the process has exhausted (or is about to exhaust) all file descriptors. You would need an alert like "instance:fd_utilization >= 0.99" for those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
predict_linear() alerts won't fire when the process has exhausted (or is about to exhaust) all file descriptors.
I'm not sure I understand. Admittedly I'm lacking the benefit from much real-world data since we don't have node exporter deployed broadly yet. I could just pull this os.yaml file out for now as we get more hands-on experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense. A picture is worth a thousand words! Thanks for the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier would you agree predict_linear() is useful for canary-type warning ("You're headed for danger") but critical alerting on resource saturation should use the bare utilization value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aweiteka that's correct. predict_linear() works fine when the metric's trend is more or less steady. As sometimes it's more erratic, you still need a safety belt with a fixed threshold close to the max saturation level.
I removed os.rules for now as we better determine what we want to be alerting on. An example rule set for etcd will be good to have with the documentation. |
examples/prometheus/prometheus.yaml
Outdated
@@ -258,6 +258,8 @@ objects: | |||
prometheus.yml: | | |||
rule_files: | |||
- 'prometheus.rules' | |||
- '*.rules' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#17553 needs to merge first, but minor.
examples/prometheus/rules/README.md
Outdated
|
||
## Updating Rules | ||
|
||
NOTE: We cannot yet "update" a configmap from a local file (see [comment](https://github.com/kubernetes/kubernetes/issues/30558#issuecomment-326643503)). For now we delete and recreate. Why not use the Prometheus API `/-/reload/` endpoint? It can take over 60 seconds for changes to a configmap to appear in a pod (see [detailed explaination](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#mounted-configmaps-are-updated-automatically)). It is more reliable to simply delete the pod so it creates a new one with the new configmap. This has the cost of ~10s downtime but ensures you've got the updated config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to structure this separately.
per discussion with @smarterclayton the etcd metrics don't have values by default. I'll be removing the etcd rules file and adding a basic kube service rules file. |
Here's a pass at a few rules for the kube service to monitor "golden rules" errors and latency. The file also serves as an example of using annotations for grouping and automation. |
selfHealing: false | ||
url: | ||
|
||
- alert: KubernetesAPIDown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually you create an additional rule that fires when Prometheus hasn't discovered any API server (eg the service discovery is broken).
Comment outdated as it has been added in the latest commit...
examples/prometheus/rules/kube.rules
Outdated
severity: warning | ||
annotations: | ||
summary: Kubernetes API server unreachable | ||
description: "Kubernetes API server unreachable" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could add the instance's label in the description:
description: "Kubernetes API server unreachable on {{ $labels.instance }}"
92819cf
to
7386155
Compare
7386155
to
951d15c
Compare
/retest |
/retest |
You need to run |
|
||
## Updating Rules | ||
|
||
1. Edit or add a local rules file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add oc edit configmap base-rules
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see you are going for a local update
--mount-path=/etc/prometheus/rules | ||
1. Delete pod to restart with new configuration | ||
|
||
oc delete $(oc get pods -o name --selector='app=prometheus') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to say something about reload through sending HUP the prometheus process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a developer-focused workflow so I thought local files->restart was the most straight-forward path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's still worth to mention the HUP for operators out there. Up to you
expr: max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) / 1e+06 > 1 | ||
for: 5m | ||
labels: | ||
severity: warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for the severity being a label and not an annotation?
It's part of the alert definition metadata and not the metric that generated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having it as a label allows the alert to be routed to different receivers in AlertManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having it as a label allows the alert to be routed to different receivers in AlertManager.
That's my understanding (upstream docs). We could also add as an annotation if we determine it's useful for in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm make sense.
It also means that if you did not define a severity on your alert and a generating expression has that label it will be set from that (I'm not sure what the precedence is if both exist).
rules: | ||
|
||
- alert: DockerLatencyHigh | ||
expr: max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) / 1e+06 > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to have this alert per node:
max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) by (instance) / 1e+06
(and use $instance in the description)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Thanks.
selfHealing: false | ||
url: | ||
|
||
- alert: KubernetesAPIAbsent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this alert will also be firing in case prometheus isn't scraping metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is to ensure we catch silent failures.
/retest |
@aweiteka looks good, only thing missing is updated severities |
@aweiteka do you plan to update the severities? |
We're tweaking some of the queries to not be so chatty. |
Signed-off-by: Aaron Weitekamp <[email protected]>
af21aca
to
1e77ad0
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aweiteka Assign the PR to them by writing The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
@aweiteka: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
Signed-off-by: Aaron Weitekamp [email protected]