Skip to content

Example prometheus rules for kube api #17608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

aweiteka
Copy link
Contributor

@aweiteka aweiteka commented Dec 5, 2017

Signed-off-by: Aaron Weitekamp [email protected]

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 5, 2017
@aweiteka
Copy link
Contributor Author

aweiteka commented Dec 5, 2017

cc @smarterclayton
related to #17553

@imcsk8
Copy link
Contributor

imcsk8 commented Dec 5, 2017

LGTM


## Updating Rules

NOTE: We cannot yet "update" a configmap from a local file (see [comment](https://github.com/kubernetes/kubernetes/issues/30558#issuecomment-326643503)). For now we delete and recreate. Why not use the Pometheus API `/-/reload/` endpoint? It can take over 60 seconds for changes to a configmap to appear in a pod (see [detailed explaination](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#mounted-configmaps-are-updated-automatically)). It is more reliable to simply delete the pod so it creates a new one with the new configmap. This has the cost of ~10s downtime but ensures you've got the updated config.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Pometheus/Prometheus/

- record: instance:fd_utilization
expr: process_open_fds / process_max_fds

- alert: FdExhaustionIn4Hrs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_linear() alerts won't fire when the process has exhausted (or is about to exhaust) all file descriptors. You would need an alert like "instance:fd_utilization >= 0.99" for those cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_linear() alerts won't fire when the process has exhausted (or is about to exhaust) all file descriptors.

I'm not sure I understand. Admittedly I'm lacking the benefit from much real-world data since we don't have node exporter deployed broadly yet. I could just pull this os.yaml file out for now as we get more hands-on experience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_linear() may fail to predict the correct value for some edge cases. For instance the predicted slope is decreasing but the last point is already at the threshold value. In this situation the alert won't trigger because predict_linear() will return a value below the threshold.

predict_linear edge case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. A picture is worth a thousand words! Thanks for the description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier would you agree predict_linear() is useful for canary-type warning ("You're headed for danger") but critical alerting on resource saturation should use the bare utilization value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aweiteka that's correct. predict_linear() works fine when the metric's trend is more or less steady. As sometimes it's more erratic, you still need a safety belt with a fixed threshold close to the max saturation level.

@aweiteka
Copy link
Contributor Author

I removed os.rules for now as we better determine what we want to be alerting on. An example rule set for etcd will be good to have with the documentation.

@@ -258,6 +258,8 @@ objects:
prometheus.yml: |
rule_files:
- 'prometheus.rules'
- '*.rules'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#17553 needs to merge first, but minor.


## Updating Rules

NOTE: We cannot yet "update" a configmap from a local file (see [comment](https://github.com/kubernetes/kubernetes/issues/30558#issuecomment-326643503)). For now we delete and recreate. Why not use the Prometheus API `/-/reload/` endpoint? It can take over 60 seconds for changes to a configmap to appear in a pod (see [detailed explaination](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#mounted-configmaps-are-updated-automatically)). It is more reliable to simply delete the pod so it creates a new one with the new configmap. This has the cost of ~10s downtime but ensures you've got the updated config.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to structure this separately.

@aweiteka
Copy link
Contributor Author

aweiteka commented Jan 2, 2018

per discussion with @smarterclayton the etcd metrics don't have values by default. I'll be removing the etcd rules file and adding a basic kube service rules file.

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 3, 2018
@aweiteka
Copy link
Contributor Author

aweiteka commented Jan 3, 2018

Here's a pass at a few rules for the kube service to monitor "golden rules" errors and latency. The file also serves as an example of using annotations for grouping and automation.

selfHealing: false
url:

- alert: KubernetesAPIDown
Copy link
Contributor

@simonpasquier simonpasquier Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually you create an additional rule that fires when Prometheus hasn't discovered any API server (eg the service discovery is broken).

Comment outdated as it has been added in the latest commit...

severity: warning
annotations:
summary: Kubernetes API server unreachable
description: "Kubernetes API server unreachable"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could add the instance's label in the description:

  description: "Kubernetes API server unreachable on {{ $labels.instance }}"

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 4, 2018
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 8, 2018
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 8, 2018
@aweiteka
Copy link
Contributor Author

aweiteka commented Jan 8, 2018

/retest

@aweiteka aweiteka changed the title Example prometheus rules for etcd Example prometheus rules for kube api Jan 9, 2018
@aweiteka
Copy link
Contributor Author

/retest

@smarterclayton
Copy link
Contributor

You need to run hack/update-generated-bindata.sh and check that in.


## Updating Rules

1. Edit or add a local rules file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add oc edit configmap base-rules ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see you are going for a local update

--mount-path=/etc/prometheus/rules
1. Delete pod to restart with new configuration

oc delete $(oc get pods -o name --selector='app=prometheus')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to say something about reload through sending HUP the prometheus process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a developer-focused workflow so I thought local files->restart was the most straight-forward path.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's still worth to mention the HUP for operators out there. Up to you

expr: max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) / 1e+06 > 1
for: 5m
labels:
severity: warning

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for the severity being a label and not an annotation?
It's part of the alert definition metadata and not the metric that generated it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having it as a label allows the alert to be routed to different receivers in AlertManager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having it as a label allows the alert to be routed to different receivers in AlertManager.

That's my understanding (upstream docs). We could also add as an annotation if we determine it's useful for in some way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm make sense.
It also means that if you did not define a severity on your alert and a generating expression has that label it will be set from that (I'm not sure what the precedence is if both exist).

rules:

- alert: DockerLatencyHigh
expr: max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) / 1e+06 > 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have this alert per node:
max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) by (instance) / 1e+06

(and use $instance in the description)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks.

selfHealing: false
url:

- alert: KubernetesAPIAbsent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this alert will also be firing in case prometheus isn't scraping metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is to ensure we catch silent failures.

@aweiteka
Copy link
Contributor Author

/retest

@moolitayer
Copy link

@aweiteka looks good, only thing missing is updated severities

@moolitayer
Copy link

@aweiteka do you plan to update the severities?

@aweiteka
Copy link
Contributor Author

@aweiteka do you plan to update the severities?

We're tweaking some of the queries to not be so chatty.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aweiteka
We suggest the following additional approver: smarterclayton

Assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 30, 2018

@aweiteka: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/verify 1e77ad0 link /test verify
ci/openshift-jenkins/gcp 1e77ad0 link /test gcp

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 30, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants