Example prometheus rules for kube api #17608

aweiteka · 2017-12-05T15:44:16Z

Signed-off-by: Aaron Weitekamp [email protected]

aweiteka · 2017-12-05T15:45:09Z

imcsk8 · 2017-12-05T20:54:30Z

LGTM

simonpasquier · 2017-12-11T15:20:28Z

examples/prometheus/rules/README.md

+
+## Updating Rules
+
+NOTE: We cannot yet "update" a configmap from a local file (see [comment](https://github.com/kubernetes/kubernetes/issues/30558#issuecomment-326643503)). For now we delete and recreate. Why not use the Pometheus API `/-/reload/` endpoint? It can take over 60 seconds for changes to a configmap to appear in a pod (see [detailed explaination](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#mounted-configmaps-are-updated-automatically)). It is more reliable to simply delete the pod so it creates a new one with the new configmap. This has the cost of ~10s downtime but ensures you've got the updated config.


s/Pometheus/Prometheus/

simonpasquier · 2017-12-12T08:44:55Z

examples/prometheus/rules/os.rules

+    - record: instance:fd_utilization
+      expr: process_open_fds / process_max_fds
+
+    - alert: FdExhaustionIn4Hrs


predict_linear() alerts won't fire when the process has exhausted (or is about to exhaust) all file descriptors. You would need an alert like "instance:fd_utilization >= 0.99" for those cases.

predict_linear() alerts won't fire when the process has exhausted (or is about to exhaust) all file descriptors.

I'm not sure I understand. Admittedly I'm lacking the benefit from much real-world data since we don't have node exporter deployed broadly yet. I could just pull this os.yaml file out for now as we get more hands-on experience.

predict_linear() may fail to predict the correct value for some edge cases. For instance the predicted slope is decreasing but the last point is already at the threshold value. In this situation the alert won't trigger because predict_linear() will return a value below the threshold.

Ok, that makes sense. A picture is worth a thousand words! Thanks for the description.

@simonpasquier would you agree predict_linear() is useful for canary-type warning ("You're headed for danger") but critical alerting on resource saturation should use the bare utilization value?

@aweiteka that's correct. predict_linear() works fine when the metric's trend is more or less steady. As sometimes it's more erratic, you still need a safety belt with a fixed threshold close to the max saturation level.

aweiteka · 2017-12-13T18:49:55Z

I removed os.rules for now as we better determine what we want to be alerting on. An example rule set for etcd will be good to have with the documentation.

smarterclayton · 2018-01-02T17:13:45Z

examples/prometheus/prometheus.yaml

@@ -258,6 +258,8 @@ objects:
    prometheus.yml: |
      rule_files:
        - 'prometheus.rules'
+        - '*.rules'


#17553 needs to merge first, but minor.

smarterclayton · 2018-01-02T17:14:09Z

examples/prometheus/rules/README.md

+
+## Updating Rules
+
+NOTE: We cannot yet "update" a configmap from a local file (see [comment](https://github.com/kubernetes/kubernetes/issues/30558#issuecomment-326643503)). For now we delete and recreate. Why not use the Prometheus API `/-/reload/` endpoint? It can take over 60 seconds for changes to a configmap to appear in a pod (see [detailed explaination](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#mounted-configmaps-are-updated-automatically)). It is more reliable to simply delete the pod so it creates a new one with the new configmap. This has the cost of ~10s downtime but ensures you've got the updated config.


Might be better to structure this separately.

aweiteka · 2018-01-02T19:17:34Z

per discussion with @smarterclayton the etcd metrics don't have values by default. I'll be removing the etcd rules file and adding a basic kube service rules file.

aweiteka · 2018-01-03T21:00:13Z

Here's a pass at a few rules for the kube service to monitor "golden rules" errors and latency. The file also serves as an example of using annotations for grouping and automation.

simonpasquier · 2018-01-04T14:24:59Z

examples/prometheus/rules/kube.rules

+        selfHealing: false
+        url:
+
+    - alert: KubernetesAPIDown


~~Usually you create an additional rule that fires when Prometheus hasn't discovered any API server (eg the service discovery is broken).~~

Comment outdated as it has been added in the latest commit...

simonpasquier · 2018-01-04T14:27:42Z

examples/prometheus/rules/kube.rules

+        severity: warning
+      annotations:
+        summary: Kubernetes API server unreachable
+        description: "Kubernetes API server unreachable"


you could add the instance's label in the description:

description: "Kubernetes API server unreachable on {{ $labels.instance }}"

aweiteka · 2018-01-08T16:25:03Z

/retest

aweiteka · 2018-01-10T18:46:04Z

/retest

smarterclayton · 2018-01-10T19:08:07Z

You need to run hack/update-generated-bindata.sh and check that in.

moolitayer · 2018-01-11T12:03:48Z

examples/prometheus/rules/README.md

+
+## Updating Rules
+
+1. Edit or add a local rules file


add oc edit configmap base-rules ?

oh I see you are going for a local update

moolitayer · 2018-01-11T12:09:52Z

examples/prometheus/rules/README.md

+           --mount-path=/etc/prometheus/rules
+1. Delete pod to restart with new configuration
+
+        oc delete $(oc get pods -o name --selector='app=prometheus')


do we want to say something about reload through sending HUP the prometheus process?

This is a developer-focused workflow so I thought local files->restart was the most straight-forward path.

Makes sense

Maybe it's still worth to mention the HUP for operators out there. Up to you

moolitayer · 2018-01-11T12:20:32Z

examples/prometheus/rules/kube.rules

+      expr: max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) / 1e+06 > 1
+      for: 5m
+      labels:
+        severity: warning


Is there a reason for the severity being a label and not an annotation?
It's part of the alert definition metadata and not the metric that generated it.

Having it as a label allows the alert to be routed to different receivers in AlertManager.

Having it as a label allows the alert to be routed to different receivers in AlertManager.

That's my understanding (upstream docs). We could also add as an annotation if we determine it's useful for in some way.

Hmm make sense.
It also means that if you did not define a severity on your alert and a generating expression has that label it will be set from that (I'm not sure what the precedence is if both exist).

moolitayer · 2018-01-11T12:33:39Z

examples/prometheus/rules/kube.rules

+  rules:
+
+    - alert: DockerLatencyHigh
+      expr: max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) / 1e+06 > 1


would be nice to have this alert per node:
max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) by (instance) / 1e+06

(and use $instance in the description)

Good catch. Thanks.

moolitayer · 2018-01-11T13:12:24Z

examples/prometheus/rules/kube.rules

+        selfHealing: false
+        url:
+
+    - alert: KubernetesAPIAbsent


I guess this alert will also be firing in case prometheus isn't scraping metrics

Right, this is to ensure we catch silent failures.

aweiteka · 2018-01-11T19:15:06Z

/retest

moolitayer · 2018-01-18T15:55:46Z

@aweiteka looks good, only thing missing is updated severities

moolitayer · 2018-01-30T15:42:27Z

@aweiteka do you plan to update the severities?

aweiteka · 2018-01-30T16:19:55Z

@aweiteka do you plan to update the severities?

We're tweaking some of the queries to not be so chatty.

Signed-off-by: Aaron Weitekamp <[email protected]>

openshift-ci-robot · 2018-01-30T16:25:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aweiteka
We suggest the following additional approver: smarterclayton

Assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

examples/prometheus/OWNERS

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

openshift-ci-robot · 2018-01-30T17:18:59Z

@aweiteka: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
ci/openshift-jenkins/verify	`1e77ad0`	link	`/test verify`
ci/openshift-jenkins/gcp	`1e77ad0`	link	`/test gcp`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2018-04-30T17:53:29Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2018-05-30T18:39:04Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2018-06-29T18:40:38Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot requested review from mfojtik and smarterclayton December 5, 2017 15:44

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 5, 2017

simonpasquier reviewed Dec 12, 2017

View reviewed changes

smarterclayton reviewed Jan 2, 2018

View reviewed changes

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 3, 2018

simonpasquier reviewed Jan 4, 2018

View reviewed changes

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 4, 2018

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 8, 2018

aweiteka force-pushed the ex-prom-rules branch from 92819cf to 7386155 Compare January 8, 2018 14:58

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 8, 2018

aweiteka force-pushed the ex-prom-rules branch from 7386155 to 951d15c Compare January 8, 2018 15:01

aweiteka changed the title ~~Example prometheus rules for etcd~~ Example prometheus rules for kube api Jan 9, 2018

moolitayer reviewed Jan 11, 2018

View reviewed changes

aweiteka mentioned this pull request Jan 15, 2018

Change alert definition meta ManageIQ/manageiq-providers-kubernetes#217

Merged

moolitayer mentioned this pull request Jan 18, 2018

Update test alert metadata #16026

Closed

aweiteka added 2 commits January 30, 2018 11:22

add example etcd rules and a doc

73b6744

Signed-off-by: Aaron Weitekamp <[email protected]>

update 'error' severity

1e77ad0

aweiteka force-pushed the ex-prom-rules branch from af21aca to 1e77ad0 Compare January 30, 2018 16:25

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 30, 2018

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 30, 2018

openshift-ci-robot closed this Jun 29, 2018


		## Updating Rules

		NOTE: We cannot yet "update" a configmap from a local file (see [comment](https://github.com/kubernetes/kubernetes/issues/30558#issuecomment-326643503)). For now we delete and recreate. Why not use the Pometheus API `/-/reload/` endpoint? It can take over 60 seconds for changes to a configmap to appear in a pod (see [detailed explaination](https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/#mounted-configmaps-are-updated-automatically)). It is more reliable to simply delete the pod so it creates a new one with the new configmap. This has the cost of ~10s downtime but ensures you've got the updated config.

Example prometheus rules for kube api #17608

Example prometheus rules for kube api #17608

Uh oh!

Conversation

aweiteka commented Dec 5, 2017

Uh oh!

aweiteka commented Dec 5, 2017

Uh oh!

imcsk8 commented Dec 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aweiteka commented Dec 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aweiteka commented Jan 2, 2018

Uh oh!

aweiteka commented Jan 3, 2018

Uh oh!

simonpasquier Jan 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aweiteka commented Jan 8, 2018

Uh oh!

aweiteka commented Jan 10, 2018

Uh oh!

smarterclayton commented Jan 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aweiteka commented Jan 11, 2018

Uh oh!

moolitayer commented Jan 18, 2018

Uh oh!

moolitayer commented Jan 30, 2018

Uh oh!

simonpasquier Jan 4, 2018 •

edited

Loading

openshift-ci-robot commented Jan 30, 2018 •

edited

Loading