Skip to content

Example prometheus rules for kube api #17608

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion examples/prometheus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ $ oc process -f prometheus-standalone.yaml | oc apply -f -

You can find the Prometheus route by invoking `oc get routes` and then browsing in your web console. Users who are granted `view` access on the namespace will have access to login to Prometheus.

To load rules see [rules README](/examples/prometheus/rules/README.md).

## Useful metrics queries

Expand Down Expand Up @@ -175,4 +176,4 @@ Returns the number of successfully completed builds.

> openshift_build_total{phase="Failed"} offset 5m

Returns the failed builds totals, per failure reason, from 5 minutes ago.
Returns the failed builds totals, per failure reason, from 5 minutes ago.
3 changes: 2 additions & 1 deletion examples/prometheus/prometheus.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -366,7 +366,7 @@ objects:
expr: up{job="kubernetes-nodes"} == 0
annotations:
miqTarget: "ContainerNode"
severity: "HIGH"
severity: error
message: "{{$labels.instance}} is down"

recording.rules: |
Expand All @@ -385,6 +385,7 @@ objects:
prometheus.yml: |
rule_files:
- '*.rules'
- 'rules/*.rules'

# A scrape configuration for running Prometheus on a Kubernetes cluster.
# This uses separate scrape configs for cluster components (i.e. API server, node)
Expand Down
31 changes: 31 additions & 0 deletions examples/prometheus/rules/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Prometheus and Alertmanager Rules

## Loading Rules

With this deployment method all files in the rules directory are mounted into the pod as a configmap.

1. Create a configmap of the rules directory

oc create configmap base-rules --from-file=rules/
1. Attach the configmap to the prometheus statefulset as a volume

oc volume statefulset/prometheus --add \
--configmap-name=base-rules --name=base-rules -t configmap \
--mount-path=/etc/prometheus/rules
1. Delete pod to restart with new configuration

oc delete $(oc get pods -o name --selector='app=prometheus')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to say something about reload through sending HUP the prometheus process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a developer-focused workflow so I thought local files->restart was the most straight-forward path.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's still worth to mention the HUP for operators out there. Up to you


## Updating Rules

1. Edit or add a local rules file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add oc edit configmap base-rules ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see you are going for a local update

1. Validate the rules directory. ('promtool' may be downloaded from the [Prometheus web site](https://prometheus.io/download/).)

promtool check rules rules/*.rules
1. Update the configmap

oc create configmap base-rules --from-file=rules/ --dry-run -o yaml | oc apply -f -
1. Delete pod to restart with new configuration

oc delete $(oc get pods -o name --selector='app=prometheus')

73 changes: 73 additions & 0 deletions examples/prometheus/rules/kube.rules
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
groups:
- name: kubernetes-rules
rules:

- alert: DockerLatencyHigh
expr: max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) / 1e+06 > 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have this alert per node:
max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) by (instance) / 1e+06

(and use $instance in the description)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks.

for: 5m
labels:
severity: warning

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for the severity being a label and not an annotation?
It's part of the alert definition metadata and not the metric that generated it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having it as a label allows the alert to be routed to different receivers in AlertManager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having it as a label allows the alert to be routed to different receivers in AlertManager.

That's my understanding (upstream docs). We could also add as an annotation if we determine it's useful for in some way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm make sense.
It also means that if you did not define a severity on your alert and a generating expression has that label it will be set from that (I'm not sure what the precedence is if both exist).

annotations:
summary: Docker latency is high
description: "Docker latency is {{ $value }} seconds for 90% of kubelet operations"
alertType: latency
miqTarget: ContainerNode
component: container runtime
selfHealing: false
url:

- alert: KubernetesAPIDown
Copy link
Contributor

@simonpasquier simonpasquier Jan 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually you create an additional rule that fires when Prometheus hasn't discovered any API server (eg the service discovery is broken).

Comment outdated as it has been added in the latest commit...

expr: up{job="kubernetes-apiservers"} == 0
for: 10m
labels:
severity: error
annotations:
summary: Kubernetes API server unreachable
description: "Kubernetes API server unreachable on {{ $labels.cluster }} instance {{ $labels.instance }}"
alertType: availability
miqTarget: ContainerNode
component: kubernetes
selfHealing: false
url:

- alert: KubernetesAPIAbsent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this alert will also be firing in case prometheus isn't scraping metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is to ensure we catch silent failures.

expr: absent(up{job="kubernetes-apiservers"})
for: 5m
labels:
severity: error
annotations:
summary: Kubernetes API server absent
description: Kubernetes API server absent
alertType: availability
miqTarget: ContainerNode
component: kubernetes
selfHealing: false
url:

- alert: KubernetesAPIErrorsHigh
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5
for: 5m
labels:
severity: warning
annotations:
summary: Kubernetes API server errors high
description: "Kubernetes API server errors (response code 5xx) are {{ $value }}% of total requests"
alertType: errors
miqTarget: ContainerNode
component: kubernetes
selfHealing: false
url:

- alert: KubernetesAPILatencyHigh
expr: apiserver_request_latencies_summary{quantile="0.9",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} / 1e+06 > .5
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes API server latency high
description: "Kubernetes API server request latency is {{ $value }} seconds for 90% of requests. NOTE: long-standing requests have been removed from alert query."
alertType: latency
miqTarget: ContainerNode
component: kubernetes
selfHealing: false
url: