Skip to content

Commit 951d15c

Browse files
committed
add example etcd rules and a doc
Signed-off-by: Aaron Weitekamp <[email protected]>
1 parent 9425ace commit 951d15c

File tree

4 files changed

+102
-1
lines changed

4 files changed

+102
-1
lines changed

examples/prometheus/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ $ oc process -f prometheus-standalone.yaml | oc apply -f -
5353

5454
You can find the Prometheus route by invoking `oc get routes` and then browsing in your web console. Users who are granted `view` access on the namespace will have access to login to Prometheus.
5555

56+
To load rules see [rules README](/examples/prometheus/rules/README.md).
5657

5758
## Useful metrics queries
5859

@@ -175,4 +176,4 @@ Returns the number of successfully completed builds.
175176

176177
> openshift_build_total{phase="Failed"} offset 5m
177178
178-
Returns the failed builds totals, per failure reason, from 5 minutes ago.
179+
Returns the failed builds totals, per failure reason, from 5 minutes ago.

examples/prometheus/prometheus.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -273,6 +273,7 @@ objects:
273273
prometheus.yml: |
274274
rule_files:
275275
- '*.rules'
276+
- 'rules/*.rules'
276277
277278
# A scrape configuration for running Prometheus on a Kubernetes cluster.
278279
# This uses separate scrape configs for cluster components (i.e. API server, node)

examples/prometheus/rules/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Prometheus and Alertmanager Rules
2+
3+
## Loading Rules
4+
5+
With this deployment method all files in the rules directory are mounted into the pod as a configmap.
6+
7+
1. Create a configmap of the rules directory
8+
9+
oc create configmap base-rules --from-file=rules/
10+
1. Attach the configmap to the prometheus statefulset as a volume
11+
12+
oc volume statefulset/prometheus --add \
13+
--configmap-name=base-rules --name=base-rules -t configmap \
14+
--mount-path=/etc/prometheus/rules
15+
1. Delete pod to restart with new configuration
16+
17+
oc delete $(oc get pods -o name --selector='app=prometheus')
18+
19+
## Updating Rules
20+
21+
1. Edit or add a local rules file
22+
1. Validate the rules directory. ('promtool' may be downloaded from the [Prometheus web site](https://prometheus.io/download/).)
23+
24+
promtool check rules rules/*.rules
25+
1. Update the configmap
26+
27+
oc create configmap base-rules --from-file=rules/ --dry-run -o yaml | oc apply -f -
28+
1. Delete pod to restart with new configuration
29+
30+
oc delete $(oc get pods -o name --selector='app=prometheus')
31+

examples/prometheus/rules/kube.rules

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
groups:
2+
- name: kubernetes-rules
3+
rules:
4+
5+
- alert: DockerLatencyHigh
6+
expr: max(kubelet_docker_operations_latency_microseconds{quantile="0.9"}) / 1e+06 > 1
7+
for: 5m
8+
labels:
9+
severity: warning
10+
annotations:
11+
summary: Docker latency is high
12+
description: "Docker latency is {{ $value }} seconds for 90% of kubelet operations"
13+
alertType: latency
14+
component: container runtime
15+
selfHealing: false
16+
url:
17+
18+
- alert: KubernetesAPIDown
19+
expr: up{job="kubernetes-apiservers"} == 0
20+
for: 10m
21+
labels:
22+
severity: critical
23+
annotations:
24+
summary: Kubernetes API server unreachable
25+
description: "Kubernetes API server unreachable on {{ $labels.cluster }} instance {{ $labels.instance }}"
26+
alertType: availability
27+
component: kubernetes
28+
selfHealing: false
29+
url:
30+
31+
- alert: KubernetesAPIAbsent
32+
expr: absent(up{job="kubernetes-apiservers"})
33+
for: 5m
34+
labels:
35+
severity: critical
36+
annotations:
37+
summary: Kubernetes API server absent
38+
description: Kubernetes API server absent
39+
alertType: availability
40+
component: kubernetes
41+
selfHealing: false
42+
url:
43+
44+
- alert: KubernetesAPIErrorsHigh
45+
expr: rate(apiserver_request_count{code=~"^(?:5..)$"}[5m]) / rate(apiserver_request_count[5m]) * 100 > 5
46+
for: 5m
47+
labels:
48+
severity: warning
49+
annotations:
50+
summary: Kubernetes API server errors high
51+
description: "Kubernetes API server errors (response code 5xx) are {{ $value }}% of total requests"
52+
alertType: errors
53+
component: kubernetes
54+
selfHealing: false
55+
url:
56+
57+
- alert: KubernetesAPILatencyHigh
58+
expr: apiserver_request_latencies_summary{quantile="0.9",subresource!="log",verb!~"^(?:WATCH|WATCHLIST|PROXY|CONNECT)$"} / 1e+06 > .5
59+
for: 10m
60+
labels:
61+
severity: warning
62+
annotations:
63+
summary: Kubernetes API server latency high
64+
description: "Kubernetes API server request latency is {{ $value }} seconds for 90% of requests. NOTE: long-standing requests have been removed from alert query."
65+
alertType: latency
66+
component: kubernetes
67+
selfHealing: false
68+
url:

0 commit comments

Comments
 (0)