kube/monitoring/README.md

59 lines
2.9 KiB
Markdown

# Monitoring namespace
Prometheus is accessible at [prom.k-space.ee](https://prom.k-space.ee/)
and the corresponding AlertManager is accessible at [am.k-space.ee](https://am.k-space.ee/).
Both are [deployed by ArgoCD](https://argocd.k-space.ee/applications/monitoring)
from this Git repo directory using Prometheus operator.
Note that Prometheus and other monitoring stack components should use appropriate
node selector to make sure the components get scheduled on nodes which are
hosted in a privileged VLAN where they have access to UPS SNMP targets,
Mikrotik router/switch API-s etc.
## For users
To add monitoring targets inside the Kubernetes cluster make use of
[PodMonitor](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/getting-started.md#using-podmonitors) or ServiceMonitor custom
resource definitions.
For external targets (ab)use the Probe CRD as seen in `node-exporter.yaml`
or `ping-exporter.yaml`
Alerts are sent to #kube-prod Slack channel. The alerting rules are automatically
picked up by Prometheus operator via Kubernetes manifests utilizing
the operator's
[PrometheusRule](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/alerting.md#deploying-prometheus-rules) custom resource definitions.
Sample queries:
* [SSD/HDD temperatures](https://prom.k-space.ee/graph?g0.expr=%7B__name__%3D~%22smartmon_(temperature_celsius%7Cairflow_temperature_cel)_raw_value%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=1d)
* [HDD power on hours](https://prom.k-space.ee/graph?g0.range_input=30m&g0.expr=smartmon_power_on_hours_raw_value&g0.tab=0), 8760 hours per year
* [CPU/NB temperatures](https://prom.k-space.ee/graph?g0.range_input=1h&g0.expr=node_hwmon_temp_celsius&g0.tab=0)
* [Disk space left](https://prom.k-space.ee/graph?g0.range_input=1h&g0.expr=node_filesystem_avail_bytes&g0.tab=1)
* Minio [s3 egress](https://prom.k-space.ee/graph?g0.expr=rate(minio_s3_traffic_sent_bytes%5B3m%5D)&g0.tab=0&g0.display_mode=lines&g0.show_exemplars=0&g0.range_input=6h), [internode egress](https://prom.k-space.ee/graph?g0.expr=rate(minio_inter_node_traffic_sent_bytes%5B2m%5D)&g0.tab=0&g0.display_mode=lines&g0.show_exemplars=0&g0.range_input=6h), [storage used](https://prom.k-space.ee/graph?g0.expr=minio_node_disk_used_bytes&g0.tab=0&g0.display_mode=lines&g0.show_exemplars=0&g0.range_input=6h)
# For administrators
To reconfigure SNMP targets etc:
```
kubectl delete -n monitoring configmap snmp-exporter
kubectl create -n monitoring configmap snmp-exporter --from-file=snmp.yml=snmp-configs.yaml
```
To set Slack secrets:
```
kubectl create -n monitoring secret generic slack-secrets \
--from-literal=webhook-url=https://hooks.slack.com/services/...
```
To set Mikrotik secrets:
```
kubectl create -n monitoring secret generic mikrotik-exporter \
--from-literal=MIKROTIK_PASSWORD='f7W!H*Pu' \
--from-literal=PROMETHEUS_BEARER_TOKEN=$(cat /dev/urandom | base64 | head -c 30)
```