背景
tidb提供了Prometheus + Grafana 来对集群中的各个 TiKV、TiDB 和 PD 组件的各种性能指标进行监控。但是还缺少实时告警功能,本文主要介绍通过webhook方式发送告警信息到企业微信群。
工具安装
Prometheus Alert 是开源的运维告警中心消息转发系统,支持主流的监控系统 Prometheus,日志系统 Graylog 和数据可视化系统 Grafana 发出的预警消息。通知渠道支持钉钉、微信、企业微信、华为云短信、腾讯云短信、腾讯云电话、阿里云短信、阿里云电话等。
项目地址:https://github.com/feiyu563/PrometheusAlert
按照README文档一步步安装。
安装完访问页面:
记住这个模版地址后面有用
告警方案一:Granafa+Prometheus Alert(未调通)
Tidb集群内置了一套Granafa监控,可以配置告警,有默认的告警规则。
添加通知渠道,选择webhook,地址填写http://10.20.10.118:8080/prometheusalert?type=wx&tpl=grafana-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxxxxxxxx&at=xxxxxxx
key为企业微信群里告警机器人的key。
测试结果:企业微信群里可以正常接收到消息
回头再进入Granafa告警规则编辑页面,发现有一个红色警告Template variables are not supported in alert queries。
官方回复:
https://github.com/grafana/grafana/issues/9334
Template variables are not supported in alerting.
Template variables should be used for discovery and drill down. Not controlling alert rules
默认的图表监控中使用了模版变量,granafa告警不支持。
解决办法:复制这个 Dashboard 的 json,将里面的变量都改为常量。
这个工作量有点大,遂放弃,有兴趣的小伙伴可以尝试下。
告警方案二:prometheus alertmanager+PrometheusAlert
通过Tidb集群自带的alertmanager发送告警信息到PrometheusAlert告警。
alertmanager地址 http://IP:9093/#/alerts
修改prometheus配置文件prometheus.yml
添加如下配置:
alerting:
alertmanagers:
- static_configs:
- targets:
- '10.20.10.61:9093'
修改alertmanager配置文件alertmanager.yml
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: "localhost:25"
smtp_from: "alertmanager@example.org"
smtp_auth_username: "alertmanager"
smtp_auth_password: "password"
# smtp_require_tls: true
# The Slack webhook URL.
# slack_api_url: ''
route:
# A default receiver
# receiver: "blackhole"
receiver: "webhook1"
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: ["env", "instance", "alertname", "type", "group", "job"]
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 3m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 3m
routes:
# - match:
# receiver: webhook-kafka-adapter
# continue: true
# - match:
# env: test-cluster
# receiver: db-alert-slack
# - match:
# env: test-cluster
# receiver: db-alert-email
receivers:
# - name: 'webhook-kafka-adapter'
# webhook_configs:
# - send_resolved: true
# url: 'http://10.0.3.6:28082/v1/alertmanager'
#- name: 'db-alert-slack'
# slack_configs:
# - channel: '#alerts'
# username: 'db-alert'
# icon_emoji: ':bell:'
# title: '{{ .CommonLabels.alertname }}'
# text: '{{ .CommonAnnotations.summary }} {{ .CommonAnnotations.description }} expr: {{ .CommonLabels.expr }} http://172.0.0.1:9093/#/alerts'
# - name: "db-alert-email"
# email_configs:
# - send_resolved: true
# to: "example@example.com"
- name: webhook1
webhook_configs:
- url: 'http://10.20.10.118:8080/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxxxxxx&at=xxxxxxxxx'
send_resolved: true # 警报被解决之后是否通知
# This doesn't alert anything, please configure your own receiver
# - name: "blackhole"
重启alertmanager 和prometheus服务
tiup cluster restart cluster-name -N ip:9093
tiup cluster restart cluster-name -N ip:9090
企业微信群告警效果: