背景

tidb提供了Prometheus + Grafana 来对集群中的各个 TiKV、TiDB 和 PD 组件的各种性能指标进行监控。但是还缺少实时告警功能，本文主要介绍通过webhook方式发送告警信息到企业微信群。

工具安装

Prometheus Alert 是开源的运维告警中心消息转发系统，支持主流的监控系统 Prometheus，日志系统 Graylog 和数据可视化系统 Grafana 发出的预警消息。通知渠道支持钉钉、微信、企业微信、华为云短信、腾讯云短信、腾讯云电话、阿里云短信、阿里云电话等。

项目地址：https://github.com/feiyu563/PrometheusAlert

按照README文档一步步安装。

安装完访问页面：

记住这个模版地址后面有用

告警方案一：Granafa+Prometheus Alert（未调通）

Tidb集群内置了一套Granafa监控，可以配置告警，有默认的告警规则。

添加通知渠道，选择webhook,地址填写http://10.20.10.118:8080/prometheusalert?type=wx&tpl=grafana-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxxxxxxxx&at=xxxxxxx

key为企业微信群里告警机器人的key。

测试结果：企业微信群里可以正常接收到消息

回头再进入Granafa告警规则编辑页面，发现有一个红色警告Template variables are not supported in alert queries。

官方回复：

https://github.com/grafana/grafana/issues/9334

Template variables are not supported in alerting.

Template variables should be used for discovery and drill down. Not controlling alert rules

默认的图表监控中使用了模版变量，granafa告警不支持。

解决办法：复制这个 Dashboard 的 json，将里面的变量都改为常量。

这个工作量有点大，遂放弃，有兴趣的小伙伴可以尝试下。

告警方案二：prometheus alertmanager+PrometheusAlert

通过Tidb集群自带的alertmanager发送告警信息到PrometheusAlert告警。

alertmanager地址 http://IP:9093/#/alerts

修改prometheus配置文件prometheus.yml

添加如下配置：

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - '10.20.10.61:9093'

修改alertmanager配置文件alertmanager.yml

global:
  # The smarthost and SMTP sender used for mail notifications.
  smtp_smarthost: "localhost:25"
  smtp_from: "alertmanager@example.org"
  smtp_auth_username: "alertmanager"
  smtp_auth_password: "password"
  # smtp_require_tls: true

  # The Slack webhook URL.
  # slack_api_url: ''

route:
  # A default receiver
#  receiver: "blackhole"
  receiver: "webhook1"

  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ["env", "instance", "alertname", "type", "group", "job"]

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first
  # notification.
  group_wait: 30s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 3m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3m

  routes:
  # - match:
  #   receiver: webhook-kafka-adapter
  #   continue: true
  # - match:
  #     env: test-cluster
  #   receiver: db-alert-slack
  # - match:
  #     env: test-cluster
  #   receiver: db-alert-email

receivers:
  # - name: 'webhook-kafka-adapter'
  #   webhook_configs:
  #   - send_resolved: true
  #     url: 'http://10.0.3.6:28082/v1/alertmanager'

  #- name: 'db-alert-slack'
  #  slack_configs:
  #  - channel: '#alerts'
  #    username: 'db-alert'
  #    icon_emoji: ':bell:'
  #    title:   '{{ .CommonLabels.alertname }}'
  #    text:    '{{ .CommonAnnotations.summary }}  {{ .CommonAnnotations.description }}  expr: {{ .CommonLabels.expr }}  http://172.0.0.1:9093/#/alerts'

  # - name: "db-alert-email"
  #   email_configs:
  #     - send_resolved: true
  #       to: "example@example.com"
  - name: webhook1
    webhook_configs:
    - url: 'http://10.20.10.118:8080/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxxxxxx&at=xxxxxxxxx'
      send_resolved: true  # 警报被解决之后是否通知

  # This doesn't alert anything, please configure your own receiver
#  - name: "blackhole"

重启alertmanager 和prometheus服务

tiup cluster restart cluster-name -N ip:9093

tiup cluster restart cluster-name -N ip:9090

企业微信群告警效果：

发送Tidb告警信息到企业微信群实践

背景

工具安装

告警方案一：Granafa+Prometheus Alert（未调通）

告警方案二：prometheus alertmanager+PrometheusAlert