prometheus-webhook 是对alertmanager 告警的一个扩展,支持钉钉,微信,邮件告警和自建告警模板 |
1、配置告警
1、下载并解压告警安装包
#下载 wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz #解压 tar -zxvf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz [tidb@vm172-16-201-64 prometheus-webhook-dingtalk-2.1.0.linux-amd64]$ ll 总用量 18744 -rw-r--r-- 1 tidb tidb 1299 4月 21 16:20 config.example.yml drwxr-xr-x 4 tidb tidb 4096 4月 21 16:20 contrib -rw-r--r-- 1 tidb tidb 11358 4月 21 16:20 LICENSE -rwxr-xr-x 1 tidb tidb 19172733 4月 21 16:19 prometheus-webhook-dingtalk [tidb@vm172-16-201-64 prometheus-webhook-dingtalk-2.1.0.linux-amd64]$ |
2、配置webhook启动脚本
more /data/webhook-dingtalk/webhook-dingtalk.sh #!/bin/bash set -e WEBHOOK_BIN=/data/webhook-dingtalk/prometheus-webhook-dingtalk exec $WEBHOOK_BIN \ --web.listen-address=":8060" \ --config.file="/data/webhook-dingtalk/jms_config.yml" \ --log.level="info" \ --log.format="logfmt" \ --web.enable-lifecycle \ --web.enable-ui \ |
3、配置webhook 配置文件
more /data/webhook-dingtalk_config.yml ## Request timeout # timeout: 5s ## Uncomment following line in order to write template from scratch (be careful!) #no_builtin_template: true ## Customizable templates path #templates: # - contrib/templates/legacy/template.tmpl ## You can also override default template using `default_message` ## The following example to use the 'legacy' template from v0.3.0 #default_message: # title: '{{ template "legacy.title" . }}' # text: '{{ template "legacy.content" . }}' ## Targets, previously was known as "profiles" targets: webhook1: url: https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX # secret for signature secret: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX #webhook2: # url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx webhook_legacy: url: https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX secret: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX # Customize template content message: # Use legacy template title: '{{ template "legacy.title" . }}' text: '{{ template "legacy.content" . }}' #webhook_mention_all: # url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx # mention: # all: true webhook_mention_users: url: https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX secret: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX mention: mobiles: ['XXXXXXXXXXXX'] |
4、配置alertmanager.yml
more /data/dm-deploy/alertmanager-9093/conf/alertmanager.yml global: # The smarthost and SMTP sender used for mail notifications. smtp_smarthost: "localhost:25" smtp_from: "alertmanager@example.org" smtp_auth_username: "alertmanager" smtp_auth_password: "password" # smtp_require_tls: true # The Slack webhook URL. # slack_api_url: '' route: # A default receiver receiver: "webhook" # The labels by which incoming alerts are grouped together. For example, # multiple alerts coming in for cluster=A and alertname=LatencyHigh would # be batched into a single group. group_by: ["env", "instance", "alertname", "type", "group", "job"] # When a new group of alerts is created by an incoming alert, wait at # least 'group_wait' to send the initial notification. # This way ensures that you get multiple alerts for the same group that start # firing shortly after another are batched together on the first # notification. group_wait: 30s # When the first notification was sent, wait 'group_interval' to send a batch # of new alerts that started firing for that group. group_interval: 3m # If an alert has successfully been sent, wait 'repeat_interval' to # resend them. repeat_interval: 3m routes: # - match: # receiver: webhook-kafka-adapter # continue: true # - match: # env: test-cluster # receiver: db-alert-slack # - match: # env: test-cluster # receiver: db-alert-email #配置的IP地址就是部署webhook的机器地址 receivers: - name: 'webhook' webhook_configs: - send_resolved: true url: 'http://XX.XX.XX.:8060/dingtalk/webhook1/send' #- name: 'db-alert-slack' # slack_configs: # - channel: '#alerts' # username: 'db-alert' # icon_emoji: ':bell:' # title: '{{ .CommonLabels.alertname }}' # text: '{{ .CommonAnnotations.summary }} {{ .CommonAnnotations.description }} expr: {{ .CommonLabels.expr }} http://1 72.0.0.1:9093/#/alerts' # - name: "db-alert-email" # email_configs: # - send_resolved: true # to: "example@example.com" # This doesn't alert anything, please configure your own receiver #- name: "blackhole" |
5、配置开机启动脚本
more /etc/systemd/system/prometheus-webhook.service [Unit] Description=prometheus-webhook service After=syslog.target network.target remote-fs.target nss-lookup.target [Service] LimitNOFILE=1000000 LimitSTACK=10485760 User=tidb ExecStart=/data/webhook-dingtalk/webhook-dingtalk.sh Restart=always RestartSec=15s [Install] WantedBy=multi-user.target |
6、启动webhook
#启动webhook sudo systemctl start prometheus-webhook.service #停止webhook sudo systemctl stop prometheus-webhook.service #查看服务状态 sudo systemctl status -l prometheus-webhook.service |
7、重启alertmanager让告警生效
tiup clutster stop tidb-test -N x:9093 tiup clutster start tidb-test-N x:9093 #查看启动后状态 tiup clutster display tidb-jms -N x:9093 |
8、告警展示
[FIRING:1] tidb_tikvclient_backoff_seconds_count Alerts Firing TiDB tikvclient_backoff_count error Description: cluster: tidb-test, instance: xxxx:10081, values:253.33333333333331 Graph: Details: alertname: tidb_tikvclient_backoff_seconds_count cluster: tidb-test env: tidb-test expr: increase( tidb_tikvclient_backoff_seconds_count[10m] ) > 10 instance: xxxx:10081 job: tidb level: warning monitor: prometheus type: regionMiss |
9、注意事项
需要注意的是,TiUP 会使用自己的配置参数覆盖监控组件的配置,如果你直接修改监控组件的配置文件,修改的配置文件可能在对集群进行 deploy/scale-out/scale-in/reload 等操作中被 TiUP 所覆盖,导致配置不生效。
alertmanager_servers
- config_file:该字段指定一个本地文件,该文件会在集群配置初始化阶段被传输到目标机器上,作为 Alertmanager 的配置
Plain Text alertmanager_servers: - host: 172.16.201.64 ssh_port: 22 web_port: 9093 cluster_port: 9094 deploy_dir: /data1/tidb-deploy/alertmanager-9093 data_dir: /data1/tidb-data/alertmanager-9093 log_dir: /data1/tidb-deploy/alertmanager-9093/log arch: amd64 os: linux config_file: /data1/tidb-deploy/alertmanager-9093/conf/alertmanager_test.yml |
2、修改告警
1、到prometheus的conf 目录下找到对应的告警项
[tidb@vm172-16-201-64 ~]$ cd /data/tidb-deploy/prometheus-9090/conf/ [tidb@vm172-16-201-64 conf]$ ll 总用量 96 -rw-r--r-- 1 tidb tidb 3500 6月 28 15:34 binlog.rules.yml -rw-r--r-- 1 tidb tidb 4492 6月 28 15:34 blacker.rules.yml -rw-r--r-- 1 tidb tidb 37 6月 28 15:34 bypass.rules.yml -rw-r--r-- 1 tidb tidb 1964 6月 28 15:34 kafka.rules.yml -rw-r--r-- 1 tidb tidb 459 6月 28 15:34 lightning.rules.yml -rw-r--r-- 1 tidb tidb 507 6月 28 15:34 ngmonitoring.toml -rw-r--r-- 1 tidb tidb 5214 6月 28 15:34 node.rules.yml -rw-r--r-- 1 tidb tidb 7920 6月 28 15:34 pd.rules.yml -rw-r--r-- 1 tidb tidb 6199 6月 28 15:34 prometheus.yml -rw-r--r-- 1 tidb tidb 6507 6月 28 15:34 ticdc.rules.yml -rw-r--r-- 1 tidb tidb 6271 6月 28 15:34 tidb.rules.yml -rw-r--r-- 1 tidb tidb 3112 6月 28 15:34 tiflash.rules.yml -rw-r--r-- 1 tidb tidb 4685 6月 28 15:34 tikv.accelerate.rules.yml -rw-r--r-- 1 tidb tidb 13977 6月 28 15:34 tikv.rules.yml [tidb@vm172-16-201-64 conf]$ |
2、备份相应的文件,修改告警项
cp tidb.rules.yml tidb.rules.yml_20220628 vi tidb.rules.yml |
3、重启prometheus,让修改生效
Plain Text tiup clutster stop tidb-jms -N x:9090 tiup clutster start tidb-jms -N x:9090 #查看启动后状态 tiup clutster display tidb-jms -N x:9090 |
4、临时静默
https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/alert-manager-inhibit
用户或者管理员可以直接通过Alertmanager的UI临时屏蔽特定的告警通知。通过定义标签的匹配规则(字符串或者正则表达式),如果新的告警通知满足静默规则的设置,则停止向receiver发送通知。
进入Alertmanager UI,点击"New Silence"显示如下内容:
1、创建静默规则
用户可以通过该UI定义新的静默规则的开始时间以及持续时间,通过Matchers部分可以设置多条匹配规则(字符串匹配或者正则匹配)。填写当前静默规则的创建者以及创建原因后,点击"Create"按钮即可。
通过"Preview Alerts"可以查看预览当前匹配规则匹配到的告警信息。静默规则创建成功后,Alertmanager会开始加载该规则并且设置状态为Pending,当规则生效后则进行到Active状态。
活动的静默规则
当静默规则生效以后,从Alertmanager的Alerts页面下用户将不会看到该规则匹配到的告警信息。
告警信息
对于已经生效的规则,用户可以通过手动点击”Expire“按钮使当前规则过期。