背景
在双中心部署tidb dr-auto sync集群,出于监控的高可用考虑,在物理分离的两个数据中心分别部署独立的prometheus+alertmanager+grafana,实现任一监控均可访问。
此部署架构需考虑两套监控组件数据采集的一致性,以及监控告警重复发送的问题。
实现思路
- 两套Prometheus组件各自独立进行集群监控信息的采集和存储;
- 两套Grafana连接各自的Prometheus作为数据源;
- AlertManager通过集群配置,基于gossip机制,在多个alertmanager收到相同告警事件后,由其中之一对外发送监控告警信息。
模拟实现
模拟实现的环境
TiDB v7.1.0 LTS
单个集群部署两套监控
# # Server configs are used to specify the configuration of Prometheus Server.
monitoring_servers:
- host: 30.0.100.40
port: 9091
deploy_dir: "/tidb/tidb-deploy/prometheus-8249"
data_dir: "/data/tidb-data/prometheus-8249"
log_dir: "/data/tidb-deploy/prometheus-8249/log"
- host: 30.0.100.42
port: 9091
deploy_dir: "/tidb/tidb-deploy/prometheus-8249"
data_dir: "/data/tidb-data/prometheus-8249"
log_dir: "/data/tidb-deploy/prometheus-8249/log"
# # Server configs are used to specify the configuration of Grafana Servers.
grafana_servers:
- host: 30.0.100.40
deploy_dir: /data/tidb-deploy/grafana-3000
- host: 30.0.100.42
deploy_dir: /data/tidb-deploy/grafana-3000
# # Server configs are used to specify the configuration of Alertmanager Servers.
alertmanager_servers:
- host: 30.0.100.40
deploy_dir: "/data/tidb-deploy/alertmanager-9093"
data_dir: "/data/tidb-data/alertmanager-9093"
log_dir: "/data/tidb-deploy/alertmanager-9093/log"
- host: 30.0.100.42
deploy_dir: "/data/tidb-deploy/alertmanager-9093"
data_dir: "/data/tidb-data/alertmanager-9093"
log_dir: "/data/tidb-deploy/alertmanager-9093/log"
调整监控数据链路
grafana调整datasource
确认prometheus配置,设置alertmanager信息
登录alertmanager,确认多个alertmanager组成了集群(此处由tidb自动完成配置)
需复用haproxy+keepalive反向代理多个prometheus,并修改dashboard的prometheus数据源,以免单个prometheus故障后影响dashboard的使用
haproxy配置略
dashboard配置如下
Webhook实现
- 编写webhook转换为飞书api的golang程序
略
- 测试,使用HTTP接口测试工具,确认飞书webhook小程序接收并解析了相关告警事件
{
"version": "4",
"groupKey": "123333",
"status": "firing",
"receiver": "target",
"groupLabels": {"group":"group1"},
"commonLabels": {"server":"test"},
"commonAnnotations": {"server":"test"},
"externalURL": "http://30.0.100.40:3000",
"alerts": [
{
"labels": {"server":"test"},
"annotations": {"server":"test"},
"startsAt": "2023-08-12T07:20:50.52Z",
"endsAt": "2023-08-12T09:20:50.52Z"
}
]
}
2023/08/20 10:40:20 172.31.0.4 - {"version":"4","groupKey":"123333","status":"firing","Receiver":"target","GroupLabels":{"group":"group1"},"CommonLabels":{"server":"test"},"CommonAnnotations":{"server":"test"},"ExternalURL":"http://30.0.100.40:3000","Alerts":[{"labels":{"server":"test"},"annotations":{"server":"test"},"startsAt":"2023-08-12T07:20:50.52Z","endsAt":"2023-08-12T09:20:50.52Z"}]}
[GIN] 2023/08/20 - 10:40:20 | 200 | 621.879µs | 172.31.0.4 | POST "/alert-feishu"
配置alertmanager webhook
- 编写alertmanager配置文件模板,添加reciver及webhook定义,存放在tiup中控机的路径下
routes:
- match:
receiver: webhook-feishu-adapter
continue: true
receivers:
- name: 'webhook-feishu-adapter'
webhook_configs:
- send_resolved: true
url: 'http://30.0.100.42:9999/alert-feishu'
- 使用tiup edit-config,添加alertmanager_server下的config_file,路径指向上一步编写的alertmanager配置文件
alertmanager_servers:
- host: 30.0.100.40
ssh_port: 22
web_port: 9093
cluster_port: 9094
deploy_dir: /data/tidb-deploy/alertmanager-9093
data_dir: /data/tidb-data/alertmanager-9093
log_dir: /data/tidb-deploy/alertmanager-9093/log
arch: arm64
os: linux
config_file: /home/tidb/monitor-template/alert_config_40.yaml
- host: 30.0.100.42
ssh_port: 22
web_port: 9093
cluster_port: 9094
deploy_dir: /data/tidb-deploy/alertmanager-9093
data_dir: /data/tidb-data/alertmanager-9093
log_dir: /data/tidb-deploy/alertmanager-9093/log
arch: arm64
os: linux
config_file: /home/tidb/monitor-template/alert_config_42.yaml
- 尝试触发告警,确认未产生多条告警
- 关闭其中一个中心的监控组件,确认是否可以正常告警
- 启动上一步停止的tidb组件,确认可以触发告警的恢复
(此处为webhook代码中的错误,未引用恢复时间)
结论
在多中心环境下,除考虑集群本身的高可用功能外,其监控组件同样应具备高可用能力。本文从多中心监控使用及告警整合的维度,尝试构建了集群监控在多中心的高可用部署及实现方案。
如有疑问,欢迎讨论。
参考:
https://www.prometheus.wang/ha/alertmanager-high-availability.html