专栏 - TiDB同城双中心监控组件高可用方案

背景

在双中心部署tidb dr-auto sync集群，出于监控的高可用考虑，在物理分离的两个数据中心分别部署独立的prometheus+alertmanager+grafana，实现任一监控均可访问。

此部署架构需考虑两套监控组件数据采集的一致性，以及监控告警重复发送的问题。

no-alt

实现思路

两套Prometheus组件各自独立进行集群监控信息的采集和存储；

两套Grafana连接各自的Prometheus作为数据源；

AlertManager通过集群配置，基于gossip机制，在多个alertmanager收到相同告警事件后，由其中之一对外发送监控告警信息。

模拟实现

模拟实现的环境

TiDB v7.1.0 LTS

单个集群部署两套监控

# # Server configs are used to specify the configuration of Prometheus Server.
monitoring_servers:
  - host: 30.0.100.40
    port: 9091
    deploy_dir: "/tidb/tidb-deploy/prometheus-8249"
    data_dir: "/data/tidb-data/prometheus-8249"
    log_dir: "/data/tidb-deploy/prometheus-8249/log"
  - host: 30.0.100.42
    port: 9091
    deploy_dir: "/tidb/tidb-deploy/prometheus-8249"
    data_dir: "/data/tidb-data/prometheus-8249"
    log_dir: "/data/tidb-deploy/prometheus-8249/log"

# # Server configs are used to specify the configuration of Grafana Servers.
grafana_servers:
  - host: 30.0.100.40
    deploy_dir: /data/tidb-deploy/grafana-3000
  - host: 30.0.100.42
    deploy_dir: /data/tidb-deploy/grafana-3000

# # Server configs are used to specify the configuration of Alertmanager Servers.
alertmanager_servers:
  - host: 30.0.100.40
    deploy_dir: "/data/tidb-deploy/alertmanager-9093"
    data_dir: "/data/tidb-data/alertmanager-9093"
    log_dir: "/data/tidb-deploy/alertmanager-9093/log"
  - host: 30.0.100.42
    deploy_dir: "/data/tidb-deploy/alertmanager-9093"
    data_dir: "/data/tidb-data/alertmanager-9093"
    log_dir: "/data/tidb-deploy/alertmanager-9093/log"

调整监控数据链路

grafana调整datasource

no-alt

确认prometheus配置，设置alertmanager信息

no-alt

登录alertmanager，确认多个alertmanager组成了集群（此处由tidb自动完成配置）

no-alt 需复用haproxy+keepalive反向代理多个prometheus，并修改dashboard的prometheus数据源，以免单个prometheus故障后影响dashboard的使用

haproxy配置略

dashboard配置如下 no-alt

Webhook实现

编写webhook转换为飞书api的golang程序

略

测试，使用HTTP接口测试工具，确认飞书webhook小程序接收并解析了相关告警事件

{
  "version": "4",
  "groupKey": "123333",
  "status": "firing",
  "receiver": "target",
  "groupLabels": {"group":"group1"},
  "commonLabels": {"server":"test"},
  "commonAnnotations": {"server":"test"},
  "externalURL": "http://30.0.100.40:3000",
  "alerts": [
    {
      "labels": {"server":"test"},
      "annotations": {"server":"test"},
      "startsAt": "2023-08-12T07:20:50.52Z",
      "endsAt": "2023-08-12T09:20:50.52Z"
    }
  ]
}

2023/08/20 10:40:20 172.31.0.4 - {"version":"4","groupKey":"123333","status":"firing","Receiver":"target","GroupLabels":{"group":"group1"},"CommonLabels":{"server":"test"},"CommonAnnotations":{"server":"test"},"ExternalURL":"http://30.0.100.40:3000","Alerts":[{"labels":{"server":"test"},"annotations":{"server":"test"},"startsAt":"2023-08-12T07:20:50.52Z","endsAt":"2023-08-12T09:20:50.52Z"}]}
[GIN] 2023/08/20 - 10:40:20 | 200 |     621.879µs |      172.31.0.4 | POST     "/alert-feishu"

配置alertmanager webhook

编写alertmanager配置文件模板，添加reciver及webhook定义，存放在tiup中控机的路径下

  routes:
  - match:
    receiver: webhook-feishu-adapter
    continue: true

receivers:
  - name: 'webhook-feishu-adapter'
    webhook_configs:
    - send_resolved: true
      url: 'http://30.0.100.42:9999/alert-feishu'

使用tiup edit-config，添加alertmanager_server下的config_file，路径指向上一步编写的alertmanager配置文件

alertmanager_servers:
- host: 30.0.100.40
  ssh_port: 22
  web_port: 9093
  cluster_port: 9094
  deploy_dir: /data/tidb-deploy/alertmanager-9093
  data_dir: /data/tidb-data/alertmanager-9093
  log_dir: /data/tidb-deploy/alertmanager-9093/log
  arch: arm64
  os: linux
  config_file: /home/tidb/monitor-template/alert_config_40.yaml
- host: 30.0.100.42
  ssh_port: 22
  web_port: 9093
  cluster_port: 9094
  deploy_dir: /data/tidb-deploy/alertmanager-9093
  data_dir: /data/tidb-data/alertmanager-9093
  log_dir: /data/tidb-deploy/alertmanager-9093/log
  arch: arm64
  os: linux
  config_file: /home/tidb/monitor-template/alert_config_42.yaml