专栏 - 对TiDB监控方式的一点点研究

TiDB 的告警极其复杂，在 tidb 软件的代码中集成了 prometheus 类库。这样当 tidb 运行的时候，就可以使用 10080 状态端口，把一些 tidb server 运行的 mertis 暴露出来。此时 prometheus 就可以通过拉取数据，最终获取到监控数据并展示出来。为了了解清楚运行机制，今天我们可以通过golang语言简单的研究一下。

1. 创建应用

我们通过go语言先写一段程序，引用 prometheus 的类库，通过这个程序可以暴露程序本身的 metrics，代码如下：

package main  
  
import (  
   "github.com/prometheus/client_golang/prometheus/promhttp"  
   "net/http")  
  
func main() {  
   http.Handle("/metrics", promhttp.Handler())  
   http.ListenAndServe(":8080", nil)  
}

写完这段代码点 run 运行。然后我们就可以通过curl http://127.0.0.1:8080/metrics 获取当前的程序的metrics。如下图所示：

[root@vmtest ~]# % curl -s http://127.0.0.1:8080/metrics | grep -v #
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
go_goroutines 6
go_info{version="go1.18.1"} 1
go_memstats_alloc_bytes 2.15052e+06
go_memstats_alloc_bytes_total 2.15052e+06
go_memstats_buck_hash_sys_bytes 4224

从结果中，我们可以看到都是 go 进程的一些监控指标，包含已经分配的heap、goroutine 等等。

2. 自定义指标

创建应用后，我们就可以开发自己的指标了，在程序运行的过程中我们可以注册指标，然后将指标暴露出来。暴露之后就可以使用 prometheus 去拉数据。

package main  
  
import (  
   "github.com/prometheus/client_golang/prometheus"  
   "github.com/prometheus/client_golang/prometheus/promhttp"   
   "net/http"
)  
  
func main() {  
   temp := prometheus.NewCounter(prometheus.CounterOpts{Name: "test_Counter", Help: "The current Counter"})  
  
   prometheus.MustRegister(temp)  
   temp.Add(20)  
   http.Handle("/metrics", promhttp.Handler())  
   http.ListenAndServe(":8080", nil)  
}

这里的程序，通过 prometheus 的类，创建了一个新的Counter计数器，并通过 MustRegister 把指标注册。注册后就可以使用 API 进行访问了。再次点 run 运行程序，通过curl http://127.0.0.1:8080/metrics 就可以获取我们自定义的 test_Counter 的结果。

[root@vmtest ~]# % curl -s http://127.0.0.1:8080/metrics | grep -i test
# HELP test_Counter The current Counter
# TYPE test_Counter counter
test_Counter 20

当然，我的这个数据是写死的，在一些复杂的程序里面，可以设置4种prometheus的类型，包括： Counter（只增不减的计数器） Gauge（可增可减的仪表盘） Histogram （直方图） Summary （和直方图类似，也用于统计和分析样本的分布情况）此时我们只要在 prometheus 中配置一下，获取一下这个指标，它就能定时的通过API来取这个数据了。

3. 研究TiDB中的监控

接下来我们就来探索一下TiDB中的监控，我们就以TiDB_query_duration 这个告警为例。首先看一下它的告警表达式，我们去掉外面的 prometheus 的函数，最里面的是tidb_server_handle_query_duration_seconds_bucket。

histogram_quantile(0.99, sum(rate(tidb_server_handle_query_duration_seconds_bucket[1m])) BY (le, instance)) > 1

通过上面信息和官方文档的说明，可以得知它是 SQL 语句的执行时间的一个直方图。在TiDB的代码中，所有的指标都放在 Metrics 文件夹下面，该文件夹下面又包含了各个组件的监控，而我们学习的这个指标可以在 metrics->distsql.go 的代码中找到定义。这个代码很简单，使用 prometheus 类，创建了一个直方图指标，名字叫做 handle_query_duration_seconds。直方图类型的主要作用就是，记录在一定的时间范围内，对数据进行采样，并将其计入可配置的存储桶（bucket）中，后续可通过指定区间筛选样本。所以这里还需要在定义一个存储桶（bucket），而这个桶的定义使用了 ExpinentialBickets 函数，该函数有三个输入参数，分别是第一个Bucket的值，系数因子（factor），Buckets的数量。上面的代码其实就是生成了一段Buckets，它的算法其实就是最小值，乘上因子2，得到下一个值，以此类推总共要得到29个数字，我们自己写一段类似代码执行一下。

package main  
  
import "fmt"  
  
func main() {  
   buckets := make([]float64, 29)  
   start := 0.0005  
   for i := range buckets {  
      buckets[i] = start  
      start *= 2  
   }  
   fmt.Println("demo:", buckets)  
}

这个代码运行结果就会得到一个 buckets 的区间，最小值是0.0005，最大值是134217.728。

demo: [0.0005 0.001 0.002 0.004 0.008 0.016 0.032 0.064 0.128 0.256 0.512 1.024 2.048 4.096 8.192 16.384 32.768 65.536 131.072 262.144 524.288 1048.576 2097.152 4194.304 8388.608 16777.216 33554.432 67108.864 134217.728]

生成好指标后，就可以把指标进行注册，注册的代码在metrics->metrics.go的 RegisterMetrics 函数中实现。

prometheus.MustRegister(DistSQLQueryHistogram)

当 TiDB-Server 程序运行后，执行了SQL语句，会通过 distsql->select_result.go 的 fetchResp 函数，更新一下这个直方图的信息。到这里，其实就大功告成了。在外面就可以通过 API 查到这部分信息了，以下是 sql_type 类型为 select 的一组值。

curl http://127.0.0.1:10080/metrics | grep -i tidb_server_handle_query_duration_seconds_bucket | grep -i select
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.0005"} 32
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.001"} 37
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.002"} 39
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.004"} 39
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.008"} 40
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.016"} 45
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.032"} 45
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.064"} 45
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.128"} 45
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.256"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="0.512"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="1.024"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="2.048"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="4.096"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="8.192"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="16.384"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="32.768"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="65.536"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="131.072"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="262.144"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="524.288"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="1048.576"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="2097.152"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="4194.304"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="8388.608"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="16777.216"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="33554.432"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="67108.864"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="134217.728"} 46
tidb_server_handle_query_duration_seconds_bucket{sql_type="Select",le="+Inf"} 46

看到这一组值，你可能会觉得有点奇怪，为什么从 bucket[0.256]开始，达到46就一直不变化了。其实，落在[-,0.0005]区间实际采样点是32个，而落在[0.0005,0.001]的实际采样点是5个，落在[0.001,0.002]的实际采样点是2个。它的这个最终 bucket 结果是累积的。我们搞个Analyze的语句来简单验证一下，首先我们先记录下来当前的直方图各个buckets的值。

curl http://127.0.0.1:10080/metrics | grep -i tidb_server_handle_query_duration_seconds_bucket | grep -i AnalyzeTable
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.001"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.002"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.004"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.008"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.016"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.032"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.064"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.128"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.256"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.512"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="1.024"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="2.048"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="4.096"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="8.192"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="16.384"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="32.768"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="65.536"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="131.072"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="262.144"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="524.288"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="1048.576"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="2097.152"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="4194.304"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="8388.608"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="16777.216"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="33554.432"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="67108.864"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="134217.728"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="+Inf"} 2

使用MySQL客户端做一个Analyze分析。这个SQL花了8.38秒。

MySQL [test]> analyze table stock_bak;
Query OK, 0 rows affected, 1 warning (8.38 sec)

按照我们的推导它会落在 le="16.384"这个Bucket 里面，当前该Bucket的数据为1，落在这里就会变成2。然后依次类推到 le="524.288" 这里的Bucket都会变成2，然后从le="1048.576"开始后面的Buckets都会增加1变成3。下面来 curl 一下，验证我们的逻辑是否合理。

tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.001"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.002"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.004"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.008"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.016"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.032"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.064"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.128"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.256"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="0.512"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="1.024"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="2.048"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="4.096"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="8.192"} 1
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="16.384"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="32.768"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="65.536"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="131.072"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="262.144"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="524.288"} 2
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="1048.576"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="2097.152"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="4194.304"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="8388.608"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="16777.216"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="33554.432"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="67108.864"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="134217.728"} 3
tidb_server_handle_query_duration_seconds_bucket{sql_type="AnalyzeTable",le="+Inf"} 3

掌握了上面的信息，接下来我们就可以思考一个事情如何给TiDB增加一个原生的监控。可以按照我上面介绍的方式自己加一个指标并注册。

结尾

本文简单描述了tidb中的监控实现的原理，讲解了如何使用原生方式添加一个指标。当然也可以通过SQL查询结果，使用node_exporter将结果推给Prometheus，只不过通过sql查会给数据库造成额外的负担，毕竟有一些sql查数据字典视图是很缓慢的，一旦一个查询延迟，后面不停的查询就会造成积压，对数据库影响将是巨大的。