作者：脚本小王子、Ethan_chen

一、概述

　　当前我们生产TiDB集群版本为v5.3.0，需要升级到v7.5.2解决我们生产环境中遇到的问题，并使用上新功能提高集群性能及降低维护成本。

　　由于升级版本跨度大，在生产升级前，我们阅读了从V5.3.0到7.5.2的所有Release（这很重要），并记录下相关的影响项以及一些对自己有价值的新特性，然后先在自己搭建的虚拟机环境做相关验证。

　　数据库作为最底层的核心组件，稳定性是至关重要，因此我们制定的升级路径为：【个人环境(工作电脑上搭建的虚拟机)升级测试】 -> 【开发环境升级验证】 -> 【测试环境升级验证】 -> 【生产环境升级】。

　　开发环境升级完成后，让开发同学使用一周左右确保没有问题，然后再对测试环境进行升级。测试环境是最接近生产的环境，因此最容易在此处暴露出问题，我们就是在此步发现CDC的maxwell格式和老版本格式不一致的问题，避免了一次升级导致业务不可用的严重故障。测试环境升级完成观察约两周左右，再对生产环境进行升级。

　　值得注意的是，在相同的版本下，通过升级达到该版本和直接搭建该版本的默认参数是不一样的。为了保证生产升级前要把问题全暴露出来，我们选择从第一次搭建的版本开始升级。例如我们是从v4.0.4开始使用TiDB，然后升级到v4.0.10，再到现在的v5.3.0，因此我们在非生产环境做验证时，先搭建v4.0.4，然后再一路升级到v5.3.0，此时再把虚拟机保存一个快照，便于反复测试v5.3.0到v7.5.2的升级验证。

　　建议在升级前把show global variables和show config的内容全保存下来，这样做的好处是升级后当有一些负面影响，可以通过对比升级前后的变量及配置，比较快速的定位到问题（例如升级后的备份速度明显变慢，经对比发现backup.num-threads由原来的8自动变更为2导致）

　　备份永远是DBA最后的王牌，我们在生产环境真正升级前，先制定了全备和增量的备份策略，并应用在另外一个备用集群中，虽然最终没启用备用集群，但因为有它的存在，我们的升级中遇到问题也不致于乱了阵脚，还有选择试一试的勇气。

　　通过版本升级，我们主要达到以下目的：

1.1、解决现版本的bug

(1)、解决备份时间过长及增量备份、还原太慢问题（目前采用tibinlog还原很慢）。

(2)、ticdc 会卡死的问题（v6.5.8解决了非常多的 ticdc问题）。

(3)、解决对指定ip段的用户每次登录日志都报错“Failed to get user record”,导致产生大量日志的问题。

(4)、解决tidb-server OOM后加载统计信息很慢导致业务无法恢复的问题。

　　在 TiDB 启动阶段，初始统计信息加载完成之前执行的 SQL 可能有不合理的执行计划，从而影响性能。为了避免这种情况，从 v7.1.0 开始，TiDB 引入了配置项 force-init-stats。可以控制 TiDB 启动时是否在统计信息初始化完成后再对外提供服务。该配置项从 v7.2.0 起默认开启。

(5)、修复将 FLOAT 列改为 DOUBLE 列后查询结果有误的问题（v5.3.1修复）。

(6)、修复了查询 INFORMATION_SCHEMA.CLUSTER_SLOW_QUERY 表导致 TiDB 服务器 OOM 的问题，在 Grafana dashboard 中查看慢查询记录的时候可能会触发该问题 #33893（5.3.2重要bug修复【官方不建议使用该版本】）。

1.2、新增功能有利于提高性能及降低维护成本

(1)、6.3.0支持自动分区。

(2)、7.1.0支持分区重组（分区拆分合并）。

(3)、TiDB 在 v6.0.0 版本中引入了缓存表功能。

(4)、6.2.0支持 point-in-time recovery (PITR)，允许恢复备份集群的历史任意时间点的快照。

(5)、从 v7.4.0 开始，TiDB 支持在 GROUP BY 子句中使用 WITH ROLLUP 修饰符和 GROUPING 函数。

(6)、支持统计信息采集配置持久化 tidb_persist_analyze_options（5.4.0新加功能）。

(7)、优化备份对集群的影响。

(8)、在6.0.0中，对内存悲观锁进行优化，可以有效降低 10% 延迟，提升 10% QPS。

二、升级后的性能表现

(1)、集群的稳定性提高：升级到7.5.2后，TiDB集群OOM的次数相比之前大幅度下降。

(2)、TiKV组件内也会自动GC，不需要重启TiKV节点来帮助GC回收region。

(3)、使用表TTL (Time To Live，生存时间)清理过期数据，减少了使用脚本处理过期数据造成的数据库压力。

(4)、如下图所示，本次升级后数据库的响应时间及TiDB server内存都得到优化。

三、升级演练（非生产环境升级）

3.1、升级前检查

3.1.1、停止相关定时作业

停止包含有备份、还原、ddl操作的所有定时作业

3.1.2、检查server-version

server-version 的值设置为空或者当前 TiDB 真实的版本值，避免出现非预期行为

mysql> show config where name like '%server-version%';
+------+---------------------+----------------+-------+
| Type | Instance            | Name           | Value |
+------+---------------------+----------------+-------+
| tidb | 192.168.68.129:4000 | server-version |       |
| tidb | 192.168.68.128:4000 | server-version |       |
+------+---------------------+----------------+-------+
2 rows in set, 1 warning (0.07 sec)

3.1.3、系统架构检查

在 Linux AMD64 架构的硬件平台部署 TiFlash 时，CPU 必须支持 AVX2 指令集，执行以下命令有输出：

cat /proc/cpuinfo | grep avx2

在 Linux ARM64 架构的硬件平台部署 TiFlash 时，CPU 必须支持 ARMv8 架构，执行以下命令有输出：

cat /proc/cpuinfo | grep 'crc32' | grep 'asimd'

注意：对于虚拟机搭建的自测环境不支持avx2的情况，可以通过修改tiflash启动脚本绕过【生产环境务必支持avx2】，但该文件会在升级过程中会被覆盖还原，最后导致升级不成功，因此需要在升级过程中不断的检查该文件是否被覆盖，如果被覆盖了要及时修改回来，此时可以拷贝以下脚本在所有TiFlash节点上执行，以实现实时监控并修改【注意，以下脚本务必在执行升级前先执行】：

function update_tiflash_script()
{
        # run_tiflash.sh 脚本所在路径【【【【【注意要根据实际情况修改此路径】】】】】】
        scripts_path='/data/tidb-deploy/tiflash-9000/scripts'
 
        now=`date +%F%T | sed -r 's/-|://g'`
        # 备份原脚本
        cp ${scripts_path}/run_tiflash.sh ${scripts_path}/run_tiflash.sh.${now}
        while [ 1 = 1 ]; do
                echo "正在监控‘${scripts_path}/run_tiflash.sh’文件，升级完成后请按“ctrl + c”停止本脚本"
                isExist=`cat ${scripts_path}/run_tiflash.sh | grep 'required_cpu_flags' | wc -l`
                if [ "${isExist}" != "0" ]; then
                        isModifed=`cat ${scripts_path}/run_tiflash.sh | grep 'required_cpu_flags="avx"' | wc -l`
                        if [ "${isModifed}" = "0" ]; then
                                echo "Not found 'required_cpu_flags=\"avx\"', try modify..."
                                # 先把原来的注释掉
                                sed -i 's/required_cpu_flags=/# required_cpu_flags=&/g' ${scripts_path}/run_tiflash.sh
                                # 然后进行修改
                                sed -i '/# required_cpu_flags=/i\    required_cpu_flags="avx"' ${scripts_path}/run_tiflash.sh
                        fi
                fi
                sleep 0.5
        done
}
update_tiflash_script

另外，此处修改虽然可以绕过因TiFlash升级失败而导致的整个集群升级失败的问题，但是成功升级集群后，TiFlash仍然因为不支持avx2而启动失败。

参考：https://asktug.com/t/topic/1021704/22

3.1.4、Prometheus问题

　　升级 v5.3 之前版本的集群到 v5.3 及后续版本时，默认部署的 Prometheus 会从 v2.8.1 升级到 v2.27.1，v2.27.1 提供更多的功能并解决了安全风险。Prometheus v2.27.1 相对于 v2.8.1 存在 Alert 时间格式变化，详情见 Prometheus commit。

3.2、开始升级

3.2.1、升级 TiUP 版本

tiup 版本不低于 1.11.3

tiup update --self
tiup --version

3.2.2、升级 TiUP Cluster 版本

tiup cluster 版本不低于 1.11.3

tiup update cluster
tiup cluster --version

3.2.3、确保无ddl操作

mysql> admin show ddl;
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
| SCHEMA_VER | OWNER_ID                             | OWNER_ADDRESS       | RUNNING_JOBS | SELF_ID                              | QUERY |
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
|        100 | cad28f9e-dcda-4782-8e14-c792604d4275 | 192.168.68.128:4000 |              | cad28f9e-dcda-4782-8e14-c792604d4275 |       |
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
1 row in set (0.01 sec)

注意：升级过程中勿进行ddl操作

3.2.4、确保无备份和还原操作

mysql> show backups;
Empty set (0.00 sec)
mysql> show restores;
Empty set (0.00 sec)

3.2.5、检查当前集群的健康状况

[root@localhost ~]# tiup cluster check tidb-test --cluster
Checking updates for component cluster... Timedout (after 2s)
+ Download necessary tools
...... <此处忽略若干日志>
Checking region status of the cluster tidb-test...
All regions are healthy.
[root@localhost ~]#

执行结束后，最后会输出 region status 检查结果。如果结果为 "All regions are healthy"，则说明当前集群中所有 region 均为健康状态，可以继续执行升级；如果结果为 "Regions are not fully healthy: m miss-peer, n pending-peer" 并提示 "Please fix unhealthy regions before other operations."，则说明当前集群中有 region 处在异常状态，应先排除相应异常状态，并再次检查结果为 "All regions are healthy" 后再继续升级。

如果有错误，可以先尝试自动修复：

tiup cluster check tidb-test --cluster --apply

3.2.6、升级 TiDB 集群

tiup cluster upgrade tidb-test v7.5.2

此时需要重启各组件：

[root@localhost ~]# tiup cluster upgrade tidb-test v7.5.2
Before the upgrade, it is recommended to read the upgrade guide at https://docs.pingcap.com/tidb/stable/upgrade-tidb-using-tiup and finish the preparation steps.
This operation will upgrade tidb v5.3.0 cluster tidb-test to v7.5.2:
will upgrade and restart component "            tiflash" to "v7.5.2",
will upgrade and restart component "                cdc" to "v7.5.2",
will upgrade and restart component "                 pd" to "v7.5.2",
will upgrade and restart component "               tikv" to "v7.5.2",
will upgrade and restart component "               pump" to "v7.5.2",
will upgrade and restart component "               tidb" to "v7.5.2",
will upgrade and restart component "            drainer" to "v7.5.2",
will upgrade and restart component "         prometheus" to "v7.5.2",
will upgrade and restart component "            grafana" to "v7.5.2",
will upgrade component     "node-exporter" to "",
will upgrade component "blackbox-exporter" to "".
Do you want to continue? [y/N]:(default=N) y

3.2.7、升级br

（1）升级前备份

开发环境升级前的br备份（即步骤3.2.6之前做备份）

[tidb@localhost ~]$ br backup full --pd "192.168.100.164:2379" -s "local:///nfs/full_20240701" --log-file backup_full.log
Detail BR log in backup_full.log
Full backup <---------------------------------------------------------------------------------------------------------------------\..............................................................................................> 55.57%{"level":"warn","ts":"2024-07-01T16:19:11.501+0800","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"endpoint://client-1a1b68b6-4fbb-423c-806b-a471b994fbad/192.168.100.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Full backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2024/07/01 16:37:57.134 +08:00] [INFO] [collector.go:65] ["Full backup success summary"] [total-ranges=8105] [ranges-succeed=8105] [ranges-failed=0] [backup-checksum=5m34.229454455s] [backup-fast-checksum=578.178902ms] [backup-total-ranges=12445] [total-take=30m12.00797445s] [BackupTS=450840825721257986] [total-kv=2812160710] [total-kv-size=221.7GB] [average-speed=122.3MB/s] [backup-data-size(after-compressed)=47.51GB] [Size=47506313613]
[tidb@localhost ~]$

共耗时30m12秒，备份文件为47.5G

（2）升级br

# 下载地址：
wget https://download.pingcap.org/tidb-community-toolkit-v7.5.2-linux-amd64.tar.gz
tar xvzf tidb-community-toolkit-v7.5.2-linux-amd64.tar.gz
cd tidb-community-toolkit-v7.5.2-linux-amd64
tar xvzf br-v7.5.2-linux-amd64.tar.gz
cp br /usr/bin
# 尝试备份
su - tidb
br backup full --pd "192.168.100.164:2379" -s "local:///nfs/full_20240701_2" --log-file backup_full_2.log

（3）升级br后备份

# backup.num-threads=2时
[tidb@localhost ~]$ time br backup full --pd "192.168.100.164:2379" -s "local:///nfs/full_20240701_2" --log-file backup_full_2.log
Detail BR log in backup_full_2.log
Full Backup <----.................................................................................................................................................................................................................> 1.68%{"level":"warn","ts":"2024-07-01T20:02:02.873855+0800","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001e6700/192.168.100.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Full Backup <---/.................................................................................................................................................................................................................> 1.69%{"level":"warn","ts":"2024-07-01T20:19:33.890588+0800","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001e6700/192.168.100.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Full Backup <----\................................................................................................................................................................................................................> 1.98%
Full Backup <----/................................................................................................................................................................................................................> 1.98%
Full Backup <-----................................................................................................................................................................................................................> 1.98%
Full Backup <----\................................................................................................................................................................................................................> 1.99%
Full Backup <----|................................................................................................................................................................................................................> 1.99%
Full Backup <---------------------------------------------\......................................................................................................................................................................> 21.39%{"level":"warn","ts":"2024-07-01T20:59:41.485154+0800","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001e6700/192.168.100.164:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2024/07/01 22:12:29.231 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=8111] [ranges-succeed=8111] [ranges-failed=0] [backup-fast-checksum=1.012687562s] [backup-checksum=7m47.96059744s] [backup-total-ranges=12509] [total-take=2h12m22.072926235s] [BackupTS=450844481438089217] [total-kv=2812160811] [total-kv-size=221.7GB] [average-speed=27.91MB/s] [backup-data-size(after-compressed)=50.03GB] [Size=50026772216]
 
real    132m31.496s
user    1m27.695s
sys     1m2.156s
 
 
# backup.num-threads=4时
[tidb@localhost ~]$ time br backup full --pd "192.168.100.164:2379" -s "local:///nfs/full_20240701_3" --log-file backup_full_3.log
Detail BR log in backup_full_3.log
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2024/07/02 00:09:55.013 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=8104] [ranges-succeed=8104] [ranges-failed=0] [backup-checksum=7m5.51865244s] [backup-fast-checksum=932.893636ms] [backup-total-ranges=12509] [total-take=40m7.219175349s] [backup-data-size(after-compressed)=50.03GB] [Size=50026756234] [BackupTS=450847779061760014] [total-kv=2812160813] [total-kv-size=221.7GB] [average-speed=92.09MB/s]
 
real    40m7.655s
user    0m46.710s
sys     0m34.209s

# 使用 ratelimit 参数时
[tidb@localhost ~]$ time br backup full --pd "192.168.100.164:2379" -s "local:///nfs/full_20240701_4" --ratelimit 128 --log-file backup_full_4.log
Detail BR log in backup_full_4.log
[2024/07/02 00:11:52.901 +08:00] [WARN] [backup.go:312] ["setting `--ratelimit` and `--concurrency` at the same time, ignoring `--concurrency`: `--ratelimit` forces sequential (i.e. concurrency = 1) backup"] [ratelimit=134.2MB/s] [concurrency-specified=4]
Full Backup <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
Checksum <----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------> 100.00%
[2024/07/02 01:28:02.552 +08:00] [INFO] [collector.go:77] ["Full Backup success summary"] [total-ranges=8104] [ranges-succeed=8104] [ranges-failed=0] [backup-checksum=7m9.190860539s] [backup-fast-checksum=1.025710248s] [backup-total-ranges=12509] [total-take=1h16m9.669375594s] [total-kv=2812160813] [total-kv-size=221.7GB] [average-speed=48.51MB/s] [backup-data-size(after-compressed)=50.03GB] [Size=50026756230] [BackupTS=450848440988729349]
 
real    76m9.965s
user    1m3.750s
sys     0m49.402s

（4）升级前后备份对比


	备份耗时	备份文件大小
升级前(backup.num-threads=6)	30m12s	47.5G
升级后(backup.num-threads=2)	2h12m22s	50.0G
升级后(backup.num-threads=4)	40m7s	50.0G

3.3、报错处理

3.3.1、报错1

Error: init config failed: 192.168.68.128:9093: transfer from /data/tidb-deploy/alertmanager-9093/conf/alertmanager.yml to /data/tidb-deploy/alertmanager-9093/conf/alertmanager.yml failed: failed to scp /data/tidb-deploy/alertmanager-9093/conf/alertmanager.yml to tidb@192.168.68.128:/data/tidb-deploy/alertmanager-9093/conf/alertmanager.yml: Process exited with status 1

解决方法：先把通过缩容把 alertmanager 干掉，然后再升级

tiup cluster scale-in tidb-test --node 192.168.68.128:9093

3.3.2、报错2

Error: failed to get leader count 192.168.68.129: metric tikv_raftstore_region_count{type="leader"} not found

先手工尝试能否获取这个信息：

[root@localhost ~]# curl -sl 192.168.68.129:20180/metrics | grep 'tikv_raftstore_region_count{type="leader"}'
tikv_raftstore_region_count{type="leader"} 3

有输出说明是可以获取的，此时在之前的基础上继续跑，需要先看跑到哪里：

[root@localhost ~]# tiup cluster audit
ID           Time                       Command
--           ----                       -------
gqmyWYvCwwx  2024-05-23T18:46:36+08:00  /root/.tiup/components/cluster/v1.15.1/tiup-cluster check ./topology.yaml
gqmz0wgydxj  2024-05-23T18:47:39+08:00  /root/.tiup/components/cluster/v1.15.1/tiup-cluster check ./topology.yaml
......
grCGsxGpBfj  2024-06-26T17:48:19+08:00  /root/.tiup/components/cluster/v1.15.2/tiup-cluster display tidb-test
grCGG9LP9dp  2024-06-26T17:51:27+08:00  /root/.tiup/components/cluster/v1.15.2/tiup-cluster scale-in tidb-test --node 192.168.68.128:9093
grCGT8Nzntq  2024-06-26T17:54:40+08:00  /root/.tiup/components/cluster/v1.15.2/tiup-cluster display tidb-test
grCGYsZn30J  2024-06-26T17:55:56+08:00  /root/.tiup/components/cluster/v1.15.2/tiup-cluster upgrade tidb-test v7.5.2
grCHxJRwQ3s  2024-06-26T18:04:30+08:00  /root/.tiup/components/cluster/v1.15.2/tiup-cluster audit
可以看到是执行到ID为 grCGYsZn30J 的地方，此时我们继续执行：
[root@localhost ~]# tiup cluster replay grCGYsZn30J
Will replay the command `tiup cluster upgrade tidb-test v7.5.2`
Do you want to continue? [y/N]: (default=N) y
......
Upgraded cluster `tidb-test` successfully

参考：https://docs.pingcap.com/zh/tidb/v7.5/upgrade-tidb-using-tiup#41-%E5%8D%87%E7%BA%A7%E6%97%B6%E6%8A%A5%E9%94%99%E4%B8%AD%E6%96%AD%E5%A4%84%E7%90%86%E5%AE%8C%E6%8A%A5%E9%94%99%E5%90%8E%E5%A6%82%E4%BD%95%E7%BB%A7%E7%BB%AD%E5%8D%87%E7%BA%A7

3.3.3、报错3

[root@localhost ~]# tiup cluster upgrade tidb-test v7.5.2
Before the upgrade, it is recommended to read the upgrade guide at https://docs.pingcap.com/tidb/stable/upgrade-tidb-using-tiup and finish the preparation steps.
This operation will upgrade tidb v5.3.0 cluster tidb-test to v7.5.2:
will upgrade and restart component "            tiflash" to "v7.5.2",
will upgrade and restart component "                cdc" to "v7.5.2",
will upgrade and restart component "                 pd" to "v7.5.2",
will upgrade and restart component "               tikv" to "v7.5.2",
will upgrade and restart component "               pump" to "v7.5.2",
will upgrade and restart component "               tidb" to "v7.5.2",
will upgrade and restart component "            drainer" to "v7.5.2",
will upgrade and restart component "         prometheus" to "v7.5.2",
will upgrade and restart component "            grafana" to "v7.5.2",
will upgrade component     "node-exporter" to "",
will upgrade component "blackbox-exporter" to "".
Do you want to continue? [y/N]:(default=N) y
Upgrading cluster...
 
...... <此处忽略若干的日志>
 
  - Generate config blackbox_exporter -> 192.168.68.128 ... Done
+ [ Serial ] - UpgradeCluster
Upgrading component tiflash
        Restarting instance 192.168.68.132:9000
 
Error: failed to restart: 192.168.68.132 tiflash-9000.service, please check the instance's log(/data/tidb-deploy/tiflash-9000/log) for more detail.: timed out waiting for port 3930 to be started after 2m0s
 
Verbose debug logs has been written to /root/.tiup/logs/tiup-cluster-debug-2024-06-28-21-40-11.log.
[root@localhost ~]#

经查tiflash日志，发现是不支持avx2

重启集群后发现很多节点起不来

除TiFlash外(由于不支持avx2)，其他逐个节点起来后正常。

四、生产环境升级

4.1、确定升级方案

4.2、前期准备

(1)、系统架构检查，在 Linux AMD64 架构的硬件平台部署 TiFlash 时，CPU 必须支持 AVX2 指令集

# 在 Linux AMD64 架构的硬件平台部署 TiFlash 时，CPU 必须支持 AVX2 指令集，执行以下命令有输出：

cat /proc/cpuinfo | grep avx2

# 在 Linux ARM64 架构的硬件平台部署 TiFlash 时，CPU 必须支持 ARMv8 架构，执行以下命令有输出：

cat /proc/cpuinfo | grep 'crc32' | grep 'asimd'

(2)、TiDB瘦身（清除过期的历史数据，减少重启TiDB时迁移Leader所消耗的时间）

(3)、重启TiKV以释放空间（此处由于5.3.0版本TiKV无法释放空间，需要重启才能释放空间）

(4)、重建Prometheus（解决因共用Prometheus，tiup失去Prometheus控制权问题）

(5)、修改统计作业中ddl语句（此处由于升级期间，如果有ddl语句执行会导致升级失败）

(6)、取得所有授权语句并应用于新集群（链接附件为源码,密码为tidb）

(7)、给出新集群的各节点的CPU、内存、磁盘信息（运维申请服务器）

(8)、编写升级后ticdc的toml文件（因原来的maxwell格式和新版本的maxwell格式不兼容，故需要重做ticdc）

4.3、升级前准备

由于本次是跨大版本的升级，做好最坏打算的前期工作。提前准备一套备用集群，配置与生产环境相同，并且将生产环境的全量备份还原到备用集群上。

(1)、中午开启集群备份

(2)、申请新集群服务器

(3)、注释掉一切维护作业（包括归档、备份、统计信息处理）

(4)、新集群搭建

(5)、新集群br搭建在其中一台pd服务器上执行

(6)、新集群挂载OSS（用于备份还原）

(7)、新集群全备还原

参考：有效性测试时，全备还原耗时149分钟

# br restore full --pd 新集群pd的ip:2379 -s local://备份路径 --log-file restorefull.log
# 【【【注意要替换文件名】】】
br restore full --pd host_ip:2379 -s local:///dbbak/tidbFullBak/mg_tidb_full_20240717130001 --log-file restorefull.log

(8)、老集群做第一次增量备份

# 取得上一次备份的TS 【【【注意要修改备份文件夹】】】
LAST_BACKUP_TS=`br validate decode --field="end-version" -s local:///dbbak/tidbFullBak/mg_tidb_full_20240717130001 | tail -n1`
echo $LAST_BACKUP_TS
# 开始增量备份
br backup full\
    --pd host_ip:2379 \
    --ratelimit 128 \
    -s local:///dbbak/tidbFullBak/mg_tidb_incr_20240717_1 \
    --lastbackupts ${LAST_BACKUP_TS}

(9)、新集群做第一次增量还原

注：如果第一次增量备份很快的话，可以不需做这次还原，而是真正升级前再基于全备做一次增量备份

br restore full --pd host_ip:2379 -s local:///dbbak/tidbFullBak/mg_tidb_incr_20240717_1 --log-file restoreincr.log

4.4、升级集群及组件

4.4.1、回收所有ddl权限

4.4.2、再做一次增量备份

# 取得上一次备份的TS 【【【注意要修改备份文件夹】】】

# 基于上次增量再做增量
LAST_BACKUP_TS=`br validate decode --field="end-version" -s local:///dbbak/tidbFullBak/mg_tidb_incr_20240717_1 | tail -n1`
echo $LAST_BACKUP_TS

# 【以上1、2只选一个，视具体情况决定】

# 开始增量备份
br backup full\
    --pd host_ip:2379 \
    --ratelimit 128 \
    -s local:///dbbak/tidbFullBak/mg_tidb_incr_20240717_2 \
    --lastbackupts ${LAST_BACKUP_TS}

4.4.3、升级集群

4.4.3.1、系统检查

(1)、停止包含有备份、还原、ddl操作的所有定时作业

(2)、server-version 的值设置为空或者当前 TiDB 真实的版本值，避免出现非预期行为

mysql> show config where name like '%server-version%';
+------+---------------------+----------------+-------+
| Type | Instance            | Name           | Value |
+------+---------------------+----------------+-------+
| tidb | 192.168.68.129:4000 | server-version |       |
| tidb | 192.168.68.128:4000 | server-version |       |
+------+---------------------+----------------+-------+
2 rows in set, 1 warning (0.07 sec)

4.4.3.2、升级 TiUP 版本

tiup 版本不低于 1.11.3

tiup update --self
tiup --version

4.4.3.3、升级 TiUP Cluster 版本

tiup cluster 版本不低于 1.11.3

tiup update cluster
tiup cluster --version

4.4.3.4、确保无ddl操作

mysql> admin show ddl;
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
| SCHEMA_VER | OWNER_ID                             | OWNER_ADDRESS       | RUNNING_JOBS | SELF_ID                              | QUERY |
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
|        100 | cad28f9e-dcda-4782-8e14-c792604d4275 | 192.168.68.128:4000 |              | cad28f9e-dcda-4782-8e14-c792604d4275 |       |
+------------+--------------------------------------+---------------------+--------------+--------------------------------------+-------+
1 row in set (0.01 sec)

确保RUNNG_JOBS无值

注意：升级过程中勿进行ddl操作

4.4.3.5、确保无备份和还原操作

mysql> show backups;
Empty set (0.00 sec)

mysql> show restores;
Empty set (0.00 sec)

4.4.3.6、检查当前集群的健康状况

[root@localhost ~]# tiup cluster check mg-tidb --cluster
Checking updates for component cluster... Timedout (after 2s)
+ Download necessary tools
．．．．．．
Checking region status of the cluster tidb-test...
All regions are healthy.
[root@localhost ~]#

如果有错误，可以先尝试自动修复：

tiup cluster check mg-tidb --cluster --apply

4.4.3.7、删除ticdc作业，并记录当前时间

当时时间为：xxxx:xxx:xxx

# 删除cdc: 表 xxx_cmdinfo（其它表依次类推）
cdc cli changefeed remove --changefeed-id cmdinfo-kafka --pd=http://host_ip:2379,http://host_ip:2379,http://host_ip:2379 --force

# 删除任务后会保留任务的同步状态信息 24 小时（主要用于记录同步的 checkpoint），24 小时内不能创建同名的任务。如果希望彻底删除任务信息，可以指定 --force 或 -f 参数删除，删除后 changefeed 的所有信息都会被清理，可以立即创建同名的 changefeed。

4.3.8、升级 TiDB 集群

tiup cluster upgrade mg-tidb v7.5.2

需要重启各组件：

[root@localhost ~]# tiup cluster upgrade mg-tidb v7.5.2
Before the upgrade, it is recommended to read the upgrade guide at https://docs.pingcap.com/tidb/stable/upgrade-tidb-using-tiup and finish the preparation steps.
This operation will upgrade tidb v5.3.0 cluster tidb-test to v7.5.2:
will upgrade and restart component "            tiflash" to "v7.5.2",
will upgrade and restart component "                cdc" to "v7.5.2",
will upgrade and restart component "                 pd" to "v7.5.2",
will upgrade and restart component "               tikv" to "v7.5.2",
will upgrade and restart component "               pump" to "v7.5.2",
will upgrade and restart component "               tidb" to "v7.5.2",
will upgrade and restart component "            drainer" to "v7.5.2",
will upgrade and restart component "         prometheus" to "v7.5.2",
will upgrade and restart component "            grafana" to "v7.5.2",
will upgrade component     "node-exporter" to "",
will upgrade component "blackbox-exporter" to "".
Do you want to continue? [y/N]:(default=N) y

4.4、升级ticdc

原ticdc为maxwell格式，输出到kafka时每条记录独立一行，升级到v7.5.2后，多条记录对应一行，因此需要重做ticdc，并改为canal-json格式

4.4.1、升级ticdc

# 查看 changefeed (V5.3.0的命令)，确认之前的 changefeed 已经删除
tiup cdc cli changefeed list --pd=host_ip:2379
# 输出为 [] 则表示全删除
# 升级cdc到7.5.2版本
tiup update cdc:v7.5.2
# 查看 changefeed (V7.5.2的命令)，
tiup cdc cli changefeed list --server=host_ip:8300

4.4.2、创建 changefeed

# 重新创建ticdc-xxx_cmdinfo 表
tiup cdc cli changefeed create \
    --server=172.16.5.9:8300  \
    --sink-uri="kafka://xxx.xxx.xxx.xxx:9092/ticdc_xxx_cmdinfo?protocol=canal-json&kafka-version=2.4.1&partition-num=6&max-message-bytes=67108864&replication-factor=1" \
    --changefeed-id="xxx-cmdinfo-kafka" \
    --config=xxx_cmdinfo.toml

4.4.3、更新数据，追回升级过程中丢失的ticdc数据

取出4.4.3.7、删除ticdc作业，并记录当前时间”记录的时间，并适当再往前调10分钟，然后对相关表在这个时间及之后的数据做一次更新（例如对一个无关重要的字段做更新，此处对create_time加1秒）

update xxx_cmdinfo set create_time = date_add(create_time, interval +1 second) where create_time > 'xxx:xx:xx';

4.4.4、修改kafka的消费代码

由于ticdc写到kafka的格式也发现变化，因此需要修改相关的kafka消费代码，格式差异见“5.1、升级到V7.5.2版本后ticdc同步到kafka的maxwell格式josn记录异常”

4.5、升级br

# 下载地址：

wget https://download.pingcap.org/tidb-community-toolkit-v7.5.2-linux-amd64.tar.gz
tar xvzf tidb-community-toolkit-v7.5.2-linux-amd64.tar.gz
cd tidb-community-toolkit-v7.5.2-linux-amd64
tar xvzf br-v7.5.2-linux-amd64.tar.gz
cp br /usr/bin
# 尝试备份
# su - tidb
# br backup full --pd "host_ip:2379" -s "local:///nfs/full_20240701_2" --log-file backup_full_2.log

4.6、恢复回收的ddl权限

执行前面备份的权限，以恢复回收的ddl权限

五、升级期间遇到的问题

5.1、升级到V7.5.2版本后ticdc同步到kafka的maxwell格式josn记录异常

升级后，ticdc同步到kafka的maxwell格式，从kafka消费下来，v5.3.0是每条记录一行，v7.5.2是多条记录一行，且一行内的的多条记录是没分隔符的。

要使消费下来，每条记录为一行，需要将升级后的v7.5.2的ticdc同步到kafka的输出格式改为canal-json格式，而 v5.3.0是没有 canal-json 格式的。

==========insert==============
# maxwell的insert格式(升级前)
{"database":"test","table":"t","type":"insert","ts":1637823163,"data":{"create_time":"2018-01-01 00:00:00","dept":1,"id":1,"last_login_time":"2018-03-01 12:00:00","name":"user_1"}}
# json-canal的insert格式(升级后)
{"id":0,"database":"test","table":"t","pkNames":["id"],"isDdl":false,"type":"INSERT","es":1721044339113,"ts":1721044339899,"sql":"","sqlType":{"id":4,"dept":-6,"name":12,"create_time":93,"last_login_time":93},"mysqlType":{"dept":"tinyint","name":"varchar","create_time":"datetime","last_login_time":"datetime","id":"int"},"old":null,"data":[{"id":"1","dept":"1","name":"user_1","create_time":"2018-01-01 00:00:00","last_login_time":"2018-03-01 12:00:00"}]}
 
==========update==============
# maxwell的update格式(升级前)
{"database":"test","table":"t","type":"update","ts":1637824161,"data":{"create_time":"2021-11-25 15:09:21","dept":1,"id":1,"last_login_time":"2018-03-01 12:00:00","name":"user_1"},"old":{"create_time":"2018-01-01 00:00:00"}}
# json-canal的insert格式(升级后)
{"database": "test", "table": "t", "type": "update","ts":1637824161,"data": {"create_time":"2021-11-25 15:09:21","dept":1,"id":1,"last_login_time":"2018-03-01 12:00:00","name":"user_1"},"old":{"create_time":"2018-01-01 00:00:00"}}
 
==========delete==============
# maxwell的delete格式(升级前)
{"database":"test","table":"t","type":"delete","ts":1637824320,"old":{"create_time":"2021-11-25 15:10:46","dept":1,"id":1,"last_login_time":"2018-03-01 12:00:00","name":"user_1"}}
# maxwell的delete格式(升级后)
{"database": "test", "table": "t", "type": "delete","ts":1637824161,"data": {"create_time":"2021-11-25 15:09:21","dept":1,"id":1,"last_login_time":"2018-03-01 12:00:00","name":"user_1"},"old":{"create_time":"2018-01-01 00:00:00"}}

解决方案：此处我们是开发修改消费kafka的代码进行解决。在升级前停止TiCDC同步，并且等待kafka内topic消费完，确定数据到最新值后，升级TiDB集群，升级后将新的消费代码上线。

参考：https://asktug.com/t/topic/1005840?replies_to_post_number=2

5.2、单个PD节点无法启动问题

通过TiDB社区老师的建议指导，阅读源码才得知，PD的启动与环境变量有关。PD节点启动前回先去获取环境变量是否有配置，后再启动节点。由于我们先前为了方便DM的使用，修改了环境变量导致，导致PD启动需要获取环境变量而导致的无法启动。

以下图一为服务器的环境变量、图二为PD的源码。

图一：

图二：

5.3、备份还原

新集群的搭建默认参数new_collations_enabled_on_first_bootstrap为true，然而我们从v5.3.0版本升级为v7.5.2版本后参数new_collations_enabled_on_first_bootstrap为false，参数不对应，备份还原不成功。

需要在新建集群的时候将new_collations_enabled_on_first_bootstrap设为false才能正确进行全量还原。（该参数只有在集群搭建的时候设置才有效）

六、升级后遗留的问题

目前我们7.5.2版本的TiDB Dashboard 流量可视化有问题，某个表一旦产生流量后，后面即使没有读写操作都会一直显示有流量，这导致失去了利用流量可视化定位问题的核武器，目前还在寻找解决方法。

详情请参考asktug：https://asktug.com/t/topic/1029492/1

七、总结

此次 TiDB 集群升级的历程，犹如一场充满挑战与收获的冒险。在这个过程中，我真切地领悟到了精心规划与充分准备所蕴含的巨大价值。与 TiDB 社区的互动交流及反馈，如同开启了一扇通往未来的窗户，让我看到了这款产品持续进步的无限可能。我坚信，在不断前行的道路上，TiDB 必将为用户呈上更为稳定、高效且易用的数据库解决方案。与此同时，我满心期待着在未来的日子里，能与社区携手并肩，共同探寻更多提升性能、降低成本的有效途径。

特别感谢下@升级导师-军军、@升级导师-刘培梁、@表妹和群里各位大佬们的鼎力支持。

TiDB✖️麦谷科技：v5.3.0 至 v7.5.2 升级最佳实践全记录