博客 - 一篇文章带你玩转 TiDB 灾难恢复

一篇文章带你玩转TiDB灾难恢复

一、背景

高可用是 TiDB 的另一大特点，TiDB/TiKV/PD 这三个组件都能容忍部分实例失效，不影响整个集群的可用性。下面分别说明这三个组件的可用性、单个实例失效后的后果以及如何恢复。

TiDB

TiDB 是无状态的，推荐至少部署两个实例，前端通过负载均衡组件对外提供服务。当单个实例失效时，会影响正在这个实例上进行的 Session，从应用的角度看，会出现单次请求失败的情况，重新连接后即可继续获得服务。单个实例失效后，可以重启这个实例或者部署一个新的实例。

PD

PD 是一个集群，通过 Raft 协议保持数据的一致性，单个实例失效时，如果这个实例不是 Raft 的 leader，那么服务完全不受影响；如果这个实例是 Raft 的 leader，会重新选出新的 Raft leader，自动恢复服务。PD 在选举的过程中无法对外提供服务，这个时间大约是3秒钟。推荐至少部署三个 PD 实例，单个实例失效后，重启这个实例或者添加新的实例。

TiKV

TiKV 是一个集群，通过 Raft 协议保持数据的一致性（副本数量可配置，默认保存三副本），并通过 PD 做负载均衡调度。单个节点失效时，会影响这个节点上存储的所有 Region。对于 Region 中的 Leader 节点，会中断服务，等待重新选举；对于 Region 中的 Follower 节点，不会影响服务。当某个 TiKV 节点失效，并且在一段时间内（默认 30 分钟）无法恢复，PD 会将其上的数据迁移到其他的 TiKV 节点上。

二、架构

wtidb28.add.shbt.qihoo.net  192.168.1.1  TiDB/PD/pump/prometheus/grafana/CCS
wtidb27.add.shbt.qihoo.net  192.168.1.2  TiDB
wtidb26.add.shbt.qihoo.net  192.168.1.3  TiDB
wtidb22.add.shbt.qihoo.net  192.168.1.4  TiKV
wtidb21.add.shbt.qihoo.net  192.168.1.5  TiKV
wtidb20.add.shbt.qihoo.net  192.168.1.6  TiKV
wtidb19.add.shbt.qihoo.net  192.168.1.7  TiKV
wtidb18.add.shbt.qihoo.net  192.168.1.8  TiKV
wtidb17.add.shbt.qihoo.net  192.168.1.9  TiFlash
wtidb16.add.shbt.qihoo.net  192.168.1.10  TiFlash

集群采用3TiDB节点，5TiKV，2TiFlash架构来测试灾难恢复，TiFlash采用的方式是先部署集群，后部署TiFlash的方式，版本3.1.0GA

在恢复之前，首先应该通过 pd-ctl 调整 PD 的配置，禁用相关的调度，包括：

 config set region-schedule-limit 0
 config set replica-schedule-limit 0
 config set leader-schedule-limit 0
 config set merge-schedule-limit 0

这样可以将恢复过程中可能的异常情况降到最少。下面将按照出现掉电故障的节点个数及
TiKV 配置副本数来分别分析，并给出解决方案

三、宕机两台测试

集群默认3副本，5台机器宕机任意两台，理论上存在三种情况，一种是3副本中，有两个副本正巧在宕机的这两台上，一种是3副本中，只有一个region在宕机的两台机器上，还一种就是宕机的两台机器里不存在某些内容的任何副本，本次我们测试让wtidb21和wtidb22两个TiKV节点宕机。

我们先看一下宕机前测试表的状况

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.91 sec)

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.98 sec)

两台同时宕机后：

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;               
ERROR 9002 (HY000): TiKV server timeout 
mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
ERROR 9005 (HY000): Region is unavailable

看一下宕机的两台store_id

/data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -i -u http://192.168.1.1:2379
» store

知道是1和4

检查大于等于一半副本数在故障节点上的region

[tidb@wtidb28 bin]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl  -u http://192.168.1.1:2379  -d region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,4) then . else empty end) | length>=$total-length)}'
{"id":18,"peer_stores":[4,6,1]}
{"id":405,"peer_stores":[7,4,1]}
{"id":120,"peer_stores":[4,1,6]}
{"id":337,"peer_stores":[4,5,1]}
{"id":128,"peer_stores":[4,1,6]}
{"id":112,"peer_stores":[1,4,6]}
{"id":22,"peer_stores":[4,6,1]}
{"id":222,"peer_stores":[7,4,1]}
{"id":571,"peer_stores":[4,6,1]}

在剩余正常的kv节点上执行停kv的脚本：

ps -ef|grep tikv
sh /data1/tidb/deploy/scripts/stop_tikv.sh
ps -ef|grep tikv

变更其属主，将其拷贝至tidb目录下
chown -R tidb. /home/helei/tikv-ctl

这个分情况, 如果是 region 数量太多，那么按照 region 来修复的话，速度会比较慢，并且繁琐，这个使用使用 all 操作比较便捷，但是有可能误杀有两个 peer 的副本，也就是说可能你坏的这台机器，有个region只有一个在这台机器上，但他也会只保留一个region副本在集群里
下面的操作要在所有存活的节点先执行stop kv操作（要求kv是关闭状态），然后执行

[tidb@wtidb20 tidb]$ ./tikv-ctl --db /data1/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 1,4 --all-regions                                                                    
removing stores [1, 4] from configrations...
success

重启pd节点

ansible-playbook stop.yml --tags=pd
这里如果pd都关了的话，你是登不上库的
[helei@db-admin01 ~]$ /usr/local/mysql56/bin/mysql -u xxxx -h xxxx -P xxxx -pxxxx
xxxxx 
…
…
…
ansible-playbook start.yml --tags=pd

重启存活的kv节点
sh /data1/tidb/deploy/scripts/start_tikv.sh

检查没有处于leader状态的region
[tidb@wtidb28 bin]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -u http://192.168.1.1:2379 -d region --jq '.regions[]|select(has("leader")|not)|{id: .id,peer_stores: [.peers[].store_id]}'
这里我没有搜到任何的非leader region，只有副本数是3，且同时挂3台机器以上，且正巧有些region全部的region都在这3台机器上，前面步骤是unsafe all-region，pd这个检查没有处于leader状态的region步骤才会显示出来，才会需要对应到表查询丢了那些数据，才需要去创建空region啥的，我这个情况，只要还保留一个副本，不管unsafe执行的是all-regions，还是指定的具体的region号，都是不需要后面的步骤

正常启动集群后，可以通过pd-ctl来观看之前的region数，理论上在使用unsafe --all-regions后，仅剩的1个region成为leader，剩余的kv节点通过raft协议将其再次复制出2个follower拷贝到其他store上
例如本案例里的
{"id":18,"peer_stores":[4,6,1]}
通过pd-ctl可以看到他现在在犹豫1,4kv节点损坏，在执行unsafe-recover remove-fail-stores --all-regions后，将1,4的移除，仅剩的6成为leader，利用raft协议在5,7节点复制出新的follower，达成3副本顺利启动集群

» region 18
{
  "id": 18,
  "start_key": "7480000000000000FF0700000000000000F8",
  "end_key": "7480000000000000FF0900000000000000F8",
  "epoch": {
    "conf_ver": 60,
    "version": 4
  },
  "peers": [
    {
      "id": 717,
      "store_id": 6
    },
    {
      "id": 59803,
      "store_id": 7
    },
    {
      "id": 62001,
      "store_id": 5
    }
  ],
  "leader": {
    "id": 717,
    "store_id": 6
  },
  "written_bytes": 0,
  "read_bytes": 0,
  "written_keys": 0,
  "read_keys": 0,
  "approximate_size": 1,
  "approximate_keys": 0
}

如果只同时挂了2台机器，那么到这里就结束了，如果只挂1台那么不用处理的

先看一下数据现在是没问题的，之前的步骤恢复的很顺利

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.86 sec)

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (0.98 sec)

这里有个插曲

当我把1,4宕掉的节点恢复，这期间集群一直没有新的数据写入，原本是6作为leader，新生成的5，7作为follower作为副本，而恢复后，将5,7剔除，重新将1,4作为follower了，region 18还是1,4,6的store_id。

四、宕机3台测试

如果同时挂了3台及以上，那么上面的非leader步骤检查是会有内容的
我们这次让如下三台宕机：

wtidb22.add.shbt.qihoo.net  192.168.1.4  TiKV
wtidb21.add.shbt.qihoo.net  192.168.1.5  TiKV
wtidb20.add.shbt.qihoo.net  192.168.1.6  TiKV

首先，停止所有正常的tikv，本案例是wtidb19,wtidb18

看一下宕机的两台store_id

/data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -i -u http://192.168.1.1:2379
» store

知道是1、4、5

检查大于等于一半副本数在故障节点上的region

[tidb@wtidb28 bin]$  /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl  -u http://192.168.1.1:2379  -d region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,4,5) then . else empty end) | length>=$total-length)}'
{"id":156,"peer_stores":[1,4,6]}
{"id":14,"peer_stores":[6,1,4]}
{"id":89,"peer_stores":[5,4,1]}
{"id":144,"peer_stores":[1,4,6]}
{"id":148,"peer_stores":[6,1,4]}
{"id":152,"peer_stores":[7,1,4]}
{"id":260,"peer_stores":[6,1,4]}
{"id":480,"peer_stores":[7,1,4]}
{"id":132,"peer_stores":[5,4,6]}
{"id":22,"peer_stores":[6,1,4]}
{"id":27,"peer_stores":[4,1,6]}
{"id":37,"peer_stores":[1,4,6]}
{"id":42,"peer_stores":[5,4,6]}
{"id":77,"peer_stores":[5,4,6]}
{"id":116,"peer_stores":[5,4,6]}
{"id":222,"peer_stores":[6,1,4]}
{"id":69,"peer_stores":[5,4,6]}
{"id":73,"peer_stores":[7,4,1]}
{"id":81,"peer_stores":[5,4,1]}
{"id":128,"peer_stores":[6,1,4]}
{"id":2,"peer_stores":[5,6,4]}
{"id":10,"peer_stores":[7,4,1]}
{"id":18,"peer_stores":[6,1,4]}
{"id":571,"peer_stores":[6,5,4]}
{"id":618,"peer_stores":[7,1,4]}
{"id":218,"peer_stores":[6,5,1]}
{"id":47,"peer_stores":[1,4,6]}
{"id":52,"peer_stores":[6,1,4]}
{"id":57,"peer_stores":[4,7,1]}
{"id":120,"peer_stores":[6,1,4]}
{"id":179,"peer_stores":[5,1,4]}
{"id":460,"peer_stores":[5,7,1]}
{"id":93,"peer_stores":[6,1,4]}
{"id":112,"peer_stores":[6,1,4]}
{"id":337,"peer_stores":[5,6,4]}
{"id":400,"peer_stores":[5,7,1]}

现在还剩两台存活

wtidb19.add.shbt.qihoo.net  192.168.1.7  TiKV
wtidb18.add.shbt.qihoo.net  192.168.1.8  TiKV

下面的操作要在所有存活的节(本案例是wtidb19和wtidb18)点先执行stop kv操作（要求kv是关闭状态），然后执行

[tidb@wtidb19 tidb]$ ./tikv-ctl --db /data1/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 1,4,5 --all-regions                                                                 
removing stores [1, 4, 5] from configrations...
success

重启pd节点

ansible-playbook stop.yml --tags=pd
ansible-playbook start.yml --tags=pd

重启存活的kv节点
sh /data1/tidb/deploy/scripts/start_tikv.sh

检查没有处于leader状态的region，这里看到，1,4,5因为所有的region都在损坏的3台机器上，这些region丢弃后数据是恢复不了的

[tidb@wtidb28 tidb-ansible-3.1.0]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl  -u http://192.168.1.1:2379 -d region --jq '.regions[]|select(has("leader")|not)|{id: .id,peer_stores: [.peers[].store_id]}'
{"id":179,"peer_stores":[5,1,4]}
{"id":81,"peer_stores":[5,4,1]}
{"id":89,"peer_stores":[5,4,1]}

根据region ID，确认region属于哪张表

[tidb@wtidb28 tidb-ansible-3.1.0]$ curl http://192.168.1.1:10080/regions/179
{
 "region_id": 179,
 "start_key": "dIAAAAAAAAA7X2mAAAAAAAAAAwOAAAAAATQXJwE4LjQuMAAAAPwBYWxsAAAAAAD6A4AAAAAAAqs0",
 "end_key": "dIAAAAAAAAA7X3KAAAAAAAODBA==",
 "frames": [
  {
   "db_name": "hl",
   "table_name": "rpt_qdas_show_shoujizhushou_channelver_mix_daily(p201910)",
   "table_id": 59,
   "is_record": false,
   "index_name": "key2",
   "index_id": 3,
   "index_values": [
    "20191015",
    "8.4.0",
    "all",
    "174900"
   ]
  },
  {
   "db_name": "hl",
   "table_name": "rpt_qdas_show_shoujizhushou_channelver_mix_daily(p201910)",
   "table_id": 59,
   "is_record": true,
   "record_id": 230148
  }
 ]
}

这时候去看集群状态的话，

» store
{
  "count": 5,
  "stores": [
    {
      "store": {
        "id": 1,
        "address": "192.168.1.4:20160",
        "version": "3.1.0",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_count": 3,
        "region_weight": 1,
        "start_ts": "1970-01-01T08:00:00+08:00"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "192.168.1.5:20160",
        "version": "3.1.0",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_count": 3,
        "region_weight": 1,
        "start_ts": "1970-01-01T08:00:00+08:00"
      }
    },
    {
      "store": {
        "id": 5,
        "address": "192.168.1.6:20160",
        "version": "3.1.0",
        "state_name": "Down"

监控也是没数据

库里查询也依旧被阻塞

mysql> use hl
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;

创建空region解决unavaliable状态，这个命令要求pd，kv处于关闭状态
这里必须一个一个-r的写，要不报错：

[tidb@wtidb19 tidb]$ ./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p 192.168.1.1:2379 -r 89,179,81
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ParseIntError { kind: InvalidDigit }', src/libcore/result.rs:1188:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Aborted

./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p '192.168.1.1:2379' -r 89
./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p '192.168.1.1:2379' -r 81
./tikv-ctl --db /data1/tidb/deploy/data/db recreate-region -p '192.168.1.1:2379' -r 179

启动pd和tikv后，再次运行
[tidb@wtidb28 tidb-ansible-3.1.0]$ /data1/tidb-ansible-3.1.0/resources/bin/pd-ctl -u http://192.168.1.1:2379 -d region --jq '.regions[]|select(has("leader")|not)|{id: .id,peer_stores: [.peers[].store_id]}'
没有任何结果则符合预期

这里再次查询可以看到丢了数据，因为我们有几个region（81,89,179）都丢失了

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily;
+----------+
| count(*) |
+----------+
|  1262523 |
+----------+
1 row in set (0.92 sec)

这里可以看到索引数据不再region（81,89,179）中，所以还跟之前一样

mysql> select count(*) from rpt_qdas_show_shoujizhushou_channelver_mix_daily force index (idx_day_ver_ch);
+----------+
| count(*) |
+----------+
|  1653394 |
+----------+
1 row in set (1.01 sec)

至此，测试完成

别忘了恢复调度：
恢复操作之后
将 PD 的调度参数还原

五、总结

看完这篇文章，相信你不会再虚TiDB的多点掉电问题的数据恢复了，正常情况下，极少数出现集群同时宕机多台机器的，如果只宕机了一台，那么并不影响集群的运行，他会自动处理，当某个 TiKV 节点失效，并且在一段时间内（默认 30 分钟）无法恢复，PD 会将其上的数据迁移到其他的 TiKV 节点上。但如果同时宕机两台，甚至3台及以上，那么看过这篇文章的你相信你一定不会再手忙脚乱不知所措了！~