在 TiDB 中，根据用户定义的多种副本规则，一份数据可能会同时存储在多个节点中，从而保证在单个或少数节点暂时离线或损坏时，读写数据不受任何影响。但是，当一个 Region 的多数或全部副本在短时间内全部下线时，该 Region 会处于暂不可用的状态，无法进行读写操作。

如果一段数据的多数副本发生了永久性损坏（如磁盘损坏）等问题，从而导致节点无法上线时，此段数据会一直保持暂不可用的状态。这时，如果用户希望集群恢复正常使用，在用户能够容忍数据回退或数据丢失的前提下，用户理论上可以通过手动移除不可用副本的方式，使 Region 重新形成多数派，进而让上层业务可以写入和读取（可能是 stale 的，或者为空）这一段数据分片。

集群信息

[tidb@vm116 ~]$ 
[tidb@vm116 ~]$ tiup cluster display tidb-prd
tiup is checking updates for component cluster ...
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-prd
Cluster type:       tidb
Cluster name:       tidb-prd
Cluster version:    v5.4.3
Deploy user:        tidb
SSH type:           builtin
TLS encryption:     enabled
CA certificate:     /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt
Client private key: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem
Client certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt
Dashboard URL:      https://10.2.103.116:32379/dashboard
Grafana URL:        http://10.2.103.116:5000
ID                  Role          Host          Ports        OS/Arch       Status   Data Dir                            Deploy Dir
--                  ----          ----          -----        -------       ------   --------                            ----------
10.2.103.116:9793   alertmanager  10.2.103.116  9793/9794    linux/x86_64  Up       /data1/tidb-data/alertmanager-9793  /data1/tidb-deploy/alertmanager-9793
10.2.103.116:5000   grafana       10.2.103.116  5000         linux/x86_64  Up       -                                   /data1/tidb-deploy/grafana-5000
10.2.103.116:32379  pd            10.2.103.116  32379/3380   linux/x86_64  Up|L|UI  /data1/tidb-data/pd-32379           /data1/tidb-deploy/pd-32379
10.2.103.116:9390   prometheus    10.2.103.116  9390/32020   linux/x86_64  Up       /data1/tidb-data/prometheus-9390    /data1/tidb-deploy/prometheus-9390
10.2.103.116:43000  tidb          10.2.103.116  43000/20080  linux/x86_64  Up       -                                   /data1/tidb-deploy/tidb-34000
10.2.103.116:30160  tikv          10.2.103.116  30160/30180  linux/x86_64  Up       /data1/tidb-data/tikv-30160         /data1/tidb-deploy/tikv-30160
10.2.103.116:30162  tikv          10.2.103.116  30162/30182  linux/x86_64  Up       /data1/tidb-data/tikv-30162         /data1/tidb-deploy/tikv-30162
10.2.103.116:30163  tikv          10.2.103.116  30163/30183  linux/x86_64  Up       /data1/tidb-data/tikv-30163         /data1/tidb-deploy/tikv-30163
Total nodes: 8

查询数据

MySQL [(none)]> use test;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MySQL [test]> select count(*) from t3;
+----------+
| count(*) |
+----------+
|  3271488 |
+----------+
1 row in set (0.00 sec)

模拟tikv 宕机，同时强制2个tikv 缩容

[tidb@vm116 ~]$ tiup cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force 
tiup is checking updates for component cluster ...
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force

  ██     ██  █████  ██████  ███    ██ ██ ███    ██  ██████
  ██     ██ ██   ██ ██   ██ ████   ██ ██ ████   ██ ██
  ██  █  ██ ███████ ██████  ██ ██  ██ ██ ██ ██  ██ ██   ███
  ██ ███ ██ ██   ██ ██   ██ ██  ██ ██ ██ ██  ██ ██ ██    ██
   ███ ███  ██   ██ ██   ██ ██   ████ ██ ██   ████  ██████

Forcing scale in is unsafe and may result in data loss for stateful components.
DO NOT use `--force` if you have any component in Pending Offline status.
The process is irreversible and could NOT be cancelled.
Only use `--force` when some of the servers are already permanently offline.
Are you sure to continue?
(Type "Yes, I know my data might be lost." to continue)
: Yes, I know my data might be lost.
This operation will delete the 10.2.103.116:30160,10.2.103.116:30162 nodes in `tidb-prd` and all their data.
Do you want to continue? [y/N]:(default=N) y
The component `[tikv]` will become tombstone, maybe exists in several minutes or hours, after that you can use the prune command to clean it
Do you want to continue? [y/N]:(default=N) y
Scale-in nodes...
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.116:30160 10.2.103.116:30162] Force:true SSHTimeout:5 OptTimeout:120 APITimeout:600 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:default Operation:StartOperation}
Stopping component tikv
        Stopping instance 10.2.103.116
        Stop tikv 10.2.103.116:30160 success
Destroying component tikv
        Destroying instance 10.2.103.116
Destroy 10.2.103.116 success
- Destroy tikv paths: [/data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160/log /data1/tidb-deploy/tikv-30160 /etc/systemd/system/tikv-30160.service]
Stopping component tikv
        Stopping instance 10.2.103.116
        Stop tikv 10.2.103.116:30162 success
Destroying component tikv
        Destroying instance 10.2.103.116
Destroy 10.2.103.116 success
- Destroy tikv paths: [/data1/tidb-data/tikv-30162 /data1/tidb-deploy/tikv-30162/log /data1/tidb-deploy/tikv-30162 /etc/systemd/system/tikv-30162.service]
+ [ Serial ] - UpdateMeta: cluster=tidb-prd, deleted=`'10.2.103.116:30160','10.2.103.116:30162'`
+ [ Serial ] - UpdateTopology: cluster=tidb-prd
+ Refresh instance configs
  - Generate config pd -> 10.2.103.116:32379 ... Done
  - Generate config tikv -> 10.2.103.116:30163 ... Done
  - Generate config tidb -> 10.2.103.116:43000 ... Done
  - Generate config prometheus -> 10.2.103.116:9390 ... Done
  - Generate config grafana -> 10.2.103.116:5000 ... Done
  - Generate config alertmanager -> 10.2.103.116:9793 ... Done
+ Reload prometheus and grafana
  - Reload prometheus -> 10.2.103.116:9390 ... Done
  - Reload grafana -> 10.2.103.116:5000 ... Done
Scaled cluster `tidb-prd` in successfully
[tidb@vm116 ~]$

查询数据报错

MySQL [test]> select count(*) from t3;
ERROR 9005 (HY000): Region is unavailable
MySQL [test]>

报错日志

[2023/03/24 11:08:45.587 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8e640: Retry in 1000 milliseconds"]
[2023/03/24 11:08:45.587 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=1]
[2023/03/24 11:08:45.587 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30160] [store_id=1]
[2023/03/24 11:08:46.180 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30162] [store_id=5002]
[2023/03/24 11:08:46.180 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoint"] [store_id=5002] [addr=10.2.103.116:30162]
[2023/03/24 11:08:46.181 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1679627326.181128634\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30162\"}"]
[2023/03/24 11:08:46.181 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8ed40: Retry in 1000 milliseconds"]
[2023/03/24 11:08:46.181 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30162] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=5002]
[2023/03/24 11:08:46.181 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30162] [store_id=5002]
[2023/03/24 11:08:48.162 +08:00] [WARN] [endpoint.rs:606] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"]
[2023/03/24 11:08:48.174 +08:00] [WARN] [endpoint.rs:606] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"]
[2023/03/24 11:08:50.587 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30160] [store_id=1]
[2023/03/24 11:08:50.587 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoint"] [store_id=1] [addr=10.2.103.116:30160]
[2023/03/24 11:08:50.588 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1679627330.588107444\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30160\"}"]
[2023/03/24 11:08:50.588 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8f440: Retry in 1000 milliseconds"]
[2023/03/24 11:08:50.588 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30160] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=1]
[2023/03/24 11:08:50.588 +08:00] [INFO] [store.rs:2580] ["broadcasting unreachable"] [unreachable_store_id=1] [store_id=5001]
[2023/03/24 11:08:50.588 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30160] [store_id=1]
[2023/03/24 11:08:51.181 +08:00] [INFO] [raft_client.rs:742] ["resolve store address ok"] [addr=10.2.103.116:30162] [store_id=5002]
[2023/03/24 11:08:51.181 +08:00] [INFO] [raft_client.rs:627] ["server: new connection with tikv endpoint"] [store_id=5002] [addr=10.2.103.116:30162]
[2023/03/24 11:08:51.182 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1679627331.182361851\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.9.1+1.38.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30162\"}"]
[2023/03/24 11:08:51.182 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f4177d8fb40: Retry in 999 milliseconds"]
[2023/03/24 11:08:51.182 +08:00] [ERROR] [raft_client.rs:504] ["connection aborted"] [addr=10.2.103.116:30162] [receiver_err="Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [sink_error="Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] })))"] [store_id=5002]
[2023/03/24 11:08:51.182 +08:00] [INFO] [store.rs:2580] ["broadcasting unreachable"] [unreachable_store_id=5002] [store_id=5001]
[2023/03/24 11:08:51.182 +08:00] [ERROR] [raft_client.rs:776] ["connection abort"] [addr=10.2.103.116:30162] [store_id=5002]
^C
[tidb@vm116 log]$

准备修复（v6.1 以前版本）

1、暂停PD调度

[tidb@vm116 ~]$ tiup ctl:v5.4.3 pd -u "https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v5.4.3/ctl pd -u https://10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
» config set region-schedule-limit 0
Success!
» config set replica-schedule-limit 0
Success!
» config set leader-schedule-limit 0
Success!
» config set merge-schedule-limit 0
Success!
» config set hot-region-schedule-limit 0
Success!
»

2、检查副本

使用 pd-ctl 检查大于等于一半副本数在故障节点上的 Region；要求：PD 处于运行状态；

»  region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,5002) then . else empty end) | length>=$total-length) }'
{"id":2003,"peer_stores":[1,5002,5001]}
{"id":7001,"peer_stores":[1,5002,5001]}
{"id":7005,"peer_stores":[1,5002,5001]}
{"id":7009,"peer_stores":[1,5002,5001]}

»

3、stop 需要修复的tikv

[tidb@vm116 ~]$ tiup cluster display tidb-prd
tiup is checking updates for component cluster ...
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-prd
Cluster type:       tidb
Cluster name:       tidb-prd
Cluster version:    v5.4.3
Deploy user:        tidb
SSH type:           builtin
TLS encryption:     enabled
CA certificate:     /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt
Client private key: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem
Client certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt
Dashboard URL:      https://10.2.103.116:32379/dashboard
Grafana URL:        http://10.2.103.116:5000
ID                  Role          Host          Ports        OS/Arch       Status   Data Dir                            Deploy Dir
--                  ----          ----          -----        -------       ------   --------                            ----------
10.2.103.116:9793   alertmanager  10.2.103.116  9793/9794    linux/x86_64  Up       /data1/tidb-data/alertmanager-9793  /data1/tidb-deploy/alertmanager-9793
10.2.103.116:5000   grafana       10.2.103.116  5000         linux/x86_64  Up       -                                   /data1/tidb-deploy/grafana-5000
10.2.103.116:32379  pd            10.2.103.116  32379/3380   linux/x86_64  Up|L|UI  /data1/tidb-data/pd-32379           /data1/tidb-deploy/pd-32379
10.2.103.116:9390   prometheus    10.2.103.116  9390/32020   linux/x86_64  Up       /data1/tidb-data/prometheus-9390    /data1/tidb-deploy/prometheus-9390
10.2.103.116:43000  tidb          10.2.103.116  43000/20080  linux/x86_64  Up       -                                   /data1/tidb-deploy/tidb-34000
10.2.103.116:30163  tikv          10.2.103.116  30163/30183  linux/x86_64  Up       /data1/tidb-data/tikv-30163         /data1/tidb-deploy/tikv-30163
Total nodes: 6
[tidb@vm116 ~]$ tiup cluster stop tidb-prd -N 10.2.103.116:30163
tiup is checking updates for component cluster ...
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster stop tidb-prd -N 10.2.103.116:30163
Will stop the cluster tidb-prd with nodes: 10.2.103.116:30163, roles: .
Do you want to continue? [y/N]:(default=N) y
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [ Serial ] - StopCluster
Stopping component tikv
        Stopping instance 10.2.103.116
        Stop tikv 10.2.103.116:30163 success
Stopping component node_exporter
Stopping component blackbox_exporter
Stopped cluster `tidb-prd` successfully

4、unsafe-recover操作

在所有未发生掉电故障的实例上，对所有 Region 移除掉所有位于故障节点上的 Peer；要求：在所有未发生掉电故障的机器上运行，且需要关闭 TiKV 节点；

[tidb@vm116 v5.4.3]$  ./tikv-ctl  --data-dir  /data1/tidb-data/tikv-30163  unsafe-recover remove-fail-stores -s 1,5002 --all-regions
[2023/03/24 11:25:47.978 +08:00] [WARN] [config.rs:612] ["compaction guard is disabled due to region info provider not available"]
[2023/03/24 11:25:47.978 +08:00] [WARN] [config.rs:715] ["compaction guard is disabled due to region info provider not available"]
removing stores [1, 5002] from configurations...
success

5、启动已经修复的tikv

[tidb@vm116 v5.4.3]$ tiup cluster start  tidb-prd -N 10.2.103.116:30163
tiup is checking updates for component cluster ...
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster start tidb-prd -N 10.2.103.116:30163
Starting cluster tidb-prd...
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [ Serial ] - StartCluster
Starting component tikv
        Starting instance 10.2.103.116:30163
        Start instance 10.2.103.116:30163 success
Starting component node_exporter
        Starting instance 10.2.103.116
        Start 10.2.103.116 success
Starting component blackbox_exporter
        Starting instance 10.2.103.116
        Start 10.2.103.116 success
+ [ Serial ] - UpdateTopology: cluster=tidb-prd
Started cluster `tidb-prd` successfully

6、检查 Region Leader

使用 pd-ctl 检查没有 Leader 的 Region ; 要求：PD 处于运行状态；

[tidb@vm116 v5.4.3]$ tiup ctl:v5.4.3 pd -u "https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v5.4.3/ctl pd -u https://10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
» region --jq '.regions[]|select(has("leader")|not)|{id: .id, peer_stores: [.peers[].store_id]}'

»

7、数据一致性检测

检查数据索引一致性，要求：PD、TiKV、TiDB 处于运行状态;

MySQL [test]> select count(*) from t3;
+----------+
| count(*) |
+----------+
|  3271488 |
+----------+
1 row in set (0.55 sec)

MySQL [test]> 
MySQL [test]> admin check table t3;
Query OK, 0 rows affected (0.00 sec)

MySQL [test]>

8、恢复调度

[tidb@vm116 ~]$ tiup ctl:v5.4.3 pd -u "https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v5.4.3/ctl pd -u https://10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
» config set region-schedule-limit 2000
Success!
» config set replica-schedule-limit 32
Success!
» config set leader-schedule-limit 8
Success!
» config set merge-schedule-limit 16
Success!
» config set hot-region-schedule-limit 2
Success!
»

异常情况

1、No such region

./tikv-ctl  --data-dir  /data3/tidb/data unsafe-recover remove-fail-stores -s 1 -r 50377
[INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."]
[INFO] [mod.rs:479] ["encryption is disabled."]
[WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"]
[WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]
removing stores [1] from configurations...
Debugger::remove_fail_stores: "No such region 50377 on the store"

2、创建空 Region 解决 Unavailable 报错

要求：PD 处于运行状态，命令的目标 TiKV 处于关闭状态

./tikv-ctl  --ca-path  /data3/tidb/deploy/tls/ca.crt  --key-path  /data3/tidb/deploy/tls/tikv.pem --cert-path /data3/tidb/deploy/tls/tikv.crt  --data-dir  /data3/tidb/data recreate-region -p https://10.2.103.116:32379  -r 50377
[INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."]
[INFO] [mod.rs:479] ["encryption is disabled."]
initing empty region  with peer_id ...
success

v6.1 版本修复

1、集群信息

[tidb@vm116 ~]$ tiup cluster display tidb-prd 
tiup is checking updates for component cluster ...
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-prd
Cluster type:       tidb
Cluster name:       tidb-prd
Cluster version:    v6.1.5
Deploy user:        tidb
SSH type:           builtin
TLS encryption:     enabled
CA certificate:     /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt
Client private key: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem
Client certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt
Dashboard URL:      https://10.2.103.116:32379/dashboard
Grafana URL:        http://10.2.103.116:5000
ID                  Role          Host          Ports        OS/Arch       Status   Data Dir                            Deploy Dir
--                  ----          ----          -----        -------       ------   --------                            ----------
10.2.103.116:9793   alertmanager  10.2.103.116  9793/9794    linux/x86_64  Up       /data1/tidb-data/alertmanager-9793  /data1/tidb-deploy/alertmanager-9793
10.2.103.116:5000   grafana       10.2.103.116  5000         linux/x86_64  Up       -                                   /data1/tidb-deploy/grafana-5000
10.2.103.116:32379  pd            10.2.103.116  32379/3380   linux/x86_64  Up|L|UI  /data1/tidb-data/pd-32379           /data1/tidb-deploy/pd-32379
10.2.103.116:9390   prometheus    10.2.103.116  9390/32020   linux/x86_64  Up       /data1/tidb-data/prometheus-9390    /data1/tidb-deploy/prometheus-9390
10.2.103.116:43000  tidb          10.2.103.116  43000/20080  linux/x86_64  Up       -                                   /data1/tidb-deploy/tidb-34000
10.2.103.116:30160  tikv          10.2.103.116  30160/30180  linux/x86_64  Up       /data1/tidb-data/tikv-30160         /data1/tidb-deploy/tikv-30160
10.2.103.116:30162  tikv          10.2.103.116  30162/30182  linux/x86_64  Up       /data1/tidb-data/tikv-30162         /data1/tidb-deploy/tikv-30162
10.2.103.116:30163  tikv          10.2.103.116  30163/30183  linux/x86_64  Up       /data1/tidb-data/tikv-30163         /data1/tidb-deploy/tikv-30163
Total nodes: 8

2、强制缩容2个tikv

[tidb@vm116 ~]$ tiup cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force 
tiup is checking updates for component cluster ...
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-prd -N 10.2.103.116:30160,10.2.103.116:30162 --force

  ██     ██  █████  ██████  ███    ██ ██ ███    ██  ██████
  ██     ██ ██   ██ ██   ██ ████   ██ ██ ████   ██ ██
  ██  █  ██ ███████ ██████  ██ ██  ██ ██ ██ ██  ██ ██   ███
  ██ ███ ██ ██   ██ ██   ██ ██  ██ ██ ██ ██  ██ ██ ██    ██
   ███ ███  ██   ██ ██   ██ ██   ████ ██ ██   ████  ██████

Forcing scale in is unsafe and may result in data loss for stateful components.
DO NOT use `--force` if you have any component in Pending Offline status.
The process is irreversible and could NOT be cancelled.
Only use `--force` when some of the servers are already permanently offline.
Are you sure to continue?
(Type "Yes, I know my data might be lost." to continue)
: Yes, I know my data might be lost.
This operation will delete the 10.2.103.116:30160,10.2.103.116:30162 nodes in `tidb-prd` and all their data.
Do you want to continue? [y/N]:(default=N) y
The component `[tikv]` will become tombstone, maybe exists in several minutes or hours, after that you can use the prune command to clean it
Do you want to continue? [y/N]:(default=N) y
Scale-in nodes...
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.116
+ [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.116:30160 10.2.103.116:30162] Force:true SSHTimeout:5 OptTimeout:120 APITimeout:600 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:default Operation:StartOperation}
failed to delete tikv: error requesting https://10.2.103.116:32379/pd/api/v1/store/7014, response: "[PD:core:ErrStoresNotEnough]can not remove store 7014 since the number of up stores would be 2 while need 3"
, code 400
Stopping component tikv
        Stopping instance 10.2.103.116
        Stop tikv 10.2.103.116:30162 success
Destroying component tikv
        Destroying instance 10.2.103.116
Destroy 10.2.103.116 success
- Destroy tikv paths: [/data1/tidb-data/tikv-30162 /data1/tidb-deploy/tikv-30162/log /data1/tidb-deploy/tikv-30162 /etc/systemd/system/tikv-30162.service]
failed to delete tikv: error requesting https://10.2.103.116:32379/pd/api/v1/store/7013, response: "[PD:core:ErrStoresNotEnough]can not remove store 7013 since the number of up stores would be 2 while need 3"
, code 400
Stopping component tikv
        Stopping instance 10.2.103.116
        Stop tikv 10.2.103.116:30160 success
Destroying component tikv
        Destroying instance 10.2.103.116
Destroy 10.2.103.116 success
- Destroy tikv paths: [/etc/systemd/system/tikv-30160.service /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160/log /data1/tidb-deploy/tikv-30160]
+ [ Serial ] - UpdateMeta: cluster=tidb-prd, deleted=`'10.2.103.116:30160','10.2.103.116:30162'`
+ [ Serial ] - UpdateTopology: cluster=tidb-prd
+ Refresh instance configs
  - Generate config pd -> 10.2.103.116:32379 ... Done
  - Generate config tikv -> 10.2.103.116:30163 ... Done
  - Generate config tidb -> 10.2.103.116:43000 ... Done
  - Generate config prometheus -> 10.2.103.116:9390 ... Done
  - Generate config grafana -> 10.2.103.116:5000 ... Done
  - Generate config alertmanager -> 10.2.103.116:9793 ... Done
+ Reload prometheus and grafana
  - Reload prometheus -> 10.2.103.116:9390 ... Done
  - Reload grafana -> 10.2.103.116:5000 ... Done
Scaled cluster `tidb-prd` in successfully

3、查询store信息

[tidb@vm116 ~]$ tiup ctl:v5.4.3 pd -u "https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v5.4.3/ctl pd -u https://10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
» store
{
  "count": 3,
  "stores": [
    {
      "store": {
        "id": 7014,
        "address": "10.2.103.116:30162",
        "version": "6.1.5",
        "peer_address": "10.2.103.116:30162",
        "status_address": "10.2.103.116:30182",
        "git_hash": "e554126f6e83a6ddc944ddc51746b6def303ec1a",
        "start_timestamp": 1679629145,
        "deploy_path": "/data1/tidb-deploy/tikv-30162/bin",
        "last_heartbeat": 1679629565041246521,
        "node_state": 1,
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "492GiB",
        "available": "424GiB",
        "used_size": "324.6MiB",
        "leader_count": 2,
        "leader_weight": 1,
        "leader_score": 2,
        "leader_size": 178,
        "region_count": 7,
        "region_weight": 1,
        "region_score": 410.17607715843184,
        "region_size": 251,
        "slow_score": 1,
        "start_ts": "2023-03-24T11:39:05+08:00",
        "last_heartbeat_ts": "2023-03-24T11:46:05.041246521+08:00",
        "uptime": "7m0.041246521s"
      }
    },
    {
      "store": {
        "id": 5001,
        "address": "10.2.103.116:30163",
        "version": "6.1.5",
        "peer_address": "10.2.103.116:30163",
        "status_address": "10.2.103.116:30183",
        "git_hash": "e554126f6e83a6ddc944ddc51746b6def303ec1a",
        "start_timestamp": 1679629139,
        "deploy_path": "/data1/tidb-deploy/tikv-30163/bin",
        "last_heartbeat": 1679629599338071366,
        "node_state": 1,
        "state_name": "Up"
      },
      "status": {
        "capacity": "492GiB",
        "available": "435GiB",
        "used_size": "342MiB",
        "leader_count": 5,
        "leader_weight": 1,
        "leader_score": 5,
        "leader_size": 73,
        "region_count": 7,
        "region_weight": 1,
        "region_score": 409.8657305188558,
        "region_size": 251,
        "slow_score": 1,
        "start_ts": "2023-03-24T11:38:59+08:00",
        "last_heartbeat_ts": "2023-03-24T11:46:39.338071366+08:00",
        "uptime": "7m40.338071366s"
      }
    },
    {
      "store": {
        "id": 7013,
        "address": "10.2.103.116:30160",
        "version": "6.1.5",
        "peer_address": "10.2.103.116:30160",
        "status_address": "10.2.103.116:30180",
        "git_hash": "e554126f6e83a6ddc944ddc51746b6def303ec1a",
        "start_timestamp": 1679629155,
        "deploy_path": "/data1/tidb-deploy/tikv-30160/bin",
        "last_heartbeat": 1679629565148211763,
        "node_state": 1,
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "492GiB",
        "available": "424GiB",
        "used_size": "324.6MiB",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 7,
        "region_weight": 1,
        "region_score": 410.17510527918483,
        "region_size": 251,
        "slow_score": 1,
        "start_ts": "2023-03-24T11:39:15+08:00",
        "last_heartbeat_ts": "2023-03-24T11:46:05.148211763+08:00",
        "uptime": "6m50.148211763s"
      }
    }
  ]
}

4、查询数据报错

MySQL [test]> select count(*) from t3;
ERROR 9002 (HY000): TiKV server timeout
MySQL [test]>

5、tikv 错误日志

[2023/03/24 11:48:47.549 +08:00] [ERROR] [raft_client.rs:824] ["connection abort"] [addr=10.2.103.116:30160] [store_id=7013]
[2023/03/24 11:48:47.813 +08:00] [INFO] [raft.rs:1550] ["starting a new election"] [term=9] [raft_id=7004] [region_id=7001]
[2023/03/24 11:48:47.813 +08:00] [INFO] [raft.rs:1170] ["became pre-candidate at term 9"] [term=9] [raft_id=7004] [region_id=7001]
[2023/03/24 11:48:47.813 +08:00] [INFO] [raft.rs:1299] ["broadcasting vote request"] [to="[250648, 250650]"] [log_index=1863] [log_term=9] [term=9] [type=MsgRequestPreVote] [raft_id=7004] [region_id=7001]
[2023/03/24 11:48:47.927 +08:00] [INFO] [<unknown>] ["subchannel 0x7fae72e89c00 {address=ipv4:10.2.103.116:30162, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30162, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22260, grpc.internal.security_connector=0x7fae72e0ad80, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30162, random id=324}: connect failed: {\"created\":\"@1679629727.927814392\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.1+1.44.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30162\"}"]
[2023/03/24 11:48:47.928 +08:00] [INFO] [<unknown>] ["subchannel 0x7fae72e89c00 {address=ipv4:10.2.103.116:30162, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30162, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22260, grpc.internal.security_connector=0x7fae72e0ad80, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30162, random id=324}: Retry in 999 milliseconds"]
[2023/03/24 11:48:47.928 +08:00] [INFO] [advance.rs:296] ["check leader failed"] [to_store=7014] [error="\"[rpc failed] RpcFailure: 14-UNAVAILABLE failed to connect to all addresses\""]
[2023/03/24 11:48:47.928 +08:00] [INFO] [<unknown>] ["subchannel 0x7fae72eb5c00 {address=ipv4:10.2.103.116:30160, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30160, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22ee0, grpc.internal.security_connector=0x7fae72fbd640, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30160, random id=325}: connect failed: {\"created\":\"@1679629727.928184608\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.10.1+1.44.0/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:10.2.103.116:30160\"}"]
[2023/03/24 11:48:47.928 +08:00] [INFO] [<unknown>] ["subchannel 0x7fae72eb5c00 {address=ipv4:10.2.103.116:30160, args=grpc.client_channel_factory=0x7faea78591a8, grpc.default_authority=10.2.103.116:30160, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x7fae00a22ee0, grpc.internal.security_connector=0x7fae72fbd640, grpc.internal.subchannel_pool=0x7faea7832d60, grpc.primary_user_agent=grpc-rust/0.10.2, grpc.resource_quota=0x7faea799c630, grpc.server_uri=dns:///10.2.103.116:30160, random id=325}: Retry in 1000 milliseconds"]
[2023/03/24 11:48:47.928 +08:00] [INFO] [advance.rs:296] ["check leader failed"] [to_store=7013] [error="\"[rpc failed] RpcFailure: 14-UNAVAILABLE failed to connect to all addresses\""]
[2023/03/24 11:48:48.037 +08:00] [WARN] [endpoint.rs:621] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7009, leader may None\" not_leader { region_id: 7009 }"]
[2023/03/24 11:48:48.089 +08:00] [WARN] [endpoint.rs:621] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 7001, leader may None\" not_leader { region_id: 7001 }"]
[2023/03/24 11:48:48.409 +08:00] [WARN] [endpoint.rs:621] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 252001, leader may None\" not_leader { region_id: 252001 }"]
^C
[tidb@vm116 log]$

6、PD unsafe 修复

1、执行修复命令

[tidb@vm116 ctl]$ tiup ctl:v6.1.5  pd -u "https://10.2.103.116:32379" --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v6.1.5/ctl pd -u https://10.2.103.116:32379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-prd/tls/client.crt -i
» unsafe remove-failed-stores 7013,7014
Success!

2、查询修复进度

» unsafe remove-failed-stores show
[
  {
    "info": "Unsafe recovery enters collect report stage: failed stores 7013, 7014",
    "time": "2023-03-24 11:56:44.910"
  },
  {
    "info": "Unsafe recovery enters force leader stage",
    "time": "2023-03-24 11:56:49.390",
    "actions": {
      "store 5001": [
        "force leader on regions: 7001, 7005, 7009, 252001, 252005, 252009, 2003"
      ]
    }
  },
  {
    "info": "Collecting reports from alive stores(0/1)",
    "time": "2023-03-24 11:57:02.286",
    "details": [
      "Stores that have not dispatched plan: ",
      "Stores that have reported to PD: ",
      "Stores that have not reported to PD: 5001"
    ]
  }
]

3、修复完成

» unsafe remove-failed-stores show
[
  {
    "info": "Unsafe recovery enters collect report stage: failed stores 7013, 7014",
    "time": "2023-03-24 11:56:44.910"
  },
  {
    "info": "Unsafe recovery enters force leader stage",
    "time": "2023-03-24 11:56:49.390",
    "actions": {
      "store 5001": [
        "force leader on regions: 7001, 7005, 7009, 252001, 252005, 252009, 2003"
      ]
    }
  },
  {
    "info": "Unsafe recovery enters demote failed voter stage",
    "time": "2023-03-24 11:57:20.434",
    "actions": {
      "store 5001": [
        "region 7001 demotes peers { id:250648 store_id:7014 }, { id:250650 store_id:7013 }",
        "region 7005 demotes peers { id:250647 store_id:7013 }, { id:250649 store_id:7014 }",
        "region 7009 demotes peers { id:250644 store_id:7014 }, { id:250646 store_id:7013 }",
        "region 252001 demotes peers { id:252003 store_id:7013 }, { id:252004 store_id:7014 }",
        "region 252005 demotes peers { id:252007 store_id:7013 }, { id:252008 store_id:7014 }",
        "region 252009 demotes peers { id:252011 store_id:7013 }, { id:252012 store_id:7014 }",
        "region 2003 demotes peers { id:250643 store_id:7013 }, { id:250645 store_id:7014 }"
      ]
    }
  },
  {
    "info": "Unsafe recovery finished",
    "time": "2023-03-24 11:57:22.443",
    "details": [
      "affected table ids: 73, 77, 68, 70"
    ]
  }
]

7、查询数据

MySQL [test]> select count(*) from t3;
+----------+
| count(*) |
+----------+
|  3271488 |
+----------+
1 row in set (0.50 sec)

MySQL [test]> 
MySQL [test]> admin check table t3;
Query OK, 0 rows affected (0.00 sec)

MySQL [test]>

总结：

1、尽量在PD的调度上满足异常宕机数据的高可用，考虑多个标签，比如机房，机架，机器，可以降低丢数据的风险。

2、在v6.1 之前，如果出现多副本的丢失，恢复步骤相对的繁琐，人工介入太多。 v6.1后恢复相对简答。如果可以的话，尽量升级到v6.1 这样能够快速恢复

集群3副本丢失2副本-unsafe-recover