案例一:
背景:发现集群 gc 未工作,drop table 不回收磁盘空间,发现 gc safepoint 停留在 2024-05-22,集群版本 v5.3.1
1. 日志中以及 admin check table 发现报错
mysql> admin check table SuperXXXall;
ERROR 1105 (HY000): unexpected resolve err: commit_ts_expired:<start_ts:450046591374983318 attempted_commit_ts:450046591977652390 key:"t\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" min_commit_ts:450046592187367871 > , lock: key: 7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5, primary: 7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44, txnStartTS: 450046591374983318, lockForUpdateTS:4500465913
start_ts:
mysql> select tidb_parse_tso(450046591374983318);
+------------------------------------+
| tidb_parse_tso(450046591374983318) |
+------------------------------------+
| 2024-05-27 14:31:41.522000 |
+------------------------------------+
1 row in set (0.00 sec)
attempted_commit_ts:
mysql> select tidb_parse_tso(450046591977652390);
+------------------------------------+
| tidb_parse_tso(450046591977652390) |
+------------------------------------+
| 2024-05-27 14:31:43.821000 |
+------------------------------------+
1 row in set (0.00 sec)
min_commit_ts:
mysql> select tidb_parse_tso(450046592187367871);
+------------------------------------+
| tidb_parse_tso(450046592187367871) |
+------------------------------------+
| 2024-05-27 14:31:44.621000 |
+------------------------------------+
1 row in set (0.01 sec)
2. 确认数据内容
lock key:
mysql> select tidb_decode_key('7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5');
+-----------------------------------------------------------------------------------------------+
| tidb_decode_key('7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5') |
+-----------------------------------------------------------------------------------------------+
| {"index_id":2,"index_vals":{"ipnum":"3213214421"},"table_id":274911} |
+-----------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
primary:
mysql> select tidb_decode_key('7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44');
+-----------------------------------------------------------------------------------------------+
| tidb_decode_key('7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44') |
+-----------------------------------------------------------------------------------------------+
| {"index_id":1,"index_vals":{"aid":"111111111","sn":"111111111111111111"},"table_id":274911} |
+-----------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
index_id 1 对应 uk_aid_sn。index_id 2 对应 index(ip)
3. 确认 region 信息
找对应的 region
lock:
fdc@fdc-tidb04-onlinetidb:~$ curl 'http://{ip}:10080/mvcc/hex/7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5'
{
"key": "7480000000000431DF5F698000000000000002040000000077B02BB303F80000012ADBADF5",
"region_id": 366711639,
"value": {
"info": {
"lock": {
"start_ts": 450046591374983318,
"primary": "dIAAAAAABDHfX2mAAAAAAAAAAQOAAAAAE9r/EQOIZRj1jlQ/RA==",
"short_value": "MA=="
},
"writes": [
{
"start_ts": 450046591374983318,
"commit_ts": 450046591977652390,
"short_value": "MA=="
}
]
}
}
primary:
fdc@fdc-tidb04-onlinetidb:~$ curl 'http://{ip}:10080/mvcc/hex/7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44'
{
"key": "7480000000000431DF5F698000000000000001038000000013DAFF1103886518F58E543F44",
"region_id": 40507174,
"value": {
"info": {
"writes": [
{
"start_ts": 450046591374983318,
"commit_ts": 450046591977652390,
"short_value": "eAAAASrbrfM="
}
]
}
}
确认到有 lock
4. 确认数据分布
tiup ctl:v5.3.1 pd -u {ip}:2379 region key 7480000000000431DF5F698000000000000002040000000077B02BB303F80000012ADBADF5
{
"id": 344684358,
"start_key": "748000000000042DFF8400000000000000F8",
"end_key": "7480000000000431FFDF5F698000000000FF0000010380000000FF0000854D03885C91FFF79BAF31BF000000FC",
"epoch": {
"conf_ver": 18923,
"version": 231376
},
"peers": [
{
"id": 1389106238,
"store_id": 348337315,
"role_name": "Voter"
},
{
"id": 1506170971,
"store_id": 477311915,
"role_name": "Voter"
},
{
"id": 1619480854,
"store_id": 935248854,
"role_name": "Voter"
}
],
"leader": {
"id": 1506170971,
"store_id": 477311915,
"role_name": "Voter"
},
"written_bytes": 2826,
"read_bytes": 0,
"written_keys": 2,
"read_keys": 0,
"approximate_size": 83,
"approximate_keys": 1176620
}
5. 驱逐 leader,减少对业务的影响
tiup ctl:v5.3.1 pd -u 10.90.230.8:2379 scheduler add evict-leader-scheduler 477311915
# 驱逐完记得加回来
tiup ctl:v5.3.1 pd -u 10.90.230.8:2379 scheduler remove evict-leader-scheduler
6. 确认 mvcc 信息
对 key 编码
fdc@fdc-tidb04-onlinetidb:~$ tiup ctl:v5.3.1 tikv --to-escaped "7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5"
t\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365
查看 mvcc
tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,default
fdc@fdc-tidb02-onlinetikv:~$ tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,default
Starting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365 --show-cf=lock,write,default
[2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."]
[2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."]
[2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"]
[2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]
no mvcc infos for zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365
no mvcc infos
recover-mvcc 查看
fdc@fdc-tidb02-onlinetikv:~$ tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tidb/tikv/deploy1/data recover-mvcc --read-only -r 40507174 -p 10.90.230.8:2379
Starting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tidb/tikv/deploy1/data recover-mvcc --read-only -r 40507174 -p ******
[2025/09/22 16:17:42.552 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."]
[2025/09/22 16:17:42.552 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."]
[2025/09/22 16:17:42.556 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"]
[2025/09/22 16:17:42.556 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]
Recover regions: [40507174], pd: ["10.90.230.8:2379"], read_only: true
[2025/09/22 16:17:45.303 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=10.90.230.8:2379]
[2025/09/22 16:17:45.304 +08:00] [INFO] [<unknown>] ["TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter"]
[2025/09/22 16:17:45.307 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fa62ac05100 for subchannel 0x7fa66a612280"]
[2025/09/22 16:17:45.309 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.191.128.76:2379]
[2025/09/22 16:17:45.309 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fa62ac05280 for subchannel 0x7fa66a612600"]
[2025/09/22 16:17:45.310 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.90.230.8:2379]
[2025/09/22 16:17:45.313 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fa62ac05400 for subchannel 0x7fa66a612280"]
[2025/09/22 16:17:45.315 +08:00] [INFO] [util.rs:668] ["connected to PD member"] [endpoints=http://10.90.230.8:2379]
[2025/09/22 16:17:45.315 +08:00] [INFO] [util.rs:536] ["all PD endpoints are consistent"] [endpoints="[\"10.90.230.8:2379\"]"]
[2025/09/22 16:17:45.328 +08:00]success!
[INFO] [debug.rs:956] ["thread 0: skip write 0 rows"]
[2025/09/22 16:17:45.328 +08:00] [INFO] [debug.rs:959] ["thread 0: total fix default: 0, lock: 0, write: 0"]
[2025/09/22 16:17:45.328 +08:00] [INFO] [debug.rs:968] ["thread 0 has finished working."]
但是没有看到具体的 mvcc lock。这个时候很疑惑,难道是找错地方了?
7. 确认 region 状态
mysql> select * from TIKV_REGION_STATUS where region_id in (36671107174)\G
*************************** 1. row ***************************
REGION_ID: 366711639
START_KEY: 7480000000000431FFDF5F698000000000FF0000020400000000FF77A751AA03B80000FF0021B1B123000000FC
END_KEY: 7480000000000431FFDF5F698000000000FF0000020400000000FF77B1949A03980000FF0014C54AF8000000FC
TABLE_ID: 274911
DB_NAME: DB
TABLE_NAME: SuperXXXall
IS_INDEX: 1
INDEX_ID: 2
INDEX_NAME: ip
EPOCH_CONF_VER: 18971
EPOCH_VERSION: 210340
WRITTEN_BYTES: 0
READ_BYTES: 0
APPROXIMATE_SIZE: 70
APPROXIMATE_KEYS: 1108205
REPLICATIONSTATUS_STATE: NULL
REPLICATIONSTATUS_STATEID: NULL
*************************** 2. row ***************************
REGION_ID: 40507174
START_KEY: 7480000000000431FFDF5F698000000000FF0000010380000000FF13DAFBF103886518FF703DBBA158000000FC
END_KEY: 7480000000000431FFDF5F698000000000FF0000010380000000FF13DB043303886A23FF907234F2FC000000FC
TABLE_ID: 274911
DB_NAME: DB
TABLE_NAME: SuperXXXall
IS_INDEX: 1
INDEX_ID: 1
INDEX_NAME: uk_aid_sn
EPOCH_CONF_VER: 19460
EPOCH_VERSION: 210350
WRITTEN_BYTES: 0
READ_BYTES: 0
APPROXIMATE_SIZE: 68
APPROXIMATE_KEYS: 973219
REPLICATIONSTATUS_STATE: NULL
REPLICATIONSTATUS_STATEID: NULL
2 rows in set (1 min 19.87 sec)
都是索引。如果实在走不通就直接 tombstone region 或者创建空 region 。理论上对原始数据不影响
8. 确认 region 位置
tiup ctl:v5.3.1 pd region 366711639
Starting component `ctl`: /home/fdc/.tiup/components/ctl/v5.3.1/ctl pd region 366711639
{
"id": 366711639,
"start_key": "7480000000000431FFDF5F698000000000FF0000020400000000FF77A751AA03B80000FF0021B1B123000000FC",
"end_key": "7480000000000431FFDF5F698000000000FF0000020400000000FF77B1949A03980000FF0014C54AF8000000FC",
"epoch": {
"conf_ver": 18971,
"version": 210340
},
"peers": [
{
"id": 737290763,
"store_id": 400136022,
"role_name": "Voter"
},
{
"id": 1073621434,
"store_id": 456470117,
"role_name": "Voter"
},
{
"id": 1553665102,
"store_id": 477311917,
"role_name": "Voter"
}
],
"leader": {
"id": 1073621434,
"store_id": 456470117,
"role_name": "Voter"
},
"written_bytes": 0,
"read_bytes": 0,
"written_keys": 0,
"read_keys": 0,
"approximate_size": 70,
"approximate_keys": 1108205
}
尝试在新 tikv 处理
对 key 编码
查看 mvcc
tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tikv/deploy/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,default
fdc@fdc-tidb02-onlinetikv:~$ tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tikv/deploy/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,default
Starting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365 --show-cf=lock,write,default
[2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."]
[2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."]
[2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"]
[2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]
no mvcc infos for zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365
no mvcc infos
recover-mvcc 查看
fdc@fdc-tidb06-onlinetikv:~$ tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tikv/deploy/data recover-mvcc --read-only -r 366711639 -p 10.90.230.8:2379
Starting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tikv/deploy/data recover-mvcc --read-only -r 366711639 -p ******
[2025/09/25 11:29:04.220 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."]
[2025/09/25 11:29:04.220 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."]
[2025/09/25 11:29:04.224 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"]
[2025/09/25 11:29:04.224 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]
Recover regions: [366711639], pd: ["10.90.230.8:2379"], read_only: true
[2025/09/25 11:29:07.042 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=10.90.230.8:2379]
[2025/09/25 11:29:07.044 +08:00] [INFO] [<unknown>] ["Disabling AF_INET6 sockets because socket() failed."]
[2025/09/25 11:29:07.044 +08:00] [INFO] [<unknown>] ["TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter"]
[2025/09/25 11:29:07.047 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f9eb3606680 for subchannel 0x7f9ef7211f00"]
[2025/09/25 11:29:07.049 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.191.128.76:2379]
[2025/09/25 11:29:07.050 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f9eb3606800 for subchannel 0x7f9ef7212280"]
[2025/09/25 11:29:07.050 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.90.230.8:2379]
[2025/09/25 11:29:07.054 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f9eb3606980 for subchannel 0x7f9ef7211f00"]
[2025/09/25 11:29:07.056 +08:00] [INFO] [util.rs:668] ["connected to PD member"] [endpoints=http://10.90.230.8:2379]
[2025/09/25 11:29:07.056 +08:00] [INFO] [util.rs:536] ["all PD endpoints are consistent"] [endpoints="[\"10.90.230.8:2379\"]"]
[2025/09/25 11:29:07.257 +08:00] [INFO] [debug.rs:1098] ["thread 0: LOCK for_update_ts is less than WRITE ts, key: 7480000000000431FFDF5F698000000000FF0000020400000000FF77B02BB303F80000FF012ADBADF5000000FC, for_update_ts: 450046591374983318, commit_ts: 450046591977652390"]
[2025/09/25 11:29:07.580 +08:00] [INFO] [debug.rs:1063] ["thread 0: scan 1000000 rows"]
[2025/09/25 11:29:07.614 +08:00] [INFO] [debug.rs:956] ["thread 0: skip write 1 rows"]
[2025/09/25 11:29:07.614 +08:00] [INFO] [debug.rs:959] ["thread 0: total fix default: 0, lock: 1, write: 0"]
[2025/09/25 11:29:07.614 +08:00] [INFO] [debug.rs:968] ["thread 0 has finished working."]
success!
但是没有看到具体的 mvcc lock
至少看到了 key 的锁,先尝试修复,在其他 store 上同样处理。最终在三个 store 上处理完后解决问题,GC 正常获取到 lock ,回收空间。
案例二:
1. 从日志确认报错
["resolve store address failed"] [err_code=KV:Unknown] [err="Other(\"[src/server/resolve.rs:124]: unknown error \\\"[components/pd_client/src/util.rs:954]: invalid store ID 8769963922, not found\\\"\")"] [store_id=8769963922] [thread_id=203]
[raft_client.rs:829] ["resolve store address failed"] [err_code=KV:Unknown] [err="Other(\"[src/server/resolve.rs:124]: unknown error \\\"[components/pd_client/src/util.rs:954]: invalid store ID 32996937412, not found\\\"\")"] [store_id=32996937412] [thread_id=203]
[raft_client.rs:829] ["resolve store address failed"] [err_code=KV:Unknown] [err="Other(\"[src/server/resolve.rs:124]: unknown error \\\"[components/pd_client/src/util.rs:954]: invalid store ID 15351649554, not found\\\"\")"] [store_id=15351649554] [thread_id=203]
2025-09-11 11:20:23 (UTC+08:00)PD 10.191.0.46:2379[operator_controller.go:944] ["invalid store ID"] [store-id=8769963922]
2025-09-11 11:21:56 (UTC+08:00)PD 10.191.0.46:2379[operator_controller.go:944] ["invalid store ID"] [store-id=15351649554]
2. 尝试重启 kv
论坛中有类似情况重启 kv 可以暂时解决:https://asktug.com/t/topic/1045115/1
重启后无效果,kv日志还是存在报错
[2025/09/11 14:45:58.229 +08:00] [INFO] [resolve.rs:121] ["resolve store not found"] [store_id=32996937412] [thread_id=18]
[2025/09/11 14:45:58.229 +08:00] [ERROR] [raft_client.rs:829] ["resolve store address failed"] [err_code=KV:Unknown] [err="Other(\"[src/server/resolve.rs:124]: unknown error \\\"[components/pd_client/src/util.rs:954]: invalid store ID 32996937412, not found\\\"\")"] [store_id=32996937412] [thread_id=202]
3. 查看是否有异常 region
tiup ctl:v7.5.4 pd -u 10.191.0.46:2379 region check down-peer
{
"count": 2,
"regions": [
{
"id": 20278184100,
"start_key": "7480000000000365FF955F698000000000FF0000020146334535FF35444144FF464645FF4441353938FF4644FF453136423143FF41FF31374538393446FFFF0000000000000000FFF703800000281D38FF963E000000000000F9",
"end_key": "7480000000000365FF955F698000000000FF0000020146383942FF41383535FF303333FF3843363630FF3333FF464243333245FF37FF39423243463832FFFF0000000000000000FFF703800000281B55FFFC7B000000000000F9",
"epoch": {
"conf_ver": 9382,
"version": 103309
},
"peers": [
{
"role_name": "Learner",
"is_learner": true,
"id": 21571927681,
"store_id": 8769963922,
"role": 1
},
{
"role_name": "Voter",
"id": 22597784773,
"store_id": 23
},
{
"role_name": "Learner",
"is_learner": true,
"id": 22603334967,
"store_id": 8769963926,
"role": 1
},
{
"role_name": "Voter",
"id": 33232384663,
"store_id": 32996937412
}
],
"leader": {
"role_name": "Voter",
"id": 22597784773,
"store_id": 23
},
"down_peers": [
{
"peer": {
"role_name": "Learner",
"is_learner": true,
"id": 21571927681,
"store_id": 8769963922,
"role": 1
},
"down_seconds": 8285684
},
{
"peer": {
"role_name": "Learner",
"is_learner": true,
"id": 22603334967,
"store_id": 8769963926,
"role": 1
},
"down_seconds": 8285684
},
{
"peer": {
"role_name": "Voter",
"id": 33232384663,
"store_id": 32996937412
},
"down_seconds": 8285684
}
],
"pending_peers": [
{
"role_name": "Learner",
"is_learner": true,
"id": 21571927681,
"store_id": 8769963922,
"role": 1
},
{
"role_name": "Learner",
"is_learner": true,
"id": 22603334967,
"store_id": 8769963926,
"role": 1
},
{
"role_name": "Voter",
"id": 33232384663,
"store_id": 32996937412
}
],
"cpu_usage": 0,
"written_bytes": 0,
"read_bytes": 0,
"written_keys": 0,
"read_keys": 0,
"approximate_size": 465,
"approximate_keys": 4353659
},
{
"id": 29486864661,
"start_key": "74800000000004E2FF605F728000003695FF505EE90000000000FA",
"end_key": "74800000000004E2FF605F728000003695FF65A3D10000000000FA",
"epoch": {
"conf_ver": 13178,
"version": 143231
},
"peers": [
{
"role_name": "Voter",
"id": 29963431801,
"store_id": 21961705754
},
{
"role_name": "Learner",
"is_learner": true,
"id": 29976105937,
"store_id": 15351649554,
"role": 1
},
{
"role_name": "Learner",
"is_learner": true,
"id": 29998759908,
"store_id": 8769963926,
"role": 1
},
{
"role_name": "Voter",
"id": 33232575337,
"store_id": 32996937412
}
],
"leader": {
"role_name": "Voter",
"id": 29963431801,
"store_id": 21961705754
},
"down_peers": [
{
"peer": {
"role_name": "Learner",
"is_learner": true,
"id": 29976105937,
"store_id": 15351649554,
"role": 1
},
"down_seconds": 8285666
},
{
"peer": {
"role_name": "Learner",
"is_learner": true,
"id": 29998759908,
"store_id": 8769963926,
"role": 1
},
"down_seconds": 8285666
},
{
"peer": {
"role_name": "Voter",
"id": 33232575337,
"store_id": 32996937412
},
"down_seconds": 8285666
}
],
"pending_peers": [
{
"role_name": "Learner",
"is_learner": true,
"id": 29976105937,
"store_id": 15351649554,
"role": 1
},
{
"role_name": "Learner",
"is_learner": true,
"id": 29998759908,
"store_id": 8769963926,
"role": 1
},
{
"role_name": "Voter",
"id": 33232575337,
"store_id": 32996937412
}
],
"cpu_usage": 0,
"written_bytes": 0,
"read_bytes": 0,
"written_keys": 0,
"read_keys": 0,
"approximate_size": 767,
"approximate_keys": 1350662
}
]
}
尝试补回副本
观察到均有在异常 store 上的 region ,尝试在其他节点补回。
确认到第一个 region 20278184100。只剩一个 leader 还在。尝试补副本
>> operator add add-peer 1 2 // 在 store 2 上新增 Region 1 的一个副本
>> operator add add-learner 1 2 // 在 store 2 上新增 Region 1 的一个 learner 副本
在 store 16 、33472595794 上补
tiup ctl:v7.5.4 pd -u {ip}:2379 operator add add-peer 20278184100 16
Success! The operator is created."
不能同时多个,只能一个个 store 跑
tiup ctl:v7.5.4 pd -u {ip}:2379 operator add add-peer 20278184100 33472595794
尝试补回副本,确认 gc 配置
涉及的表:
mysql> select * from TIKV_REGION_STATUS where REGION_ID = 20278184100\G
*************************** 1. row ***************************
REGION_ID: 20278184100
START_KEY: 7480000000000365FF955F698000000000FF0000020146334535FF35444144FF464645FF4441353938FF4644FF453136423143FF41FF31374538393446FFFF0000000000000000FFF703800000281D38FF963E000000000000F9
END_KEY: 7480000000000365FF955F698000000000FF0000020146383942FF41383535FF303333FF3843363630FF3333FF464243333245FF37FF39423243463832FFFF0000000000000000FFF703800000281B55FFFC7B000000000000F9
TABLE_ID: NULL
DB_NAME: NULL
TABLE_NAME: NULL
IS_INDEX: 0
INDEX_ID: NULL
INDEX_NAME: NULL
IS_PARTITION: 0
PARTITION_ID: NULL
PARTITION_NAME: NULL
EPOCH_CONF_VER: 9382
EPOCH_VERSION: 103309
WRITTEN_BYTES: 0
READ_BYTES: 0
APPROXIMATE_SIZE: 465
APPROXIMATE_KEYS: 4353659
REPLICATIONSTATUS_STATE: NULL
REPLICATIONSTATUS_STATEID: NULL
1 row in set (32.32 sec)
mysql> select * from TIKV_REGION_STATUS where REGION_ID = 29486864661\G
*************************** 1. row ***************************
REGION_ID: 29486864661
START_KEY: 74800000000004E2FF605F728000003695FF505EE90000000000FA
END_KEY: 74800000000004E2FF605F728000003695FF65A3D10000000000FA
TABLE_ID: 10505
DB_NAME: mars_p1log
TABLE_NAME: loginrole
IS_INDEX: 0
INDEX_ID: NULL
INDEX_NAME: NULL
IS_PARTITION: 1
PARTITION_ID: 320096
PARTITION_NAME: p20240815
EPOCH_CONF_VER: 13178
EPOCH_VERSION: 143231
WRITTEN_BYTES: 0
READ_BYTES: 0
APPROXIMATE_SIZE: 767
APPROXIMATE_KEYS: 1350662
REPLICATIONSTATUS_STATE: NULL
REPLICATIONSTATUS_STATEID: NULL
1 row in set (31.31 sec)
最终 operator 超时 51m
尝试 remove-peer
>> operator add remove-peer 1 2 // 移除 store 2 上的 Region 1 的一个副本
tiup ctl:v7.5.4 pd -u {ip}:2379 operator check 20278184100
tiup ctl:v7.5.4 pd -u {ip}:2379 operator remove 20278184100
tiup ctl:v7.5.4 pd -u {ip}:2379 operator add remove-peer 20278184100 8769963922
第一个空 region 的最终也超时
[2025/09/11 16:21:55.912 +08:00] [INFO] [operator_controller.go:659] ["operator timeout"] [region-id=20278184100] [takes=4m39.750985291s] [operator="\"admin-remove-peer {rm peer: store [8769963922]} (kind:admin,region, region:20278184100(103309, 9382), createAt:2025-09-11 16:17:16.161505793 +0800 CST m=+17048519.309387956, startAt:2025-09-11 16:17:16.161565328 +0800 CST m=+17048519.309447485, currentStep:0, size:465, steps:[0:{remove peer on store 8769963922}], timeout:[4m39s]) timeout\""] [additional-info="{\"cancel-reason\":\"timeout\"}"]
尝试给第二个补 region,也是超时
fdc@fdc-tidb01-tidbp1:~$ tiup ctl:v7.5.4 pd -u {ip}:2379 operator add add-peer 29486864661 16
Success! The operator is created."
# 检查各项配置
mysql> show config where name like '%enable-remove-down-replica';
+------+------------------+-------------------------------------+-------+
| Type | Instance | Name | Value |
+------+------------------+-------------------------------------+-------+
| pd | {ip}:2379 | schedule.enable-remove-down-replica | true |
| pd | {ip}:2379 | schedule.enable-remove-down-replica | true |
| pd | {ip}:2379 | schedule.enable-remove-down-replica | true |
+------+------------------+-------------------------------------+-------+
3 rows in set (0.06 sec)
mysql> show config where name like '%enable-replace-offline-replica';
+------+------------------+-----------------------------------------+-------+
| Type | Instance | Name | Value |
+------+------------------+-----------------------------------------+-------+
| pd | {ip}:2379 | schedule.enable-replace-offline-replica | true |
| pd | {ip}:2379 | schedule.enable-replace-offline-replica | true |
| pd | {ip}:2379 | schedule.enable-replace-offline-replica | true |
+------+------------------+-----------------------------------------+-------+
配置也确认了都开启的
准备执行有损回复,最终也跑不了
fdc@fdc-tidb01-tidbp1:~/tipd/deploy/log$ tiup ctl:v7.5.4 pd -u {ip}:2379 store 32996937412
Failed to get store: [404] "[PD:core:ErrStoreNotFound]store 32996937412 not found"
fdc@fdc-tidb01-tidbp1:~/tipd/deploy/log$ tiup ctl:v7.5.4 pd -u {ip}:2379 store 8769963922
Failed to get store: [404] "[PD:core:ErrStoreNotFound]store 8769963922 not found"
fdc@fdc-tidb01-tidbp1:~/tipd/deploy/log$ tiup ctl:v7.5.4 pd -u {ip}:2379 store 8769963926
Failed to get store: [404] "[PD:core:ErrStoreNotFound]store 8769963926 not found"
----------------
tiup ctl:v7.5.4 pd -u {ip}:2379 unsafe remove-failed-stores 8769963922,8769963926,32996937412
Failed! [500] "[PD:unsaferecovery:ErrUnsafeRecoveryInvalidInput]invalid input store 32996937412 doesn't exist"
5. 尝试 recreate_region
https://tidb.net/blog/ddef26a5#4%C2%A0%C2%A0%20%E5%BC%82%E5%B8%B8%E5%A4%84%E7%90%86%E4%B8%89%E6%9D%BF%E6%96%A7/4.3%20%E7%AC%AC%E4%B8%89%E6%8B%9B%EF%BC%9A%E9%87%8D%E5%BB%BAregion
4.3 第三招:重建region
如果region的副本全部丢失或仅少量的几个无数据空region无法选出leader时可以使用recreate-region方式重建region。
(1) 副本全部丢失,执行了多副本失败恢复
检查副本全部丢失的region,if内指定故障tikv的store_id
pd-ctl region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total |map(if .==(4,5,7) then . else empty end)|length>$total-length)}' |sort
(2) 少量region无数据且无法选主,未对集群做任何处理
使用curl http://tidb_ip:10080/regions/{region_id} 检查该region上的对象信息,如果frames 字段为空的话则说明该region为无数据的空region,重建无影响,否则会丢失数据。
(3) 重建region
关闭region涉及的存活tikv实例,然后在其中一个正常tikv上执行:
tikv-ctl --data-dir /data/tidb-data/tikv-20160 recreate-region -p 'pd_ip:pd_port' -r <region_id>
注意:以前版本使用--db参数而非--data-dir,指定目录为正常tikv的。另外复制命令时注意引号、单横线是否是中文格式。
确认region 内容
curl 'http://{ip}:10081/regions/20278184100'
{
"start_key": "dIAAAAAAA2WVX2mAAAAAAAAAAgFGM0U1NURBRP9GRkVEQTU5OP9GREUxNkIxQ/9BMTdFODk0Rv8AAAAAAAAAAPcDgAAAKB04lj4=",
"end_key": "dIAAAAAAA2WVX2mAAAAAAAAAAgFGODlCQTg1Nf8wMzM4QzY2MP8zM0ZCQzMyRf83OUIyQ0Y4Mv8AAAAAAAAAAPcDgAAAKBtV/Hs=",
"start_key_hex": "7480000000000365955f698000000000000002014633453535444144ff4646454441353938ff4644453136423143ff4131374538393446ff0000000000000000f703800000281d38963e",
"end_key_hex": "7480000000000365955f698000000000000002014638394241383535ff3033333843363630ff3333464243333245ff3739423243463832ff0000000000000000f703800000281b55fc7b",
"region_id": 20278184100,
"frames": null
curl 'http://{ip}:10081/regions/29486864661'
{
"start_key": "dIAAAAAABOJgX3KAAAA2lVBe6Q==",
"end_key": "dIAAAAAABOJgX3KAAAA2lWWj0Q==",
"start_key_hex": "74800000000004e2605f728000003695505ee9",
"end_key_hex": "74800000000004e2605f72800000369565a3d1",
"region_id": 29486864661,
"frames": [
{
"db_name": "db",
"table_name": "table",
"table_id": 320096,
"is_record": true, (包含具体记录,不是索引)
"record_id": 234433306345
}
]
}
frames 空的没影响。但最终也是没效果
6. unsafe remove-failed-stores --auto-detect 最终方案
文章:https://asktug.com/t/topic/1029821/105
tiup ctl:v7.5.4 pd -u {ip}6:2379 unsafe remove-failed-stores --auto-detect
fdc@fdc-tidb01-tidbp1:~$ tiup ctl:v7.5.4 pd -u {ip}:2379 unsafe remove-failed-stores show
Starting component ctl: /home/fdc/.tiup/components/ctl/v7.5.4/ctl pd -u 10.191.0.46:2379 unsafe remove-failed-stores show
[
{
"info": "Unsafe recovery enters collect report stage",
"time": "2025-09-23 09:40:55.189",
"details": [
"auto detect mode with no specified Failed stores"
]
},
{
"info": "Unsafe recovery enters demote Failed voter stage",
"time": "2025-09-23 09:41:44.825",
"actions": {
"store 21961705754": [
"tombstone the peer of region 29486864661"
],
"store 23": [
"tombstone the peer of region 20278184100"
]
}
},
{
"info": "Unsafe recovery Finished",
"time": "2025-09-23 09:43:13.212",
"details": [
"affected table ids: 320096, 222613",
"no newly created empty regions"
]
}
]
执行完后,能正常获取到锁,执行 GC
总结
恢复方案:
- tikv recover-mvcc 工具清理获取不到的锁
- unsafe remove-failed-stores --auto-detect 清理残留的 storeid 信息
运维操作:
- tikv 下线过程中,尽量不使用 force scale in 。预留多点时间操作,走正常的下线流程
- 及时升级集群版本
- 做好多个组件 GC safepoint 的监控,两个案例里面都是 tidb 和 pd 的 safepoint 正常,但是 tikv 一直无法进行正常的 GC 回收空间,及时进行处理