0
0
0
0
专栏/.../

记两个 GC 失效修复的案例

 小老板努力变强  发表于  2025-09-30

案例一:

背景:发现集群 gc 未工作,drop table 不回收磁盘空间,发现 gc safepoint 停留在 2024-05-22,集群版本 v5.3.1

1. 日志中以及 admin check table 发现报错

mysql> admin check table SuperXXXall;
ERROR 1105 (HY000): unexpected resolve err: commit_ts_expired:<start_ts:450046591374983318 attempted_commit_ts:450046591977652390 key:"t\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" min_commit_ts:450046592187367871 > , lock: key: 7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5, primary: 7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44, txnStartTS: 450046591374983318, lockForUpdateTS:4500465913

start_ts:
mysql> select tidb_parse_tso(450046591374983318);
+------------------------------------+
| tidb_parse_tso(450046591374983318) |
+------------------------------------+
| 2024-05-27 14:31:41.522000         |
+------------------------------------+
1 row in set (0.00 sec)

attempted_commit_ts:
mysql> select tidb_parse_tso(450046591977652390);
+------------------------------------+
| tidb_parse_tso(450046591977652390) |
+------------------------------------+
| 2024-05-27 14:31:43.821000         |
+------------------------------------+
1 row in set (0.00 sec)

min_commit_ts:
mysql> select tidb_parse_tso(450046592187367871);
+------------------------------------+
| tidb_parse_tso(450046592187367871) |
+------------------------------------+
| 2024-05-27 14:31:44.621000         |
+------------------------------------+
1 row in set (0.01 sec)

2. 确认数据内容

lock key:
mysql> select tidb_decode_key('7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5');
+-----------------------------------------------------------------------------------------------+
| tidb_decode_key('7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5') |
+-----------------------------------------------------------------------------------------------+
| {"index_id":2,"index_vals":{"ipnum":"3213214421"},"table_id":274911}                          |
+-----------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

primary:
mysql> select tidb_decode_key('7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44');
+-----------------------------------------------------------------------------------------------+
| tidb_decode_key('7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44') |
+-----------------------------------------------------------------------------------------------+
| {"index_id":1,"index_vals":{"aid":"111111111","sn":"111111111111111111"},"table_id":274911}   |
+-----------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

index_id 1 对应 uk_aid_sn。index_id 2 对应 index(ip)

3. 确认 region 信息

找对应的 region
lock:
fdc@fdc-tidb04-onlinetidb:~$ curl 'http://{ip}:10080/mvcc/hex/7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5'
{
 "key": "7480000000000431DF5F698000000000000002040000000077B02BB303F80000012ADBADF5",
 "region_id": 366711639,
 "value": {
  "info": {
   "lock": {
    "start_ts": 450046591374983318,
    "primary": "dIAAAAAABDHfX2mAAAAAAAAAAQOAAAAAE9r/EQOIZRj1jlQ/RA==",
    "short_value": "MA=="
   },
   "writes": [
    {
     "start_ts": 450046591374983318,
     "commit_ts": 450046591977652390,
     "short_value": "MA=="
    }
   ]
  }
 }

primary: 
fdc@fdc-tidb04-onlinetidb:~$ curl 'http://{ip}:10080/mvcc/hex/7480000000000431df5f698000000000000001038000000013daff1103886518f58e543f44'
{
 "key": "7480000000000431DF5F698000000000000001038000000013DAFF1103886518F58E543F44",
 "region_id": 40507174,
 "value": {
  "info": {
   "writes": [
    {
     "start_ts": 450046591374983318,
     "commit_ts": 450046591977652390,
     "short_value": "eAAAASrbrfM="
    }
   ]
  }
 }

确认到有 lock 

4. 确认数据分布

tiup ctl:v5.3.1 pd -u {ip}:2379 region key 7480000000000431DF5F698000000000000002040000000077B02BB303F80000012ADBADF5
{
  "id": 344684358,
  "start_key": "748000000000042DFF8400000000000000F8",
  "end_key": "7480000000000431FFDF5F698000000000FF0000010380000000FF0000854D03885C91FFF79BAF31BF000000FC",
  "epoch": {
    "conf_ver": 18923,
    "version": 231376
  },
  "peers": [
    {
      "id": 1389106238,
      "store_id": 348337315,
      "role_name": "Voter"
    },
    {
      "id": 1506170971,
      "store_id": 477311915, 
      "role_name": "Voter"
    },
    {
      "id": 1619480854,
      "store_id": 935248854,
      "role_name": "Voter"
    }
  ],
  "leader": {
    "id": 1506170971,
    "store_id": 477311915,
    "role_name": "Voter"
  },
  "written_bytes": 2826,
  "read_bytes": 0,
  "written_keys": 2,
  "read_keys": 0,
  "approximate_size": 83,
  "approximate_keys": 1176620
}

5. 驱逐 leader,减少对业务的影响

tiup ctl:v5.3.1 pd -u 10.90.230.8:2379 scheduler add evict-leader-scheduler 477311915

# 驱逐完记得加回来
tiup ctl:v5.3.1 pd -u 10.90.230.8:2379 scheduler remove evict-leader-scheduler 

6. 确认 mvcc 信息

对 key 编码
fdc@fdc-tidb04-onlinetidb:~$ tiup ctl:v5.3.1 tikv --to-escaped "7480000000000431df5f698000000000000002040000000077b02bb303f80000012adbadf5"
t\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365

查看 mvcc 
tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,default

fdc@fdc-tidb02-onlinetikv:~$ tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,default
Starting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365 --show-cf=lock,write,default
[2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."]
[2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."]
[2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"]
[2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]
no mvcc infos for zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365
no mvcc infos

recover-mvcc 查看
fdc@fdc-tidb02-onlinetikv:~$ tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tidb/tikv/deploy1/data recover-mvcc --read-only -r 40507174 -p 10.90.230.8:2379
Starting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tidb/tikv/deploy1/data recover-mvcc --read-only -r 40507174 -p ******
[2025/09/22 16:17:42.552 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."]
[2025/09/22 16:17:42.552 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."]
[2025/09/22 16:17:42.556 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"]
[2025/09/22 16:17:42.556 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]
Recover regions: [40507174], pd: ["10.90.230.8:2379"], read_only: true
[2025/09/22 16:17:45.303 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=10.90.230.8:2379]
[2025/09/22 16:17:45.304 +08:00] [INFO] [<unknown>] ["TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter"]
[2025/09/22 16:17:45.307 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fa62ac05100 for subchannel 0x7fa66a612280"]
[2025/09/22 16:17:45.309 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.191.128.76:2379]
[2025/09/22 16:17:45.309 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fa62ac05280 for subchannel 0x7fa66a612600"]
[2025/09/22 16:17:45.310 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.90.230.8:2379]
[2025/09/22 16:17:45.313 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fa62ac05400 for subchannel 0x7fa66a612280"]
[2025/09/22 16:17:45.315 +08:00] [INFO] [util.rs:668] ["connected to PD member"] [endpoints=http://10.90.230.8:2379]
[2025/09/22 16:17:45.315 +08:00] [INFO] [util.rs:536] ["all PD endpoints are consistent"] [endpoints="[\"10.90.230.8:2379\"]"]
[2025/09/22 16:17:45.328 +08:00]success!
 [INFO] [debug.rs:956] ["thread 0: skip write 0 rows"]
[2025/09/22 16:17:45.328 +08:00] [INFO] [debug.rs:959] ["thread 0: total fix default: 0, lock: 0, write: 0"]
[2025/09/22 16:17:45.328 +08:00] [INFO] [debug.rs:968] ["thread 0 has finished working."]

但是没有看到具体的 mvcc lock。这个时候很疑惑,难道是找错地方了?

7. 确认 region 状态

mysql> select * from TIKV_REGION_STATUS where region_id in (36671107174)\G
*************************** 1. row ***************************
                REGION_ID: 366711639
                START_KEY: 7480000000000431FFDF5F698000000000FF0000020400000000FF77A751AA03B80000FF0021B1B123000000FC
                  END_KEY: 7480000000000431FFDF5F698000000000FF0000020400000000FF77B1949A03980000FF0014C54AF8000000FC
                 TABLE_ID: 274911
                  DB_NAME: DB
               TABLE_NAME: SuperXXXall
                 IS_INDEX: 1
                 INDEX_ID: 2
               INDEX_NAME: ip
           EPOCH_CONF_VER: 18971
            EPOCH_VERSION: 210340
            WRITTEN_BYTES: 0
               READ_BYTES: 0
         APPROXIMATE_SIZE: 70
         APPROXIMATE_KEYS: 1108205
  REPLICATIONSTATUS_STATE: NULL
REPLICATIONSTATUS_STATEID: NULL
*************************** 2. row ***************************
                REGION_ID: 40507174
                START_KEY: 7480000000000431FFDF5F698000000000FF0000010380000000FF13DAFBF103886518FF703DBBA158000000FC
                  END_KEY: 7480000000000431FFDF5F698000000000FF0000010380000000FF13DB043303886A23FF907234F2FC000000FC
                 TABLE_ID: 274911
                  DB_NAME: DB
               TABLE_NAME: SuperXXXall
                 IS_INDEX: 1
                 INDEX_ID: 1
               INDEX_NAME: uk_aid_sn
           EPOCH_CONF_VER: 19460
            EPOCH_VERSION: 210350
            WRITTEN_BYTES: 0
               READ_BYTES: 0
         APPROXIMATE_SIZE: 68
         APPROXIMATE_KEYS: 973219
  REPLICATIONSTATUS_STATE: NULL
REPLICATIONSTATUS_STATEID: NULL
2 rows in set (1 min 19.87 sec)

都是索引。如果实在走不通就直接 tombstone region 或者创建空 region 。理论上对原始数据不影响

8. 确认 region 位置

tiup ctl:v5.3.1 pd region 366711639
Starting component `ctl`: /home/fdc/.tiup/components/ctl/v5.3.1/ctl pd region 366711639
{
  "id": 366711639,
  "start_key": "7480000000000431FFDF5F698000000000FF0000020400000000FF77A751AA03B80000FF0021B1B123000000FC",
  "end_key": "7480000000000431FFDF5F698000000000FF0000020400000000FF77B1949A03980000FF0014C54AF8000000FC",
  "epoch": {
    "conf_ver": 18971,
    "version": 210340
  },
  "peers": [
    {
      "id": 737290763,
      "store_id": 400136022,
      "role_name": "Voter"
    },
    {
      "id": 1073621434,
      "store_id": 456470117,
      "role_name": "Voter"
    },
    {
      "id": 1553665102,
      "store_id": 477311917,
      "role_name": "Voter"
    }
  ],
  "leader": {
    "id": 1073621434,
    "store_id": 456470117,
    "role_name": "Voter"
  },
  "written_bytes": 0,
  "read_bytes": 0,
  "written_keys": 0,
  "read_keys": 0,
  "approximate_size": 70,
  "approximate_keys": 1108205
}

尝试在新 tikv 处理

对 key 编码
查看 mvcc 
tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tikv/deploy/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,default

fdc@fdc-tidb02-onlinetikv:~$ tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tikv/deploy/data mvcc -k "zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365" --show-cf=lock,write,default
Starting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tidb/tikv/deploy1/data mvcc -k zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365 --show-cf=lock,write,default
[2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."]
[2025/09/22 16:13:06.667 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."]
[2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"]
[2025/09/22 16:13:06.671 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]
no mvcc infos for zt\200\000\000\000\000\0041\337_i\200\000\000\000\000\000\000\002\004\000\000\000\000w\260+\263\003\370\000\000\001*\333\255\365
no mvcc infos

recover-mvcc 查看
fdc@fdc-tidb06-onlinetikv:~$  tiup ctl:v5.3.1 tikv --data-dir=/ssd1/tikv/deploy/data recover-mvcc --read-only -r 366711639 -p 10.90.230.8:2379
Starting component ctl: /home/fdc/.tiup/components/ctl/v5.3.1/ctl tikv --data-dir=/ssd1/tikv/deploy/data recover-mvcc --read-only -r 366711639 -p ******
[2025/09/25 11:29:04.220 +08:00] [INFO] [mod.rs:118] ["encryption: none of key dictionary and file dictionary are found."]
[2025/09/25 11:29:04.220 +08:00] [INFO] [mod.rs:479] ["encryption is disabled."]
[2025/09/25 11:29:04.224 +08:00] [WARN] [config.rs:587] ["compaction guard is disabled due to region info provider not available"]
[2025/09/25 11:29:04.224 +08:00] [WARN] [config.rs:682] ["compaction guard is disabled due to region info provider not available"]
Recover regions: [366711639], pd: ["10.90.230.8:2379"], read_only: true
[2025/09/25 11:29:07.042 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=10.90.230.8:2379]
[2025/09/25 11:29:07.044 +08:00] [INFO] [<unknown>] ["Disabling AF_INET6 sockets because socket() failed."]
[2025/09/25 11:29:07.044 +08:00] [INFO] [<unknown>] ["TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter"]
[2025/09/25 11:29:07.047 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f9eb3606680 for subchannel 0x7f9ef7211f00"]
[2025/09/25 11:29:07.049 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.191.128.76:2379]
[2025/09/25 11:29:07.050 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f9eb3606800 for subchannel 0x7f9ef7212280"]
[2025/09/25 11:29:07.050 +08:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://10.90.230.8:2379]
[2025/09/25 11:29:07.054 +08:00] [INFO] [<unknown>] ["New connected subchannel at 0x7f9eb3606980 for subchannel 0x7f9ef7211f00"]
[2025/09/25 11:29:07.056 +08:00] [INFO] [util.rs:668] ["connected to PD member"] [endpoints=http://10.90.230.8:2379]
[2025/09/25 11:29:07.056 +08:00] [INFO] [util.rs:536] ["all PD endpoints are consistent"] [endpoints="[\"10.90.230.8:2379\"]"]
[2025/09/25 11:29:07.257 +08:00] [INFO] [debug.rs:1098] ["thread 0: LOCK for_update_ts is less than WRITE ts, key: 7480000000000431FFDF5F698000000000FF0000020400000000FF77B02BB303F80000FF012ADBADF5000000FC, for_update_ts: 450046591374983318, commit_ts: 450046591977652390"]
[2025/09/25 11:29:07.580 +08:00] [INFO] [debug.rs:1063] ["thread 0: scan 1000000 rows"]
[2025/09/25 11:29:07.614 +08:00] [INFO] [debug.rs:956] ["thread 0: skip write 1 rows"]
[2025/09/25 11:29:07.614 +08:00] [INFO] [debug.rs:959] ["thread 0: total fix default: 0, lock: 1, write: 0"]
[2025/09/25 11:29:07.614 +08:00] [INFO] [debug.rs:968] ["thread 0 has finished working."]
success!

但是没有看到具体的 mvcc lock

至少看到了 key 的锁,先尝试修复,在其他 store 上同样处理。最终在三个 store 上处理完后解决问题,GC 正常获取到 lock ,回收空间。

案例二:

1. 从日志确认报错

["resolve store address failed"] [err_code=KV:Unknown] [err="Other(\"[src/server/resolve.rs:124]: unknown error \\\"[components/pd_client/src/util.rs:954]: invalid store ID 8769963922, not found\\\"\")"] [store_id=8769963922] [thread_id=203]

[raft_client.rs:829] ["resolve store address failed"] [err_code=KV:Unknown] [err="Other(\"[src/server/resolve.rs:124]: unknown error \\\"[components/pd_client/src/util.rs:954]: invalid store ID 32996937412, not found\\\"\")"] [store_id=32996937412] [thread_id=203]

[raft_client.rs:829] ["resolve store address failed"] [err_code=KV:Unknown] [err="Other(\"[src/server/resolve.rs:124]: unknown error \\\"[components/pd_client/src/util.rs:954]: invalid store ID 15351649554, not found\\\"\")"] [store_id=15351649554] [thread_id=203]

2025-09-11 11:20:23 (UTC+08:00)PD 10.191.0.46:2379[operator_controller.go:944] ["invalid store ID"] [store-id=8769963922]

2025-09-11 11:21:56 (UTC+08:00)PD 10.191.0.46:2379[operator_controller.go:944] ["invalid store ID"] [store-id=15351649554]

2. 尝试重启 kv

论坛中有类似情况重启 kv 可以暂时解决:https://asktug.com/t/topic/1045115/1

重启后无效果,kv日志还是存在报错
[2025/09/11 14:45:58.229 +08:00] [INFO] [resolve.rs:121] ["resolve store not found"] [store_id=32996937412] [thread_id=18]
[2025/09/11 14:45:58.229 +08:00] [ERROR] [raft_client.rs:829] ["resolve store address failed"] [err_code=KV:Unknown] [err="Other(\"[src/server/resolve.rs:124]: unknown error \\\"[components/pd_client/src/util.rs:954]: invalid store ID 32996937412, not found\\\"\")"] [store_id=32996937412] [thread_id=202]

3. 查看是否有异常 region

 tiup ctl:v7.5.4 pd -u 10.191.0.46:2379 region check down-peer
{
    "count": 2,
    "regions": [
        {
            "id": 20278184100,
            "start_key": "7480000000000365FF955F698000000000FF0000020146334535FF35444144FF464645FF4441353938FF4644FF453136423143FF41FF31374538393446FFFF0000000000000000FFF703800000281D38FF963E000000000000F9",
            "end_key": "7480000000000365FF955F698000000000FF0000020146383942FF41383535FF303333FF3843363630FF3333FF464243333245FF37FF39423243463832FFFF0000000000000000FFF703800000281B55FFFC7B000000000000F9",
            "epoch": {
                "conf_ver": 9382,
                "version": 103309
            },
            "peers": [
                {
                    "role_name": "Learner",
                    "is_learner": true,
                    "id": 21571927681,
                    "store_id": 8769963922,
                    "role": 1
                },
                {
                    "role_name": "Voter",
                    "id": 22597784773,
                    "store_id": 23
                },
                {
                    "role_name": "Learner",
                    "is_learner": true,
                    "id": 22603334967,
                    "store_id": 8769963926,
                    "role": 1
                },
                {
                    "role_name": "Voter",
                    "id": 33232384663,
                    "store_id": 32996937412
                }
            ],
            "leader": {
                "role_name": "Voter",
                "id": 22597784773,
                "store_id": 23
            },
            "down_peers": [
                {
                    "peer": {
                        "role_name": "Learner",
                        "is_learner": true,
                        "id": 21571927681,
                        "store_id": 8769963922,
                        "role": 1
                    },
                    "down_seconds": 8285684
                },
                {
                    "peer": {
                        "role_name": "Learner",
                        "is_learner": true,
                        "id": 22603334967,
                        "store_id": 8769963926,
                        "role": 1
                    },
                    "down_seconds": 8285684
                },
                {
                    "peer": {
                        "role_name": "Voter",
                        "id": 33232384663,
                        "store_id": 32996937412
                    },
                    "down_seconds": 8285684
                }
            ],
            "pending_peers": [
                {
                    "role_name": "Learner",
                    "is_learner": true,
                    "id": 21571927681,
                    "store_id": 8769963922,
                    "role": 1
                },
                {
                    "role_name": "Learner",
                    "is_learner": true,
                    "id": 22603334967,
                    "store_id": 8769963926,
                    "role": 1
                },
                {
                    "role_name": "Voter",
                    "id": 33232384663,
                    "store_id": 32996937412
                }
            ],
            "cpu_usage": 0,
            "written_bytes": 0,
            "read_bytes": 0,
            "written_keys": 0,
            "read_keys": 0,
            "approximate_size": 465,
            "approximate_keys": 4353659
        },
        {
            "id": 29486864661,
            "start_key": "74800000000004E2FF605F728000003695FF505EE90000000000FA",
            "end_key": "74800000000004E2FF605F728000003695FF65A3D10000000000FA",
            "epoch": {
                "conf_ver": 13178,
                "version": 143231
            },
            "peers": [
                {
                    "role_name": "Voter",
                    "id": 29963431801,
                    "store_id": 21961705754
                },
                {
                    "role_name": "Learner",
                    "is_learner": true,
                    "id": 29976105937,
                    "store_id": 15351649554,
                    "role": 1
                },
                {
                    "role_name": "Learner",
                    "is_learner": true,
                    "id": 29998759908,
                    "store_id": 8769963926,
                    "role": 1
                },
                {
                    "role_name": "Voter",
                    "id": 33232575337,
                    "store_id": 32996937412
                }
            ],
            "leader": {
                "role_name": "Voter",
                "id": 29963431801,
                "store_id": 21961705754
            },
            "down_peers": [
                {
                    "peer": {
                        "role_name": "Learner",
                        "is_learner": true,
                        "id": 29976105937,
                        "store_id": 15351649554,
                        "role": 1
                    },
                    "down_seconds": 8285666
                },
                {
                    "peer": {
                        "role_name": "Learner",
                        "is_learner": true,
                        "id": 29998759908,
                        "store_id": 8769963926,
                        "role": 1
                    },
                    "down_seconds": 8285666
                },
                {
                    "peer": {
                        "role_name": "Voter",
                        "id": 33232575337,
                        "store_id": 32996937412
                    },
                    "down_seconds": 8285666
                }
            ],
            "pending_peers": [
                {
                    "role_name": "Learner",
                    "is_learner": true,
                    "id": 29976105937,
                    "store_id": 15351649554,
                    "role": 1
                },
                {
                    "role_name": "Learner",
                    "is_learner": true,
                    "id": 29998759908,
                    "store_id": 8769963926,
                    "role": 1
                },
                {
                    "role_name": "Voter",
                    "id": 33232575337,
                    "store_id": 32996937412
                }
            ],
            "cpu_usage": 0,
            "written_bytes": 0,
            "read_bytes": 0,
            "written_keys": 0,
            "read_keys": 0,
            "approximate_size": 767,
            "approximate_keys": 1350662
        }
    ]
}

尝试补回副本

观察到均有在异常 store 上的 region ,尝试在其他节点补回。
确认到第一个 region 20278184100。只剩一个 leader 还在。尝试补副本

>> operator add add-peer 1 2   // 在 store 2 上新增 Region 1 的一个副本
>> operator add add-learner 1 2  // 在 store 2 上新增 Region 1 的一个 learner 副本

在 store 16 、33472595794 上补
tiup ctl:v7.5.4 pd -u {ip}:2379 operator add add-peer 20278184100 16
Success! The operator is created."

不能同时多个,只能一个个 store 跑
tiup ctl:v7.5.4 pd -u {ip}:2379 operator add add-peer 20278184100 33472595794

尝试补回副本,确认 gc 配置

涉及的表:
mysql> select * from TIKV_REGION_STATUS where REGION_ID = 20278184100\G
*************************** 1. row ***************************
                REGION_ID: 20278184100
                START_KEY: 7480000000000365FF955F698000000000FF0000020146334535FF35444144FF464645FF4441353938FF4644FF453136423143FF41FF31374538393446FFFF0000000000000000FFF703800000281D38FF963E000000000000F9
                  END_KEY: 7480000000000365FF955F698000000000FF0000020146383942FF41383535FF303333FF3843363630FF3333FF464243333245FF37FF39423243463832FFFF0000000000000000FFF703800000281B55FFFC7B000000000000F9
                 TABLE_ID: NULL
                  DB_NAME: NULL
               TABLE_NAME: NULL
                 IS_INDEX: 0
                 INDEX_ID: NULL
               INDEX_NAME: NULL
             IS_PARTITION: 0
             PARTITION_ID: NULL
           PARTITION_NAME: NULL
           EPOCH_CONF_VER: 9382
            EPOCH_VERSION: 103309
            WRITTEN_BYTES: 0
               READ_BYTES: 0
         APPROXIMATE_SIZE: 465
         APPROXIMATE_KEYS: 4353659
  REPLICATIONSTATUS_STATE: NULL
REPLICATIONSTATUS_STATEID: NULL
1 row in set (32.32 sec)

mysql> select * from TIKV_REGION_STATUS where REGION_ID = 29486864661\G
*************************** 1. row ***************************
                REGION_ID: 29486864661
                START_KEY: 74800000000004E2FF605F728000003695FF505EE90000000000FA
                  END_KEY: 74800000000004E2FF605F728000003695FF65A3D10000000000FA
                 TABLE_ID: 10505
                  DB_NAME: mars_p1log
               TABLE_NAME: loginrole
                 IS_INDEX: 0
                 INDEX_ID: NULL
               INDEX_NAME: NULL
             IS_PARTITION: 1
             PARTITION_ID: 320096
           PARTITION_NAME: p20240815
           EPOCH_CONF_VER: 13178
            EPOCH_VERSION: 143231
            WRITTEN_BYTES: 0
               READ_BYTES: 0
         APPROXIMATE_SIZE: 767
         APPROXIMATE_KEYS: 1350662
  REPLICATIONSTATUS_STATE: NULL
REPLICATIONSTATUS_STATEID: NULL
1 row in set (31.31 sec)

最终 operator 超时 51m

尝试 remove-peer
>> operator add remove-peer 1 2                         // 移除 store 2 上的 Region 1 的一个副本
tiup ctl:v7.5.4 pd -u {ip}:2379 operator check 20278184100
tiup ctl:v7.5.4 pd -u {ip}:2379 operator remove 20278184100
tiup ctl:v7.5.4 pd -u {ip}:2379 operator add remove-peer 20278184100 8769963922

第一个空 region 的最终也超时
[2025/09/11 16:21:55.912 +08:00] [INFO] [operator_controller.go:659] ["operator timeout"] [region-id=20278184100] [takes=4m39.750985291s] [operator="\"admin-remove-peer {rm peer: store [8769963922]} (kind:admin,region, region:20278184100(103309, 9382), createAt:2025-09-11 16:17:16.161505793 +0800 CST m=+17048519.309387956, startAt:2025-09-11 16:17:16.161565328 +0800 CST m=+17048519.309447485, currentStep:0, size:465, steps:[0:{remove peer on store 8769963922}], timeout:[4m39s]) timeout\""] [additional-info="{\"cancel-reason\":\"timeout\"}"]

尝试给第二个补 region,也是超时
fdc@fdc-tidb01-tidbp1:~$ tiup ctl:v7.5.4 pd -u {ip}:2379 operator add add-peer 29486864661 16
Success! The operator is created."

# 检查各项配置
mysql> show config where name like '%enable-remove-down-replica';
+------+------------------+-------------------------------------+-------+
| Type | Instance         | Name                                | Value |
+------+------------------+-------------------------------------+-------+
| pd   | {ip}:2379 | schedule.enable-remove-down-replica | true  |
| pd   | {ip}:2379 | schedule.enable-remove-down-replica | true  |
| pd   | {ip}:2379 | schedule.enable-remove-down-replica | true  |
+------+------------------+-------------------------------------+-------+
3 rows in set (0.06 sec)

mysql> show config where name like '%enable-replace-offline-replica';
+------+------------------+-----------------------------------------+-------+
| Type | Instance         | Name                                    | Value |
+------+------------------+-----------------------------------------+-------+
| pd   | {ip}:2379 | schedule.enable-replace-offline-replica | true  |
| pd   | {ip}:2379 | schedule.enable-replace-offline-replica | true  |
| pd   | {ip}:2379 | schedule.enable-replace-offline-replica | true  |
+------+------------------+-----------------------------------------+-------+

配置也确认了都开启的

准备执行有损回复,最终也跑不了

fdc@fdc-tidb01-tidbp1:~/tipd/deploy/log$ tiup ctl:v7.5.4 pd -u {ip}:2379 store 32996937412
Failed to get store: [404] "[PD:core:ErrStoreNotFound]store 32996937412 not found"

fdc@fdc-tidb01-tidbp1:~/tipd/deploy/log$ tiup ctl:v7.5.4 pd -u {ip}:2379 store 8769963922
Failed to get store: [404] "[PD:core:ErrStoreNotFound]store 8769963922 not found"

fdc@fdc-tidb01-tidbp1:~/tipd/deploy/log$ tiup ctl:v7.5.4 pd -u {ip}:2379 store 8769963926
Failed to get store: [404] "[PD:core:ErrStoreNotFound]store 8769963926 not found"

----------------

tiup ctl:v7.5.4 pd -u {ip}:2379 unsafe remove-failed-stores 8769963922,8769963926,32996937412
Failed! [500] "[PD:unsaferecovery:ErrUnsafeRecoveryInvalidInput]invalid input store 32996937412 doesn't exist"

5. 尝试 recreate_region

https://tidb.net/blog/ddef26a5#4%C2%A0%C2%A0%20%E5%BC%82%E5%B8%B8%E5%A4%84%E7%90%86%E4%B8%89%E6%9D%BF%E6%96%A7/4.3%20%E7%AC%AC%E4%B8%89%E6%8B%9B%EF%BC%9A%E9%87%8D%E5%BB%BAregion

4.3 第三招:重建region
       如果region的副本全部丢失或仅少量的几个无数据空region无法选出leader时可以使用recreate-region方式重建region。
(1)    副本全部丢失,执行了多副本失败恢复
检查副本全部丢失的region,if内指定故障tikv的store_id
pd-ctl region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total |map(if .==(4,5,7) then . else empty end)|length>$total-length)}' |sort
(2)    少量region无数据且无法选主,未对集群做任何处理
       使用curl http://tidb_ip:10080/regions/{region_id} 检查该region上的对象信息,如果frames 字段为空的话则说明该region为无数据的空region,重建无影响,否则会丢失数据。
(3)    重建region
       关闭region涉及的存活tikv实例,然后在其中一个正常tikv上执行:
    tikv-ctl --data-dir /data/tidb-data/tikv-20160 recreate-region -p 'pd_ip:pd_port' -r  <region_id>
       注意:以前版本使用--db参数而非--data-dir,指定目录为正常tikv的。另外复制命令时注意引号、单横线是否是中文格式。

确认region 内容

curl 'http://{ip}:10081/regions/20278184100'
{
 "start_key": "dIAAAAAAA2WVX2mAAAAAAAAAAgFGM0U1NURBRP9GRkVEQTU5OP9GREUxNkIxQ/9BMTdFODk0Rv8AAAAAAAAAAPcDgAAAKB04lj4=",
 "end_key": "dIAAAAAAA2WVX2mAAAAAAAAAAgFGODlCQTg1Nf8wMzM4QzY2MP8zM0ZCQzMyRf83OUIyQ0Y4Mv8AAAAAAAAAAPcDgAAAKBtV/Hs=",
 "start_key_hex": "7480000000000365955f698000000000000002014633453535444144ff4646454441353938ff4644453136423143ff4131374538393446ff0000000000000000f703800000281d38963e",
 "end_key_hex": "7480000000000365955f698000000000000002014638394241383535ff3033333843363630ff3333464243333245ff3739423243463832ff0000000000000000f703800000281b55fc7b",
 "region_id": 20278184100,
 "frames": null
 
curl 'http://{ip}:10081/regions/29486864661'
{
 "start_key": "dIAAAAAABOJgX3KAAAA2lVBe6Q==",
 "end_key": "dIAAAAAABOJgX3KAAAA2lWWj0Q==",
 "start_key_hex": "74800000000004e2605f728000003695505ee9",
 "end_key_hex": "74800000000004e2605f72800000369565a3d1",
 "region_id": 29486864661,
 "frames": [
  {
   "db_name": "db",
   "table_name": "table",
   "table_id": 320096,
   "is_record": true, (包含具体记录,不是索引)
   "record_id": 234433306345
  }
 ]
}

frames 空的没影响。但最终也是没效果

6. unsafe remove-failed-stores --auto-detect 最终方案

文章:https://asktug.com/t/topic/1029821/105

tiup ctl:v7.5.4 pd -u {ip}6:2379 unsafe remove-failed-stores --auto-detect

fdc@fdc-tidb01-tidbp1:~$ tiup ctl:v7.5.4 pd -u {ip}:2379 unsafe remove-failed-stores show
Starting component ctl: /home/fdc/.tiup/components/ctl/v7.5.4/ctl pd -u 10.191.0.46:2379 unsafe remove-failed-stores show
[
  {
    "info": "Unsafe recovery enters collect report stage",
    "time": "2025-09-23 09:40:55.189",
    "details": [
      "auto detect mode with no specified Failed stores"
    ]
  },
  {
    "info": "Unsafe recovery enters demote Failed voter stage",
    "time": "2025-09-23 09:41:44.825",
    "actions": {
      "store 21961705754": [
        "tombstone the peer of region 29486864661"
      ],
      "store 23": [
        "tombstone the peer of region 20278184100"
      ]
    }
  },
  {
    "info": "Unsafe recovery Finished",
    "time": "2025-09-23 09:43:13.212",
    "details": [
      "affected table ids: 320096, 222613",
      "no newly created empty regions"
    ]
  }
]

执行完后,能正常获取到锁,执行 GC

总结

恢复方案:

  1. tikv recover-mvcc 工具清理获取不到的锁
  2. unsafe remove-failed-stores --auto-detect 清理残留的 storeid 信息

运维操作:

  1. tikv 下线过程中,尽量不使用 force scale in 。预留多点时间操作,走正常的下线流程
  2. 及时升级集群版本
  3. 做好多个组件 GC safepoint 的监控,两个案例里面都是 tidb 和 pd 的 safepoint 正常,但是 tikv 一直无法进行正常的 GC 回收空间,及时进行处理

0
0
0
0

版权声明:本文为 TiDB 社区用户原创文章,遵循 CC BY-NC-SA 4.0 版权协议,转载请附上原文出处链接和本声明。

评论
暂无评论