1、背景
公司突然停电,ups也用完电了,来电后开机,发现tidb集群起不来了,每次启动pd显示成功(其实是失败的),到了tikv就卡死,后查看的tidb-deploy 下的 pd和tikv日志,发现pd未启动成功
tikv问题日志
[2025/06/25 10:53:30.904 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.111:2379] [thread_id=1]
[2025/06/25 10:53:30.904 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a8cd000 {address=ipv4:192.168.0.111:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.111:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.111:2379}: connect failed: {\"created\":\"@1750820010.904417027\",\"description\":\"Failed to connect to remote host: No route to host\",\"errno\":113,\"file\":\"/workspace/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":205,\"os_error\":\"No route to host\",\"syscall\":\"getsockopt(SO_ERROR)\",\"target_address\":\"ipv4:192.168.0.111:2379\"}"] [thread_id=13]
[2025/06/25 10:53:30.904 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a8cd000 {address=ipv4:192.168.0.111:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.111:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.111:2379}: Retry in 1000 milliseconds"] [thread_id=13]
[2025/06/25 10:53:30.904 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [endpoints=192.168.0.111:2379] [thread_id=1]
[2025/06/25 10:53:30.904 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.112:2379] [thread_id=1]
[2025/06/25 10:53:32.905 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.112:2379] [thread_id=1]
[2025/06/25 10:53:33.206 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.110:2379] [thread_id=1]
[2025/06/25 10:53:35.207 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.110:2379] [thread_id=1]
[2025/06/25 10:53:35.207 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.111:2379] [thread_id=1]
[2025/06/25 10:53:37.208 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.111:2379] [thread_id=1]
[2025/06/25 10:53:37.208 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.112:2379] [thread_id=1]
[2025/06/25 10:53:39.209 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.112:2379] [thread_id=1]
[2025/06/25 10:53:39.510 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.110:2379] [thread_id=1]
[2025/06/25 10:53:41.511 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.110:2379] [thread_id=1]
[2025/06/25 10:53:41.511 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.111:2379] [thread_id=1]
[2025/06/25 10:53:43.512 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.111:2379] [thread_id=1]
[2025/06/25 10:53:43.512 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.112:2379] [thread_id=1]
[2025/06/25 10:53:45.513 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.112:2379] [thread_id=1]
[2025/06/25 10:53:45.814 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.110:2379] [thread_id=1]
[2025/06/25 10:53:47.815 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.110:2379] [thread_id=1]
[2025/06/25 10:53:47.815 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.111:2379] [thread_id=1]
[2025/06/25 10:53:47.815 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a9ca800 {address=ipv4:192.168.0.111:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.111:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.111:2379}: connect failed: {\"created\":\"@1750820027.815459575\",\"description\":\"Failed to connect to remote host: No route to host\",\"errno\":113,\"file\":\"/workspace/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":205,\"os_error\":\"No route to host\",\"syscall\":\"getsockopt(SO_ERROR)\",\"target_address\":\"ipv4:192.168.0.111:2379\"}"] [thread_id=13]
[2025/06/25 10:53:47.815 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a9ca800 {address=ipv4:192.168.0.111:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.111:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.111:2379}: Retry in 1000 milliseconds"] [thread_id=13]
[2025/06/25 10:53:47.815 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [endpoints=192.168.0.111:2379] [thread_id=1]
[2025/06/25 10:53:47.815 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.112:2379] [thread_id=1]
[2025/06/25 10:53:47.815 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a8c0800 {address=ipv4:192.168.0.112:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.112:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.112:2379}: connect failed: {\"created\":\"@1750820027.815866425\",\"description\":\"Failed to connect to remote host: No route to host\",\"errno\":113,\"file\":\"/workspace/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/grpcio-sys-0.10.3+1.44.0-patched/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":205,\"os_error\":\"No route to host\",\"syscall\":\"getsockopt(SO_ERROR)\",\"target_address\":\"ipv4:192.168.0.112:2379\"}"] [thread_id=13]
[2025/06/25 10:53:47.815 +08:00] [INFO] [<unknown>] ["subchannel 0x7f555a8c0800 {address=ipv4:192.168.0.112:2379, args=grpc.client_channel_factory=0x7f555a89cbf0, grpc.default_authority=192.168.0.112:2379, grpc.initial_reconnect_backoff_ms=1000, grpc.internal.subchannel_pool=0x7f555a8344e0, grpc.keepalive_time_ms=10000, grpc.keepalive_timeout_ms=3000, grpc.max_receive_message_length=-1, grpc.max_reconnect_backoff_ms=5000, grpc.max_send_message_length=-1, grpc.primary_user_agent=grpc-rust/0.10.4, grpc.resource_quota=0x7f555a90d510, grpc.server_uri=dns:///192.168.0.112:2379}: Retry in 999 milliseconds"] [thread_id=13]
[2025/06/25 10:53:47.816 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: \"failed to connect to all addresses\", details: [] }))"] [endpoints=192.168.0.112:2379] [thread_id=1]
[2025/06/25 10:53:48.116 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.110:2379] [thread_id=1]
[2025/06/25 10:53:50.117 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.110:2379] [thread_id=1]
[2025/06/25 10:53:50.117 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.111:2379] [thread_id=1]
[2025/06/25 10:53:52.118 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.111:2379] [thread_id=1]
[2025/06/25 10:53:52.118 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.112:2379] [thread_id=1]
[2025/06/25 10:53:54.119 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.112:2379] [thread_id=1]
[2025/06/25 10:53:54.420 +08:00] [INFO] [util.rs:639] ["connecting to PD endpoint"] [endpoints=192.168.0.110:2379] [thread_id=1]
[2025/06/25 10:53:56.421 +08:00] [INFO] [util.rs:601] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=192.168.0.110:2379] [thread_id=1]
排查PD,PD集群未启动成功
[2025/06/25 11:09:33.494 +08:00] [INFO] [raft] [zap_raft.go:77] ["c269c4a75aa6e6c1 is starting a new election at term 2"]
[2025/06/25 11:09:33.494 +08:00] [INFO] [raft] [zap_raft.go:77] ["c269c4a75aa6e6c1 became pre-candidate at term 2"]
[2025/06/25 11:09:33.494 +08:00] [INFO] [raft] [zap_raft.go:77] ["c269c4a75aa6e6c1 received MsgPreVoteResp from c269c4a75aa6e6c1 at term 2"]
[2025/06/25 11:09:33.494 +08:00] [INFO] [raft] [zap_raft.go:77] ["c269c4a75aa6e6c1 [logterm: 2, index: 1900314] sent MsgPreVote request to 2a308483ede69c8d at term 2"]
[2025/06/25 11:09:33.494 +08:00] [INFO] [raft] [zap_raft.go:77] ["c269c4a75aa6e6c1 [logterm: 2, index: 1900314] sent MsgPreVote request to bb1260d81c39ab2f at term 2"]
[2025/06/25 11:09:36.591 +08:00] [WARN] [probing_status.go:68] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=2a308483ede69c8d] [rtt=0s] [error="dial tcp 192.168.0.111:2380: connect: no route to host"]
[2025/06/25 11:09:36.591 +08:00] [WARN] [probing_status.go:68] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=2a308483ede69c8d] [rtt=0s] [error="dial tcp 192.168.0.111:2380: connect: no route to host"]
[2025/06/25 11:09:36.594 +08:00] [WARN] [probing_status.go:68] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=bb1260d81c39ab2f] [rtt=0s] [error="dial tcp 192.168.0.112:2380: connect: no route to host"]
[2025/06/25 11:09:36.594 +08:00] [WARN] [probing_status.go:68] ["prober detected unhealthy status"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=bb1260d81c39ab2f] [rtt=0s] [error="dial tcp 192.168.0.112:2380: connect: no route to host"]
2、根据网上教程,剔除其中2个pd节点,试图恢复,未能成功,可以先尝试方法1,方法1简单,数据丢失风险小
https://docs.pingcap.com/zh/tidb/stable/production-deployment-using-tiup/
3、重装tidb集群,记录当前集群id、已分配最大ID、并备份tikv数据磁盘(tidb-data),数据备份最重要
当前集群id
cat /www/tidb-deploy/pd-2379/log/pd.log | grep "cluster id"
cluster-id=7510129864311934607
获取已分配 ID,alloc-id 的值在第6步用到,这里查出来的值加1000,就为alloc-id的值
grep "idAllocator allocates a new id" {{/path/to}}/pd*.log | awk -F'=' '{print $2}' | awk -F']' '{print $1}' | sort -r -n | head -n 1
4、清理掉pd的旧数据盘,然后使用tiup进行tidb集群重新安装,新集群安装好后,将tikv数据迁移到新集群tikv的tidb-data目录下,重启集群
官方部署文档地址
https://docs.pingcap.com/zh/tidb/stable/production-deployment-using-tiup/
5、迁移旧tikv数据到新集群后出现tikv无法启动问题,日志信息cluster ID mismatch
6、更改集群id,将新集群的cluster ID修改旧的 集群id
tiup pd-recover -endpoints http://192.168.0.110:2379 -cluster-id 7510129864311934607 -alloc-id 6000
7、重启集群,启动成功问题得到修复
tiup cluster restart tidb-ly
查看状态
tiup cluster display tidb-ly