(1)问题现象:升级tiup过程中stop tikv节点超时:ERROR Run Command Timeout,其实登录到192.168.1.43查看tikv其实已经stop了。
2020-06-29T05:21:18.289+0800 INFO Stopping instance 192.168.1.43
2020-06-29T05:22:58.364+0800 INFO SSHCommand {“host”: “192.168.1.43”, “port”: “22”, “cmd”: “export LANG=C; PATH=$PATH:/usr/bin:/usr/sbin sudo -H -u root bash -c “systemctl daemon-reload && systemctl stop tikv-20160.service””, “stdout”: “”, “stderr”: “Run Command Timeout!\"n”}
2020-06-29T05:22:58.364+0800 ERROR Run Command Timeout!
2020-06-29T05:22:58.364+0800 INFO Execute command finished {“code”: 1, “error”: “failed to upgrade: failed to stop 192.168.1.43: failed to stop: tikv 192.168.1.43:20160: executor.ssh.execute_timedout: Execute command over SSH timedout for ‘tidb@192.168.1.43:22’ {ssh_stderr: Run Command Timeout!\"n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/usr/bin:/usr/sbin sudo -H -u root bash -c “systemctl daemon-reload && systemctl stop tikv-20160.service”}”, “errorVerbose”: “executor.ssh.execute_timedout: Execute command over SSH timedout for ‘tidb@192.168.1.43:22’ {ssh_stderr: Run Command Timeout!\"n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/usr/bin:/usr/sbin sudo -H -u root bash -c “systemctl daemon-reload && systemctl stop tikv-20160.service”}\"n at github.com/pingcap/tiup/pkg/cluster/executor.(*SSHExecutor).Execute()\"n\"tgithub.com/pingcap/tiup@/pkg/cluster/executor/ssh.go:172\"n at github.com/pingcap/tiup/pkg/cluster/module.(*SystemdModule).Execute()\"n\"tgithub.com/pingcap/tiup@/pkg/cluster/module/systemd.go:89\"n at github.com/pingcap/tiup/pkg/cluster/operation.stopInstance()\"n\"tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:574\"n at github.com/pingcap/tiup/pkg/cluster/operation.Upgrade()\"n\"tgithub.com/pingcap/tiup@/pkg/cluster/operation/upgrade.go:99\"n at github.com/pingcap/tiup/pkg/cluster/task.(*ClusterOperate).Execute()\"n\"tgithub.com/pingcap/tiup@/pkg/cluster/task/action.go:53\"n at github.com/pingcap/tiup/pkg/cluster/task.(*Serial).Execute()\"n\"tgithub.com/pingcap/tiup@/pkg/cluster/task/task.go:189\"n at github.com/pingcap/tiup/components/cluster/command.upgrade()\"n\"tgithub.com/pingcap/tiup@/components/cluster/command/upgrade.go:174\"n at github.com/pingcap/tiup/components/cluster/command.newUpgradeCmd.func1()\"n\"tgithub.com/pingcap/tiup@/components/cluster/command/upgrade.go:50\"n at github.com/spf13/cobra.(*Command).execute()\"n\"tgithub.com/spf13/cobra@v1.0.0/command.go:842\"n at github.com/spf13/cobra.(*Command).ExecuteC()\"n\"tgithub.com/spf13/cobra@v1.0.0/command.go:950\"n at github.com/spf13/cobra.(*Command).Execute()\"n\"tgithub.com/spf13/cobra@v1.0.0/command.go:887\"n at github.com/pingcap/tiup/components/cluster/command.Execute()\"n\"tgithub.com/pingcap/tiup@/components/cluster/command/root.go:220\"n at main.main()\"n\"tgithub.com/pingcap/tiup@/components/cluster/main.go:19\"n at runtime.main()\"n\"truntime/proc.go:203\"n at runtime.goexit()\"n\"truntime/asm_amd64.s:1357\"nfailed to stop: tikv 192.168.1.43:20160\"ngithub.com/pingcap/tiup/pkg/cluster/operation.stopInstance\"n\"tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:593\"ngithub.com/pingcap/tiup/pkg/cluster/operation.Upgrade\"n\"tgithub.com/pingcap/tiup@/pkg/cluster/operation/upgrade.go:99\"ngithub.com/pingcap/tiup/pkg/cluster/task.(*ClusterOperate).Execute\"n\"tgithub.com/pingcap/tiup@/pkg/cluster/task/action.go:53\"ngithub.com/pingcap/tiup/pkg/cluster/task.(*Serial).Execute\"n\"tgithub.com/pingcap/tiup@/pkg/cluster/task/task.go:189\"ngithub.com/pingcap/tiup/components/cluster/command.upgrade\"n\"tgithub.com/pingcap/tiup@/components/cluster/command/upgrade.go:174\"ngithub.com/pingcap/tiup/components/cluster/command.newUpgradeCmd.func1\"n\"tgithub.com/pingcap/tiup@/components/cluster/command/upgrade.go:50\"ngithub.com/spf13/cobra.(*Command).execute\"n\"tgithub.com/spf13/cobra@v1.0.0/command.go:842\"ngithub.com/spf13/cobra.(*Command).ExecuteC\"n\"tgithub.com/spf13/cobra@v1.0.0/command.go:950\"ngithub.com/spf13/cobra.(*Command).Execute\"n\"tgithub.com/spf13/cobra@v1.0.0/command.go:887\"ngithub.com/pingcap/tiup/components/cluster/command.Execute\"n\"tgithub.com/pingcap/tiup@/components/cluster/command/root.go:220\"nmain.main\"n\"tgithub.com/pingcap/tiup@/components/cluster/main.go:19\"nruntime.main\"n\"truntime/proc.go:203\"nruntime.goexit\"n\"truntime/asm_amd64.s:1357\"nfailed to stop 192.168.1.43\"nfailed to upgrade”}
(2)解决方案:
1、升级tiup到最新版本: tiup update --self && tiup update --all 升级以下 tiup 及其组件
为啥要升级,目的是要使用最新版本的tiup的下面2个参数:
tiup cluster --help
Flags:
-h, --help help for tiup
–ssh-timeout int Timeout in seconds to connect host via SSH, ignored for operations that don’t need an SSH connection. (default 5)
-v, --version version for tiup
–wait-timeout int Timeout in seconds to wait for an operation to complete, ignored for operations that don’t fit. (default 60)
如果报ssh-timeout相关的报错,这个是中控机跟tikv/pd/tidb机器建立ssh连接的超时时间,如果遇到网络不好等情况,可以调大这个参数时间
如果报ERROR Run Command Timeout相关的报错,这个是中控机跟tikv/pd/tidb机器执行命令的超时时间,如果遇到执行比较慢,可以调大这个参数时间。
2、调整了相关的timeout超时时间,执行了多次还是升级不成功,那就祭出最大的杀器:–force
滚动升级会逐个升级所有的组件。升级 TiKV 期间,会逐个将 TiKV 上的所有 leader 切走再停止该 TiKV 实例。默认超时时间为 5 分钟,超过后会直接停止实例。
如果不希望驱逐 leader,而希望立刻升级,可以在上述命令中指定 --force,该方式会造成性能抖动(特别建议在凌晨低峰时间操作,将影响降低到最低),不会造成数据损失。