0
0
0
0
专栏/.../

记一次TIDB开启TLS失败导致PD扩容失败案例

 Dora  发表于  2024-07-10

问题背景

  1. 集群之前由于TIUP目录被删除导致TLLS证书丢失,后续需要重新开启TLS

  2. 在测试环境测试TLS开启步骤,导致后续两台PD扩容失败,步骤如下:

    1. 缩容两台PD
    2. 开启TLS
    3. 扩容原有的两台PD,最后PD启动的时候报错
    4. 集群Restart

集群重启后导致三台PD全部Down机,需要pd-recover恢复或者销毁集群重建恢复

注:TLS开启需要只保留一台PD,若有多台PD需要先缩容成1台

排查过程

TIUP日志排查

  1. 查看TIUP的日志,看到开启TLS的时候有报错
2024-01-04T16:11:00.718+0800    DEBUG   TaskFinish  {"task": "Restart Cluster", "error": "failed to stop: failed to stop: xxxx node_exporter-9100.service, please check the instance's log() for more detail.: timed out waiting for port 9100 to be stopped after 2m0s", "errorVerbose": "timed out waiting for port 9100 to be stopped after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStopped\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:130\ngithub.com/pingcap/tiup/pkg/cluster/operation.systemctlMonitor.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:338\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594\nfailed to stop: 10.196.32.49 node_exporter-9100.service, please check the instance's log() for more detail.\nfailed to stop"}
2024-01-04T16:11:00.718+0800    INFO    Execute command finished    {"code": 1, "error": "failed to stop: failed to stop: xxxx node_exporter-9100.service, please check the instance's log() for more detail.: timed out waiting for port 9100 to be stopped after 2m0s", "errorVerbose": "timed out waiting for port 9100 to be stopped after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStopped\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:130\ngithub.com/pingcap/tiup/pkg/cluster/operation.systemctlMonitor.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:338\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594\nfailed to stop: 10.196.32.49 node_exporter-9100.service, please check the instance's log() for more detail.\nfailed to stop"}

PD日志排查

某一台PD启动的时候是http而不是https

image.png

问题大致明朗,在开启TLS的时候因为报错导致后续PD的配置有问题

复现过程

缩容PD

tiup cluster scale-in tidb-cc -N 172.16.201.159:52379

开启TLS

开启TLS的时候发现最后一步是修改PD的配置信息

+ [ Serial ] - Reload PD Members Update pd-172.16.201.73-52379 peerURLs [https://172.16.201.73:52380]

[tidb@vm172-16-201-73 /tidb-deploy/cc/pd-52379/scripts]$ tiup cluster tls tidb-cc enable
tiup is checking updates for component cluster ...
A new version of cluster is available:
   The latest version:         v1.14.1
   Local installed version:    v1.12.3
   Update current component:   tiup update cluster
   Update all components:      tiup update --all

Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.12.3/tiup-cluster tls tidb-cc enable
Enable/Disable TLS will stop and restart the cluster `tidb-cc`
Do you want to continue? [y/N]:(default=N) y
Generate certificate: /home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.159
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.159
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.99
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ [Parallel] - UserSSH: user=tidb, host=172.16.201.73
+ Copy certificate to remote host
  - Generate certificate pd -> 172.16.201.73:52379 ... Done
  - Generate certificate tikv -> 172.16.201.73:25160 ... Done
  - Generate certificate tikv -> 172.16.201.159:25160 ... Done
  - Generate certificate tikv -> 172.16.201.99:25160 ... Done
  - Generate certificate tidb -> 172.16.201.73:54000 ... Done
  - Generate certificate tidb -> 172.16.201.159:54000 ... Done
  - Generate certificate prometheus -> 172.16.201.73:59090 ... Done
  - Generate certificate grafana -> 172.16.201.73:43000 ... Done
  - Generate certificate alertmanager -> 172.16.201.73:59093 ... Done
+ Copy monitor certificate to remote host
  - Generate certificate node_exporter -> 172.16.201.159 ... Done
  - Generate certificate node_exporter -> 172.16.201.99 ... Done
  - Generate certificate node_exporter -> 172.16.201.73 ... Done
  - Generate certificate blackbox_exporter -> 172.16.201.73 ... Done
  - Generate certificate blackbox_exporter -> 172.16.201.159 ... Done
  - Generate certificate blackbox_exporter -> 172.16.201.99 ... Done
+ Refresh instance configs
  - Generate config pd -> 172.16.201.73:52379 ... Done
  - Generate config tikv -> 172.16.201.73:25160 ... Done
  - Generate config tikv -> 172.16.201.159:25160 ... Done
  - Generate config tikv -> 172.16.201.99:25160 ... Done
  - Generate config tidb -> 172.16.201.73:54000 ... Done
  - Generate config tidb -> 172.16.201.159:54000 ... Done
  - Generate config prometheus -> 172.16.201.73:59090 ... Done
  - Generate config grafana -> 172.16.201.73:43000 ... Done
  - Generate config alertmanager -> 172.16.201.73:59093 ... Done
+ Refresh monitor configs
  - Generate config node_exporter -> 172.16.201.73 ... Done
  - Generate config node_exporter -> 172.16.201.159 ... Done
  - Generate config node_exporter -> 172.16.201.99 ... Done
  - Generate config blackbox_exporter -> 172.16.201.73 ... Done
  - Generate config blackbox_exporter -> 172.16.201.159 ... Done
  - Generate config blackbox_exporter -> 172.16.201.99 ... Done
+ [ Serial ] - Save meta
+ [ Serial ] - Restart Cluster
Stopping component alertmanager
        Stopping instance 172.16.201.73
        Stop alertmanager 172.16.201.73:59093 success
Stopping component grafana
        Stopping instance 172.16.201.73
        Stop grafana 172.16.201.73:43000 success
Stopping component prometheus
        Stopping instance 172.16.201.73
        Stop prometheus 172.16.201.73:59090 success
Stopping component tidb
        Stopping instance 172.16.201.159
        Stopping instance 172.16.201.73
        Stop tidb 172.16.201.159:54000 success
        Stop tidb 172.16.201.73:54000 success
Stopping component tikv
        Stopping instance 172.16.201.99
        Stopping instance 172.16.201.73
        Stopping instance 172.16.201.159
        Stop tikv 172.16.201.73:25160 success
        Stop tikv 172.16.201.99:25160 success
        Stop tikv 172.16.201.159:25160 success
Stopping component pd
        Stopping instance 172.16.201.73
        Stop pd 172.16.201.73:52379 success
Stopping component node_exporter
        Stopping instance 172.16.201.99
        Stopping instance 172.16.201.73
        Stopping instance 172.16.201.159
        Stop 172.16.201.73 success
        Stop 172.16.201.99 success
        Stop 172.16.201.159 success
Stopping component blackbox_exporter
        Stopping instance 172.16.201.99
        Stopping instance 172.16.201.73
        Stopping instance 172.16.201.159
        Stop 172.16.201.73 success
        Stop 172.16.201.99 success
        Stop 172.16.201.159 success
Starting component pd
        Starting instance 172.16.201.73:52379
        Start instance 172.16.201.73:52379 success
Starting component tikv
        Starting instance 172.16.201.99:25160
        Starting instance 172.16.201.73:25160
        Starting instance 172.16.201.159:25160
        Start instance 172.16.201.99:25160 success
        Start instance 172.16.201.73:25160 success
        Start instance 172.16.201.159:25160 success
Starting component tidb
        Starting instance 172.16.201.159:54000
        Starting instance 172.16.201.73:54000
        Start instance 172.16.201.159:54000 success
        Start instance 172.16.201.73:54000 success
Starting component prometheus
        Starting instance 172.16.201.73:59090
        Start instance 172.16.201.73:59090 success
Starting component grafana
        Starting instance 172.16.201.73:43000
        Start instance 172.16.201.73:43000 success
Starting component alertmanager
        Starting instance 172.16.201.73:59093
        Start instance 172.16.201.73:59093 success
Starting component node_exporter
        Starting instance 172.16.201.159
        Starting instance 172.16.201.99
        Starting instance 172.16.201.73
        Start 172.16.201.73 success
        Start 172.16.201.99 success
        Start 172.16.201.159 success
Starting component blackbox_exporter
        Starting instance 172.16.201.159
        Starting instance 172.16.201.99
        Starting instance 172.16.201.73
        Start 172.16.201.73 success
        Start 172.16.201.99 success
        Start 172.16.201.159 success
+ [ Serial ] - Reload PD Members
        Update pd-172.16.201.73-52379 peerURLs: [https://172.16.201.73:52380]
Enabled TLS between TiDB components for cluster `tidb-cc` successfully

模拟开启TLS的时候停止node_exports失败

查看member信息

[tidb@vm172-16-201-73 ~]$ tiup ctl:v7.1.1 pd -u https://172.16.201.73:52379 --cacert=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/ca.crt --key=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/client.pem --cert=/home/tidb/.tiup/storage/cluster/clusters/tidb-cc/tls/client.crt member

{
  "header": {
    "cluster_id": 7299695983639846267
  },
  "members": [
    {
      "name": "pd-172.16.201.73-52379",
      "member_id": 781118713925753452,
      "peer_urls": [
        "http://172.16.201.73:52380"
      ],
      "client_urls": [
        "https://172.16.201.73:52379"
      ],

发现 members中的peer_urls确实是http的而非https,复现成功

总结

  1. 开启TLS的时候失败,因为停止node_export失败,认为无关紧要,所以继续接下来的步骤
  2. 开启TLS最后一步是pd member的配置更新,停止node_export失败导致这一步没有正常执行
  3. 后续需要确保TLS开启成功后才能做下一步,也可以使用pd-ctl 查看member的情况
  4. 如果后续遇到此类情况,可以先关闭TLS,再次开启,开启的时候还遇到停止node_export时间过长,可以tiup执行时增加--wait-timeout参数以及手动kill 机器的node_export进程,确保TLS继续进行

0
0
0
0

版权声明:本文为 TiDB 社区用户原创文章,遵循 CC BY-NC-SA 4.0 版权协议,转载请附上原文出处链接和本声明。

评论
暂无评论