0
0
0
0
专栏/.../

TiDB Labs云环境测试故障期间数据库零宕机

 EINTR  发表于  2025-02-25

TiDB Labs从连接登录TiDB Labs,选择"可用区故障期间数据库零宕机",支付对应的积分后,启动实验环境,等待几分钟后,实验环境创建完成后,可以的到私钥key和对应的主机ip地址.

image.png

1. 数据库环境配置

登录主机后,使用配置模板配置修改成部署数据库的配置文件.

cp template-nine-nodes.yaml nine-nodes.yaml

[ec2-user@ip-10-90-3-53 ~]$ cat nine-nodes.yaml 
global:
  user: "ec2-user"
  ssh_port: 22
  deploy_dir: "/tidb-deploy"
  data_dir: "/tidb-data"
  arch: "amd64"
server_configs:
  pd:
    replication.max-replicas: 3
pd_servers:
  - host: 10.90.3.53
    client_port: 2379
    peer_port: 2380
    advertise_client_addr: "http://13.212.46.63:2379"
  - host: 10.90.2.107
    client_port: 2379
    peer_port: 2380
    advertise_client_addr: "http://52.221.245.22:2379"
  - host: 10.90.1.143
    client_port: 2379
    peer_port: 2380
    advertise_client_addr: "http://47.129.246.237:2379"
tidb_servers:
  - host: 10.90.2.172
    port: 4000
    status_port: 10080
    deploy_dir: /tidb-deploy/tidb-4000
    log_dir: /tidb-deploy/tidb-4000/log
  - host: 10.90.1.216
    port: 4000
    status_port: 10080
    deploy_dir: /tidb-deploy/tidb-4000
    log_dir: /tidb-deploy/tidb-4000/log
tikv_servers:
  - host:10.90.3.65
    port: 20160
    status_port: 20180
  - host: 10.90.2.144
    port: 20160
    status_port: 20180
  - host: 10.90.1.254
    port: 20160
    status_port: 20180
monitoring_servers:
  - host: 10.90.4.114
grafana_servers:
  - host: 10.90.4.114
alertmanager_servers:
  - host: 10.90.4.114

检查配置文件有效性,确保配置文件没有错误.

tiup mirror set tidb-community-server-8.1.1-linux-amd64
tiup cluster check nine-nodes.yaml \
  --user ec2-user \
  -i /home/ec2-user/.ssh/pe-class-key-singapore.pem \
  --apply

// 检查完成后的输出
+ Try to apply changes to fix failed checks
  - Applying changes on 10.90.3.53 ... ⠧ Sysctl: host=10.90.3.53 vm.swappiness = 0
  - Applying changes on 10.90.1.143 ... ⠧ Sysctl: host=10.90.1.143 vm.swappiness = 0
  - Applying changes on 10.90.1.254 ... ⠧ Sysctl: host=10.90.1.254 vm.swappiness = 0
  - Applying changes on 10.90.4.114 ... ⠧ Sysctl: host=10.90.4.114 vm.swappiness = 0
  - Applying changes on 10.90.2.107 ... ⠧ Sysctl: host=10.90.2.107 vm.swappiness = 0
  - Applying changes on 10.90.3.65 ... ⠧ 
  - Applying changes on 10.90.2.144 ... ⠧ 
  - Applying changes on 10.90.2.172 ... ⠧ 
  - Applying changes on 10.90.1.216 ... ⠧ 
+ Try to apply changes to fix failed checks
  - Applying changes on 10.90.3.53 ... Done
  - Applying changes on 10.90.1.143 ... Done
  - Applying changes on 10.90.1.254 ... Done
  - Applying changes on 10.90.4.114 ... Done
  - Applying changes on 10.90.2.107 ... Done
+ Try to apply changes to fix failed checks
  - Applying changes on 10.90.3.53 ... Done
+ Try to apply changes to fix failed checks
  - Applying changes on 10.90.3.53 ... Done
  - Applying changes on 10.90.1.143 ... Done
  - Applying changes on 10.90.1.254 ... Done
  - Applying changes on 10.90.4.114 ... Done
  - Applying changes on 10.90.2.107 ... Done
  - Applying changes on 10.90.3.65 ... Done
  - Applying changes on 10.90.2.144 ... Done
  - Applying changes on 10.90.2.172 ... Done
  - Applying changes on 10.90.1.216 ... Done

检查成功后,使用配置文件部署数据库

tiup cluster deploy tidb-demo 8.1.1 ./nine-nodes.yaml \
  --user ec2-user \
  -i /home/ec2-user/.ssh/pe-class-key-singapore.pem \
  --yes

// 部署成功后,输入以下信息
Enabling component blackbox_exporter
        Enabling instance 10.90.4.114
        Enabling instance 10.90.2.107
        Enabling instance 10.90.1.254
        Enabling instance 10.90.1.143
        Enabling instance 10.90.2.172
        Enabling instance 10.90.2.144
        Enabling instance 10.90.3.53
        Enabling instance 10.90.1.216
        Enabling instance 10.90.3.65
        Enable 10.90.4.114 success
        Enable 10.90.1.254 success
        Enable 10.90.3.65 success
        Enable 10.90.2.107 success
        Enable 10.90.1.216 success
        Enable 10.90.2.172 success
        Enable 10.90.3.53 success
        Enable 10.90.2.144 success
        Enable 10.90.1.143 success
Cluster `tidb-demo` deployed successfully, you can start it with command: `tiup cluster start tidb-demo --init`

启动数据库实例

[ec2-user@ip-10-90-3-53 ~]$ tiup cluster start tidb-demo

// 启动成功后,输出以下信息
+ [ Serial ] - UpdateTopology: cluster=tidb-demo
Started cluster `tidb-demo` successfully

查看数据库组件状态信息

[ec2-user@ip-10-90-3-53 ~]$ tiup cluster display tidb-demo
Cluster type:       tidb
Cluster name:       tidb-demo
Cluster version:    v8.1.1
Deploy user:        ec2-user
SSH type:           builtin
Dashboard URL:      http://52.221.245.22:2379/dashboard
Grafana URL:        http://10.90.4.114:3000
ID                 Role          Host         Ports        OS/Arch       Status  Data Dir                      Deploy Dir
--                 ----          ----         -----        -------       ------  --------                      ----------
10.90.4.114:9093   alertmanager  10.90.4.114  9093/9094    linux/x86_64  Up      /tidb-data/alertmanager-9093  /tidb-deploy/alertmanager-9093
10.90.4.114:3000   grafana       10.90.4.114  3000         linux/x86_64  Up      -                             /tidb-deploy/grafana-3000
10.90.1.143:2379   pd            10.90.1.143  2379/2380    linux/x86_64  Up|L    /tidb-data/pd-2379            /tidb-deploy/pd-2379
10.90.2.107:2379   pd            10.90.2.107  2379/2380    linux/x86_64  Up      /tidb-data/pd-2379            /tidb-deploy/pd-2379
10.90.3.53:2379    pd            10.90.3.53   2379/2380    linux/x86_64  Up      /tidb-data/pd-2379            /tidb-deploy/pd-2379
10.90.4.114:9090   prometheus    10.90.4.114  9090/12020   linux/x86_64  Up      /tidb-data/prometheus-9090    /tidb-deploy/prometheus-9090
10.90.1.216:4000   tidb          10.90.1.216  4000/10080   linux/x86_64  Up      -                             /tidb-deploy/tidb-4000
10.90.2.172:4000   tidb          10.90.2.172  4000/10080   linux/x86_64  Up      -                             /tidb-deploy/tidb-4000
10.90.1.254:20160  tikv          10.90.1.254  20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
10.90.2.144:20160  tikv          10.90.2.144  20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
10.90.3.65:20160   tikv          10.90.3.65   20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160

2. 部署应用程序

设置ip环境变量,java程序要使用

// IP是tidb的可登录地址
export HOST_DB1_PRIVATE_IP=10.90.2.172
export HOST_DB2_PRIVATE_IP=10.90.1.216

编译和执行java程序

javac -cp .:misc/mysql-connector-java-5.1.36-bin.jar DemoJdbcEndlessInsertDummyCSP.java
java -cp .:misc/mysql-connector-java-5.1.36-bin.jar DemoJdbcEndlessInsertDummyCSP

执行成功后,会打印两个线程的工作内容

image.png

启动另一个终端,监控数据库表中的数据变化

qdb1(){
mysql -h ${HOST_DB1_PRIVATE_IP} -P 4000 -uroot --connect-timeout 1 2>/dev/null << EOF
  SELECT COUNT(event) FROM test.dummy;
EOF
}

qdb2(){
mysql -h ${HOST_DB2_PRIVATE_IP} -P 4000 -uroot --connect-timeout 1 2>/dev/null << EOF
  SELECT COUNT(event) FROM test.dummy;
EOF
}

query1(){
  echo;
  date;
  qdb1 || qdb2
  sleep 2;
}

query2(){
  echo;
  date;
  qdb2 || qdb1
  sleep 2;
}

while true; do
  query1;
  query2;
done;

执行成功后,显示以下信息

image.png

登录数据库查看pd的状态

image.png

3. 模拟可用区故障

使用实验上给命令模拟可用区故障

[ec2-user@ip-10-90-3-53 ~]$ VPC_ID=`aws ec2 describe-vpcs \
>   --filters "Name=tag:Name,Values=wqlixueyang-hotmail-com" \
>   --query "Vpcs[0].VpcId" \
>   --output text \
>   --region ap-southeast-1`
  --output text \
  --region ap-southeast-1`
CAGE_NACL_ID=`aws ec2 describe-network-acls \
  --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" "Name=tag:Name,Values=cage" \
  --query "NetworkAcls[0].NetworkAclId" \
  --output text \
  --region ap-southeast-1`
ASSOC_ID=`aws ec2 describe-network-acls \
  --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \
  --query "NetworkAcls[0].Associations" \
  --output text \
  --region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`
aws ec2 replace-network-acl-association \
  --association-id ${ASSOC_ID} \
  --network-acl-id ${CAGE_NACL_ID} \
  --region ap-southeast-1
[ec2-user@ip-10-90-3-53 ~]$ SUBNET_ID=`aws ec2 describe-subnets \
>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:Name,Values=demo-subnet-1" \
>   --query "Subnets[0].SubnetId" \
>   --output text \
>   --region ap-southeast-1`
[ec2-user@ip-10-90-3-53 ~]$ CAGE_NACL_ID=`aws ec2 describe-network-acls \
>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" "Name=tag:Name,Values=cage" \
>   --query "NetworkAcls[0].NetworkAclId" \
>   --output text \
>   --region ap-southeast-1`
[ec2-user@ip-10-90-3-53 ~]$ ASSOC_ID=`aws ec2 describe-network-acls \
>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \
>   --query "NetworkAcls[0].Associations" \
>   --output text \
>   --region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`
[ec2-user@ip-10-90-3-53 ~]$ aws ec2 replace-network-acl-association \
>   --association-id ${ASSOC_ID} \
>   --network-acl-id ${CAGE_NACL_ID} \
>   --region ap-southeast-1
{
    "NewAssociationId": "aclassoc-0956a8b18b9835948"
}

查看集群状态

[ec2-user@ip-10-90-3-53 ~]$ tiup cluster display tidb-demo
Cluster type:       tidb
Cluster name:       tidb-demo
Cluster version:    v8.1.1
Deploy user:        ec2-user
SSH type:           builtin
Dashboard URL:      http://52.221.245.22:2379/dashboard
Grafana URL:        http://10.90.4.114:3000
ID                 Role          Host         Ports        OS/Arch       Status        Data Dir                      Deploy Dir
--                 ----          ----         -----        -------       ------        --------                      ----------
10.90.4.114:9093   alertmanager  10.90.4.114  9093/9094    linux/x86_64  Up            /tidb-data/alertmanager-9093  /tidb-deploy/alertmanager-9093
10.90.4.114:3000   grafana       10.90.4.114  3000         linux/x86_64  Up            -                             /tidb-deploy/grafana-3000
10.90.1.143:2379   pd            10.90.1.143  2379/2380    linux/x86_64  Down          /tidb-data/pd-2379            /tidb-deploy/pd-2379
10.90.2.107:2379   pd            10.90.2.107  2379/2380    linux/x86_64  Up|L          /tidb-data/pd-2379            /tidb-deploy/pd-2379
10.90.3.53:2379    pd            10.90.3.53   2379/2380    linux/x86_64  Up            /tidb-data/pd-2379            /tidb-deploy/pd-2379
10.90.4.114:9090   prometheus    10.90.4.114  9090/12020   linux/x86_64  Up            /tidb-data/prometheus-9090    /tidb-deploy/prometheus-9090
10.90.1.216:4000   tidb          10.90.1.216  4000/10080   linux/x86_64  Down          -                             /tidb-deploy/tidb-4000
10.90.2.172:4000   tidb          10.90.2.172  4000/10080   linux/x86_64  Up            -                             /tidb-deploy/tidb-4000
10.90.1.254:20160  tikv          10.90.1.254  20160/20180  linux/x86_64  Disconnected  /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
10.90.2.144:20160  tikv          10.90.2.144  20160/20180  linux/x86_64  Up            /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
10.90.3.65:20160   tikv          10.90.3.65   20160/20180  linux/x86_64  Up            /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
Total nodes: 11

停止成功后,10.90.1.216的流量被无缝切换到10.90.2.172上.

image.png

接下来通过脚本启动对应的可用区

[ec2-user@ip-10-90-3-53 ~]$ VPC_ID=`aws ec2 describe-vpcs \
>   --filters "Name=tag:Name,Values=wqlixueyang-hotmail-com" \
>   --query "Vpcs[0].VpcId" \
>   --output text \
>   --region ap-southeast-1`
SUBNET_ID=`aws ec2 describe-subnets \
  --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:Name,Values=demo-subnet-1" \
  --query "Subnets[0].SubnetId" \
  --output text \
  --region ap-southeast-1`
DEFAULT_NACL_ID=`aws ec2 describe-network-acls \
  --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \
  --query "NetworkAcls[0].NetworkAclId" \
  --output text \
  --region ap-southeast-1`
ASSOC_ID=`aws ec2 describe-network-acls \
  --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" \
  --query "NetworkAcls[0].Associations" \
  --output text \
  --region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`
aws ec2 replace-network-acl-association \
  --association-id ${ASSOC_ID} \
  --network-acl-id ${DEFAULT_NACL_ID} \
  --region ap-southeast-1
[ec2-user@ip-10-90-3-53 ~]$ SUBNET_ID=`aws ec2 describe-subnets \
>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=tag:Name,Values=demo-subnet-1" \
>   --query "Subnets[0].SubnetId" \
>   --output text \
>   --region ap-southeast-1`
[ec2-user@ip-10-90-3-53 ~]$ DEFAULT_NACL_ID=`aws ec2 describe-network-acls \
>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=true" \
>   --query "NetworkAcls[0].NetworkAclId" \
>   --output text \
>   --region ap-southeast-1`
[ec2-user@ip-10-90-3-53 ~]$ ASSOC_ID=`aws ec2 describe-network-acls \
>   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=default,Values=false" \
>   --query "NetworkAcls[0].Associations" \
>   --output text \
>   --region ap-southeast-1 | grep ${SUBNET_ID} | awk -F" " '{print $1}'`
[ec2-user@ip-10-90-3-53 ~]$ aws ec2 replace-network-acl-association \
>   --association-id ${ASSOC_ID} \
>   --network-acl-id ${DEFAULT_NACL_ID} \
>   --region ap-southeast-1
{
    "NewAssociationId": "aclassoc-05f1983f8ad847144"
}

集群状态已经正常.

[ec2-user@ip-10-90-3-53 ~]$ tiup cluster display tidb-demo
Cluster type:       tidb
Cluster name:       tidb-demo
Cluster version:    v8.1.1
Deploy user:        ec2-user
SSH type:           builtin
Dashboard URL:      http://52.221.245.22:2379/dashboard
Grafana URL:        http://10.90.4.114:3000
ID                 Role          Host         Ports        OS/Arch       Status  Data Dir                      Deploy Dir
--                 ----          ----         -----        -------       ------  --------                      ----------
10.90.4.114:9093   alertmanager  10.90.4.114  9093/9094    linux/x86_64  Up      /tidb-data/alertmanager-9093  /tidb-deploy/alertmanager-9093
10.90.4.114:3000   grafana       10.90.4.114  3000         linux/x86_64  Up      -                             /tidb-deploy/grafana-3000
10.90.1.143:2379   pd            10.90.1.143  2379/2380    linux/x86_64  Up      /tidb-data/pd-2379            /tidb-deploy/pd-2379
10.90.2.107:2379   pd            10.90.2.107  2379/2380    linux/x86_64  Up|L    /tidb-data/pd-2379            /tidb-deploy/pd-2379
10.90.3.53:2379    pd            10.90.3.53   2379/2380    linux/x86_64  Up      /tidb-data/pd-2379            /tidb-deploy/pd-2379
10.90.4.114:9090   prometheus    10.90.4.114  9090/12020   linux/x86_64  Up      /tidb-data/prometheus-9090    /tidb-deploy/prometheus-9090
10.90.1.216:4000   tidb          10.90.1.216  4000/10080   linux/x86_64  Up      -                             /tidb-deploy/tidb-4000
10.90.2.172:4000   tidb          10.90.2.172  4000/10080   linux/x86_64  Up      -                             /tidb-deploy/tidb-4000
10.90.1.254:20160  tikv          10.90.1.254  20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
10.90.2.144:20160  tikv          10.90.2.144  20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
10.90.3.65:20160   tikv          10.90.3.65   20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
Total nodes: 11

查看业务流量重新平衡到两个IP上.

image.png

4. 总结

通过云环境模拟可用区故障导致部分实例不可用状态,业务流量可以平滑的迁移到其他可用区, 并且不会对业务造成影响.

0
0
0
0

版权声明:本文为 TiDB 社区用户原创文章,遵循 CC BY-NC-SA 4.0 版权协议,转载请附上原文出处链接和本声明。

评论
暂无评论