2
8
10
0
专栏/.../

使用TiKV-CDC实现rawkv集群的两地三中心

 cchouqiang  发表于  2023-12-05

一、背景

业务的连续性和稳定性对企业来说至关重要,因此对数据库侧的容灾能力要求也越来越高,传统的高可用架构已无法满足高要求的容灾要求,两地三中心架构已成为主流架构。

对于rawkv集群来说,需要使用TiKV-CDC来实现在两个集群间的数据同步。

二、环境准备

安装介质

tikv-cdc下载地址:

https://github.com/tikv/migration/releases/tag/cdc-v1.1.1

(注:TiKV-CDC是专门同步rawkv集群开发的,跟TiCDC不同)

tikv-cdc使用手册:https://tikv.org/docs/6.5/concepts/explore-tikv-features/cdc/cdc-cn/

三、集群拓扑

两地三中心架构下,总共有3套集群,分别为:主生产集群、同城应急集群、异地灾备集群。

主生产集群承担读写业务,通过TiKV-CDC同步数据至同城应急集群、异地灾备集群,同步链路为2条独立的链路,互不影响。

TiKV-CDC同步链路可以定期进行切换演练,验证灾备环境的可用性。

同城应急集群可以承担读流量。

image.png

故障情况说明

单个AZ级故障:

1、集群中同城A机房2个AZ中任意单AZ级故障,集群不受影响。

2、集群中同城B机房AZ故障,集群不受影响。

区域级故障:

1、同城A、B机房同时发生故障,异地C机房的灾备集群可以直接承载流量。

集群级故障:

1、主生产集群发生问题后,可以切换至同城应急集群继续提供服务。

读写分离:

1、同城应急库可以根据不同业务类型提供读能力。

四、创建两地三中心同步链路

4.1 备份主集群数据

rawkv备份使用的工具为tikv-br

备份流水索引集群
$ mkdir -p /data/br
$ ./tikv-br backup raw --gcttl=600m --pd "http://PDIP:2379" --storage "local:///data/br" --dst-api-version v2 --log-file="/tmp/tikv-br.log"

4.2 还原数据至灾备集群

将备份数据scp到灾备集群上

$ ./tikv-br restore raw --pd "http://PDIP:2379" --storage "local:///data/br" --log-file /tmp/restore.log"

4.3 创建同步链路

在主集群的tikv-cdc机器上创建同步链路changefeed-id

(PD_MASTER_IP为主集群的PD地址,PD_SLAVE_IP为灾备集群的PD地址)

$ cd /home/tidb/tidb_deploy/cdc-8300/bin
./tikv-cdc cli changefeed create --pd "http://PD_MASTER_IP:2379" --sink-uri tikv://PD_SLAVE_IP:2379 --start-ts <前边记录的BackupTs> --changefeed-id test-cdc

4.4 查看同步链路状态

$ ./tikv-cdc cli changefeed --pd http://PD_MASTER_IP:2379 list

五、主备集群切换

5.1 主集群切换到灾备集群

5.1.1 主集群停止TICDC同步,检查主集群grpc流量为0

观察tikv-details下的grpc流量,若为0,则说明业务已经停止。

5.1.2 执行remove_cdc.sh脚本

remove_cdc.sh需要在主集群的tikv-cdc节点执行。

sh remove_cdc.sh

remove_cdc.sh脚本如下:

注意:CHANGEFEED_ID,pd_master这几个变量需要按照实际情况修改。

#!/bin/bash
export PATH=/home/ap/cdacs/.tiup/bin:$PATH
TIME=`(date "+%Y-%m-%d_%H-%M-%S")`
#CHANGEFEED_ID,pdipdport are variables
CHANGEFEED_ID=test-cdc-bak
##pd_master is up stream cdc pd info.
pd_master='IP:2379'
ticdc_dir=`ps -ef |grep tikv-[c]dc |awk -F" " '{print $(NF)}'|awk -F"log" '{print $1}'`

##new add 20230106
cdcstate=`${ticdc_dir}bin/tikv-cdc cli changefeed query -s --changefeed-id $CHANGEFEED_ID --pd http://$pd_master |grep -E "state" |awk -F':'  '{print $2}' |awk -F',' '{print $1}' |sed 's/[^a-z]//g'`
logpath="${ticdc_dir}log"
scripts_log="$logpath/remove_changefeed_$TIME.log"


cdc_change(){
echo "cdc changefeed remove."
${ticdc_dir}bin/tikv-cdc cli changefeed remove --changefeed-id $CHANGEFEED_ID --pd=http://$pd_master
if [ $? -eq 0 ]; then
  echo "’$TIME‘ remove changefeed-id successed!!!"
  sleep 5
  ${ticdc_dir}bin/tikv-cdc cli changefeed --pd http://$pd_master list
else
  echo "‘$TIME’ remove changefeed-id failed!!!"
fi
}


tikvcdcstate()
{
echo "tikvcdc replication state check."
if [ $cdcstate = "normal" ];then
   echo "tikvcdc replication state is normal."
else
  echo "tikvcdc replication state is $cdcstate."
  exit 1
fi
}


tikvcdcreplication()
{
tsoa=`${ticdc_dir}bin/tikv-cdc cli tso query --pd=http://$pd_master`

echo "tikv cdc replication progress check."
checknums=10
seconds=2
nums=0
while true
do

tsob=`${ticdc_dir}bin/tikv-cdc cli changefeed query -s --changefeed-id $CHANGEFEED-ID --pd http://$pd_master |grep -E "tso" |awk -F" " '{print $2}'|awk -F"," '{print $1}'`
let nums++
if [ $tsob -gt $tsoa ] && [ $nums -le $checknums ];then
         echo "tikvcdc replication is done,this is $nums times replication check."
         break
      elif [  $tsob -lt $tsoa ] && [ $nums -lt $checknums ];then
         echo "tikvcdc replication is not done,this is $nums times replication check."
         sleep $seconds
      else
         echo "tikvcdc replication is not done,this is $nums times and last replication check."
         break
   fi
done
}

main()
{
tikvcdcstate
tikvcdcreplication
cdc_change
}
main| tee  ${scripts_log}

5.1.3 观察日志输出

检查日志${scripts_log}有无报错,若无报错,则说明执行成功。

5.1.4 灾备集群启动TICDC同步

执行create_cdc.sh脚本,create_cdc.sh需要在tikv-cdc节点执行。

sh create_cdc.sh

create_cdc.sh脚本如下:

注意:CHANGEFEED_ID,pd_master,pd_slave这几个变量需要按照实际情况修改。

#!/bin/bash
export PATH=/home/ap/cdacs/.tiup/bin:$PATH 
TIME=`(date "+%Y-%m-%d_%H-%M-%S")`
#CHANGEFEED_ID,pd_master,pd_slave are variables
CHANGEFEED_ID=xxx-cdc-bak
##pd_master is up stream cdc pd info.
pd_master='IP:2379'
#pd_slave is down stream cdc pd info.
pd_slave='IP:2379'
ticdc_dir=`ps -ef |grep tikv-[c]dc |awk -F" " '{print $(NF)}'|awk -F"log" '{print $1}'`
logpath="${ticdc_dir}log"
scripts_log="$logpath/create_changefeed_$TIME.log"



tikvcdcstate()
{
echo "tikvcdc replication state check."
if [ $cdcstate = "normal" ];then
   echo "tikvcdc replication state is normal."
else 
  echo "tikvcdc replication state is $cdcstate."
  exit 1
fi
}


cdc_change(){
echo "cdc changefeed-id create."
.${ticdc_dir}bin/tikv-cdc cli changefeed create --pd "http://$pd_master" --sink-uri tikv://$pd_slave --changefeed-id $CHANGEFEED_ID 
if [ $? -eq 0 ]; then
  echo "'$TIME' Create changefeed-id successed!!!" 
  cdcstate=`.${ticdc_dir}bin/tikv-cdc cli changefeed --pd "http://$pd_master" list |grep -E "state" |awk -F':'  '{print $2}' |awk -F',' '{print $1}' |sed 's/\"//g'`
  tikvcdcstate
else
  echo "'$TIME' Create changefeed-id failed!!!" 
  exit 1 
fi
}



cdc_change | tee  ${scripts_log}

5.1.5 观察日志输出

观察日志${scripts_log}有无报错,若无报错,则说明执行成功。

5.2 灾备集群回切到主集群

5.2.1 灾备集群停止TICDC同步,检查灾备集群grpc流量为0

观察tikv-details下的grpc流量,若为0,则说明业务已经停止。

5.2.2 执行remove_cdc.sh脚本

remove_cdc.sh需要在tikv-cdc节点执行

sh remove_cdc.sh

remove_cdc.sh脚本如下:

注意:CLUSTER_NAME,CHANGEFEED_ID,pdipdport这几个变量需要按照实际情况修改。

#!/bin/bash
export PATH=/home/ap/cdacs/.tiup/bin:$PATH
TIME=`(date "+%Y-%m-%d_%H-%M-%S")`
#CHANGEFEED_ID,pdipdport are variables
CHANGEFEED_ID=test-cdc-bak
##pd_master is up stream cdc pd info.
pd_master='172.16.11.115:2379'
ticdc_dir=`ps -ef |grep tikv-[c]dc |awk -F" " '{print $(NF)}'|awk -F"log" '{print $1}'`

##new add 20230106
cdcstate=`${ticdc_dir}bin/tikv-cdc cli changefeed query -s --changefeed-id $CHANGEFEED_ID --pd http://$pd_master |grep -E "state" |awk -F':'  '{print $2}' |awk -F',' '{print $1}' |sed 's/[^a-z]//g'`
logpath="${ticdc_dir}log"
scripts_log="$logpath/remove_changefeed_$TIME.log"


cdc_change(){
echo "cdc changefeed remove."
${ticdc_dir}bin/tikv-cdc cli changefeed remove --changefeed-id $CHANGEFEED_ID --pd=http://$pd_master
if [ $? -eq 0 ]; then
  echo "’$TIME‘ remove changefeed-id successed!!!"
  sleep 5
  ${ticdc_dir}bin/tikv-cdc cli changefeed --pd http://$pd_master list
else
  echo "‘$TIME’ remove changefeed-id failed!!!"
fi
}

##new add 20230106
tikvcdcstate()
{
echo "tikvcdc replication state check."
if [ $cdcstate = "normal" ];then
   echo "tikvcdc replication state is normal."
else
  echo "tikvcdc replication state is $cdcstate."
  exit 1
fi
}



##new add 20230106
tikvcdcreplication()
{
tsoa=`${ticdc_dir}bin/tikv-cdc cli tso query --pd=http://$pd_master`

echo "tikv cdc replication progress check."
checknums=10
seconds=2
nums=0
while true
do

tsob=`${ticdc_dir}bin/tikv-cdc cli changefeed query -s --changefeed-id $CHANGEFEED-ID --pd http://$pd_master |grep -E "tso" |awk -F" " '{print $2}'|awk -F"," '{print $1}'`
let nums++
if [ $tsob -gt $tsoa ] && [ $nums -le $checknums ];then
         echo "tikvcdc replication is done,this is $nums times replication check."
         break
      elif [  $tsob -lt $tsoa ] && [ $nums -lt $checknums ];then
         echo "tikvcdc replication is not done,this is $nums times replication check."
         sleep $seconds
      else
         echo "tikvcdc replication is not done,this is $nums times and last replication check."
         break
   fi
done
}

main()
{
tikvcdcstate
tikvcdcreplication
cdc_change
}
main| tee  ${scripts_log}

5.2.3 观察日志输出

检查$ticdc_dir/log/$logfile有无报错,若无报错,则说明执行成功。

5.2.4 主集群启动TICDC同步

执行create_cdc.sh脚本

create_cdc.sh需要在tikv-cdc节点执行

sh create_cdc.sh

create_cdc.sh脚本如下:

注意:CLUSTER_NAME,CHANGEFEED_ID,pd_master,pd_slave这几个变量需要按照实际情况修改。

#!/bin/bash
set -e
export PATH=/home/cdacs/.tiup/bin:$PATH
TIME=`(date "+%Y-%m-%d_%H-%M-%S")`
#CHANGEFEED_ID,pd_master,pd_slave are variables
CHANGEFEED_ID=test-cdc-bak
##pd_master is up stream cdc pd info.
pd_master='IP:2379'
#pd_slave is down stream cdc pd info.
pd_slave='IP:2379'
ticdc_dir=`ps -ef |grep tikv-[c]dc |awk -F" " '{print $(NF)}'|awk -F"log" '{print $1}'`
logpath="${ticdc_dir}log"
scripts_log="$logpath/create_changefeed_$TIME.log"
#cdcstate=`${ticdc_dir}bin/tikv-cdc cli changefeed --pd "http://$pd_master" list |grep -E "state" |awk -F':'  '{print $2}'|awk -F',' '{print $1}'|sed 's/[^a-z]//g'`
##new add 20230106
tikvcdcstate()
{
cdcstate=`${ticdc_dir}bin/tikv-cdc cli changefeed query -s --changefeed-id $CHANGEFEED_ID --pd http://$pd_master |grep -E "state" |awk -F':'  '{print $2}' |awk -F',' '{print $1}' |sed 's/[^a-z]//g'`
echo "tikvcdc replication state check."
if [ $cdcstate = normal ];then
   echo "tikvcdc replication state is normal."
else
  echo "tikvcdc replication state is $cdcstate."
  exit 1
fi
}

cdc_change(){
echo "cdc changefeed-id start create."
${ticdc_dir}bin/tikv-cdc cli changefeed create --pd "http://$pd_master" --sink-uri tikv://$pd_slave --changefeed-id $CHANGEFEED_ID
if [ $? -eq 0 ]; then
  echo "'$TIME' Create changefeed-id successed!!!"
  tikvcdcstate
else
  echo "'$TIME' Create changefeed-id failed!!!"
  exit 1
fi
}

cdc_change | tee  ${scripts_log}

5.2.5 观察日志输出

观察日志${scripts_log}有无报错,若无报错,则说明执行成功。

六、总结

1、两地三中心架构给业务连续性提供了强有力的保障;

2、定期进行灾备切换演练。

2
8
10
0

版权声明:本文为 TiDB 社区用户原创文章,遵循 CC BY-NC-SA 4.0 版权协议,转载请附上原文出处链接和本声明。

评论
暂无评论