一、背景
业务的连续性和稳定性对企业来说至关重要,因此对数据库侧的容灾能力要求也越来越高,传统的高可用架构已无法满足高要求的容灾要求,两地三中心架构已成为主流架构。
对于rawkv集群来说,需要使用TiKV-CDC来实现在两个集群间的数据同步。
二、环境准备
安装介质
tikv-cdc下载地址:
https://github.com/tikv/migration/releases/tag/cdc-v1.1.1
(注:TiKV-CDC是专门同步rawkv集群开发的,跟TiCDC不同)
tikv-cdc使用手册:https://tikv.org/docs/6.5/concepts/explore-tikv-features/cdc/cdc-cn/
三、集群拓扑
两地三中心架构下,总共有3套集群,分别为:主生产集群、同城应急集群、异地灾备集群。
主生产集群承担读写业务,通过TiKV-CDC同步数据至同城应急集群、异地灾备集群,同步链路为2条独立的链路,互不影响。
TiKV-CDC同步链路可以定期进行切换演练,验证灾备环境的可用性。
同城应急集群可以承担读流量。
故障情况说明
单个AZ级故障:
1、集群中同城A机房2个AZ中任意单AZ级故障,集群不受影响。
2、集群中同城B机房AZ故障,集群不受影响。
区域级故障:
1、同城A、B机房同时发生故障,异地C机房的灾备集群可以直接承载流量。
集群级故障:
1、主生产集群发生问题后,可以切换至同城应急集群继续提供服务。
读写分离:
1、同城应急库可以根据不同业务类型提供读能力。
四、创建两地三中心同步链路
4.1 备份主集群数据
rawkv备份使用的工具为tikv-br
备份流水索引集群
$ mkdir -p /data/br
$ ./tikv-br backup raw --gcttl=600m --pd "http://PDIP:2379" --storage "local:///data/br" --dst-api-version v2 --log-file="/tmp/tikv-br.log"
4.2 还原数据至灾备集群
将备份数据scp到灾备集群上
$ ./tikv-br restore raw --pd "http://PDIP:2379" --storage "local:///data/br" --log-file /tmp/restore.log"
4.3 创建同步链路
在主集群的tikv-cdc机器上创建同步链路changefeed-id
(PD_MASTER_IP为主集群的PD地址,PD_SLAVE_IP为灾备集群的PD地址)
$ cd /home/tidb/tidb_deploy/cdc-8300/bin
./tikv-cdc cli changefeed create --pd "http://PD_MASTER_IP:2379" --sink-uri tikv://PD_SLAVE_IP:2379 --start-ts <前边记录的BackupTs> --changefeed-id test-cdc
4.4 查看同步链路状态
$ ./tikv-cdc cli changefeed --pd http://PD_MASTER_IP:2379 list
五、主备集群切换
5.1 主集群切换到灾备集群
5.1.1 主集群停止TICDC同步,检查主集群grpc流量为0
观察tikv-details下的grpc流量,若为0,则说明业务已经停止。
5.1.2 执行remove_cdc.sh脚本
remove_cdc.sh需要在主集群的tikv-cdc节点执行。
sh remove_cdc.sh
remove_cdc.sh脚本如下:
注意:CHANGEFEED_ID,pd_master这几个变量需要按照实际情况修改。
#!/bin/bash
export PATH=/home/ap/cdacs/.tiup/bin:$PATH
TIME=`(date "+%Y-%m-%d_%H-%M-%S")`
#CHANGEFEED_ID,pdipdport are variables
CHANGEFEED_ID=test-cdc-bak
##pd_master is up stream cdc pd info.
pd_master='IP:2379'
ticdc_dir=`ps -ef |grep tikv-[c]dc |awk -F" " '{print $(NF)}'|awk -F"log" '{print $1}'`
##new add 20230106
cdcstate=`${ticdc_dir}bin/tikv-cdc cli changefeed query -s --changefeed-id $CHANGEFEED_ID --pd http://$pd_master |grep -E "state" |awk -F':' '{print $2}' |awk -F',' '{print $1}' |sed 's/[^a-z]//g'`
logpath="${ticdc_dir}log"
scripts_log="$logpath/remove_changefeed_$TIME.log"
cdc_change(){
echo "cdc changefeed remove."
${ticdc_dir}bin/tikv-cdc cli changefeed remove --changefeed-id $CHANGEFEED_ID --pd=http://$pd_master
if [ $? -eq 0 ]; then
echo "’$TIME‘ remove changefeed-id successed!!!"
sleep 5
${ticdc_dir}bin/tikv-cdc cli changefeed --pd http://$pd_master list
else
echo "‘$TIME’ remove changefeed-id failed!!!"
fi
}
tikvcdcstate()
{
echo "tikvcdc replication state check."
if [ $cdcstate = "normal" ];then
echo "tikvcdc replication state is normal."
else
echo "tikvcdc replication state is $cdcstate."
exit 1
fi
}
tikvcdcreplication()
{
tsoa=`${ticdc_dir}bin/tikv-cdc cli tso query --pd=http://$pd_master`
echo "tikv cdc replication progress check."
checknums=10
seconds=2
nums=0
while true
do
tsob=`${ticdc_dir}bin/tikv-cdc cli changefeed query -s --changefeed-id $CHANGEFEED-ID --pd http://$pd_master |grep -E "tso" |awk -F" " '{print $2}'|awk -F"," '{print $1}'`
let nums++
if [ $tsob -gt $tsoa ] && [ $nums -le $checknums ];then
echo "tikvcdc replication is done,this is $nums times replication check."
break
elif [ $tsob -lt $tsoa ] && [ $nums -lt $checknums ];then
echo "tikvcdc replication is not done,this is $nums times replication check."
sleep $seconds
else
echo "tikvcdc replication is not done,this is $nums times and last replication check."
break
fi
done
}
main()
{
tikvcdcstate
tikvcdcreplication
cdc_change
}
main| tee ${scripts_log}
5.1.3 观察日志输出
检查日志${scripts_log}有无报错,若无报错,则说明执行成功。
5.1.4 灾备集群启动TICDC同步
执行create_cdc.sh脚本,create_cdc.sh需要在tikv-cdc节点执行。
sh create_cdc.sh
create_cdc.sh脚本如下:
注意:CHANGEFEED_ID,pd_master,pd_slave这几个变量需要按照实际情况修改。
#!/bin/bash
export PATH=/home/ap/cdacs/.tiup/bin:$PATH
TIME=`(date "+%Y-%m-%d_%H-%M-%S")`
#CHANGEFEED_ID,pd_master,pd_slave are variables
CHANGEFEED_ID=xxx-cdc-bak
##pd_master is up stream cdc pd info.
pd_master='IP:2379'
#pd_slave is down stream cdc pd info.
pd_slave='IP:2379'
ticdc_dir=`ps -ef |grep tikv-[c]dc |awk -F" " '{print $(NF)}'|awk -F"log" '{print $1}'`
logpath="${ticdc_dir}log"
scripts_log="$logpath/create_changefeed_$TIME.log"
tikvcdcstate()
{
echo "tikvcdc replication state check."
if [ $cdcstate = "normal" ];then
echo "tikvcdc replication state is normal."
else
echo "tikvcdc replication state is $cdcstate."
exit 1
fi
}
cdc_change(){
echo "cdc changefeed-id create."
.${ticdc_dir}bin/tikv-cdc cli changefeed create --pd "http://$pd_master" --sink-uri tikv://$pd_slave --changefeed-id $CHANGEFEED_ID
if [ $? -eq 0 ]; then
echo "'$TIME' Create changefeed-id successed!!!"
cdcstate=`.${ticdc_dir}bin/tikv-cdc cli changefeed --pd "http://$pd_master" list |grep -E "state" |awk -F':' '{print $2}' |awk -F',' '{print $1}' |sed 's/\"//g'`
tikvcdcstate
else
echo "'$TIME' Create changefeed-id failed!!!"
exit 1
fi
}
cdc_change | tee ${scripts_log}
5.1.5 观察日志输出
观察日志${scripts_log}有无报错,若无报错,则说明执行成功。
5.2 灾备集群回切到主集群
5.2.1 灾备集群停止TICDC同步,检查灾备集群grpc流量为0
观察tikv-details下的grpc流量,若为0,则说明业务已经停止。
5.2.2 执行remove_cdc.sh脚本
remove_cdc.sh需要在tikv-cdc节点执行
sh remove_cdc.sh
remove_cdc.sh脚本如下:
注意:CLUSTER_NAME,CHANGEFEED_ID,pdipdport这几个变量需要按照实际情况修改。
#!/bin/bash
export PATH=/home/ap/cdacs/.tiup/bin:$PATH
TIME=`(date "+%Y-%m-%d_%H-%M-%S")`
#CHANGEFEED_ID,pdipdport are variables
CHANGEFEED_ID=test-cdc-bak
##pd_master is up stream cdc pd info.
pd_master='172.16.11.115:2379'
ticdc_dir=`ps -ef |grep tikv-[c]dc |awk -F" " '{print $(NF)}'|awk -F"log" '{print $1}'`
##new add 20230106
cdcstate=`${ticdc_dir}bin/tikv-cdc cli changefeed query -s --changefeed-id $CHANGEFEED_ID --pd http://$pd_master |grep -E "state" |awk -F':' '{print $2}' |awk -F',' '{print $1}' |sed 's/[^a-z]//g'`
logpath="${ticdc_dir}log"
scripts_log="$logpath/remove_changefeed_$TIME.log"
cdc_change(){
echo "cdc changefeed remove."
${ticdc_dir}bin/tikv-cdc cli changefeed remove --changefeed-id $CHANGEFEED_ID --pd=http://$pd_master
if [ $? -eq 0 ]; then
echo "’$TIME‘ remove changefeed-id successed!!!"
sleep 5
${ticdc_dir}bin/tikv-cdc cli changefeed --pd http://$pd_master list
else
echo "‘$TIME’ remove changefeed-id failed!!!"
fi
}
##new add 20230106
tikvcdcstate()
{
echo "tikvcdc replication state check."
if [ $cdcstate = "normal" ];then
echo "tikvcdc replication state is normal."
else
echo "tikvcdc replication state is $cdcstate."
exit 1
fi
}
##new add 20230106
tikvcdcreplication()
{
tsoa=`${ticdc_dir}bin/tikv-cdc cli tso query --pd=http://$pd_master`
echo "tikv cdc replication progress check."
checknums=10
seconds=2
nums=0
while true
do
tsob=`${ticdc_dir}bin/tikv-cdc cli changefeed query -s --changefeed-id $CHANGEFEED-ID --pd http://$pd_master |grep -E "tso" |awk -F" " '{print $2}'|awk -F"," '{print $1}'`
let nums++
if [ $tsob -gt $tsoa ] && [ $nums -le $checknums ];then
echo "tikvcdc replication is done,this is $nums times replication check."
break
elif [ $tsob -lt $tsoa ] && [ $nums -lt $checknums ];then
echo "tikvcdc replication is not done,this is $nums times replication check."
sleep $seconds
else
echo "tikvcdc replication is not done,this is $nums times and last replication check."
break
fi
done
}
main()
{
tikvcdcstate
tikvcdcreplication
cdc_change
}
main| tee ${scripts_log}
5.2.3 观察日志输出
检查$ticdc_dir/log/$logfile有无报错,若无报错,则说明执行成功。
5.2.4 主集群启动TICDC同步
执行create_cdc.sh脚本
create_cdc.sh需要在tikv-cdc节点执行
sh create_cdc.sh
create_cdc.sh脚本如下:
注意:CLUSTER_NAME,CHANGEFEED_ID,pd_master,pd_slave这几个变量需要按照实际情况修改。
#!/bin/bash
set -e
export PATH=/home/cdacs/.tiup/bin:$PATH
TIME=`(date "+%Y-%m-%d_%H-%M-%S")`
#CHANGEFEED_ID,pd_master,pd_slave are variables
CHANGEFEED_ID=test-cdc-bak
##pd_master is up stream cdc pd info.
pd_master='IP:2379'
#pd_slave is down stream cdc pd info.
pd_slave='IP:2379'
ticdc_dir=`ps -ef |grep tikv-[c]dc |awk -F" " '{print $(NF)}'|awk -F"log" '{print $1}'`
logpath="${ticdc_dir}log"
scripts_log="$logpath/create_changefeed_$TIME.log"
#cdcstate=`${ticdc_dir}bin/tikv-cdc cli changefeed --pd "http://$pd_master" list |grep -E "state" |awk -F':' '{print $2}'|awk -F',' '{print $1}'|sed 's/[^a-z]//g'`
##new add 20230106
tikvcdcstate()
{
cdcstate=`${ticdc_dir}bin/tikv-cdc cli changefeed query -s --changefeed-id $CHANGEFEED_ID --pd http://$pd_master |grep -E "state" |awk -F':' '{print $2}' |awk -F',' '{print $1}' |sed 's/[^a-z]//g'`
echo "tikvcdc replication state check."
if [ $cdcstate = normal ];then
echo "tikvcdc replication state is normal."
else
echo "tikvcdc replication state is $cdcstate."
exit 1
fi
}
cdc_change(){
echo "cdc changefeed-id start create."
${ticdc_dir}bin/tikv-cdc cli changefeed create --pd "http://$pd_master" --sink-uri tikv://$pd_slave --changefeed-id $CHANGEFEED_ID
if [ $? -eq 0 ]; then
echo "'$TIME' Create changefeed-id successed!!!"
tikvcdcstate
else
echo "'$TIME' Create changefeed-id failed!!!"
exit 1
fi
}
cdc_change | tee ${scripts_log}
5.2.5 观察日志输出
观察日志${scripts_log}有无报错,若无报错,则说明执行成功。
六、总结
1、两地三中心架构给业务连续性提供了强有力的保障;
2、定期进行灾备切换演练。