说明:测试环境v4.0.15,对于cdc 来说是一个非常老的版本,可能存在比较多的问题,如果是生产环境,尽量升级到比较新的版本,比如是v6.1.6,v6.5.1 这些版本无论是在性能和功能上面都有非常大的提升。下面的问题在v5.4.1测试就没有问题,所以推荐使用新的稳定的LTS版本。
cdc 的基本操作命令
#创建cdc
tiup ctl:v4.0.15 cdc changefeed create --pd=http://10.2.103.115:32379 --sink-uri="tidb://root:tidb@10.2.103.116:34000/" --changefeed-id="simple-replication-task" --config=cdc.toml
#查看cdc 任务状态
tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379
#查看具体任务状态
tiup ctl:v4.0.15 cdc changefeed query -c simple-replication-task --pd=http://10.2.103.115:32379
#移除任务
tiup ctl:v4.0.15 cdc changefeed remove -c simple-replication-task --pd=http://10.2.103.115:32379
PD的状态
[tidb@vm115 ~]$ tiup cluster display tidb-dev
tiup is checking updates for component cluster ...
A new version of cluster is available:
The latest version: v1.12.0
Local installed version: v1.11.3
Update current component: tiup update cluster
Update all components: tiup update --all
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-dev
Cluster type: tidb
Cluster name: tidb-dev
Cluster version: v4.0.15
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://10.2.103.115:32379/dashboard
Grafana URL: http://10.2.103.115:7000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
10.2.103.115:9793 alertmanager 10.2.103.115 9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-9793
10.2.103.115:9893 alertmanager 10.2.103.115 9893/9894 linux/x86_64 Up /data1/tidb-data/alertmanager-9893 /data1/tidb-deploy/alertmanager-9893
10.2.103.115:8400 cdc 10.2.103.115 8400 linux/x86_64 Up /data1/tidb-data/cdc-8400 /data1/tidb-deploy/cdc-8400
10.2.103.115:7000 grafana 10.2.103.115 7000 linux/x86_64 Up - /data1/tidb-deploy/grafana-7000
10.2.103.115:32379 pd 10.2.103.115 32379/3380 linux/x86_64 Up|L|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-32379
10.2.103.115:35379 pd 10.2.103.115 35379/3580 linux/x86_64 Up /data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379
10.2.103.115:36379 pd 10.2.103.115 36379/3680 linux/x86_64 Up /data1/tidb-data/pd-36379 /data1/tidb-deploy/pd-36379
10.2.103.115:9590 prometheus 10.2.103.115 9590/35020 linux/x86_64 Up /data1/tidb-data/prometheus-9590 /data1/tidb-deploy/prometheus-9590
10.2.103.115:43000 tidb 10.2.103.115 43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-34000
10.2.103.115:30160 tikv 10.2.103.115 30160/30180 linux/x86_64 Up /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160
Total nodes: 10
cdc 任务状态
[tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[
{
"id": "simple-replication-task2",
"summary": {
"state": "normal",
"tso": 440757324149424131,
"checkpoint": "2023-04-13 11:15:59.237",
"error": null
}
}
]
切换PD leader
[tidb@vm115 ~]$ tiup ctl:v4.0.15 pd -u http://10.2.103.115:32379 member leader transfer pd-10.2.103.115-35379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl pd -u http://10.2.103.115:32379 member leader transfer pd-10.2.103.115-35379
Success!
[tidb@vm115 ~]$ tiup cluster display tidb-dev
tiup is checking updates for component cluster ...
A new version of cluster is available:
The latest version: v1.12.0
Local installed version: v1.11.3
Update current component: tiup update cluster
Update all components: tiup update --all
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-dev
Cluster type: tidb
Cluster name: tidb-dev
Cluster version: v4.0.15
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://10.2.103.115:32379/dashboard
Grafana URL: http://10.2.103.115:7000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
10.2.103.115:9793 alertmanager 10.2.103.115 9793/9794 linux/x86_64 Up /data1/tidb-data/alertmanager-9793 /data1/tidb-deploy/alertmanager-9793
10.2.103.115:9893 alertmanager 10.2.103.115 9893/9894 linux/x86_64 Up /data1/tidb-data/alertmanager-9893 /data1/tidb-deploy/alertmanager-9893
10.2.103.115:8400 cdc 10.2.103.115 8400 linux/x86_64 Up /data1/tidb-data/cdc-8400 /data1/tidb-deploy/cdc-8400
10.2.103.115:7000 grafana 10.2.103.115 7000 linux/x86_64 Up - /data1/tidb-deploy/grafana-7000
10.2.103.115:32379 pd 10.2.103.115 32379/3380 linux/x86_64 Up|UI /data1/tidb-data/pd-32379 /data1/tidb-deploy/pd-32379
10.2.103.115:35379 pd 10.2.103.115 35379/3580 linux/x86_64 Up|L /data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379
10.2.103.115:36379 pd 10.2.103.115 36379/3680 linux/x86_64 Up /data1/tidb-data/pd-36379 /data1/tidb-deploy/pd-36379
10.2.103.115:9590 prometheus 10.2.103.115 9590/35020 linux/x86_64 Up /data1/tidb-data/prometheus-9590 /data1/tidb-deploy/prometheus-9590
10.2.103.115:43000 tidb 10.2.103.115 43000/20080 linux/x86_64 Up - /data1/tidb-deploy/tidb-34000
10.2.103.115:30160 tikv 10.2.103.115 30160/30180 linux/x86_64 Up /data1/tidb-data/tikv-30160 /data1/tidb-deploy/tikv-30160
Total nodes: 10
切换PD leader 对cdc 没有影响
[tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[
{
"id": "simple-replication-task2",
"summary": {
"state": "normal",
"tso": 440757380830461955,
"checkpoint": "2023-04-13 11:19:35.458",
"error": null
}
}
]
[tidb@vm115 ~]$
缩容PD节点
cdc 任务报错
[tidb@vm115 ~]$ tiup cluster scale-in tidb-dev -N 10.2.103.115:35379
tiup is checking updates for component cluster ...
A new version of cluster is available:
The latest version: v1.12.1
Local installed version: v1.11.3
Update current component: tiup update cluster
Update all components: tiup update --all
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-dev -N 10.2.103.115:35379
This operation will delete the 10.2.103.115:35379 nodes in `tidb-dev` and all their data.
Do you want to continue? [y/N]:(default=N) y
Scale-in nodes...
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.115:35379] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:600 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:default Operation:StartOperation}
Stopping component pd
Stopping instance 10.2.103.115
Stop pd 10.2.103.115:35379 success
Destroying component pd
Destroying instance 10.2.103.115
Destroy 10.2.103.115 success
- Destroy pd paths: [/data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379/log /data1/tidb-deploy/pd-35379 /etc/systemd/system/pd-35379.service]
+ [ Serial ] - UpdateMeta: cluster=tidb-dev, deleted=`'10.2.103.115:35379'`
+ [ Serial ] - UpdateTopology: cluster=tidb-dev
+ Refresh instance configs
- Generate config pd -> 10.2.103.115:32379 ... Done
- Generate config pd -> 10.2.103.115:36379 ... Done
- Generate config tikv -> 10.2.103.115:30160 ... Done
- Generate config tidb -> 10.2.103.115:43000 ... Done
- Generate config cdc -> 10.2.103.115:8400 ... Done
- Generate config prometheus -> 10.2.103.115:9590 ... Done
- Generate config grafana -> 10.2.103.115:7000 ... Done
- Generate config alertmanager -> 10.2.103.115:9793 ... Done
- Generate config alertmanager -> 10.2.103.115:9893 ... Done
+ Reload prometheus and grafana
- Reload prometheus -> 10.2.103.115:9590 ... Done
- Reload grafana -> 10.2.103.115:7000 ... Done
Scaled cluster `tidb-dev` in successfully
cdc任务报错
[tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[2023/04/14 09:18:52.714 +08:00] [WARN] [client_changefeed.go:170] ["query changefeed info failed"] [error="owner not found"]
[
{
"id": "simple-replication-task2",
"summary": null
}
]
[tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[2023/04/14 09:19:00.720 +08:00] [WARN] [client_changefeed.go:170] ["query changefeed info failed"] [error="owner not found"]
[
{
"id": "simple-replication-task2",
"summary": null
}
]
[tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[
{
"id": "simple-replication-task2",
"summary": {
"state": "stopped",
"tso": 440778127778512897,
"checkpoint": "2023-04-14 09:18:38.784",
"error": {
"addr": "10.2.103.115:8400",
"code": "CDC:ErrProcessorUnknown",
"message": "failed to update info: [CDC:ErrReachMaxTry]reach maximum try: 3"
}
}
}
]
[tidb@vm115 ~]$
cdc 报错日志
[2023/04/14 09:18:40.348 +08:00] [ERROR] [processor.go:497] ["failed to flush task position"] [changefeed=simple-replication-task2] [error="[CDC:ErrPDEtcdAPIError]rpc error: code = Unknown desc = raft: stopped"] [errorVerbose="[CDC:ErrPDEtcdAPIError]rpc error: code = Unknown desc = raft: stopped\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc@/pkg/errors/helper.go:30\ngithub.com/pingcap/ticdc/cdc/kv.CDCEtcdClient.PutTaskPositionOnChange\n\tgithub.com/pingcap/ticdc@/cdc/kv/etcd.go:739\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).flushTaskPosition\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:494\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).flushTaskStatusAndPosition\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:560\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func1.1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:318\ngithub.com/pingcap/ticdc/pkg/retry.run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:54\ngithub.com/pingcap/ticdc/pkg/retry.Do\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:32\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:317\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func2\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:349\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:418\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).Run.func1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:251\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
解决方案
重新的resume cdc 任务
[tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed resume -c simple-replication-task2 --pd=http://10.2.103.115:32379
[tidb@vm115 ~]$ tiup ctl:v4.0.15 cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[
{
"id": "simple-replication-task2",
"summary": {
"state": "normal",
"tso": 440778162669879297,
"checkpoint": "2023-04-14 09:20:51.884",
"error": null
}
}
]
[tidb@vm115 ~]$
升级到v5.4.1测试
缩容PD
同样的操作,cdc 任务不报错
[tidb@vm115 ~]$ tiup cluster scale-in tidb-dev -N 10.2.103.115:35379
tiup is checking updates for component cluster ...
A new version of cluster is available:
The latest version: v1.12.1
Local installed version: v1.11.3
Update current component: tiup update cluster
Update all components: tiup update --all
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-dev -N 10.2.103.115:35379
This operation will delete the 10.2.103.115:35379 nodes in `tidb-dev` and all their data.
Do you want to continue? [y/N]:(default=N) y
Scale-in nodes...
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.115:35379] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:600 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:default Operation:StartOperation}
Stopping component pd
Stopping instance 10.2.103.115
Stop pd 10.2.103.115:35379 success
Destroying component pd
Destroying instance 10.2.103.115
Destroy 10.2.103.115 success
- Destroy pd paths: [/data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379/log /data1/tidb-deploy/pd-35379 /etc/systemd/system/pd-35379.service]
+ [ Serial ] - UpdateMeta: cluster=tidb-dev, deleted=`'10.2.103.115:35379'`
+ [ Serial ] - UpdateTopology: cluster=tidb-dev
+ Refresh instance configs
- Generate config pd -> 10.2.103.115:32379 ... Done
- Generate config pd -> 10.2.103.115:36379 ... Done
- Generate config tikv -> 10.2.103.115:30160 ... Done
- Generate config tidb -> 10.2.103.115:43000 ... Done
- Generate config cdc -> 10.2.103.115:8400 ... Done
- Generate config prometheus -> 10.2.103.115:9590 ... Done
- Generate config grafana -> 10.2.103.115:7000 ... Done
- Generate config alertmanager -> 10.2.103.115:9793 ... Done
- Generate config alertmanager -> 10.2.103.115:9893 ... Done
+ Reload prometheus and grafana
- Reload prometheus -> 10.2.103.115:9590 ... Done
- Reload grafana -> 10.2.103.115:7000 ... Done
Scaled cluster `tidb-dev` in successfully
[tidb@vm115 ~]$
cdc 状态正常
[tidb@vm115 ~]$ tiup ctl:v5.4.1 cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v5.4.1/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[
{
"id": "simple-replication-task2",
"summary": {
"state": "normal",
"tso": 440778284890324994,
"checkpoint": "2023-04-14 09:28:38.118",
"error": null
}
}
]
[tidb@vm115 ~]$
总结:
1、任务生产上面的变更,如果有条件都要在测试环境模拟、测试一下。
2、生产集群尽量升级到一些主流、稳定的版本上,过老的版本可能存在一些BUG。
3、最新的LTS版本 cdc 功能和性能都要质的飞跃,推荐使用新的LTS版本。