说明：测试环境v4.0.15，对于cdc 来说是一个非常老的版本，可能存在比较多的问题，如果是生产环境，尽量升级到比较新的版本，比如是v6.1.6，v6.5.1 这些版本无论是在性能和功能上面都有非常大的提升。下面的问题在v5.4.1测试就没有问题，所以推荐使用新的稳定的LTS版本。

cdc 的基本操作命令

#创建cdc
tiup ctl:v4.0.15 cdc  changefeed create --pd=http://10.2.103.115:32379 --sink-uri="tidb://root:tidb@10.2.103.116:34000/" --changefeed-id="simple-replication-task" --config=cdc.toml 
#查看cdc 任务状态
tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379
#查看具体任务状态
tiup ctl:v4.0.15  cdc changefeed query -c simple-replication-task --pd=http://10.2.103.115:32379
#移除任务
tiup ctl:v4.0.15  cdc changefeed remove -c simple-replication-task --pd=http://10.2.103.115:32379

PD的状态

[tidb@vm115 ~]$ tiup cluster display tidb-dev
tiup is checking updates for component cluster ...
A new version of cluster is available:
   The latest version:         v1.12.0
   Local installed version:    v1.11.3
   Update current component:   tiup update cluster
   Update all components:      tiup update --all

Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-dev
Cluster type:       tidb
Cluster name:       tidb-dev
Cluster version:    v4.0.15
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://10.2.103.115:32379/dashboard
Grafana URL:        http://10.2.103.115:7000
ID                  Role          Host          Ports        OS/Arch       Status   Data Dir                            Deploy Dir
--                  ----          ----          -----        -------       ------   --------                            ----------
10.2.103.115:9793   alertmanager  10.2.103.115  9793/9794    linux/x86_64  Up       /data1/tidb-data/alertmanager-9793  /data1/tidb-deploy/alertmanager-9793
10.2.103.115:9893   alertmanager  10.2.103.115  9893/9894    linux/x86_64  Up       /data1/tidb-data/alertmanager-9893  /data1/tidb-deploy/alertmanager-9893
10.2.103.115:8400   cdc           10.2.103.115  8400         linux/x86_64  Up       /data1/tidb-data/cdc-8400           /data1/tidb-deploy/cdc-8400
10.2.103.115:7000   grafana       10.2.103.115  7000         linux/x86_64  Up       -                                   /data1/tidb-deploy/grafana-7000
10.2.103.115:32379  pd            10.2.103.115  32379/3380   linux/x86_64  Up|L|UI  /data1/tidb-data/pd-32379           /data1/tidb-deploy/pd-32379
10.2.103.115:35379  pd            10.2.103.115  35379/3580   linux/x86_64  Up       /data1/tidb-data/pd-35379           /data1/tidb-deploy/pd-35379
10.2.103.115:36379  pd            10.2.103.115  36379/3680   linux/x86_64  Up       /data1/tidb-data/pd-36379           /data1/tidb-deploy/pd-36379
10.2.103.115:9590   prometheus    10.2.103.115  9590/35020   linux/x86_64  Up       /data1/tidb-data/prometheus-9590    /data1/tidb-deploy/prometheus-9590
10.2.103.115:43000  tidb          10.2.103.115  43000/20080  linux/x86_64  Up       -                                   /data1/tidb-deploy/tidb-34000
10.2.103.115:30160  tikv          10.2.103.115  30160/30180  linux/x86_64  Up       /data1/tidb-data/tikv-30160         /data1/tidb-deploy/tikv-30160
Total nodes: 10

cdc 任务状态

[tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[
  {
    "id": "simple-replication-task2",
    "summary": {
      "state": "normal",
      "tso": 440757324149424131,
      "checkpoint": "2023-04-13 11:15:59.237",
      "error": null
    }
  }
]

切换PD leader

[tidb@vm115 ~]$ tiup ctl:v4.0.15 pd  -u http://10.2.103.115:32379 member leader transfer pd-10.2.103.115-35379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl pd -u http://10.2.103.115:32379 member leader transfer pd-10.2.103.115-35379
Success!
[tidb@vm115 ~]$ tiup cluster display tidb-dev
tiup is checking updates for component cluster ...
A new version of cluster is available:
   The latest version:         v1.12.0
   Local installed version:    v1.11.3
   Update current component:   tiup update cluster
   Update all components:      tiup update --all

Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster display tidb-dev
Cluster type:       tidb
Cluster name:       tidb-dev
Cluster version:    v4.0.15
Deploy user:        tidb
SSH type:           builtin
Dashboard URL:      http://10.2.103.115:32379/dashboard
Grafana URL:        http://10.2.103.115:7000
ID                  Role          Host          Ports        OS/Arch       Status  Data Dir                            Deploy Dir
--                  ----          ----          -----        -------       ------  --------                            ----------
10.2.103.115:9793   alertmanager  10.2.103.115  9793/9794    linux/x86_64  Up      /data1/tidb-data/alertmanager-9793  /data1/tidb-deploy/alertmanager-9793
10.2.103.115:9893   alertmanager  10.2.103.115  9893/9894    linux/x86_64  Up      /data1/tidb-data/alertmanager-9893  /data1/tidb-deploy/alertmanager-9893
10.2.103.115:8400   cdc           10.2.103.115  8400         linux/x86_64  Up      /data1/tidb-data/cdc-8400           /data1/tidb-deploy/cdc-8400
10.2.103.115:7000   grafana       10.2.103.115  7000         linux/x86_64  Up      -                                   /data1/tidb-deploy/grafana-7000
10.2.103.115:32379  pd            10.2.103.115  32379/3380   linux/x86_64  Up|UI   /data1/tidb-data/pd-32379           /data1/tidb-deploy/pd-32379
10.2.103.115:35379  pd            10.2.103.115  35379/3580   linux/x86_64  Up|L    /data1/tidb-data/pd-35379           /data1/tidb-deploy/pd-35379
10.2.103.115:36379  pd            10.2.103.115  36379/3680   linux/x86_64  Up      /data1/tidb-data/pd-36379           /data1/tidb-deploy/pd-36379
10.2.103.115:9590   prometheus    10.2.103.115  9590/35020   linux/x86_64  Up      /data1/tidb-data/prometheus-9590    /data1/tidb-deploy/prometheus-9590
10.2.103.115:43000  tidb          10.2.103.115  43000/20080  linux/x86_64  Up      -                                   /data1/tidb-deploy/tidb-34000
10.2.103.115:30160  tikv          10.2.103.115  30160/30180  linux/x86_64  Up      /data1/tidb-data/tikv-30160         /data1/tidb-deploy/tikv-30160
Total nodes: 10

切换PD leader 对cdc 没有影响

[tidb@vm115 ~]$  tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[
  {
    "id": "simple-replication-task2",
    "summary": {
      "state": "normal",
      "tso": 440757380830461955,
      "checkpoint": "2023-04-13 11:19:35.458",
      "error": null
    }
  }
]
[tidb@vm115 ~]$

缩容PD节点

cdc 任务报错

[tidb@vm115 ~]$ tiup cluster scale-in tidb-dev -N 10.2.103.115:35379
tiup is checking updates for component cluster ...
A new version of cluster is available:
   The latest version:         v1.12.1
   Local installed version:    v1.11.3
   Update current component:   tiup update cluster
   Update all components:      tiup update --all

Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-dev -N 10.2.103.115:35379
This operation will delete the 10.2.103.115:35379 nodes in `tidb-dev` and all their data.
Do you want to continue? [y/N]:(default=N) y
Scale-in nodes...
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.115:35379] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:600 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:default Operation:StartOperation}
Stopping component pd
        Stopping instance 10.2.103.115
        Stop pd 10.2.103.115:35379 success
Destroying component pd
        Destroying instance 10.2.103.115
Destroy 10.2.103.115 success
- Destroy pd paths: [/data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379/log /data1/tidb-deploy/pd-35379 /etc/systemd/system/pd-35379.service]
+ [ Serial ] - UpdateMeta: cluster=tidb-dev, deleted=`'10.2.103.115:35379'`
+ [ Serial ] - UpdateTopology: cluster=tidb-dev
+ Refresh instance configs
  - Generate config pd -> 10.2.103.115:32379 ... Done
  - Generate config pd -> 10.2.103.115:36379 ... Done
  - Generate config tikv -> 10.2.103.115:30160 ... Done
  - Generate config tidb -> 10.2.103.115:43000 ... Done
  - Generate config cdc -> 10.2.103.115:8400 ... Done
  - Generate config prometheus -> 10.2.103.115:9590 ... Done
  - Generate config grafana -> 10.2.103.115:7000 ... Done
  - Generate config alertmanager -> 10.2.103.115:9793 ... Done
  - Generate config alertmanager -> 10.2.103.115:9893 ... Done
+ Reload prometheus and grafana
  - Reload prometheus -> 10.2.103.115:9590 ... Done
  - Reload grafana -> 10.2.103.115:7000 ... Done
Scaled cluster `tidb-dev` in successfully

cdc任务报错

[tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[2023/04/14 09:18:52.714 +08:00] [WARN] [client_changefeed.go:170] ["query changefeed info failed"] [error="owner not found"]
[
  {
    "id": "simple-replication-task2",
    "summary": null
  }
]
[tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[2023/04/14 09:19:00.720 +08:00] [WARN] [client_changefeed.go:170] ["query changefeed info failed"] [error="owner not found"]
[
  {
    "id": "simple-replication-task2",
    "summary": null
  }
]
[tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[
  {
    "id": "simple-replication-task2",
    "summary": {
      "state": "stopped",
      "tso": 440778127778512897,
      "checkpoint": "2023-04-14 09:18:38.784",
      "error": {
        "addr": "10.2.103.115:8400",
        "code": "CDC:ErrProcessorUnknown",
        "message": "failed to update info: [CDC:ErrReachMaxTry]reach maximum try: 3"
      }
    }
  }
]
[tidb@vm115 ~]$

cdc 报错日志

[2023/04/14 09:18:40.348 +08:00] [ERROR] [processor.go:497] ["failed to flush task position"] [changefeed=simple-replication-task2] [error="[CDC:ErrPDEtcdAPIError]rpc error: code = Unknown desc = raft: stopped"] [errorVerbose="[CDC:ErrPDEtcdAPIError]rpc error: code = Unknown desc = raft: stopped\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByCause\n\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:279\ngithub.com/pingcap/ticdc/pkg/errors.WrapError\n\tgithub.com/pingcap/ticdc@/pkg/errors/helper.go:30\ngithub.com/pingcap/ticdc/cdc/kv.CDCEtcdClient.PutTaskPositionOnChange\n\tgithub.com/pingcap/ticdc@/cdc/kv/etcd.go:739\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).flushTaskPosition\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:494\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).flushTaskStatusAndPosition\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:560\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func1.1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:318\ngithub.com/pingcap/ticdc/pkg/retry.run\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:54\ngithub.com/pingcap/ticdc/pkg/retry.Do\n\tgithub.com/pingcap/ticdc@/pkg/retry/retry_with_opt.go:32\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:317\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker.func2\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:349\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).positionWorker\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:418\ngithub.com/pingcap/ticdc/cdc.(*oldProcessor).Run.func1\n\tgithub.com/pingcap/ticdc@/cdc/processor.go:251\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\truntime/asm_amd64.s:1357"]

解决方案

重新的resume cdc 任务

[tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed resume -c simple-replication-task2 --pd=http://10.2.103.115:32379
[tidb@vm115 ~]$ tiup ctl:v4.0.15  cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v4.0.15/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[
  {
    "id": "simple-replication-task2",
    "summary": {
      "state": "normal",
      "tso": 440778162669879297,
      "checkpoint": "2023-04-14 09:20:51.884",
      "error": null
    }
  }
]
[tidb@vm115 ~]$

升级到v5.4.1测试

缩容PD

同样的操作，cdc 任务不报错

[tidb@vm115 ~]$ tiup cluster scale-in tidb-dev -N 10.2.103.115:35379 
tiup is checking updates for component cluster ...
A new version of cluster is available:
   The latest version:         v1.12.1
   Local installed version:    v1.11.3
   Update current component:   tiup update cluster
   Update all components:      tiup update --all

Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.11.3/tiup-cluster scale-in tidb-dev -N 10.2.103.115:35379
This operation will delete the 10.2.103.115:35379 nodes in `tidb-dev` and all their data.
Do you want to continue? [y/N]:(default=N) y
Scale-in nodes...
+ [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-dev/ssh/id_rsa.pub
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [Parallel] - UserSSH: user=tidb, host=10.2.103.115
+ [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[10.2.103.115:35379] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:600 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] DisplayMode:default Operation:StartOperation}
Stopping component pd
        Stopping instance 10.2.103.115
        Stop pd 10.2.103.115:35379 success
Destroying component pd
        Destroying instance 10.2.103.115
Destroy 10.2.103.115 success
- Destroy pd paths: [/data1/tidb-data/pd-35379 /data1/tidb-deploy/pd-35379/log /data1/tidb-deploy/pd-35379 /etc/systemd/system/pd-35379.service]
+ [ Serial ] - UpdateMeta: cluster=tidb-dev, deleted=`'10.2.103.115:35379'`
+ [ Serial ] - UpdateTopology: cluster=tidb-dev
+ Refresh instance configs
  - Generate config pd -> 10.2.103.115:32379 ... Done
  - Generate config pd -> 10.2.103.115:36379 ... Done
  - Generate config tikv -> 10.2.103.115:30160 ... Done
  - Generate config tidb -> 10.2.103.115:43000 ... Done
  - Generate config cdc -> 10.2.103.115:8400 ... Done
  - Generate config prometheus -> 10.2.103.115:9590 ... Done
  - Generate config grafana -> 10.2.103.115:7000 ... Done
  - Generate config alertmanager -> 10.2.103.115:9793 ... Done
  - Generate config alertmanager -> 10.2.103.115:9893 ... Done
+ Reload prometheus and grafana
  - Reload prometheus -> 10.2.103.115:9590 ... Done
  - Reload grafana -> 10.2.103.115:7000 ... Done
Scaled cluster `tidb-dev` in successfully
[tidb@vm115 ~]$

cdc 状态正常

[tidb@vm115 ~]$ tiup ctl:v5.4.1  cdc changefeed list --pd=http://10.2.103.115:32379
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v5.4.1/ctl cdc changefeed list --pd=http://10.2.103.115:32379
[
  {
    "id": "simple-replication-task2",
    "summary": {
      "state": "normal",
      "tso": 440778284890324994,
      "checkpoint": "2023-04-14 09:28:38.118",
      "error": null
    }
  }
]
[tidb@vm115 ~]$

总结：

1、任务生产上面的变更，如果有条件都要在测试环境模拟、测试一下。

2、生产集群尽量升级到一些主流、稳定的版本上，过老的版本可能存在一些BUG。

3、最新的LTS版本 cdc 功能和性能都要质的飞跃，推荐使用新的LTS版本。

迁移PD坑-cdc任务全部stop