专栏 - DM同步过程问题汇总

为提高效率，提问时请提供以下信息，问题描述清晰可优先响应。

【TiDB 版本】：V4.0 DM V1.0.6总结了DM同步过程遇到的一些问题和解决方法DM同步问题及解决方法.docx (43.7 KB)

问题1


现象	Message": "execute statement failed: ALTER TABLE `hdb_prod`.`prod_room_info` MODIFY COLUMN `building_floor` VARCHAR(20) DEFAULT NULL COMMENT '楼层': Error 8200: Unsupported modify column: type varchar(20) not match origin int(255)", "RawCause": "Error 8200: Unsupported modify column: type varchar(20) not match origin int(255)"
原因	在修改类型方面，只支持整数类型之间修改，字符串类型之间修改和 Blob 类型之间的修改，且只能使原类型长度变长。此外，不能改变列的 unsigned/charset/collate 属性
解决方法	方法一 1、 drop tale xxx 2、修改task任务参数： remove-meta: true 3、重跑任务 start-task task 方法二（表数据量小的时候适用）修复方案：改表名 alter table prod_room_info rename to prod_room_info_bak; 重建表 CREATE TABLE `prod_room_info` (xxx,xxx); 重新插入表数据 insert into prod_room_info select * from prod_room_info_bak; 跳过异常sql sql-skip -w 10.71.80.114:8262 --sql-pattern=~(?i)ALTER\s+TABLE\s+`hdb_prod`.`prod_room_info`\s+ADD --sharding hdb_prod 重启task resume-task task

现象

Message": "execute statement failed: ALTER TABLE `hdb_prod`.`prod_room_info` MODIFY COLUMN `building_floor` VARCHAR(20) DEFAULT NULL COMMENT '楼层': Error 8200: Unsupported modify column: type varchar(20) not match origin int(255)",

"RawCause": "Error 8200: Unsupported modify column: type varchar(20) not match origin int(255)"

原因

在修改类型方面，只支持整数类型之间修改，字符串类型之间修改和 Blob 类型之间的修改，且只能使原类型长度变长。此外，不能改变列的 unsigned/charset/collate 属性

解决方法

方法一

1、 drop tale xxx

2、修改task任务参数： remove-meta: true

3、重跑任务 start-task task

方法二（表数据量小的时候适用）

修复方案：

改表名

alter table prod_room_info rename to prod_room_info_bak;

重建表

CREATE TABLE `prod_room_info` (xxx,xxx);

重新插入表数据

insert into prod_room_info select * from prod_room_info_bak;

跳过异常sql

sql-skip -w 10.71.80.114:8262 --sql-pattern=~(?i)ALTER\s+TABLE\s+`hdb_prod`.`prod_room_info`\s+ADD --sharding hdb_prod

重启task

resume-task task

问题2


现象	数据库名不能带- 符号 "msg": "[code=32001:class=dump-unit:scope=internal:level=high] mydumper runs with error: exit status 1. \n\n",
原因	数据库名带了中划线 “-”，dump的时候异常
解决方法	task任务中mydumper模块中extra-args 参数使用 -x mydumpers: mysql-replica-04.dump: mydumper-path: bin/mydumper threads: 4 chunk-filesize: 64 skip-tz-utc: true extra-args: "-s 100000 -x '^oms-dataex' --no-locks"

现象

数据库名不能带- 符号

"msg": "[code=32001:class=dump-unit:scope=internal:level=high] mydumper runs with error: exit status 1. \n\n",

原因

数据库名带了中划线 “-”，dump的时候异常

解决方法

task任务中mydumper模块中extra-args 参数使用 -x

mydumpers:

mysql-replica-04.dump:

mydumper-path: bin/mydumper

threads: 4

chunk-filesize: 64

skip-tz-utc: true

extra-args: "-s 100000 -x '^oms-dataex' --no-locks"

问题3


现象	"Message": "run table schema failed - dbfile ./dumped_data.center_other/ouser.store_business-schema.sql: execute statement failed: CREATE TABLE `store_business` (`id` bigint(20) NOT NULL AUTO_INCREMENT,`business_type` tinyint(2) DEFAULT NULL COMMENT '1 代表类目 2代码品牌',`category_code` varchar(60) DEFAULT NULL COMMENT '类目code',`brand_id` bigint(20) DEFAULT NULL COMMENT '品牌Id',`name` varchar(60) DEFAULT NULL COMMENT '名称',`org_id` bigint(20) DEFAULT NULL COMMENT '店铺Id',`is_deleted` int(2) DEFAULT NULL COMMENT '是否已删除',`create_userid` bigint(20) DEFAULT NULL,`create_username` varchar(100) DEFAULT NULL,`create_user_ip` varchar(60) DEFAULT NULL COMMENT '创建人IP',`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '创建时间-应用操作时间',`create_time_db` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT '创建时间-数据库操作时间',`update_userid` bigint(20) DEFAULT NULL,`update_username` varchar(100) DEFAULT NULL,`update_user_ip` varchar(60) DEFAULT NULL COMMENT '修改人IP',`update_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:0...: Error 1067: Invalid default value for 'create_time_db'", "RawCause": "Error 1067: Invalid default value for 'create_time_db'"
原因	对于建表语句，原表中的几个字段存在如下情况： `create_time_db` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT '创建时间-数据库操作时间'，TIDB中不支持 DEFAULT '0000-00-00 00:00:00'
解决方法	1、在tidb重新建表对报错字段修改为1970-01-01 10:00:00 2、跳过当前错误DDL sql-skip -w ip:8262 --sql-pattern=~(?i)CREATE\s+TABLE\s+`store_business` --sharding task 3、恢复任务 resume-task task

现象

"Message": "run table schema failed - dbfile ./dumped_data.center_other/ouser.store_business-schema.sql: execute statement failed: CREATE TABLE `store_business` (`id` bigint(20) NOT NULL AUTO_INCREMENT,`business_type` tinyint(2) DEFAULT NULL COMMENT '1 代表类目 2代码品牌',`category_code` varchar(60) DEFAULT NULL COMMENT '类目code',`brand_id` bigint(20) DEFAULT NULL COMMENT '品牌Id',`name` varchar(60) DEFAULT NULL COMMENT '名称',`org_id` bigint(20) DEFAULT NULL COMMENT '店铺Id',`is_deleted` int(2) DEFAULT NULL COMMENT '是否已删除',`create_userid` bigint(20) DEFAULT NULL,`create_username` varchar(100) DEFAULT NULL,`create_user_ip` varchar(60) DEFAULT NULL COMMENT '创建人IP',`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '创建时间-应用操作时间',`create_time_db` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT '创建时间-数据库操作时间',`update_userid` bigint(20) DEFAULT NULL,`update_username` varchar(100) DEFAULT NULL,`update_user_ip` varchar(60) DEFAULT NULL COMMENT '修改人IP',`update_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:0...: Error 1067: Invalid default value for 'create_time_db'",

"RawCause": "Error 1067: Invalid default value for 'create_time_db'"

原因

对于建表语句，原表中的几个字段存在如下情况：

`create_time_db` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT '创建时间-数据库操作时间'，TIDB中不支持 DEFAULT '0000-00-00 00:00:00'

解决方法

1、在tidb重新建表对报错字段修改为1970-01-01 10:00:00

2、跳过当前错误DDL

sql-skip -w ip:8262 --sql-pattern=~(?i)CREATE\s+TABLE\s+`store_business` --sharding task

3、恢复任务

resume-task task

问题4


现象	"errorMsg": "instance mysql-replica-06 table `hdb_broker_1`.`client_baseinfo` of sharding `hdb_broker`.`client_baseinfo` have auto-increment key, would conflict with each other to cause data corruption", "instruction": "please handle it by yourself, read document https://pingcap.com/docs-cn/dev/reference/tools/data-migration/usage-scenarios/best-practice-dm-shard/#自增主键冲突处理 for more detail (only have Chinese document now, will translate to English later)",
原因	分库分表场景下，上游表存在自增主键id的情况
解决方法	参考：https://pingcap.com/docs-cn/dev/reference/tools/data-migration/usage-scenarios/best-practice-dm-shard/#自增主键冲突处理修改下游表结构另外关键一点：在task.yaml同步任务配置中必须添加以下配置：（跳过自增主键冲突检查） ignore-checking-items: ["auto_increment_ID"]

现象

"errorMsg": "instance mysql-replica-06 table `hdb_broker_1`.`client_baseinfo` of sharding `hdb_broker`.`client_baseinfo` have auto-increment key, would conflict with each other to cause data corruption",

"instruction": "please handle it by yourself, read document https://pingcap.com/docs-cn/dev/reference/tools/data-migration/usage-scenarios/best-practice-dm-shard/#自增主键冲突处理 for more detail (only have Chinese document now, will translate to English later)",

原因

分库分表场景下，上游表存在自增主键id的情况

解决方法

参考：https://pingcap.com/docs-cn/dev/reference/tools/data-migration/usage-scenarios/best-practice-dm-shard/#自增主键冲突处理

修改下游表结构

另外关键一点：

在task.yaml同步任务配置中必须添加以下配置：

（跳过自增主键冲突检查）

ignore-checking-items: ["auto_increment_ID"]

问题5


现象	"msg": "[code=38033:class=dm-master:scope=internal:level=high] request to dm-worker 10.123.222.246:8262 is timeout, but request may be successful, please execute `query-status` to check status\ngithub.com/pingcap/dm/pkg/terror.(*Error).
原因	原因是多方面的，单纯看这个query-status的日志不能定位具体情况，当时事后才知道是上游mysql磁盘满了，导致的异常
解决方法	上游修复后，重跑任务即可

问题6


现象	"Message": "execute statement failed: ALTER TABLE `hdb_broker2`.`broker_backup` ADD PRIMARY KEY(`guid`): Error 8200: Unsupported add primary key, alter-primary-key is false", "RawCause": "Error 8200: Unsupported add primary key, alter-primary-key is false"
原因	TIDB不支持ALTER TABLE 新增主键的操作
解决方法	1、按照上游的表结构在下游重新创建该表 2、修改task任务参数： remove-meta: true 3、重跑任务 start-task task 4、任务跑完后，把参数remove-meta: false， 5、 Stop-task task 6、 start-task task

现象

"Message": "execute statement failed: ALTER TABLE `hdb_broker2`.`broker_backup` ADD PRIMARY KEY(`guid`): Error 8200: Unsupported add primary key, alter-primary-key is false",

"RawCause": "Error 8200: Unsupported add primary key, alter-primary-key is false"

原因

TIDB不支持ALTER TABLE 新增主键的操作

解决方法

1、按照上游的表结构在下游重新创建该表

2、修改task任务参数： remove-meta: true

3、重跑任务 start-task task

4、任务跑完后，把参数remove-meta: false，

5、 Stop-task task

6、 start-task task

问题7


现象	grep "12ac50f28e9f475f92b044b6058be5c5" client_building_visited hdb_broker_7.client_abc.000000012.sql:(1435228,'12ac50f28e9f475f92b044b6058be5c5',NULL), hdb_broker_7.client_abc.000000012.sql:(1454518,'12ac50f28e9f475f92b044b6058be5c5' ,NULL),
原因	分库分表场景下，此表存在自增主键，通过把id 字段创建为 unique key，然后把自增主键改为index，但由于上游数据id字段不唯一，会报主键冲突
解决方法	1、上游修复数据 2、修改task任务参数： remove-meta: true 3、重跑任务 start-task task 4、任务跑完后，把参数remove-meta: false， 5、 Stop-task task 6、 start-task task

现象

grep "12ac50f28e9f475f92b044b6058be5c5" *client_building_visited*

hdb_broker_7.client_abc.000000012.sql:(1435228,'12ac50f28e9f475f92b044b6058be5c5',NULL),

hdb_broker_7.client_abc.000000012.sql:(1454518,'12ac50f28e9f475f92b044b6058be5c5' ,NULL),

原因

分库分表场景下，此表存在自增主键，通过把id 字段创建为 unique key，然后把自增主键改为index，但由于上游数据id字段不唯一，会报主键冲突

解决方法

1、上游修复数据

2、修改task任务参数： remove-meta: true

3、重跑任务 start-task task

4、任务跑完后，把参数remove-meta: false，

5、 Stop-task task

6、 start-task task

问题8


现象	"Message": "current pos (mysql-bin\|000001.000669, 219047745): encountered incompatible DDL in TiDB:\n\tplease confirm your DDL statement is correct and needed.\n\tfor TiDB compatible DDL, please see the docs:\n\t English version: https://pingcap.com/docs/dev/reference/mysql-compatibility/#ddl\n\t Chinese version: https://pingcap.com/docs-cn/dev/reference/mysql-compatibility/#ddl\n\tif the DDL is not needed, you can use a filter rule with \"*\" schema-pattern to ignore it.\n\t : parse statement: line 1 column 2 near \"XA START X'31302e37312e36342e35322e746d313539343131343937373035313030303136',X'31302e37312e36342e35322e746d3332',1096044365\" %!!(MISSING)(EXTRA string=XA START X'31302e37312e36342e35322e746d313539343131343937373035313030303136',X'31302e37312e36342e35322e746d3332',1096044365)", "RawCause": "line 1 column 2 near \"XA START X'31302e37312e36342e35322e746d313539343131343937373035313030303136',X'31302e37312e36342e35322e746d3332',1096044365\" "
原因	tidb不支持分布式XA事务
解决方法	在task.yaml上增加XA事务的过滤，则可以解决此报错 filters: mysql-replica-05.filter.1: schema-pattern: "" #此处非常重要，只能用，如果用上游数据库缩写会无法跳过报错，比如 abc_* sql-pattern: ["XA PREPARE", "XA START", "XA END", "XA COMMIT"] action: Ignore

现象

"Message": "current pos (mysql-bin|000001.000669, 219047745): encountered incompatible DDL in TiDB:\n\tplease confirm your DDL statement is correct and needed.\n\tfor TiDB compatible DDL, please see the docs:\n\t English version: https://pingcap.com/docs/dev/reference/mysql-compatibility/#ddl\n\t Chinese version: https://pingcap.com/docs-cn/dev/reference/mysql-compatibility/#ddl\n\tif the DDL is not needed, you can use a filter rule with \"*\" schema-pattern to ignore it.\n\t : parse statement: line 1 column 2 near \"XA START X'31302e37312e36342e35322e746d313539343131343937373035313030303136',X'31302e37312e36342e35322e746d3332',1096044365\" %!!(MISSING)(EXTRA string=XA START X'31302e37312e36342e35322e746d313539343131343937373035313030303136',X'31302e37312e36342e35322e746d3332',1096044365)",

"RawCause": "line 1 column 2 near \"XA START X'31302e37312e36342e35322e746d313539343131343937373035313030303136',X'31302e37312e36342e35322e746d3332',1096044365\" "

原因

tidb不支持分布式XA事务

解决方法

在task.yaml上增加XA事务的过滤，则可以解决此报错

filters:

mysql-replica-05.filter.1:

schema-pattern: "*" #此处非常重要，只能用*，如果用上游数据库缩写会无法跳过报错，比如 abc_*

sql-pattern: ["XA PREPARE", "XA START", "XA END", "XA COMMIT"]

action: Ignore

问题9


现象	[2020/07/09 11:03:03.384 +08:00] [INFO] [mydumper.go:164] ["Connected to a MySQL server"] [task=123] [unit=dump] [2020/07/09 11:03:03.386 +08:00] [ERROR] [mydumper.go:170] ["There are queries in PROCESSLIST running longer than 60s, aborting dump,"] [task=broker_invite_registe_info] [unit=dump] [2020/07/09 11:03:03.386 +08:00] [ERROR] [mydumper.go:118] ["dump data exits with error"] [task=123] [unit=dump] ["cost time"=20.801611ms] [error="msg:\"[code=32001:class=dump-unit:scope=internal:level=high] mydumper runs with error: exit status 1. \\tuse --long-query-guard to change the guard value, kill queries (--kill-long-queries) or use \\n\\tdifferent server for dump\\n\" "]
原因	多个任务同时在dump文件
解决方法	等其他dump进程跑完再跑即可

现象

[2020/07/09 11:03:03.384 +08:00] [INFO] [mydumper.go:164] ["Connected to a MySQL server"] [task=123] [unit=dump]

[2020/07/09 11:03:03.386 +08:00] [ERROR] [mydumper.go:170] ["There are queries in PROCESSLIST running longer than 60s, aborting dump,"] [task=broker_invite_registe_info] [unit=dump]

[2020/07/09 11:03:03.386 +08:00] [ERROR] [mydumper.go:118] ["dump data exits with error"] [task=123] [unit=dump] ["cost time"=20.801611ms] [error="msg:\"[code=32001:class=dump-unit:scope=internal:level=high] mydumper runs with error: exit status 1. \\tuse --long-query-guard to change the guard value, kill queries (--kill-long-queries) or use \\n\\tdifferent server for dump\\n\" "]

原因

多个任务同时在dump文件

解决方法

等其他dump进程跑完再跑即可

问题10


现象	"ErrLevel": 3, "Message": "TCPReader get relay event with error: ERROR 1236 (HY000): Could not find first log file name in binary log index file", "RawCause": "ERROR 1236 (HY000): Could not find first log file name in binary log index file"
原因	上游mysql异常，导致DM需要获取的binlog缺失
解决方法	如果只是通过重新全量+增量跑任务的形式，其实可以重新完成同步任务，但是这个报错信息在query-status上会一直存在。最终解决方案一： 1、停worker（下线worker） ansible-playbook stop.yml --tags=dm-worker -l dm-worker6 2、修改?inventory.ini?文件，注释?dm-worker6实例所在行配置并重启 DM-master 服务 ansible-playbook rolling_update.yml --tags=dm-master 3、备份部署目录 mv /data/dm-deploy /data/dm-deploy_bak 4、重新上线work 修改?inventory.ini?文件，不注释dm-worker6实例所在行 5、重新部署worker6 ansible-playbook deploy.yml --tags=dm-worker -l dm-worker6 ansible-playbook start.yml --tags=dm-worker -l dm-worker6 6、配置并重启 DM-master 服务 ansible-playbook rolling_update.yml --tags=dm-master 7、配置并重启 Prometheus 服务。 ansible-playbook rolling_update_monitor.yml --tags=prometheus

现象

"ErrLevel": 3,

"Message": "TCPReader get relay event with error: ERROR 1236 (HY000): Could not find first log file name in binary log index file",

"RawCause": "ERROR 1236 (HY000): Could not find first log file name in binary log index file"

原因

上游mysql异常，导致DM需要获取的binlog缺失

解决方法

如果只是通过重新全量+增量跑任务的形式，其实可以重新完成同步任务，但是这个报错信息在query-status上会一直存在。

最终解决方案一：

1、停worker（下线worker）

ansible-playbook stop.yml --tags=dm-worker -l dm-worker6

2、修改?inventory.ini?文件，注释?dm-worker6实例所在行

配置并重启 DM-master 服务

ansible-playbook rolling_update.yml --tags=dm-master

3、备份部署目录

mv /data/dm-deploy /data/dm-deploy_bak

4、重新上线work

修改?inventory.ini?文件，不注释dm-worker6实例所在行

5、重新部署worker6

ansible-playbook deploy.yml --tags=dm-worker -l dm-worker6

ansible-playbook start.yml --tags=dm-worker -l dm-worker6

6、配置并重启 DM-master 服务

ansible-playbook rolling_update.yml --tags=dm-master

7、配置并重启 Prometheus 服务。

ansible-playbook rolling_update_monitor.yml --tags=prometheus

问题11


现象	[2020/07/13 12:24:14.134 +08:00] [ERROR] [mydumper.go:170] ["Could not read data from oms.so_attachment: Lost connection to MySQL server during query"] [task=so_attachment] [unit=dump] [2020/07/13 12:24:14.147 +08:00] [INFO] [mydumper.go:164] ["Thread 1 shutting down"] [task=so_attachment] [unit=dump] [2020/07/13 12:24:14.147 +08:00] [INFO] [mydumper.go:164] ["Finished dump at: 2020-07-13 12:24:14"] [task=so_attachment] [unit=dump] [2020/07/13 12:24:14.642 +08:00] [ERROR] [mydumper.go:118] ["dump data exits with error"] [task=so_attachment] [unit=dump] ["cost time"=1h56m9.983838484s] [error="msg:\"[code=32001:class=dump-unit:scope=internal:level=high] mydumper runs with error: exit status 1. \\n\\n\" "] [2020/07/13 12:24:14.642 +08:00] [INFO] [subtask.go:266] ["unit process returned"] [subtask=so_attachment] [unit=Dump] [stage=Paused] [status={}] [2020/07/13 12:24:14.642 +08:00] [ERROR] [subtask.go:285] ["unit process error"]
原因	该表有个字段`sign_str` longtext COMMENT '电子公章图片文件流'，存放的字符串非常大，在dump的时候极为缓慢，导致任务一直在自动重试dump。
解决方法	对于单表数据量不是很大、但单条记录某个字段数据量很大的情况，去掉 chunk-filesize: 64并在 extra-args: 增加 --rows mydumpers: mysql-replica-04.dump: mydumper-path: bin/mydumper threads: 4 #chunk-filesize: 64 #单个dump文件的大小 skip-tz-utc: true extra-args: -T oms.so_attachment -r 3000 # -r 单个文件的dump记录数

现象

[2020/07/13 12:24:14.134 +08:00] [ERROR] [mydumper.go:170] ["Could not read data from oms.so_attachment: Lost connection to MySQL server during query"] [task=so_attachment] [unit=dump]

[2020/07/13 12:24:14.147 +08:00] [INFO] [mydumper.go:164] ["Thread 1 shutting down"] [task=so_attachment] [unit=dump]

[2020/07/13 12:24:14.147 +08:00] [INFO] [mydumper.go:164] ["Finished dump at: 2020-07-13 12:24:14"] [task=so_attachment] [unit=dump]

[2020/07/13 12:24:14.642 +08:00] [ERROR] [mydumper.go:118] ["dump data exits with error"] [task=so_attachment] [unit=dump] ["cost time"=1h56m9.983838484s] [error="msg:\"[code=32001:class=dump-unit:scope=internal:level=high] mydumper runs with error: exit status 1. \\n\\n\" "]

[2020/07/13 12:24:14.642 +08:00] [INFO] [subtask.go:266] ["unit process returned"] [subtask=so_attachment] [unit=Dump] [stage=Paused] [status={}]

[2020/07/13 12:24:14.642 +08:00] [ERROR] [subtask.go:285] ["unit process error"]

原因

该表有个字段`sign_str` longtext COMMENT '电子公章图片文件流'，存放的字符串非常大，在dump的时候极为缓慢，导致任务一直在自动重试dump。

解决方法

对于单表数据量不是很大、但单条记录某个字段数据量很大的情况，去掉 chunk-filesize: 64并在 extra-args: 增加 --rows

mydumpers:

mysql-replica-04.dump:

mydumper-path: bin/mydumper

threads: 4

#chunk-filesize: 64 #单个dump文件的大小

skip-tz-utc: true

extra-args: -T oms.so_attachment -r 3000 # -r 单个文件的dump记录数

问题12


现象	"Message": "handle a potential duplicate event \u0026{Timestamp:1594677869 EventType:TableMapEvent ServerID:64244 EventSize:94 LogPos:123 Flags:0} in mysql-bin.000333: check event \u0026{Timestamp:1594677869 EventType:TableMapEvent ServerID:64244 EventSize:94 LogPos:123 Flags:0} whether duplicate in /data/dm-deploy/relay_log/433d5dde-b90c-11ea-9d0e-fa2736d1d700.000001/mysql-bin.000333: event from 29 in /data/dm-deploy/relay_log/433d5dde-b90c-11ea-9d0e-fa2736d1d700.000001/mysql-bin.000333 diff from passed-in event \u0026{Timestamp:1594677869 EventType:TableMapEvent ServerID:64244 EventSize:94 LogPos:123 Flags:0}",
原因	在 DM 进行 relay log 拉取与增量同步过程中，如果遇到了上游超过 4GB 的 binlog 文件，就可能出现这两个错误。原因是 DM 在写 relay log 时需要依据 binlog position 及文件大小对 event 进行验证，且需要保存同步的 binlog position 信息作为 checkpoint。但是 MySQL binlog position 官方定义使用 uint32 存储，所以超过 4G 部分的 binlog position 的 offset 值会溢出，进而出现上面的错误
解决方法	对于 relay 处理单元，可通过以下步骤手动恢复： 1. 在上游确认出错时对应的 binlog 文件的大小超出了 4GB。 2. 停止 DM-worker。 3. 将上游对应的 binlog 文件复制到 relay log 目录作为 relay log 文件。 4. 更新 relay log 目录内对应的?relay.meta?文件以从下一个 binlog 开始拉取。如果 DM worker 已开启enable_gtid?，那么在修改?relay.meta?文件时，同样需要修改下一个 binlog 对应的 GTID。如果未开启enable_gtid?则无需修改 GTID。例如：报错时有?binlog-name = "mysql-bin.004451"?与?binlog-pos = 2453?，则将其分别更新为?binlog-name = "mysql-bin.004452"?和?binlog-pos = 4?，同时更新?binlog-gtid = "f0e914ef-54cf-11e7-813d-6c92bf2fa791:1-138218058"?。 5. 重启 DM-worker。对于 binlog replication 处理单元，可通过以下步骤手动恢复： 1. 在上游确认出错时对应的 binlog 文件的大小超出了 4GB。 2. 通过?stop-task?停止同步任务。 3. 将下游?dm_meta?数据库中 global checkpoint 与每个 table 的 checkpoint 中的?binlog_name?更新为出错的 binlog 文件，将?binlog_pos?更新为已同步过的一个合法的 position 值，比如 4。例如：出错任务名为dm_test?，对应的?source-id?为?replica-1?，出错时对应的 binlog 文件为?mysql-bin\|000001.004451，则执行?UPDATE dm_test_syncer_checkpoint SET binlog_name='mysql-bin\|000001.004451', binlog_pos = 4 WHERE id='replica-1';?。 4. 在同步任务配置中为?syncers?部分设置?safe-mode: true?以保证可重入执行。 5. 通过?start-task?启动同步任务。通过?query-status?观察同步任务状态，当原造成出错的 relay log 文件同步完成后，即可还原?safe-mode?为原始值并重启同步任务。

现象

"Message": "handle a potential duplicate event \u0026{Timestamp:1594677869 EventType:TableMapEvent ServerID:64244 EventSize:94 LogPos:123 Flags:0} in mysql-bin.000333: check event \u0026{Timestamp:1594677869 EventType:TableMapEvent ServerID:64244 EventSize:94 LogPos:123 Flags:0} whether duplicate in /data/dm-deploy/relay_log/433d5dde-b90c-11ea-9d0e-fa2736d1d700.000001/mysql-bin.000333: event from 29 in /data/dm-deploy/relay_log/433d5dde-b90c-11ea-9d0e-fa2736d1d700.000001/mysql-bin.000333 diff from passed-in event \u0026{Timestamp:1594677869 EventType:TableMapEvent ServerID:64244 EventSize:94 LogPos:123 Flags:0}",

原因

在 DM 进行 relay log 拉取与增量同步过程中，如果遇到了上游超过 4GB 的 binlog 文件，就可能出现这两个错误。

原因是 DM 在写 relay log 时需要依据 binlog position 及文件大小对 event 进行验证，且需要保存同步的 binlog position 信息作为 checkpoint。但是 MySQL binlog position 官方定义使用 uint32 存储，所以超过 4G 部分的 binlog position 的 offset 值会溢出，进而出现上面的错误

解决方法

对于 relay 处理单元，可通过以下步骤手动恢复：

1. 在上游确认出错时对应的 binlog 文件的大小超出了 4GB。

2. 停止 DM-worker。

3. 将上游对应的 binlog 文件复制到 relay log 目录作为 relay log 文件。

4. 更新 relay log 目录内对应的?relay.meta?文件以从下一个 binlog 开始拉取。如果 DM worker 已开启enable_gtid?，那么在修改?relay.meta?文件时，同样需要修改下一个 binlog 对应的 GTID。如果未开启enable_gtid?则无需修改 GTID。例如：报错时有?binlog-name = "mysql-bin.004451"?与?binlog-pos = 2453?，则将其分别更新为?binlog-name = "mysql-bin.004452"?和?binlog-pos = 4?，同时更新?binlog-gtid = "f0e914ef-54cf-11e7-813d-6c92bf2fa791:1-138218058"?。

5. 重启 DM-worker。

对于 binlog replication 处理单元，可通过以下步骤手动恢复：

1. 在上游确认出错时对应的 binlog 文件的大小超出了 4GB。

2. 通过?stop-task?停止同步任务。

3. 将下游?dm_meta?数据库中 global checkpoint 与每个 table 的 checkpoint 中的?binlog_name?更新为出错的 binlog 文件，将?binlog_pos?更新为已同步过的一个合法的 position 值，比如 4。例如：出错任务名为dm_test?，对应的?source-id?为?replica-1?，出错时对应的 binlog 文件为?mysql-bin|000001.004451，则执行?UPDATE dm_test_syncer_checkpoint SET binlog_name='mysql-bin|000001.004451', binlog_pos = 4 WHERE id='replica-1';?。

4. 在同步任务配置中为?syncers?部分设置?safe-mode: true?以保证可重入执行。

5. 通过?start-task?启动同步任务。

通过?query-status?观察同步任务状态，当原造成出错的 relay log 文件同步完成后，即可还原?safe-mode?为原始值并重启同步任务。

问题13


现象	"Message": "TCPReader get relay event with error: ERROR 1236 (HY000): could not find next log; the first event 'mysql-bin.000016' at 4, the last event read from '/data/binlog/mysql-bin.000382' at 300608237, the last byte read from '/data/binlog/mysql-bin.000382' at 300608237.", "RawCause": "ERROR 1236 (HY000): could not find next log; the first event 'mysql-bin.000016' at 4, the last event read from '/data/binlog/mysql-bin.000382' at 300608237, the last byte read from '/data/binlog/mysql-bin.000382' at 300608237."
原因	上游binlog异常
解决方法

现象

"Message": "TCPReader get relay event with error: ERROR 1236 (HY000): could not find next log; the first event 'mysql-bin.000016' at 4, the last event read from '/data/binlog/mysql-bin.000382' at 300608237, the last byte read from '/data/binlog/mysql-bin.000382' at 300608237.",

"RawCause": "ERROR 1236 (HY000): could not find next log; the first event 'mysql-bin.000016' at 4, the last event read from '/data/binlog/mysql-bin.000382' at 300608237, the last byte read from '/data/binlog/mysql-bin.000382' at 300608237."

原因

上游binlog异常

解决方法

问题14


现象	"Message": "flush checkpoint (mysql-bin\|000001.002944, 263803812)(flushed (mysql-bin\|000001.002944, 263803812)): execute statement failed: begin: invalid connection", "RawCause": "invalid connection"
原因	task任务中的host配置使用的是lvx的ip（负载均衡），导致task state一直在pause、resume
解决方法	在task任务中，把下游的host修改为TIDB的ip即可

现象

"Message": "flush checkpoint (mysql-bin|000001.002944, 263803812)(flushed (mysql-bin|000001.002944, 263803812)): execute statement failed: begin: invalid connection",

"RawCause": "invalid connection"

原因

task任务中的host配置使用的是lvx的ip（负载均衡），导致task state一直在pause、resume

解决方法

在task任务中，把下游的host修改为TIDB的ip即可

问题15


现象	重启worker后，发现任务异常并通过query-status查看到报错是主键冲突的情况
原因	同步任务task模式 remove-meta: true 引起的
解决方法	基于这种情况要重启work的解决方法 1、 stop-task task 2、修改参数remove-meta: false 3、 Start-task task 4、重启worker

现象

重启worker后，发现任务异常并通过query-status查看到报错是主键冲突的情况

原因

同步任务task模式 remove-meta: true 引起的

解决方法

基于这种情况要重启work的解决方法

1、 stop-task task

2、修改参数remove-meta: false

3、 Start-task task

4、重启worker

问题16


现象	"Message": "execute statement failed: ALTER TABLE `finance_pump`.`crm_user_level_down_result` MODIFY COLUMN `union_id` VARCHAR(255) CHARACTER SET UTF8 NULL DEFAULT NULL COMMENT '会员Id' FIRST: Error 8200: Unsupported modify charset from utf8mb4 to utf8", "RawCause": "Error 8200: Unsupported modify charset from utf8mb4 to utf8"
原因	上下游表结构已经一致，但是task任务一直显示DDL语句报错，任务无法启动
解决方法	1、语句跳过 sql-skip -w 10.22.33.238:8262 --sql-pattern=~(?i)ALTER\s+TABLE\s+`hdb_broker2`.`broker_backup`\s+ADD --sharding task 或者 2、binlog跳过 sql-skip -w 10.22.33.238:8262 --binlog-pos=mysql-bin.000226:340571511 task 3、如果以上2种方法都无法跳过，对于跳不过的DDL，通过修改task配置解决配置： filter-rules: - mysql-replica-06.filter.1 filters: mysql-replica-06.filter.1: schema-pattern: "finance_pump" sql-pattern: ["ALTER TABLE"] action: Ignore

现象

"Message": "execute statement failed: ALTER TABLE `finance_pump`.`crm_user_level_down_result` MODIFY COLUMN `union_id` VARCHAR(255) CHARACTER SET UTF8 NULL DEFAULT NULL COMMENT '会员Id' FIRST: Error 8200: Unsupported modify charset from utf8mb4 to utf8",

"RawCause": "Error 8200: Unsupported modify charset from utf8mb4 to utf8"

原因

上下游表结构已经一致，但是task任务一直显示DDL语句报错，任务无法启动

解决方法

1、语句跳过

sql-skip -w 10.22.33.238:8262 --sql-pattern=~(?i)ALTER\s+TABLE\s+`hdb_broker2`.`broker_backup`\s+ADD --sharding task

或者

2、binlog跳过

sql-skip -w 10.22.33.238:8262 --binlog-pos=mysql-bin.000226:340571511 task

3、如果以上2种方法都无法跳过，对于跳不过的DDL，通过修改task配置解决

配置：

filter-rules:

- mysql-replica-06.filter.1

filters:

mysql-replica-06.filter.1:

schema-pattern: "finance_pump"

sql-pattern: ["ALTER TABLE"]

action: Ignore