1 背景
我们公司有一种特殊的数据分析场景,大数据部门会定期抽取业务数据进行分析然后T+1更新报表。这个抽数任务经常会影响业务的正常访问。经过调研后发现tiflash可以解决这种场景带来的问题。
但是,近期我们公司在使用tiflash引擎的过程中遇到一些与预期不符合的问题,本文就来整理总结一下。
tidb版本4.0.13
2、初体验
根据官方文档的介绍,tiflash在统计信息count(*)这种场景有很大的优势,所以我们就测试了这种使用场景,结果和描述的一致。
(1)准备测试数据
mysql> show create table tb_monitor_disk_info\G
*************************** 1. row ***************************
Table: tb_monitor_disk_info
Create Table: CREATE TABLE `tb_monitor_disk_info` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`rshost` varchar(20) NOT NULL DEFAULT '' COMMENT '主机地址',
`part` varchar(50) NOT NULL DEFAULT '' COMMENT '分区信息',
`disk_info` json DEFAULT NULL COMMENT 'disk信息 json串, 单位是GB',
`a_time` datetime NOT NULL DEFAULT '2022-01-01 00:00:00',
PRIMARY KEY (`id`),
KEY `idx_rshost` (`rshost`),
KEY `idx_a_time` (`a_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin AUTO_INCREMENT=307494024
1 row in set (0.00 sec)
mysql> select count(*) from tb_monitor_disk_info;
+----------+
| count(*) |
+----------+
| 52916325 |
+----------+
1 row in set (0.07 sec)
mysql> select * from tb_monitor_disk_info limit 10;
+-----------+----------------+-----------+---------------------------------------------------+---------------------+
| id | rshost | part | disk_info | a_time |
+-----------+----------------+-----------+---------------------------------------------------+---------------------+
| 260383923 | 192.168.64.160 | /dev/sdb1 | {"free": "3900", "total": "5361", "used": "1461"} | 2023-02-01 03:39:28 |
| 260383924 | 192.168.64.77 | /dev/sda1 | {"free": "2617", "total": "2748", "used": "19"} | 2023-02-01 03:39:28 |
| 260383925 | 192.168.66.224 | /dev/sdb1 | {"free": "1147", "total": "3574", "used": "2426"} | 2023-02-01 03:39:27 |
| 260383926 | 192.168.64.158 | /dev/sdb1 | {"free": "2746", "total": "5361", "used": "2614"} | 2023-02-01 03:39:28 |
| 260383927 | 192.168.65.33 | /dev/sdb1 | {"free": "2328", "total": "5361", "used": "3033"} | 2023-02-01 03:39:28 |
| 260383928 | 192.168.65.93 | /dev/dfa1 | {"free": "2947", "total": "2978", "used": "31"} | 2023-02-01 03:39:28 |
| 260383929 | 192.168.64.68 | /dev/sda1 | {"free": "2607", "total": "2748", "used": "28"} | 2023-02-01 03:39:28 |
| 260383930 | 192.168.65.83 | /dev/sdb1 | {"free": "3346", "total": "3574", "used": "227"} | 2023-02-01 03:39:27 |
| 260383931 | 192.168.64.245 | /dev/sdb1 | {"free": "3819", "total": "5321", "used": "1233"} | 2023-02-01 03:39:29 |
| 260383932 | 192.168.64.249 | /dev/sdb1 | {"free": "3972", "total": "5321", "used": "1080"} | 2023-02-01 03:39:29 |
+-----------+----------------+-----------+---------------------------------------------------+---------------------+
10 rows in set (0.02 sec)
mysql> select * from information_schema.tiflash_replica;
+--------------+----------------------+----------+---------------+-----------------+-----------+----------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_ID | REPLICA_COUNT | LOCATION_LABELS | AVAILABLE | PROGRESS |
+--------------+----------------------+----------+---------------+-----------------+-----------+----------+
| dbzz_monitor | tb_monitor_disk_info | 455 | 2 | | 1 | 1 |
+--------------+----------------------+----------+---------------+-----------------+-----------+----------+
1 row in set (0.01 sec)
mysql>
所有演示都是采用我们线上的监控测试数据,总量大概5000w左右,并且已经部署好tiflash引擎。
(2)测试tiflash
- 使用tikv做统计查询
mysql> set SESSION tidb_isolation_read_engines = "tikv";
Query OK, 0 rows affected (0.00 sec)
mysql> desc select count(*) from tb_monitor_disk_info;
+----------------------------+-------------+-----------+------------------------------------------------------+----------------------------------+
| id | estRows | task | access object | operator info |
+----------------------------+-------------+-----------+------------------------------------------------------+----------------------------------+
| StreamAgg_24 | 1.00 | root | | funcs:count(Column#14)->Column#6 |
| └─IndexReader_25 | 1.00 | root | | index:StreamAgg_8 |
| └─StreamAgg_8 | 1.00 | cop[tikv] | | funcs:count(1)->Column#14 |
| └─IndexFullScan_22 | 53220210.00 | cop[tikv] | table:tb_monitor_disk_info, index:idx_rshost(rshost) | keep order:false |
+----------------------------+-------------+-----------+------------------------------------------------------+----------------------------------+
4 rows in set (0.00 sec)
mysql> select count(*) from tb_monitor_disk_info;
+----------+
| count(*) |
+----------+
| 52916325 |
+----------+
1 row in set (2.12 sec)
mysql> select count(*) from tb_monitor_disk_info;
+----------+
| count(*) |
+----------+
| 52916325 |
+----------+
1 row in set (2.32 sec)
mysql> select count(*) from tb_monitor_disk_info;
+----------+
| count(*) |
+----------+
| 52916325 |
+----------+
1 row in set (2.10 sec)
mysql> select count(*) from tb_monitor_disk_info;
+----------+
| count(*) |
+----------+
| 52916325 |
+----------+
1 row in set (2.14 sec)
mysql> select count(*) from tb_monitor_disk_info;
+----------+
| count(*) |
+----------+
| 52916325 |
+----------+
1 row in set (2.07 sec)
mysql>
- 使用tiflash做统计查询
mysql> set SESSION tidb_isolation_read_engines = "tiflash";
Query OK, 0 rows affected (0.00 sec)
mysql> desc select count(*) from tb_monitor_disk_info;
+----------------------------+-------------+-------------------+----------------------------+---------------------------------+
| id | estRows | task | access object | operator info |
+----------------------------+-------------+-------------------+----------------------------+---------------------------------+
| StreamAgg_16 | 1.00 | root | | funcs:count(Column#8)->Column#6 |
| └─TableReader_17 | 1.00 | root | | data:StreamAgg_8 |
| └─StreamAgg_8 | 1.00 | batchCop[tiflash] | | funcs:count(1)->Column#8 |
| └─TableFullScan_15 | 53220210.00 | batchCop[tiflash] | table:tb_monitor_disk_info | keep order:false |
+----------------------------+-------------+-------------------+----------------------------+---------------------------------+
4 rows in set (0.00 sec)
mysql> select count(*) from tb_monitor_disk_info;
+----------+
| count(*) |
+----------+
| 52916325 |
+----------+
1 row in set (0.07 sec)
mysql> select count(*) from tb_monitor_disk_info;
+----------+
| count(*) |
+----------+
| 52916325 |
+----------+
1 row in set (0.05 sec)
mysql> select count(*) from tb_monitor_disk_info;
+----------+
| count(*) |
+----------+
| 52916325 |
+----------+
1 row in set (0.06 sec)
mysql> select count(*) from tb_monitor_disk_info;
+----------+
| count(*) |
+----------+
| 52916325 |
+----------+
1 row in set (0.05 sec)
mysql> select count(*) from tb_monitor_disk_info;
+----------+
| count(*) |
+----------+
| 52916325 |
+----------+
1 row in set (0.06 sec)
mysql>
可以看到在使用tikv引擎做全表统计行数的时候耗时大概2秒多,使用tiflash引擎做全表行数统计大概只需要100 ms以内,提升还是很大的。
3、不满足预期
在使用tiflash组件前,我们只是验证了一下统计查询对性能的提升,并没有对起针对性的测试调研,比如模拟业务的查询场景。
这是一个反面例子,强烈建议上线前要充分测试验证,有条件的还要模拟线上业务场景做充分测试。
以至于,我们抽数任务在使用tiflash引擎后,查询SQL并没有走tiflash,经过测试发现,使用索引的查询条件就没法使用tiflash。所以从使用体验来看,tiflash并没有那么舒服,也不像网上吹虚的那样,具体测试如下:
mysql> desc select count(*) from tb_monitor_disk_info where a_time >= '2023-02-10' and a_time < '2023-02-11';
+-----------------------------+------------+-----------+------------------------------------------------------+-------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+-----------------------------+------------+-----------+------------------------------------------------------+-------------------------------------------------------------------+
| StreamAgg_24 | 1.00 | root | | funcs:count(Column#8)->Column#6 |
| └─IndexReader_25 | 1.00 | root | | index:StreamAgg_9 |
| └─StreamAgg_9 | 1.00 | cop[tikv] | | funcs:count(1)->Column#8 |
| └─IndexRangeScan_23 | 1513586.15 | cop[tikv] | table:tb_monitor_disk_info, index:idx_a_time(a_time) | range:[2023-02-10 00:00:00,2023-02-11 00:00:00), keep order:false |
+-----------------------------+------------+-----------+------------------------------------------------------+-------------------------------------------------------------------+
4 rows in set (0.00 sec)
mysql> select count(*) from tb_monitor_disk_info where a_time >= '2023-02-10' and a_time < '2023-02-11';
+----------+
| count(*) |
+----------+
| 1775971 |
+----------+
1 row in set (0.25 sec)
mysql> desc select * from tb_monitor_disk_info where a_time >= '2023-02-10' and a_time < '2023-02-11';
+--------------------------------+------------+-----------+------------------------------------------------------+-------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+--------------------------------+------------+-----------+------------------------------------------------------+-------------------------------------------------------------------+
| IndexLookUp_13 | 1513586.15 | root | | |
| ├─IndexRangeScan_11(Build) | 1513586.15 | cop[tikv] | table:tb_monitor_disk_info, index:idx_a_time(a_time) | range:[2023-02-10 00:00:00,2023-02-11 00:00:00), keep order:false |
| └─TableRowIDScan_12(Probe) | 1513586.15 | cop[tikv] | table:tb_monitor_disk_info | keep order:false |
+--------------------------------+------------+-----------+------------------------------------------------------+-------------------------------------------------------------------+
3 rows in set (0.00 sec)
mysql> desc select * from tb_monitor_disk_info ignore index(idx_a_time) where a_time >= '2023-02-10' and a_time < '2023-02-11';
+-------------------------+-------------+--------------+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+-------------------------+-------------+--------------+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| TableReader_10 | 1513586.15 | root | | data:Selection_9 |
| └─Selection_9 | 1513586.15 | cop[tiflash] | | ge(dbzz_monitor.tb_monitor_disk_info.a_time, 2023-02-10 00:00:00.000000), lt(dbzz_monitor.tb_monitor_disk_info.a_time, 2023-02-11 00:00:00.000000) |
| └─TableFullScan_8 | 53220210.00 | cop[tiflash] | table:tb_monitor_disk_info | keep order:false |
+-------------------------+-------------+--------------+----------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)
mysql>
针对上述现象,只能使用ignore index语法,使其不使用索引才会走tiflash。
也可以使用 set SESSION tidb_isolation_read_engines = "tiflash";
4、总结
抽数是我们公司一个重要的业务场景,具有一定的特殊性,基本都是通过时间字段取遍历增量数据或者全量数据,十分消耗资源,所以基于这个背景我们上线了tiflash组件提供这类业务场景进行抽数,但是实际使用中发现并没有得到预期的效果:
(1)我们的抽数SQL非统计类且因为使用了索引,可能tidb识别为代价不高的查询,所以还是下发到tikv,导致我们的业务正常访问请求收到影响。
(2)为了解决上述(1)的问题,有如下两个方案。
- 修改业务逻辑。
这个方案需要去修改业务逻辑,将抽数语句改成ignore index,或者在连接tidb的时候设置tidb_isolation_read_engines,总的来说对业务来说也不太友好,所以需要看实际情况,业务能不能接受。
- 修改tidb的配置
另一种方案就是新加一个tidb节点,将新tidb节点的引擎强制使用tiflash。
[isolation-read]
engines = ["tiflash"]
但是需要注意,没有启用tiflash的表是没法查数的,所以这个新加的节点不能暴露给业务使用,否则可能会导致查询报错。
综上所述,每个环境的业务场景都比较复杂,存在不同情况的查询慢问题,tiflash的使用场景也存在一定的限制,所以想通过tiflash解决线上查询性能问题一定要先测试。
5、写在最后
本文对tiflash引擎在线上环境做了简单测试总结,各公司的业务场景也不一样,需求也不同,还可能碰上其他未知的问题,本文所有内容仅供参考。