背景
9.10号下午18点多,研发人员反馈有个服务的延时变高了,平时都是2s以下,突然变成了10s多
排查过程
-
登录集群的dashboard,查看集群延时和慢查询确实都升高了,并且业务SQL慢的前后都有
ANALYZE TABLE
dip_saos_w.
db_main_dw;
操作,这个Tidb自动触发的收集更新表的统计信息的操作,如下图:
-
咨询了开发人员,
db_main_dw
这个表很早就上了,并且今天也没有做过任何业务上的调整,我试着调整analyze执行的频率,并且手动执行了一下今天变慢的SQL,发现集群在没有做analyze的时候也很慢,如下图: -
接着在群里咨询大家上图变慢的这个SQL今天有没有做过调整,得知下午上过线,这个SQL去掉了一个排序条件,
ORDER BY c.priority desc, a.start_time
变成了ORDER BY a.start_time
,调整前的SQL如下:
SELECT a.task_id, a.task_type, a.task_type_name, a.task_owner, a.task_owner_name, a.task_title, a.start_time, b.customer_account_id, b.customer_name, c.owner, c.owner_name, c.intention_level, c.oppty_status_category, c.oppty_status, c.last_follow_up_time, c.last_follow_up_status, c.last_follow_up_content, c.priority FROM saos_base.tsa_task a LEFT JOIN saos_tur.tur_customer b ON a.customer_Account = b.customer_account_id LEFT JOIN ( SELECT oo.customer_account_id, oo.OWNER, oo.owner_name, oo.intention_level, oo.oppty_status_category, oo.oppty_status, oo.last_follow_up_time, oo.last_follow_up_status, oo.last_follow_up_content, oo.priority FROM saos_tur.rb_opportunity oo WHERE oo.oppty_status_category = 10 ) c ON b.customer_account_id = c.customer_account_id WHERE a.biz_app = 'RB' AND a.task_owner = '5123' AND a.task_type = 7134 AND a.task_state = 11 AND a.start_time < '2020-09-10 00:00:00' ORDER BY c.priority desc, a.start_time ASC LIMIT 0, 10;
- 然后在PingCAP的技术支持群里咨询各位老师得知,原 SQL 语句使用了更为高效的索引。新的语句去掉了原先的排序列更换了为了低效的索引,所以执行变慢。(从最终的情况来看还是与执行计划不准确有直接关系)
- 那么这个SQL还能再快嘛,8s业务肯定是无法接受的,我想到了用tiflash加速,然后我把这个SQL涉及到的表都开启了tiflash,explain 这个SQL发现查询竟然没有走tiflash,然后我手动设置了session 级别的
set @@session.tidb_isolation_read_engines = "tidb,tiflash";
把tikv禁止掉,再次explain 这个SQL都走tiflash了,手动执行时间从8s将到了500ms
- 既然走tiflash查询效率提升比较明显,那就让开发人员改下SQL,让这个查询强制走tiflash,比较尴尬的是我们按官方文档把SQL调整了还是不能走tiflash,强制走tiflash的SQL如下:
SELECT /*+ read_from_storage(tiflash[a,b]) */ a.task_id, a.task_type, a.task_type_name, a.task_owner, a.task_owner_name, a.task_title, a.start_time, b.customer_account_id, b.customer_name, c.owner, c.owner_name, c.intention_level, c.oppty_status_category, c.oppty_status, c.last_follow_up_time, c.last_follow_up_status, c.last_follow_up_content, c.priority FROM saos_base.tsa_task a LEFT JOIN saos_tur.tur_customer b ON a.customer_Account = b.customer_account_id LEFT JOIN ( SELECT oo.customer_account_id, oo.OWNER, oo.owner_name, oo.intention_level, oo.oppty_status_category, oo.oppty_status, oo.last_follow_up_time, oo.last_follow_up_status, oo.last_follow_up_content, oo.priority FROM saos_tur.rb_opportunity oo WHERE oo.oppty_status_category = 10 ) c ON b.customer_account_id = c.customer_account_id WHERE a.biz_app = 'RB' AND a.task_owner = '5123' AND a.task_type = 7134 AND a.task_state = 11 AND a.start_time < '2020-09-10 00:00:00' ORDER BY c.priority desc, a.start_time ASC LIMIT 0, 10;
- 上面问题反馈给群里PingCAP的技术支持小伙伴们然后继续排查,收集了表的健康度、表的统计信息、统计信息直方图,信息如下:
#表健康度
MySQL [(none)]> SHOW STATS_HEALTHY where Table_name in ( 'tsa_task','tur_customer','rb_opportunity');
+-----------+----------------+----------------+---------+
| Db_name | Table_name | Partition_name | Healthy |
+-----------+----------------+----------------+---------+
| saos_base | tsa_task | | 51 |
| saos_tur | tur_customer | | 80 |
| saos_tur | rb_opportunity | | 72 |
+-----------+----------------+----------------+---------+
#表统计信息
MySQL [(none)]> SHOW stats_meta where Table_name in ( 'tsa_task','tur_customer','rb_opportunity');
+-----------+----------------+----------------+---------------------+--------------+-----------+
| Db_name | Table_name | Partition_name | Update_time | Modify_count | Row_count |
+-----------+----------------+----------------+---------------------+--------------+-----------+
| saos_tur | tur_customer | | 2020-09-11 14:25:23 | 157738 | 795808 |
| saos_tur | rb_opportunity | | 2020-09-11 14:27:06 | 144563 | 517778 |
| saos_base | tsa_task | | 2020-09-11 14:27:51 | 1112555 | 2311207 |
+-----------+----------------+----------------+---------------------+--------------+-----------+
#统计信息直方图
MySQL [(none)]> SHOW STATS_HISTOGRAMS where Table_name in ( 'tsa_task','tur_customer','rb_opportunity');
+-----------+----------------+----------------+-------------------------------+----------+---------------------+----------------+------------+---------------------+-------------+
| Db_name | Table_name | Partition_name | Column_name | Is_index | Update_time | Distinct_count | Null_count | Avg_col_size | Correlation |
+-----------+----------------+----------------+-------------------------------+----------+---------------------+----------------+------------+---------------------+-------------+
| saos_base | tsa_task | | biz_code | 0 | 2020-08-13 08:51:18 | 1510400 | 0 | 22.83 | 0.198177 |
| saos_base | tsa_task | | task_type | 0 | 2020-08-13 08:51:18 | 93 | 0 | 7.37 | -0.124592 |
| saos_base | tsa_task | | create_time | 0 | 2020-08-13 08:51:21 | 1309184 | 0 |
+-----------+----------------+----------------+-------------------------------+----------+---------------------+----------------+------------+---------------------+-------------+
65 rows in set (0.00 sec)
- 通过上面信息发现这3个表的健康度都不高,官方建议健康度最好在80以上,所以我们先重新收集下表的统计信息,命令如下:
analyze table tsa_task;
analyze table tur_customer;
analyze table rb_opportunity;
- 然后再次执行发现还是无法走tiflash,但是走tikv变快了,通过执行计划可以看到查询优化器用到了TopN和IndexMergeJoin这两个优化算子,执行计划如下:
explain analyze SELECT a.task_id, a.task_type, a.task_type_name, a.task_owner, a.task_owner_name, a.task_title, a.start_time, b.customer_account_id, b.customer_name, c.owner, c.owner_name, c.intention_level, c.oppty_status_category, c.oppty_status, c.last_follow_up_time, c.last_follow_up_status, c.last_follow_up_content, c.priority
FROM saos_base.tsa_task a
LEFT JOIN saos_tur.tur_customer b ON a.customer_Account = b.customer_account_id
LEFT JOIN ( SELECT oo.customer_account_id, oo.OWNER, oo.owner_name, oo.intention_level, oo.oppty_status_category, oo.oppty_status, oo.last_follow_up_time, oo.last_follow_up_status, oo.last_follow_up_content, oo.priority FROM saos_tur.rb_opportunity oo WHERE oo.oppty_status_category = 10 ) c
ON b.customer_account_id = c.customer_account_id
WHERE a.biz_app = 'RB' AND a.task_owner = '5123' AND a.task_type = 7134 AND a.task_state = 11 AND a.start_time < '2020-09-10 00:00:00'
ORDER BY a.start_time ASC LIMIT 0, 10;
+------------------------------------------------+----------+---------+-----------+--------------------------------------------------------------+-------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
+------------------------------------------------+----------+---------+-----------+--------------------------------------------------------------+-------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+------+
| Projection_15 | 10.00 | 0 | root | | time:159.603696ms, loops:1, Concurrency:OFF | saos_base.tsa_task.task_id, saos_base.tsa_task.task_type, saos_base.tsa_task.task_type_name, saos_base.tsa_task.task_owner, saos_base.tsa_task.task_owner_name, saos_base.tsa_task.task_title, saos_base.tsa_task.start_time, saos_tur.tur_customer.customer_account_id, saos_tur.tur_customer.customer_name, saos_tur.rb_opportunity.owner, saos_tur.rb_opportunity.owner_name, saos_tur.rb_opportunity.intention_level, saos_tur.rb_opportunity.oppty_status_category, saos_tur.rb_opportunity.oppty_status, saos_tur.rb_opportunity.last_follow_up_time, saos_tur.rb_opportunity.last_follow_up_status, saos_tur.rb_opportunity.last_follow_up_content, saos_tur.rb_opportunity.priority | 3.7265625 KB | N/A |
| └─Limit_21 | 10.00 | 0 | root | | time:159.601289ms, loops:1 | offset:0, count:10 | N/A | N/A |
| └─IndexJoin_160 | 10.00 | 0 | root | | time:159.584142ms, loops:1, Concurrency:4 | left outer join, inner:IndexLookUp_159, outer key:saos_tur.tur_customer.customer_account_id, inner key:saos_tur.rb_opportunity.customer_account_id | 0 Bytes | N/A |
| ├─Limit_174(Build) | 10.00 | 0 | root | | time:159.685046ms, loops:1 | offset:0, count:10 | N/A | N/A |
| │ └─IndexJoin_130 | 10.00 | 0 | root | | time:159.681804ms, loops:1, Concurrency:4 | left outer join, inner:IndexLookUp_129, outer key:saos_base.tsa_task.customer_account, inner key:saos_tur.tur_customer.customer_account_id | 0 Bytes | N/A |
| │ ├─TopN_139(Build) | 10.00 | 0 | root | | time:159.753396ms, loops:1 | saos_base.tsa_task.start_time:asc, offset:0, count:10 | 0 Bytes | N/A |
| │ │ └─IndexLookUp_147 | 10.00 | 0 | root | | time:159.720228ms, loops:2, rpc num: 1, rpc time:14.658549ms, proc keys:19366 | | 369.71875 KB | N/A |
| │ │ ├─IndexRangeScan_70(Build) | 19354.27 | 19366 | cop[tikv] | table:a, index:idx_owner(task_owner) | time:11ms, loops:23 | range:["5123","5123"], keep order:false | N/A | N/A |
| │ │ └─TopN_146(Probe) | 10.00 | 0 | cop[tikv] | | proc max:73ms, min:0s, p80:64ms, p95:73ms, iters:31, tasks:17 | saos_base.tsa_task.start_time:asc, offset:0, count:10 | N/A | N/A |
| │ │ └─Selection_72 | 672.94 | 0 | cop[tikv] | | proc max:72ms, min:0s, p80:64ms, p95:72ms, iters:31, tasks:17 | eq(saos_base.tsa_task.biz_app, "RB"), eq(saos_base.tsa_task.task_state, 11), eq(saos_base.tsa_task.task_type, 7134), lt(saos_base.tsa_task.start_time, 2020-09-10 00:00:00.000000) | N/A | N/A |
| │ │ └─TableRowIDScan_71 | 19354.27 | 19366 | cop[tikv] | table:a | proc max:72ms, min:0s, p80:64ms, p95:72ms, iters:31, tasks:17 | keep order:false | N/A | N/A |
| │ └─IndexLookUp_129(Probe) | 1.00 | | root | | time:0ns, loops:0 | | N/A | N/A |
| │ ├─IndexRangeScan_127(Build) | 1.00 | | cop[tikv] | table:b, index:uniq_customeraccountid(customer_account_id) | time:0ns, loops:0 | range: decided by [eq(saos_tur.tur_customer.customer_account_id, saos_base.tsa_task.customer_account)], keep order:false | N/A | N/A |
| │ └─TableRowIDScan_128(Probe) | 1.00 | | cop[tikv] | table:b | time:0ns, loops:0 | keep order:false | N/A | N/A |
| └─IndexLookUp_159(Probe) | 1.04 | | root | | time:0ns, loops:0 | | N/A | N/A |
| ├─IndexRangeScan_156(Build) | 2.54 | | cop[tikv] | table:oo, index:idx_customer_account_id(customer_account_id) | time:0ns, loops:0 | range: decided by [eq(saos_tur.rb_opportunity.customer_account_id, saos_tur.tur_customer.customer_account_id)], keep order:false | N/A | N/A |
| └─Selection_158(Probe) | 1.04 | | cop[tikv] | | time:0ns, loops:0 | eq(saos_tur.rb_opportunity.oppty_status_category, 10) | N/A | N/A |
| └─TableRowIDScan_157 | 2.54 | | cop[tikv] | table:oo | time:0ns, loops:0 | keep order:false | N/A | N/A |
+------------------------------------------------+----------+---------+-----------+--------------------------------------------------------------+-------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+------+
18 rows in set, 1 warning (0.17 sec)
- Tidb虽然高度兼容Mysql,但是有些用法和Mysql还是不一样的,包括SQL优化,下面是SQL优化相关的原理文章,要想把Tidb用好,SQL的优化这块还需要从原理方面深入学习,SQL优化原理相关
写在最后
特别感谢PingCAP的小伙伴们的大力支持,每次遇到问题会帮忙积极主动跟进处理,再次感谢李仲舒、王贤净、苏丹、高振娇。