insert带来的TiDB集群hang死血案

一、背景

业务在周四 6 月18 日中午12 点 30 左右，开始将数据往之前新建物理分表导数据，指定了shard_row_id_bit 以及 pre_split_region来避免热点。几天后的凌晨，集群出现qps骤降和 duration 耗时异常。
集群配置

集群版本：v3.0.5
集群配置：普通SSD磁盘，128G内存，40 核cpu
tidb21 TiDB/PD/pump/prometheus/grafana/CCS
tidb22 TiDB/PD/pump
tidb23 TiDB/PD/pump
tidb01 TiKV
tidb02 TiKV
tidb03 TiKV
tidb04 TiKV
tidb05 TiKV
tidb06 TiKV
tidb07 TiKV
tidb08 TiKV
tidb09 TiKV
tidb10 TiKV
tidb11 TiKV
tidb12 TiKV
tidb13 TiKV
tidb14 TiKV
tidb15 TiKV
tidb16 TiKV
tidb17 TiKV
tidb18 TiKV
tidb19 TiKV
tidb20 TiKV
wtidb29 TiKV
wtidb30 TiKV

二、现象

分析如下：
在 6 月 21 日凌晨 01：24分左右，qps大幅骤降，durtation大幅增高，报警包含如下三类：

集群region数量和大小上涨缓慢，且此时没有大规模region balance：

pd监控显示各节点磁盘消耗停滞

tidb --> kv error 监控面板下面出现下述的告警信息：

上面有反馈server is busy，这可能是raftsotore线程卡了，导致消息没有及时处理，也可能是写入的事物过多，TiKV进行写入流控，可能是查询量过大，产生了堆积。

通过监控我们定位到server is busy的节点，是ip为218结尾的这台机器出现了问题：

218这个节点的pending commands不正常

其scheduler worker cpu也远高于其他节点
日志大概看了下未见异常，将其重启。

重启问题节点218后，能看到pending command和worker cpu转移了：

对TiKV的日志进行过滤

cat 218.log| grep conflict | awk -F 'tableID=' '{print $2}' 都过滤下写冲突的 tableid 
select * from information_schema.tables where tidb_table_id='93615';
 
["commit failed"] [conn=250060] ["finished txn"="Txn{state=invalid}"] [error="[kv:9007]Write conflict, txnStartTS=417517629610917903, conflictStartTS=417517692315762921, conflictCommitTS=417517692315762921, key={tableID=93643, indexID=1, indexValues={f73732b3d9f35742e39e6bba8e401192, www.baidu.com, }} primary={tableID=93643, indexID=1, indexValues={f73732b3d9f35742e39e6bba8e401192, www.baidu.com, }} [try again later]"]
 
 [kv:9007]Write conflict, txnStartTS=417517629610917903, conflictStartTS=417517692315762921, conflictCommitTS=417517692315762921, key={tableID=93643, indexID=1, indexValues={f73732b3d9f35742e39e6bba8e401192, www.baidu.com, }} primary={tableID=93643, indexID=1, indexValues={f73732b3d9f35742e39e6bba8e401192, www.baidu.com, }} [try again later]"] [errorVerbose="[kv:9007]Write conflict, txnStartTS=417517629610917903, conflictStartTS=417517692315762921, conflictCommitTS=417517692315762921, key={tableID=93643, indexID=1, indexValues={f73732b3d9f35742e39e6bba8e401192, www.baidu.com, }} primary={tableID=93643, indexID=1, indexValues={f73732b3d9f35742e39e6bba8e401192, www.baidu.com, }} [try again later]
10 分钟的日志，这个表的冲突出现了 1147 次

[kv:9007]Write conflict：表示出现了写写冲突
txnStartTS=417517629610917903：表示当前事务的 start_ts 时间戳，可以通过 pd-ctl 工具将时间戳转换为具体时间
conflictStartTS=417517692315762921：表示冲突事务的 start_ts 时间戳，可以通过 pd-ctl 工具将时间戳转换为具体时间
conflictCommitTS=417517692315762921：表示冲突事务的 commit_ts 时间戳，可以通过 pd-ctl 工具将时间戳转换为具体时间
key={tableID=93643, indexID=1, indexValues={string, }}：表示当前事务中冲突的数据，tableID 表示发生冲突的表的 ID，indexID 表示是索引数据发生了冲突。如果是数据发生了冲突，会打印 handle=x 表示对应哪行数据发生了冲突，indexValues 表示发生冲突的索引数据
primary={tableID=93643, indexID=1, indexValues={string, }}：表示当前事务中的 Primary Key 信息

通过pd工具可以换算出时间
./pd-ctl -u https://{PDIP}:2379 tso {TIMESTAMP}

通过 tableID 查找具体的表名：
curl http://{TiDBIP}:10080/db-table/{tableID}

通过 indexID 查找具体的索引名：
SELECT * FROM INFORMATION_SCHEMA.TIDB_INDEXES WHERE TABLE_SCHEMA='{table_name}' AND TABLE_NAME='{table_name}' AND INDEX_ID={indexID};

版本差异
在 v3.0.8 版本之前，TiDB 默认采用乐观事务模型，在事务执行过程中并不会做冲突检测，而是在事务最终 COMMIT 提交时触发两阶段提交，并检测是否存在写写冲突。当出现写写冲突，并且开启了事务重试机制，则 TiDB 会在限定次数内进行重试，最终重试成功或者达到重试次数上限后，会给客户端返回结果。因此，如果 TiDB 集群中存在大量的写写冲突情况，容易导致集群的 Duration 比较高。

另外在 v3.0.8 及之后版本默认使用悲观事务模式，从而避免在事务提交的时候因为冲突而导致失败，无需修改应用程序。悲观事务模式下会在每个 DML 语句执行的时候，加上悲观锁，用于防止其他事务修改相同 Key，从而保证在最后提交的 prewrite 阶段不会出现写写冲突的情况。

出现写写冲突的原因
TiDB 中使用 Percolator 事务模型来实现 TiDB 中的事务。Percolator 总体上就是一个二阶段提交的实现。具体的二阶段提交过程可参考乐观事务文档。
当客户端发起 COMMIT 请求的时候，TiDB 开始两阶段提交：
1. TiDB 从所有要写入的 Key 中选择一个作为当前事务的 Primary Key
2. TiDB 向所有的本次提交涉及到的 TiKV 发起 prewrite 请求，TiKV 判断是否所有 Key 都可以 prewrite 成功
3. TiDB 收到所有 Key 都 prewrite 成功的消息
4. TiDB 向 PD 请求 commit_ts
5. TiDB 向 Primary Key 发起第二阶段提交。Primary Key 所在的 TiKV 收到 commit 操作后，检查数据合法性，清理 prewrite 阶段留下的锁
6. TiDB 收到两阶段提交成功的信息
写写冲突发生在 prewrite 阶段，当发现有其他的事务在写当前 Key (data.commit_ts > txn.start_ts)，则会发生写写冲突。
TiDB 会根据 tidb_disable_txn_auto_retry 和 tidb_retry_limit 参数设置的情况决定是否进行重试，如果设置了不重试，或者重试次数达到上限后还是没有 prewrite 成功，则向 TiDB 返回 Write Conflict 错误

三、排查步骤

监控
从监控信息，初步排查是因为 tikv 比较繁忙，出现了 server is busy 的情况。导出了 21 日异常时间段的 tidb 以及 tikv-details 的监控信息，开始进行排查。
TiDB
Query Details
写写冲突在 21 日 01:15 出现了一个陡升的现象：

KV Duration
KV duration 耗时都集中在 store 16 ，store 16 IP 地址为 218，并且结合 KV Error 的监控，可能从 01:15 左右开始218 就出现响应超时的报错：
TiKV - Details
Erros
通过下述监控面板，server is busy直接能定位到218机器，在 raftstore error 监控面板中，01:15 左右 218 出现了大量的 not leader,这是因为region 因为冲突特别繁忙，没有办法响应请求了

Grpc
通过 grpc message count 以及 message duration 两个监控项看下，从 01:15 左右开始，整个tikv-details 处理消息的数量大幅降低，并且处理 prewrite 的耗时，翻倍上涨，整个写入速度大幅降低。

Thread CPU
raftstore cpu 以及 async apply cpu 利用率同样在 01:15 开始利用率大幅下降，与 grcp 相关监控吻合，集群此时，处理写入的速度确实降低了。

scheduler worker cpu 在 01:15 也降低了，但是 218的 scheduler 的 cpu 异常高于其他 tikv ，并且出现上涨是在 00:45 左右。这是因为该节点冲突严重，scheduler在反复调度，处理pending task。

因为上面的 scheduler worker cpu 的异常现象，排查热点相关的问题，发现在出现问题时，
各个 tikv 节点接收的消息较为均衡，热点现象排除。

Storage
查看 async write duration ，发现从 01:15 开始 async write duration 非常低，故不再查看propose wait ，append，apply wait 以及 apply 耗时，整个写入慢的问题不是发生在这个阶段，很可能是发生在 scheduler 阶段。rocksdb-kv 相关的监控也验证了这一点。

RocksDB - KV
查看 rocksdb-kv 中 wal 以及 write duration 自 01:15 开始耗时下降，与 storage 处的猜测吻合，不是 raftstore ，apply 以及 rocksdb 慢，可能是慢在了 scheduler 。

Scheduler - prewrite
接着查看了 scheduler - prewrite 监控面板，出现 command 以及 latch wait duration 逐渐上涨
的情况，与 grpc 监控 prewrite 耗时增长相匹配。此时，基本确定，写入慢是因为 scheduler -
prewrite 耗时太长导致。

分析小结：

自 21 日 01:15 开始，整个集群的写入速度骤降。
排除了 scheduler cpu 异常导致热点的可能性。
排除了 raftstore ，apply 以及 rocksdb-kv 写入慢的可能性。
初步怀疑，写入慢在了 scheduler 部分。
Scheduler
查看 scheduler 监控，发现 async request error 集中在 not leader，并且 scheduler pending
commands 在 01:15 陡升，并且主要集中在 218。
分析小结：
除调整写入的表外，业务以及 qps 没有调整，所以不太可能是
scheduler-pending-write-threshold 超过 100MB 导致。
scheduler 以及 scheduler - prewrite 监控面板中 pending commands 的数量以及
commands 、latch wait duration 的耗时基本确定写入慢是慢在了 scheduler。
pending commands 主要集中在 218，极大可能是写写冲突比较严重，并且都集
中在 218 ，导致 latch 耗时增长。
处理动作：
● 决定将 218 重启，将部分 leader 切换至其他 tikv 节点，看下是否可以缓解这个现
象
处理反馈：
把218重启后，现在就换了一个节点开始pending commands上涨，此时为 138。可能写
写冲突相当严重，并且集中在某几个 region：

TiKV 日志
因为重启 218 后，scheduler pending 会转移，故过滤了 218 的 log ，发现大量的出
现如下信息, 并且都集中在某几个 region 上 7365862 ，7257865：

[2020/06/21 09:40:43.011 +08:00] [INFO] [process.rs:188] ["get snapshot failed"]
[err="Request(message: \""peer is not leader for region 7365862, leader may Some(id:
7365863 store_id: 16)\"" not_leader { region_id: 7365862 leader { id: 7365863 store_id: 16
} })"] [cid=50406655291]
[2020/06/21 09:40:43.023 +08:00] [INFO] [process.rs:188] ["get snapshot failed"]
[err="Request(message: \""peer is not leader for region 7257865, leader may Some(id:
7257868 store_id: 16)\"" not_leader { region_id: 7257865 leader { id: 7257868 store_id: 16
} })"] [cid=50406655292]

查看 region 7257865 信息如下：

TiDB 日志
查看 TiDB 的日志，出现 218 链接无效，server is timeout , 并且 region cache 需要 refill
的现象,可能是 218 tikv 的 scheduler 太繁忙，并且因为写写冲突严重，都集中在某几个
region，无法响应。这个与 tikv 监控以及日志显示的结论基本一致：

[2020/06/21 09:40:47.001 +08:00] [INFO] [region_cache.go:894] ["mark store's regions
need be refill"] [store=xx.xx.xx.218:20160]
[2020/06/21 09:40:47.001 +08:00] [INFO] [region_cache.go:393] ["switch region peer to
next due to send request fail"] [current="region ID: 7901120, meta: id:7901120
start_key:\""t\"\"200\"\"000\"\"000\"\"000\"\"000\"\"001m\"\"313_i\"\"200\"\"000\"\"000\"\"000\"\"000\"\"000\"\"000\"\"00
1\"\"001f4ee30f3\"\"377b1663596\"\"377040a2655\"\"377c7f7a44f\"\"377\"\"000\"\"000\"\"000\"\"000\"\"000\"\"
000\"\"000\"\"000\"\"367\"\"001autodisc\"\"377over.bil\"\"377lielourd\"\"377.net\"\"000\"\"000\"\"000\"\"000\"\"37
3\""
end_key:\""t\"\"200\"\"000\"\"000\"\"000\"\"000\"\"001m\"\"313_i\"\"200\"\"000\"\"000\"\"000\"\"000\"\"000\"\"000\"\"002
\"" region_epoch:<conf_ver:623 version:60959 > peers:<id:7903456 store_id:5 >

因为定位是写写冲突导致，那么过滤了 21 日 10 分钟 tidb 的监控，发现下表同一个 key 的冲
突出现了 1100 多次：

冲突日志如下：

["commit failed"] [conn=250060] ["finished txn"="Txn{state=invalid}"] [error="[kv:9007]Write
conflict, txnStartTS=417517629610917903, conflictStartTS=417517692315762921,
conflictCommitTS=417517692315762921, key={tableID=93643, indexID=1,
indexValues={f73732b3d9f35742e39e6bba8e401192, www.baidu.com, }}
primary={tableID=93643, indexID=1,
indexValues={f73732b3d9f35742e39e6bba8e401192, www.baidu.com, }} [try again later]"]
[kv:9007]Write conflict, txnStartTS=417517629610917903,
conflictStartTS=417517692315762921, conflictCommitTS=417517692315762921,
key={tableID=93643, indexID=1, indexValues={f73732b3d9f35742e39e6bba8e401192,
www.baidu.com, }} primary={tableID=93643, indexID=1,
indexValues={f73732b3d9f35742e39e6bba8e401192, www.baidu.com, }} [try again later]"]
[errorVerbose="[kv:9007]Write conflict, txnStartTS=417517629610917903,
conflictStartTS=417517692315762921, conflictCommitTS=417517692315762921,
key={tableID=93643, indexID=1, indexValues={f73732b3d9f35742e39e6bba8e401192,
www.baidu.com, }} primary={tableID=93643, indexID=1,
indexValues={f73732b3d9f35742e39e6bba8e401192, www.baidu.com, }} [try again later]
表索引信息如下，UNIQUE KEY `idx_cert_md5_host` (`cert_md5`,`host`) 为 uk：

四、总结回顾

确定是因为写写冲突导致的 tikv scheduler latch 等待，并且集中在某几个 key 和region ，scheduler 的 slot 有限排队等待严重，进而出现集中请求的 region 所在的tikv 很繁忙，出现 server is timeout 的报错。
处理办法：
因为写写冲突都是集中在某个 key 和某个 region 上，并且冲突比较严重，尝试开启 tidb txn-local-latches，让 latch 等待集中在 tidb ，缓解 tikv 的压力。
处理结果：
调整参数后，并没有缓解。
业务逻辑：
与开发沟通后，了解到，业务端的唯一键检查主要靠数据库的 uk 来保证，应用端没有实现相关机制，当数据库报错后，业务再去做相应的报错处理。
因为这样的原因，出现某个 key 的冲突严重。但是这个逻辑在 19 年就已经有了，只是在 21 号爆发了。
可能是 21 号某种业务行为，导致某条数据，会频繁的出现，并进行 insert ，写写冲突爆发。
后续调整为，在满足业务逻辑的前提下，将 insert 语句改造为 insert ignore，当出现相同的数据时，报
1062，并返回 tidb。用户反馈，调整语句后，现象消失，qps 以及 duration 恢复正常：

感谢
感谢PingCap高振娇女神帮忙排查和定位故障，为360云平台TiDB集群运维提供了大力支持~也帮助和提升了我故障解决问题的能力，这篇文章分享给大家，通读这篇文章可以很好地掌握故障排查的思路，结合这个案例也可以让有类似潜在威胁的同学提前规避。这个案例后面还引发了另外两个案例，以后有时间也会一一总结分享给大家~

insert引发的TiDB hang死血案（案情一）

insert带来的TiDB集群hang死血案