一、背景
在企业中经常以业务保护等级合规的目的设计数据库逻辑备份机制,逻辑备份生成平面文件放置在 S3 兼容的存储上。业务有需求要查询历史时段的数据,需要将备份文件重新导入数据库进行查询,由于查询的时间段很可能相对比较模糊,会出现多次的重复操作。
Dumpling 是 TiDB 的逻辑备份工具,兼容 MySQL,具有产品轻量化、使用便捷的特点,能将生产数据备份成 CSV 平面文件,具备支持自定义分隔符等特性,并兼容 S3 存储。DuckDB 是轻量的嵌入式 AP 场景数据库,能支持以外部表方式读取 S3 上面的 CSV 文件。
本次尝试通过 Dumpling 和 DuckDB 的结合,能够将 Dumpling 的备份文件交付给 DuckDB,实现快速打开、快速查询的目的。
二、环境准备
角色 |
配置 |
---|---|
minio |
8C / 16GB |
duckdb |
8C / 16GB |
Duckdb 环境
https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip
# unzip duckdb_cli-linux-amd64.zip
Archive: duckdb_cli-linux-amd64.zip
inflating: duckdb
# vi init.sql
SET extension_directory = '/data/duckdb_ext';
.prompt 'duckdb> '
.timer on
# ./duckdb bak.duckdb --init init.sql
duckdb> SELECT current_database();
┌────────────────────┐
│ current_database() │
│ varchar │
├────────────────────┤
│ bak │
└────────────────────┘
Run Time (s): real 0.001 user 0.000429 sys 0.000501
duckdb> .exit
Minio 单机实验环境
注:
wget https://dl.min.io/server/minio/release/linux-amd64/minio
wget https://dl.min.io/client/mc/release/linux-amd64/mc
# MINIO_ROOT_USER=admin MINIO_ROOT_PASSWORD=password ./minio server /data/miniodata/ --console-address ":9001" &
[1] 9664
[root@vm172-16-201-85 data]# MinIO Object Storage Server
Copyright: 2015-2025 MinIO, Inc.
License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html
Version: RELEASE.2024-12-18T13-15-44Z (go1.23.4 linux/amd64)
API: http://172.16.201.85:9000 http://172.17.0.1:9000 http://127.0.0.1:9000
RootUser: admin
RootPass: password
WebUI: http://172.16.201.85:9001 http://172.17.0.1:9001 http://127.0.0.1:9001
RootUser: admin
RootPass: password
CLI: https://min.io/docs/minio/linux/reference/minio-mc.html#quickstart
$ mc alias set 'myminio' 'http://172.16.201.85:9000' 'admin' 'password'
Docs: https://docs.min.io
WARN: Detected Linux kernel version older than 4.0 release, there are some known potential performance problems with this kernel version. MinIO recommends a minimum of 4.x linux kernel version for best performance
# s3cmd --no-ssl --access_key=admin --secret_key=password --host=172.16.201.85:9000 --host-bucket= mb s3://tidbbak
数据备份
mysql> select count(1) from users10m;
+----------+
| count(1) |
+----------+
| 10000000 |
+----------+
1 row in set (6 min 36.33 sec)
#export AWS_ACCESS_KEY_ID=admin
#export AWS_SECRET_ACCESS_KEY=password
#CURDATE=$(date +%Y%m%d%H%M%S)
#/root/.tiup/components/dumpling/v8.5.0/dumpling -B bakdb -u "${BAKUSER}" -P ${TIDBPORT} -h ${TIDBIP} -p "${BAKPW}" --filetype csv -t 8 -o "s3://${Bucket}/dumpling/${CURDATE}" --s3.endpoint="http://${Endpoint}" -r 200000 -F 256MiB
# s3cmd --no-ssl --access_key=admin --secret_key=password --host=172.16.201.85:9000 --host-bucket= get s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv
download: 's3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv' -> './bakdb.users10m.0000000010000.csv' [1 of 1]
51980222 of 51980222 100% in 0s 113.40 MB/s done
# more bakdb.users10m.0000000010000.csv
"userid","name","phone","sex","birth","card_type","card_id","reg_date","reg_time","modify_date","unit_id","vip_level","corp_id","job_number","col00","col01","col02","col03","col04","col05","col06","col07","col08","col09","
col10","col11","col12","col13","col14","col15","col16","col17","col18","col19","balance1","balance2","balance3","modify_datetime"
"004804a7-e96e-4cf8-8622-891e0ae9d87d","LKQKM","10240894028","女","1979-10-29","身份证","588802761557381279","2001-05-12","02:41:58","2016-07-09","hDY","4","22100","130353","EDRROXGRCYJ","TENOFL","MDBPTICNPYQKYDTRHSGC","YP
UJVK","JSKWBZYHXR","IXBPDPDPEARSP","QDSLHMNZ","DJBRX","DYCCQDHJIMA","AFISVLDMXLQPZHVZZDQA","FYHYBEYQGYYDVIHG","LZ","NITVMZGXRTYKQFVPFTHB","NCFESYSN","WFSVITJTYPYBEOXYLXU","YZYYG","PZYQPLZ","GJIRATCSLPTFHCSVDH","STOBWVHFSQF
IDP","KPRVIGNNFIDCIBL",2071.61,9169.1201,10000.1234,"2025-01-28 10:42:56"
"004804c4-3494-47de-a7ca-f5ab7ba8e135","OKTICUOYOJNNCULUCGKS","15002789114","男","1988-02-13","身份证","970410241703370828","2000-11-11","09:46:21","2013-01-15","dgjhS","2","6900","107337","SINTVGVWCUQROU","U","LRHRMWTGCKN
","T","SADSQUTZD","OBJMWNAHSUERZJPMRG","TAPMJTRLQJIAYXSCAVUQ","VRJCSHSGUAOWAVBDST","OKOUTEUMWVHJYBTP","EYMJMMBWWJW","PALWUHU","NWKMJADTBYPJZ","OVZSQOMBUBXUT","XDFMQHJFIWXN","JFIA","FDD","XVQUOOODC","P","OVRLI","QKH",9722.4
2,5618.2924,10000.1234,"2025-01-28 10:42:56"
三、访问 S3 上 csv 备份
配置 S3 的访问密钥
安装插件和配置 s3 secret。
duckdb> INSTALL 'httpfs';
duckdb> SELECT * FROM duckdb_extensions();
┌──────────────────┬─────────┬───────────┬──────────────────────┬───────────────────────────────────────────────────────────────────────────────┬───────────────────┬───────────────────┬───────────────────┬────────────────┐
│ extension_name │ loaded │ installed │ install_path │ description │ aliases │ extension_version │ install_mode │ installed_from │
│ varchar │ boolean │ boolean │ varchar │ varchar │ varchar[] │ varchar │ varchar │ varchar │
├──────────────────┼─────────┼───────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────────┤
│ arrow │ false │ false │ │ A zero-copy data integration between Apache Arrow and DuckDB │ [] │ │ │ │
│ autocomplete │ true │ true │ (BUILT-IN) │ Adds support for autocomplete in the shell │ [] │ │ STATICALLY_LINKED │ │
│ aws │ false │ false │ │ Provides features that depend on the AWS SDK │ [] │ │ │ │
│ azure │ false │ false │ │ Adds a filesystem abstraction for Azure blob storage to DuckDB │ [] │ │ │ │
│ delta │ false │ false │ │ Adds support for Delta Lake │ [] │ │ │ │
│ excel │ false │ false │ │ Adds support for Excel-like format strings │ [] │ │ │ │
│ fts │ true │ true │ (BUILT-IN) │ Adds support for Full-Text Search Indexes │ [] │ v1.1.3 │ STATICALLY_LINKED │ │
│ httpfs │ true │ true │ /data/duckdb_ext/v… │ Adds support for reading and writing files over a HTTP(S) connection │ [http, https, s3] │ v1.1.3 │ REPOSITORY │ core │
...
│ vss │ false │ false │ │ Adds indexing support to accelerate Vector Similarity Search │ [] │ │ │ │
├──────────────────┴─────────┴───────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────┴───────────────────┴───────────────────┴───────────────────┴────────────────┤
│ 24 rows 9 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.005 user 0.000519 sys 0.001037
duckdb> CREATE PERSISTENT SECRET miniosecret (
TYPE S3,
KEY_ID 'admin',
SECRET 'password',
ENDPOINT '172.16.201.85:9000',
URL_STYLE 'path',
USE_SSL false
);
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ true │
└─────────┘
duckdb> SELECT * FROM duckdb_secrets();
┌─────────────┬─────────┬──────────┬────────────┬────────────┬──────────────────────┬──────────────────────────────────────────────────────────┐
│ name │ type │ provider │ persistent │ storage │ scope │ secret_string │
│ varchar │ varchar │ varchar │ boolean │ varchar │ varchar[] │ varchar │
├─────────────┼─────────┼──────────┼────────────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────┤
│ miniosecret │ s3 │ config │ true │ local_file │ [s3://, s3n://, s3… │ name=miniosecret;type=s3;provider=config;serializable=… │
└─────────────┴─────────┴──────────┴────────────┴────────────┴──────────────────────┴──────────────────────────────────────────────────────────┘
注:PERSISTENT SECRET 存储在 ${HOME}/.duckdb/stored_secrets 目录下。
Duckdb 访问 S3 的 CSV 备份文件
通过 Duckdb 的 S3 expansion 结合 read_csv 访问数据。
duckdb> SELECT count(*) FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true);
100% ▕████████████████████████████████████████████████████████████▏
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 10000000 │
└──────────────┘
Run Time (s): real 18.073 user 17.706364 sys 6.432855
duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true) limit 1;
┌──────────────────────┬─────────┬─────────────┬─────────┬────────────┬───────────┬────────────────────┬───┬────────────────────┬────────────────┬─────────────────┬──────────┬───────────┬────────────┬─────────────────────┐
│ userid │ name │ phone │ sex │ birth │ card_type │ card_id │ … │ col17 │ col18 │ col19 │ balance1 │ balance2 │ balance3 │ modify_datetime │
│ varchar │ varchar │ int64 │ varchar │ date │ varchar │ varchar │ │ varchar │ varchar │ varchar │ varchar │ varchar │ double │ timestamp │
├──────────────────────┼─────────┼─────────────┼─────────┼────────────┼───────────┼────────────────────┼───┼────────────────────┼────────────────┼─────────────────┼──────────┼───────────┼────────────┼─────────────────────┤
│ 004804a7-e96e-4cf8… │ LKQKM │ 10240894028 │ 女 │ 1979-10-29 │ 身份证 │ 588802761557381279 │ … │ GJIRATCSLPTFHCSVDH │ STOBWVHFSQFIDP │ KPRVIGNNFIDCIBL │ 2071.61 │ 9169.1201 │ 10000.1234 │ 2025-01-28 10:42:56 │
├──────────────────────┴─────────┴─────────────┴─────────┴────────────┴───────────┴────────────────────┴───┴────────────────────┴────────────────┴─────────────────┴──────────┴───────────┴────────────┴─────────────────────┤
│ 1 rows 38 columns (14 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 3.981 user 6.215305 sys 1.525878
duckdb> .mode line
duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true) limit 1;
userid = 004804a7-e96e-4cf8-8622-891e0ae9d87d
name = LKQKM
phone = 10240894028
sex = 女
birth = 1979-10-29
card_type = 身份证
card_id = 588802761557381279
reg_date = 2001-05-12
reg_time = 02:41:58
modify_date = 2016-07-09
unit_id = hDY
vip_level = 4
corp_id = 22100
job_number = 130353
col00 = EDRROXGRCYJ
col01 = TENOFL
col02 = MDBPTICNPYQKYDTRHSGC
col03 = YPUJVK
col04 = JSKWBZYHXR
col05 = IXBPDPDPEARSP
col06 = QDSLHMNZ
col07 = DJBRX
col08 = DYCCQDHJIMA
col09 = AFISVLDMXLQPZHVZZDQA
col10 = FYHYBEYQGYYDVIHG
col11 = LZ
col12 = NITVMZGXRTYKQFVPFTHB
col13 = NCFESYSN
col14 = WFSVITJTYPYBEOXYLXU
col15 = YZYYG
col16 = PZYQPLZ
col17 = GJIRATCSLPTFHCSVDH
col18 = STOBWVHFSQFIDP
col19 = KPRVIGNNFIDCIBL
balance1 = 2071.61
balance2 = 9169.1201
balance3 = 10000.1234
modify_datetime = 2025-01-28 10:42:56
Run Time (s): real 2.791 user 6.307715 sys 1.495764
duckdb> .mode duckbox
注:.mode line 类似于 MySQL > 的 \G 结束符的输出,.mode table 类似于 MySQL > 的默认输出,.mode duckbox 是 Duckdb 的默认输出,自适应屏幕的简略输出。
标准备份数据导入 Duckdb
默认情况下,使用 auto_detect 访问数据不一定能使用正确的字段格式,先使用 sniff_csv 生成配置模板。
duckdb> .mode line
duckdb> SELECT Prompt FROM sniff_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv', sample_size = 10000);
Prompt = FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv', auto_detect=false, delim=',', quote='"', escape='"', new_line='\r\n', skip=0, comment='', header=true, columns={'userid': 'VARCHAR', 'name': 'VARCHAR', 'phone': 'BIGINT', 'sex': 'VARCHAR', 'birth': 'DATE', 'card_type': 'VARCHAR', 'card_id': 'VARCHAR', 'reg_date': 'DATE', 'reg_time': 'TIME', 'modify_date': 'DATE', 'unit_id': 'VARCHAR', 'vip_level': 'BIGINT', 'corp_id': 'BIGINT', 'job_number': 'BIGINT', 'col00': 'VARCHAR', 'col01': 'VARCHAR', 'col02': 'VARCHAR', 'col03': 'VARCHAR', 'col04': 'VARCHAR', 'col05': 'VARCHAR', 'col06': 'VARCHAR', 'col07': 'VARCHAR', 'col08': 'VARCHAR', 'col09': 'VARCHAR', 'col10': 'VARCHAR', 'col11': 'VARCHAR', 'col12': 'VARCHAR', 'col13': 'VARCHAR', 'col14': 'VARCHAR', 'col15': 'VARCHAR', 'col16': 'VARCHAR', 'col17': 'VARCHAR', 'col18': 'VARCHAR', 'col19': 'VARCHAR', 'balance1': 'DOUBLE', 'balance2': 'DOUBLE', 'balance3': 'DOUBLE', 'modify_datetime': 'TIMESTAMP'}, dateformat='%Y-%m-%d', sample_size=10000);
Run Time (s): real 0.174 user 0.277318 sys 0.060880
使用 read_csv 的 auto_detect 的自适应失败或 sniff_csv 的 sample_size 内的数据没有对应的格式,最终会导致缺少配置项的导入过程仍有可能会失败,需要手工调整。
需要调整配置模板后导入数据,也可以使用备份时对应的 schema 文件内关于字段的描述。
duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000120000.csv');
Run Time (s): real 1.472 user 3.583090 sys 0.183995
Conversion Error: CSV Error on Line: 154129
Original Line:
"04e471fa-0a6b-4181-b2a6-e4bc53af5fac","UVJXFMBEC","12308418577","女","2003-12-17","身份证","520850486289953447","2009-09-13",\N,"2012-04-30","8","0","4500","131043","EXDBHQFLZMAJZH","RVWDBE","UBLXVMPIC","CIHYSHCEQZZFBQZ","HPOARFSFFUQSPPN","PQKLEZ","DJFPVM","RBAKOUZZTN","V","PMVYO","V","BCIMNBFA","UOCMGLHVWU","JCUQJYT","KRFFJNMDIYCUBAJMFYK","OKLIMZU","VNSMFYDLH","NWJJLQZP","GKVL","UDUVEYSEBVZMQRKTLF",1606.15,9866.7268,10000.1234,"2025-01-27 09:39:44"
Error when converting column "reg_time". Could not convert string "\N" to 'TIME'
Column reg_time is being converted as type TIME
This type was auto-detected from the CSV file.
Dumpling 默认备份 CSV 时使用 '\N',需要添加 nullstr='\N' 配置项。修改后,再导入。
duckdb> SELECT *
FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='', nullstr='\N', header=true,
columns = {
'userid': 'VARCHAR',
'name': 'VARCHAR',
'phone': 'BIGINT',
'sex': 'VARCHAR',
'birth': 'DATE',
'card_type': 'VARCHAR',
'card_id': 'VARCHAR',
'reg_date': 'DATE',
'reg_time': 'TIME',
'modify_date': 'DATE',
'unit_id': 'VARCHAR',
'vip_level': 'BIGINT',
'corp_id': 'BIGINT',
'job_number': 'BIGINT',
'col00': 'VARCHAR',
'col01': 'VARCHAR',
'col02': 'VARCHAR',
'col03': 'VARCHAR',
'col04': 'VARCHAR',
'col05': 'VARCHAR',
'col06': 'VARCHAR',
'col07': 'VARCHAR',
'col08': 'VARCHAR',
'col09': 'VARCHAR',
'col10': 'VARCHAR',
'col11': 'VARCHAR',
'col12': 'VARCHAR',
'col13': 'VARCHAR',
'col14': 'VARCHAR',
'col15': 'VARCHAR',
'col16': 'VARCHAR',
'col17': 'VARCHAR',
'col18': 'VARCHAR',
'col19': 'VARCHAR',
'balance1': 'DOUBLE',
'balance2': 'DOUBLE',
'balance3': 'DOUBLE',
'modify_datetime': 'TIMESTAMP'
});
-- 输出略。
duckdb> create table users10m_ctas as SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true,
columns = {
'userid': 'VARCHAR',
'name': 'VARCHAR',
'phone': 'BIGINT',
'sex': 'VARCHAR',
'birth': 'DATE',
'card_type': 'VARCHAR',
'card_id': 'VARCHAR',
'reg_date': 'DATE',
'reg_time': 'TIME',
'modify_date': 'DATE',
'unit_id': 'VARCHAR',
'vip_level': 'BIGINT',
'corp_id': 'BIGINT',
'job_number': 'BIGINT',
'col00': 'VARCHAR',
'col01': 'VARCHAR',
'col02': 'VARCHAR',
'col03': 'VARCHAR',
'col04': 'VARCHAR',
'col05': 'VARCHAR',
'col06': 'VARCHAR',
'col07': 'VARCHAR',
'col08': 'VARCHAR',
'col09': 'VARCHAR',
'col10': 'VARCHAR',
'col11': 'VARCHAR',
'col12': 'VARCHAR',
'col13': 'VARCHAR',
'col14': 'VARCHAR',
'col15': 'VARCHAR',
'col16': 'VARCHAR',
'col17': 'VARCHAR',
'col18': 'VARCHAR',
'col19': 'VARCHAR',
'balance1': 'DOUBLE',
'balance2': 'DOUBLE',
'balance3': 'DOUBLE',
'modify_datetime': 'TIMESTAMP'
});
100% ▕████████████████████████████████████████████████████████████▏
Run Time (s): real 33.892 user 167.084411 sys 14.806282
# s3cmd --no-ssl --access_key=admin --secret_key=password --host=172.16.201.85:9000 --host-bucket= du s3://tidbbak/dumpling/20250128161903/
4913936499 42 objects s3://tidbbak/dumpling/20250128161903/
# ls -lh /data/bak.duckdb
-rw-r--r-- 1 root root 2.9G Jan 28 17:12 /data/bak.duckdb
查看 CTAS 的表结构和 DDL,已经按标识的字段类型创建。
duckdb> desc users10m_ctas;
┌─────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ userid │ VARCHAR │ YES │ │ │ │
│ name │ VARCHAR │ YES │ │ │ │
│ phone │ BIGINT │ YES │ │ │ │
│ sex │ VARCHAR │ YES │ │ │ │
│ birth │ DATE │ YES │ │ │ │
│ card_type │ VARCHAR │ YES │ │ │ │
│ card_id │ VARCHAR │ YES │ │ │ │
│ reg_date │ DATE │ YES │ │ │ │
│ reg_time │ TIME │ YES │ │ │ │
│ modify_date │ DATE │ YES │ │ │ │
│ unit_id │ VARCHAR │ YES │ │ │ │
│ vip_level │ BIGINT │ YES │ │ │ │
│ corp_id │ BIGINT │ YES │ │ │ │
│ job_number │ BIGINT │ YES │ │ │ │
│ col00 │ VARCHAR │ YES │ │ │ │
│ col01 │ VARCHAR │ YES │ │ │ │
│ col02 │ VARCHAR │ YES │ │ │ │
│ col03 │ VARCHAR │ YES │ │ │ │
│ col04 │ VARCHAR │ YES │ │ │ │
│ col05 │ VARCHAR │ YES │ │ │ │
│ col06 │ VARCHAR │ YES │ │ │ │
│ col07 │ VARCHAR │ YES │ │ │ │
│ col08 │ VARCHAR │ YES │ │ │ │
│ col09 │ VARCHAR │ YES │ │ │ │
│ col10 │ VARCHAR │ YES │ │ │ │
│ col11 │ VARCHAR │ YES │ │ │ │
│ col12 │ VARCHAR │ YES │ │ │ │
│ col13 │ VARCHAR │ YES │ │ │ │
│ col14 │ VARCHAR │ YES │ │ │ │
│ col15 │ VARCHAR │ YES │ │ │ │
│ col16 │ VARCHAR │ YES │ │ │ │
│ col17 │ VARCHAR │ YES │ │ │ │
│ col18 │ VARCHAR │ YES │ │ │ │
│ col19 │ VARCHAR │ YES │ │ │ │
│ balance1 │ DOUBLE │ YES │ │ │ │
│ balance2 │ DOUBLE │ YES │ │ │ │
│ balance3 │ DOUBLE │ YES │ │ │ │
│ modify_datetime │ TIMESTAMP │ YES │ │ │ │
├─────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 38 rows 6 columns │
└───────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.001 user 0.001159 sys 0.000000
duckdb> .mode line
duckdb> WITH table_columns AS (
SELECT
name,
type,
CASE WHEN "notnull" = 1 THEN 'NOT NULL' ELSE '' END AS not_null
FROM pragma_table_info('users10m_ctas'))
SELECT
'CREATE TABLE users10m_ctas (' ||
string_agg(name || ' ' || type || ' ' || not_null, ', ') ||
');' AS create_table_sql
FROM table_columns;
create_table_sql = CREATE TABLE users10m_ctas (userid VARCHAR , name VARCHAR , phone BIGINT , sex VARCHAR , birth DATE , card_type VARCHAR , card_id VARCHAR , reg_date DATE , reg_time TIME , modify_date DATE , unit_id VARCHAR , vip_level BIGINT , corp_id BIGINT , job_number BIGINT , col00 VARCHAR , col01 VARCHAR , col02 VARCHAR , col03 VARCHAR , col04 VARCHAR , col05 VARCHAR , col06 VARCHAR , col07 VARCHAR , col08 VARCHAR , col09 VARCHAR , col10 VARCHAR , col11 VARCHAR , col12 VARCHAR , col13 VARCHAR , col14 VARCHAR , col15 VARCHAR , col16 VARCHAR , col17 VARCHAR , col18 VARCHAR , col19 VARCHAR , balance1 DOUBLE , balance2 DOUBLE , balance3 DOUBLE , modify_datetime TIMESTAMP );
Run Time (s): real 0.002 user 0.002692 sys 0.000146
注意:DuckDB 的 VARCHAR 没有显式长度限制,可以存储任意长度的字符串。实际使用中,字符串长度受限于内存大小和系统性能。
测试原生表和外部表的性能
duckdb> SELECT max(reg_date),min(balance1) FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true,columns = {中间格式略};
max(reg_date) = 2009-12-30
min(balance1) = 0.02
Run Time (s): real 18.868 user 19.898231 sys 6.671171
duckdb> SELECT max(reg_date),min(balance1) FROM users10m_ctas;
max(reg_date) = 2009-12-30
min(balance1) = 0.02
Run Time (s): real 0.011 user 0.051131 sys 0.001120
duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true,columns = {中间格式略}) where userid='04e471fa-0a6b-4181-b2a6-e4bc53af5fac';
100%
userid = 04e471fa-0a6b-4181-b2a6-e4bc53af5fac
name = UVJXFMBEC
phone = 12308418577
sex = 女
birth = 2003-12-17
card_type = 身份证
card_id = 520850486289953447
reg_date = 2009-09-13
reg_time =
modify_date = 2012-04-30
unit_id = 8
vip_level = 0
corp_id = 4500
job_number = 131043
col00 = EXDBHQFLZMAJZH
col01 = RVWDBE
col02 = UBLXVMPIC
col03 = CIHYSHCEQZZFBQZ
col04 = HPOARFSFFUQSPPN
col05 = PQKLEZ
col06 = DJFPVM
col07 = RBAKOUZZTN
col08 = V
col09 = PMVYO
col10 = V
col11 = BCIMNBFA
col12 = UOCMGLHVWU
col13 = JCUQJYT
col14 = KRFFJNMDIYCUBAJMFYK
col15 = OKLIMZU
col16 = VNSMFYDLH
col17 = NWJJLQZP
col18 = GKVL
col19 = UDUVEYSEBVZMQRKTLF
balance1 = 1606.15
balance2 = 9866.7268
balance3 = 10000.1234
modify_datetime = 2025-01-27 09:39:44
Run Time (s): real 15.471 user 25.564100 sys 7.191754
duckdb> SELECT * FROM users10m_ctas where userid='04e471fa-0a6b-4181-b2a6-e4bc53af5fac';
userid = 04e471fa-0a6b-4181-b2a6-e4bc53af5fac
name = UVJXFMBEC
phone = 12308418577
sex = 女
birth = 2003-12-17
card_type = 身份证
card_id = 520850486289953447
reg_date = 2009-09-13
reg_time =
modify_date = 2012-04-30
unit_id = 8
vip_level = 0
corp_id = 4500
job_number = 131043
col00 = EXDBHQFLZMAJZH
col01 = RVWDBE
col02 = UBLXVMPIC
col03 = CIHYSHCEQZZFBQZ
col04 = HPOARFSFFUQSPPN
col05 = PQKLEZ
col06 = DJFPVM
col07 = RBAKOUZZTN
col08 = V
col09 = PMVYO
col10 = V
col11 = BCIMNBFA
col12 = UOCMGLHVWU
col13 = JCUQJYT
col14 = KRFFJNMDIYCUBAJMFYK
col15 = OKLIMZU
col16 = VNSMFYDLH
col17 = NWJJLQZP
col18 = GKVL
col19 = UDUVEYSEBVZMQRKTLF
balance1 = 1606.15
balance2 = 9866.7268
balance3 = 10000.1234
modify_datetime = 2025-01-27 09:39:44
Run Time (s): real 0.006 user 0.009642 sys 0.002340
对比场景成绩
ctas |
max / min |
where 查询 |
|
---|---|---|---|
Read csv from S3 (4.9 GB CSV) |
34 sec |
19 sec |
16 sec |
table(10 M records) |
- |
11 ms |
6 ms |
使用自定义的 CSV 格式
使用控制字符 x01,即 SOH 作为间隔符。能规避字段中包含默认定义分隔符的情况,防止数据读取分列时出错。
注意:如果使用多字符的 delimiter,duckdb 会指定提示 Invalid Input Error: The delimiter option cannot exceed a size of 1 byte. 的超长错误。
export AWS_ACCESS_KEY_ID=admin; export AWS_SECRET_ACCESS_KEY=password;
dumpling --filter 'bakdb.users10m' -u root -P 4500 -h 172.16.201.122 -p root --filetype csv -t 8 -o "s3://tidbbak/dumpling/${CURDATE}" --s3.endpoint="http://172.16.201.85:9000" -r 200000 -F 256MiB --csv-separator=$(echo -e "\x01") --escape-backslash=false --csv-null-value="NULL" --csv-line-terminator=$'\r\n' --csv-delimiter=""
# cat -e bakdb.users10m.0000000390000.csv|more
userid^Aname^Aphone^Asex^Abirth^Acard_type^Acard_id^Areg_date^Areg_time^Amodify_date^Aunit_id^Avip_level^Acorp_id^Ajob_number^Acol00^Acol01^Acol02^Acol03^Acol04^Acol05^Acol06^Acol07^Acol08^Acol09^Acol10^Acol11^Acol12^Acol1
3^Acol14^Acol15^Acol16^Acol17^Acol18^Acol19^Abalance1^Abalance2^Abalance3^Amodify_datetime^M$
1dbf161b-87e9-49e4-bd43-83e8179b4292^ACQAZEBAOQL^A14737946971^AM-gM-^TM-7^A2019-06-15^AM-hM-:M-+M-dM-;M-=M-hM-/M-^A^A599781278347424629^A2003-04-16^ANULL^A2018-04-03^Aeg^A1^A22200^A27564^ARLAWIETOMOH^ALQDFFCOHMXVOR^AYPBIIH
QNIVDM^AXUUKIEXPJRVGQ^AZQMEPKMNHBACDFOYVT^AVFSWFI^ACPSIYHSUBFCI^AYGRFPCEGHAISOGAXIG^AUSAOTBNP^AWDUS^AUTYZOOMCHECUHYAB^ABS^AWJRVSSQVVUSBVZGL^AZ^AOCSGJZYQEIYIKIXVCEX^APYAXMFDV^AORLWDLPQHPMEMOB^AIZZVERY^AGCRJOF^AXKQDXOUXFXRJJ
YORJ^ANULL^ANULL^A10000.1234^A2025-01-27 09:39:44^M$
对应的配置 read_csv 的参数 delim=e'\x01' 和 nullstr='NULL'
duckdb> create table users10m_ctas2 as SELECT *
· FROM read_csv('s3://tidbbak/dumpling/20250129143910/bakdb.users10m.*.csv', union_by_name = true, delim=e'\x01', quote='', new_line='\r\n', skip=0, comment='', nullstr='NULL', header=true,
· columns = {中间格式略});
duckdb> select * from users10m_ctas2 limit 10 ;
┌──────────────────────┬──────────────────────┬─────────────┬─────────┬────────────┬───────────┬────────────────────┬───┬─────────────────┬──────────────────────┬──────────┬───────────┬────────────┬─────────────────────┐
│ userid │ name │ phone │ sex │ birth │ card_type │ card_id │ … │ col18 │ col19 │ balance1 │ balance2 │ balance3 │ modify_datetime │
│ varchar │ varchar │ int64 │ varchar │ date │ varchar │ varchar │ │ varchar │ varchar │ double │ double │ double │ timestamp │
├──────────────────────┼──────────────────────┼─────────────┼─────────┼────────────┼───────────┼────────────────────┼───┼─────────────────┼──────────────────────┼──────────┼───────────┼────────────┼─────────────────────┤
│ 004804a7-e96e-4cf8… │ LKQKM │ 10240894028 │ 女 │ 1979-10-29 │ 身份证 │ 588802761557381279 │ … │ STOBWVHFSQFIDP │ KPRVIGNNFIDCIBL │ 2071.61 │ 9169.1201 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 004804c4-3494-47de… │ OKTICUOYOJNNCULUCGKS │ 15002789114 │ 男 │ 1988-02-13 │ 身份证 │ 970410241703370828 │ … │ OVRLI │ QKH │ 9722.42 │ 5618.2924 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 00480505-5d3e-40f3… │ OEYHUYADGB │ 12955781898 │ 男 │ 2001-03-05 │ 身份证 │ 897826304664568897 │ … │ GXVSEAORZA │ SLDXCJYWYFZFHEHERQPV │ 6997.88 │ 8810.8585 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 0048050a-8c4f-4b68… │ QLCWZCMS │ 16300152778 │ 女 │ 2007-07-11 │ 身份证 │ 177321359793083575 │ … │ FSPAC │ MZKB │ 9225.89 │ 628.1277 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 0048050f-c7e2-4d9d… │ YLQALLEVP │ 10781833261 │ 男 │ 1981-06-23 │ 身份证 │ 620054938335154911 │ … │ CPFGWPHW │ NUMQXOZGAJODSEJYL │ 7555.27 │ 5465.2121 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 00480534-030c-4caa… │ HAWMDVNXMVR │ 17899716473 │ 女 │ 2001-12-26 │ 身份证 │ 599799321632177923 │ … │ DFCXDQQRLCQL │ PFNZ │ 2019.48 │ 8455.7013 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 004805c2-78c6-4d1b… │ DRGFKEKBPP │ 16538556539 │ 女 │ 2016-09-08 │ 身份证 │ 698176678748396854 │ … │ MZNAQX │ UCOBKOKZBQGWKTGBSSXF │ 2793.68 │ 3975.8125 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 004805d5-d97d-49db… │ Y │ 10521349859 │ 男 │ 1998-12-10 │ 身份证 │ 309479851036137986 │ … │ DMONNXPUFSXMVPR │ NRHQXIXNTHMU │ 9898.58 │ 9541.5137 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 004805de-0e71-425f… │ SGFAWOYJ │ 19009647824 │ 男 │ 1973-02-10 │ 身份证 │ 453108689618730419 │ … │ CU │ SKBELQNXTITOWSTLE │ 1370.42 │ 7448.9362 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 004805de-562d-46f1… │ NPTFMCP │ 15951915861 │ 男 │ 2005-02-02 │ 身份证 │ 140249578270439645 │ … │ ALVM │ OIOBZ │ 6013.69 │ 8486.551 │ 10000.1234 │ 2025-01-28 10:42:56 │
├──────────────────────┴──────────────────────┴─────────────┴─────────┴────────────┴───────────┴────────────────────┴───┴─────────────────┴──────────────────────┴──────────┴───────────┴────────────┴─────────────────────┤
│ 10 rows 38 columns (13 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.007 user 0.007134 sys 0.000018
四、总结
备份的数据具有较高的业务价值,可以支持历史快照的即席查询,进一步分析业务的历史变化,但是由于平面文件使用不方便,在日常运营过程中,备份数据没有体现出业务价值。在 MySQL 生态中,通过 dumpling 备份成 CSV,结合 S3 兼容的对象存储,再通过 DuckDB 实现数据备份的快速打开和查询,其他的数据库也可以参考。在具体的实践过程中,可以通过自动化平台,实现按需的自助查询操作,具有较高的业务意义。