0
0
0
0
专栏/.../

DuckDB x TiDB -- 使用 DuckDB 唤醒 CSV 文件

 pepezzzz  发表于  2025-02-10
原创迁移

一、背景

在企业中经常以业务保护等级合规的目的设计数据库逻辑备份机制,逻辑备份生成平面文件放置在 S3 兼容的存储上。业务有需求要查询历史时段的数据,需要将备份文件重新导入数据库进行查询,由于查询的时间段很可能相对比较模糊,会出现多次的重复操作。

Dumpling 是 TiDB 的逻辑备份工具,兼容 MySQL,具有产品轻量化、使用便捷的特点,能将生产数据备份成 CSV 平面文件,具备支持自定义分隔符等特性,并兼容 S3 存储。DuckDB 是轻量的嵌入式 AP 场景数据库,能支持以外部表方式读取 S3 上面的 CSV 文件。

本次尝试通过 Dumpling 和 DuckDB 的结合,能够将 Dumpling 的备份文件交付给 DuckDB,实现快速打开、快速查询的目的。

二、环境准备

角色

配置

minio

8C / 16GB

duckdb

8C / 16GB

Duckdb 环境

https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip

# unzip duckdb_cli-linux-amd64.zip 
Archive:  duckdb_cli-linux-amd64.zip
  inflating: duckdb    
# vi init.sql 
SET extension_directory = '/data/duckdb_ext';
.prompt 'duckdb> '       
.timer on       
# ./duckdb bak.duckdb --init init.sql
duckdb> SELECT current_database();
┌────────────────────┐
│ current_database() │
│      varchar       │
├────────────────────┤
│ bak                │
└────────────────────┘
Run Time (s): real 0.001 user 0.000429 sys 0.000501
duckdb> .exit

Minio 单机实验环境

注:

wget https://dl.min.io/server/minio/release/linux-amd64/minio

wget https://dl.min.io/client/mc/release/linux-amd64/mc

# MINIO_ROOT_USER=admin MINIO_ROOT_PASSWORD=password ./minio server /data/miniodata/   --console-address ":9001" &
[1] 9664
[root@vm172-16-201-85 data]# MinIO Object Storage Server
Copyright: 2015-2025 MinIO, Inc.
License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html
Version: RELEASE.2024-12-18T13-15-44Z (go1.23.4 linux/amd64)

API: http://172.16.201.85:9000  http://172.17.0.1:9000  http://127.0.0.1:9000 
   RootUser: admin 
   RootPass: password 

WebUI: http://172.16.201.85:9001 http://172.17.0.1:9001 http://127.0.0.1:9001 
   RootUser: admin 
   RootPass: password 

CLI: https://min.io/docs/minio/linux/reference/minio-mc.html#quickstart
   $ mc alias set 'myminio' 'http://172.16.201.85:9000' 'admin' 'password'

Docs: https://docs.min.io
WARN: Detected Linux kernel version older than 4.0 release, there are some known potential performance problems with this kernel version. MinIO recommends a minimum of 4.x linux kernel version for best performance

# s3cmd --no-ssl --access_key=admin --secret_key=password  --host=172.16.201.85:9000 --host-bucket= mb s3://tidbbak

数据备份

mysql> select count(1) from users10m;
+----------+
| count(1) |
+----------+
| 10000000 |
+----------+
1 row in set (6 min 36.33 sec)

#export AWS_ACCESS_KEY_ID=admin
#export AWS_SECRET_ACCESS_KEY=password
#CURDATE=$(date +%Y%m%d%H%M%S)
#/root/.tiup/components/dumpling/v8.5.0/dumpling -B bakdb -u "${BAKUSER}" -P ${TIDBPORT} -h ${TIDBIP} -p "${BAKPW}"  --filetype csv -t 8 -o "s3://${Bucket}/dumpling/${CURDATE}" --s3.endpoint="http://${Endpoint}" -r 200000 -F 256MiB

# s3cmd --no-ssl --access_key=admin --secret_key=password  --host=172.16.201.85:9000 --host-bucket= get  s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv
download: 's3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv' -> './bakdb.users10m.0000000010000.csv'  [1 of 1]
 51980222 of 51980222   100% in    0s   113.40 MB/s  done
# more bakdb.users10m.0000000010000.csv
"userid","name","phone","sex","birth","card_type","card_id","reg_date","reg_time","modify_date","unit_id","vip_level","corp_id","job_number","col00","col01","col02","col03","col04","col05","col06","col07","col08","col09","
col10","col11","col12","col13","col14","col15","col16","col17","col18","col19","balance1","balance2","balance3","modify_datetime"
"004804a7-e96e-4cf8-8622-891e0ae9d87d","LKQKM","10240894028","女","1979-10-29","身份证","588802761557381279","2001-05-12","02:41:58","2016-07-09","hDY","4","22100","130353","EDRROXGRCYJ","TENOFL","MDBPTICNPYQKYDTRHSGC","YP
UJVK","JSKWBZYHXR","IXBPDPDPEARSP","QDSLHMNZ","DJBRX","DYCCQDHJIMA","AFISVLDMXLQPZHVZZDQA","FYHYBEYQGYYDVIHG","LZ","NITVMZGXRTYKQFVPFTHB","NCFESYSN","WFSVITJTYPYBEOXYLXU","YZYYG","PZYQPLZ","GJIRATCSLPTFHCSVDH","STOBWVHFSQF
IDP","KPRVIGNNFIDCIBL",2071.61,9169.1201,10000.1234,"2025-01-28 10:42:56"
"004804c4-3494-47de-a7ca-f5ab7ba8e135","OKTICUOYOJNNCULUCGKS","15002789114","男","1988-02-13","身份证","970410241703370828","2000-11-11","09:46:21","2013-01-15","dgjhS","2","6900","107337","SINTVGVWCUQROU","U","LRHRMWTGCKN
","T","SADSQUTZD","OBJMWNAHSUERZJPMRG","TAPMJTRLQJIAYXSCAVUQ","VRJCSHSGUAOWAVBDST","OKOUTEUMWVHJYBTP","EYMJMMBWWJW","PALWUHU","NWKMJADTBYPJZ","OVZSQOMBUBXUT","XDFMQHJFIWXN","JFIA","FDD","XVQUOOODC","P","OVRLI","QKH",9722.4
2,5618.2924,10000.1234,"2025-01-28 10:42:56"

三、访问 S3 上 csv 备份

配置 S3 的访问密钥

安装插件和配置 s3 secret。

duckdb> INSTALL 'httpfs';
duckdb> SELECT * FROM duckdb_extensions();
┌──────────────────┬─────────┬───────────┬──────────────────────┬───────────────────────────────────────────────────────────────────────────────┬───────────────────┬───────────────────┬───────────────────┬────────────────┐
│  extension_name  │ loaded  │ installed │     install_path     │                                  description                                  │      aliases      │ extension_version │   install_mode    │ installed_from │
│     varchar      │ boolean │  boolean  │       varchar        │                                    varchar                                    │     varchar[]     │      varchar      │      varchar      │    varchar     │
├──────────────────┼─────────┼───────────┼──────────────────────┼───────────────────────────────────────────────────────────────────────────────┼───────────────────┼───────────────────┼───────────────────┼────────────────┤
│ arrow            │ false   │ false     │                      │ A zero-copy data integration between Apache Arrow and DuckDB                  │ []                │                   │                   │                │
│ autocomplete     │ true    │ true      │ (BUILT-IN)           │ Adds support for autocomplete in the shell                                    │ []                │                   │ STATICALLY_LINKED │                │
│ aws              │ false   │ false     │                      │ Provides features that depend on the AWS SDK                                  │ []                │                   │                   │                │
│ azure            │ false   │ false     │                      │ Adds a filesystem abstraction for Azure blob storage to DuckDB                │ []                │                   │                   │                │
│ delta            │ false   │ false     │                      │ Adds support for Delta Lake                                                   │ []                │                   │                   │                │
│ excel            │ false   │ false     │                      │ Adds support for Excel-like format strings                                    │ []                │                   │                   │                │
│ fts              │ true    │ true      │ (BUILT-IN)           │ Adds support for Full-Text Search Indexes                                     │ []                │ v1.1.3            │ STATICALLY_LINKED │                │
│ httpfs           │ true    │ true      │ /data/duckdb_ext/v…  │ Adds support for reading and writing files over a HTTP(S) connection          │ [http, https, s3] │ v1.1.3            │ REPOSITORY        │ core           │
...
│ vss              │ false   │ false     │                      │ Adds indexing support to accelerate Vector Similarity Search                  │ []                │                   │                   │                │
├──────────────────┴─────────┴───────────┴──────────────────────┴───────────────────────────────────────────────────────────────────────────────┴───────────────────┴───────────────────┴───────────────────┴────────────────┤
│ 24 rows                                                                                                                                                                                                          9 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.005 user 0.000519 sys 0.001037

duckdb> CREATE PERSISTENT SECRET miniosecret (
            TYPE S3,
            KEY_ID 'admin',
            SECRET 'password',
            ENDPOINT '172.16.201.85:9000',
            URL_STYLE 'path',
            USE_SSL false
        );
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ true    │
└─────────┘

duckdb> SELECT * FROM duckdb_secrets();
┌─────────────┬─────────┬──────────┬────────────┬────────────┬──────────────────────┬──────────────────────────────────────────────────────────┐
│    name     │  type   │ provider │ persistent │  storage   │        scope         │                      secret_string                       │
│   varchar   │ varchar │ varchar  │  boolean   │  varchar   │      varchar[]       │                         varchar                          │
├─────────────┼─────────┼──────────┼────────────┼────────────┼──────────────────────┼──────────────────────────────────────────────────────────┤
│ miniosecret │ s3      │ config   │ true       │ local_file │ [s3://, s3n://, s3…  │ name=miniosecret;type=s3;provider=config;serializable=…  │
└─────────────┴─────────┴──────────┴────────────┴────────────┴──────────────────────┴──────────────────────────────────────────────────────────┘

注:PERSISTENT SECRET 存储在 ${HOME}/.duckdb/stored_secrets 目录下。

Duckdb 访问 S3 的 CSV 备份文件

通过 Duckdb 的 S3 expansion 结合 read_csv 访问数据。

duckdb> SELECT count(*) FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true);
100% ▕████████████████████████████████████████████████████████████▏ 
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│     10000000 │
└──────────────┘
Run Time (s): real 18.073 user 17.706364 sys 6.432855

duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true) limit 1;
┌──────────────────────┬─────────┬─────────────┬─────────┬────────────┬───────────┬────────────────────┬───┬────────────────────┬────────────────┬─────────────────┬──────────┬───────────┬────────────┬─────────────────────┐
│        userid        │  name   │    phone    │   sex   │   birth    │ card_type │      card_id       │ … │       col17        │     col18      │      col19      │ balance1 │ balance2  │  balance3  │   modify_datetime   │
│       varchar        │ varchar │    int64    │ varchar │    date    │  varchar  │      varchar       │   │      varchar       │    varchar     │     varchar     │ varchar  │  varchar  │   double   │      timestamp      │
├──────────────────────┼─────────┼─────────────┼─────────┼────────────┼───────────┼────────────────────┼───┼────────────────────┼────────────────┼─────────────────┼──────────┼───────────┼────────────┼─────────────────────┤
│ 004804a7-e96e-4cf8…  │ LKQKM   │ 10240894028 │ 女      │ 1979-10-29 │ 身份证    │ 588802761557381279 │ … │ GJIRATCSLPTFHCSVDH │ STOBWVHFSQFIDP │ KPRVIGNNFIDCIBL │ 2071.61  │ 9169.1201 │ 10000.1234 │ 2025-01-28 10:42:56 │
├──────────────────────┴─────────┴─────────────┴─────────┴────────────┴───────────┴────────────────────┴───┴────────────────────┴────────────────┴─────────────────┴──────────┴───────────┴────────────┴─────────────────────┤
│ 1 rows                                                                                                                                                                                               38 columns (14 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 3.981 user 6.215305 sys 1.525878

duckdb> .mode line
duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true) limit 1;
         userid = 004804a7-e96e-4cf8-8622-891e0ae9d87d
           name = LKQKM
          phone = 10240894028
            sex = 女
          birth = 1979-10-29
      card_type = 身份证
        card_id = 588802761557381279
       reg_date = 2001-05-12
       reg_time = 02:41:58
    modify_date = 2016-07-09
        unit_id = hDY
      vip_level = 4
        corp_id = 22100
     job_number = 130353
          col00 = EDRROXGRCYJ
          col01 = TENOFL
          col02 = MDBPTICNPYQKYDTRHSGC
          col03 = YPUJVK
          col04 = JSKWBZYHXR
          col05 = IXBPDPDPEARSP
          col06 = QDSLHMNZ
          col07 = DJBRX
          col08 = DYCCQDHJIMA
          col09 = AFISVLDMXLQPZHVZZDQA
          col10 = FYHYBEYQGYYDVIHG
          col11 = LZ
          col12 = NITVMZGXRTYKQFVPFTHB
          col13 = NCFESYSN
          col14 = WFSVITJTYPYBEOXYLXU
          col15 = YZYYG
          col16 = PZYQPLZ
          col17 = GJIRATCSLPTFHCSVDH
          col18 = STOBWVHFSQFIDP
          col19 = KPRVIGNNFIDCIBL
       balance1 = 2071.61
       balance2 = 9169.1201
       balance3 = 10000.1234
modify_datetime = 2025-01-28 10:42:56
Run Time (s): real 2.791 user 6.307715 sys 1.495764
duckdb> .mode duckbox

注:.mode line 类似于 MySQL > 的 \G 结束符的输出,.mode table 类似于 MySQL > 的默认输出,.mode duckbox 是 Duckdb 的默认输出,自适应屏幕的简略输出。

标准备份数据导入 Duckdb

默认情况下,使用 auto_detect 访问数据不一定能使用正确的字段格式,先使用 sniff_csv 生成配置模板。

duckdb> .mode line
duckdb> SELECT Prompt FROM sniff_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv', sample_size = 10000);
Prompt = FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000010000.csv', auto_detect=false, delim=',', quote='"', escape='"', new_line='\r\n', skip=0, comment='', header=true, columns={'userid': 'VARCHAR', 'name': 'VARCHAR', 'phone': 'BIGINT', 'sex': 'VARCHAR', 'birth': 'DATE', 'card_type': 'VARCHAR', 'card_id': 'VARCHAR', 'reg_date': 'DATE', 'reg_time': 'TIME', 'modify_date': 'DATE', 'unit_id': 'VARCHAR', 'vip_level': 'BIGINT', 'corp_id': 'BIGINT', 'job_number': 'BIGINT', 'col00': 'VARCHAR', 'col01': 'VARCHAR', 'col02': 'VARCHAR', 'col03': 'VARCHAR', 'col04': 'VARCHAR', 'col05': 'VARCHAR', 'col06': 'VARCHAR', 'col07': 'VARCHAR', 'col08': 'VARCHAR', 'col09': 'VARCHAR', 'col10': 'VARCHAR', 'col11': 'VARCHAR', 'col12': 'VARCHAR', 'col13': 'VARCHAR', 'col14': 'VARCHAR', 'col15': 'VARCHAR', 'col16': 'VARCHAR', 'col17': 'VARCHAR', 'col18': 'VARCHAR', 'col19': 'VARCHAR', 'balance1': 'DOUBLE', 'balance2': 'DOUBLE', 'balance3': 'DOUBLE', 'modify_datetime': 'TIMESTAMP'}, dateformat='%Y-%m-%d', sample_size=10000);
Run Time (s): real 0.174 user 0.277318 sys 0.060880

使用 read_csv 的 auto_detect 的自适应失败或 sniff_csv 的 sample_size 内的数据没有对应的格式,最终会导致缺少配置项的导入过程仍有可能会失败,需要手工调整。

需要调整配置模板后导入数据,也可以使用备份时对应的 schema 文件内关于字段的描述。

duckdb> SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.0000000120000.csv');
Run Time (s): real 1.472 user 3.583090 sys 0.183995
Conversion Error: CSV Error on Line: 154129
Original Line: 
"04e471fa-0a6b-4181-b2a6-e4bc53af5fac","UVJXFMBEC","12308418577","女","2003-12-17","身份证","520850486289953447","2009-09-13",\N,"2012-04-30","8","0","4500","131043","EXDBHQFLZMAJZH","RVWDBE","UBLXVMPIC","CIHYSHCEQZZFBQZ","HPOARFSFFUQSPPN","PQKLEZ","DJFPVM","RBAKOUZZTN","V","PMVYO","V","BCIMNBFA","UOCMGLHVWU","JCUQJYT","KRFFJNMDIYCUBAJMFYK","OKLIMZU","VNSMFYDLH","NWJJLQZP","GKVL","UDUVEYSEBVZMQRKTLF",1606.15,9866.7268,10000.1234,"2025-01-27 09:39:44"
Error when converting column "reg_time". Could not convert string "\N" to 'TIME'

Column reg_time is being converted as type TIME
This type was auto-detected from the CSV file.

Dumpling 默认备份 CSV 时使用 '\N',需要添加 nullstr='\N' 配置项。修改后,再导入。

duckdb> SELECT *
FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='', nullstr='\N', header=true,
    columns = {
'userid': 'VARCHAR',
'name': 'VARCHAR',
'phone': 'BIGINT',
'sex': 'VARCHAR',
'birth': 'DATE',
'card_type': 'VARCHAR',
'card_id': 'VARCHAR',
'reg_date': 'DATE',
'reg_time': 'TIME',
'modify_date': 'DATE',
'unit_id': 'VARCHAR',
'vip_level': 'BIGINT',
'corp_id': 'BIGINT',
'job_number': 'BIGINT',
'col00': 'VARCHAR',
'col01': 'VARCHAR',
'col02': 'VARCHAR',
'col03': 'VARCHAR',
'col04': 'VARCHAR',
'col05': 'VARCHAR',
'col06': 'VARCHAR',
'col07': 'VARCHAR',
'col08': 'VARCHAR',
'col09': 'VARCHAR',
'col10': 'VARCHAR',
'col11': 'VARCHAR',
'col12': 'VARCHAR',
'col13': 'VARCHAR',
'col14': 'VARCHAR',
'col15': 'VARCHAR',
'col16': 'VARCHAR',
'col17': 'VARCHAR',
'col18': 'VARCHAR',
'col19': 'VARCHAR',
'balance1': 'DOUBLE',
'balance2': 'DOUBLE',
'balance3': 'DOUBLE',
'modify_datetime': 'TIMESTAMP'
    });
-- 输出略。

duckdb> create table users10m_ctas as  SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true,
    columns = {
'userid': 'VARCHAR',
'name': 'VARCHAR',
'phone': 'BIGINT',
'sex': 'VARCHAR',
'birth': 'DATE',
'card_type': 'VARCHAR',
'card_id': 'VARCHAR',
'reg_date': 'DATE',
'reg_time': 'TIME',
'modify_date': 'DATE',
'unit_id': 'VARCHAR',
'vip_level': 'BIGINT',
'corp_id': 'BIGINT',
'job_number': 'BIGINT',
'col00': 'VARCHAR',
'col01': 'VARCHAR',
'col02': 'VARCHAR',
'col03': 'VARCHAR',
'col04': 'VARCHAR',
'col05': 'VARCHAR',
'col06': 'VARCHAR',
'col07': 'VARCHAR',
'col08': 'VARCHAR',
'col09': 'VARCHAR',
'col10': 'VARCHAR',
'col11': 'VARCHAR',
'col12': 'VARCHAR',
'col13': 'VARCHAR',
'col14': 'VARCHAR',
'col15': 'VARCHAR',
'col16': 'VARCHAR',
'col17': 'VARCHAR',
'col18': 'VARCHAR',
'col19': 'VARCHAR',
'balance1': 'DOUBLE',
'balance2': 'DOUBLE',
'balance3': 'DOUBLE',
'modify_datetime': 'TIMESTAMP'
    });
100% ▕████████████████████████████████████████████████████████████▏ 
Run Time (s): real 33.892 user 167.084411 sys 14.806282

# s3cmd --no-ssl --access_key=admin --secret_key=password  --host=172.16.201.85:9000 --host-bucket=  du s3://tidbbak/dumpling/20250128161903/
  4913936499       42 objects s3://tidbbak/dumpling/20250128161903/

# ls -lh /data/bak.duckdb 
-rw-r--r-- 1 root root 2.9G Jan 28 17:12 /data/bak.duckdb

查看 CTAS 的表结构和 DDL,已经按标识的字段类型创建。

duckdb> desc users10m_ctas;
┌─────────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│   column_name   │ column_type │  null   │   key   │ default │  extra  │
│     varchar     │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ userid          │ VARCHAR     │ YES     │         │         │         │
│ name            │ VARCHAR     │ YES     │         │         │         │
│ phone           │ BIGINT      │ YES     │         │         │         │
│ sex             │ VARCHAR     │ YES     │         │         │         │
│ birth           │ DATE        │ YES     │         │         │         │
│ card_type       │ VARCHAR     │ YES     │         │         │         │
│ card_id         │ VARCHAR     │ YES     │         │         │         │
│ reg_date        │ DATE        │ YES     │         │         │         │
│ reg_time        │ TIME        │ YES     │         │         │         │
│ modify_date     │ DATE        │ YES     │         │         │         │
│ unit_id         │ VARCHAR     │ YES     │         │         │         │
│ vip_level       │ BIGINT      │ YES     │         │         │         │
│ corp_id         │ BIGINT      │ YES     │         │         │         │
│ job_number      │ BIGINT      │ YES     │         │         │         │
│ col00           │ VARCHAR     │ YES     │         │         │         │
│ col01           │ VARCHAR     │ YES     │         │         │         │
│ col02           │ VARCHAR     │ YES     │         │         │         │
│ col03           │ VARCHAR     │ YES     │         │         │         │
│ col04           │ VARCHAR     │ YES     │         │         │         │
│ col05           │ VARCHAR     │ YES     │         │         │         │
│ col06           │ VARCHAR     │ YES     │         │         │         │
│ col07           │ VARCHAR     │ YES     │         │         │         │
│ col08           │ VARCHAR     │ YES     │         │         │         │
│ col09           │ VARCHAR     │ YES     │         │         │         │
│ col10           │ VARCHAR     │ YES     │         │         │         │
│ col11           │ VARCHAR     │ YES     │         │         │         │
│ col12           │ VARCHAR     │ YES     │         │         │         │
│ col13           │ VARCHAR     │ YES     │         │         │         │
│ col14           │ VARCHAR     │ YES     │         │         │         │
│ col15           │ VARCHAR     │ YES     │         │         │         │
│ col16           │ VARCHAR     │ YES     │         │         │         │
│ col17           │ VARCHAR     │ YES     │         │         │         │
│ col18           │ VARCHAR     │ YES     │         │         │         │
│ col19           │ VARCHAR     │ YES     │         │         │         │
│ balance1        │ DOUBLE      │ YES     │         │         │         │
│ balance2        │ DOUBLE      │ YES     │         │         │         │
│ balance3        │ DOUBLE      │ YES     │         │         │         │
│ modify_datetime │ TIMESTAMP   │ YES     │         │         │         │
├─────────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┤
│ 38 rows                                                     6 columns │
└───────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.001 user 0.001159 sys 0.000000
duckdb> .mode line
duckdb> WITH table_columns AS (
                        SELECT
                        name,
                        type,
                        CASE WHEN "notnull" = 1 THEN 'NOT NULL' ELSE '' END AS not_null
                        FROM pragma_table_info('users10m_ctas'))
                        SELECT
                        'CREATE TABLE users10m_ctas (' ||
                        string_agg(name || ' ' || type || ' ' || not_null, ', ') ||
                        ');' AS create_table_sql
                        FROM table_columns;
create_table_sql = CREATE TABLE users10m_ctas (userid VARCHAR , name VARCHAR , phone BIGINT , sex VARCHAR , birth DATE , card_type VARCHAR , card_id VARCHAR , reg_date DATE , reg_time TIME , modify_date DATE , unit_id VARCHAR , vip_level BIGINT , corp_id BIGINT , job_number BIGINT , col00 VARCHAR , col01 VARCHAR , col02 VARCHAR , col03 VARCHAR , col04 VARCHAR , col05 VARCHAR , col06 VARCHAR , col07 VARCHAR , col08 VARCHAR , col09 VARCHAR , col10 VARCHAR , col11 VARCHAR , col12 VARCHAR , col13 VARCHAR , col14 VARCHAR , col15 VARCHAR , col16 VARCHAR , col17 VARCHAR , col18 VARCHAR , col19 VARCHAR , balance1 DOUBLE , balance2 DOUBLE , balance3 DOUBLE , modify_datetime TIMESTAMP );
Run Time (s): real 0.002 user 0.002692 sys 0.000146

注意:DuckDB 的 VARCHAR 没有显式长度限制,可以存储任意长度的字符串。实际使用中,字符串长度受限于内存大小和系统性能。

测试原生表和外部表的性能

duckdb> SELECT max(reg_date),min(balance1) FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true,columns = {中间格式略};

max(reg_date) = 2009-12-30
min(balance1) = 0.02
Run Time (s): real 18.868 user 19.898231 sys 6.671171

duckdb> SELECT max(reg_date),min(balance1) FROM users10m_ctas;
max(reg_date) = 2009-12-30
min(balance1) = 0.02
Run Time (s): real 0.011 user 0.051131 sys 0.001120

duckdb>  SELECT * FROM read_csv('s3://tidbbak/dumpling/20250128161903/bakdb.users10m.*.csv', union_by_name = true, delim=',', quote='"', new_line='\r\n', skip=0, comment='',nullstr='\N', header=true,columns = {中间格式略}) where userid='04e471fa-0a6b-4181-b2a6-e4bc53af5fac';
100% 
         userid = 04e471fa-0a6b-4181-b2a6-e4bc53af5fac
           name = UVJXFMBEC
          phone = 12308418577
            sex = 女
          birth = 2003-12-17
      card_type = 身份证
        card_id = 520850486289953447
       reg_date = 2009-09-13
       reg_time = 
    modify_date = 2012-04-30
        unit_id = 8
      vip_level = 0
        corp_id = 4500
     job_number = 131043
          col00 = EXDBHQFLZMAJZH
          col01 = RVWDBE
          col02 = UBLXVMPIC
          col03 = CIHYSHCEQZZFBQZ
          col04 = HPOARFSFFUQSPPN
          col05 = PQKLEZ
          col06 = DJFPVM
          col07 = RBAKOUZZTN
          col08 = V
          col09 = PMVYO
          col10 = V
          col11 = BCIMNBFA
          col12 = UOCMGLHVWU
          col13 = JCUQJYT
          col14 = KRFFJNMDIYCUBAJMFYK
          col15 = OKLIMZU
          col16 = VNSMFYDLH
          col17 = NWJJLQZP
          col18 = GKVL
          col19 = UDUVEYSEBVZMQRKTLF
       balance1 = 1606.15
       balance2 = 9866.7268
       balance3 = 10000.1234
modify_datetime = 2025-01-27 09:39:44
Run Time (s): real 15.471 user 25.564100 sys 7.191754

duckdb> SELECT * FROM users10m_ctas where userid='04e471fa-0a6b-4181-b2a6-e4bc53af5fac';
         userid = 04e471fa-0a6b-4181-b2a6-e4bc53af5fac
           name = UVJXFMBEC
          phone = 12308418577
            sex = 女
          birth = 2003-12-17
      card_type = 身份证
        card_id = 520850486289953447
       reg_date = 2009-09-13
       reg_time = 
    modify_date = 2012-04-30
        unit_id = 8
      vip_level = 0
        corp_id = 4500
     job_number = 131043
          col00 = EXDBHQFLZMAJZH
          col01 = RVWDBE
          col02 = UBLXVMPIC
          col03 = CIHYSHCEQZZFBQZ
          col04 = HPOARFSFFUQSPPN
          col05 = PQKLEZ
          col06 = DJFPVM
          col07 = RBAKOUZZTN
          col08 = V
          col09 = PMVYO
          col10 = V
          col11 = BCIMNBFA
          col12 = UOCMGLHVWU
          col13 = JCUQJYT
          col14 = KRFFJNMDIYCUBAJMFYK
          col15 = OKLIMZU
          col16 = VNSMFYDLH
          col17 = NWJJLQZP
          col18 = GKVL
          col19 = UDUVEYSEBVZMQRKTLF
       balance1 = 1606.15
       balance2 = 9866.7268
       balance3 = 10000.1234
modify_datetime = 2025-01-27 09:39:44
Run Time (s): real 0.006 user 0.009642 sys 0.002340

对比场景成绩

ctas

max / min

where 查询

Read csv from S3 (4.9 GB CSV)

34 sec

19 sec

16 sec

table(10 M records)

-

11 ms

6 ms

使用自定义的 CSV 格式

使用控制字符 x01,即 SOH 作为间隔符。能规避字段中包含默认定义分隔符的情况,防止数据读取分列时出错。

注意:如果使用多字符的 delimiter,duckdb 会指定提示 Invalid Input Error: The delimiter option cannot exceed a size of 1 byte. 的超长错误。

export AWS_ACCESS_KEY_ID=admin; export AWS_SECRET_ACCESS_KEY=password; 
dumpling --filter 'bakdb.users10m' -u root -P 4500 -h 172.16.201.122  -p root  --filetype csv -t 8 -o "s3://tidbbak/dumpling/${CURDATE}" --s3.endpoint="http://172.16.201.85:9000" -r 200000 -F 256MiB  --csv-separator=$(echo -e "\x01") --escape-backslash=false --csv-null-value="NULL" --csv-line-terminator=$'\r\n' --csv-delimiter="" 

# cat -e bakdb.users10m.0000000390000.csv|more 
userid^Aname^Aphone^Asex^Abirth^Acard_type^Acard_id^Areg_date^Areg_time^Amodify_date^Aunit_id^Avip_level^Acorp_id^Ajob_number^Acol00^Acol01^Acol02^Acol03^Acol04^Acol05^Acol06^Acol07^Acol08^Acol09^Acol10^Acol11^Acol12^Acol1
3^Acol14^Acol15^Acol16^Acol17^Acol18^Acol19^Abalance1^Abalance2^Abalance3^Amodify_datetime^M$
1dbf161b-87e9-49e4-bd43-83e8179b4292^ACQAZEBAOQL^A14737946971^AM-gM-^TM-7^A2019-06-15^AM-hM-:M-+M-dM-;M-=M-hM-/M-^A^A599781278347424629^A2003-04-16^ANULL^A2018-04-03^Aeg^A1^A22200^A27564^ARLAWIETOMOH^ALQDFFCOHMXVOR^AYPBIIH
QNIVDM^AXUUKIEXPJRVGQ^AZQMEPKMNHBACDFOYVT^AVFSWFI^ACPSIYHSUBFCI^AYGRFPCEGHAISOGAXIG^AUSAOTBNP^AWDUS^AUTYZOOMCHECUHYAB^ABS^AWJRVSSQVVUSBVZGL^AZ^AOCSGJZYQEIYIKIXVCEX^APYAXMFDV^AORLWDLPQHPMEMOB^AIZZVERY^AGCRJOF^AXKQDXOUXFXRJJ
YORJ^ANULL^ANULL^A10000.1234^A2025-01-27 09:39:44^M$

对应的配置 read_csv 的参数 delim=e'\x01' 和 nullstr='NULL'

duckdb> create table users10m_ctas2 as SELECT *
      · FROM read_csv('s3://tidbbak/dumpling/20250129143910/bakdb.users10m.*.csv', union_by_name = true, delim=e'\x01', quote='', new_line='\r\n', skip=0, comment='', nullstr='NULL', header=true,
      ·     columns = {中间格式略});

duckdb> select * from users10m_ctas2 limit 10 ;
┌──────────────────────┬──────────────────────┬─────────────┬─────────┬────────────┬───────────┬────────────────────┬───┬─────────────────┬──────────────────────┬──────────┬───────────┬────────────┬─────────────────────┐
│        userid        │         name         │    phone    │   sex   │   birth    │ card_type │      card_id       │ … │      col18      │        col19         │ balance1 │ balance2  │  balance3  │   modify_datetime   │
│       varchar        │       varchar        │    int64    │ varchar │    date    │  varchar  │      varchar       │   │     varchar     │       varchar        │  double  │  double   │   double   │      timestamp      │
├──────────────────────┼──────────────────────┼─────────────┼─────────┼────────────┼───────────┼────────────────────┼───┼─────────────────┼──────────────────────┼──────────┼───────────┼────────────┼─────────────────────┤
│ 004804a7-e96e-4cf8…  │ LKQKM                │ 10240894028 │ 女      │ 1979-10-29 │ 身份证    │ 588802761557381279 │ … │ STOBWVHFSQFIDP  │ KPRVIGNNFIDCIBL      │  2071.61 │ 9169.1201 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 004804c4-3494-47de…  │ OKTICUOYOJNNCULUCGKS │ 15002789114 │ 男      │ 1988-02-13 │ 身份证    │ 970410241703370828 │ … │ OVRLI           │ QKH                  │  9722.42 │ 5618.2924 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 00480505-5d3e-40f3…  │ OEYHUYADGB           │ 12955781898 │ 男      │ 2001-03-05 │ 身份证    │ 897826304664568897 │ … │ GXVSEAORZA      │ SLDXCJYWYFZFHEHERQPV │  6997.88 │ 8810.8585 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 0048050a-8c4f-4b68…  │ QLCWZCMS             │ 16300152778 │ 女      │ 2007-07-11 │ 身份证    │ 177321359793083575 │ … │ FSPAC           │ MZKB                 │  9225.89 │  628.1277 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 0048050f-c7e2-4d9d…  │ YLQALLEVP            │ 10781833261 │ 男      │ 1981-06-23 │ 身份证    │ 620054938335154911 │ … │ CPFGWPHW        │ NUMQXOZGAJODSEJYL    │  7555.27 │ 5465.2121 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 00480534-030c-4caa…  │ HAWMDVNXMVR          │ 17899716473 │ 女      │ 2001-12-26 │ 身份证    │ 599799321632177923 │ … │ DFCXDQQRLCQL    │ PFNZ                 │  2019.48 │ 8455.7013 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 004805c2-78c6-4d1b…  │ DRGFKEKBPP           │ 16538556539 │ 女      │ 2016-09-08 │ 身份证    │ 698176678748396854 │ … │ MZNAQX          │ UCOBKOKZBQGWKTGBSSXF │  2793.68 │ 3975.8125 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 004805d5-d97d-49db…  │ Y                    │ 10521349859 │ 男      │ 1998-12-10 │ 身份证    │ 309479851036137986 │ … │ DMONNXPUFSXMVPR │ NRHQXIXNTHMU         │  9898.58 │ 9541.5137 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 004805de-0e71-425f…  │ SGFAWOYJ             │ 19009647824 │ 男      │ 1973-02-10 │ 身份证    │ 453108689618730419 │ … │ CU              │ SKBELQNXTITOWSTLE    │  1370.42 │ 7448.9362 │ 10000.1234 │ 2025-01-28 10:42:56 │
│ 004805de-562d-46f1…  │ NPTFMCP              │ 15951915861 │ 男      │ 2005-02-02 │ 身份证    │ 140249578270439645 │ … │ ALVM            │ OIOBZ                │  6013.69 │  8486.551 │ 10000.1234 │ 2025-01-28 10:42:56 │
├──────────────────────┴──────────────────────┴─────────────┴─────────┴────────────┴───────────┴────────────────────┴───┴─────────────────┴──────────────────────┴──────────┴───────────┴────────────┴─────────────────────┤
│ 10 rows                                                                                                                                                                                            38 columns (13 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 0.007 user 0.007134 sys 0.000018

四、总结

备份的数据具有较高的业务价值,可以支持历史快照的即席查询,进一步分析业务的历史变化,但是由于平面文件使用不方便,在日常运营过程中,备份数据没有体现出业务价值。在 MySQL 生态中,通过 dumpling 备份成 CSV,结合 S3 兼容的对象存储,再通过 DuckDB 实现数据备份的快速打开和查询,其他的数据库也可以参考。在具体的实践过程中,可以通过自动化平台,实现按需的自助查询操作,具有较高的业务意义。

0
0
0
0

版权声明:本文为 TiDB 社区用户原创文章,遵循 CC BY-NC-SA 4.0 版权协议,转载请附上原文出处链接和本声明。

评论
暂无评论