背景
在国产化数据库替代过程中,有一项必不可少的任务就是数据迁移,我们常用的技术手段之一是把数据库的数据导出成csv格式的文件,然后将csv格式的文件导入的国产数据库中。对于TiDB数据库,导入工具为lightning,lightning导入csv文件时,为了提高导入的速度,官方建议csv文件大小为256MB。对于大的csv文件,我们需要进行切分,并且切分后的csv文件命名必须符合lightning的命名要求:schema.table.00001.csv,schema.table.00002.csv,schema.table.00003.csv...
功能
Tisplit可以实现把上百GB的csv文件,切分成若干几百MB的小文件,并且每个切分的文件不会出现断行的情况;切分后的csv文件名符合lightning的导入要求。
代码
下载地址: https://github.com/cchouqiang/TiSplit
function TiSplit_help(){
echo "Auther : cchouqiang"
echo "Introduce : TiSplit A tool for splitting hundreds of GB CSV files into small files. If there are any bugs, please provide feedback in a timely manner, and the author will fix them in a timely manner! "
echo " "
echo "Usage: tisplit.sh [option] [parameter]"
echo "option: -i --input-file [input_csv_path] | | CSV file to be processed;"
echo " -o --operate-path [operate_dir_path] | | Need to process the path of the CSV file"
echo " -m --schema-meta [schema_meta] | | eg: -m schema_name.table_name;"
echo " -l --lines_per_file [lines_per_file] |(default: '') | split lines"
echo " -h --help | | help"
}
# Deal with content of input
if [ $# -le 0 ] || [ $1 = '?' ]; then
TiSplit_help
exit 1
fi
# Set TiSplit options using getopt lib
TEMP=`getopt -o i:o:m:l:h --long help,input-file:,operate-path:,schema-meta:,lines_per_file: -- "$@"`
# Note the quotes around `$TEMP': they are essential!
eval set -- "$TEMP"
while true ; do
case "$1" in
-i|--input-file) echo "Option i == ${2}" ;
Source_oper_file=${2}; shift 2;;
-o|--operate-path) echo "Option o == ${2}" ;
TiSplit_check_dir=${2}
TiSplit_oper_file=${2}/TiSplit_operating_csv_$perfix_hash_time;
TiSplit_oper_dir=${2}/${perfix_hash_time}_operating_dir; shift 2;;
-m|--schema-meta) echo "Option m == ${2}" ;
TiSplit_meta_table=${2}; shift 2;;
-l|--lines_per_file) echo "Option l == ${2}" ;
TiSplit_lines_per_file=${2}; shift 2;;
-h|--help) TiSplit_help; exit 1 ;;
--) shift ; break ;;
*) echo "Internal error!" ; exit 1 ;;
esac
done
# Print information on terminal
echo "---------------------------------------------------------------------------"
echo "------------ TiSplit starting ------------------------------------------"
echo "---------------------------------------------------------------------------"
# Split the file into many small files, which
cd ${TiSplit_check_dir}
split -l ${TiSplit_lines_per_file} ${Source_oper_file} -d -a 6 ${TiSplit_meta_table}.
# Change every files to obey the filename named rule of tidb-lightning
softfiles=$(ls ${TiSplit_check_dir}/${TiSplit_meta_table}*)
for sfile in ${softfiles}
do
mv ${sfile} ${sfile}.csv
done
echo "---------------------------------------------------------------------------"
echo "------------ TiSplit completed !!! ------------------------------"
echo "---------------------------------------------------------------------------"
使用方法
1、将TiSplit脚本放到服务器上。
vi tisplit.sh
2、准备待切分的csv文件,确保存放目录下能容纳2倍的csv空间。
du -sh ${TiSplit_check_dir}
3、估算每个文件切分的行数
tail -2000 data.csv > a.txt
du -sh a.txt
根据a.txt文件的大小,使用256M除以a.txt文件大小,得出的数值乘以2000,即切分的行数
例如:a.txt 4M,(256/4)*2000=128000,即split切分的行数。
4、执行tisplit.sh
参数如下:
待切分的csv文件:-i data.csv
存放csv的路径:-o /home/tidb/csvpoc
数据库及表名:-m ceshi.tab
切分行数: -l 10
[tidb@host1 ~]$ ./tisplit.sh -i data.csv -o /home/tidb/csvpoc -m ceshi.tab -l 10
Option i == data.csv
Option o == /home/tidb/csvpoc
Option m == ceshi.tab
Option l == 10
---------------------------------------------------------------------------
------------ TiSplit starting ------------------------------------------
---------------------------------------------------------------------------
---------------------------------------------------------------------------
------------ TiSplit completed !!! ------------------------------
---------------------------------------------------------------------------
[tidb@host1 ~]$ cd /home/tidb/csvpoc
[tidb@host1 csvpoc]$ ls
ceshi.tab.00000000.csv ceshi.tab.00000007.csv ceshi.tab.00000014.csv ceshi.tab.00000021.csv ceshi.tab.00000028.csv ceshi.tab.00000035.csv data.csv
ceshi.tab.00000001.csv ceshi.tab.00000008.csv ceshi.tab.00000015.csv ceshi.tab.00000022.csv ceshi.tab.00000029.csv ceshi.tab.00000036.csv
ceshi.tab.00000002.csv ceshi.tab.00000009.csv ceshi.tab.00000016.csv ceshi.tab.00000023.csv ceshi.tab.00000030.csv ceshi.tab.00000037.csv
ceshi.tab.00000003.csv ceshi.tab.00000010.csv ceshi.tab.00000017.csv ceshi.tab.00000024.csv ceshi.tab.00000031.csv ceshi.tab.00000038.csv
ceshi.tab.00000004.csv ceshi.tab.00000011.csv ceshi.tab.00000018.csv ceshi.tab.00000025.csv ceshi.tab.00000032.csv ceshi.tab.00000039.csv
ceshi.tab.00000005.csv ceshi.tab.00000012.csv ceshi.tab.00000019.csv ceshi.tab.00000026.csv ceshi.tab.00000033.csv ceshi.tab.00000040.csv
ceshi.tab.00000006.csv ceshi.tab.00000013.csv ceshi.tab.00000020.csv ceshi.tab.00000027.csv ceshi.tab.00000034.csv ceshi.tab.00000041.csv
[tidb@1 csvpoc]$