0
1
1
0
专栏/.../

小试牛刀 - Kubernetes 上搭建 TiDB 集群

 数据源的TiDB学习之路  发表于  2024-10-22

TiDB Operator 是 Kubernetes 上的 TiDB 集群自动运维系统,提供包括部署、升级、扩缩容、备份恢复、配置变更的 TiDB 全生命周期管理。借助 TiDB Operator,TiDB 可以无缝运行在公有云或自托管的 Kubernetes集群上。

TiDB Operator 提供了多种方式来部署 Kubernetes上的 TiDB 集群,本文介绍如何在 Linux 测试环境中创建一个简单的 Kubernetes集群,部署 TiDB Operator,并使用 TiDB Operator 部署 TiDB 集群。

1 创建 Kubernetes 集群

本文选择使用 kubeadm 部署本地测试 Kubernetes 集群,在使用 kubeadm 创建 Kubernetes 集群前,我们需要先安装好 containerd(容器运行时)、kubelet(以容器方式部署和启动 Kubernetes的主要服务)、kubeadm(Kubernetes部署工具) kubectl(Kubernetes命令行客户端),另外需要将参数 net.ipv4.ip_forward 设置为 1。

  • 安装 containerd

Kubernetes 自 1.24 版本后默认使用 CRI(Container Runtime Interface) 兼容的容器运行时来运行 Kubernetes集群,如 containerd 或 CRI-O。当前环境默认安装 Kubernetes 版本为 1.31,容器运行时使用 containerd,安装步骤参考 https://github.com/containerd/containerd/blob/main/docs/getting-started.md

首先下载安装包,基于环境是 x86 还是 ARM 选择对应的版本,下载地址为:https://github.com/containerd/containerd/releases

##下载及安装 containerd
tar -xzvf containerd-1.7.23-linux-arm64.tar.gz
mv bin/* /usr/local/bin/

## 配置 systemd 启停
mkdir /usr/local/lib/systemd/system -p
cd /usr/local/lib/systemd/system

## 编辑 containerd.service, 内容参考 https://raw.githubusercontent.com/containerd/containerd/main/containerd.service
vi containerd.service
systemctl daemon-reload
systemctl enable --now containerd
## 启动 containerd 服务
systemctl restart containerd

## 下载并安装 runc,下载地址为 https://github.com/opencontainers/runc/releases
install -m 755 runc.arm64 /usr/local/sbin/runc
  • 设置 net.ipv4.ip_forward

临时修改使用命令 sysctl -w net.ipv4.ip_forward=1,如果要永久生效则需要将 net.ipv4.ip_forward=1 添加到 /etc/sysctl.conf 配置文件中并执行 sysctl -p 生效。

  • 安装 kubelet、kubeadm 和 kubectl

添加 kubernetes.repo yum 源,如果下载较慢可自行选择一个国内 yum 源地址

# 此操作会覆盖 /etc/yum.repos.d/kubernetes.repo 中现存的所有配置
cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/v1.31/rpm/repodata/repomd.xml.key
exclude=kubelet kubeadm kubectl cri-tools kubernetes-cni
EOF

之后使用 yum install 进行安装,并启动 kubelet 服务及设置开机自动启动

yum install -y kubelet kubeadm kubectl --disableexcludes=kubernetes
systemctl start kubelet
systemctl enable kubelet
  • 创建 Kubernetes 集群

创建 Kubernetes集群主要有 2 个步骤:第 1 步是使用 kubeadm init 初始化控制平面,第 2 步则是使用 kubeadm join 添加工作节点。

首先选择一台节点作为控制平面节点,并在节点上使用 kubeadm init 初始化集群。初始化集群需要配置一些必要的参数,这可以通过两种方法实现,一种是编辑 init.yaml 文件并在命令行中通过 --config init.yaml 实现,另一种则是直接传递参数名=参数值实现。

  1. 通过 --config init.yaml 实现
生成 kubeadm init 默认参数内容到 init.yaml
kubeadm init pirnt init-defaults > init.yaml
手工编辑 init.yaml 相关配置,配置 apiserver-advertise-address 等参数
vi init.yaml
kubeadm init --config init.default.yaml 
  1. 通过指定 参数名=参数值 实现
kubeadm init \
--apiserver-advertise-address=xx.xx.xx.151 \
--image-repository registry.aliyuncs.com/google_containers \
--service-cidr=10.96.0.0/12 \
--pod-network-cidr=10.244.0.0/16

上述参数 --image-repository registry.aliyuncs.com/google_containers 表示使用国内阿里云镜像托管站点下载镜像。由于默认 Kubernetes 服务指向的国外镜像下载地址是 registry.k8s.io 可能无法访问,因此需要修改为国内镜像地址。查看默认的镜像下载地址可以通过命令行 kubeadm config images list 实现,

## 查看默认的镜像下载地址
kubeadm config images list
==============================================================
registry.k8s.io/kube-apiserver:v1.31.1
registry.k8s.io/kube-controller-manager:v1.31.1
registry.k8s.io/kube-scheduler:v1.31.1
registry.k8s.io/kube-proxy:v1.31.1
registry.k8s.io/coredns/coredns:v1.11.3
registry.k8s.io/pause:3.10
registry.k8s.io/etcd:3.5.15-0

## 指定--image-repository查看的镜像下载地址
kubeadm config images list --image-repository registry.aliyuncs.com/google_containers
==============================================================
registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.1
registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.1
registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.1
registry.aliyuncs.com/google_containers/kube-proxy:v1.31.1
registry.aliyuncs.com/google_containers/coredns:v1.11.3
registry.aliyuncs.com/google_containers/pause:3.10
registry.aliyuncs.com/google_containers/etcd:3.5.15-0

上述步骤执行可能会由于网络问题而导致镜像拉取失败,这可能是因为 containerd 未配置代理,我们需要根据问题列表 “执行 kubeadm init 报错 failed to pull image” 配置 containerd 代理。

同时,containerd 配置中默认的镜像下载地址仍然是 registry.k8s.io,我们需要将其修改为上述阿里云镜像下载地址,具体方法参数问题列表 “执行 kubeadm init 报错 context deadline exceeded”

现在,重新执行 kubeadm init 便可以正常创建 Kubernetes集群,输出提示显示 Your Kubernetes control-plane has initialized successfully! 表明控制平面已经初始化成功。

##kubeadm reset表示重置之前初始化的控制平面
kubeadm reset

kubeadm init --apiserver-advertise-address=xx.xx.x.151 --image-repository registry.aliyuncs.com/google_containers --service-cidr=10.96.0.0/12 --pod-network-cidr=10.244.0.0/16
========================================
[init] Using Kubernetes version: v1.31.1
。。。
Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join xx.xx.x.151:6443 --token hml2xs.agic16co7u1e8lki \
        --discovery-token-ca-cert-hash sha256:570bd607f60eac2d4bde3416dc84ebf9736fd25f20874c293ed372dde2f82f61

如果想继续使用集群,根据上述输出提示,根据执行用户是否为 root,需要执行相关步骤,root 用户只需要执行 export KUBECONFIG=/etc/kubernetes/admin.conf 即可。

kubectl -n kube-system get configmap
====================================
NAME                                                   DATA   AGE
coredns                                                1      7m5s
extension-apiserver-authentication                     6      7m8s
kube-apiserver-legacy-service-account-token-tracking   1      7m8s
kube-proxy                                             2      7m5s
kube-root-ca.crt                                       1      7m1s
kubeadm-config                                         1      7m6s
kubelet-config                                         1      7m6s

虽然,Kubernetes 集群中的控制平面已经创建成功,不过此时还没有可用的工作节点,并缺乏容器网络的配置。下一步我们为集群添加 Node 节点。

首先,使用上述相同的方式在新的节点上安装 containerd、kubeadm、kubelet,并启动 containerd、kubelet 服务。其次,使用 kubeadm join 命令将节点加入集群,可从上述 kubeadm init 输出中复制完整命令,如下所示:

kubeadm join xx.xx.x.151:6443 --token w6cvcg.or8m1644t6vxlzub \
        --discovery-token-ca-cert-hash sha256:92793ee4cfd14610de745bc1a604557d54fd69fb2cd1dccc3cc6d24be74ff8cb

注意,token 和 discovery-token-ca-cert-hash 要根据控制平面节点当前的实际值填写,否则可能会导致 join 报错,具体查看问题列表 “执行 kubeadm join 报错 couldn't validate the identity of the API Server” 解决

[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 501.642626ms
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

上述输出证明 Node 已经被添加到 Kubernetes 集群中,同样的步骤在多个节点上执行就可以添加多个 Worker Node 节点。本示例中添加 2 个 Worker Node,kubectl get nodes 输出如下所示,

kubectl get nodes
=================
NAME               STATUS     ROLES           AGE     VERSION
host-xx-xx-x-151   NotReady   control-plane   70m     v1.31.1
host-xx-xx-x-152   NotReady   <none>          5m43s   v1.31.1
host-xx-xx-x-153   NotReady   <none>          13s     v1.31.1

上述 kubectl get nodes 输出中显示各节点的状态为 NotReady 状态,这是因为集群还没有安装 CNI 网络插件。通过以下 kubectl apply 命令一键完成安装 CNI 插件。

##安装 CNI 插件
kubectl apply -f "https://docs.projectcalico.org/manifests/calico.yaml"
=======================
。。。
clusterrolebinding.rbac.authorization.k8s.io/calico-kube-controllers created
clusterrolebinding.rbac.authorization.k8s.io/calico-node created
daemonset.apps/calico-node created
deployment.apps/calico-kube-controllers created

注意,calico.yaml 文件中默认会从 docker.io/calico/node:v3.25.0 拉取镜像,然而 ctr images pull docker.io/xxx 可能因为网络问题而拉取失败,此时应该将 docker.io 修改为国内镜像源如 dokerproxy.cn,见问题列表 “执行 kubectl describe node 报错 cni plugin not initialized”

CRI 网络插件安装成功后,再使用 kubectl get nodes 查看发现所有节点都为 Ready 状态。

##将 control-plane 转为 node 节点
kubectl taint nodes host-xx-xx-xx-151 node-role.kubernetes.io/control-plane-
===============================
node/host-xx-xx-xx-151 untainted

kubectl get nodes
=================
NAME               STATUS   ROLES           AGE   VERSION
host-xx-xx-xx-151   Ready    control-plane   29h   v1.31.1
host-xx-xx-xx-152   Ready    <none>          28h   v1.31.1
host-xx-xx-xx-153   Ready    <none>          28h   v1.31.1

2 部署 TiDB Operator

有了 Kubernetes 集群后,下一次是部署 TiDB Operator,这分为 2 个步骤:

  1. 安装 TiDB Operator CRDs

TiDB Operator 包含许多实现 TiDB 集群不同组件的自定义资源类型 (CRD)。首先,下载 TiDB Operator CRD 文件,并使用 kubectl create -f crd.yaml 安装 TiDB Operator CRD。

##安装 tidb crd
curl -o https://raw.githubusercontent.com/pingcap/tidb-operator/v1.6.0/manifests/crd.yaml
kubectl create -f crd.yaml

##检验 tidb crd 创建成功
ubectl get crd | grep tidb
==========================
tidbclusterautoscalers.pingcap.com                    2024-10-21T06:23:55Z
tidbclusters.pingcap.com                              2024-10-21T06:23:55Z
tidbdashboards.pingcap.com                            2024-10-21T06:23:55Z
tidbinitializers.pingcap.com                          2024-10-21T06:23:55Z
tidbmonitors.pingcap.com                              2024-10-21T06:23:56Z
tidbngmonitorings.pingcap.com                         2024-10-21T06:23:56Z
  1. 安装 TiDB Operator

本文使用 helm 安装 TiDB Operator。参考 Helm 官网 https://helm.sh/docs/intro/install/,使用 Script 方式安装 Helm,

##下载 helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh

##安装 helm
sh get_helm.sh
==============
Downloading https://get.helm.sh/helm-v3.16.2-linux-arm64.tar.gz
Verifying checksum... Done.
Preparing to install helm into /usr/local/bin
helm installed into /usr/local/bin/helm

接着使用 helm repo add pingcap https://charts.pingcap.org/ 添加 PingCAP 仓库,

helm repo add pingcap https://charts.pingcap.org/
=================================================
"pingcap" has been added to your repositorie

为 TiDB Operator 创建一个 namespace,执行命令 kubectl create namespace tidb-admin 完成创建,

##创建 tidb-admin namespace
kubectl create namespace tidb-admin
===================================
namespace/tidb-admin created

##查看 namespace
kubectl get namespace
=====================
NAME              STATUS   AGE
default           Active   23h
kube-flannel      Active   19h
kube-node-lease   Active   23h
kube-public       Active   23h
kube-system       Active   23h
tidb-admin        Active   2m52s
tigera-operator   Active   19h

使用 helm install 安装 TiDB Operator,

helm install --namespace tidb-admin tidb-operator pingcap/tidb-operator --version v1.6.0 \
     --set operatorImage=registry.cn-beijing.aliyuncs.com/tidb/tidb-operator:v1.6.0 \
     --set tidbBackupManagerImage=registry.cn-beijing.aliyuncs.com/tidb/tidb-backup-manager:v1.6.0 \
     --set scheduler.kubeSchedulerImageName=registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler
=======================================
NAME: tidb-operator
LAST DEPLOYED: Mon Oct 21 14:42:59 2024
NAMESPACE: tidb-admin
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Make sure tidb-operator components are running:

    kubectl get pods --namespace tidb-admin -l app.kubernetes.io/instance=tidb-operator

检查 TiDB Operator 是否运行起来,根据上述提示命令执行,以下输出表明 TiDB Operator 已经正常安装完成。

kubectl get pods --namespace tidb-admin -l app.kubernetes.io/instance=tidb-operator
=========================================================================
NAME                                      READY   STATUS    RESTARTS   AGE
tidb-controller-manager-6cb84c5b5-r98m5   1/1     Running   0          97s

3 部署 TiDB 集群和监控

首先,使用 kubectl create namespace tidb-cluster 创建一个 tidb-cluster 的命名空间,然后使用 kubectl -n tidb-cluster apply -f tidb-cluster.yaml 部署 TiDB 集群。

##创建 tidb-cluster namespace
kubectl create namespace tidb-cluster
curl -o https://github.com/pingcap/tidb-operator/blob/v1.6.0/examples/advanced/tidb-cluster.yaml

##创建 tidb cluster
kubectl delete tc advanced-tidb -n tidb-cluster
kubectl -n tidb-cluster apply -f tidb-cluster.yaml
==================================================
tidbcluster.pingcap.com/advanced-tidb created

注意,在 tidb-cluster.yaml 文件中,至少需要配置 storageClassName 参数,这是因为 pd、tidb、tikv 组件均需要持久化存储,否则就会出现问题列表 “执行 kubectl get pods -n tidb-cluster 显示 basic-pd-0 状态为 pending” 中的情况。storageClassName 配置依赖于 PersistentVolume(PV) 的创建,需要提前创建 PV。

安装完成,使用 kubectl get pods -n tidb-cluster 便可以看到正在运行的TiDB组件 Pod。

kubectl get pods -n tidb-cluster
================================
NAME                                      READY   STATUS    RESTARTS   AGE
advanced-tidb-discovery-b8ddc49c5-pm2l6   1/1     Running   0          8m9s
advanced-tidb-pd-0                        1/1     Running   0          8m9s
advanced-tidb-pd-1                        1/1     Running   0          8m9s
advanced-tidb-pd-2                        1/1     Running   0          8m9s
advanced-tidb-tidb-0                      2/2     Running   0          2m38s
advanced-tidb-tidb-1                      2/2     Running   0          3m12s
advanced-tidb-tidb-2                      2/2     Running   0          4m54s
advanced-tidb-tikv-0                      1/1     Running   0          8m2s
advanced-tidb-tikv-1                      1/1     Running   0          8m2s
advanced-tidb-tikv-2                      1/1     Running   0          8m2s

4 初始化 TiDB 集群

集群部署好后,可能需要一些初始化的动作,比如设置 root 用户密码、创建用户、设置允许访问的机器等。

  • 初始化 root 密码及创建新用户

通过以下命令修改 root 密码,该命令会创建 root 密码,存到 tidb-secret 的 Secret 里面。

kubectl create secret generic tidb-secret --from-literal=root=root123 --namespace=tidb-cluster

以下命令表示在修改 root 密码的同时,创建另外一个普通用户 developer 并设定密码,且创建的用户 developer 默认只有 USAGE 权限。

kubectl create secret generic tidb-secret --from-literal=root=root123 --from-literal=developer=developer123 --namespace=tidb-cluster

如果需要其他的初始化动作,则手动编辑 tidb-initializer.yaml 文件并通过执行下述命令来执行初始化。本文假设只初始化 root 用户密码,

kubectl apply -f tidb-initializer.yaml -n tidb-cluster
=====================================================
tidbinitializer.pingcap.com/tidb-init created

注意,在初始化过程中需要根据 image: tnir/mysqlclient 下载镜像,此处可能会遇到镜像下载失败,可参照问题列表中的解决方法。另外,如果是 ARM 环境,需要将 image 修改为 image: kanshiori/mysqlclient-arm64,参考问题列表 “执行 TiDB 初始化的 pod 状态为 Init:Error 报错 standard_init_linux.go:219”

5 连接 TiDB 集群

TiDB 集群创建好后,可以通过 kubectl get all -n tidb-cluster 查看相关信息,包括对外提供的访问地址

kubectl get all -n tidb-cluster
==============================
NAME                                          READY   STATUS    RESTARTS   AGE
pod/advanced-tidb-discovery-b8ddc49c5-pm2l6   1/1     Running   0          48m
pod/advanced-tidb-pd-0                        1/1     Running   0          48m
pod/advanced-tidb-pd-1                        1/1     Running   0          48m
pod/advanced-tidb-pd-2                        1/1     Running   0          48m
pod/advanced-tidb-tidb-0                      2/2     Running   0          43m
pod/advanced-tidb-tidb-1                      2/2     Running   0          43m
pod/advanced-tidb-tidb-2                      2/2     Running   0          45m
pod/advanced-tidb-tikv-0                      1/1     Running   0          48m
pod/advanced-tidb-tikv-1                      1/1     Running   0          48m
pod/advanced-tidb-tikv-2                      1/1     Running   0          48m

NAME                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                          AGE
service/advanced-tidb-discovery   ClusterIP   10.96.132.251    <none>        10261/TCP,10262/TCP              48m
service/advanced-tidb-pd          ClusterIP   10.101.219.172   <none>        2379/TCP                         48m
service/advanced-tidb-pd-peer     ClusterIP   None             <none>        2380/TCP,2379/TCP                48m
service/advanced-tidb-tidb        NodePort    10.111.104.136   <none>        4000:31263/TCP,10080:32410/TCP   48m
service/advanced-tidb-tidb-peer   ClusterIP   None             <none>        10080/TCP                        48m
service/advanced-tidb-tikv-peer   ClusterIP   None             <none>        20160/TCP                        48m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/advanced-tidb-discovery   1/1     1            1           48m

NAME                                                DESIRED   CURRENT   READY   AGE
replicaset.apps/advanced-tidb-discovery-b8ddc49c5   1         1         1       48m

NAME                                  READY   AGE
statefulset.apps/advanced-tidb-pd     3/3     48m
statefulset.apps/advanced-tidb-tidb   3/3     48m
statefulset.apps/advanced-tidb-tikv   3/3     48m

上述输出显示 service/advanced-tidb-tidb 对应的 NodePort 为 10.111.104.136, PORT 为 4000。我们可以通过这个 IP 和 PORT 连接到 TiDB 数据库,

mysql -h10.111.104.136 -P4000 -uroot -c
ERROR 1045 (28000): Access denied for user 'root'@'10.244.115.128' (using password: NO)

mysql -h10.111.104.136 -P4000 -uroot -c -proot123
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3938453128
Server version: 8.0.11-TiDB-v8.1.0 TiDB Server (Apache License 2.0) Community Edition, MySQL 8.0 compatible

Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

上述输出中第一次使用无密码的 root 登录失败,这说明上一步中的初始化集群修改 root 密码已经生效,而第二次使用带密码的 root 登录则能正常连接集群。

6 问题记录

执行 kubeadm init 报错 failed to pull image

kubeadm init --config=init.default.yaml
。。。
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action beforehand using 'kubeadm config images pull'
W1019 09:45:10.324749  794291 checks.go:846] detected that the sandbox image "registry.k8s.io/pause:3.8" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.aliyuncs.com/google_containers/pause:3.10" as the CRI sandbox image.
error execution phase preflight: [preflight] Some fatal errors occurred
        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-apiserver:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-apiserver/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout
        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-controller-manager:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-controller-manager/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout
        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-scheduler:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-scheduler/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout
        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0: failed to pull image registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0: failed to pull and unpack image "registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0": failed to resolve reference "registry.aliyuncs.com/google_containers/kube-proxy:v1.31.0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/kube-proxy/manifests/v1.31.0": dial tcp 120.55.105.209:443: i/o timeout
        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/coredns:v1.11.3: failed to pull image registry.aliyuncs.com/google_containers/coredns:v1.11.3: failed to pull and unpack image "registry.aliyuncs.com/google_containers/coredns:v1.11.3": failed to resolve reference "registry.aliyuncs.com/google_containers/coredns:v1.11.3": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/coredns/manifests/v1.11.3": dial tcp 120.55.105.209:443: i/o timeout
        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/pause:3.10: failed to pull image registry.aliyuncs.com/google_containers/pause:3.10: failed to pull and unpack image "registry.aliyuncs.com/google_containers/pause:3.10": failed to resolve reference "registry.aliyuncs.com/google_containers/pause:3.10": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/pause/manifests/3.10": dial tcp 120.55.105.209:443: i/o timeout
        [ERROR ImagePull]: failed to pull image registry.aliyuncs.com/google_containers/etcd:3.5.15-0: failed to pull image registry.aliyuncs.com/google_containers/etcd:3.5.15-0: rpc error: code = DeadlineExceeded desc = failed to pull and unpack image "registry.aliyuncs.com/google_containers/etcd:3.5.15-0": failed to resolve reference "registry.aliyuncs.com/google_containers/etcd:3.5.15-0": failed to do request: Head "https://registry.aliyuncs.com/v2/google_containers/etcd/manifests/3.5.15-0": dial tcp 120.55.105.209:443: i/o timeout
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
  • 解决方法:

在使用 containerd 作为容器运行时的 Kubernetes环境中,因网络限制需配置 HTTPS_PROXY 和 NO_PROXY 以顺利 pull image。参考链接 https://blog.csdn.net/Beer_Do/article/details/113253618

##配置 containerd 的 proxy 代理
mkdir /etc/systemd/system/containerd.service.d
cat /etc/systemd/system/containerd.service.d/http_proxy.conf << EOF
[Service]
Environment="HTTP_PROXY=xx.xx.x.x:3128"
Environment="HTTPS_PROXY=xx.xx.x.x:3128"
Environment="no_proxy=127.0.0.1,localhost,xx.xx.xx.151,10.96.0.0/12"

##重新加载配置并重启 containerd 服务
systemctl daemon-reload
systemctl restart containerd

执行 kubeadm init 报错 context deadline exceeded

kubeadm init --config=init.default.yaml
。。。
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is healthy after 1.00191638s
[api-check] Waiting for a healthy API server. This can take up to 4m0s
[api-check] The API server is not healthy after 4m0.000418939s

Unfortunately, an error has occurred:
        context deadline exceeded

This error is likely caused by:
        - The kubelet is not running
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
        - 'systemctl status kubelet'
        - 'journalctl -xeu kubelet'
。。。
  • 解决方法:参考链接 https://blog.csdn.net/weixin_43205308/article/details/140554729,查看 containerd 默认的镜像下载地址为 registry.k8s.io,需要修改为与上述相同的国内镜像下载地址 registry.aliyuncs.com/google_containers。需要生成 /etc/containerd/config.toml 文件并修改 sandbox_image 的地址,并重启 containerd 服务生效。
##查看当前 containerd 镜像下载地址
containerd config dump | grep sandbox_image

##生成默认的 containerd 配置文件
mkdir /etc/containerd
containerd config default > /etc/containerd/config.toml

##编辑/etc/containerd/config.toml,将 sandbox_image 地址修改为国内镜像地址,修改后的内容为
##sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.8"
vi /etc/containerd/config.toml

##重新加载配置并重启 containerd 服务
systemctl daemon-reload
systemctl restart containerd

执行 kubeadm init 报错 [ERROR FileContent--proc-sys-net-ipv4-ip_forward]

error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR FileContent--proc-sys-net-ipv4-ip_forward]: /proc/sys/net/ipv4/ip_forward contents are not set to 1
  • 解决方法:此报错是因为上述步骤中配置 sysctl -w net.ipv4.ip_forward=1 未执行,可根据上述说明配置生效。

执行 kubeadm init 报错 [WARNING FileExisting-socat]

[WARNING FileExisting-socat]: socat not found in system path
  • 解决方法:系统缺失 socat,执行 yum install socat 安装。

执行 kubeadm init 报错 [WARNING Hostname]

 [WARNING Hostname]: hostname "host-xx-xx-x-151" could not be reached
  • 解决方法:在 /etc/hosts 配置 hostname 和 IP 的映射关系。

执行 kubeadm join 报错 couldn't validate the identity of the API Server

[preflight] Running pre-flight checks
error execution phase preflight: couldn't validate the identity of the API Server: failed to request the cluster-info ConfigMap: client rate limiter Wait returned an error: context deadline exceeded
To see the stack trace of this error execute with --v=5 or higher
  • 解决方法:此报错因为 token 及 discovery-token-ca-cert-hash 不是当前最新值,通过以下命令来查看当前的 token 和 discovery-token-ca-cert-hash,并使用最新的对应值替换命令行中的值
## 查看当前 token
kubeadm token list
## 查看当前 discovery-token-ca-cert-hash
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'

执行 kubectl describe node 报错 cni plugin not initialized

kubectl describe node host-xx-xx-x-151
======================================
。。。
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Sun, 20 Oct 2024 17:30:26 +0800   Sun, 20 Oct 2024 14:39:52 +0800   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
  • 解决方法:此问题的根本原因是 ctr images pull docker.io/calico/cni:v3.25.0 遇到网络问题,修改为 ctr images pull dockerproxy.cn/calico/cni:v3.25.0 可正常拉取镜像。因此,下载 calico.yaml 文件后替换文件中的 docker.io 为 dockerproxy.cn 即可。
ctr images pull docker.io/calico/cni:v3.25.0
============================================
docker.io/calico/cni:v3.25.0: resolving      |--------------------------------------|
elapsed: 0.1 s                total:   0.0 B (0.0 B/s)
INFO[0000] trying next host                              error="failed to do request: Head \"https://registry-1.docker.io/v2/calico/cni/manifests/v3.25.0\": EOF" host=registry-1.docker.io
ctr: failed to resolve reference "docker.io/calico/cni:v3.25.0": failed to do request: Head "https://registry-1.docker.io/v2/calico/cni/manifests/v3.25.0": EOF

ctr images pull dockerproxy.cn/calico/cni:v3.25.0
============================================
dockerproxy.cn/calico/cni:v3.25.0:                                                resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:a38d53cb8688944eafede2f0eadc478b1b403cefeff7953da57fe9cd2d65e977:    exists         |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:0ec3ba054c2d5c3d70e0632e56724c747af9e91bb9760d6c14f456179d96105d: done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:bc84ed7b6a651f36d1486db36f1c2c1181b6c14463ea310823e6c2f69d0af100:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:4e9bf0451aa2da91f8516b127064c193b3d0f2183a14dc36627aa9a9d096c715:    done           |++++++++++++++++++++++++++++++++++++++|
config-sha256:0bb8d6f033a0548573ff857c26574d89a8ad4b691aa88a32eddf0c7db06599ef:   done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:ae5822c70daca619af58b197f6a3ea6f7cac1b785f6fbea673fb37be4853f6d5:    done           |++++++++++++++++++++++++++++++++++++++|
elapsed: 3.7 s                                                                    total:   0.0 B (0.0 B/s)
unpacking linux/arm64/v8 sha256:a38d53cb8688944eafede2f0eadc478b1b403cefeff7953da57fe9cd2d65e977...
done: 14.720114ms

执行 kubectl get pods -n tidb-cluster 显示 basic-pd-0 状态为 pending

kubectl get pods -n tidb-cluster
NAME                               READY   STATUS    RESTARTS   AGE
basic-discovery-85c8d6cd7f-wck48   1/1     Running   0          2m3s
basic-pd-0                         0/1     Pending   0          2m3s
  • 解决方法:通过 kubectl describe pod 查看发现是 pod 未绑定PersistentVolumeClaims(PVC)
kubectl describe pod basic-pd-0 -n tidb-cluster
==============================================
。。。
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  4m29s (x4 over 19m)  default-scheduler  0/3 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.

参考 https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/configure-storage-class#%E6%9C%AC%E5%9C%B0-pv-%E9%85%8D%E7%BD%AE 配置本地存储

注意,需要将 local-volume-provisioner.yaml 配置文件中的 image: "quay.io/external_storage/local-volume-provisioner:v2.3.4" 修改为 image: "quay.io/external_storage/local-volume-provisioner:v2.5.0",这是因为 v2.3.4 版本过于老旧,可能遇到 no match for platform in manifest: not found 的报错。

##在各节点上创建目录并挂载
mkdir /data1/pdk8s/pd -p
mkdir /data1/tikvk8s/tikv -p
mkdir /data1/tidbk8s/tidb -p
mount --bind /data1/pdk8s/pd /data1/pdk8s/pd
mount --bind /data1/tikvk8s/tikv /data1/tikvk8s/tikv
mount --bind /data1/tidbk8s/tidb /data1/tidbk8s/tidb

##配置本地存储 pv
curl -o https://raw.githubusercontent.com/pingcap/tidb-operator/v1.6.0/examples/local-pv/local-volume-provisioner.yaml
vi local-volume-provisioner.yaml
kubectl delete -f local-volume-provisioner.yaml
kubectl apply -f local-volume-provisioner.yaml

kubectl get po -n kube-system -l app=local-volume-provisioner
================================================================
NAME                             READY   STATUS    RESTARTS   AGE
local-volume-provisioner-8f7ms   1/1     Running   0          135m
local-volume-provisioner-xw7h7   1/1     Running   0          136m
local-volume-provisioner-zj27b   1/1     Running   0          8m58s

kubectl get pv
=============
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
local-pv-1d87b555   2459Gi     RWO            Delete           Available           tidb-storage   <unset>                          2m12s
local-pv-1fe2627e   2459Gi     RWO            Delete           Available           pd-storage     <unset>                          2m12s
local-pv-4aba16db   2459Gi     RWO            Delete           Available           pd-storage     <unset>                          2m12s
local-pv-4e85cc9d   2459Gi     RWO            Delete           Available           tikv-storage   <unset>                          2m12s
local-pv-547cf652   2459Gi     RWO            Delete           Available           tikv-storage   <unset>                          2m12s
local-pv-6870ef87   2459Gi     RWO            Delete           Available           tidb-storage   <unset>                          2m12s
local-pv-89a42df0   2459Gi     RWO            Delete           Available           tikv-storage   <unset>                          2m12s
local-pv-a898b600   2459Gi     RWO            Delete           Available           pd-storage     <unset>                          2m12s
local-pv-b039092a   2459Gi     RWO            Delete           Available           tidb-storage   <unset>                          2m12s

安装 TiDB 集群后 pd pod 状态为 ImagePullBackOff

kubectl describe pod advanced-tidb-pd-2 -n tidb-cluster
======================================================
。。。
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  19m                   default-scheduler  Successfully assigned tidb-cluster/advanced-tidb-pd-2 to host-xx-xx-xx-151
  Normal   Pulling    18m (x4 over 19m)     kubelet            Pulling image "pingcap/pd:v8.1.0"
  Warning  Failed     18m (x4 over 19m)     kubelet            Failed to pull image "pingcap/pd:v8.1.0": failed to pull and unpack image "docker.io/pingcap/pd:v8.1.0": failed to resolve reference "docker.io/pingcap/pd:v8.1.0": failed to do request: Head "https://registry-1.docker.io/v2/pingcap/pd/manifests/v8.1.0": EOF
  Warning  Failed     18m (x4 over 19m)     kubelet            Error: ErrImagePull
  Warning  Failed     17m (x6 over 19m)     kubelet            Error: ImagePullBackOff
  Normal   BackOff    4m23s (x66 over 19m)  kubelet            Back-off pulling image "pingcap/pd:v8.1.0"
  • 解决方法:由于默认镜像地址为 docker.io 无法访问,将 tidb-cluster.yaml 中 image 相关地方增加国内镜像地址。需要修改的地方包括以下几处,其中 image: alpine:3.16.0 影响到 TiDB Server 的镜像下载。
grep dockerproxy tidb-cluster.yaml
==================================
    image: dockerproxy.cn/alpine:3.16.0
    baseImage: dockerproxy.cn/pingcap/pd
    baseImage: dockerproxy.cn/pingcap/tidb
    baseImage: dockerproxy.cn/pingcap/tikv

执行 TiDB 初始化的 pod 状态为 Init:Error 报错 standard_init_linux.go:219

kubectl get pods -n tidb-cluster
================================
NAME                                      READY   STATUS       RESTARTS   AGE
advanced-tidb-discovery-b8ddc49c5-pm2l6   1/1     Running      0          70m
advanced-tidb-pd-0                        1/1     Running      0          70m
advanced-tidb-pd-1                        1/1     Running      0          70m
advanced-tidb-pd-2                        1/1     Running      0          70m
advanced-tidb-tidb-0                      2/2     Running      0          65m
advanced-tidb-tidb-1                      2/2     Running      0          65m
advanced-tidb-tidb-2                      2/2     Running      0          67m
advanced-tidb-tidb-initializer-k5hjb      0/1     Init:Error   0          70s
advanced-tidb-tikv-0                      1/1     Running      0          70m
advanced-tidb-tikv-1                      1/1     Running      0          70m
advanced-tidb-tikv-2                      1/1     Running      0          70m

kubectl logs advanced-tidb-tidb-initializer-k5hjb -n tidb-cluster
=================================================================
Defaulted container "mysql-client" out of: mysql-client, wait (init)
Error from server (BadRequest): container "mysql-client" in pod "advanced-tidb-tidb-initializer-k5hjb" is waiting to start: PodInitializing

kubectl logs advanced-tidb-tidb-initializer-47x5r -n tidb-cluster -c wait
=========================================================================
standard_init_linux.go:219: exec user process caused "exec format error"
libcontainer: container start initialization failed: standard_init_linux.go:219: exec user process caused "exec format error"
  • 解决方法:此问题的原因是执行环境为 ARM 环境,参考文档 https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/deploy-cluster-on-arm64#%E5%88%9D%E5%A7%8B%E5%8C%96-tidb-%E9%9B%86%E7%BE%A4, 需要将 TidbInitializer 定义文件中的 spec.image 字段设置为 ARM64 版本镜像,如 image: kanshiori/mysqlclient-arm64

0
1
1
0

版权声明:本文为 TiDB 社区用户原创文章,遵循 CC BY-NC-SA 4.0 版权协议,转载请附上原文出处链接和本声明。

评论
暂无评论