从Kubernetes HA群集安全删除主服务器

我在AWS EC2实例上部署了 kops 的开发K8S集群,最初将其部署为具有3个主节点和3个节点的HA体系结构。

现在为了节省成本,我想关闭3个主机中的2个,只保留1个运行中

我尝试使用kubectl drain,但是它没有效果,只是终止节点会导致群集连接不稳定。

有一种安全的方法可以删除母版吗?

heiheiliangliang 回答:从Kubernetes HA群集安全删除主服务器

这个问题已经在Github question - HA to single master migration上进行了讨论。

已经为您准备了solution

由于在Kops 1.12中引入了etcd-manager,并且mainevents的etcd群集自动且定期地备份到S3(与KOPS_STATE_STORE相同的存储桶)。

因此,如果您的k8s集群版本高于1.12,则可能需要执行以下步骤:

  1. 删除集群中的etcd区域
$ kops edit cluster

etcdCluster部分中,删除etcdMembers项,以便仅保留instanceGroupmain的一个events。例如

  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-ap-southeast-1a
      name: a
    name: main
  - etcdMembers:
    - instanceGroup: master-ap-southeast-1a
      name: a
    name: events

  1. 应用更改
$ kops update cluster --yes
$ kops rolling-update cluster --yes

  1. 删除2个主实例组
$ kops delete ig master-xxxxxx-1b
$ kops delete ig master-xxxxxx-1c

此操作无法撤消,它将立即删除2个主节点。

现在,您的3个主节点中的2个被删除,k8s etcd服务可能会失败,并且kube-api服务将无法访问。在此步骤之后,kopskubectl命令通常不再起作用。

  1. 使用单个主节点重新启动ectd集群
    这是棘手的部分。 ssh进入剩余的主节点,然后
$ sudo systemctl stop protokube
$ sudo systemctl stop kubelet

下载etcd-manager-ctl工具。如果使用其他etcd-manager版本,请相应调整下载链接

$ wget https://github.com/kopeio/etcd-manager/releases/download/3.0.20190930/etcd-manager-ctl-linux-amd64
$ mv etcd-manager-ctl-linux-amd64 etcd-manager-ctl
$ chmod +x etcd-manager-ctl
$ mv etcd-manager-ctl /usr/local/bin/

从S3恢复备份。参见official docs

$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/main list-backups
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/main restore-backup 2019-10-16T09:42:37Z-000001
# do the same for events
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/events list-backups
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/events restore-backup 2019-10-16T09:42:37Z-000001

这不会立即开始还原;您需要重新启动etcd:杀死相关容器并启动kubelet

$ sudo systemctl start kubelet
$ sudo systemctl start protokube

等待还原完成,然后kubectl get nodeskops validate cluster应该可以正常工作。否则,您可以终止AWS控制台中其余主节点的EC2实例,Auto Scaling Groups将创建一个新的主节点,并还原etcd集群。

,

这些是减少KOPS部署集群中主节点数量的步骤

注意:在您尝试执行此处描述的步骤之前,请先考虑是否可以重新创建集群。 按照以下步骤,虽然让我最终从 3 个 master 减少到 1 个,但在不同的情况下需要额外的故障排除。我从这个过程中学到的一切都在下面,但您的情况可能有所不同,因此不能保证成功。

先决条件

转到 AWS 控制台并确定此过程结束后将成为单主节点的主节点的私有 IP(稍后的 MASTER_IP 变量)和可用区 (AZ)。

您需要配置到 S3 的 AWS CLI 访问权限,kops 才能工作。 您需要配置 kubectl 以与我们要操作的集群一起工作。 如果由于任何原因出现问题,您可能需要 SSH 密钥以允许您到达剩余的主节点以在那里恢复 ETCD(因为在这种情况下 kubectl 将不再可用)。本文档目前未涵盖此案例。 为 MASTER_IP、AZ、KOPS_STATE_BUCKET 和 CLUSTER_NAME 提供值以匹配您的环境。

# MASTER_IP is the IP of master node in availability zone AZ (so "c" in this example)
export MASTER_IP="172.20.115.115"
export AZ="c"
export KOPS_STATE_BUCKET="mironq-prod-eu-central-1-state-store"
export CLUSTER_NAME="mironq.prod.eu-central-1.aws.svc.example.com"

# no need to change following command unless you use different version of Etcd
export BACKUP_MAIN="s3://${KOPS_STATE_BUCKET}/${CLUSTER_NAME}/backups/etcd/main"
export BACKUP_EVENT="s3://${KOPS_STATE_BUCKET}/${CLUSTER_NAME}/backups/etcd/events"
export ETCD_CMD="/opt/etcd-v3.4.3-linux-amd64/etcdctl --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001"
CONTAINER=$(kubectl get pod -l k8s-app=etcd-manager-main -o=jsonpath='{.items[*].metadata.name}'|tr ' ' '\n'|grep ${MASTER_IP})

注意:您的 CONTAINER 变量现在应该包含将保留的 master 的 pod 名称,即:

$ echo $CONTAINER 
etcd-manager-main-ip-172-20-109-104.eu-central-1.compute.internal

现在确认 Etcd 备份存在并且是最近的(大约 15 分钟前)和集群的当前成员数量。

kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} list-backups|sort -n
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} list-backups|sort -n

# Confirm current members of existing Etcd cluster
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list

移除 Etcd 节点

获取将要移除的 Etcd 节点的 ID

MASTERS2DELETE=$(kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list|grep -v etcd-${AZ}|cut -d,-f1)
#$ echo $MASTERS2DELETE
efb9893f347468eb ffea6e819b91a131

现在您已准备好删除不需要的 Etcd 节点

for MASTER in ${MASTERS2DELETE};do echo "Deleting ETCD node ${MSTR}"; kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member remove ${MASTER}; done
# a few minutes may be needed after this has been executed
# Confirm only one member is left
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list

你还会看到一些主节点没有准备好

$ kubectl get node

安排备份恢复

!!!重要的 !!! 现在我们需要确保在继续之前进行新的备份。默认情况下,etcd-manager 每 15 分钟进行一次备份。等待新的返回,因为它将包含有关预期节点数 (=1) 的信息

现在我们已经为这个单节点集群创建了一个新的备份,我们可以在重新启动后安排它的恢复。 下面的代码包含注释掉的响应,以帮助您确定您的命令是否按预期执行。

计划恢复“主”集群。

BACKUP_LIST=$(kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} list-backups|sort -n)

#echo "$BACKUP_LIST"
#[...]
#2020-12-17T14:55:55Z-000001
#2020-12-17T15:11:05Z-000002
#2020-12-17T15:26:13Z-000003
#2020-12-17T15:41:14Z-000001
#2020-12-17T15:56:20Z-000001
#2020-12-17T16:11:35Z-000004
#2020-12-17T16:26:41Z-000005

LATEST_BACKUP=$(echo -n "${BACKUP_LIST}"|tail -1)

# confirm that latest backup has been selected
#$ echo $LATEST_BACKUP 
#2020-12-17T16:26:41Z-000005

kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} restore-backup "${LATEST_BACKUP/%[$'\t\r\n']}"

#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/main
#I1217 16:41:59.101078   11608 vfs.go:60] Adding command at s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/main/control/#2020-12-17T16:41:59Z-000000/_command.json: timestamp:1608223319100999598 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:26:41Z-000005\r" #> 
#added restore-backup command: timestamp:1608223319100999598 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:26:41Z-000005\r" > 

计划恢复“主”集群。

BACKUP_LIST=$(kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} list-backups|sort -n)

#$ echo "$BACKUP_LIST"
#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events
#I1217 16:48:41.230896   17761 vfs.go:102] listed backups in s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events: #[2020-12-17T14:56:08Z-000001 2020-12-17T15:11:17Z-000001 2020-12-17T15:26:26Z-000002 2020-12-17T15:41:27Z-000002 2020-12-17T15:56:32Z-000003 2020-12-17T16:11:41Z-000003 #2020-12-17T16:26:48Z-000004 2020-12-17T16:41:56Z-000001]
#2020-12-17T14:56:08Z-000001
#2020-12-17T15:11:17Z-000001
#2020-12-17T15:26:26Z-000002
#2020-12-17T15:41:27Z-000002
#2020-12-17T15:56:32Z-000003
#2020-12-17T16:11:41Z-000003
#2020-12-17T16:26:48Z-000004
#2020-12-17T16:41:56Z-000001

LATEST_BACKUP=$(echo -n "${BACKUP_LIST}"|tail -1)

# confirm that latest backup has been selected
$ echo $LATEST_BACKUP 
2020-12-17T16:41:56Z-000001

kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} restore-backup "${LATEST_BACKUP/%[$'\t\r\n']}"

#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events
#I1217 16:53:17.876318   21958 vfs.go:60] Adding command at s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events/control/#2020-12-17T16:53:17Z-000000/_command.json: timestamp:1608223997876256810 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:41:56Z-000001\r" #> 
#added restore-backup command: timestamp:1608223997876256810 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:41:56Z-000001\r" > 

检查端点是否仍然健康(应该如此)

# check if endpoint is healthy
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} endpoint health
#https://127.0.0.1:4001 is healthy: successfully committed proposal: took = 7.036109ms

删除实例组

实例组列表示例

kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} get ig
#NAME           ROLE    MACHINETYPE MIN MAX ZONES
#bastions       Bastion t3.micro    1   1   eu-central-1a,eu-central-1b,eu-central-1c
#master-eu-central-1a   Master  t3.medium   1   1   eu-central-1a
#master-eu-central-1b   Master  t3.medium   1   1   eu-central-1b
#master-eu-central-1c   Master  t3.medium   1   1   eu-central-1c
#nodes          Node    t3.medium   2   6   eu-central-1a,eu-central-1b

在我们禁用 Etcd 节点的可用区中删除主实例组(本例中的 a 和 b,因为我们希望让 c 作为唯一的主节点运行)。 编辑下面的命令(替换 [AZ-letter] 以匹配您的情况。

kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} delete ig master-eu-central-1[AZ-letter]
#InstanceGroup "master-eu-central-1a" found for deletion
#I1217 17:01:39.035294 2538280 delete.go:54] Deleting "master-eu-central-1a"

#Deleted InstanceGroup: "master-eu-central-1a"

使用以下命令调用编辑模式手动编辑集群。这里的目标是将剩余的 Etcd 节点与集群配置相匹配:通过下面的示例,您需要删除不再存在的节点的条目。

改变这个:

  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-eu-central-1a
      name: a
    - instanceGroup: master-eu-central-1b
      name: b
    - instanceGroup: master-eu-central-1c
      name: c
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-eu-central-1a
      name: a
    - instanceGroup: master-eu-central-1b
      name: b
    - instanceGroup: master-eu-central-1c
      name: c
    memoryRequest: 100Mi
    name: events

到这里(只留下仍然有主人的区域):

  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-eu-central-1c
      name: c
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-eu-central-1c
      name: c
    memoryRequest: 100Mi
    name: events

这将打开允许进行更改的编辑器。

kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} edit cluster

应用 KOPS 更改

应用更改并强制重新创建主节点(第二个命令将使集群无响应,直到创建新主节点并重新联机。

kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} update cluster --yes
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} rolling-update cluster --cloudonly --yes

问题排查

Etcd 集群、“主”和“事件”都必须重新上线才能再次启动 API。如果 API 服务器日志抱怨无法连接到端口 4001,那么您的“主”Etcd 集群未启动,如果端口号为 4002,则它是“事件”。 就在上面,您已经指示 Etcd 集群导入备份,这必须完成才能启动集群

本文链接:https://www.f2er.com/2726058.html

大家都在问