我在AWS EC2实例上部署了 kops 的开发K8S集群,最初将其部署为具有3个主节点和3个节点的HA体系结构。
现在为了节省成本,我想关闭3个主机中的2个,只保留1个运行中
我尝试使用kubectl drain
,但是它没有效果,只是终止节点会导致群集连接不稳定。
有一种安全的方法可以删除母版吗?
我在AWS EC2实例上部署了 kops 的开发K8S集群,最初将其部署为具有3个主节点和3个节点的HA体系结构。
现在为了节省成本,我想关闭3个主机中的2个,只保留1个运行中
我尝试使用kubectl drain
,但是它没有效果,只是终止节点会导致群集连接不稳定。
有一种安全的方法可以删除母版吗?
这个问题已经在Github question - HA to single master migration上进行了讨论。
已经为您准备了solution。
由于在Kops 1.12中引入了etcd-manager,并且main
和events
的etcd群集自动且定期地备份到S3(与KOPS_STATE_STORE
相同的存储桶)。
因此,如果您的k8s集群版本高于1.12,则可能需要执行以下步骤:
$ kops edit cluster
在etcdCluster
部分中,删除etcdMembers
项,以便仅保留instanceGroup
和main
的一个events
。例如
etcdClusters:
- etcdMembers:
- instanceGroup: master-ap-southeast-1a
name: a
name: main
- etcdMembers:
- instanceGroup: master-ap-southeast-1a
name: a
name: events
$ kops update cluster --yes
$ kops rolling-update cluster --yes
$ kops delete ig master-xxxxxx-1b
$ kops delete ig master-xxxxxx-1c
此操作无法撤消,它将立即删除2个主节点。
现在,您的3个主节点中的2个被删除,k8s etcd服务可能会失败,并且kube-api服务将无法访问。在此步骤之后,kops
和kubectl
命令通常不再起作用。
$ sudo systemctl stop protokube
$ sudo systemctl stop kubelet
下载etcd-manager-ctl
工具。如果使用其他etcd-manager
版本,请相应调整下载链接
$ wget https://github.com/kopeio/etcd-manager/releases/download/3.0.20190930/etcd-manager-ctl-linux-amd64
$ mv etcd-manager-ctl-linux-amd64 etcd-manager-ctl
$ chmod +x etcd-manager-ctl
$ mv etcd-manager-ctl /usr/local/bin/
从S3恢复备份。参见official docs
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/main list-backups
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/main restore-backup 2019-10-16T09:42:37Z-000001
# do the same for events
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/events list-backups
$ etcd-manager-ctl -backup-store=s3://<kops s3 bucket name>/<cluster name>/backups/etcd/events restore-backup 2019-10-16T09:42:37Z-000001
这不会立即开始还原;您需要重新启动etcd:杀死相关容器并启动kubelet
$ sudo systemctl start kubelet
$ sudo systemctl start protokube
等待还原完成,然后kubectl get nodes
和kops validate cluster
应该可以正常工作。否则,您可以终止AWS控制台中其余主节点的EC2实例,Auto Scaling Groups将创建一个新的主节点,并还原etcd集群。
注意:在您尝试执行此处描述的步骤之前,请先考虑是否可以重新创建集群。 按照以下步骤,虽然让我最终从 3 个 master 减少到 1 个,但在不同的情况下需要额外的故障排除。我从这个过程中学到的一切都在下面,但您的情况可能有所不同,因此不能保证成功。
转到 AWS 控制台并确定此过程结束后将成为单主节点的主节点的私有 IP(稍后的 MASTER_IP 变量)和可用区 (AZ)。
您需要配置到 S3 的 AWS CLI 访问权限,kops 才能工作。 您需要配置 kubectl 以与我们要操作的集群一起工作。 如果由于任何原因出现问题,您可能需要 SSH 密钥以允许您到达剩余的主节点以在那里恢复 ETCD(因为在这种情况下 kubectl 将不再可用)。本文档目前未涵盖此案例。 为 MASTER_IP、AZ、KOPS_STATE_BUCKET 和 CLUSTER_NAME 提供值以匹配您的环境。
# MASTER_IP is the IP of master node in availability zone AZ (so "c" in this example)
export MASTER_IP="172.20.115.115"
export AZ="c"
export KOPS_STATE_BUCKET="mironq-prod-eu-central-1-state-store"
export CLUSTER_NAME="mironq.prod.eu-central-1.aws.svc.example.com"
# no need to change following command unless you use different version of Etcd
export BACKUP_MAIN="s3://${KOPS_STATE_BUCKET}/${CLUSTER_NAME}/backups/etcd/main"
export BACKUP_EVENT="s3://${KOPS_STATE_BUCKET}/${CLUSTER_NAME}/backups/etcd/events"
export ETCD_CMD="/opt/etcd-v3.4.3-linux-amd64/etcdctl --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001"
CONTAINER=$(kubectl get pod -l k8s-app=etcd-manager-main -o=jsonpath='{.items[*].metadata.name}'|tr ' ' '\n'|grep ${MASTER_IP})
注意:您的 CONTAINER 变量现在应该包含将保留的 master 的 pod 名称,即:
$ echo $CONTAINER
etcd-manager-main-ip-172-20-109-104.eu-central-1.compute.internal
现在确认 Etcd 备份存在并且是最近的(大约 15 分钟前)和集群的当前成员数量。
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} list-backups|sort -n
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} list-backups|sort -n
# Confirm current members of existing Etcd cluster
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list
获取将要移除的 Etcd 节点的 ID
MASTERS2DELETE=$(kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list|grep -v etcd-${AZ}|cut -d,-f1)
#$ echo $MASTERS2DELETE
efb9893f347468eb ffea6e819b91a131
现在您已准备好删除不需要的 Etcd 节点
for MASTER in ${MASTERS2DELETE};do echo "Deleting ETCD node ${MSTR}"; kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member remove ${MASTER}; done
# a few minutes may be needed after this has been executed
# Confirm only one member is left
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} member list
你还会看到一些主节点没有准备好
$ kubectl get node
!!!重要的 !!! 现在我们需要确保在继续之前进行新的备份。默认情况下,etcd-manager 每 15 分钟进行一次备份。等待新的返回,因为它将包含有关预期节点数 (=1) 的信息
现在我们已经为这个单节点集群创建了一个新的备份,我们可以在重新启动后安排它的恢复。 下面的代码包含注释掉的响应,以帮助您确定您的命令是否按预期执行。
计划恢复“主”集群。
BACKUP_LIST=$(kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} list-backups|sort -n)
#echo "$BACKUP_LIST"
#[...]
#2020-12-17T14:55:55Z-000001
#2020-12-17T15:11:05Z-000002
#2020-12-17T15:26:13Z-000003
#2020-12-17T15:41:14Z-000001
#2020-12-17T15:56:20Z-000001
#2020-12-17T16:11:35Z-000004
#2020-12-17T16:26:41Z-000005
LATEST_BACKUP=$(echo -n "${BACKUP_LIST}"|tail -1)
# confirm that latest backup has been selected
#$ echo $LATEST_BACKUP
#2020-12-17T16:26:41Z-000005
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_MAIN} restore-backup "${LATEST_BACKUP/%[$'\t\r\n']}"
#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/main
#I1217 16:41:59.101078 11608 vfs.go:60] Adding command at s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/main/control/#2020-12-17T16:41:59Z-000000/_command.json: timestamp:1608223319100999598 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:26:41Z-000005\r" #>
#added restore-backup command: timestamp:1608223319100999598 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:26:41Z-000005\r" >
计划恢复“主”集群。
BACKUP_LIST=$(kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} list-backups|sort -n)
#$ echo "$BACKUP_LIST"
#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events
#I1217 16:48:41.230896 17761 vfs.go:102] listed backups in s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events: #[2020-12-17T14:56:08Z-000001 2020-12-17T15:11:17Z-000001 2020-12-17T15:26:26Z-000002 2020-12-17T15:41:27Z-000002 2020-12-17T15:56:32Z-000003 2020-12-17T16:11:41Z-000003 #2020-12-17T16:26:48Z-000004 2020-12-17T16:41:56Z-000001]
#2020-12-17T14:56:08Z-000001
#2020-12-17T15:11:17Z-000001
#2020-12-17T15:26:26Z-000002
#2020-12-17T15:41:27Z-000002
#2020-12-17T15:56:32Z-000003
#2020-12-17T16:11:41Z-000003
#2020-12-17T16:26:48Z-000004
#2020-12-17T16:41:56Z-000001
LATEST_BACKUP=$(echo -n "${BACKUP_LIST}"|tail -1)
# confirm that latest backup has been selected
$ echo $LATEST_BACKUP
2020-12-17T16:41:56Z-000001
kubectl exec -it -n kube-system ${CONTAINER} -- /etcd-manager-ctl -backup-store=${BACKUP_EVENT} restore-backup "${LATEST_BACKUP/%[$'\t\r\n']}"
#Backup Store: s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events
#I1217 16:53:17.876318 21958 vfs.go:60] Adding command at s3://mironq-prod-eu-central-1-state-store/mironq.prod.eu-central-1.aws.svc.example.com/backups/etcd/events/control/#2020-12-17T16:53:17Z-000000/_command.json: timestamp:1608223997876256810 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:41:56Z-000001\r" #>
#added restore-backup command: timestamp:1608223997876256810 restore_backup:<cluster_spec:<member_count:3 etcd_version:"3.4.3" > backup:"2020-12-17T16:41:56Z-000001\r" >
检查端点是否仍然健康(应该如此)
# check if endpoint is healthy
kubectl exec -it -n kube-system ${CONTAINER} -- ${ETCD_CMD} endpoint health
#https://127.0.0.1:4001 is healthy: successfully committed proposal: took = 7.036109ms
实例组列表示例
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} get ig
#NAME ROLE MACHINETYPE MIN MAX ZONES
#bastions Bastion t3.micro 1 1 eu-central-1a,eu-central-1b,eu-central-1c
#master-eu-central-1a Master t3.medium 1 1 eu-central-1a
#master-eu-central-1b Master t3.medium 1 1 eu-central-1b
#master-eu-central-1c Master t3.medium 1 1 eu-central-1c
#nodes Node t3.medium 2 6 eu-central-1a,eu-central-1b
在我们禁用 Etcd 节点的可用区中删除主实例组(本例中的 a 和 b,因为我们希望让 c 作为唯一的主节点运行)。 编辑下面的命令(替换 [AZ-letter] 以匹配您的情况。
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} delete ig master-eu-central-1[AZ-letter]
#InstanceGroup "master-eu-central-1a" found for deletion
#I1217 17:01:39.035294 2538280 delete.go:54] Deleting "master-eu-central-1a"
#Deleted InstanceGroup: "master-eu-central-1a"
使用以下命令调用编辑模式手动编辑集群。这里的目标是将剩余的 Etcd 节点与集群配置相匹配:通过下面的示例,您需要删除不再存在的节点的条目。
改变这个:
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: master-eu-central-1a
name: a
- instanceGroup: master-eu-central-1b
name: b
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- instanceGroup: master-eu-central-1a
name: a
- instanceGroup: master-eu-central-1b
name: b
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: events
到这里(只留下仍然有主人的区域):
etcdClusters:
- cpuRequest: 200m
etcdMembers:
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: main
- cpuRequest: 100m
etcdMembers:
- instanceGroup: master-eu-central-1c
name: c
memoryRequest: 100Mi
name: events
这将打开允许进行更改的编辑器。
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} edit cluster
应用更改并强制重新创建主节点(第二个命令将使集群无响应,直到创建新主节点并重新联机。
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} update cluster --yes
kops --name ${CLUSTER_NAME} --state s3://${KOPS_STATE_BUCKET} rolling-update cluster --cloudonly --yes
Etcd 集群、“主”和“事件”都必须重新上线才能再次启动 API。如果 API 服务器日志抱怨无法连接到端口 4001,那么您的“主”Etcd 集群未启动,如果端口号为 4002,则它是“事件”。 就在上面,您已经指示 Etcd 集群导入备份,这必须完成才能启动集群