Reviewing leaf usage in progs, it always occurs right after reap. Combining leaf and reap methods avoids a redundant query for the strand's children. It's typical for nap or donate to be called after the leaf check after reap. Also build this into reap, by calling donate by default, or nap if a nap keyword argument is given. There are a few cases where reap was called without leaf/donate. Add a fallthrough keyword argument to support this, so if there are no children, it does not call either nap or donate Vm::HostNexus#wait_prep and Kubernetes::UpgradeKubernetesNode#wait_new_node both need the return value of the reapable child(ren). Add a reaper keyword argument for this, which is called once for each child. The most common pattern for using reap/leaf/donate was: ```ruby reap hop_download_lb_cert if leaf? donate ``` This turns into: ```ruby reap(:download_lb_cert) ``` The second most common pattern was: ```ruby reap donate unless leaf? pop "upgrade cancelled" # or other code ``` This turns into: ```ruby reap { pop "upgrade cancelled" } ``` In a few places, I changed operations on strand.children to strand.children_dataset. Now that we are no longer using cached children by default, it's better to do these checks in the database intead of in Ruby. These places deserve careful review: * Prog::Minio::MinioServerNexus#unavailable * Prog::Postgres::PostgresResourceNexus#wait * Prog::Postgres::PostgresServerNexus#unavailable For Prog::Vnet::LoadBalancerNexus#wait_update_vm_load_balancers, I removed a check on the children completely. It was checking for an exitval using children_dataset directly after reap, which should only be true if there was still an active lease for the child. This also deserves careful review. This broke many mocked tests. This fixes the mocked tests to use database-backed objects, ensuring that we are testing observable behavior and not implementation details.
99 lines
2.5 KiB
Ruby
99 lines
2.5 KiB
Ruby
# frozen_string_literal: true
|
|
|
|
class Prog::Kubernetes::UpgradeKubernetesNode < Prog::Base
|
|
subject_is :kubernetes_cluster
|
|
|
|
def old_vm
|
|
@old_vm ||= Vm[frame.fetch("old_vm_id")]
|
|
end
|
|
|
|
def new_vm
|
|
@new_vm ||= Vm[frame.fetch("new_vm_id")]
|
|
end
|
|
|
|
def kubernetes_nodepool
|
|
@kubernetes_nodepool ||= KubernetesNodepool[frame.fetch("nodepool_id", nil)]
|
|
end
|
|
|
|
def before_run
|
|
if kubernetes_cluster.strand.label == "destroy" && strand.label != "destroy"
|
|
reap { pop "upgrade cancelled" }
|
|
end
|
|
end
|
|
|
|
label def start
|
|
new_frame = if kubernetes_nodepool
|
|
{"nodepool_id" => kubernetes_nodepool.id}
|
|
else
|
|
{}
|
|
end
|
|
|
|
bud Prog::Kubernetes::ProvisionKubernetesNode, new_frame
|
|
|
|
hop_wait_new_node
|
|
end
|
|
|
|
label def wait_new_node
|
|
vm_id = nil
|
|
reaper = lambda do |child|
|
|
vm_id = child.exitval.fetch("vm_id")
|
|
end
|
|
|
|
reap(reaper:) do
|
|
current_frame = strand.stack.first
|
|
# This will not work correctly if the strand has multiple children.
|
|
# However, the strand has only has a single child created in start.
|
|
current_frame["new_vm_id"] = vm_id
|
|
strand.modified!(:stack)
|
|
|
|
hop_drain_old_node
|
|
end
|
|
end
|
|
|
|
label def drain_old_node
|
|
register_deadline("remove_old_node_from_cluster", 60 * 60)
|
|
|
|
vm = kubernetes_cluster.cp_vms.last
|
|
case vm.sshable.d_check("drain_node")
|
|
when "Succeeded"
|
|
hop_remove_old_node_from_cluster
|
|
when "NotStarted"
|
|
vm.sshable.d_run("drain_node", "sudo", "kubectl", "--kubeconfig=/etc/kubernetes/admin.conf",
|
|
"drain", old_vm.name, "--ignore-daemonsets", "--delete-emptydir-data")
|
|
nap 10
|
|
when "InProgress"
|
|
nap 10
|
|
when "Failed"
|
|
vm.sshable.d_restart("drain_node")
|
|
nap 10
|
|
end
|
|
nap 60 * 60
|
|
end
|
|
|
|
label def remove_old_node_from_cluster
|
|
if kubernetes_nodepool
|
|
kubernetes_nodepool.remove_vm(old_vm)
|
|
else
|
|
kubernetes_cluster.remove_cp_vm(old_vm)
|
|
kubernetes_cluster.api_server_lb.detach_vm(old_vm)
|
|
end
|
|
|
|
# kubeadm reset is necessary for etcd member removal, delete node itself
|
|
# doesn't remove node from the etcd member, hurting the etcd cluster health
|
|
old_vm.sshable.cmd("sudo kubeadm reset --force")
|
|
|
|
hop_delete_node_object
|
|
end
|
|
|
|
label def delete_node_object
|
|
res = kubernetes_cluster.client.delete_node(old_vm.name)
|
|
fail "delete node object failed: #{res}" unless res.exitstatus.zero?
|
|
hop_destroy_node
|
|
end
|
|
|
|
label def destroy_node
|
|
old_vm.incr_destroy
|
|
pop "upgraded node"
|
|
end
|
|
end
|