Files
ubicloud/prog/kubernetes/upgrade_kubernetes_node.rb
Jeremy Evans 15b91ff0c4 Absorb leaf into reap
Reviewing leaf usage in progs, it always occurs right after reap.
Combining leaf and reap methods avoids a redundant query for the
strand's children.

It's typical for nap or donate to be called after the leaf check
after reap.  Also build this into reap, by calling donate by default,
or nap if a nap keyword argument is given.

There are a few cases where reap was called without leaf/donate.
Add a fallthrough keyword argument to support this, so if there are
no children, it does not call either nap or donate

Vm::HostNexus#wait_prep and Kubernetes::UpgradeKubernetesNode#wait_new_node
both need the return value of the reapable child(ren). Add a reaper
keyword argument for this, which is called once for each child.

The most common pattern for using reap/leaf/donate was:

```ruby
reap
hop_download_lb_cert if leaf?
donate
```

This turns into:

```ruby
reap(:download_lb_cert)
```

The second most common pattern was:

```ruby
reap
donate unless leaf?
pop "upgrade cancelled" # or other code
```

This turns into:

```ruby
reap { pop "upgrade cancelled" }
```

In a few places, I changed operations on strand.children to
strand.children_dataset.  Now that we are no longer using
cached children by default, it's better to do these checks
in the database intead of in Ruby.  These places deserve careful
review:

* Prog::Minio::MinioServerNexus#unavailable
* Prog::Postgres::PostgresResourceNexus#wait
* Prog::Postgres::PostgresServerNexus#unavailable

For Prog::Vnet::LoadBalancerNexus#wait_update_vm_load_balancers,
I removed a check on the children completely. It was checking
for an exitval using children_dataset directly after reap,
which should only be true if there was still an active lease
for the child.  This also deserves careful review.

This broke many mocked tests.  This fixes the mocked tests
to use database-backed objects, ensuring that we are testing
observable behavior and not implementation details.
2025-06-26 03:49:53 +09:00

99 lines
2.5 KiB
Ruby

# frozen_string_literal: true
class Prog::Kubernetes::UpgradeKubernetesNode < Prog::Base
subject_is :kubernetes_cluster
def old_vm
@old_vm ||= Vm[frame.fetch("old_vm_id")]
end
def new_vm
@new_vm ||= Vm[frame.fetch("new_vm_id")]
end
def kubernetes_nodepool
@kubernetes_nodepool ||= KubernetesNodepool[frame.fetch("nodepool_id", nil)]
end
def before_run
if kubernetes_cluster.strand.label == "destroy" && strand.label != "destroy"
reap { pop "upgrade cancelled" }
end
end
label def start
new_frame = if kubernetes_nodepool
{"nodepool_id" => kubernetes_nodepool.id}
else
{}
end
bud Prog::Kubernetes::ProvisionKubernetesNode, new_frame
hop_wait_new_node
end
label def wait_new_node
vm_id = nil
reaper = lambda do |child|
vm_id = child.exitval.fetch("vm_id")
end
reap(reaper:) do
current_frame = strand.stack.first
# This will not work correctly if the strand has multiple children.
# However, the strand has only has a single child created in start.
current_frame["new_vm_id"] = vm_id
strand.modified!(:stack)
hop_drain_old_node
end
end
label def drain_old_node
register_deadline("remove_old_node_from_cluster", 60 * 60)
vm = kubernetes_cluster.cp_vms.last
case vm.sshable.d_check("drain_node")
when "Succeeded"
hop_remove_old_node_from_cluster
when "NotStarted"
vm.sshable.d_run("drain_node", "sudo", "kubectl", "--kubeconfig=/etc/kubernetes/admin.conf",
"drain", old_vm.name, "--ignore-daemonsets", "--delete-emptydir-data")
nap 10
when "InProgress"
nap 10
when "Failed"
vm.sshable.d_restart("drain_node")
nap 10
end
nap 60 * 60
end
label def remove_old_node_from_cluster
if kubernetes_nodepool
kubernetes_nodepool.remove_vm(old_vm)
else
kubernetes_cluster.remove_cp_vm(old_vm)
kubernetes_cluster.api_server_lb.detach_vm(old_vm)
end
# kubeadm reset is necessary for etcd member removal, delete node itself
# doesn't remove node from the etcd member, hurting the etcd cluster health
old_vm.sshable.cmd("sudo kubeadm reset --force")
hop_delete_node_object
end
label def delete_node_object
res = kubernetes_cluster.client.delete_node(old_vm.name)
fail "delete node object failed: #{res}" unless res.exitstatus.zero?
hop_destroy_node
end
label def destroy_node
old_vm.incr_destroy
pop "upgraded node"
end
end