mirror of
https://github.com/ubicloud/ubicloud.git
synced 2025-11-28 08:30:27 +08:00
In production we had several incidents which caused host unavailability after logging the following in syslog: - A per-cpu kworker processing work on the wrong CPU, - Internal state of a worker pool has been corrupt: all kworkers in the pool are idle, but number of running kworkers in the pool is non-zero. We think the state corruption is a consequence of a kworker in the pool having been executed on the wrong cpu. As a consequence of the worker pool state corruption, some queued deferred works were never processed, which caused: - IO stalls, - VM stop timeouts, - SSH authentication timeouts, - ... After reviewing the evidence in logs and Linux 6.8.0-87's source code, our current theory of why this happens is that there is a race condition between kworker creation and cgroup creation/deletion. The cgroup side in this race condition is `update_tasks_cpumask` in `kernel/cgroup/cpuset.c`, which updates allowed cpumask of tasks after a `root` cpu partition has been created or deleted. In this case, list of cpus available for the system cgroup and other cgroups changes. In rare cases and under a race condition, `update_tasks_cpumask` might migrate a kworker that is just being created to a wrong cpu. This commit tries to workaround this race condition by setting `cpuset.cpus.partition` to `member` instead of `root`. In this case, list of available cpus to other cgroups won't change and `update_tasks_cpumask` won't be called, avoiding the described race condition. The downside of using `member` is that other services might use a bit of VM's cpu, which is acceptable if we get a more stable system. Note that the described race condition is just our current understanding of the issue, and we might have still missed few things. Therefore, we might revert this commit and try other workarounds if we still see the issue after this change. |
||
|---|---|---|
| .. | ||
| slice_setup_e2e_spec.rb | ||