ubicloud

mirror of https://github.com/ubicloud/ubicloud.git synced 2025-11-28 08:30:27 +08:00

History

Hadi Moshayedi 943193ef28 Attempt to workaround the Linux Kernel race condition. In production we had several incidents which caused host unavailability after logging the following in syslog: - A per-cpu kworker processing work on the wrong CPU, - Internal state of a worker pool has been corrupt: all kworkers in the pool are idle, but number of running kworkers in the pool is non-zero. We think the state corruption is a consequence of a kworker in the pool having been executed on the wrong cpu. As a consequence of the worker pool state corruption, some queued deferred works were never processed, which caused: - IO stalls, - VM stop timeouts, - SSH authentication timeouts, - ... After reviewing the evidence in logs and Linux 6.8.0-87's source code, our current theory of why this happens is that there is a race condition between kworker creation and cgroup creation/deletion. The cgroup side in this race condition is `update_tasks_cpumask` in `kernel/cgroup/cpuset.c`, which updates allowed cpumask of tasks after a `root` cpu partition has been created or deleted. In this case, list of cpus available for the system cgroup and other cgroups changes. In rare cases and under a race condition, `update_tasks_cpumask` might migrate a kworker that is just being created to a wrong cpu. This commit tries to workaround this race condition by setting `cpuset.cpus.partition` to `member` instead of `root`. In this case, list of available cpus to other cgroups won't change and `update_tasks_cpumask` won't be called, avoiding the described race condition. The downside of using `member` is that other services might use a bit of VM's cpu, which is acceptable if we get a more stable system. Note that the described race condition is just our current understanding of the issue, and we might have still missed few things. Therefore, we might revert this commit and try other workarounds if we still see the issue after this change.	2025-11-14 09:14:04 -08:00
..
slice_setup_e2e_spec.rb	Attempt to workaround the Linux Kernel race condition.	2025-11-14 09:14:04 -08:00

Hadi Moshayedi 943193ef28 Attempt to workaround the Linux Kernel race condition.

In production we had several incidents which caused host unavailability
after logging the following in syslog:

- A per-cpu kworker processing work on the wrong CPU,
- Internal state of a worker pool has been corrupt: all kworkers in the
  pool are idle, but number of running kworkers in the pool is non-zero.

We think the state corruption is a consequence of a kworker in the pool
having been executed on the wrong cpu.

As a consequence of the worker pool state corruption, some queued
deferred works were never processed, which caused:

- IO stalls,
- VM stop timeouts,
- SSH authentication timeouts,
- ...

After reviewing the evidence in logs and Linux 6.8.0-87's source code,
our current theory of why this happens is that there is a race condition
between kworker creation and cgroup creation/deletion.

The cgroup side in this race condition is `update_tasks_cpumask` in
`kernel/cgroup/cpuset.c`, which updates allowed cpumask of tasks after a
`root` cpu partition has been created or deleted. In this case, list of
cpus available for the system cgroup and other cgroups changes.

In rare cases and under a race condition, `update_tasks_cpumask` might
migrate a kworker that is just being created to a wrong cpu.

This commit tries to workaround this race condition by setting
`cpuset.cpus.partition` to `member` instead of `root`. In this case,
list of available cpus to other cgroups won't change and
`update_tasks_cpumask` won't be called, avoiding the described race
condition.

The downside of using `member` is that other services might use a bit of
VM's cpu, which is acceptable if we get a more stable system.

Note that the described race condition is just our current understanding
of the issue, and we might have still missed few things.
Therefore, we might revert this commit and try other workarounds if we
still see the issue after this change.

2025-11-14 09:14:04 -08:00

slice_setup_e2e_spec.rb

Attempt to workaround the Linux Kernel race condition.

2025-11-14 09:14:04 -08:00