Files
ubicloud/prog/learn_cpu.rb
Hadi Moshayedi 4e35b475a1 Fix learning total_dies in arm64.
We used to always determine the number of CPU dies by counting unique
values in `/sys/devices/system/cpu/cpu*/topology/die_id` files. This
method works on x64 systems but is ineffective on ARM64.

The `topology_die_id` function is defined for x64 architectures in
`arch/x86/include/asm/topology.h` but is not implemented for ARM64 in
`arch/arm64/include/asm/topology.h`.

In Linux kernel 5.15 (used in Ubuntu 22.04), the `die_id` attribute is
exposed with a value of -1 if `topology_die_id` is not defined for the
architecture. Therefore, the `die_id` file for ARM64 consistently has
the value -1, causing our method of counting unique values to always
return 1 on Ubuntu 22.04, regardless of the actual number of dies. This
can cause issues if number of sockets is more than 1, since our code
assumes `total_dies` is a multiple of `total_sockets`.

Starting with Linux kernel 5.17, a change was introduced [1] to expose
the `die_id` file only if `topology_die_id` is defined for the
architecture. Consequently, in Linux kernel 6.8 (used in Ubuntu 24.04),
this file is absent for ARM64 systems. As a result, our method of
counting unique values now produces 0 on Ubuntu 24.04.

Given the lack of a straightforward way to determine the number of dies
on ARM64 systems, this patch sets `total_dies` equal to `total_sockets`,
assuming one die per socket.

[1] https://github.com/torvalds/linux/commit/2c4dcd7
2024-11-22 12:08:50 -08:00

41 lines
1.2 KiB
Ruby

# frozen_string_literal: true
class Prog::LearnCpu < Prog::Base
subject_is :sshable
CpuTopology = Struct.new(:total_cpus, :total_cores, :total_dies, :total_sockets, keyword_init: true)
def get_arch
arch = sshable.cmd("common/bin/arch").strip
fail "BUG: unexpected CPU architecture" unless ["arm64", "x64"].include?(arch)
arch
end
def get_topology
s = sshable.cmd("/usr/bin/lscpu -Jye")
parsed = JSON.parse(s).fetch("cpus").map { |cpu|
[cpu.fetch("socket"), cpu.fetch("core")]
}
cpus = parsed.count
sockets = parsed.map { |socket, _| socket }.uniq.count
cores = parsed.uniq.count
CpuTopology.new(total_cpus: cpus, total_cores: cores, total_dies: 0,
total_sockets: sockets)
end
def count_dies(arch:, total_sockets:)
# Linux kernel doesn't provide die_id information for arm64.
return total_sockets if arch == "arm64"
die_ids = sshable.cmd("cat /sys/devices/system/cpu/cpu*/topology/die_id").split("\n")
die_ids.uniq.count
end
label def start
arch = get_arch
topo = get_topology
topo.total_dies = count_dies(total_sockets: topo.total_sockets, arch: arch)
pop(arch: arch, **topo.to_h)
end
end