mirror of
https://github.com/ubicloud/ubicloud.git
synced 2025-10-05 06:12:09 +08:00
22f3aa5c1d
"Terminate SSH sessions promptly and their processes", by yours truly, introduced many problems. It's useful to be lazy about returning SSH `keepalive@openssh.com` packets (note: not the same as TCP keepalive) while maintaining a session, and that was broken by that patch. At the same time, lingering sessions without any interlock theory are bad, and I don't want to go back to that. Back when writing22f3aa5c1
, I thought that tight timing there could be used as a form of lease-style mutual exclusion, but in retrospect, it would either be *too* tightly timed or our retry interval would have to become much longer. To resolve this, I intend to add mutual exclusion recorded on the server, costing some overhead with every established connection. My experiments with `hyperfine` suggest 3ms of overhead or less. See the bottom of this message for a benchmark script and its output on my laptop. In this patch, I am taking the first step to introduce opt-in mutual exclusion to sessions via a constant/global called `SSH_SESSION_LOCK_NAME`. By default, this does nothing until `SSH_SESSION_LOCK_NAME` is defined. This can be set differently for different processes; e.g., `respirate` may bind this symbol but `monitor` may bind it differently, or not at all. Even with this symbol set, this implementation is nearly a no-op, to test overhead and measure status quo cases of accidental(?) session concurrency where crashing upon mutual exclusion could be disruptive. Instead, it only logs a message when there would have been session lock contention. Eventually, once we have a handle on everything, the `Clog.emit` can be converted to raise. A way to test this against a `Sshable`: sa = Sshable.first # Demonstrate what happens without SSH_SESSION_LOCK_NAME set. p ['precondition: no locking', sa.cmd('pgrep -af session || true')] sa.invalidate_cache_entry p ['postcondition: no locking', sa.cmd('pgrep -af session || true')] sa.invalidate_cache_entry SSH_SESSION_LOCK_NAME = "test-lock" # Should display the session flock holding process, which is # automatically started when the session begins. p ['precondition: locking', sa.cmd('pgrep -af session')] # Invalidating the Sshable cache entry creates a second concurrent # session upon the next cmd. sa.invalidate_cache_entry # This should display the contention logging as a side effect and # show the same session flock process pid. p ['postcondition: locking', sa.cmd('pgrep -af session')] At the last part, where you expect to see a message as a side effect, it should look like this: .... ["precondition: locking", "21256 session-lock-test-lock infinity\n"] .... {"contended_session_lock":{"exit_code":124,"session_fail_msg":"session lock conflict for test-lock"},"message":"session lock failure","time":"2025-09-08 15:31:37 -0700"} .... ["postcondition: locking", "21256 session-lock-test-lock infinity\n"] Note that the pid (21256) doesn't change, and the `contended_session_lock` key. You can use `pgrep` to debug the session locking interactively. You might also find the distinctive file descriptor number useful, e.g., `ls /proc/*/fd/999`. Here's a benchmarking program using `hyperfine` with a closely related version to measure overhead. You can put it into a file (e.g., `bench.bash`) and execute it. It has modifications of exit codes to help ensure the benchmark is doing something meaningful: set -uex cat > prepare.sh << 'EOF' #!/usr/bin/bash pkill -f sessionlockasdf 2>/dev/null || true flock /dev/shm/session_lockfile true EOF cat > locking.sh << 'EOF' #!/usr/bin/bash exec 999>/dev/shm/session_lockfile || exit 1 flock -xn 999 || { echo "Another session active."; exit 1; } exec -a sessionlockasdf sleep infinity </dev/null >/dev/null 2>&1 & disown EOF cat > noop.sh << 'EOF' #!/usr/bin/bash exit 0 EOF chmod +x prepare.sh locking.sh noop.sh hyperfine -N \ --prepare './prepare.sh' \ --command-name "Session Lock" './locking.sh' \ --command-name "Noop" './noop.sh' \ --cleanup './prepare.sh' rm -f prepare.sh locking.sh noop.sh rm -f /dev/shm/session_lockfile 2>/dev/null || true On my laptop, it has output like this: ``` + cat + cat + cat + chmod +x prepare.sh locking.sh noop.sh + hyperfine -N --prepare ./prepare.sh --command-name 'Session Lock' ./locking.sh --command-name Noop ./noop.sh --cleanup ./prepare.sh Benchmark 1: Session Lock Time (mean ± σ): 2.0 ms ± 0.2 ms [User: 0.7 ms, System: 1.2 ms] Range (min … max): 1.5 ms … 2.3 ms 110 runs Benchmark 2: Noop Time (mean ± σ): 928.3 µs ± 126.4 µs [User: 386.7 µs, System: 445.3 µs] Range (min … max): 579.7 µs … 1221.6 µs 154 runs Summary Noop ran 2.11 ± 0.34 times faster than Session Lock + rm -f prepare.sh locking.sh noop.sh + rm -f /dev/shm/session_lockfile ```
224 lines
6 KiB
Ruby
224 lines
6 KiB
Ruby
# frozen_string_literal: true
|
|
|
|
require "net/ssh"
|
|
require_relative "../model"
|
|
|
|
class Sshable < Sequel::Model
|
|
# We need to unrestrict primary key so Sshable.new(...).save_changes works
|
|
# in sshable_spec.rb.
|
|
unrestrict_primary_key
|
|
|
|
plugin ResourceMethods, encrypted_columns: [:raw_private_key_1, :raw_private_key_2]
|
|
|
|
SSH_CONNECTION_ERRORS = [
|
|
Net::SSH::Disconnect,
|
|
Net::SSH::ConnectionTimeout,
|
|
Errno::ECONNRESET,
|
|
Errno::ECONNREFUSED,
|
|
IOError
|
|
].freeze
|
|
|
|
class SshError < StandardError
|
|
attr_reader :stdout, :stderr, :exit_code, :exit_signal
|
|
|
|
def initialize(cmd, stdout, stderr, exit_code, exit_signal)
|
|
@exit_code = exit_code
|
|
@exit_signal = exit_signal
|
|
@stdout = stdout
|
|
@stderr = stderr
|
|
super("command exited with an error: " + cmd)
|
|
end
|
|
end
|
|
|
|
def keys
|
|
[raw_private_key_1, raw_private_key_2].compact.map {
|
|
SshKey.from_binary(it)
|
|
}
|
|
end
|
|
|
|
def self.repl?
|
|
REPL
|
|
end
|
|
|
|
def repl?
|
|
self.class.repl?
|
|
end
|
|
|
|
def cmd(cmd, stdin: nil, log: true)
|
|
start = Time.now
|
|
stdout = StringIO.new
|
|
stderr = StringIO.new
|
|
exit_code = nil
|
|
exit_signal = nil
|
|
channel_duration = nil
|
|
|
|
begin
|
|
connect.open_channel do |ch|
|
|
channel_duration = Time.now - start
|
|
ch.exec(cmd) do |ch, success|
|
|
ch.on_data do |ch, data|
|
|
$stderr.write(data) if repl?
|
|
stdout.write(data)
|
|
end
|
|
|
|
ch.on_extended_data do |ch, type, data|
|
|
$stderr.write(data) if repl?
|
|
stderr.write(data)
|
|
end
|
|
|
|
ch.on_request("exit-status") do |ch2, data|
|
|
exit_code = data.read_long
|
|
end
|
|
|
|
ch.on_request("exit-signal") do |ch2, data|
|
|
exit_signal = data.read_long
|
|
end
|
|
ch.send_data stdin
|
|
ch.eof!
|
|
ch.wait
|
|
end
|
|
end.wait
|
|
rescue
|
|
invalidate_cache_entry
|
|
raise
|
|
end
|
|
|
|
stdout_str = stdout.string.freeze
|
|
stderr_str = stderr.string.freeze
|
|
|
|
if log
|
|
Clog.emit("ssh cmd execution") do
|
|
finish = Time.now
|
|
embed = {start:, finish:, cmd:, exit_code:, exit_signal:, ubid:, duration: finish - start}
|
|
|
|
# Suppress large outputs to avoid annoyance in duplication
|
|
# when in the REPL. In principle, the user of the REPL could
|
|
# read the Clog output and the feature of printing output in
|
|
# real time to $stderr could be removed, but when supervising
|
|
# a tty, I've found it can be useful to see data arrive in
|
|
# real time from SSH.
|
|
unless repl?
|
|
embed[:stderr] = stderr_str
|
|
embed[:stdout] = stdout_str
|
|
end
|
|
embed[:channel_duration] = channel_duration
|
|
embed[:connect_duration] = @connect_duration if @connect_duration
|
|
{ssh: embed}
|
|
end
|
|
end
|
|
|
|
fail SshError.new(cmd, stdout_str, stderr.string.freeze, exit_code, exit_signal) unless exit_code.zero?
|
|
stdout_str
|
|
end
|
|
|
|
def d_check(unit_name)
|
|
cmd("common/bin/daemonizer2 check #{unit_name.shellescape}")
|
|
end
|
|
|
|
def d_clean(unit_name)
|
|
cmd("common/bin/daemonizer2 clean #{unit_name.shellescape}")
|
|
end
|
|
|
|
def d_run(unit_name, *run_command, stdin: nil, log: true)
|
|
cmd("common/bin/daemonizer2 run #{unit_name.shellescape} #{Shellwords.join(run_command)}", stdin:, log:)
|
|
end
|
|
|
|
def d_restart(unit_name)
|
|
cmd("common/bin/daemonizer2 restart #{unit_name}")
|
|
end
|
|
|
|
# A huge number of settings are needed to isolate net-ssh from the
|
|
# host system and provide some anti-hanging assurance (keepalive,
|
|
# timeout).
|
|
COMMON_SSH_ARGS = {non_interactive: true, timeout: 10,
|
|
user_known_hosts_file: [], global_known_hosts_file: [],
|
|
verify_host_key: :accept_new, keys: [], key_data: [], use_agent: false,
|
|
keepalive: true, keepalive_interval: 3, keepalive_maxcount: 5}.freeze
|
|
|
|
def maybe_ssh_session_lock_name
|
|
SSH_SESSION_LOCK_NAME if defined?(SSH_SESSION_LOCK_NAME)
|
|
end
|
|
|
|
def connect
|
|
Thread.current[:clover_ssh_cache] ||= {}
|
|
|
|
# Cache hit.
|
|
if (sess = Thread.current[:clover_ssh_cache][[host, unix_user]])
|
|
return sess
|
|
end
|
|
|
|
# Cache miss.
|
|
start = Time.now
|
|
sess = start_fresh_session
|
|
@connect_duration = Time.now - start
|
|
Thread.current[:clover_ssh_cache][[host, unix_user]] = sess
|
|
|
|
if (lock_name = maybe_ssh_session_lock_name&.shellescape)
|
|
lock_contents = <<LOCK
|
|
exec 999>/dev/shm/session-lock-#{lock_name} || exit 92
|
|
flock -xn 999 || { echo "Another session active: " #{lock_name}; exit 124; }
|
|
exec -a session-lock-#{lock_name} sleep infinity </dev/null >/dev/null 2>&1 &
|
|
disown
|
|
LOCK
|
|
|
|
begin
|
|
cmd(lock_contents, log: false)
|
|
rescue SshError => ex
|
|
session_fail_msg = case (exit_code = ex.exit_code)
|
|
when 92
|
|
"could not create session lock file for #{lock_name}"
|
|
when 124
|
|
"session lock conflict for #{lock_name}"
|
|
else
|
|
"unknown SshError"
|
|
end
|
|
|
|
Clog.emit("session lock failure") do
|
|
{contended_session_lock: {exit_code:, session_fail_msg:}}
|
|
end
|
|
end
|
|
end
|
|
|
|
sess
|
|
end
|
|
|
|
def start_fresh_session(&block)
|
|
Net::SSH.start(host, unix_user, **COMMON_SSH_ARGS, key_data: keys.map(&:private_key), &block)
|
|
end
|
|
|
|
def invalidate_cache_entry
|
|
Thread.current[:clover_ssh_cache]&.delete([host, unix_user])
|
|
end
|
|
|
|
def available?
|
|
cmd("true") && true
|
|
rescue *SSH_CONNECTION_ERRORS
|
|
false
|
|
end
|
|
|
|
def self.reset_cache
|
|
return [] unless (cache = Thread.current[:clover_ssh_cache])
|
|
|
|
cache.filter_map do |key, sess|
|
|
sess.close
|
|
nil
|
|
rescue => e
|
|
e
|
|
ensure
|
|
cache.delete(key)
|
|
end
|
|
end
|
|
end
|
|
|
|
# Table: sshable
|
|
# Columns:
|
|
# id | uuid | PRIMARY KEY
|
|
# host | text |
|
|
# raw_private_key_1 | text |
|
|
# raw_private_key_2 | text |
|
|
# unix_user | text | NOT NULL DEFAULT 'rhizome'::text
|
|
# Indexes:
|
|
# sshable_pkey | PRIMARY KEY btree (id)
|
|
# sshable_host_key | UNIQUE btree (host)
|
|
# Referenced By:
|
|
# vm_host | vm_host_id_fkey | (id) REFERENCES sshable(id)
|