mirror of
https://github.com/ubicloud/ubicloud.git
synced 2025-10-05 22:31:57 +08:00
22f3aa5c1d
"Terminate SSH sessions promptly and their processes", by yours truly, introduced many problems. It's useful to be lazy about returning SSH `keepalive@openssh.com` packets (note: not the same as TCP keepalive) while maintaining a session, and that was broken by that patch. At the same time, lingering sessions without any interlock theory are bad, and I don't want to go back to that. Back when writing22f3aa5c1
, I thought that tight timing there could be used as a form of lease-style mutual exclusion, but in retrospect, it would either be *too* tightly timed or our retry interval would have to become much longer. To resolve this, I intend to add mutual exclusion recorded on the server, costing some overhead with every established connection. My experiments with `hyperfine` suggest 3ms of overhead or less. See the bottom of this message for a benchmark script and its output on my laptop. In this patch, I am taking the first step to introduce opt-in mutual exclusion to sessions via a constant/global called `SSH_SESSION_LOCK_NAME`. By default, this does nothing until `SSH_SESSION_LOCK_NAME` is defined. This can be set differently for different processes; e.g., `respirate` may bind this symbol but `monitor` may bind it differently, or not at all. Even with this symbol set, this implementation is nearly a no-op, to test overhead and measure status quo cases of accidental(?) session concurrency where crashing upon mutual exclusion could be disruptive. Instead, it only logs a message when there would have been session lock contention. Eventually, once we have a handle on everything, the `Clog.emit` can be converted to raise. A way to test this against a `Sshable`: sa = Sshable.first # Demonstrate what happens without SSH_SESSION_LOCK_NAME set. p ['precondition: no locking', sa.cmd('pgrep -af session || true')] sa.invalidate_cache_entry p ['postcondition: no locking', sa.cmd('pgrep -af session || true')] sa.invalidate_cache_entry SSH_SESSION_LOCK_NAME = "test-lock" # Should display the session flock holding process, which is # automatically started when the session begins. p ['precondition: locking', sa.cmd('pgrep -af session')] # Invalidating the Sshable cache entry creates a second concurrent # session upon the next cmd. sa.invalidate_cache_entry # This should display the contention logging as a side effect and # show the same session flock process pid. p ['postcondition: locking', sa.cmd('pgrep -af session')] At the last part, where you expect to see a message as a side effect, it should look like this: .... ["precondition: locking", "21256 session-lock-test-lock infinity\n"] .... {"contended_session_lock":{"exit_code":124,"session_fail_msg":"session lock conflict for test-lock"},"message":"session lock failure","time":"2025-09-08 15:31:37 -0700"} .... ["postcondition: locking", "21256 session-lock-test-lock infinity\n"] Note that the pid (21256) doesn't change, and the `contended_session_lock` key. You can use `pgrep` to debug the session locking interactively. You might also find the distinctive file descriptor number useful, e.g., `ls /proc/*/fd/999`. Here's a benchmarking program using `hyperfine` with a closely related version to measure overhead. You can put it into a file (e.g., `bench.bash`) and execute it. It has modifications of exit codes to help ensure the benchmark is doing something meaningful: set -uex cat > prepare.sh << 'EOF' #!/usr/bin/bash pkill -f sessionlockasdf 2>/dev/null || true flock /dev/shm/session_lockfile true EOF cat > locking.sh << 'EOF' #!/usr/bin/bash exec 999>/dev/shm/session_lockfile || exit 1 flock -xn 999 || { echo "Another session active."; exit 1; } exec -a sessionlockasdf sleep infinity </dev/null >/dev/null 2>&1 & disown EOF cat > noop.sh << 'EOF' #!/usr/bin/bash exit 0 EOF chmod +x prepare.sh locking.sh noop.sh hyperfine -N \ --prepare './prepare.sh' \ --command-name "Session Lock" './locking.sh' \ --command-name "Noop" './noop.sh' \ --cleanup './prepare.sh' rm -f prepare.sh locking.sh noop.sh rm -f /dev/shm/session_lockfile 2>/dev/null || true On my laptop, it has output like this: ``` + cat + cat + cat + chmod +x prepare.sh locking.sh noop.sh + hyperfine -N --prepare ./prepare.sh --command-name 'Session Lock' ./locking.sh --command-name Noop ./noop.sh --cleanup ./prepare.sh Benchmark 1: Session Lock Time (mean ± σ): 2.0 ms ± 0.2 ms [User: 0.7 ms, System: 1.2 ms] Range (min … max): 1.5 ms … 2.3 ms 110 runs Benchmark 2: Noop Time (mean ± σ): 928.3 µs ± 126.4 µs [User: 386.7 µs, System: 445.3 µs] Range (min … max): 579.7 µs … 1221.6 µs 154 runs Summary Noop ran 2.11 ± 0.34 times faster than Session Lock + rm -f prepare.sh locking.sh noop.sh + rm -f /dev/shm/session_lockfile ```
263 lines
9.8 KiB
Ruby
263 lines
9.8 KiB
Ruby
# frozen_string_literal: true
|
|
|
|
require_relative "spec_helper"
|
|
|
|
RSpec.describe Sshable do
|
|
# Avoid using excessive entropy by using one generated key for all
|
|
# tests.
|
|
key = SshKey.generate.keypair.freeze
|
|
|
|
subject(:sa) {
|
|
described_class.new(
|
|
id: described_class.generate_uuid,
|
|
host: "test.localhost",
|
|
unix_user: "testuser",
|
|
raw_private_key_1: key
|
|
)
|
|
}
|
|
|
|
it "can encrypt and decrypt a field" do
|
|
sa.save_changes
|
|
|
|
expect(sa.values[:raw_private_key_1] =~ /\AA[AgQ]..A/).not_to be_nil
|
|
expect(sa.raw_private_key_1).to eq(key)
|
|
end
|
|
|
|
describe "#maybe_ssh_session_lock_name" do
|
|
it "does not yield if SSH_SESSION_LOCK_NAME is not defined" do
|
|
expect(sa.maybe_ssh_session_lock_name).to be_nil
|
|
end
|
|
|
|
unless ENV["CLOVER_FREEZE"]
|
|
it "yields if SSH_SESSION_LOCK_NAME is defined" do
|
|
stub_const("SSH_SESSION_LOCK_NAME", "testlockname")
|
|
expect(sa.maybe_ssh_session_lock_name).to eq("testlockname")
|
|
end
|
|
end
|
|
end
|
|
|
|
describe "session locking" do
|
|
lock_script = <<LOCK
|
|
exec 999>/dev/shm/session-lock-testlockname || exit 92
|
|
flock -xn 999 || { echo "Another session active: " testlockname; exit 124; }
|
|
exec -a session-lock-testlockname sleep infinity </dev/null >/dev/null 2>&1 &
|
|
disown
|
|
LOCK
|
|
|
|
it "interlocks" do
|
|
portable_pkill = lambda { system(%q(ps -eo pid,args | awk '$2=="session-lock-testlockname"{print $1}' | xargs -I {} sh -c 'test -n "{}" && kill {}')) }
|
|
portable_pkill.call
|
|
q_lock_script = lock_script.shellescape
|
|
expect([`bash -c #{q_lock_script}`, $?.exitstatus]).to eq(["", 0])
|
|
expect([`bash -c #{q_lock_script}`, $?.exitstatus]).to eq(["Another session active: testlockname\n", 124])
|
|
expect(portable_pkill.call).to be true
|
|
end
|
|
|
|
describe "exit code handling" do
|
|
before do
|
|
expect(sa).to receive(:maybe_ssh_session_lock_name).and_return("testlockname")
|
|
sa.invalidate_cache_entry
|
|
expect(Net::SSH).to receive(:start) do
|
|
instance_double(Net::SSH::Connection::Session, close: nil)
|
|
end
|
|
end
|
|
|
|
it "runs the session lock script if SSH_SESSION_LOCK_NAME is set" do
|
|
expect(sa).to receive(:cmd).with(lock_script, log: false)
|
|
sa.connect
|
|
end
|
|
|
|
it "reports a failure to obtain a file descriptor with an obscure exit code" do
|
|
expect(sa).to receive(:cmd).with(lock_script, log: false).and_raise(Sshable::SshError.new(lock_script, "", "", 92, nil))
|
|
expect(Clog).to receive(:emit).with("session lock failure").and_wrap_original do |m, a, &b|
|
|
expect(b.call.dig(:contended_session_lock, :session_fail_msg)).to eq("could not create session lock file for testlockname")
|
|
end
|
|
sa.connect
|
|
end
|
|
|
|
it "reports lock conflicts when an obscure exit code is raised" do
|
|
expect(sa).to receive(:cmd).with(lock_script, log: false).and_raise(Sshable::SshError.new(lock_script, "", "", 124, nil))
|
|
expect(Clog).to receive(:emit).with("session lock failure").and_wrap_original do |m, a, &b|
|
|
expect(b.call.dig(:contended_session_lock, :session_fail_msg)).to eq("session lock conflict for testlockname")
|
|
end
|
|
sa.connect
|
|
end
|
|
|
|
it "has a generic message for unrecognized errors" do
|
|
expect(sa).to receive(:cmd).with(lock_script, log: false).and_raise(Sshable::SshError.new(lock_script, "", "", 1, nil))
|
|
expect(Clog).to receive(:emit).with("session lock failure").and_wrap_original do |m, a, &b|
|
|
expect(b.call.dig(:contended_session_lock, :session_fail_msg)).to eq("unknown SshError")
|
|
end
|
|
sa.connect
|
|
end
|
|
end
|
|
end
|
|
|
|
describe "caching" do
|
|
# The cache is thread local, so re-set the thread state by boxing
|
|
# each test in a new thread.
|
|
around do |ex|
|
|
Thread.new {
|
|
ex.run
|
|
}.join
|
|
end
|
|
|
|
it "can cache SSH connections" do
|
|
expect(Net::SSH).to receive(:start) do
|
|
instance_double(Net::SSH::Connection::Session, close: nil)
|
|
end
|
|
|
|
expect(Thread.current[:clover_ssh_cache]).to be_nil
|
|
first_time = sa.connect
|
|
expect(Thread.current[:clover_ssh_cache].size).to eq(1)
|
|
second_time = sa.connect
|
|
expect(first_time).to equal(second_time)
|
|
|
|
expect(described_class.reset_cache).to eq []
|
|
expect(Thread.current[:clover_ssh_cache]).to be_empty
|
|
end
|
|
|
|
it "does not crash if a cache has never been made" do
|
|
expect {
|
|
sa.invalidate_cache_entry
|
|
}.not_to raise_error
|
|
end
|
|
|
|
it "can invalidate a single cache entry" do
|
|
sess = instance_double(Net::SSH::Connection::Session, close: nil)
|
|
expect(Net::SSH).to receive(:start).and_return sess
|
|
sa.connect
|
|
expect {
|
|
sa.invalidate_cache_entry
|
|
}.to change { Thread.current[:clover_ssh_cache] }.from({["test.localhost", "testuser"] => sess}).to({})
|
|
end
|
|
|
|
it "can reset caches when has cached connection" do
|
|
sess = instance_double(Net::SSH::Connection::Session, close: nil)
|
|
expect(Net::SSH).to receive(:start).and_return sess
|
|
sa.connect
|
|
expect {
|
|
described_class.reset_cache
|
|
}.to change { Thread.current[:clover_ssh_cache] }.from({["test.localhost", "testuser"] => sess}).to({})
|
|
end
|
|
|
|
it "can reset caches when has no cached connection" do
|
|
expect(described_class.reset_cache).to eq([])
|
|
end
|
|
|
|
it "can reset caches even if session fails while closing" do
|
|
sess = instance_double(Net::SSH::Connection::Session)
|
|
expect(sess).to receive(:close).and_raise Sshable::SshError.new("bogus", "", "", nil, nil)
|
|
expect(Net::SSH).to receive(:start).and_return sess
|
|
sa.connect
|
|
|
|
expect(described_class.reset_cache.first).to be_a Sshable::SshError
|
|
expect(Thread.current[:clover_ssh_cache]).to eq({})
|
|
end
|
|
end
|
|
|
|
describe "#cmd" do
|
|
let(:session) { instance_double(Net::SSH::Connection::Session) }
|
|
|
|
before do
|
|
expect(sa).to receive(:connect).and_return(session).at_least(:once)
|
|
end
|
|
|
|
def simulate(cmd:, exit_status:, exit_signal:, stdout:, stderr:)
|
|
expect(session).to receive(:open_channel) do |&blk|
|
|
chan = instance_spy(Net::SSH::Connection::Channel)
|
|
expect(chan).to receive(:exec).with(cmd) do |&blk|
|
|
chan2 = instance_spy(Net::SSH::Connection::Channel)
|
|
expect(chan2).to receive(:on_request).with("exit-status") do |&blk|
|
|
buf = instance_double(Net::SSH::Buffer)
|
|
expect(buf).to receive(:read_long).and_return(exit_status)
|
|
blk.call(nil, buf)
|
|
end
|
|
|
|
expect(chan2).to receive(:on_request).with("exit-signal") do |&blk|
|
|
buf = instance_double(Net::SSH::Buffer)
|
|
expect(buf).to receive(:read_long).and_return(exit_signal)
|
|
blk.call(nil, buf)
|
|
end
|
|
expect(chan2).to receive(:on_data).and_yield(instance_double(Net::SSH::Connection::Channel), stdout)
|
|
expect(chan2).to receive(:on_extended_data).and_yield(nil, 1, stderr)
|
|
|
|
blk.call(chan2, true)
|
|
end
|
|
blk.call(chan, true)
|
|
chan
|
|
end
|
|
end
|
|
|
|
it "can run a command" do
|
|
[false, true].each do |repl_value|
|
|
[false, true].each do |log_value|
|
|
allow(described_class).to receive(:repl?).and_return(repl_value)
|
|
if repl_value
|
|
# Note that in the REPL, stdout and stderr get multiplexed
|
|
# into stderr in real time, packet by packet.
|
|
expect($stderr).to receive(:write).with("hello")
|
|
expect($stderr).to receive(:write).with("world")
|
|
end
|
|
|
|
if log_value
|
|
sa.instance_variable_set(:@connect_duration, 1.1)
|
|
expect(Clog).to receive(:emit).with("ssh cmd execution") do |&blk|
|
|
dat = blk.call
|
|
if repl_value
|
|
expect(dat[:ssh].slice(:stdout, :stderr)).to be_empty
|
|
else
|
|
expect(dat[:ssh].slice(:stdout, :stderr)).to eq({stdout: "hello", stderr: "world"})
|
|
end
|
|
end
|
|
end
|
|
simulate(cmd: "echo hello", exit_status: 0, exit_signal: nil, stdout: "hello", stderr: "world")
|
|
expect(sa.cmd("echo hello", log: log_value)).to eq("hello")
|
|
end
|
|
end
|
|
end
|
|
|
|
it "raises an error with a non-zero exit status" do
|
|
simulate(cmd: "exit 1", exit_status: 1, exit_signal: 127, stderr: "", stdout: "")
|
|
expect { sa.cmd("exit 1") }.to raise_error Sshable::SshError, "command exited with an error: exit 1"
|
|
end
|
|
|
|
it "invalidates the cache if the session raises an error" do
|
|
err = IOError.new("the party is over")
|
|
expect(session).to receive(:open_channel).and_raise err
|
|
expect(sa).to receive(:invalidate_cache_entry)
|
|
expect { sa.cmd("irrelevant") }.to raise_error err
|
|
end
|
|
end
|
|
|
|
describe "daemonizer methods" do
|
|
let(:unit_name) { "test_unit" }
|
|
let(:run_command) { "sudo host/bin/setup-vm prep test_unit" }
|
|
let(:stdin_data) { "secret_data" }
|
|
|
|
it "calls cmd with the correct check command" do
|
|
expect(sa).to receive(:cmd).with("common/bin/daemonizer2 check test_unit")
|
|
sa.d_check(unit_name)
|
|
end
|
|
|
|
it "calls cmd with the correct clean command" do
|
|
expect(sa).to receive(:cmd).with("common/bin/daemonizer2 clean test_unit")
|
|
sa.d_clean(unit_name)
|
|
end
|
|
|
|
it "calls cmd with the correct restart command" do
|
|
expect(sa).to receive(:cmd).with("common/bin/daemonizer2 restart test_unit")
|
|
sa.d_restart(unit_name)
|
|
end
|
|
|
|
it "calls cmd with the correct run command and no stdin" do
|
|
expect(sa).to receive(:cmd).with("common/bin/daemonizer2 run test_unit sudo\\ host/bin/setup-vm\\ prep\\ test_unit", stdin: nil, log: true)
|
|
sa.d_run(unit_name, run_command)
|
|
end
|
|
|
|
it "calls cmd with the correct run command and passes stdin" do
|
|
expect(sa).to receive(:cmd).with("common/bin/daemonizer2 run test_unit sudo\\ host/bin/setup-vm\\ prep\\ test_unit", stdin: stdin_data, log: true)
|
|
sa.d_run(unit_name, run_command, stdin: stdin_data)
|
|
end
|
|
end
|
|
end
|