ubicloud/spec/model/sshable_spec.rb
Daniel Farina c43d8ccd57 Add no-op version of SSH session locking
22f3aa5c1d "Terminate SSH sessions
promptly and their processes", by yours truly, introduced many
problems. It's useful to be lazy about returning SSH
`keepalive@openssh.com` packets (note: not the same as TCP keepalive)
while maintaining a session, and that was broken by that patch.

At the same time, lingering sessions without any interlock theory are
bad, and I don't want to go back to that.

Back when writing 22f3aa5c1, I thought that tight timing there could
be used as a form of lease-style mutual exclusion, but in retrospect,
it would either be *too* tightly timed or our retry interval would
have to become much longer.

To resolve this, I intend to add mutual exclusion recorded on the
server, costing some overhead with every established connection. My
experiments with `hyperfine` suggest 3ms of overhead or less. See the
bottom of this message for a benchmark script and its output on my
laptop.

In this patch, I am taking the first step to introduce opt-in mutual
exclusion to sessions via a constant/global called
`SSH_SESSION_LOCK_NAME`.

By default, this does nothing until `SSH_SESSION_LOCK_NAME` is
defined. This can be set differently for different processes; e.g.,
`respirate` may bind this symbol but `monitor` may bind it
differently, or not at all.

Even with this symbol set, this implementation is nearly a no-op, to
test overhead and measure status quo cases of accidental(?) session
concurrency where crashing upon mutual exclusion could be
disruptive. Instead, it only logs a message when there would have been
session lock contention.

Eventually, once we have a handle on everything, the `Clog.emit` can
be converted to raise.

A way to test this against a `Sshable`:

    sa = Sshable.first

    # Demonstrate what happens without SSH_SESSION_LOCK_NAME set.
    p ['precondition: no locking', sa.cmd('pgrep -af session || true')]
    sa.invalidate_cache_entry
    p ['postcondition: no locking', sa.cmd('pgrep -af session || true')]
    sa.invalidate_cache_entry

    SSH_SESSION_LOCK_NAME = "test-lock"

    # Should display the session flock holding process, which is
    # automatically started when the session begins.
    p ['precondition: locking', sa.cmd('pgrep -af session')]

    # Invalidating the Sshable cache entry creates a second concurrent
    # session upon the next cmd.
    sa.invalidate_cache_entry

    # This should display the contention logging as a side effect and
    # show the same session flock process pid.
    p ['postcondition: locking', sa.cmd('pgrep -af session')]

At the last part, where you expect to see a message as a side effect,
it should look like this:

    ....
    ["precondition: locking", "21256 session-lock-test-lock infinity\n"]
    ....
    {"contended_session_lock":{"exit_code":124,"session_fail_msg":"session lock conflict for test-lock"},"message":"session lock failure","time":"2025-09-08 15:31:37 -0700"}
    ....
    ["postcondition: locking", "21256 session-lock-test-lock infinity\n"]

Note that the pid (21256) doesn't change, and the
`contended_session_lock` key.

You can use `pgrep` to debug the session locking interactively. You
might also find the distinctive file descriptor number useful,
e.g., `ls /proc/*/fd/999`.

Here's a benchmarking program using `hyperfine` with a closely related
version to measure overhead. You can put it into a file
(e.g., `bench.bash`) and execute it. It has modifications of exit
codes to help ensure the benchmark is doing something meaningful:

    set -uex

    cat > prepare.sh << 'EOF'
    #!/usr/bin/bash
    pkill -f sessionlockasdf 2>/dev/null || true
    flock /dev/shm/session_lockfile true
    EOF

    cat > locking.sh << 'EOF'
    #!/usr/bin/bash
    exec 999>/dev/shm/session_lockfile || exit 1
    flock -xn 999 || { echo "Another session active."; exit 1; }
    exec -a sessionlockasdf sleep infinity </dev/null >/dev/null 2>&1 &
    disown
    EOF

    cat > noop.sh << 'EOF'
    #!/usr/bin/bash
    exit 0
    EOF

    chmod +x prepare.sh locking.sh noop.sh

    hyperfine -N \
        --prepare './prepare.sh' \
        --command-name "Session Lock" './locking.sh' \
        --command-name "Noop" './noop.sh' \
        --cleanup './prepare.sh'

    rm -f prepare.sh locking.sh noop.sh
    rm -f /dev/shm/session_lockfile 2>/dev/null || true

On my laptop, it has output like this:

```
+ cat
+ cat
+ cat
+ chmod +x prepare.sh locking.sh noop.sh
+ hyperfine -N --prepare ./prepare.sh --command-name 'Session Lock' ./locking.sh --command-name Noop ./noop.sh --cleanup ./prepare.sh
Benchmark 1: Session Lock
  Time (mean ± σ):       2.0 ms ±   0.2 ms    [User: 0.7 ms, System: 1.2 ms]
  Range (min … max):     1.5 ms …   2.3 ms    110 runs

Benchmark 2: Noop
  Time (mean ± σ):     928.3 µs ± 126.4 µs    [User: 386.7 µs, System: 445.3 µs]
  Range (min … max):   579.7 µs … 1221.6 µs    154 runs

Summary
  Noop ran
    2.11 ± 0.34 times faster than Session Lock
+ rm -f prepare.sh locking.sh noop.sh
+ rm -f /dev/shm/session_lockfile
```
2025-09-18 20:01:15 -07:00

263 lines
9.8 KiB
Ruby

# frozen_string_literal: true
require_relative "spec_helper"
RSpec.describe Sshable do
# Avoid using excessive entropy by using one generated key for all
# tests.
key = SshKey.generate.keypair.freeze
subject(:sa) {
described_class.new(
id: described_class.generate_uuid,
host: "test.localhost",
unix_user: "testuser",
raw_private_key_1: key
)
}
it "can encrypt and decrypt a field" do
sa.save_changes
expect(sa.values[:raw_private_key_1] =~ /\AA[AgQ]..A/).not_to be_nil
expect(sa.raw_private_key_1).to eq(key)
end
describe "#maybe_ssh_session_lock_name" do
it "does not yield if SSH_SESSION_LOCK_NAME is not defined" do
expect(sa.maybe_ssh_session_lock_name).to be_nil
end
unless ENV["CLOVER_FREEZE"]
it "yields if SSH_SESSION_LOCK_NAME is defined" do
stub_const("SSH_SESSION_LOCK_NAME", "testlockname")
expect(sa.maybe_ssh_session_lock_name).to eq("testlockname")
end
end
end
describe "session locking" do
lock_script = <<LOCK
exec 999>/dev/shm/session-lock-testlockname || exit 92
flock -xn 999 || { echo "Another session active: " testlockname; exit 124; }
exec -a session-lock-testlockname sleep infinity </dev/null >/dev/null 2>&1 &
disown
LOCK
it "interlocks" do
portable_pkill = lambda { system(%q(ps -eo pid,args | awk '$2=="session-lock-testlockname"{print $1}' | xargs -I {} sh -c 'test -n "{}" && kill {}')) }
portable_pkill.call
q_lock_script = lock_script.shellescape
expect([`bash -c #{q_lock_script}`, $?.exitstatus]).to eq(["", 0])
expect([`bash -c #{q_lock_script}`, $?.exitstatus]).to eq(["Another session active: testlockname\n", 124])
expect(portable_pkill.call).to be true
end
describe "exit code handling" do
before do
expect(sa).to receive(:maybe_ssh_session_lock_name).and_return("testlockname")
sa.invalidate_cache_entry
expect(Net::SSH).to receive(:start) do
instance_double(Net::SSH::Connection::Session, close: nil)
end
end
it "runs the session lock script if SSH_SESSION_LOCK_NAME is set" do
expect(sa).to receive(:cmd).with(lock_script, log: false)
sa.connect
end
it "reports a failure to obtain a file descriptor with an obscure exit code" do
expect(sa).to receive(:cmd).with(lock_script, log: false).and_raise(Sshable::SshError.new(lock_script, "", "", 92, nil))
expect(Clog).to receive(:emit).with("session lock failure").and_wrap_original do |m, a, &b|
expect(b.call.dig(:contended_session_lock, :session_fail_msg)).to eq("could not create session lock file for testlockname")
end
sa.connect
end
it "reports lock conflicts when an obscure exit code is raised" do
expect(sa).to receive(:cmd).with(lock_script, log: false).and_raise(Sshable::SshError.new(lock_script, "", "", 124, nil))
expect(Clog).to receive(:emit).with("session lock failure").and_wrap_original do |m, a, &b|
expect(b.call.dig(:contended_session_lock, :session_fail_msg)).to eq("session lock conflict for testlockname")
end
sa.connect
end
it "has a generic message for unrecognized errors" do
expect(sa).to receive(:cmd).with(lock_script, log: false).and_raise(Sshable::SshError.new(lock_script, "", "", 1, nil))
expect(Clog).to receive(:emit).with("session lock failure").and_wrap_original do |m, a, &b|
expect(b.call.dig(:contended_session_lock, :session_fail_msg)).to eq("unknown SshError")
end
sa.connect
end
end
end
describe "caching" do
# The cache is thread local, so re-set the thread state by boxing
# each test in a new thread.
around do |ex|
Thread.new {
ex.run
}.join
end
it "can cache SSH connections" do
expect(Net::SSH).to receive(:start) do
instance_double(Net::SSH::Connection::Session, close: nil)
end
expect(Thread.current[:clover_ssh_cache]).to be_nil
first_time = sa.connect
expect(Thread.current[:clover_ssh_cache].size).to eq(1)
second_time = sa.connect
expect(first_time).to equal(second_time)
expect(described_class.reset_cache).to eq []
expect(Thread.current[:clover_ssh_cache]).to be_empty
end
it "does not crash if a cache has never been made" do
expect {
sa.invalidate_cache_entry
}.not_to raise_error
end
it "can invalidate a single cache entry" do
sess = instance_double(Net::SSH::Connection::Session, close: nil)
expect(Net::SSH).to receive(:start).and_return sess
sa.connect
expect {
sa.invalidate_cache_entry
}.to change { Thread.current[:clover_ssh_cache] }.from({["test.localhost", "testuser"] => sess}).to({})
end
it "can reset caches when has cached connection" do
sess = instance_double(Net::SSH::Connection::Session, close: nil)
expect(Net::SSH).to receive(:start).and_return sess
sa.connect
expect {
described_class.reset_cache
}.to change { Thread.current[:clover_ssh_cache] }.from({["test.localhost", "testuser"] => sess}).to({})
end
it "can reset caches when has no cached connection" do
expect(described_class.reset_cache).to eq([])
end
it "can reset caches even if session fails while closing" do
sess = instance_double(Net::SSH::Connection::Session)
expect(sess).to receive(:close).and_raise Sshable::SshError.new("bogus", "", "", nil, nil)
expect(Net::SSH).to receive(:start).and_return sess
sa.connect
expect(described_class.reset_cache.first).to be_a Sshable::SshError
expect(Thread.current[:clover_ssh_cache]).to eq({})
end
end
describe "#cmd" do
let(:session) { instance_double(Net::SSH::Connection::Session) }
before do
expect(sa).to receive(:connect).and_return(session).at_least(:once)
end
def simulate(cmd:, exit_status:, exit_signal:, stdout:, stderr:)
expect(session).to receive(:open_channel) do |&blk|
chan = instance_spy(Net::SSH::Connection::Channel)
expect(chan).to receive(:exec).with(cmd) do |&blk|
chan2 = instance_spy(Net::SSH::Connection::Channel)
expect(chan2).to receive(:on_request).with("exit-status") do |&blk|
buf = instance_double(Net::SSH::Buffer)
expect(buf).to receive(:read_long).and_return(exit_status)
blk.call(nil, buf)
end
expect(chan2).to receive(:on_request).with("exit-signal") do |&blk|
buf = instance_double(Net::SSH::Buffer)
expect(buf).to receive(:read_long).and_return(exit_signal)
blk.call(nil, buf)
end
expect(chan2).to receive(:on_data).and_yield(instance_double(Net::SSH::Connection::Channel), stdout)
expect(chan2).to receive(:on_extended_data).and_yield(nil, 1, stderr)
blk.call(chan2, true)
end
blk.call(chan, true)
chan
end
end
it "can run a command" do
[false, true].each do |repl_value|
[false, true].each do |log_value|
allow(described_class).to receive(:repl?).and_return(repl_value)
if repl_value
# Note that in the REPL, stdout and stderr get multiplexed
# into stderr in real time, packet by packet.
expect($stderr).to receive(:write).with("hello")
expect($stderr).to receive(:write).with("world")
end
if log_value
sa.instance_variable_set(:@connect_duration, 1.1)
expect(Clog).to receive(:emit).with("ssh cmd execution") do |&blk|
dat = blk.call
if repl_value
expect(dat[:ssh].slice(:stdout, :stderr)).to be_empty
else
expect(dat[:ssh].slice(:stdout, :stderr)).to eq({stdout: "hello", stderr: "world"})
end
end
end
simulate(cmd: "echo hello", exit_status: 0, exit_signal: nil, stdout: "hello", stderr: "world")
expect(sa.cmd("echo hello", log: log_value)).to eq("hello")
end
end
end
it "raises an error with a non-zero exit status" do
simulate(cmd: "exit 1", exit_status: 1, exit_signal: 127, stderr: "", stdout: "")
expect { sa.cmd("exit 1") }.to raise_error Sshable::SshError, "command exited with an error: exit 1"
end
it "invalidates the cache if the session raises an error" do
err = IOError.new("the party is over")
expect(session).to receive(:open_channel).and_raise err
expect(sa).to receive(:invalidate_cache_entry)
expect { sa.cmd("irrelevant") }.to raise_error err
end
end
describe "daemonizer methods" do
let(:unit_name) { "test_unit" }
let(:run_command) { "sudo host/bin/setup-vm prep test_unit" }
let(:stdin_data) { "secret_data" }
it "calls cmd with the correct check command" do
expect(sa).to receive(:cmd).with("common/bin/daemonizer2 check test_unit")
sa.d_check(unit_name)
end
it "calls cmd with the correct clean command" do
expect(sa).to receive(:cmd).with("common/bin/daemonizer2 clean test_unit")
sa.d_clean(unit_name)
end
it "calls cmd with the correct restart command" do
expect(sa).to receive(:cmd).with("common/bin/daemonizer2 restart test_unit")
sa.d_restart(unit_name)
end
it "calls cmd with the correct run command and no stdin" do
expect(sa).to receive(:cmd).with("common/bin/daemonizer2 run test_unit sudo\\ host/bin/setup-vm\\ prep\\ test_unit", stdin: nil, log: true)
sa.d_run(unit_name, run_command)
end
it "calls cmd with the correct run command and passes stdin" do
expect(sa).to receive(:cmd).with("common/bin/daemonizer2 run test_unit sudo\\ host/bin/setup-vm\\ prep\\ test_unit", stdin: stdin_data, log: true)
sa.d_run(unit_name, run_command, stdin: stdin_data)
end
end
end