Files
ubicloud/.rubocop.yml
Daniel Farina aa2e0c0749 For load balancers, do one free retry on a stale connection
Under periodic higher monitor loads, we see a pattern where longer
delay between pulses results in offline/online flapping.

This is consistent with this theory: in the event that we run pulses
less than once per ten seconds, we currently record a flap of
availability each and every time from the resulting `IOError`
exception, because servers are configured as of
22f3aa5c1d to close connections ten
seconds after the last non-ClientAlive packet (which is only sent when
the protocol is idle), as seen in `sshd_config`:

    ClientAliveInterval 2
    ClientAliveCountMax 4

This patch make the consequence of some slowness gracefully degrade:
it anticipates that the server would have dropped connections that
have not had a `check_pulse` run in a while, and offers *one* retry
without recording the failed probe as a flap.

This is some pretty rough work, because this problem affects *all* low
level monitoring routines and it's not sensible to go around solving
them all one by one like this.  But, we have a timely problem with
load balancer flaps from stale connections when the monitor process
doesn't check in with its SSH connections often enough, and this can
be a model (and test) for/of a solution.
2025-07-31 10:46:58 +02:00

1.7 KiB