Under periodic higher monitor loads, we see a pattern where longer
delay between pulses results in offline/online flapping.
This is consistent with this theory: in the event that we run pulses
less than once per ten seconds, we currently record a flap of
availability each and every time from the resulting `IOError`
exception, because servers are configured as of
22f3aa5c1d
to close connections ten
seconds after the last non-ClientAlive packet (which is only sent when
the protocol is idle), as seen in `sshd_config`:
ClientAliveInterval 2
ClientAliveCountMax 4
This patch make the consequence of some slowness gracefully degrade:
it anticipates that the server would have dropped connections that
have not had a `check_pulse` run in a while, and offers *one* retry
without recording the failed probe as a flap.
This is some pretty rough work, because this problem affects *all* low
level monitoring routines and it's not sensible to go around solving
them all one by one like this. But, we have a timely problem with
load balancer flaps from stale connections when the monitor process
doesn't check in with its SSH connections often enough, and this can
be a model (and test) for/of a solution.
1.7 KiB
1.7 KiB