I recently had a Jenkins job that would die, seemingly-randomly. The only thing that really stood out was that it would tend to succeed if the runtime was 14 minutes or less, and it would tend to fail if the runtime was 17 minutes or more.
This job did a bunch of database stuff (through an SSH tunnel; more on that soon), so I first did a whole bunch of troubleshooting on the Postgres client and server configs, but nothing seemed relevant. It seemed to disconnect ("connection closed by server") on these long queries that would sit there for a long time (maybe around 15 minutes or so) and then come back with a result. After ruling out the Postgres server (all of the settings looked good, and new sessions had decent timeout configs), I moved on to SSH.
This job connects to a database by way of a forwarded port through an SSH tunnel (don't ask why; just understand that it's the least worst option available in this context). I figured that maybe the SSH tunnel was failing, since I start it in the background and have it run "sleep infinity" and then never look at it again. However, when I tested locally, my SSH session would run for multiple days without a problem.
Spoiler alert: the answer ended up being the client config, but how do you actually find that out?
SSH has two really cool options.
On the server side, you can run "sudo sshd -T | sort" to have the SSH daemon read the relevant configs and then print out all of the actual values that it's using. So, this'll merge in all of the unspecified defaults as well as all of the various options in "/etc/sshd_config" and "/etc/sshd_config.d", etc.
On the client side, you can run "ssh -G ${user}@${host} | sort", and it'll do the same thing, but for all of the client-side configs for that particular user and host combination (because maybe you have some custom stuff set up in your SSH config, etc.).
Now, in my case, it ended up being a keepalive issue. So, on the server side, here's what the relevant settings were:
clientalivecountmax 0
clientaliveinterval 900
tcpkeepalive yes
On the client (which would disconnect sometimes), here's what the relevant settings were:
serveralivecountmax 3
serveraliveinterval 0
tcpkeepalive yes
Here, you can see that the client (which is whatever the default Jenkins Kubernetes agent ended up being) enabled a TCP keepalive, but it set the keepalive interval to "0", which means that it wouldn't send keepalive packets at all.
According to the docs, the server should have sent out keepalives every 15 minutes, but whatever it was doing, the connection would drop after 15 minutes. Setting "serveraliveinterval" to "60" ended up solving my problem and allowed my SSH sessions to stay active indefinitely until the script was done with them.
Little bonus section
My SSH command to set up the tunnel in the background was:
ssh -4 -f -L${localport}:${targetaddress}:${targetport} ${user}@${bastionhost} 'sleep infinity';
"-4" forces it to use an IPv4 address (relevant in my context), and "-f" puts the SSH command into the background before "sleep infinity" gets called, right after all the port forwarding is set up. "sleep infinity" ensures that the connection never closes on its own; the "sleep" command will do nothing forever.
(Obviously, I had the "-o ServerAliveInterval=60" option in there, too.)
With this, I could trivially have my container create an SSH session that allowed for port-forwarding, and that session would be available for the entirety of the container's lifetime (the entirety of the Jenkins build).
No comments:
Post a Comment