Thursday, March 3, 2022

Dump your SSH settings for quick troubleshooting

I recently had a Jenkins job that would die, seemingly-randomly.  The only thing that really stood out was that it would tend to succeed if the runtime was 14 minutes or less, and it would tend to fail if the runtime was 17 minutes or more.

This job did a bunch of database stuff (through an SSH tunnel; more on that soon), so I first did a whole bunch of troubleshooting on the Postgres client and server configs, but nothing seemed relevant.  It seemed to disconnect ("connection closed by server") on these long queries that would sit there for a long time (maybe around 15 minutes or so) and then come back with a result.  After ruling out the Postgres server (all of the settings looked good, and new sessions had decent timeout configs), I moved on to SSH.

This job connects to a database by way of a forwarded port through an SSH tunnel (don't ask why; just understand that it's the least worst option available in this context).  I figured that maybe the SSH tunnel was failing, since I start it in the background and have it run "sleep infinity" and then never look at it again.  However, when I tested locally, my SSH session would run for multiple days without a problem.

Spoiler alert: the answer ended up being the client config, but how do you actually find that out?

SSH has two really cool options.

On the server side, you can run "sudo sshd -T | sort" to have the SSH daemon read the relevant configs and then print out all of the actual values that it's using.  So, this'll merge in all of the unspecified defaults as well as all of the various options in "/etc/sshd_config" and "/etc/sshd_config.d", etc.

On the client side, you can run "ssh -G ${user}@${host} | sort", and it'll do the same thing, but for all of the client-side configs for that particular user and host combination (because maybe you have some custom stuff set up in your SSH config, etc.).

Now, in my case, it ended up being a keepalive issue.  So, on the server side, here's what the relevant settings were:

clientalivecountmax 0
clientaliveinterval 900
tcpkeepalive yes

On the client (which would disconnect sometimes), here's what the relevant settings were:

serveralivecountmax 3
serveraliveinterval 0
tcpkeepalive yes

Here, you can see that the client (which is whatever the default Jenkins Kubernetes agent ended up being) enabled a TCP keepalive, but it set the keepalive interval to "0", which means that it wouldn't send keepalive packets at all.

According to the docs, the server should have sent out keepalives every 15 minutes, but whatever it was doing, the connection would drop after 15 minutes.  Setting "serveraliveinterval" to "60" ended up solving my problem and allowed my SSH sessions to stay active indefinitely until the script was done with them.

Little bonus section

My SSH command to set up the tunnel in the background was:

ssh -4 -f -L${localport}:${targetaddress}:${targetport} ${user}@${bastionhost} 'sleep infinity';

"-4" forces it to use an IPv4 address (relevant in my context), and "-f" puts the SSH command into the background before "sleep infinity" gets called, right after all the port forwarding is set up.  "sleep infinity" ensures that the connection never closes on its own; the "sleep" command will do nothing forever.

(Obviously, I had the "-o ServerAliveInterval=60" option in there, too.)

With this, I could trivially have my container create an SSH session that allowed for port-forwarding, and that session would be available for the entirety of the container's lifetime (the entirety of the Jenkins build).

Tuesday, March 1, 2022

QNAP, NFS, and Filesystem ACLs

I recently spent hours banging my head against a wall trying to figure out why my Plex server couldn't find some new media that I put on its volume in my QNAP.

tl;dr QNAP "Advanced Folder Permissions" turn on file access control lists (you'll need the "getfacl" and "setfacl" tools installed on Linux to mess with them).  For more information, see this guide from QNAP.

I must have turned this setting on when I rebuilt my NAS a while back, and it never mattered until I did some file operations with the File Manager or maybe just straight "cp"; I forget which (or both).  Plex refused to see the new file, and I tried debugging the indexer and all that other Plex stuff before realizing that while it could list the file, it couldn't open the file, even though its normal "ls -l" permissions looked fine.

Apparently the file access control list denied it, but I didn't even have "getfacl" or "setfacl" installed on my machine (and I had never even heard of this before), so I had no idea what was going on.  I eventually installed those tools and verified that while the standard Linux permission looked fine, the ACL permissions did not.

"sudo chmod -R +r /path/to/folder" didn't solve my problem, but tearing out the ACL did: "sudo setfacl -b -R /path/to/folder"

Later, I eventually figured out that it was QNAP's "Advanced Folder Permissions" and just disabled that so it wouldn't bother me again.