Saturday, October 30, 2021

Troubleshooting a weird Nagios NRPE SSL/TLS error

We recently gained limited access to a customer data center in order to monitor to some machines that our software is running on.  For historical reasons, we use Nagios as our monitoring tool (yes, I know that it's 2021) and we use NRPE to monitor our Linux boxes (yes, I know that NRPE is deprecated in favor of NCPA).

We had to provide the customer with a list of source IP addresses and target ports (for example, 5666 for NRPE) as part of the process to get the VPN set up.  Foreshadowing: this will become relevant soon.

After getting NRPE installed on all of our machines, we noticed that Nagios was failing to connect to any of the them.  The NRPE logs all had the following errors:

Starting up daemon
Server listening on 0.0.0.0 port 5666.
Server listening on :: port 5666.
Warning: Daemon is configured to accept command arguments from clients!
Listening for connections on port 5666
Allowing connections from: 127.0.0.1,::1,[redacted]
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
warning: can't get client address: Connection reset by peer
Error: (!log_opts) Could not complete SSL handshake with [redacted]: 5
warning: can't get client address: Connection reset by peer
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
warning: can't get client address: Connection reset by peer
Error: (!log_opts) Could not complete SSL handshake with [redacted]: 5
warning: can't get client address: Connection reset by peer
warning: can't get client address: Connection reset by peer

So, this is obviously an SSL/TLS problem.

However, everyone on the Internet basically says that this is a problem with the NRPE client machine (the Nagios source address isn't listed in "allowed_hosts", it's not set up for SSL correctly, you didn't compile it right, etc.).

After fighting with this for hours, we finally figured out what was wrong.

A hint was the "getpeername() failure"; if you open up the NRPE source code, this runs immediately after the connection is established.  The only way that you could see this error ("Transport endpoint is not connected") is if the socket was closed between that initial connection and "getpeername".

Running "tcpdump" on both sides yielded the following findings:

On Nagios:

Nagios → NRPE machine: SYN
NRPE machine → Nagios: SYN, ACK
Nagios → NRPE machine: ACK
Nagios → NRPE machine: TLSv1 Client Hello
NRPE machine → Nagios: RST, ACK

On the NRPE machine to be monitored:

Nagios → NRPE machine: SYN
NRPE machine → Nagios: SYN, ACK
Nagios → NRPE machine: ACK
Nagios → NRPE machine: RST, ACK

Both machines agreed on the first 3 packets: the classic TCP handshake.  However, they differed on the subsequent packets.  Nagios sent a TLSv1 "Client Hello" packet and immediately had the connection closed by the NRPE machine.  However, the NRPE machine did not see the TLSv1 "Client Hello" at all; rather, it saw that Nagios immediately closed the connection.

This is indicative of some trickery being done by the customer's equipment (firewall, VPN, etc.).  From what I can tell, they're quietly stripping out any TLS packets and killing the connection if it finds any.  They probably have an incorrect port rule set up for port 5666, but anyway, that's the problem here: the network infrastructure is tearing out the TLS packets and closing the connection.

Saturday, July 10, 2021

Migrating from a static volume to a storage pool in QNAP

 I bought a QNAP TS-451+ NAS a number of years ago.  At the time, you could only set up what are now called "static volumes"; these are volumes that are composed of a number of disks in some RAID configuration.  After a firmware update, QNAP introduced "storage pools", which act as a layer in between the RAIDed disks and the volumes on top of them.  Storage pools can do snapshots and some other fancy things, but the important thing here is that QNAP was pushing storage pools now, and I had a static volume.

I wanted to migrate from my old static volume to a new storage pool.  I couldn't really find any examples of anyone who had performed such a migration successfully; most of the advice on the Internet was basically, "back up your stuff and reformat".  Given the fact that my volume was almost full and that QNAP does not support an in-place migration, I figured that if I added on some extra storage in the form of an expansion unit, I could probably pull it off with minimal hassle.

(The official QNAP docs generally agree with this.)

tl;dr It was pretty easy to do, just a bit time-consuming.  I'll also note that this was a lossless process (other than my NFS permissions); I didn't have to reinstall anything or restore any backups.

Here's the general workflow:

  1. Attach the expansion unit.
  2. Add the new disks to the expansion unit.
  3. Create a new storage pool on the expansion unit.
  4. Transfer each folder in the original volume to a new folder on the expansion unit.
  5. Write down the NFS settings for the original volume's folders.
  6. Delete the original volume.
  7. Create a new storage pool with the original disks.
  8. Create a new system volume on the main storage pool.
  9. Create new volumes as desired on the main storage pool.
  10. Transfer each folder from the expansion volume to the main volume.
  11. Re-apply the NFS settings on the folders on the main storage pool's volumes.
  12. Detach the expansion unit.
Some details follow.

QNAP sells expansion units that can act as additional storage pools and volumes, and the QNAP OS integrates them pretty well.  I purchased a TS-004 and connected it to my TS-451+ NAS via USB.  I had some new drives that I was planning to use to replace the drives currently in the NAS, so instead of doing that right away, I put them all in the expansion unit and created a new storage pool (let's call this the expansion storage pool).

I had originally tried using File Station to copy and paste all of my folders to a new volume in the expansion unit, but I would get permission-related errors, and I didn't want to deal with individual files when there were millions to transfer.  QNAP has an application called Hybrid Backup Sync, and one of the things that you can do is a 1-way sync "job" that lets you properly copy everything from one folder on one volume to another folder on another volume.  So I created new top-level folders in the expansion volume and then used Hybrid Backup Sync to copy all of my data from the main volume to the expansion volume (it preserved all the file attributes, etc.).

For more information how to use Hybrid Backup Sync to do this, see this article from QNAP.

(If you're coming from a static volume and you set up a storage pool on the expansion unit, then QNAP has a feature where you can transfer a folder on a static volume to a new volume in a storage pool, but this only works one way; you can't use this feature to transfer back from storage pool to storage pool, only from static volume to storage pool.)

I then wrote down the NFS settings that I had for my folders on the main unit (it's pretty simple, but I did have some owner and whitelist configuration).

Once I had everything of mine onto the expansion volume, I then deleted the main (system) volume.  QNAP was okay with this and didn't complain at all.  Some sites that I had read claimed that you'd have to reboot or reformat or something if you did this, but at least on modern QNAP OSes, it's fine with you deleting its system volume.

For more information on deleting a volume, see this article from QNAP.

I created a new storage pool with the main unit's existing disks, and then I created a small, thin volume on it to see what would happen.  QNAP quickly decided that this new volume would be the new "system" volume, and it installed some applications on its own, and then it was done.  My guess is that it installed whatever base config it needs to operate on that new volume and maybe transferred the few applications that I already had to it or something.

(I then rebooted the QNAP just to make sure that everything was working, and it ended up being fine.)

On the expansion unit, I renamed all of the top-level folders to end with "_expansion" so that I'd be able to tell them apart from the ones that I would make on the main unit.

Then I used Hybrid Backup Sync to copy my folders from the expansion volume to the main volume.  Once that was done, I modified the NFS settings on the main volume's folders to match what they had been originally.

I tested the connections from all my machines that use the NAS, and then I detached and powered down the expansion unit.  I restarted the NAS and tested the connections again, and everything was perfect.  Now I had a storage pool with thin-provisioned volumes instead of a single, massive static volume.

Monday, July 5, 2021

Working around App Engine's bogus file modification times in Go

When an App Engine application is deployed, the files on the filesystem have their modification times "zeroed"; in this case, they are set to Tuesday, January 1, 1980 at 00:00:01 GMT (with a Unix timestamp of "315532801").  Oddly enough, this isn't January 1, 1970 (with a Unix timestamp of "0"), so they're adding 1 year and 1 second for some reason (probably to avoid actually zeroing out the date).

If you found your way here by troubleshooting, you may have seen this for your "Last-Modified" header:

last-modified: Tue, 01 Jan 1980 00:00:01 GMT

There's an issue for this particular problem (currently they're saying that it's working as designed); to follow the issue or make a comment, see issue 168399701.

For App Engine in Go, I've historically bypassed the static files stuff and just had my application serve up the files with "http.FileServer", and I've disabled caching everywhere to play it safe ("Cache-Control: no-cache, no-store, must-revalidate").  Recently, I've begun to experiment with a "max-age" of 1-minute lined up on 1-minute boundaries so that I get a bit of help from the GCP proxy and its caching powers while not shooting myself in the foot allowing stale copies of my files to linger all over the Internet.

This caused me a huge amount of headache recently when my web application wasn't updating in production, despite being pushed for over 24 hours.  It turns out that the browser (Chrome) was making a request by including the "If-Modified-Since" header, and my application was responding back with a 304 Not Modified response.  No matter how many times my service worker tried to fetch the new data, the server kept telling it that what it had was perfect.

The default HTTP file server in some languages lets you tweak how it responds ("ETag", "Last-Modified", etc.), but not in Go.  "http.FileServer" has no configuration options available to it.

What I ended up doing was wrapping "http.FileServer"'s "ServeHTTP" in another function; this function had two main goals:

  1. Set up a weak ETag value using the expiration date (ideally, I'd use a strong value like the MD5 sum of the contents, but I didn't want to have to rewrite "http.FileServer" just for this).
  2. Remove the request headers related to the modification time ("If-Modified-Since" and "If-Unmodified-Since").  "http.FileServer" definitely respects "If-Modified-Since", and because the modification time is bogus in App Engine, I figured that just quietly removing any headers related to that would keep things simple.
Here's what I ended up with:

staticHandler := http.StripPrefix("/", http.FileServer(http.Dir("/path/to/my/files")))

myHandler.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
// Cache all the static files aligned at the 1-minute boundary.
expirationTime := time.Now().Truncate(1 * time.Minute).Add(1 * time.Minute)
w.Header().Set("Cache-Control", fmt.Sprintf("public, max-age=%0.0f, must-revalidate", time.Until(expirationTime).Seconds()))
w.Header().Set("ETag", fmt.Sprintf("W/\"exp_%d\"", expirationTime.Unix())) // The ETag is weak ("W/" prefix) because it'll be the same tag for all encodings.

// Strip the headers that `http.FileServer` will use that rely on modification time.
// App Engine sets all of the timestamps to January 1, 1980.
r.Header.Del("If-Modified-Since")
r.Header.Del("If-Unmodified-Since")

staticHandler.ServeHTTP(w, r)
})

Anyway, I fought with this for two days before finally realizing what was going on, so hopefully this will let you work around App Engine's bogus file-modification times.

Thursday, April 15, 2021

Using "errors.Is" to detect "connection reset by peer" and work around it

 I maintain an application that ties into Emergency Reporting using their REST API.  When an item is updated, I have a Google Cloud Task that attempts to publish a change to a web hook, which connects to the Emergency Reporting API and creates a new incident in that system.  Because it's in Cloud Tasks, if the task fails for any reason, Cloud Tasks will attempt to retry the task until it succeeds.  Cool.

I also have it set up to send any log messages at warning level or higher to a Slack channel.  Also cool.

However, in December of 2020, Emergency Reporting switched to some kind of Microsoft-managed authentication system for their API, and this has only brought problems.  The most common of which is that the authentication API will frequently fail with a "connection reset by peer" error.  My Emergency Reporting wrapper detects this and logs it; my web hook detects a sign-in failure and logs that; and the whole Cloud Task detects that the web hook has failed and logs that.  Cloud Tasks automatically retries the task, which makes another post to the web hook, and everything succeeds the second time.  But by now, I've accumulated a bunch of warnings in the Slack channel.  Not cool.

So here's the thing: the Emergency Reporting API can fail for a lot of reasons, and I'd like to be notified when something important actually happens.  But a standard, run-of-the-mill TCP "connection reset by peer" error is not important at all.

Here's an example of the kind of error that Go's http.Client.PostForm returns in this case:

Could not post form: Post https://login.emergencyreporting.com/login.emergencyreporting.com/B2C_1A_PasswordGrant/oauth2/v2.0/token: read tcp [fddf:3978:feb1:d745::c001]:33391->[2620:1ec:29::19]:443: read: connection reset by peer

Looking at the error, it looks like there are 4 layers of error:

  1. The HTTP post
  2. The particular TCP read
  3. A generic "read"
  4. A generic "connection reset by peer"
What I really want to do in this case is detect a generic "connection reset by peer" error and quietly retry the operation, allowing all other errors to be handled as true errors.  Doing string-comparison operations on error text is rarely a good idea, so what does that leave us with?

Go 1.13 adds support for "error wrapping", where one error can "wrap" another one, while still allowing programs to make decisions based on the "wrapped" error.  You may call "errors.Is" to determine if any error in an error chain matches a particular target.

Fortunately, all of the packages in this particular chain of errors utilize this feature.  In particular, the syscall package has a set of distinct Errno errors for each low-level error, including "connection reset by peer" (ECONNRESET).

This lets us do something like this:

tokenResponse, err = client.GenerateToken()
if err != nil {
   // If this was a connection-reset error, then continue to the next retry.
   if errors.Is(err, syscall.ECONNRESET) {
      logrus.Info("Got back a syscall.ECONNRESET from Emergency Reporting.")
      // [attempt to retry the operation]
   } else {
      // This was some other kind of error that we can't handle.
      // [log a proper error message and fail]
   }
}

Since using "errors.Is" to detect the "connection reset by peer" error, I haven't received a single annoying, pointless error message in my Slack channel.  I did have to spend a bit of time trying to figure out what that ultimate, underlying error was, but after that, it's been working flawlessly.

Monday, January 25, 2021

Using LDAP groups to limit access to a Radius server (freeRADIUS 3.0)

Note: this is an updated version of a prior entry for freeRADIUS 3.0.

Anytime I need to create a VPN (to my home network, to my AWS network, etc.), I use SoftEther.  SoftEther is OpenVPN-compatible, supports L2TP/IPsec, and has some neat settings around VPN over ICMP and DNS.  Anyway, once you get it set up, it generally just works (except for the cronjob that you need to make to trim its massive log files daily).

At work, we use LDAP for our user authentication and permissions, but SoftEther doesn't support LDAP.  It does, however, support Radius, and freeRADIUS supports using LDAP as a module, so you can easily set up a quick Radius proxy for LDAP.

Quick recap on setting up freeRADIUS with LDAP

I'm assuming that you already have an LDAP server.

Install freeRADIUS and the LDAP module.
sudo apt install freeradius freeradius-ldap
sudo systemctl enable freeradius
sudo systemctl start freeradius

Enable the LDAP module via symlink:
ln -sfn ../mods-available/ldap /etc/freeradius/3.0/mods-enabled/ldap

Then turn on the LDAP module by editing /etc/freeradius/3.0/sites-enabled/default and uncommenting the "ldap" line under the "authorize" block.
authorize {
...
   ldap
...

You'll need to add an "if" statement to set the "Auth-Type"; do this immediately after that "ldap" line.
   if ((ok || updated) && User-Password) {
      update {
         control:Auth-Type := ldap
      }
   }

And the same for the "Auth-Type LDAP" block.
authorize {
...
   Auth-Type LDAP {
      ldap
   }
...

Cool; at this point, freeRADIUS will use whatever LDAP setup is in the /etc/freeradius/3.0/mods-enabled/ldap file.  It won't work (because it's not set up for your LDAP server), that's all that you need in order to back your Radius server with your LDAP server.

Next up, we'll look at configuring it to actually talk to your LDAP server.

Configuring the LDAP module

/etc/freeradius/3.0/mods-enabled/ldap is where the LDAP configuration lives.  In order to understand exactly what's going on, you should know a few things.
  1. Run-time variables, like the current user name, are written as %{Variable-Name}.  For example, the current user name is %{User-Name}.
  2. Similar to shell variables, you can have conditional values.  The basic syntax is %{%{Variable-1}:-${Variable-2}}.  A typical pattern that you'll see is using the "stripped" user name (the user name without any realm information), but if that's not defined, then use the actual user name: %{%{Stripped-User-Name}:-%{User-Name}}
For your basic LDAP integration (if you provide a valid username and password, you can sign in), you'll need to set the following values in the "ldap" block:
  1. server; this is the hostname or address of your server.  If you're running freeRADIUS on the same LDAP server, then this will be "localhost".
  2. identity; this is the DN for the "bind" user.  That's the user that freeRADIUS will log in as in order to search the directory tree and do its LDAP stuff.  This is typically a read-only user.
  3. password; this is the password for the user configured in identity.
  4. base_dn; this is the base DN to use for all user searches.  It's usually something like dc=example,dc=com, but that'll depend on your LDAP setup.  You'll generally want to set this as the base for all of your users (maybe something like ou=users,dc=example,dc=com, etc.).
Here's an example that assumes that your users are all under ou=users,dc=example,dc=com:
server = "my-ldap-server.example.com"
identity = "uid=my-bind-user,ou=service-users,dc=example,dc=com"
password = "abc123"
base_dn = "ou=users,dc=example,dc=com"

Users

You'll also need to set up user-level things in the "user" block:
  1. filter; this is the LDAP search condition that freeRADIUS will use to try to find the matching LDAP user for the user name that just tried to sign in via Radius.  This is where run-time variables will come into play.  For out-of-the-box OpenLDAP, something like this will generally work: (uid=%{%{Stripped-User-Name}:-%{User-Name}}).  What this means is look for an entity in LDAP (under the base DN defined in basedn) with a uid property of the Radius user name.  Yes, you need the surrounding parentheses.  No, I don't make the rules.
Here's an example that uses "uid" for the user name.
filter = "(uid=%{%{Stripped-User-Name}:-%{User-Name}})"

Remember, filter can be any LDAP filter, so if there were a property that you also wanted to check (such as isAllowedToDoRadius or something), then you could check for that, as well.  For example:
filter = "(&(uid=%{%{Stripped-User-Name}:-%{User-Name}})(isAllowedToDoRadius=yes))"

Filtering by group

So, that'll let any LDAP user authenticate with Radius.  Maybe you want that, maybe you don't.  In my case, I have a whole bunch of users, but I only want a small subset to be able to VPN in using SoftEther.  I added those users to the "vpn-users" group in LDAP.

Note that there are two general grouping strategies in LDAP:
  1. Groups-have-users; in this strategy, the group entity lists the users within the group.  This is the default OpenLDAP strategy.
  2. Users-have-groups; in this strategy, the user entity lists the groups that it belongs to.
If you want to have freeRADIUS respect your groups, you'll need to set the following in /etc/freeradius/3.0/mods-enabled/ldap in the "groups" block:
  1. name_attribute = cn (which turns on tracking groups); and
  2. One of these two options, which each correspond to one of the LDAP grouping strategies:
    1. membership_filter; this is an LDAP filter to use to query for all of the groups that the user belongs to.
    2. membership_attribute; this is the property on the user entity that lists the groups that the user belongs to.
If your groups have users, this might look like:
name_attribute = cn
membership_filter = "(&(objectClass=posixGroup)(memberUid=%{%{Stripped-User-Name}:-%{User-Name}}))"

If your users have groups, this might look like:
name_attribute = cn
membership_attribute = groupName

With that set up, freeRADIUS will now know which groups the user belongs to, but it won't do anything with them.

The last step is to set up some group rules in /etc/freeradius/3.0/users.  There will probably be a few entries in that file already, but by default, none of them will be LDAP-related.  So, at the very bottom, add the LDAP group rules.

Note: In my case, this file was a symlink to "mods-config/files/authorize".  The symlink was a convenience for backward-compatibility in editing the config files; freeRADIUS doesn't actually load "users"; rather, it loads "mods-config/files/authorize", so make sure that you're actually modifying the correct file.

The simplest grouping rules will look like this:
DEFAULT LDAP-Group == "your-group-name-here"
DEFAULT Auth-Type := Reject
  Reply-Message = "Sorry, you're not part of an authorized group."

This generally means: you have to a member of "your-group-name-here" or else you'll be rejected (and here's the message to send you).

In my case, my group is "vpn-users", so it looks like this:
DEFAULT LDAP-Group == "vpn-users", Auth-Type := Accept
DEFAULT Auth-Type := Reject
  Reply-Message = "Sorry, you're not part of an authorized group."

Once that's done, restart freeradius and you'll be good to go.
sudo systemctl restart freeradius

To test to see if it worked, you can run the radtest command:
radtest -x ${username} ${password} ${address} ${port} ${secret}

For example, in our case, this might look like:
radtest -x some-user abc123 my-radius-server.example.com 1812 the-gold-is-under-the-bridge

On success, you'll see something like:
rad_recv: Access-Accept packet

On failure, you'll see something like:
rad_recv: Access-Reject packet

Hopefully this helped a bit; I struggle every time I need to do anything with LDAP or Radius.  It's always really hard to find the documentation for what I'm looking for.