Sunday, January 9, 2022

Moving (or renaming) a push-notification ServiceWorker

Service Workers (web workers, etc.) are a relatively new concept.  They can do all kinds of cool things (primarily related to network requests), but they are also the mechanism by which a web site can receive push messages (via Web Push) and show them as OS notifications.

The general rule of Service Workers is to pick a file name (such as "/service-worker.js") and never, ever change it.  That's cool, but sometimes you do need to change it.

In particular, I started my push messaging journey with "platinum-push-messaging", a now-defunct web component built by Google as part of the initial Polymer project.  The promise was cool: just slap this HTML element on your page with a few parameters and boom: you have working push notifications.

When it came out, the push messaging spec was young, and no browsers fully supported its encrypted data payloads, so "platinum-push-messaging" did a lot of work to work around that limitation.  As browsers improved to support the VAPID spec, "platinum-push-messaging" (along with all of the other "platinum" elements) were quietly deprecated and archived (around 2017).

This left me with a problem: a rotting push notification system that couldn't keep up with the spec and the latest browsers.  I hacked the code to all hell to support VAPID and keep the element functioning, but I was just punting.

Apple ruined the declarative promise of the Polymer project by refusing to implement HTML imports, so the web components community adopted the NPM distribution model (and introduced a whole bunch of imperative Javascript drama and compilation tools).  Anyway, no modern web components are installed with Bower anymore, so that left me with a deprecated Service Worker in a path that I wanted to get rid of: "bower_components/platinum-push-messaging/service-worker.js"

Here was my problem:

  1. I wanted the push messaging Service Worker under my control at the top level of my application, "/push-service-worker.js".
  2. I had hundreds of users who were receiving push notifications via this system, and the upgrade had to be seamless (users couldn't be forced to take any action).
I ended up solving the problem by essentially performing a switcheroo:
  1. I had my application store the Web Push subscription info in HTML local storage.  This would be necessary later as part of the switcheroo.
  2. I removed "bower_components/platinum-push-messaging/".  Any existing clients would regularly attempt to update the service worker, but it would quietly fail, leaving the existing one running just fine.
  3. I removed all references to "platinum-push-messaging" from my code.  The existing Service Worker would continue to run (because that's what Service Workers do) and receive push messages (and show notifications).
  4. I made my own push-messaging web component with my own service worker living at "/push-service-worker.js".
  5. (This laid the framework for performing the switcheroo.)
  6. Upon loading, the part of my application that used to include "platinum-push-messaging" did a migration, if necessary, before loading the new push-messaging component:
    1. It went through all the Service Workers and looked for any legacy ones (these had "$$platinum-push-messaging$$" in the scope).  If it found any, it killed them.

      Note that the "$$platinum-push-messaging$$" in the scope was a cute trick by the web component: a page can only be controlled by one Service Worker, and the scope dictates what that Service Worker can control.  By injecting a bogus "$$platinum-push-messaging$$" at the end of the scope, it ensured that the push-messaging Service Worker couldn't accidentally control any pages and get in the way of a main Service Worker.
    2. Upon finding any legacy Service Workers, it would:
      1. Issue a delete to the web server for the old (legacy) subscription (which was stored in HTML local storage).
      2. Tell the application to auto-enable push notifications.
      3. Resume the normal workflow for the application.
  7. The normal workflow for the application entailed loading the new push-messaging web component once the user was logged in.  If a Service Worker was previously enabled, then it would remain active and enabled.  Otherwise, the application wouldn't try to annoy users by asking them for push notifications.
  8. After the new push-messaging web component was included, it would then check to see if it should be auto-enabled (it would only be auto-enabled as part of a successful migration).
    1. If it was auto-enabled, then it would enable push messaging (the user would have already given permission by virtue of having a legacy push Service Worker running).  When the new push subscription was ready, it would post that information to the web server, and the user would have push messages working again, now using the new Service Worker.  The switcheroo was complete.
That's a bit wordy for a simple switcheroo, but it was very important for me to ensure that my users never lost their push notifications as part of the upgrade.  The simple version is: detect legacy Service Worker, kill legacy Service Worker, delete legacy subscription from web server, enable new Service Worker, and save new subscription to web server.

For any given client, the switcheroo happens only once.  The moment that the legacy Service Worker has been killed, it'll never run again (so there's a chance that if the user killed the page in the milliseconds after the kill but before the save, then they'd lose their push notifications, but I viewed this as extremely unlikely; I could technically have stored a status variable, but it wasn't worth it).  After that, it operates normally.

This means that there are one of two ways for a user to be upgraded:
  1. They open up the application after it has been upgraded.  The application prompts them to reload to upgrade if it detects a new version, but eventually the browser will do this on its own, typically after the device reboots or the browser has been fully closed.
  2. They click on a push notification, which opens up the application (which is #1, above).
So at this point, it's a waiting game.  I have to maintain support for the switcheroo until all existing push subscriptions have been upgraded.  The new ones have a flag set in the database, so I just need to wait until all subscriptions have the flag.  Active users who are receiving push notifications will eventually click on one, so I made a note to revisit this and remove the switcheroo code once all of the legacy subscriptions have been removed.

I'm not certain what causes a new subscription to be generated (different endpoint, etc.), but I suspect that it has to do with the scope of the Service Worker (otherwise, how would it know, since service worker code can change frequently?).  I played it safe and just assumed that the switcheroo would generate an entirely new subscription, so I deleted the legacy one no matter what and saved the new one no matter what.

Saturday, October 30, 2021

Troubleshooting a weird Nagios NRPE SSL/TLS error

We recently gained limited access to a customer data center in order to monitor to some machines that our software is running on.  For historical reasons, we use Nagios as our monitoring tool (yes, I know that it's 2021) and we use NRPE to monitor our Linux boxes (yes, I know that NRPE is deprecated in favor of NCPA).

We had to provide the customer with a list of source IP addresses and target ports (for example, 5666 for NRPE) as part of the process to get the VPN set up.  Foreshadowing: this will become relevant soon.

After getting NRPE installed on all of our machines, we noticed that Nagios was failing to connect to any of the them.  The NRPE logs all had the following errors:

Starting up daemon
Server listening on 0.0.0.0 port 5666.
Server listening on :: port 5666.
Warning: Daemon is configured to accept command arguments from clients!
Listening for connections on port 5666
Allowing connections from: 127.0.0.1,::1,[redacted]
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
warning: can't get client address: Connection reset by peer
Error: (!log_opts) Could not complete SSL handshake with [redacted]: 5
warning: can't get client address: Connection reset by peer
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
warning: can't get client address: Connection reset by peer
Error: (!log_opts) Could not complete SSL handshake with [redacted]: 5
warning: can't get client address: Connection reset by peer
warning: can't get client address: Connection reset by peer

So, this is obviously an SSL/TLS problem.

However, everyone on the Internet basically says that this is a problem with the NRPE client machine (the Nagios source address isn't listed in "allowed_hosts", it's not set up for SSL correctly, you didn't compile it right, etc.).

After fighting with this for hours, we finally figured out what was wrong.

A hint was the "getpeername() failure"; if you open up the NRPE source code, this runs immediately after the connection is established.  The only way that you could see this error ("Transport endpoint is not connected") is if the socket was closed between that initial connection and "getpeername".

Running "tcpdump" on both sides yielded the following findings:

On Nagios:

Nagios → NRPE machine: SYN
NRPE machine → Nagios: SYN, ACK
Nagios → NRPE machine: ACK
Nagios → NRPE machine: TLSv1 Client Hello
NRPE machine → Nagios: RST, ACK

On the NRPE machine to be monitored:

Nagios → NRPE machine: SYN
NRPE machine → Nagios: SYN, ACK
Nagios → NRPE machine: ACK
Nagios → NRPE machine: RST, ACK

Both machines agreed on the first 3 packets: the classic TCP handshake.  However, they differed on the subsequent packets.  Nagios sent a TLSv1 "Client Hello" packet and immediately had the connection closed by the NRPE machine.  However, the NRPE machine did not see the TLSv1 "Client Hello" at all; rather, it saw that Nagios immediately closed the connection.

This is indicative of some trickery being done by the customer's equipment (firewall, VPN, etc.).  From what I can tell, they're quietly stripping out any TLS packets and killing the connection if it finds any.  They probably have an incorrect port rule set up for port 5666, but anyway, that's the problem here: the network infrastructure is tearing out the TLS packets and closing the connection.

Saturday, July 10, 2021

Migrating from a static volume to a storage pool in QNAP

 I bought a QNAP TS-451+ NAS a number of years ago.  At the time, you could only set up what are now called "static volumes"; these are volumes that are composed of a number of disks in some RAID configuration.  After a firmware update, QNAP introduced "storage pools", which act as a layer in between the RAIDed disks and the volumes on top of them.  Storage pools can do snapshots and some other fancy things, but the important thing here is that QNAP was pushing storage pools now, and I had a static volume.

I wanted to migrate from my old static volume to a new storage pool.  I couldn't really find any examples of anyone who had performed such a migration successfully; most of the advice on the Internet was basically, "back up your stuff and reformat".  Given the fact that my volume was almost full and that QNAP does not support an in-place migration, I figured that if I added on some extra storage in the form of an expansion unit, I could probably pull it off with minimal hassle.

(The official QNAP docs generally agree with this.)

tl;dr It was pretty easy to do, just a bit time-consuming.  I'll also note that this was a lossless process (other than my NFS permissions); I didn't have to reinstall anything or restore any backups.

Here's the general workflow:

  1. Attach the expansion unit.
  2. Add the new disks to the expansion unit.
  3. Create a new storage pool on the expansion unit.
  4. Transfer each folder in the original volume to a new folder on the expansion unit.
  5. Write down the NFS settings for the original volume's folders.
  6. Delete the original volume.
  7. Create a new storage pool with the original disks.
  8. Create a new system volume on the main storage pool.
  9. Create new volumes as desired on the main storage pool.
  10. Transfer each folder from the expansion volume to the main volume.
  11. Re-apply the NFS settings on the folders on the main storage pool's volumes.
  12. Detach the expansion unit.
Some details follow.

QNAP sells expansion units that can act as additional storage pools and volumes, and the QNAP OS integrates them pretty well.  I purchased a TS-004 and connected it to my TS-451+ NAS via USB.  I had some new drives that I was planning to use to replace the drives currently in the NAS, so instead of doing that right away, I put them all in the expansion unit and created a new storage pool (let's call this the expansion storage pool).

I had originally tried using File Station to copy and paste all of my folders to a new volume in the expansion unit, but I would get permission-related errors, and I didn't want to deal with individual files when there were millions to transfer.  QNAP has an application called Hybrid Backup Sync, and one of the things that you can do is a 1-way sync "job" that lets you properly copy everything from one folder on one volume to another folder on another volume.  So I created new top-level folders in the expansion volume and then used Hybrid Backup Sync to copy all of my data from the main volume to the expansion volume (it preserved all the file attributes, etc.).

For more information how to use Hybrid Backup Sync to do this, see this article from QNAP.

(If you're coming from a static volume and you set up a storage pool on the expansion unit, then QNAP has a feature where you can transfer a folder on a static volume to a new volume in a storage pool, but this only works one way; you can't use this feature to transfer back from storage pool to storage pool, only from static volume to storage pool.)

I then wrote down the NFS settings that I had for my folders on the main unit (it's pretty simple, but I did have some owner and whitelist configuration).

Once I had everything of mine onto the expansion volume, I then deleted the main (system) volume.  QNAP was okay with this and didn't complain at all.  Some sites that I had read claimed that you'd have to reboot or reformat or something if you did this, but at least on modern QNAP OSes, it's fine with you deleting its system volume.

For more information on deleting a volume, see this article from QNAP.

I created a new storage pool with the main unit's existing disks, and then I created a small, thin volume on it to see what would happen.  QNAP quickly decided that this new volume would be the new "system" volume, and it installed some applications on its own, and then it was done.  My guess is that it installed whatever base config it needs to operate on that new volume and maybe transferred the few applications that I already had to it or something.

(I then rebooted the QNAP just to make sure that everything was working, and it ended up being fine.)

On the expansion unit, I renamed all of the top-level folders to end with "_expansion" so that I'd be able to tell them apart from the ones that I would make on the main unit.

Then I used Hybrid Backup Sync to copy my folders from the expansion volume to the main volume.  Once that was done, I modified the NFS settings on the main volume's folders to match what they had been originally.

I tested the connections from all my machines that use the NAS, and then I detached and powered down the expansion unit.  I restarted the NAS and tested the connections again, and everything was perfect.  Now I had a storage pool with thin-provisioned volumes instead of a single, massive static volume.

Monday, July 5, 2021

Working around App Engine's bogus file modification times in Go

When an App Engine application is deployed, the files on the filesystem have their modification times "zeroed"; in this case, they are set to Tuesday, January 1, 1980 at 00:00:01 GMT (with a Unix timestamp of "315532801").  Oddly enough, this isn't January 1, 1970 (with a Unix timestamp of "0"), so they're adding 1 year and 1 second for some reason (probably to avoid actually zeroing out the date).

If you found your way here by troubleshooting, you may have seen this for your "Last-Modified" header:

last-modified: Tue, 01 Jan 1980 00:00:01 GMT

There's an issue for this particular problem (currently they're saying that it's working as designed); to follow the issue or make a comment, see issue 168399701.

For App Engine in Go, I've historically bypassed the static files stuff and just had my application serve up the files with "http.FileServer", and I've disabled caching everywhere to play it safe ("Cache-Control: no-cache, no-store, must-revalidate").  Recently, I've begun to experiment with a "max-age" of 1-minute lined up on 1-minute boundaries so that I get a bit of help from the GCP proxy and its caching powers while not shooting myself in the foot allowing stale copies of my files to linger all over the Internet.

This caused me a huge amount of headache recently when my web application wasn't updating in production, despite being pushed for over 24 hours.  It turns out that the browser (Chrome) was making a request by including the "If-Modified-Since" header, and my application was responding back with a 304 Not Modified response.  No matter how many times my service worker tried to fetch the new data, the server kept telling it that what it had was perfect.

The default HTTP file server in some languages lets you tweak how it responds ("ETag", "Last-Modified", etc.), but not in Go.  "http.FileServer" has no configuration options available to it.

What I ended up doing was wrapping "http.FileServer"'s "ServeHTTP" in another function; this function had two main goals:

  1. Set up a weak ETag value using the expiration date (ideally, I'd use a strong value like the MD5 sum of the contents, but I didn't want to have to rewrite "http.FileServer" just for this).
  2. Remove the request headers related to the modification time ("If-Modified-Since" and "If-Unmodified-Since").  "http.FileServer" definitely respects "If-Modified-Since", and because the modification time is bogus in App Engine, I figured that just quietly removing any headers related to that would keep things simple.
Here's what I ended up with:

staticHandler := http.StripPrefix("/", http.FileServer(http.Dir("/path/to/my/files")))

myHandler.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
// Cache all the static files aligned at the 1-minute boundary.
expirationTime := time.Now().Truncate(1 * time.Minute).Add(1 * time.Minute)
w.Header().Set("Cache-Control", fmt.Sprintf("public, max-age=%0.0f, must-revalidate", time.Until(expirationTime).Seconds()))
w.Header().Set("ETag", fmt.Sprintf("W/\"exp_%d\"", expirationTime.Unix())) // The ETag is weak ("W/" prefix) because it'll be the same tag for all encodings.

// Strip the headers that `http.FileServer` will use that rely on modification time.
// App Engine sets all of the timestamps to January 1, 1980.
r.Header.Del("If-Modified-Since")
r.Header.Del("If-Unmodified-Since")

staticHandler.ServeHTTP(w, r)
})

Anyway, I fought with this for two days before finally realizing what was going on, so hopefully this will let you work around App Engine's bogus file-modification times.

Thursday, April 15, 2021

Using "errors.Is" to detect "connection reset by peer" and work around it

 I maintain an application that ties into Emergency Reporting using their REST API.  When an item is updated, I have a Google Cloud Task that attempts to publish a change to a web hook, which connects to the Emergency Reporting API and creates a new incident in that system.  Because it's in Cloud Tasks, if the task fails for any reason, Cloud Tasks will attempt to retry the task until it succeeds.  Cool.

I also have it set up to send any log messages at warning level or higher to a Slack channel.  Also cool.

However, in December of 2020, Emergency Reporting switched to some kind of Microsoft-managed authentication system for their API, and this has only brought problems.  The most common of which is that the authentication API will frequently fail with a "connection reset by peer" error.  My Emergency Reporting wrapper detects this and logs it; my web hook detects a sign-in failure and logs that; and the whole Cloud Task detects that the web hook has failed and logs that.  Cloud Tasks automatically retries the task, which makes another post to the web hook, and everything succeeds the second time.  But by now, I've accumulated a bunch of warnings in the Slack channel.  Not cool.

So here's the thing: the Emergency Reporting API can fail for a lot of reasons, and I'd like to be notified when something important actually happens.  But a standard, run-of-the-mill TCP "connection reset by peer" error is not important at all.

Here's an example of the kind of error that Go's http.Client.PostForm returns in this case:

Could not post form: Post https://login.emergencyreporting.com/login.emergencyreporting.com/B2C_1A_PasswordGrant/oauth2/v2.0/token: read tcp [fddf:3978:feb1:d745::c001]:33391->[2620:1ec:29::19]:443: read: connection reset by peer

Looking at the error, it looks like there are 4 layers of error:

  1. The HTTP post
  2. The particular TCP read
  3. A generic "read"
  4. A generic "connection reset by peer"
What I really want to do in this case is detect a generic "connection reset by peer" error and quietly retry the operation, allowing all other errors to be handled as true errors.  Doing string-comparison operations on error text is rarely a good idea, so what does that leave us with?

Go 1.13 adds support for "error wrapping", where one error can "wrap" another one, while still allowing programs to make decisions based on the "wrapped" error.  You may call "errors.Is" to determine if any error in an error chain matches a particular target.

Fortunately, all of the packages in this particular chain of errors utilize this feature.  In particular, the syscall package has a set of distinct Errno errors for each low-level error, including "connection reset by peer" (ECONNRESET).

This lets us do something like this:

tokenResponse, err = client.GenerateToken()
if err != nil {
   // If this was a connection-reset error, then continue to the next retry.
   if errors.Is(err, syscall.ECONNRESET) {
      logrus.Info("Got back a syscall.ECONNRESET from Emergency Reporting.")
      // [attempt to retry the operation]
   } else {
      // This was some other kind of error that we can't handle.
      // [log a proper error message and fail]
   }
}

Since using "errors.Is" to detect the "connection reset by peer" error, I haven't received a single annoying, pointless error message in my Slack channel.  I did have to spend a bit of time trying to figure out what that ultimate, underlying error was, but after that, it's been working flawlessly.

Monday, January 25, 2021

Using LDAP groups to limit access to a Radius server (freeRADIUS 3.0)

Note: this is an updated version of a prior entry for freeRADIUS 3.0.

Anytime I need to create a VPN (to my home network, to my AWS network, etc.), I use SoftEther.  SoftEther is OpenVPN-compatible, supports L2TP/IPsec, and has some neat settings around VPN over ICMP and DNS.  Anyway, once you get it set up, it generally just works (except for the cronjob that you need to make to trim its massive log files daily).

At work, we use LDAP for our user authentication and permissions, but SoftEther doesn't support LDAP.  It does, however, support Radius, and freeRADIUS supports using LDAP as a module, so you can easily set up a quick Radius proxy for LDAP.

Quick recap on setting up freeRADIUS with LDAP

I'm assuming that you already have an LDAP server.

Install freeRADIUS and the LDAP module.
sudo apt install freeradius freeradius-ldap
sudo systemctl enable freeradius
sudo systemctl start freeradius

Enable the LDAP module via symlink:
ln -sfn ../mods-available/ldap /etc/freeradius/3.0/mods-enabled/ldap

Then turn on the LDAP module by editing /etc/freeradius/3.0/sites-enabled/default and uncommenting the "ldap" line under the "authorize" block.
authorize {
...
   ldap
...

You'll need to add an "if" statement to set the "Auth-Type"; do this immediately after that "ldap" line.
   if ((ok || updated) && User-Password) {
      update {
         control:Auth-Type := ldap
      }
   }

And the same for the "Auth-Type LDAP" block.
authorize {
...
   Auth-Type LDAP {
      ldap
   }
...

Cool; at this point, freeRADIUS will use whatever LDAP setup is in the /etc/freeradius/3.0/mods-enabled/ldap file.  It won't work (because it's not set up for your LDAP server), that's all that you need in order to back your Radius server with your LDAP server.

Next up, we'll look at configuring it to actually talk to your LDAP server.

Configuring the LDAP module

/etc/freeradius/3.0/mods-enabled/ldap is where the LDAP configuration lives.  In order to understand exactly what's going on, you should know a few things.
  1. Run-time variables, like the current user name, are written as %{Variable-Name}.  For example, the current user name is %{User-Name}.
  2. Similar to shell variables, you can have conditional values.  The basic syntax is %{%{Variable-1}:-${Variable-2}}.  A typical pattern that you'll see is using the "stripped" user name (the user name without any realm information), but if that's not defined, then use the actual user name: %{%{Stripped-User-Name}:-%{User-Name}}
For your basic LDAP integration (if you provide a valid username and password, you can sign in), you'll need to set the following values in the "ldap" block:
  1. server; this is the hostname or address of your server.  If you're running freeRADIUS on the same LDAP server, then this will be "localhost".
  2. identity; this is the DN for the "bind" user.  That's the user that freeRADIUS will log in as in order to search the directory tree and do its LDAP stuff.  This is typically a read-only user.
  3. password; this is the password for the user configured in identity.
  4. base_dn; this is the base DN to use for all user searches.  It's usually something like dc=example,dc=com, but that'll depend on your LDAP setup.  You'll generally want to set this as the base for all of your users (maybe something like ou=users,dc=example,dc=com, etc.).
Here's an example that assumes that your users are all under ou=users,dc=example,dc=com:
server = "my-ldap-server.example.com"
identity = "uid=my-bind-user,ou=service-users,dc=example,dc=com"
password = "abc123"
base_dn = "ou=users,dc=example,dc=com"

Users

You'll also need to set up user-level things in the "user" block:
  1. filter; this is the LDAP search condition that freeRADIUS will use to try to find the matching LDAP user for the user name that just tried to sign in via Radius.  This is where run-time variables will come into play.  For out-of-the-box OpenLDAP, something like this will generally work: (uid=%{%{Stripped-User-Name}:-%{User-Name}}).  What this means is look for an entity in LDAP (under the base DN defined in basedn) with a uid property of the Radius user name.  Yes, you need the surrounding parentheses.  No, I don't make the rules.
Here's an example that uses "uid" for the user name.
filter = "(uid=%{%{Stripped-User-Name}:-%{User-Name}})"

Remember, filter can be any LDAP filter, so if there were a property that you also wanted to check (such as isAllowedToDoRadius or something), then you could check for that, as well.  For example:
filter = "(&(uid=%{%{Stripped-User-Name}:-%{User-Name}})(isAllowedToDoRadius=yes))"

Filtering by group

So, that'll let any LDAP user authenticate with Radius.  Maybe you want that, maybe you don't.  In my case, I have a whole bunch of users, but I only want a small subset to be able to VPN in using SoftEther.  I added those users to the "vpn-users" group in LDAP.

Note that there are two general grouping strategies in LDAP:
  1. Groups-have-users; in this strategy, the group entity lists the users within the group.  This is the default OpenLDAP strategy.
  2. Users-have-groups; in this strategy, the user entity lists the groups that it belongs to.
If you want to have freeRADIUS respect your groups, you'll need to set the following in /etc/freeradius/3.0/mods-enabled/ldap in the "groups" block:
  1. name_attribute = cn (which turns on tracking groups); and
  2. One of these two options, which each correspond to one of the LDAP grouping strategies:
    1. membership_filter; this is an LDAP filter to use to query for all of the groups that the user belongs to.
    2. membership_attribute; this is the property on the user entity that lists the groups that the user belongs to.
If your groups have users, this might look like:
name_attribute = cn
membership_filter = "(&(objectClass=posixGroup)(memberUid=%{%{Stripped-User-Name}:-%{User-Name}}))"

If your users have groups, this might look like:
name_attribute = cn
membership_attribute = groupName

With that set up, freeRADIUS will now know which groups the user belongs to, but it won't do anything with them.

The last step is to set up some group rules in /etc/freeradius/3.0/users.  There will probably be a few entries in that file already, but by default, none of them will be LDAP-related.  So, at the very bottom, add the LDAP group rules.

Note: In my case, this file was a symlink to "mods-config/files/authorize".  The symlink was a convenience for backward-compatibility in editing the config files; freeRADIUS doesn't actually load "users"; rather, it loads "mods-config/files/authorize", so make sure that you're actually modifying the correct file.

The simplest grouping rules will look like this:
DEFAULT LDAP-Group == "your-group-name-here"
DEFAULT Auth-Type := Reject
  Reply-Message = "Sorry, you're not part of an authorized group."

This generally means: you have to a member of "your-group-name-here" or else you'll be rejected (and here's the message to send you).

In my case, my group is "vpn-users", so it looks like this:
DEFAULT LDAP-Group == "vpn-users", Auth-Type := Accept
DEFAULT Auth-Type := Reject
  Reply-Message = "Sorry, you're not part of an authorized group."

Once that's done, restart freeradius and you'll be good to go.
sudo systemctl restart freeradius

To test to see if it worked, you can run the radtest command:
radtest -x ${username} ${password} ${address} ${port} ${secret}

For example, in our case, this might look like:
radtest -x some-user abc123 my-radius-server.example.com 1812 the-gold-is-under-the-bridge

On success, you'll see something like:
rad_recv: Access-Accept packet

On failure, you'll see something like:
rad_recv: Access-Reject packet

Hopefully this helped a bit; I struggle every time I need to do anything with LDAP or Radius.  It's always really hard to find the documentation for what I'm looking for.

Tuesday, December 22, 2020

Using LDAP groups to limit access to a Radius server

Anytime I need to create a VPN (to my home network, to my AWS network, etc.), I use SoftEther.  SoftEther is OpenVPN-compatible, supports L2TP/IPsec, and has some neat settings around VPN over ICMP and DNS.  Anyway, once you get it set up, it generally just works (except for the cronjob that you need to make to trim its massive log files daily).

At work, we use LDAP for our user authentication and permissions, but SoftEther doesn't support LDAP.  It does, however, support Radius, and freeRADIUS supports using LDAP as a module, so you can easily set up a quick Radius proxy for LDAP.

Quick recap on setting up freeRADIUS with LDAP

I'm assuming that you already have an LDAP server.

Install freeRADIUS and the LDAP module.
sudo apt install freeradius freeradius-ldap
sudo systemctl enable freeradius
sudo systemctl start freeradius

Then turn on the LDAP module by editing /etc/freeradius/sites-enabled/default and uncommenting the "ldap" line under the "authorize" block.
authorize {
...
   ldap
...

And the same for the "Auth-Type LDAP" block.
authorize {
...
   Auth-Type LDAP {
      ldap
   }
...

Cool; at this point, freeRADIUS will use whatever LDAP setup is in the /etc/freeradius/modules/ldap file.  It won't work (because it's not set up for your LDAP server), that's all that you need in order to back your Radius server with your LDAP server.

Next up, we'll look at configuring it to actually talk to your LDAP server.

Configuring the LDAP module

/etc/freeradius/modules/ldap is where the LDAP configuration lives.  In order to understand exactly what's going on, you should know a few things.
  1. Run-time variables, like the current user name, are written as %{Variable-Name}.  For example, the current user name is %{User-Name}.
  2. Similar to shell variables, you can have conditional values.  The basic syntax is %{%{Variable-1}:-${Variable-2}}.  A typical pattern that you'll see is using the "stripped" user name (the user name without any realm information), but if that's not defined, then use the actual user name: %{%{Stripped-User-Name}:-%{User-Name}}
For your basic LDAP integration (if you provide a valid username and password, you can sign in), you'll need to set the following values in the "ldap" block:
  1. server; this is the hostname or address of your server.  If you're running freeRADIUS on the same LDAP server, then this will be "localhost".
  2. identity; this is the DN for the "bind" user.  That's the user that freeRADIUS will log in as in order to search the directory tree and do its LDAP stuff.  This is typically a read-only user.
  3. password; this is the password for the user configured in identity.
  4. basedn; this is the base DN to use for all user searches.  It's usually something like dc=example,dc=com, but that'll depend on your LDAP setup.  You'll generally want to set this as the base for all of your users (maybe something like ou=users,dc=example,dc=com, etc.).
  5. filter; this is the LDAP search condition that freeRADIUS will use to try to find the matching LDAP user for the user name that just tried to sign in via Radius.  This is where run-time variables will come into play.  For out-of-the-box OpenLDAP, something like this will generally work: (uid=%{%{Stripped-User-Name}:-%{User-Name}}).  What this means is look for an entity in LDAP (under the base DN defined in basedn) with a uid property of the Radius user name.  Yes, you need the surrounding parentheses.  No, I don't make the rules.
Here's an example that assumes that your users are all under ou=users,dc=example,dc=com and have a uid property that is their user name:
server = "my-ldap-server.example.com"
identity = "uid=my-bind-user,ou=service-users,dc=example,dc=com"
password = "abc123"
basedn = "ou=users,dc=example,dc=com"
filter = "(uid=%{%{Stripped-User-Name}:-%{User-Name}})"

Remember, filter can be any LDAP filter, so if there were a property that you also wanted to check (such as isAllowedToDoRadius or something), then you could check for that, as well.  For example:
filter = "(&(uid=%{%{Stripped-User-Name}:-%{User-Name}})(isAllowedToDoRadius=yes))"

Filtering by group

So, that'll let any LDAP user authenticate with Radius.  Maybe you want that, maybe you don't.  In my case, I have a whole bunch of users, but I only want a small subset to be able to VPN in using SoftEther.  I added those users to the "vpn-users" group in LDAP.

Note that there are two general grouping strategies in LDAP:
  1. Groups-have-users; in this strategy, the group entity lists the users within the group.  This is the default OpenLDAP strategy.
  2. Users-have-groups; in this strategy, the user entity lists the groups that it belongs to.
If you want to have freeRADIUS respect your groups, you'll need to set the following in /etc/freeradius/modules/ldap:
  1. groupname_attribute = cn (which turns on tracking groups); and
  2. One of these two options, which each correspond to one of the LDAP grouping strategies:
    1. groupmembership_filter; this is an LDAP filter to use to query for all of the groups that the user belongs to.
    2. groupmembership_attribute; this is the property on the user entity that lists the groups that the user belongs to.
If your groups have users, this might look like:
groupname_attribute = cn
groupmembership_filter = "(&(objectClass=posixGroup)(memberUid=%{%{Stripped-User-Name}:-%{User-Name}}))"

If your users have groups, this might look like:
groupname_attribute = cn
groupmembership_attribute = groupName

With that set up, freeRADIUS will now know which groups the user belongs to, but it won't do anything with them.

The last step is to set up some group rules in /etc/freeradius/users.  There will probably be a few entries in that file already, but by default, none of them will be LDAP-related.  So, at the very bottom, add the LDAP group rules.

The simplest grouping rules will look like this:
DEFAULT LDAP-Group == "your-group-name-here"
DEFAULT Auth-Type := Reject
  Reply-Message = "Sorry, you're not part of an authorized group."

This generally means: you have to a member of "your-group-name-here" or else you'll be rejected (and here's the message to send you).

In my case, my group is "vpn-users", so it looks like this:
DEFAULT LDAP-Group == "vpn-users"
DEFAULT Auth-Type := Reject
  Reply-Message = "Sorry, you're not part of an authorized group."

Once that's done, restart freeradius and you'll be good to go.
sudo systemctl restart freeradius

To test to see if it worked, you can run the radtest command:
radtest -x ${username} ${password} ${address} ${port} ${secret}

For example, in our case, this might look like:
radtest -x some-user abc123 my-radius-server.example.com 1812 the-gold-is-under-the-bridge

On success, you'll see something like:
rad_recv: Access-Accept packet

On failure, you'll see something like:
rad_recv: Access-Reject packet

Hopefully this helped a bit; I struggle every time I need to do anything with LDAP or Radius.  It's always really hard to find the documentation for what I'm looking for.