Monday, May 23, 2022

sed will blow away your symlinks by default

sed typically outputs to stdout, but sed -i allows you edit a file “in place”.  However, under the hood, it actually creates a new file and then replaces the original file with the new file.  This means that sed replaces symlinks with normal files.  This is most likely not what you want.

However, there is a flag to pass to make it work the way that you’d expect:

--follow-symlinks

So, if you're using sed -i, then you probably also want to tack on --follow-symlinks, too.

Saturday, April 16, 2022

Golang, http.Client, and "too many open files"

I've been having an issue with my application for a while now, and I finally figured out what the problem was.  In this particular case, the application is a web app (so, think REST API written in Go), and one of its nightly routines is to synchronize a whole bunch of data with various third-party ArcGIS systems.  The application keeps a cache of the ArcGIS images, and this job updates them so they're only ever a day old.  This allows it to show map overlays even if the underlying ArcGIS systems are inaccessible (they're random third-party systems that are frequently down for maintenance).

So, imagine 10 threads constantly making HTTP requests for new map tile images; once a large enough batch is done, the cache is updated, and then the process repeats until the entire cache has been refreshed.

In production, I never noticed a direct problem, but there were times when an ArcGIS system would just completely freak out and start lying about not supporting pagination anymore or otherwise spewing weird errors (but again, it's a third-party system, so what can you do?).  In development, I would notice this particular endpoint failing after a while with a "dial" error of "too many open files".  Every time that I looked, though, everything seemed fine, and I just forgot about it.

This last time, though, I watched the main application's open sockets ("ss -anp | grep my-application"), and I noticed that the number of connections just kept increasing.  This reminded me of my old networking days, and it looked like the TCP connections were just accumulating until the OS felt like closing them due to inactivity.

That's when I found that Go's "http.Client" has a method called "CloseIdleConnections()" that immediately closes any idle connections without waiting for the OS to do it for you.

For reasons that are not relevant here, each request to a third-party ArcGIS system uses its own "http.Client", and because of that, there was no way to reuse any connections between requests, and the application just kept racking up open connections, eventually hitting the default limit of 1024 "open files".  I simply added "defer httpClient.CloseIdleConnections()" after creating the "http.Client", and everything magically behaved as I expected: no more than 10 active connections at any time (one for each of the 10 threads running).

So, if your Go application is getting "too many open files" errors when making a lot of HTTP requests, be sure to either (1) re-architect your application to reuse your "http.Client" whenever possible, or (2) be sure to call "CloseIdleConnections()" on your "http.Client" as soon as you're done with it.

I suspect that some of the third-party ArcGIS issues that I was seeing in production might have essentially been DoS errors caused by my application assaulting these poor servers with thousands of connections.

Saturday, April 2, 2022

Service workers, push notifications, and IndexedDB

I have a pretty simple use case: a user wanted my app to provide a list of "recent" notifications that had been sent to her.  Sometimes a lot of notifications will come through in a relatively short time period, and she wants to be able to look at the list of them to make sure that she's handled them all appropriately.

I ended up having the service worker write the notification to an IndexedDB and then having the UI reload the list of notifications when it receives a "message" event from the service worker.

Before we get there, I'll walk you through my process because it was painful.

Detour: All the mistakes that I made

Since I was already using HTML local storage for other things, I figured that I would just record the list of recent notifications in there.  Every time that the page would receive a "message" event, it would add the event data to a list of notifications in local storage.  That kind of worked as long as I was debugging it.  As long as I was looking at the page, the page was open, and it would receive the "message" event.

However, in the real world, my application is installed as a "home screen" app on Android and is usually closed.  When a notification arrived, there was no open page to receive the "message" event, and it was lost.

I then tried to have the service worker write to HTML local storage instead.  It wouldn't matter which side (page or service worker) actually wrote the data since both sides would detect a change immediately.  Except that's not how it works.  Service workers can't use HTML local storage because of some rules around being asynchronous or something.

Anyway, HTML local storage was impossible as a simple communication and storage mechanism.

Because the page was usually not opened, MessageChannel and BroadcastChannel also wouldn't work.

I finally settled on using IndexedDB because a service worker is allowed to use it.  The biggest annoyance (in the design) was that there is no way to have a page "listen" for changes to an IndexedDB, so I couldn't just trivially tell my page to update the list of notifications to display when there was a change to the database.

After implementing IndexedDB, I spent a week trying to figure out why it wasn't working half the time, and that leads us to how service workers actually work.

Detour: How service workers work

Service workers are often described as a background process for your page.  The way that you hear about them, they sound like daemons that are always running and process events when they receive them.

But that's not anywhere near correct in terms of how they are implemented.  Service workers are more like "serverless" functions (such as Google Cloud Functions) in that they generally aren't running, but if a request comes in that they need to handle, then one is spun up to handle the request, and it'll be kept around for a few minutes in case any other requests come in for it to handle, and then it'll be shut down.

So my big mistake was thinking that once I initialized something in my service worker then it would be available more or less indefinitely.  The browser knows what events a service worker has registered ("push", "message", etc.) and can spin up a new worker whenever it wants, typically to handle such an event and then shut it down again shortly thereafter.

Service workers have an "install" event that gets run when new service worker code gets downloaded.  This is intended to be run exactly once for that version of the service worker.

There is also an "activate" event that gets run when an actual worker has been assigned to the task.  You can basically view this as an event that gets once when a service worker process starts running, regardless of how many times this particular code has been run previously.  If you need to initialize some global things for later functions to call, you should do it here.

The "push" event is run when a push message has been received.  Whatever work you need to do should be done in the event's "waitUntil" method as a promise chain that ultimately results in showing a notification to the user.

Detour: How IndexedDB works

IndexedDB was seemingly invented by people who had no concept of Promises in JavaScript.  Its API is entirely insane and based on "onsuccess", "oncomplete", and "onerror" callbacks.  (You can technically also use event listeners, but it's just as insane.)  It's an asynchronous API that doesn't use any of the standard asynchronous syntax as anything else in modern JavaScript.  It is what it is.

Here's what you need to know: everything in IndexedDB is callbacks.  Everything.  So, if you want to connect to a database, you'll need to make an IDBRequest and set the "onsuccess" callback.  Once you have the database, you'll need to create a transaction and set the "oncomplete" callback.  Then you can create another IDBRequest for reading or writing data from an object store (essentially a table) and setting the "onsuccess" callback.  It's callback hell, but it is what it is.  (Note that there are wrapper libraries that provide Promise-based syntax, but I hate having to wrap a standard feature for no good reason.)

(Also, there's an "onupgradeneeded" callback at the database level that you can use to do any schema- or data-related work if you're changing the database version.)

Putting it all together

I decided that there was no reason to waste cycles opening the IndexedDB on "activate" since there's no guarantee that it'll actually be used.  Instead, I had the "push" event use the previous database connection (if there was one) or create a new connection (if there wasn't).

I put together the following workflow for my service worker:

  1. Register the "push" event handler ("event.waitUntil(...)"):
    1. (Promise) Connect to the IndexedDB.
      1. If we already have a connection from a previous call, then return that.
      2. Otherwise, connect to the IndexedDB and return that (and also store it for quick access the next time so we don't have to reconnect).
    2. (Promise) Read the list of notifications from the database.
    3. (Promise) Add the new notification to the list and write it back to the database.
    4. (Promise) Fire a "message" event to all active pages and show a notification if no page is currently visible to the user.
And for my page:
  1. Load the list of notifications from the IndexedDB when the page loads.  (This sets our starting point, and any changes will be communicated by a "message" event from the service worker.)
  2. Register the "message" event handler:
    1. Reload the list of notifications from the IndexedDB.  (Remember, there's no way to be notified on changes, so receiving the "message" event and reloading is the best that we can do.)
    2. (Handle the message normally; for me, this shows a little toast with the message details and a link to click on to take the user to the appropriate screen.)

For me, the database work is a nice-to-have; the notification is the critical part of the workflow.  So I made sure that every database-related error was handled and the Promises resolved no matter what.  This way, even if there was a completely unexpected database issue, it would just get quietly skipped and the notification could be shown to the user.

In my code, I created some simple functions (to deal with the couple of IndexedDB interactions that I needed) that return Promises so I could operate normally.  You could technically just do a single "new Promise(...)" to cover all of the IndexedDB work if you wanted, or you could one of those fancy wrapper libraries.  In any case, you must call "event.waitUntil" with a Promise chain that ultimately resolves after doing something with the notification.  How you get there is up to you.

I also was using the IndexedDB as an asynchronous local storage, so I didn't need fancy keys or sorting or anything.  I just put all of my data under a single key that I could "get" and "put" trivially without having to worry about row counts or any other kind of data management.  There's a single object store with a single row in it.

Thursday, March 3, 2022

Dump your SSH settings for quick troubleshooting

I recently had a Jenkins job that would die, seemingly-randomly.  The only thing that really stood out was that it would tend to succeed if the runtime was 14 minutes or less, and it would tend to fail if the runtime was 17 minutes or more.

This job did a bunch of database stuff (through an SSH tunnel; more on that soon), so I first did a whole bunch of troubleshooting on the Postgres client and server configs, but nothing seemed relevant.  It seemed to disconnect ("connection closed by server") on these long queries that would sit there for a long time (maybe around 15 minutes or so) and then come back with a result.  After ruling out the Postgres server (all of the settings looked good, and new sessions had decent timeout configs), I moved on to SSH.

This job connects to a database by way of a forwarded port through an SSH tunnel (don't ask why; just understand that it's the least worst option available in this context).  I figured that maybe the SSH tunnel was failing, since I start it in the background and have it run "sleep infinity" and then never look at it again.  However, when I tested locally, my SSH session would run for multiple days without a problem.

Spoiler alert: the answer ended up being the client config, but how do you actually find that out?

SSH has two really cool options.

On the server side, you can run "sudo sshd -T | sort" to have the SSH daemon read the relevant configs and then print out all of the actual values that it's using.  So, this'll merge in all of the unspecified defaults as well as all of the various options in "/etc/sshd_config" and "/etc/sshd_config.d", etc.

On the client side, you can run "ssh -G ${user}@${host} | sort", and it'll do the same thing, but for all of the client-side configs for that particular user and host combination (because maybe you have some custom stuff set up in your SSH config, etc.).

Now, in my case, it ended up being a keepalive issue.  So, on the server side, here's what the relevant settings were:

clientalivecountmax 0
clientaliveinterval 900
tcpkeepalive yes

On the client (which would disconnect sometimes), here's what the relevant settings were:

serveralivecountmax 3
serveraliveinterval 0
tcpkeepalive yes

Here, you can see that the client (which is whatever the default Jenkins Kubernetes agent ended up being) enabled a TCP keepalive, but it set the keepalive interval to "0", which means that it wouldn't send keepalive packets at all.

According to the docs, the server should have sent out keepalives every 15 minutes, but whatever it was doing, the connection would drop after 15 minutes.  Setting "serveraliveinterval" to "60" ended up solving my problem and allowed my SSH sessions to stay active indefinitely until the script was done with them.

Little bonus section

My SSH command to set up the tunnel in the background was:

ssh -4 -f -L${localport}:${targetaddress}:${targetport} ${user}@${bastionhost} 'sleep infinity';

"-4" forces it to use an IPv4 address (relevant in my context), and "-f" puts the SSH command into the background before "sleep infinity" gets called, right after all the port forwarding is set up.  "sleep infinity" ensures that the connection never closes on its own; the "sleep" command will do nothing forever.

(Obviously, I had the "-o ServerAliveInterval=60" option in there, too.)

With this, I could trivially have my container create an SSH session that allowed for port-forwarding, and that session would be available for the entirety of the container's lifetime (the entirety of the Jenkins build).

Tuesday, March 1, 2022

QNAP, NFS, and Filesystem ACLs

I recently spent hours banging my head against a wall trying to figure out why my Plex server couldn't find some new media that I put on its volume in my QNAP.

tl;dr QNAP "Advanced Folder Permissions" turn on file access control lists (you'll need the "getfacl" and "setfacl" tools installed on Linux to mess with them).  For more information, see this guide from QNAP.

I must have turned this setting on when I rebuilt my NAS a while back, and it never mattered until I did some file operations with the File Manager or maybe just straight "cp"; I forget which (or both).  Plex refused to see the new file, and I tried debugging the indexer and all that other Plex stuff before realizing that while it could list the file, it couldn't open the file, even though its normal "ls -l" permissions looked fine.

Apparently the file access control list denied it, but I didn't even have "getfacl" or "setfacl" installed on my machine (and I had never even heard of this before), so I had no idea what was going on.  I eventually installed those tools and verified that while the standard Linux permission looked fine, the ACL permissions did not.

"sudo chmod -R +r /path/to/folder" didn't solve my problem, but tearing out the ACL did: "sudo setfacl -b -R /path/to/folder"

Later, I eventually figured out that it was QNAP's "Advanced Folder Permissions" and just disabled that so it wouldn't bother me again.

Sunday, January 9, 2022

Moving (or renaming) a push-notification ServiceWorker

Service Workers (web workers, etc.) are a relatively new concept.  They can do all kinds of cool things (primarily related to network requests), but they are also the mechanism by which a web site can receive push messages (via Web Push) and show them as OS notifications.

The general rule of Service Workers is to pick a file name (such as "/service-worker.js") and never, ever change it.  That's cool, but sometimes you do need to change it.

In particular, I started my push messaging journey with "platinum-push-messaging", a now-defunct web component built by Google as part of the initial Polymer project.  The promise was cool: just slap this HTML element on your page with a few parameters and boom: you have working push notifications.

When it came out, the push messaging spec was young, and no browsers fully supported its encrypted data payloads, so "platinum-push-messaging" did a lot of work to work around that limitation.  As browsers improved to support the VAPID spec, "platinum-push-messaging" (along with all of the other "platinum" elements) were quietly deprecated and archived (around 2017).

This left me with a problem: a rotting push notification system that couldn't keep up with the spec and the latest browsers.  I hacked the code to all hell to support VAPID and keep the element functioning, but I was just punting.

Apple ruined the declarative promise of the Polymer project by refusing to implement HTML imports, so the web components community adopted the NPM distribution model (and introduced a whole bunch of imperative Javascript drama and compilation tools).  Anyway, no modern web components are installed with Bower anymore, so that left me with a deprecated Service Worker in a path that I wanted to get rid of: "bower_components/platinum-push-messaging/service-worker.js"

Here was my problem:

  1. I wanted the push messaging Service Worker under my control at the top level of my application, "/push-service-worker.js".
  2. I had hundreds of users who were receiving push notifications via this system, and the upgrade had to be seamless (users couldn't be forced to take any action).
I ended up solving the problem by essentially performing a switcheroo:
  1. I had my application store the Web Push subscription info in HTML local storage.  This would be necessary later as part of the switcheroo.
  2. I removed "bower_components/platinum-push-messaging/".  Any existing clients would regularly attempt to update the service worker, but it would quietly fail, leaving the existing one running just fine.
  3. I removed all references to "platinum-push-messaging" from my code.  The existing Service Worker would continue to run (because that's what Service Workers do) and receive push messages (and show notifications).
  4. I made my own push-messaging web component with my own service worker living at "/push-service-worker.js".
  5. (This laid the framework for performing the switcheroo.)
  6. Upon loading, the part of my application that used to include "platinum-push-messaging" did a migration, if necessary, before loading the new push-messaging component:
    1. It went through all the Service Workers and looked for any legacy ones (these had "$$platinum-push-messaging$$" in the scope).  If it found any, it killed them.

      Note that the "$$platinum-push-messaging$$" in the scope was a cute trick by the web component: a page can only be controlled by one Service Worker, and the scope dictates what that Service Worker can control.  By injecting a bogus "$$platinum-push-messaging$$" at the end of the scope, it ensured that the push-messaging Service Worker couldn't accidentally control any pages and get in the way of a main Service Worker.
    2. Upon finding any legacy Service Workers, it would:
      1. Issue a delete to the web server for the old (legacy) subscription (which was stored in HTML local storage).
      2. Tell the application to auto-enable push notifications.
      3. Resume the normal workflow for the application.
  7. The normal workflow for the application entailed loading the new push-messaging web component once the user was logged in.  If a Service Worker was previously enabled, then it would remain active and enabled.  Otherwise, the application wouldn't try to annoy users by asking them for push notifications.
  8. After the new push-messaging web component was included, it would then check to see if it should be auto-enabled (it would only be auto-enabled as part of a successful migration).
    1. If it was auto-enabled, then it would enable push messaging (the user would have already given permission by virtue of having a legacy push Service Worker running).  When the new push subscription was ready, it would post that information to the web server, and the user would have push messages working again, now using the new Service Worker.  The switcheroo was complete.
That's a bit wordy for a simple switcheroo, but it was very important for me to ensure that my users never lost their push notifications as part of the upgrade.  The simple version is: detect legacy Service Worker, kill legacy Service Worker, delete legacy subscription from web server, enable new Service Worker, and save new subscription to web server.

For any given client, the switcheroo happens only once.  The moment that the legacy Service Worker has been killed, it'll never run again (so there's a chance that if the user killed the page in the milliseconds after the kill but before the save, then they'd lose their push notifications, but I viewed this as extremely unlikely; I could technically have stored a status variable, but it wasn't worth it).  After that, it operates normally.

This means that there are one of two ways for a user to be upgraded:
  1. They open up the application after it has been upgraded.  The application prompts them to reload to upgrade if it detects a new version, but eventually the browser will do this on its own, typically after the device reboots or the browser has been fully closed.
  2. They click on a push notification, which opens up the application (which is #1, above).
So at this point, it's a waiting game.  I have to maintain support for the switcheroo until all existing push subscriptions have been upgraded.  The new ones have a flag set in the database, so I just need to wait until all subscriptions have the flag.  Active users who are receiving push notifications will eventually click on one, so I made a note to revisit this and remove the switcheroo code once all of the legacy subscriptions have been removed.

I'm not certain what causes a new subscription to be generated (different endpoint, etc.), but I suspect that it has to do with the scope of the Service Worker (otherwise, how would it know, since service worker code can change frequently?).  I played it safe and just assumed that the switcheroo would generate an entirely new subscription, so I deleted the legacy one no matter what and saved the new one no matter what.