Sense Codons

Troubleshooting a "Killed: 9" error in MacOS for a Go application

2025-08-02T16:52:00.004-04:00

We recently switched from building and signing our MacOS build from an annoying MacOS box to a happy Linux container in GitHub. (We used a Rust-based tool called rcodesign, which, after a bit of trial and error, worked wonderfully.)

Last week, a support ticket came in that the newest version of the application wasn't working on MacOS. That was strange because I had spent hours personally testing it on a MacOS laptop to make sure that the new build process was flawless.

The customer's symptom: our application wasn't running.

The actual symptom: whenever you ran our application, it was instantly killed and this was the only output:

Killed: 9

Let the troubleshooting begin

The "Killed: 9" thing was suspicious because that's the kind of error that we would get when we didn't sign our MacOS binaries correctly (which is why I was worried that it was related to switching our build process). However, I confirmed that we were signing them correctly, and the local MacOS tools agreed:

codesign -vvv /Applications/my-app.app/Contents/MacOS/my-app

/Applications/my-app.app/Contents/MacOS/my-app: valid on disk

/Applications/my-app.app/Contents/MacOS/my-app: satisfies its Designated Requirement

The only difference between the last version that worked and the one that didn't was a single dependency upgrade in our go.mod file: github.com/projectdiscovery/nuclei/v3. Nuclei is used for pentesting, which our application can do as part of its suite of tools.

Our application doesn't even use the nuclei code unless it's specifically requested, and we were getting "Killed: 9" even when we ran it with "--help".

This seemed nuts; how could upgrading a package that we use for a specific scheduled action cause MacOS to instant-kill the application with "Killed: 9"?

What I knew so far:

The application wasn't doing anything interesting (it never printed a single one of our logs).
MacOS was killing it on purpose.

What's the kernel think?

I asked the kernel for its logs while I ran our application with "--help":

sudo log stream --predicate 'sender = "kernel"'

When it ran, the only interesting thing appeared to be this line:

kernel: CODE SIGNING: process 27627[my-app]: rejecting invalid page at address 0x10232000 from offset 0x0 in file "<nil>" (cs_mtime:0.0 == mtime:0.0) (signed:0 validated:0 tainted:0 nx:0 wpmapped:1 dirty:0 depth:0)

So MacOS was doing something related to the signing of our binaries, but I didn't know what. Remember, asking it about the binary itself resulted in no errors, but I was seeing one now when it ran.

The error did reference a file ("<nil>"), so maybe if I built it with debug symbols it would help.

I also stumbed onto some diagnostic information in /Library/Logs/DiagnosticReports with a series of files called "my-app-XXXX-XX-XX-XXXXXX.ips", where the X's represented a date-time. Those were JSON files with some information, and one of them appeared to be a stack trace:

[...]

"faultingThread": 0,

"threads": [

{

"triggered": true,

"id": 734760,

"threadState": {

[...]

"trap": {

"value": 14,

"description": "(invalid protections for user instruction read)"

Other than that error ("invalid protections for user interaction read") in thread 0, there wasn't much useful information.

I rebuilt the application with debug symbols this time and tried again.

Same deal, but this time, in the stack trace in the ".ips" file, register 8 had the symbol:

"r8": {

"value": 133683616,

"symbolLocation": 0,

"symbol": "*/jitdec.Decode"

Aha! A function! It was calling "jitdec.Decode".

Source code archaeology

"jitdec.Decode" wasn't anything that I had heard of before (it certainly wasn't one of our functions), so I googled.

"jitdec.Decode" is part of the "bytedance/sonic" package, which is a high-performance JSON package for Go. Nuclei uses sonic instead of the standard JSON package for some reason; I guess someone complained at some point that its JSON operations were too slow or something.

I now knew where it was breaking, but I didn't know (1) why, or (2) why that function was even being called.

Since it was breaking during "--help", I suspected that it had to be an "init" function somewhere. In Go, an imported package's "init" function is called before anything else, so nuclei had to be doing something stupid up front. After a bunch of digging through source code, it turned out that nuclei did have an "init" function where it loaded the local configuration from disk. The format of that local configuration? JSON.

Okay, so I knew that on startup, the application would try to load the nuclei config files, and that doing that called the "jitdec.Decode" function, which caused MacOS to kill the application.

Why kill it now all of the sudden? Nuclei had been using the sonic package for ages.

I diffed the nuclei versions involved (we upgraded from v3.4.5 to v3.4.7); nothing looked all that interesting in terms of actual code that changed, but they did upgrade sonic from v1.12.8 to v1.13.3.

I diffed the sonic versions involved (they upgraded from v1.12.8 to v1.13.3) and found some changes in the "jitdec" package. In particular, there was a "//go:build" comment that specifically excluded Go 1.24 (which we use), and after the upgrade, it excluded Go 1.25. So whatever was in those files, it used to be skipped for our Go version, and now it wasn't.

What was this whole "jitdec" thing, anyway? Apparently it does some just-in-time (JIT) stuff to somehow decode JSON faster? Again, this seems like overkill for a project (nuclei) that is not, to the best of my knowledge, bottlenecked by slow JSON encoding and decoding.

But that's the difference: previously, our application didn't include any JIT, and now it did. And because it read the nuclei config files before anything else, the application was compiling code on the fly to decode all 200 bytes as fast as possible to crush the benchmarks. MacOS noticed that the application was running code that wasn't signed, and it killed the application.

Now what?

Entitlements

My first instinct was to see if I could just switch the package to the standard JSON package, but I couldn't figure out how to do that. Some packages let you do a specific import up front to turn on or off their weird, high-performance overrides, but not nuclei. So we were stuck with JIT.

MacOS has a series of "entitlements" for an application: at signing time, you also bundle in a list of special things that the application is allowed to do. One of them is relates to JIT specifically, and one relates to unsigned executable memory.

rcodesign has a "--entitlements-xml-file" option to specify an entitlements file. I plugged that in, rebuilt, reinstalled, and tested everything, and the application ran normally.

In this particular case, I had to grant these two entitlements to my application to make MacOS happy about whatever sonic was doing:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">

<dict>

<key>com.apple.security.cs.allow-jit</key>

<true/>

<key>com.apple.security.cs.allow-unsigned-executable-memory</key>

<true/>

</dict>

</plist>

Summary

MacOS sucks. It's horrible to work with and the documentation is bad.

However, these things might help you in the future:

You need to sign your MacOS applications.
The MacOS kernel will kill your application if it doesn't like something about it ("Killed: 9").
Use the "codesign" tool to see what MacOS thinks about your installed application.
Use the "log stream" command to see what the kernel is doing when it kills your application.
Turn on debug symbols during your build process.
Hunt down the ".ips" file for your killed application and see what the stack trace looks like.
There are a whole bunch of entitlements that your application may need to have; if it's a runtime error, there's a chance that it needs one that it doesn't have.

Dealing with KDE "plasmashell" freezing

2023-04-05T11:09:00.005-04:00

I've been using KDE for over a decade now, and something that started happening in the past year or two (at least on Kubuntu 22.04) would be that my whole screen would mostly freeze. Generally, I'd be able to alt-tab between windows, interact with them, etc., but I couldn't click on or interact with anything related to the window manager (the title bars, the task bar, etc.).

In my case, I'd immediately notice when I came back to my desk and there was obviously a notification at some point, but the rendering got all screwed up:

In this image, you can see that the notification toast window has no visible content and instead looks like the KDE background image. Also, the time is locked at 5:29 PM, which is when this problem happened (I didn't get back to my desk until 8:30 AM the next morning).

The general fix for this is to use a shell (if you have one open, great; if not, press ctrl+alt+F2 to jump to the console) and kill "plasmashell":

killall plasmashell

Once that's done, your window manager should be less broken, but it won't have the taskbar, etc. From there, you can press alt+F2 to open the "run" window, and type in:

plasmashell --replace

You can also run this from a terminal somewhere, but you need to make sure that your "DISPLAY" environment variable is set up correctly, etc. I find it easier to do it it from the run window (and I don't have to worry about redirecting its output anywhere, since "plasmashell" does generate some logging noise).

Using a dynamic PVC on Kubernetes agents in Jenkins

2023-02-10T16:01:00.004-05:00

I recently had to create a Jenkins job that needed to use a lot of disk space. The short version of the story is that the job needed to dump the contents of a Postgres database and upload that to Artifactory, and the "jfrog" command line tool won't let you stream an upload, so the entire dump had to be present on disk in order for it to work.

I run my Jenkins on Kubernetes, and the Kubernetes hosts absolutely didn't have the disk space needed to dump this database, and it was definitely too big to use a memory-based filesystem.

The solution was to use a dynamic Persistent Volume Claim, which is maybe(?) implemented as an ephemeral volume in Kubernetes, but the exact details of what it does under the hood aren't important. What is important is that, as part of the job running, a new Persistent Volume Claim (PVC) gets created and is available for all of the containers in the pod. When the job finishes, the PVC gets destroyed. Perfect.

I couldn't figure out how to create a dynamic PVC as an ordinary volume that would get mounted on all of my containers (it's a thing, but apparently not for a declarative pipeline), but I was able to get the "workspace" dynamic PVC working.

A "workspace" volume is shared across all of the containers in the pod and have the Jenkins workspace mounted. This has all of the Git contents, including the Jenkinsfile, for the job (I'm assuming that you're using Git-based jobs here). Since all of the containers share the same workspace volume, any work done in one container is instantly visible in all of the others, without the need for Jenkins stashes or anything.

The biggest problem that I ran into was the permissions on the "workspace" file system. Each of my containers had a different idea of what the UID of the user running the container would be, and all of the containers have to agree on the permissions around their "workspace" volume.

I ended up cheating and just forcing all of my containers to run as root (UID 0), since (1) everyone could agree on that, and (2) I didn't have to worry about "sudo" not being installed on some of the containers that needed to install packages as part of their setup.

Using "workspace" volumes

To use a "workspace" volume, set workspaceVolume inside the kubernetes block:

kubernetes {
workspaceVolume dynamicPVC(accessModes: 'ReadWriteOnce', requestsSize: "300Gi")
yaml '''
---
apiVersion: v1
kind: Pod
spec:
securityContext:
fsGroup: 0
runAsGroup: 0
runAsUser: 0
containers:
[...]

In this example, we allocate a 300GiB volume for the duration of the job running.

In addition, you can see that I set the user and group information to 0 (for "root"), which let me work around all the annoying UID mismatches across the containers. If you only have one container, then obviously you don't have to do this. Also, if you have full control of your containers, then you can probably set them up with a known user with a fixed UID who can sudo, etc., as necessary.

For more information about using Kubernetes agents in Jenkins, see the official docs, but (at least of the time of this writing) they're missing a whole lot of information about volume-related things.

Troubleshooting

If you see Jenkins trying to create and then delete pods over and over and over again, you have something else wrong. In my case, the Kubernetes service accout that Jenkins uses didn't have any permissions around "persistentvolumeclaims" objects, so every time that the Pod was created, it would fail and try again.

I was only able to see the errors in the Jenkins logs in Kubernetes; they looked something like this:

Caused: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.100.0.1:443/api/v1/namespaces/cicd/persistentvolumeclaims. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. persistentvolumeclaims is forbidden: User "system:serviceaccount:cicd:default" cannot create resource "persistentvolumeclaims" in API group "" in the namespace "cicd".

I didn't have the patience to figure out exactly what was needed, so I just gave it everything:

- verbs:

- create

- delete

- get

- list

- patch

- update

- watch

apiGroups:

- ''

resources:

- persistentvolumeclaims

Use a custom login page when using Apache to require sign-in

2023-01-31T09:54:00.001-05:00

Apache has its own built-in authentication system(s) for providing access control to a site that it's hosting. You've probably encountered this before using "basic" authentication backed by a flatfile created and edited using the htpasswd command.

If you do this using the common guides on the Internet (for example, this guide from Apache itself), then when you go to your site, you'll be presented with your browser's built-in basic-authentication dialog box asking for a username and password. If you provide valid credentials, then you'll be moved on through to the main site, and if you don't, then it'll dump you to a plain "401 Unauthorized" page.

This works fine, but it has three main drawbacks:

Password managers (such as LastPass) can't detect this dialog box and autofill it, which is very annoying.
On some mobile browsers, the dialog gets in the way of normal operations. Even if you have multiple tabs open, whatever tab is trying to get you to log in will get in the way and force you to deal with it.
If you're using Windows authentication, the browser might detect the 401 error and attempt to sign you in using your domain credentials. If the server has a different set of credentials, then it'll mean that you can't actually log in due to Windows trying to auto log in.

(And the built-in popup is really ugly, and it submits the password in plaintext, etc., etc.)

Apache "Form" Authentication

To solve this problem, Apache has a type of authentication called "form" that adds an extra step involving an HTML form (that's fully customizable).

The workflow is as follows:

Create a login HTML page (you'll have to provide the page).
Register a handler for that page to POST to (Apache already has the handler).
Update any "Directory" or "Location" blocks in your Apache config to use the "form" authentication type instead of "basic".

You'll also need these modules installed and enabled:

mod_auth_form
mod_request
mod_session
mod_session_cookie

On Ubuntu, I believe that these were all installed out of the box but needed to be enabled separately. On Red Hat, I had to install the mod_session package, but everything was otherwise already enabled.

Example

If you want to try out "form" authentication, I recommend that you get everything working with "basic" authentication first. This is especially true if you have multiple directories that need to be configured separately.

For this example, I'm going to use our Nagios server.

There were two directories that needed to be protected: "/usr/local/nagios/sbin" and "/usr/local/nagios/share". This setup is generally described by this document (although it covers "digest" authentication instead of "basic").

For both directories that already had "AuthType" set up, the changes are simple:

Change AuthType Basic to AuthType Form.
Change AuthBasicProvider to AuthFormProvider.
Add the login redirect: AuthFormLoginRequiredLocation "/login.html"
Enable sessions: Session On
Set a cookie name: SessionCookieName session path=/

I decided to put my login page at "/login.html" because that makes sense, but you could put it anywhere (and even host it on a different server if you specify a full URL instead of just a path).

That page should contain a "form" with two "input" elements: "httpd_username" and "httpd_password". The form "action" should be set to "/do-login.html" (or whatever handler you want to register with Apache).

At its simplest, "login.html" looks like this:

<form method="POST" action="/do-login.html">
Username: <input type="text" name="httpd_username" value="" />
Password: <input type="password" name="httpd_password" value="" />
<input type="submit" name="login" value="Login" />
</form>

You'll probably want an "html" tag, a title and body and such, maybe some CSS, but this'll get the job done.

The last step is to register the thing that'll actually process the form data: "/do-login.html"

In your Apache config, add a "location" for it:

<Location "/do-login.html">
SetHandler form-login-handler

AuthType form
AuthName "Nagios Access"
AuthFormProvider file
AuthUserFile /path/to/your/htpasswd.users

AuthFormLoginRequiredLocation "/login.html"
AuthFormLoginSuccessLocation "/nagios/"

Session On
SessionCookieName session path=/
</Location>

The key thing here is SetHandler form-login-handler. This tells Apache to use its built-in form handler to take the values from httpd_username and httpd_password and compare them against your authentication provider(s) (in this example, it's just a flatfile, but you could use LDAP, etc.).

The other two options handle the last bit of navigation. AuthFormLoginRequiredLocation sends you back to the login page if the username/password combination didn't work (you could potentially have another page here with an error message pre-written). AuthFormLoginSuccessLocation sends you to the place where you want the user to go after login (I'm sending the user to the main Nagios page, but you could send them anywhere).

Notes

Other Authentication Providers

I've just covered the "file" authentication provider here. If you use "ldap" and/or any others, then that config will need to be copied to every single place where you have "form" authentication set up, just like you would if you were only using the "file" provider.

I found this to be really annoying, since I had two directories to protect plus the form handler, so that brings over another 4 lines or so to each config section, but what matters is that it works.

Watch out for SNI when using an nginx reverse proxy

2022-10-19T15:37:00.001-04:00

From time to time, I'll have a use case where some box needs to talk to some website that it can't reach (through networking issues), and the easiest thing to do is to throw an nginx reverse proxy on a network that it can reach (such that the reverse proxy can reach both).

The whole shtick of a reverse proxy is that you can access the reverse proxy directly and it'll forward the request on to the appropriate destination and more or less masquerade itself as if it were the destination. This is in contrast with a normal HTTP proxy that would be configured separately (if supported by whatever tool you're trying to use). Sometimes a normal HTTP proxy is the best tool for the job, but sometimes you can cheat with a tweak to /etc/hosts and a reverse proxy and nobody needs to know what happened.

Here, we're focused on the reverse proxy.

In this case, we have the following scenario:

Box 1 wants to connect to site1.example.com.
Box 1 cannot reach site1.example.com.

To cheat using a reverse proxy, we need Box 2, which:

Can be reached by Box 1.
Can reach site1.example.com.

To set up the whole reverse proxy thing, we need to:

Set up nginx on Box 2 to listen on port 443 (HTTPS) and reverse proxy to site1.example.com.
Update /etc/hosts on Box 1 so that site1.example.com points to Box 2's IP address.

At first, I was seeing this error message on the reverse proxy's "nginx/error.log":

connect() to XXXXXX:443 failed (13: Permission denied) while connecting to upstream, client: XXXXXX, server: site1.example.com, request: "GET / HTTP/1.1"

"Permission denied" isn't great, and that told me that it was something OS-related.

Of course, it was an SELinux thing (in /var/log/messages):

SELinux is preventing /usr/sbin/nginx from name_connect access on the tcp_socket port 443.

The workaround was:

setsebool -P nis_enabled 1

This also was suggested by the logs, but it didn't seem to matter:

setsebool -P httpd_can_network_connect 1

After fixing that, I was seeing:

SSL_do_handshake() failed (SSL: error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure:SSL alert number 40) while SSL handshaking to upstream, client: XXXXXX, server: site1.example.com, request: "GET / HTTP/1.1"

After tcpdump-ing the traffic from Box 1 and also another box that could directly talk to site1.example.com, it was clear Box 1 was not using SNI in its requests (SNI is a TLS extension that passes the host name in plaintext so that proxies and load balancers can properly route name-based requests).

It took way too long for me to find the nginx setting to enable it (I don't know why its disabled by default), but it's:

proxy_ssl_server_name on;

Anyway, the final nginx config for the reverse proxy on Box 2 was:

server {
listen 443 ssl;
server_name site1.example.com;

ssl_certificate /etc/nginx/ssl/server.crt;
ssl_certificate_key /etc/nginx/server.key;
ssl_protocols TLSv1.2;

location / {
proxy_pass https://site1.example.com;
proxy_ssl_session_reuse on;
proxy_ssl_server_name on;
}
}

As far as Box 1 was concerned, it could connect to site1.example.com with only a small tweak to /etc/hosts.

Speed up PNG encoding in Go with NRGBA images

2022-10-05T22:39:00.000-04:00

After migrating my application from Google Cloud Engine to Google Cloud Run, I suddenly had a use case for optimizing CPU utilization.

In my analysis of my most CPU-intensive workloads, it turned out that the majority of the time was spent encoding PNG files.

tl;dr Use image.NRGBA when you intend to encode a PNG file.

(For reference, this particular application has a Google Maps overlay that synthesizes data from other sources into tiles to be rendered on the map. The main synchronization job runs nightly and attempts to build or download new tiles for the various layers based on data from various ArcGIS systems.)

Looking at my code, I couldn't really reduce the number of calls to png.Encode, but that encoder really looked inefficient. I deleted the callgrind files (sorry), but basically, half of the CPU time in png.Encode was around memory operations and some runtime calls.

I started looking around for maybe some options to pass to the encoder or a more purpose-built implementation. I ended up finding a package that mentioned a speedup, but only for NRGBA images. However, that package looked fairly unused, and I wasn't about to turn all of my image processing to so something with 1 commit and no users.

This got me thinking, though: what is NRGBA?

It turns out that there are (at least) two ways of thinking about the whole alpha channel thing in images:

In RGBA, each of the red, green, and blue channels has already been premultiplied by the alpha channel, such that the value of, for example, R can range from 0 to A, but no higher.
In NRGBA, each of the red, green, and blue channels has its original value, and the alpha channel merely represents the opacity of the pixel in general.

For my human mind, using various tools and software over the years, when I think of "RGBA", I think of "one channel each for red, green, and blue, and one channel for the opacity of the pixel". So what this means is that I'm thinking of "NRGBA" (for non-premultiplied RGBA).

(Apparently there are good use cases for both, and when compositing, at some point you'll have to multiply by the alpha value, so "RGBA" already has that done for you.)

Okay, whatever, so what does this have to do with CPU optimization?

In Go, the png.Encode function is optimized for NRGBA images. There's a tiny little hint about this in the comment for the function:

Any Image may be encoded, but images that are not image.NRGBA might be encoded lossily.

This is corroborated by the PNG rationale document, which explains that

PNG uses "unassociated" or "non-premultiplied" alpha so that images with separate transparency masks can be stored losslessly.

If you want to have the best PNG encoding experience, then you should encode images that use NRGBA already. In fact, if you open up the code, you'll see that it will convert the image to NRGBA if it's not already in that format.

Coming back to my callgrind analysis, this is where all that CPU time was spent: converting an RGBA image to an NRGBA image. I certainly thought that it was strange how much work was being done creating a simple PNG file from a mostly-transparent map tile.

Why did I even have RGBA images? Well, my tiling API has to composite tiles from other systems into a single PNG file, so I simply created that new image with image.NewRGBA. And why that function? Because as I mentioned before, I figured "RGBA" meant "RGB with an alpha channel", which is what I wanted so that it would support transparency. It never occurred to me that "RGBA" was some weird encoding scheme for pixels in contrast to another encoding scheme called "NRGBA"; my use cases had never had me make such a distinction.

Anyway, after switching a few image.NewRGBA calls to image.NewNRGBA (and literally that was it; no other code changed), my code was way more efficient, cutting down on CPU utilization by something like 50-70%. Those RGBA to NRGBA conversions really hurt.

100% cost savings switching from App Engine to Cloud Run... wild

2022-10-05T20:50:00.002-04:00

I have written numerous times about how much I like Google's App Engine platform and about how I've tried to port as many of my applications to it as possible. The idea of App Engine is still glorious, but I have now been converted to Cloud Run.

The primary selling point? It has literally reduced my application's bill to $0. That's a 100% reduction in cost from App Engine to Cloud Run. It's absolutely wild, and it's now my new favorite thing.

Pour one out for App Engine because Cloud Run is here to stay.

Aside: a (brief) history of App Engine

Before I explain how Cloud Run manages to make my application free to run, I need to first run you through a brief history of App Engine, why it was great, why it became less great, and how Cloud Run could eat its lunch like that.

App Engine v1

Google App Engine entered the market in 2011 as a one-stop solution to host infinitely scalable applications. Prior to App Engine, if you wanted to host a web application, you had to either commit to a weird framework and go with that framework's cloud solution, or you had to spin up a bunch of virtual machines (or god help you Lambda functions) to handle your web traffic, your database, your file systems, maybe tie into a CDN, and also set up Let's Encrypt or something similar to handle your TLS certificates.

From the perspective of an application owner or application developer, that's a lot of work that doesn't directly deal with application stuff and ends up just wasting tons of everyone's time.

App Engine was Google's answer to that, something that we would now call "serverless".

The premise was simple: upload a zip file with your application, some YAML files with some metadata (concerning scaling, performance, etc.), and App Engine would take care of all the annoying IT stuff for you. In fact, they tried to hide as much of it from you as possible—they wanted you to focus on the application logic, not the nitty gritty backend details.

(This allowed Google to do all kinds of tricks in the background to make things better and faster, since the implementation details of their infrastructure were generally out of bounds for your application.)

App Engine's killer feature was that it bundled a whole bunch of services together, things that you'd normally have to set up yourself and pay extra for. This included, but was not limited to:

The web hosting basics:

File hosting
A CDN service
An execution environment in your language of choice (Java, Python, PHP, etc.)

A cheap No-SQL database to store all your data
A memcached memory store
A scheduler that'll hit any of your web endpoints whenever you want
A task queue for background tasks
A service that would accept e-mail and transform it into an HTTP POST request

All of this came with a generous free tier (truly free) where you could try stuff out and generally host small applications with no cost at all. If your application were to run up against the free-tier limits, it would just halt those services until the next day when all of the counters reset and your daily limits restarted.

(Did I mention e-mail? I had so many use cases that dealt with accepting e-mail from other systems and taking some action. This was a godsend.)

The free tier gave you 28 hours of instance time (that is, 28 hours of a single instance of the lowest class). For trying stuff out, that was plenty, especially when your application only had a few users and went unused for hours at a time.

(Yes, if you upgraded your instance type one notch that became 14 hours free, but still.)

To use all of the bonus features, you generally had to link against (more or less) the App Engine SDK, which took care of all of the details behind the scene.

This was everything that you needed to build a scalable web application, and it did its job well.

App Engine v2

Years went by, and Google made some big changes to App Engine. To me, these changes went against the great promise of App Engine, but they make sense when looking at Cloud Run as the App Engine successor.

App Engine v2 added support for newer versions of the languages that App Engine v1 had supported (which was good), but it also took away a lot of the one-stop-shop power (which was bad). The stated reason was to remove a lot of the Google Cloud-specific stuff and make the platform a bit more open and compatible with other clouds.

(While that's generally good, the magic of App Engine was that everything was built in, for free.)

Now, an App Engine application could no longer have free memcached data; instead, you could pay for a Memory Store option. For large applications, it probably didn't matter, but for small ones that cost basically nothing to run, this made memcached untenable.

Similarly, the e-mail service was discontinued, and you were encouraged to move to Twilio SendGrid, which could do something similar. Google was nice enough to give all SendGrid customers a free tier that just about made up for what was lost. The big problem was that all of the built-in tools were gone. Previously, the App Engine local development console had a place where you could type in an e-mail and send it to your application; now you had to write your own tools.

The scheduler and task queue systems were promoted out of App Engine and into Google Cloud, but the App Engine tooling continued to support them, so they required only minimal changes on the application side.

The cheap No-SQL database, Datastore, was also promoted out of App Engine, and there were no changes to make whatsoever. Yay.

The App Engine SDK was removed, and an App Engine application was simply another application in the cloud, with annoying dependencies and third-party integrations.

(Yes, technically Google added back support for the App Engine SDK years later, but meh.)

Overall, for people who used App Engine exclusively for purpose-built App Engine things, v2 was a major step down from v1. Yes, it moved App Engine toward a more standards-based approach, but the problem was that there weren't standards around half the stuff that App Engine was doing, and in the push to make it more open and in-line with other clouds, it lost a lot of its power (because other clouds weren't doing what App Engine was doing, which is why I was using Google's App Engine).

So, why Cloud Run?

Cloud Run picks up where App Engine v2 left off. With all of the App Engine-specific things removed, the next logical jump was to give Google a container image instead of a zip file and let it do its thing. App Engine had already experimented with this with the "Flexible" environment, but its pricing was not favorable compared to the "Standard" offering for the kinds of projects that I had.

From a build perspective, Cloud Run gives you a bit more flexibility than App Engine. You can give Google a Dockerfile and have it take care of all the compiling and such, or you can do most of that locally and give Google a much simpler Dockerfile with your pre-compiled application files. I don't want to get too deep into it, but you have options.

Cloud Run obviously runs your application as a container. Did App Engine? Maybe? Maybe not? I figured that it would have, but that's for Google to decide (the magic of App Engine).

But here's the thing: App Engine was billed as an instance, not a container.

Billing: App Engine vs. Cloud Run

I'm going to focus on the big thing, which is instance versus container billing. Everything else is comparable (data traffic, etc.).

App Engine billed you on instance time (wall time), normalized to units of the smallest instance class that they support. To run your application at all, there was a 15-minute minimum cost. If your application wasn't doing anything for 15 minutes, then App Engine would shut it down and you wouldn't be billed for that time. But for all other time, anytime that there was at least one instance running, you were billed for that instance (more or less as if you had spun up a VM yourself for the time that it was up).

For a lightweight web application, this kind of billing isn't ideal because most of the work that your application does is a little bit of processing centered around database calls. Most of the time the application is either not doing anything at all (no requests are coming in) or it's waiting on a response from a database. The majority of the instance CPU goes unused.

Cloud Run bills on CPU time, that is, the time that the CPU is actually used. So, for that same application, if it's sitting around doing nothing or waiting on a database response, then it's not being billed. And there's no 15-minute minimum or anything. For example, if your application does some request parsing, permissions checks, etc., and then sits around waits for 1 second for a database query to return, then you'll be billed for like 0.0001 seconds of CPU time, which is great (because your application did very little actual CPU work in that second it took for the request to complete).

Cloud Run gives you basically 48 hours of "vCPU-seconds" (that is, a single CPU churning at 100%) per month. So for an application that sits around doing nothing most of the time, the odds are that you'll never have to pay for any CPU utilization. For a daily average, this comes out to about 1.5 hours of CPU time free per day. Yes, you also pay for the amount of memory that your app uses while it's handling requests, but my app uses like 70MB of memory because it's a web application.

(For me, my app has a nightly jobs that eats up a bunch of CPU, and then it does just about nothing all day.)

Overall, for me, this is what moved me to Cloud Run. I'm billed for the resources that I'm actually using when I'm actually using them, and the specifics around instances and such are no longer my concern.

This also means that I can use a more microservice-based approach, since I'm billed on what my application does, not what resources Google spins up in the background to support it. With App Engine, running separate "services" would have been too costly (each with its own instances, etc.). With Cloud Run, it's perfect, and I can experiment without needing to worry about massive cost changes.

What's the catch?

App Engine's deployment tools are top-notch, even in App Engine v2. When you deploy your application (once you figure out all the right flags), it'll get your application up and running, deploy your cron jobs, and ensure that your Datastore indexes are pushed.

With Cloud Run, you don't get any of the extra stuff.

I've put together a list of things that you'll have to do to migrate your application from App Engine to Cloud Run, sorted from easiest to hardest.

Easy: Deploy your application

This is essentially gcloud run deploy with some options, but it's extremely simple and straightforward. Migrating from App Engine will take like 2 seconds.

Easy: Get rid of app.yaml

You won't need app.yaml anymore, so remove it. If you had environment variables in there, move them to their own file and update your gcloud run deploy command to include --env-vars-file.

In Cloud Run, everything is HTTPS, so you don't have to worry about insecure HTTP endpoints.

There are no "warmup" endpoints or anything, so just make sure that your container does what it needs to when it starts because as soon as it can receive an HTTP request, it will.

If you had configured a custom instance class, you may need to tweak your Cloud Run service with parameters that better match your old settings. I had only used an F2 class because the F1 class had a limitation on the number of concurrent requests that it could receive, and Cloud Run has no such limitation. You can also configure the maximum number of concurrent requests much, much higher.

Similarly, all of the scaling parameters are... different, and you can deal with those in Cloud Run as necessary. Let your app run for a while and see how Cloud Run behaves; it's definitely a different game from App Engine.

Easy: Deploy your task queues

Starting with App Engine v2, you already had to deploy your own task queues (App Engine v1 used to take care of that for you), so you don't have to change a single thing with deployment. Note that you will have to change how those task queues are used; see below.

Easy: Deploy your Datastore indexes

This is essentially gcloud datastore indexes create index.yaml with some options, and you should already have index.yaml sitting around, so just add this to your deployment code.

Easy: Rip out all "/_ah" endpoints

App Engine liked to have you put special App Engine things under "/_ah", so I stashed all of my cron job and task queue endpoints in there. Ordinarily, this would be fine, except that Cloud Run quietly treats "/_ah" as special and will quietly fail all requests to any endpoints under it with a permission-denied error. I wasted way too much time trying to figure out what was going on before I realized that it was Cloud Run doing something sneaky and undocumented in the background.

Move any endpoints under "/_ah" to literally any other path and update any references to them.

Easy: Create a Dockerfile

Create a Dockerfile that's as simple or as complex as you want. I wanted a minimal set of changes from my current workflow, so I had my Dockerfile just include all for binaries that get compiled locally as well as any static files.

If you want, you can set up a multi-stage Dockerfile that has separate, short-lived containers for building and compiling that ultimate ends up with a single, small container for your application. I'll eventually get there, but actually deploying into Cloud Run took precedence for me over neat CI/CD stuff.

Easy: Detect Cloud Run in production

I have some simple code that detects if it's in Google Cloud or not so that it does smart production-related things if it is. In particular, logging in production is JSON-based so that StackDriver can pick it up properly.

The old "GAE_*" and "GOOGLE_*" environment variables are gone; you'll only get "K_SERVICE" and friends to tell you a tiny amount of information about your Cloud Run environment. You'll basically get back the service name and that's it.

If you want your project ID or region, you'll have to pull those from the "computeMetadata/v1" endpoints. It's super easy; it's just something that you'll have to do.

For more information on using the "computeMetadata/v1" endpoints, see this guide.

For more information about the environment variables available in Cloud Run, see this document.

Easy: "Unlink" your Datastore

Way back in the day, Datastore was merely a part of App Engine. At some point, Google spun it out into its own thing, but for older projects, it's still tied to App Engine. Make sure that you go to your Datastore's admin screen in Google Cloud Console and unlink it from your App Engine project.

This has no practical effects other than:

You can no longer put your Datastore in read-only mode.
You can disable your App Engine project without also disabling your Datastore.

It may take 10-20 minutes for the operation to complete, so just plan on doing this early so it doesn't get in your way.

To learn more about unlinking Datastore from App Engine, see this article.

Medium: Change your task queue code

If you were using App Engine for task queues, then you were using special App Engine tasks. Since you're not using App Engine anymore, you have to use the more generic HTTP tasks.

In order to do this, you'll need to know the URL for your Cloud Run service. You can technically cheat and use the information in the HTTP request to generally get a URL that (should) work, but I opted to use the "${region}-run.googleapis.com/apis/serving.knative.dev/v1/namespaces/${project-id}/services/${service-name}" endpoint, which returns the Cloud Run URL, among other things. I grabbed that value when my container started and used it anytime I had to create a task.

(And yes, you'll need to get an access token from another "computeMetadata/v1" endpoint in order to access that API.)

For more information on that endpoint, see this reference.

Once you have the Cloud Run URL, the main difference between an App Engine task and an HTTP task is that App Engine tasks only have the path while the HTTP task has the full URL to hit. I had to change up my internals a bit so that the application knew its base URL, but it wasn't much work.

To learn more about HTTP tasks, see this document.

Medium: Build a custom tool to deploy cron jobs

App Engine used to take care of all the cron job stuff for you, but Cloud Run does not. At the time of this writing, Cloud Run has a beta command that'll create/update cron jobs from a YAML file, but it's not GA yet.

I built a simple tool that calls gcloud jobs list and parses the JSON output and compares that to a custom YAML file that I have that describes my cron jobs. Because cron jobs do not refer to an App Engine project, they can reference any HTTP endpoint or a particular Cloud Run service. My YAML file has an option for the Cloud Run service name, and my tool looks up the Cloud Run URL for that service and appends the path onto it.

It's not a lot of work, but it's work that I had to do in order to make my workflows make sense again.

Also note that the human-readable cron syntax ("every 24 hours") is gone and you'll need to use the standard "minute hour day-of-month month day-of-week" syntax that you're used to in most other cron-related things. You can also specify which time zone each cron job should use for interpreting that syntax.

To learn more about the cron syntax, see this document.

Medium: Update your custom domains

At minimum, you will now have a completely different URL for your application. If you're using custom domains, you can use Cloud Run's custom domains much like you would use App Engine's custom domains.

However, as far as Google is concerned, this is basically removing a custom domain from App Engine and creating a custom domain in Cloud Run. This means that it'll take anywhere from 20-40 minutes for HTTPS traffic to your domain to work after you make the switch.

Try to schedule this for when you have minimal traffic. As far as I can tell, there's nothing that you can do to speed it up. You may wish to consider using a load balancer instead (and giving the load balancer the TLS certificate responsibilities), but I didn't want to pay extra for a service that's basically free and only hurts me when I need to make domain changes.

Medium: Enable compression

Because App Engine ran behind Google's CDN, compression was handled by default. Cloud Run does no such thing, so if you want compression, you'll have to do it yourself. (Yes, you could pay for a load balancer, but that doesn't reduce your costs by 100%.) Most HTTP frameworks have an option for it, and if you want to roll your own, it's fairly straightforward.

I ended up going with https://github.com/NYTimes/gziphandler. You can configure a minimum response size before compressing, and if you don't want to compress everything (for example, image files are already compressed internally these days), you have to provide it with a list of MIME types that should be compressed.

Hard: Enable caching

Because App Engine ran behind Google's CDN, caching was handled by default. If you set a Cache-Control header, the CDN would take care of doing the job of an HTTP cache. If you did it right, you could reduce the traffic to your application significantly. Cloud Run does no such thing, and your two practical options are (1) pay for a load balancer (which does not reduce your costs by 100%) or (2) roll your own in some way.

There's nothing stopping you from dropping in a Cloud Run service that is basically an HTTP cache that sits in front of your actual service, but I didn't go that route. I decided to take advantage of the fact that my application knew more about caching than an external cache ever could.

For example, my application knows that its static files can be cached forever because the moment that I replace the service with a new version, the cache will die with it and a new cache will be created. For an external cache, you can't tell it, "hey, cache this until the backing service randomly goes away", so you have to set practical cache durations such as "10 minutes" (so that when you do make application changes, you can be reasonably assured that they'll get to your users fairly quickly).

I ended up writing an HTTP handler that parsed the Cache-Control header, cached things that needed to be cached, supported some extensions for the things that my application could know (such as "cache this file forever), and served those results, so my main application code could still generally operate exactly as if there were a real HTTP cache in front of it.

Extra: Consider using "alpine" instead of "scratch" for Docker

I use statically compiled binaries (written in Go) for all my stuff, and I wasted a lot of time using "scratch" as the basis for my new Cloud Run Dockerfile.

The tl;dr here is that the Google libraries really, really want to validate the certificates of the Google APIs, and "scratch" does not install any CA certificates or anything. I was getting nasty errors about things not working, and it took me a long time to realize that Google's packages were upset about not being able to validate the certificates in their HTTPS calls.

If you're using "alpine", just add apk add --no-cache ca-certificates to your build.

Also, if you do any kind of time zone work (all of my application's users and such have a time zone, for example), you'll also want to include the timezone data (otherwise, your date/time calls will fail).

If you're using "alpine", just add apk add --no-cache tzdata to your build. Also don't forget to set the system timezone to UTC via echo UTC > /etc/timezone.

Extra: You may need your project ID and region for tooling

Once running, your application can trivially find out what its project ID and region are from the "computeMetadata/v1" APIs, but I found that I needed to know these things in advance while running my deployment tools. Your case may differ from mine, but I needed these in advance.

I have multiple projects (for example, a production one and a staging one), and my tooling needed to know the project ID and region for whatever project it was building/deploying. I added this to a config file that I could trivially replace with a copy from a directory of such config files. I just wanted to mention that you might run into something similar.

Extra: Don't forget to disable App Engine

Once everything looks good, disable App Engine in your project. Make sure that you unlinked your Datastore first, otherwise you'll also disable that when you disable App Engine.

Extra: Wait a month before celebrating

While App Engine's quotas reset every day, Cloud Run's reset every month. This means that you won't necessarily be able to prove out your cost savings until you have spent an entire month running in Cloud Run.

After a few days, you should be able to eyeball the data and check the billing reports to get a sense of what your final costs will be, but remember: you get 48 hours of free CPU time before they start charging you, and if your application is CPU heavy, it might take a few days or weeks to burn through those free hours. Even after going through the free tier, the cost of subsequent usage is quite reasonable. However, the point is that you can't draw a trend line on costs based on the first few days of running. Give it a full month to be sure.

Extra: Consider optimizing your code for CPU consumption

In an instance world like App Engine, as long as your application isn't using up all of the CPU and forcing App Engine to spawn more instances or otherwise have requests queue up, CPU performance isn't all that important. Ultimately, it'll cost you the same regardless of whether you can optimize your code to shave off 20% of its CPU consumption.

With Cloud Run, you're charged, in essence, by CPU cycle, so you can drastically reduce your costs by optimizing your code. And if you can keep your stuff efficient enough to stay within the free tier, then your application is basically free. And that's wild.

If you're new to CPU optimization or haven't done it in a while, consider enabling something like "pprof" on your application and then capturing a 30 second workload. You can view the results with a tool like KCachegrind to get a feel for where your application is spending most of its CPU time and see what you can do about cleaning that up. Maybe it's optimizing some for loops; maybe it's caching; maybe it's using a slightly more efficient data structure for the work that you're doing. Whatever it is, find it and reduce it. (And start with the biggest things first.)

Conclusion

Welcome to Cloud Run! I hope your cost savings are as good as mine.

Edit (2022-10-19): added caveats about HTTP compression and caching.

Publishing a Docker image for a GitHub repo

2022-07-10T14:36:00.001-04:00

It's 2022, and if you're making a GitHub project, chances are that you'll need to publish a Docker image at some point. These days, it's really easy with GitHub CI/CD and their "actions", which generally take care of all of the hard work.

I'm assuming that you already have a working GitHub CI/CD workflow for building whatever it is that you're building, and I'm only going to focus on the Docker-specific changes that need to be made.

You'll want to set up the following workflows:

Upon "pull_request" for the "master" branch, build the Docker image (to make sure that the process works), but don't actually publish it.
Upon "push" for the "master" branch, build the Docker image and publish it as "latest".
Upon "release" for the "master" branch, build the Docker image and publish it with the release's tag.

Before you get started, you'll need to create a Dockerhub "access token" to use for your account. You can do this under "Account Settings" → "Security" → "Access Tokens".

Workflow updates

Upon "pull_request"

This workflow happens every time a commit is made to a pull request. While we want to ensure that our Docker image is built so that we know that the process works, we don't actually want to publish that image (at least I don't; you might have a use for it).

To build the Docker image, just add the following to your "steps" section:

- name: Set up Docker Buildx

uses: docker/setup-buildx-action@v2

- name: Build

uses: docker/build-push-action@v3

with:

context: .

push: false

tags: YOUR_GROUP/YOUR_REPO:latest

Simple enough.

The "docker/setup-buildx-action" action does whatever magic needs to happen for Docker stuff to work in the pipeline, and the "docker/build-push-action" builds the image from your Dockerfile and pushes the image. Because we're setting "push: false", it won't actually push.

Upon "merge" into "master"

This workflow happens every time a PR is merged into the "master" branch. In this case, we want to do everything that we did for the "pull_request" case, but we also want to push the image.

The changes here are that we'll set "push: true" and also specify our Dockerhub username and password.

To build and push the Docker image, just add the following to your "steps" section:

- name: Set up Docker Buildx

uses: docker/setup-buildx-action@v2

- name: Login to DockerHub

uses: docker/login-action@v2

with:

username: ${{ secrets.DOCKERHUB_USERNAME }}

password: ${{ secrets.DOCKERHUB_TOKEN }}

- name: Build

uses: docker/build-push-action@v3

with:

context: .

push: true

tags: YOUR_GROUP/YOUR_REPO:latest

Boom.

The new action "docker/login-action" logs into Dockerhub with your username and password, which is necessary to actually push the image.

Upon "release"

This workflow happens every time a release is created. This is generally similar to "merge" case, except instead of using the "latest" label, we'll be using the release's tag.

To build and push the Docker image, just add the following to your "steps" section:

- name: Set up Docker Buildx

uses: docker/setup-buildx-action@v2

- name: Login to DockerHub

uses: docker/login-action@v2

with:

username: ${{ secrets.DOCKERHUB_USERNAME }}

password: ${{ secrets.DOCKERHUB_TOKEN }}

- name: Build

uses: docker/build-push-action@v3

with:

context: .

push: true

tags: YOUR_GROUP/YOUR_REPO:${{ github.event.release.tag_name }}

And that's it. The "github.event.release.tag_name" variable holds the name of the Git tag, which is what we'll use for the Docker label.

sed will blow away your symlinks by default

2022-05-23T12:50:00.005-04:00

sed typically outputs to stdout, but sed -i allows you edit a file “in place”. However, under the hood, it actually creates a new file and then replaces the original file with the new file. This means that sed replaces symlinks with normal files. This is most likely not what you want.

However, there is a flag to pass to make it work the way that you’d expect:

--follow-symlinks

So, if you're using sed -i, then you probably also want to tack on --follow-symlinks, too.

Golang, http.Client, and "too many open files"

2022-04-16T23:16:00.003-04:00

I've been having an issue with my application for a while now, and I finally figured out what the problem was. In this particular case, the application is a web app (so, think REST API written in Go), and one of its nightly routines is to synchronize a whole bunch of data with various third-party ArcGIS systems. The application keeps a cache of the ArcGIS images, and this job updates them so they're only ever a day old. This allows it to show map overlays even if the underlying ArcGIS systems are inaccessible (they're random third-party systems that are frequently down for maintenance).

So, imagine 10 threads constantly making HTTP requests for new map tile images; once a large enough batch is done, the cache is updated, and then the process repeats until the entire cache has been refreshed.

In production, I never noticed a direct problem, but there were times when an ArcGIS system would just completely freak out and start lying about not supporting pagination anymore or otherwise spewing weird errors (but again, it's a third-party system, so what can you do?). In development, I would notice this particular endpoint failing after a while with a "dial" error of "too many open files". Every time that I looked, though, everything seemed fine, and I just forgot about it.

This last time, though, I watched the main application's open sockets ("ss -anp | grep my-application"), and I noticed that the number of connections just kept increasing. This reminded me of my old networking days, and it looked like the TCP connections were just accumulating until the OS felt like closing them due to inactivity.

That's when I found that Go's "http.Client" has a method called "CloseIdleConnections()" that immediately closes any idle connections without waiting for the OS to do it for you.

For reasons that are not relevant here, each request to a third-party ArcGIS system uses its own "http.Client", and because of that, there was no way to reuse any connections between requests, and the application just kept racking up open connections, eventually hitting the default limit of 1024 "open files". I simply added "defer httpClient.CloseIdleConnections()" after creating the "http.Client", and everything magically behaved as I expected: no more than 10 active connections at any time (one for each of the 10 threads running).

So, if your Go application is getting "too many open files" errors when making a lot of HTTP requests, be sure to either (1) re-architect your application to reuse your "http.Client" whenever possible, or (2) be sure to call "CloseIdleConnections()" on your "http.Client" as soon as you're done with it.

I suspect that some of the third-party ArcGIS issues that I was seeing in production might have essentially been DoS errors caused by my application assaulting these poor servers with thousands of connections.

Service workers, push notifications, and IndexedDB

2022-04-02T23:47:00.002-04:00

I have a pretty simple use case: a user wanted my app to provide a list of "recent" notifications that had been sent to her. Sometimes a lot of notifications will come through in a relatively short time period, and she wants to be able to look at the list of them to make sure that she's handled them all appropriately.

I ended up having the service worker write the notification to an IndexedDB and then having the UI reload the list of notifications when it receives a "message" event from the service worker.

Before we get there, I'll walk you through my process because it was painful.

Detour: All the mistakes that I made

Since I was already using HTML local storage for other things, I figured that I would just record the list of recent notifications in there. Every time that the page would receive a "message" event, it would add the event data to a list of notifications in local storage. That kind of worked as long as I was debugging it. As long as I was looking at the page, the page was open, and it would receive the "message" event.

However, in the real world, my application is installed as a "home screen" app on Android and is usually closed. When a notification arrived, there was no open page to receive the "message" event, and it was lost.

I then tried to have the service worker write to HTML local storage instead. It wouldn't matter which side (page or service worker) actually wrote the data since both sides would detect a change immediately. Except that's not how it works. Service workers can't use HTML local storage because of some rules around being asynchronous or something.

Anyway, HTML local storage was impossible as a simple communication and storage mechanism.

Because the page was usually not opened, MessageChannel and BroadcastChannel also wouldn't work.

I finally settled on using IndexedDB because a service worker is allowed to use it. The biggest annoyance (in the design) was that there is no way to have a page "listen" for changes to an IndexedDB, so I couldn't just trivially tell my page to update the list of notifications to display when there was a change to the database.

After implementing IndexedDB, I spent a week trying to figure out why it wasn't working half the time, and that leads us to how service workers actually work.

Detour: How service workers work

Service workers are often described as a background process for your page. The way that you hear about them, they sound like daemons that are always running and process events when they receive them.

But that's not anywhere near correct in terms of how they are implemented. Service workers are more like "serverless" functions (such as Google Cloud Functions) in that they generally aren't running, but if a request comes in that they need to handle, then one is spun up to handle the request, and it'll be kept around for a few minutes in case any other requests come in for it to handle, and then it'll be shut down.

So my big mistake was thinking that once I initialized something in my service worker then it would be available more or less indefinitely. The browser knows what events a service worker has registered ("push", "message", etc.) and can spin up a new worker whenever it wants, typically to handle such an event and then shut it down again shortly thereafter.

Service workers have an "install" event that gets run when new service worker code gets downloaded. This is intended to be run exactly once for that version of the service worker.

There is also an "activate" event that gets run when an actual worker has been assigned to the task. You can basically view this as an event that gets once when a service worker process starts running, regardless of how many times this particular code has been run previously. If you need to initialize some global things for later functions to call, you should do it here.

The "push" event is run when a push message has been received. Whatever work you need to do should be done in the event's "waitUntil" method as a promise chain that ultimately results in showing a notification to the user.

Detour: How IndexedDB works

IndexedDB was seemingly invented by people who had no concept of Promises in JavaScript. Its API is entirely insane and based on "onsuccess", "oncomplete", and "onerror" callbacks. (You can technically also use event listeners, but it's just as insane.) It's an asynchronous API that doesn't use any of the standard asynchronous syntax as anything else in modern JavaScript. It is what it is.

Here's what you need to know: everything in IndexedDB is callbacks. Everything. So, if you want to connect to a database, you'll need to make an IDBRequest and set the "onsuccess" callback. Once you have the database, you'll need to create a transaction and set the "oncomplete" callback. Then you can create another IDBRequest for reading or writing data from an object store (essentially a table) and setting the "onsuccess" callback. It's callback hell, but it is what it is. (Note that there are wrapper libraries that provide Promise-based syntax, but I hate having to wrap a standard feature for no good reason.)

(Also, there's an "onupgradeneeded" callback at the database level that you can use to do any schema- or data-related work if you're changing the database version.)

Putting it all together

I decided that there was no reason to waste cycles opening the IndexedDB on "activate" since there's no guarantee that it'll actually be used. Instead, I had the "push" event use the previous database connection (if there was one) or create a new connection (if there wasn't).

I put together the following workflow for my service worker:

(Promise) Connect to the IndexedDB.

If we already have a connection from a previous call, then return that.
Otherwise, connect to the IndexedDB and return that (and also store it for quick access the next time so we don't have to reconnect).

(Promise) Read the list of notifications from the database.
(Promise) Add the new notification to the list and write it back to the database.
(Promise) Fire a "message" event to all active pages and show a notification if no page is currently visible to the user.

And for my page:

Load the list of notifications from the IndexedDB when the page loads. (This sets our starting point, and any changes will be communicated by a "message" event from the service worker.)
Register the "message" event handler:

Reload the list of notifications from the IndexedDB. (Remember, there's no way to be notified on changes, so receiving the "message" event and reloading is the best that we can do.)
(Handle the message normally; for me, this shows a little toast with the message details and a link to click on to take the user to the appropriate screen.)

For me, the database work is a nice-to-have; the notification is the critical part of the workflow. So I made sure that every database-related error was handled and the Promises resolved no matter what. This way, even if there was a completely unexpected database issue, it would just get quietly skipped and the notification could be shown to the user.

In my code, I created some simple functions (to deal with the couple of IndexedDB interactions that I needed) that return Promises so I could operate normally. You could technically just do a single "new Promise(...)" to cover all of the IndexedDB work if you wanted, or you could one of those fancy wrapper libraries. In any case, you must call "event.waitUntil" with a Promise chain that ultimately resolves after doing something with the notification. How you get there is up to you.

I also was using the IndexedDB as an asynchronous local storage, so I didn't need fancy keys or sorting or anything. I just put all of my data under a single key that I could "get" and "put" trivially without having to worry about row counts or any other kind of data management. There's a single object store with a single row in it.

Dump your SSH settings for quick troubleshooting

2022-03-03T00:03:00.002-05:00

I recently had a Jenkins job that would die, seemingly-randomly. The only thing that really stood out was that it would tend to succeed if the runtime was 14 minutes or less, and it would tend to fail if the runtime was 17 minutes or more.

This job did a bunch of database stuff (through an SSH tunnel; more on that soon), so I first did a whole bunch of troubleshooting on the Postgres client and server configs, but nothing seemed relevant. It seemed to disconnect ("connection closed by server") on these long queries that would sit there for a long time (maybe around 15 minutes or so) and then come back with a result. After ruling out the Postgres server (all of the settings looked good, and new sessions had decent timeout configs), I moved on to SSH.

This job connects to a database by way of a forwarded port through an SSH tunnel (don't ask why; just understand that it's the least worst option available in this context). I figured that maybe the SSH tunnel was failing, since I start it in the background and have it run "sleep infinity" and then never look at it again. However, when I tested locally, my SSH session would run for multiple days without a problem.

Spoiler alert: the answer ended up being the client config, but how do you actually find that out?

SSH has two really cool options.

On the server side, you can run "sudo sshd -T | sort" to have the SSH daemon read the relevant configs and then print out all of the actual values that it's using. So, this'll merge in all of the unspecified defaults as well as all of the various options in "/etc/sshd_config" and "/etc/sshd_config.d", etc.

On the client side, you can run "ssh -G ${user}@${host} | sort", and it'll do the same thing, but for all of the client-side configs for that particular user and host combination (because maybe you have some custom stuff set up in your SSH config, etc.).

Now, in my case, it ended up being a keepalive issue. So, on the server side, here's what the relevant settings were:

clientalivecountmax 0
clientaliveinterval 900
tcpkeepalive yes

On the client (which would disconnect sometimes), here's what the relevant settings were:

serveralivecountmax 3
serveraliveinterval 0
tcpkeepalive yes

Here, you can see that the client (which is whatever the default Jenkins Kubernetes agent ended up being) enabled a TCP keepalive, but it set the keepalive interval to "0", which means that it wouldn't send keepalive packets at all.

According to the docs, the server should have sent out keepalives every 15 minutes, but whatever it was doing, the connection would drop after 15 minutes. Setting "serveraliveinterval" to "60" ended up solving my problem and allowed my SSH sessions to stay active indefinitely until the script was done with them.

Little bonus section

My SSH command to set up the tunnel in the background was:

ssh -4 -f -L${localport}:${targetaddress}:${targetport} ${user}@${bastionhost} 'sleep infinity';

"-4" forces it to use an IPv4 address (relevant in my context), and "-f" puts the SSH command into the background before "sleep infinity" gets called, right after all the port forwarding is set up. "sleep infinity" ensures that the connection never closes on its own; the "sleep" command will do nothing forever.

(Obviously, I had the "-o ServerAliveInterval=60" option in there, too.)

With this, I could trivially have my container create an SSH session that allowed for port-forwarding, and that session would be available for the entirety of the container's lifetime (the entirety of the Jenkins build).

QNAP, NFS, and Filesystem ACLs

2022-03-01T15:19:00.002-05:00

I recently spent hours banging my head against a wall trying to figure out why my Plex server couldn't find some new media that I put on its volume in my QNAP.

tl;dr QNAP "Advanced Folder Permissions" turn on file access control lists (you'll need the "getfacl" and "setfacl" tools installed on Linux to mess with them). For more information, see this guide from QNAP.

I must have turned this setting on when I rebuilt my NAS a while back, and it never mattered until I did some file operations with the File Manager or maybe just straight "cp"; I forget which (or both). Plex refused to see the new file, and I tried debugging the indexer and all that other Plex stuff before realizing that while it could list the file, it couldn't open the file, even though its normal "ls -l" permissions looked fine.

Apparently the file access control list denied it, but I didn't even have "getfacl" or "setfacl" installed on my machine (and I had never even heard of this before), so I had no idea what was going on. I eventually installed those tools and verified that while the standard Linux permission looked fine, the ACL permissions did not.

"sudo chmod -R +r /path/to/folder" didn't solve my problem, but tearing out the ACL did: "sudo setfacl -b -R /path/to/folder"

Later, I eventually figured out that it was QNAP's "Advanced Folder Permissions" and just disabled that so it wouldn't bother me again.

Moving (or renaming) a push-notification ServiceWorker

2022-01-09T19:59:00.002-05:00

Service Workers (web workers, etc.) are a relatively new concept. They can do all kinds of cool things (primarily related to network requests), but they are also the mechanism by which a web site can receive push messages (via Web Push) and show them as OS notifications.

The general rule of Service Workers is to pick a file name (such as "/service-worker.js") and never, ever change it. That's cool, but sometimes you do need to change it.

In particular, I started my push messaging journey with "platinum-push-messaging", a now-defunct web component built by Google as part of the initial Polymer project. The promise was cool: just slap this HTML element on your page with a few parameters and boom: you have working push notifications.

When it came out, the push messaging spec was young, and no browsers fully supported its encrypted data payloads, so "platinum-push-messaging" did a lot of work to work around that limitation. As browsers improved to support the VAPID spec, "platinum-push-messaging" (along with all of the other "platinum" elements) were quietly deprecated and archived (around 2017).

This left me with a problem: a rotting push notification system that couldn't keep up with the spec and the latest browsers. I hacked the code to all hell to support VAPID and keep the element functioning, but I was just punting.

Apple ruined the declarative promise of the Polymer project by refusing to implement HTML imports, so the web components community adopted the NPM distribution model (and introduced a whole bunch of imperative Javascript drama and compilation tools). Anyway, no modern web components are installed with Bower anymore, so that left me with a deprecated Service Worker in a path that I wanted to get rid of: "bower_components/platinum-push-messaging/service-worker.js"

Here was my problem:

I wanted the push messaging Service Worker under my control at the top level of my application, "/push-service-worker.js".
I had hundreds of users who were receiving push notifications via this system, and the upgrade had to be seamless (users couldn't be forced to take any action).

I ended up solving the problem by essentially performing a switcheroo:

I had my application store the Web Push subscription info in HTML local storage. This would be necessary later as part of the switcheroo.
I removed "bower_components/platinum-push-messaging/". Any existing clients would regularly attempt to update the service worker, but it would quietly fail, leaving the existing one running just fine.
I removed all references to "platinum-push-messaging" from my code. The existing Service Worker would continue to run (because that's what Service Workers do) and receive push messages (and show notifications).
I made my own push-messaging web component with my own service worker living at "/push-service-worker.js".
(This laid the framework for performing the switcheroo.)
Upon loading, the part of my application that used to include "platinum-push-messaging" did a migration, if necessary, before loading the new push-messaging component:

It went through all the Service Workers and looked for any legacy ones (these had "$$platinum-push-messaging$$" in the scope). If it found any, it killed them.

Note that the "$$platinum-push-messaging$$" in the scope was a cute trick by the web component: a page can only be controlled by one Service Worker, and the scope dictates what that Service Worker can control. By injecting a bogus "$$platinum-push-messaging$$" at the end of the scope, it ensured that the push-messaging Service Worker couldn't accidentally control any pages and get in the way of a main Service Worker.
Upon finding any legacy Service Workers, it would:

Issue a delete to the web server for the old (legacy) subscription (which was stored in HTML local storage).
Tell the application to auto-enable push notifications.
Resume the normal workflow for the application.

The normal workflow for the application entailed loading the new push-messaging web component once the user was logged in. If a Service Worker was previously enabled, then it would remain active and enabled. Otherwise, the application wouldn't try to annoy users by asking them for push notifications.
After the new push-messaging web component was included, it would then check to see if it should be auto-enabled (it would only be auto-enabled as part of a successful migration).

If it was auto-enabled, then it would enable push messaging (the user would have already given permission by virtue of having a legacy push Service Worker running). When the new push subscription was ready, it would post that information to the web server, and the user would have push messages working again, now using the new Service Worker. The switcheroo was complete.

That's a bit wordy for a simple switcheroo, but it was very important for me to ensure that my users never lost their push notifications as part of the upgrade. The simple version is: detect legacy Service Worker, kill legacy Service Worker, delete legacy subscription from web server, enable new Service Worker, and save new subscription to web server.

For any given client, the switcheroo happens only once. The moment that the legacy Service Worker has been killed, it'll never run again (so there's a chance that if the user killed the page in the milliseconds after the kill but before the save, then they'd lose their push notifications, but I viewed this as extremely unlikely; I could technically have stored a status variable, but it wasn't worth it). After that, it operates normally.

This means that there are one of two ways for a user to be upgraded:

They open up the application after it has been upgraded. The application prompts them to reload to upgrade if it detects a new version, but eventually the browser will do this on its own, typically after the device reboots or the browser has been fully closed.
They click on a push notification, which opens up the application (which is #1, above).

So at this point, it's a waiting game. I have to maintain support for the switcheroo until all existing push subscriptions have been upgraded. The new ones have a flag set in the database, so I just need to wait until all subscriptions have the flag. Active users who are receiving push notifications will eventually click on one, so I made a note to revisit this and remove the switcheroo code once all of the legacy subscriptions have been removed.

I'm not certain what causes a new subscription to be generated (different endpoint, etc.), but I suspect that it has to do with the scope of the Service Worker (otherwise, how would it know, since service worker code can change frequently?). I played it safe and just assumed that the switcheroo would generate an entirely new subscription, so I deleted the legacy one no matter what and saved the new one no matter what.

Troubleshooting a weird Nagios NRPE SSL/TLS error

2021-10-30T17:36:00.005-04:00

We recently gained limited access to a customer data center in order to monitor to some machines that our software is running on. For historical reasons, we use Nagios as our monitoring tool (yes, I know that it's 2021) and we use NRPE to monitor our Linux boxes (yes, I know that NRPE is deprecated in favor of NCPA).

We had to provide the customer with a list of source IP addresses and target ports (for example, 5666 for NRPE) as part of the process to get the VPN set up. Foreshadowing: this will become relevant soon.

After getting NRPE installed on all of our machines, we noticed that Nagios was failing to connect to any of the them. The NRPE logs all had the following errors:

Starting up daemon
Server listening on 0.0.0.0 port 5666.
Server listening on :: port 5666.
Warning: Daemon is configured to accept command arguments from clients!
Listening for connections on port 5666
Allowing connections from: 127.0.0.1,::1,[redacted]
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
warning: can't get client address: Connection reset by peer
Error: (!log_opts) Could not complete SSL handshake with [redacted]: 5
warning: can't get client address: Connection reset by peer
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
warning: can't get client address: Connection reset by peer
Error: (!log_opts) Could not complete SSL handshake with [redacted]: 5
warning: can't get client address: Connection reset by peer
warning: can't get client address: Connection reset by peer

So, this is obviously an SSL/TLS problem.

However, everyone on the Internet basically says that this is a problem with the NRPE client machine (the Nagios source address isn't listed in "allowed_hosts", it's not set up for SSL correctly, you didn't compile it right, etc.).

After fighting with this for hours, we finally figured out what was wrong.

A hint was the "getpeername() failure"; if you open up the NRPE source code, this runs immediately after the connection is established. The only way that you could see this error ("Transport endpoint is not connected") is if the socket was closed between that initial connection and "getpeername".

Running "tcpdump" on both sides yielded the following findings:

On Nagios:

Nagios → NRPE machine: SYN
NRPE machine → Nagios: SYN, ACK
Nagios → NRPE machine: ACK
Nagios → NRPE machine: TLSv1 Client Hello
NRPE machine → Nagios: RST, ACK

On the NRPE machine to be monitored:

Nagios → NRPE machine: SYN
NRPE machine → Nagios: SYN, ACK
Nagios → NRPE machine: ACK
Nagios → NRPE machine: RST, ACK

Both machines agreed on the first 3 packets: the classic TCP handshake. However, they differed on the subsequent packets. Nagios sent a TLSv1 "Client Hello" packet and immediately had the connection closed by the NRPE machine. However, the NRPE machine did not see the TLSv1 "Client Hello" at all; rather, it saw that Nagios immediately closed the connection.

This is indicative of some trickery being done by the customer's equipment (firewall, VPN, etc.). From what I can tell, they're quietly stripping out any TLS packets and killing the connection if it finds any. They probably have an incorrect port rule set up for port 5666, but anyway, that's the problem here: the network infrastructure is tearing out the TLS packets and closing the connection.

Migrating from a static volume to a storage pool in QNAP

2021-07-10T15:50:00.001-04:00

I bought a QNAP TS-451+ NAS a number of years ago. At the time, you could only set up what are now called "static volumes"; these are volumes that are composed of a number of disks in some RAID configuration. After a firmware update, QNAP introduced "storage pools", which act as a layer in between the RAIDed disks and the volumes on top of them. Storage pools can do snapshots and some other fancy things, but the important thing here is that QNAP was pushing storage pools now, and I had a static volume.

I wanted to migrate from my old static volume to a new storage pool. I couldn't really find any examples of anyone who had performed such a migration successfully; most of the advice on the Internet was basically, "back up your stuff and reformat". Given the fact that my volume was almost full and that QNAP does not support an in-place migration, I figured that if I added on some extra storage in the form of an expansion unit, I could probably pull it off with minimal hassle.

(The official QNAP docs generally agree with this.)

tl;dr It was pretty easy to do, just a bit time-consuming. I'll also note that this was a lossless process (other than my NFS permissions); I didn't have to reinstall anything or restore any backups.

Here's the general workflow:

Attach the expansion unit.
Add the new disks to the expansion unit.
Create a new storage pool on the expansion unit.
Transfer each folder in the original volume to a new folder on the expansion unit.
Write down the NFS settings for the original volume's folders.
Delete the original volume.
Create a new storage pool with the original disks.
Create a new system volume on the main storage pool.
Create new volumes as desired on the main storage pool.
Transfer each folder from the expansion volume to the main volume.
Re-apply the NFS settings on the folders on the main storage pool's volumes.
Detach the expansion unit.

Some details follow.

QNAP sells expansion units that can act as additional storage pools and volumes, and the QNAP OS integrates them pretty well. I purchased a TS-004 and connected it to my TS-451+ NAS via USB. I had some new drives that I was planning to use to replace the drives currently in the NAS, so instead of doing that right away, I put them all in the expansion unit and created a new storage pool (let's call this the expansion storage pool).

I had originally tried using File Station to copy and paste all of my folders to a new volume in the expansion unit, but I would get permission-related errors, and I didn't want to deal with individual files when there were millions to transfer. QNAP has an application called Hybrid Backup Sync, and one of the things that you can do is a 1-way sync "job" that lets you properly copy everything from one folder on one volume to another folder on another volume. So I created new top-level folders in the expansion volume and then used Hybrid Backup Sync to copy all of my data from the main volume to the expansion volume (it preserved all the file attributes, etc.).

For more information how to use Hybrid Backup Sync to do this, see this article from QNAP.

(If you're coming from a static volume and you set up a storage pool on the expansion unit, then QNAP has a feature where you can transfer a folder on a static volume to a new volume in a storage pool, but this only works one way; you can't use this feature to transfer back from storage pool to storage pool, only from static volume to storage pool.)

I then wrote down the NFS settings that I had for my folders on the main unit (it's pretty simple, but I did have some owner and whitelist configuration).

Once I had everything of mine onto the expansion volume, I then deleted the main (system) volume. QNAP was okay with this and didn't complain at all. Some sites that I had read claimed that you'd have to reboot or reformat or something if you did this, but at least on modern QNAP OSes, it's fine with you deleting its system volume.

For more information on deleting a volume, see this article from QNAP.

I created a new storage pool with the main unit's existing disks, and then I created a small, thin volume on it to see what would happen. QNAP quickly decided that this new volume would be the new "system" volume, and it installed some applications on its own, and then it was done. My guess is that it installed whatever base config it needs to operate on that new volume and maybe transferred the few applications that I already had to it or something.

(I then rebooted the QNAP just to make sure that everything was working, and it ended up being fine.)

On the expansion unit, I renamed all of the top-level folders to end with "_expansion" so that I'd be able to tell them apart from the ones that I would make on the main unit.

Then I used Hybrid Backup Sync to copy my folders from the expansion volume to the main volume. Once that was done, I modified the NFS settings on the main volume's folders to match what they had been originally.

I tested the connections from all my machines that use the NAS, and then I detached and powered down the expansion unit. I restarted the NAS and tested the connections again, and everything was perfect. Now I had a storage pool with thin-provisioned volumes instead of a single, massive static volume.

Working around App Engine's bogus file modification times in Go

2021-07-05T17:47:00.001-04:00

When an App Engine application is deployed, the files on the filesystem have their modification times "zeroed"; in this case, they are set to Tuesday, January 1, 1980 at 00:00:01 GMT (with a Unix timestamp of "315532801"). Oddly enough, this isn't January 1, 1970 (with a Unix timestamp of "0"), so they're adding 1 year and 1 second for some reason (probably to avoid actually zeroing out the date).

If you found your way here by troubleshooting, you may have seen this for your "Last-Modified" header:

last-modified: Tue, 01 Jan 1980 00:00:01 GMT

There's an issue for this particular problem (currently they're saying that it's working as designed); to follow the issue or make a comment, see issue 168399701.

For App Engine in Go, I've historically bypassed the static files stuff and just had my application serve up the files with "http.FileServer", and I've disabled caching everywhere to play it safe ("Cache-Control: no-cache, no-store, must-revalidate"). Recently, I've begun to experiment with a "max-age" of 1-minute lined up on 1-minute boundaries so that I get a bit of help from the GCP proxy and its caching powers while not shooting myself in the foot allowing stale copies of my files to linger all over the Internet.

This caused me a huge amount of headache recently when my web application wasn't updating in production, despite being pushed for over 24 hours. It turns out that the browser (Chrome) was making a request by including the "If-Modified-Since" header, and my application was responding back with a 304 Not Modified response. No matter how many times my service worker tried to fetch the new data, the server kept telling it that what it had was perfect.

The default HTTP file server in some languages lets you tweak how it responds ("ETag", "Last-Modified", etc.), but not in Go. "http.FileServer" has no configuration options available to it.

What I ended up doing was wrapping "http.FileServer"'s "ServeHTTP" in another function; this function had two main goals:

Set up a weak ETag value using the expiration date (ideally, I'd use a strong value like the MD5 sum of the contents, but I didn't want to have to rewrite "http.FileServer" just for this).
Remove the request headers related to the modification time ("If-Modified-Since" and "If-Unmodified-Since"). "http.FileServer" definitely respects "If-Modified-Since", and because the modification time is bogus in App Engine, I figured that just quietly removing any headers related to that would keep things simple.

Here's what I ended up with:

staticHandler := http.StripPrefix("/", http.FileServer(http.Dir("/path/to/my/files")))

myHandler.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {

// Cache all the static files aligned at the 1-minute boundary.

expirationTime := time.Now().Truncate(1 * time.Minute).Add(1 * time.Minute)

w.Header().Set("Cache-Control", fmt.Sprintf("public, max-age=%0.0f, must-revalidate", time.Until(expirationTime).Seconds()))

w.Header().Set("ETag", fmt.Sprintf("W/\"exp_%d\"", expirationTime.Unix())) // The ETag is weak ("W/" prefix) because it'll be the same tag for all encodings.

// Strip the headers that `http.FileServer` will use that rely on modification time.

// App Engine sets all of the timestamps to January 1, 1980.

r.Header.Del("If-Modified-Since")

r.Header.Del("If-Unmodified-Since")

staticHandler.ServeHTTP(w, r)

})

Anyway, I fought with this for two days before finally realizing what was going on, so hopefully this will let you work around App Engine's bogus file-modification times.

Using "errors.Is" to detect "connection reset by peer" and work around it

2021-04-15T16:27:00.001-04:00

I maintain an application that ties into Emergency Reporting using their REST API. When an item is updated, I have a Google Cloud Task that attempts to publish a change to a web hook, which connects to the Emergency Reporting API and creates a new incident in that system. Because it's in Cloud Tasks, if the task fails for any reason, Cloud Tasks will attempt to retry the task until it succeeds. Cool.

I also have it set up to send any log messages at warning level or higher to a Slack channel. Also cool.

However, in December of 2020, Emergency Reporting switched to some kind of Microsoft-managed authentication system for their API, and this has only brought problems. The most common of which is that the authentication API will frequently fail with a "connection reset by peer" error. My Emergency Reporting wrapper detects this and logs it; my web hook detects a sign-in failure and logs that; and the whole Cloud Task detects that the web hook has failed and logs that. Cloud Tasks automatically retries the task, which makes another post to the web hook, and everything succeeds the second time. But by now, I've accumulated a bunch of warnings in the Slack channel. Not cool.

So here's the thing: the Emergency Reporting API can fail for a lot of reasons, and I'd like to be notified when something important actually happens. But a standard, run-of-the-mill TCP "connection reset by peer" error is not important at all.

Here's an example of the kind of error that Go's http.Client.PostForm returns in this case:

Could not post form: Post https://login.emergencyreporting.com/login.emergencyreporting.com/B2C_1A_PasswordGrant/oauth2/v2.0/token: read tcp [fddf:3978:feb1:d745::c001]:33391->[2620:1ec:29::19]:443: read: connection reset by peer

Looking at the error, it looks like there are 4 layers of error:

The HTTP post
The particular TCP read
A generic "read"
A generic "connection reset by peer"

What I really want to do in this case is detect a generic "connection reset by peer" error and quietly retry the operation, allowing all other errors to be handled as true errors. Doing string-comparison operations on error text is rarely a good idea, so what does that leave us with?

Go 1.13 adds support for "error wrapping", where one error can "wrap" another one, while still allowing programs to make decisions based on the "wrapped" error. You may call "errors.Is" to determine if any error in an error chain matches a particular target.

Fortunately, all of the packages in this particular chain of errors utilize this feature. In particular, the syscall package has a set of distinct Errno errors for each low-level error, including "connection reset by peer" (ECONNRESET).

This lets us do something like this:

tokenResponse, err = client.GenerateToken()
if err != nil {
// If this was a connection-reset error, then continue to the next retry.
if errors.Is(err, syscall.ECONNRESET) {
logrus.Info("Got back a syscall.ECONNRESET from Emergency Reporting.")
// [attempt to retry the operation]
} else {
// This was some other kind of error that we can't handle.
// [log a proper error message and fail]
}
}

Since using "errors.Is" to detect the "connection reset by peer" error, I haven't received a single annoying, pointless error message in my Slack channel. I did have to spend a bit of time trying to figure out what that ultimate, underlying error was, but after that, it's been working flawlessly.

Using LDAP groups to limit access to a Radius server (freeRADIUS 3.0)

2021-01-25T15:12:00.001-05:00

Note: this is an updated version of a prior entry for freeRADIUS 3.0.

Anytime I need to create a VPN (to my home network, to my AWS network, etc.), I use SoftEther. SoftEther is OpenVPN-compatible, supports L2TP/IPsec, and has some neat settings around VPN over ICMP and DNS. Anyway, once you get it set up, it generally just works (except for the cronjob that you need to make to trim its massive log files daily).

At work, we use LDAP for our user authentication and permissions, but SoftEther doesn't support LDAP. It does, however, support Radius, and freeRADIUS supports using LDAP as a module, so you can easily set up a quick Radius proxy for LDAP.

Quick recap on setting up freeRADIUS with LDAP

I'm assuming that you already have an LDAP server.

Install freeRADIUS and the LDAP module.

sudo apt install freeradius freeradius-ldap

sudo systemctl enable freeradius

sudo systemctl start freeradius

Enable the LDAP module via symlink:

ln -sfn ../mods-available/ldap /etc/freeradius/3.0/mods-enabled/ldap

Then turn on the LDAP module by editing /etc/freeradius/3.0/sites-enabled/default and uncommenting the "ldap" line under the "authorize" block.

authorize {

...

ldap

...

You'll need to add an "if" statement to set the "Auth-Type"; do this immediately after that "ldap" line.

if ((ok || updated) && User-Password) {

update {

control:Auth-Type := ldap

}

And the same for the "Auth-Type LDAP" block.

authorize {

...

Auth-Type LDAP {

ldap

}

...

Cool; at this point, freeRADIUS will use whatever LDAP setup is in the /etc/freeradius/3.0/mods-enabled/ldap file. It won't work (because it's not set up for your LDAP server), that's all that you need in order to back your Radius server with your LDAP server.

Next up, we'll look at configuring it to actually talk to your LDAP server.

Configuring the LDAP module

/etc/freeradius/3.0/mods-enabled/ldap is where the LDAP configuration lives. In order to understand exactly what's going on, you should know a few things.

Run-time variables, like the current user name, are written as %{Variable-Name}. For example, the current user name is %{User-Name}.
Similar to shell variables, you can have conditional values. The basic syntax is %{%{Variable-1}:-${Variable-2}}. A typical pattern that you'll see is using the "stripped" user name (the user name without any realm information), but if that's not defined, then use the actual user name: %{%{Stripped-User-Name}:-%{User-Name}}

For your basic LDAP integration (if you provide a valid username and password, you can sign in), you'll need to set the following values in the "ldap" block:

server; this is the hostname or address of your server. If you're running freeRADIUS on the same LDAP server, then this will be "localhost".
identity; this is the DN for the "bind" user. That's the user that freeRADIUS will log in as in order to search the directory tree and do its LDAP stuff. This is typically a read-only user.
password; this is the password for the user configured in identity.
base_dn; this is the base DN to use for all user searches. It's usually something like dc=example,dc=com, but that'll depend on your LDAP setup. You'll generally want to set this as the base for all of your users (maybe something like ou=users,dc=example,dc=com, etc.).

Here's an example that assumes that your users are all under ou=users,dc=example,dc=com:

server = "my-ldap-server.example.com"
identity = "uid=my-bind-user,ou=service-users,dc=example,dc=com"
password = "abc123"
base_dn = "ou=users,dc=example,dc=com"

Users

You'll also need to set up user-level things in the "user" block:

filter; this is the LDAP search condition that freeRADIUS will use to try to find the matching LDAP user for the user name that just tried to sign in via Radius. This is where run-time variables will come into play. For out-of-the-box OpenLDAP, something like this will generally work: (uid=%{%{Stripped-User-Name}:-%{User-Name}}). What this means is look for an entity in LDAP (under the base DN defined in basedn) with a uid property of the Radius user name. Yes, you need the surrounding parentheses. No, I don't make the rules.

Here's an example that uses "uid" for the user name.

filter = "(uid=%{%{Stripped-User-Name}:-%{User-Name}})"

Remember, filter can be any LDAP filter, so if there were a property that you also wanted to check (such as isAllowedToDoRadius or something), then you could check for that, as well. For example:

filter = "(&(uid=%{%{Stripped-User-Name}:-%{User-Name}})(isAllowedToDoRadius=yes))"

Filtering by group

So, that'll let any LDAP user authenticate with Radius. Maybe you want that, maybe you don't. In my case, I have a whole bunch of users, but I only want a small subset to be able to VPN in using SoftEther. I added those users to the "vpn-users" group in LDAP.

Note that there are two general grouping strategies in LDAP:

Groups-have-users; in this strategy, the group entity lists the users within the group. This is the default OpenLDAP strategy.
Users-have-groups; in this strategy, the user entity lists the groups that it belongs to.

If you want to have freeRADIUS respect your groups, you'll need to set the following in /etc/freeradius/3.0/mods-enabled/ldap in the "groups" block:

name_attribute = cn (which turns on tracking groups); and
One of these two options, which each correspond to one of the LDAP grouping strategies:

membership_filter; this is an LDAP filter to use to query for all of the groups that the user belongs to.
membership_attribute; this is the property on the user entity that lists the groups that the user belongs to.

If your groups have users, this might look like:

name_attribute = cn

membership_filter = "(&(objectClass=posixGroup)(memberUid=%{%{Stripped-User-Name}:-%{User-Name}}))"

If your users have groups, this might look like:

name_attribute = cn
membership_attribute = groupName

With that set up, freeRADIUS will now know which groups the user belongs to, but it won't do anything with them.

The last step is to set up some group rules in /etc/freeradius/3.0/users. There will probably be a few entries in that file already, but by default, none of them will be LDAP-related. So, at the very bottom, add the LDAP group rules.

Note: In my case, this file was a symlink to "mods-config/files/authorize". The symlink was a convenience for backward-compatibility in editing the config files; freeRADIUS doesn't actually load "users"; rather, it loads "mods-config/files/authorize", so make sure that you're actually modifying the correct file.

The simplest grouping rules will look like this:

DEFAULT LDAP-Group == "your-group-name-here"

DEFAULT Auth-Type := Reject

Reply-Message = "Sorry, you're not part of an authorized group."

This generally means: you have to a member of "your-group-name-here" or else you'll be rejected (and here's the message to send you).

In my case, my group is "vpn-users", so it looks like this:

DEFAULT LDAP-Group == "vpn-users", Auth-Type := Accept

DEFAULT Auth-Type := Reject

Reply-Message = "Sorry, you're not part of an authorized group."

Once that's done, restart freeradius and you'll be good to go.

sudo systemctl restart freeradius

To test to see if it worked, you can run the radtest command:

radtest -x ${username} ${password} ${address} ${port} ${secret}

For example, in our case, this might look like:

radtest -x some-user abc123 my-radius-server.example.com 1812 the-gold-is-under-the-bridge

On success, you'll see something like:

rad_recv: Access-Accept packet

On failure, you'll see something like:

rad_recv: Access-Reject packet

Hopefully this helped a bit; I struggle every time I need to do anything with LDAP or Radius. It's always really hard to find the documentation for what I'm looking for.

Using LDAP groups to limit access to a Radius server

2020-12-22T17:58:00.009-05:00

Quick recap on setting up freeRADIUS with LDAP

I'm assuming that you already have an LDAP server.

Install freeRADIUS and the LDAP module.

sudo apt install freeradius freeradius-ldap

sudo systemctl enable freeradius

sudo systemctl start freeradius

Then turn on the LDAP module by editing /etc/freeradius/sites-enabled/default and uncommenting the "ldap" line under the "authorize" block.

authorize {

...

ldap

...

And the same for the "Auth-Type LDAP" block.

authorize {

...

Auth-Type LDAP {

ldap

}

...

Cool; at this point, freeRADIUS will use whatever LDAP setup is in the /etc/freeradius/modules/ldap file. It won't work (because it's not set up for your LDAP server), that's all that you need in order to back your Radius server with your LDAP server.

Next up, we'll look at configuring it to actually talk to your LDAP server.

Configuring the LDAP module

/etc/freeradius/modules/ldap is where the LDAP configuration lives. In order to understand exactly what's going on, you should know a few things.

Run-time variables, like the current user name, are written as %{Variable-Name}. For example, the current user name is %{User-Name}.
Similar to shell variables, you can have conditional values. The basic syntax is %{%{Variable-1}:-${Variable-2}}. A typical pattern that you'll see is using the "stripped" user name (the user name without any realm information), but if that's not defined, then use the actual user name: %{%{Stripped-User-Name}:-%{User-Name}}

For your basic LDAP integration (if you provide a valid username and password, you can sign in), you'll need to set the following values in the "ldap" block:

server; this is the hostname or address of your server. If you're running freeRADIUS on the same LDAP server, then this will be "localhost".
identity; this is the DN for the "bind" user. That's the user that freeRADIUS will log in as in order to search the directory tree and do its LDAP stuff. This is typically a read-only user.
password; this is the password for the user configured in identity.
basedn; this is the base DN to use for all user searches. It's usually something like dc=example,dc=com, but that'll depend on your LDAP setup. You'll generally want to set this as the base for all of your users (maybe something like ou=users,dc=example,dc=com, etc.).
filter; this is the LDAP search condition that freeRADIUS will use to try to find the matching LDAP user for the user name that just tried to sign in via Radius. This is where run-time variables will come into play. For out-of-the-box OpenLDAP, something like this will generally work: (uid=%{%{Stripped-User-Name}:-%{User-Name}}). What this means is look for an entity in LDAP (under the base DN defined in basedn) with a uid property of the Radius user name. Yes, you need the surrounding parentheses. No, I don't make the rules.

Here's an example that assumes that your users are all under ou=users,dc=example,dc=com and have a uid property that is their user name:

server = "my-ldap-server.example.com"
identity = "uid=my-bind-user,ou=service-users,dc=example,dc=com"
password = "abc123"
basedn = "ou=users,dc=example,dc=com"
filter = "(uid=%{%{Stripped-User-Name}:-%{User-Name}})"

Remember, filter can be any LDAP filter, so if there were a property that you also wanted to check (such as isAllowedToDoRadius or something), then you could check for that, as well. For example:

filter = "(&(uid=%{%{Stripped-User-Name}:-%{User-Name}})(isAllowedToDoRadius=yes))"

Filtering by group

Note that there are two general grouping strategies in LDAP:

Groups-have-users; in this strategy, the group entity lists the users within the group. This is the default OpenLDAP strategy.
Users-have-groups; in this strategy, the user entity lists the groups that it belongs to.

If you want to have freeRADIUS respect your groups, you'll need to set the following in /etc/freeradius/modules/ldap:

groupname_attribute = cn (which turns on tracking groups); and
One of these two options, which each correspond to one of the LDAP grouping strategies:

groupmembership_filter; this is an LDAP filter to use to query for all of the groups that the user belongs to.
groupmembership_attribute; this is the property on the user entity that lists the groups that the user belongs to.

If your groups have users, this might look like:

groupname_attribute = cn

groupmembership_filter = "(&(objectClass=posixGroup)(memberUid=%{%{Stripped-User-Name}:-%{User-Name}}))"

If your users have groups, this might look like:

groupname_attribute = cn
groupmembership_attribute = groupName

With that set up, freeRADIUS will now know which groups the user belongs to, but it won't do anything with them.

The last step is to set up some group rules in /etc/freeradius/users. There will probably be a few entries in that file already, but by default, none of them will be LDAP-related. So, at the very bottom, add the LDAP group rules.

The simplest grouping rules will look like this:

DEFAULT LDAP-Group == "your-group-name-here"

DEFAULT Auth-Type := Reject

Reply-Message = "Sorry, you're not part of an authorized group."

This generally means: you have to a member of "your-group-name-here" or else you'll be rejected (and here's the message to send you).

In my case, my group is "vpn-users", so it looks like this:

DEFAULT LDAP-Group == "vpn-users"

DEFAULT Auth-Type := Reject

Reply-Message = "Sorry, you're not part of an authorized group."

Once that's done, restart freeradius and you'll be good to go.

sudo systemctl restart freeradius

To test to see if it worked, you can run the radtest command:

radtest -x ${username} ${password} ${address} ${port} ${secret}

For example, in our case, this might look like:

radtest -x some-user abc123 my-radius-server.example.com 1812 the-gold-is-under-the-bridge

On success, you'll see something like:

rad_recv: Access-Accept packet

On failure, you'll see something like:

rad_recv: Access-Reject packet

Hopefully this helped a bit; I struggle every time I need to do anything with LDAP or Radius. It's always really hard to find the documentation for what I'm looking for.

Working with the Google Datastore emulator

2020-12-07T18:07:00.004-05:00

I do a good chunk of my business in Google App Engine; you package up your web application, send it to GCP, and then it takes care of scaling and uptime and all that stuff.

When I started out in 2014, I created my main application in Java because that was the least-crappy language that was supported. However, in 2020, there are a whole lot more languages (in particular: Go). I've slowly been working on porting my application from Java 8 to Go 1.14. Along the way, I've run into some really annoying issues.

For today, I'm going to be focusing on the Datastore emulator. In "old" App Engine (Java 8, Go 1.11, Python 2, etc.), they gave you a whole emulator suite. Your application ran inside of that suite, and you had fake Google-based App Engine authentication, inbound e-mail, and a Datastore emulator that also had a web UI that you could use to see your entities and manipulate them. The Datastore emulator's web UI wasn't as good as the current one that you get in production, but it was good enough to use for development.

Well, in "new" App Engine, the emulator suite is gone, and now you have to emulate or mock every aspect of App Engine that you plan on using. It's not a huge deal, but it is a bit inconvenient. In particular, you now have to start your own Datastore emulator.

It's easy to start:

gcloud config set project <your-project-id>;
gcloud beta emulators datastore start;

There are some environment variables that you'll need to export for the various libraries to detect and use instead of the production instance; run this to see them:

gcloud beta emulators datastore env-init;

That part is fine.

There are also two halfway-decent third party web UIs for the Datastore emulator:

I fought for hours trying to figure out why either of those two web UIs didn't work. Neither would show any namespaces (and thus, neither would show any entities).

The short answer is that despite what the Datastore emulator claims it's using for the project ID, the only thing that it actually uses is "dummy-emulator-datastore-project".

I got a hint about it by poking around in the emulator's data file, and I got some confirmation in this file, which is the only thing on the Internet at the time of this writing that references that string: https://code.googlesource.com/gocloud/+/master/datastore/datastore.go

So, if you start the Datastore emulator according to the instructions and either of those two web UIs aren't working, try setting the project ID to "dummy-emulator-datastore-project".

In "google-cloud-gui", you set the project ID in the UI when you hit the "+" button to create a new project.
In "dsui", you set the project ID using the "--projectId" flag.

Google Cloud Functions Issues Upgrading From Go 1.11 To 1.13

2020-07-18T01:27:00.002-04:00

I use Google Cloud Functions with Go. However, I upgraded from Go 1.11 to Go 1.13 (because Go 1.11 is being deprecated) and ran into some annoying, undocumented issues.

Static Files And The Current Working Directory

One of my Cloud Functions acts as a tiny web server; it has a few static HTML files that it serves in addition to its dynamic things.

In Go 1.11, Cloud Functions put the static files (and all the source files, for that matter) in the working directory of the function. This (1) makes sense, and (2) makes testing easy.

However, in Go 1.13, Cloud Functions puts the static files (and all of the source files) is placed in the ./serverless_function_source_code directory. Why? Who knows. All that mattered is that after a simple version upgrade, all of my stuff broke because it couldn't find files that it was able to find before the upgrade.

I found that using a sync.Once to attempt to change the current working directory (if necessary) is a fairly clean backward-compatible way of handling this issue.

Here's an example; it's fairly verbose, but you could rip out most of the logging if you don't want or need it.

// GoogleCloudFunctionSourceDirectory is where Google Cloud will put the source code that was uploaded.

const GoogleCloudFunctionSourceDirectory = "serverless_function_source_code"

// once is an object that will only execute its function one time.

// Because we want to log during our initialization, we need to handle this in a non-standard

// function and keep track of our initialization status.

var once sync.Once

// Initialize initializes the application.

// Primarily, this changes the current working directory.

func Initialize(log *logrus.Logger) {

log.Infof("Initializing the application.")

path, err := os.Getwd()

if err != nil {

log.Warnf("Could not find the current working directory: %v", err)

}

log.Infof("Current working directory: %s", path)

log.Infof("Looking for top-level source directory: %s", GoogleCloudFunctionSourceDirectory)

fileInfo, err := os.Stat(GoogleCloudFunctionSourceDirectory)

if err == nil && fileInfo.IsDir() {

log.Infof("Found top-level source directory: %s", GoogleCloudFunctionSourceDirectory)

err = os.Chdir(GoogleCloudFunctionSourceDirectory)

if err != nil {

log.Warnf("Could not change to directory %q: %v", GoogleCloudFunctionSourceDirectory, err)

}

log.Infof("Initialization complete.")

}

// CloudFunction is an HTTP Cloud Function with a request parameter.

func CloudFunction(w http.ResponseWriter, r *http.Request) {

log := logrus.New()

// Initialize our application if we haven't already.

once.Do(func() { Initialize(log) })

// YOUR CLOUD FUNCTION LOGIC HERE

}

For more information, see the Cloud Functions concepts docs.

Logging And Environment Variables

For whatever reason, Cloud Functions with Go don't log at anything other than the "default" log level; this means that all of my carefully crafted log messages all just get dumped into the logs at the same severity.

I've been using gcfhook with logrus to get around this, but it's not an ideal solution. That combination works by nullifying all output of the application and then adding a logrus hook that connects to the StackDriver API to send proper logs over the network. It works fine, but it's silly to have to make a network connection to a logging API when the application itself can output directly.

As of Go 1.13, Cloud Functions will no longer set the FUNCTION_NAME, FUNCTION_REGION, and GCP_PROJECT environment variables. This is a problem because we need those three pieces of information in order to use the StackDriver API to send the log messages. You could publish those environment variables back as part of your deployment, but I'd prefer not to.

Fortunately, Cloud Functions can now parse (poorly documented) JSON-formatted lines from stdout and stderr, resulting in proper log messages with severities. The Cloud Functions docs refer to this as "structured logging", but the docs don't seem to apply correctly. Cloud Run has a document on how these JSON-formatted lines should look, but it's still a bit hazy.

Anyway, the gcfstructuredlogformatter package introduces a logrus formatter that outputs JSON instead of plain text for logs. This eliminates the need for the extra environment variables and generally simplifies the logging workflow. It should only be a couple of lines of code to sub out gcfhook for gcfstructuredlogformatter.

Here's an example:

// CloudFunction is an HTTP Cloud Function with a request parameter.

func CloudFunction(w http.ResponseWriter, r *http.Request) {

log := logrus.New()

if value := os.Getenv("FUNCTION_TARGET"); value == "" {

log.Infof("FUNCTION_TARGET is not set; falling back to normal logging.")

} else {

formatter := gcfstructuredlogformatter.New()

log.SetFormatter(formatter)

}

log.Infof("This is an info message.")

log.Warnf("This is a warning message.")

log.Errorf("This is an error message.")

// YOUR CLOUD FUNCTION LOGIC HERE

}

Hopefully this stopped you from banging your head against the wall for a few hours like I was doing as I tried to frantically figure out why the upgrade had failed in such weird ways.

Sharing a single screen in Slack for Linux

2020-06-03T22:39:00.003-04:00

I have a bunch of monitors, and for whatever reason, Slack for Linux refuses to let me limit my screen sharing to a single monitor or application. This means that if I try to share my screen on a call, no one can see or read anything because they just see a giant, wide view of three monitors' worth of pixels crammed into their Slack window (typically only one monitor wide).

In my experience, disabling monitors/displays is just not worth it; I'll have to spend too much time getting everything set back up correctly afterward, and that's really inconvenient and annoying.

The solution that I've landed on is Xephyr; Xephyr runs a second X11 server inside a new window, so when I need to get on a call where I'll have to share my screen, I simply:

Launch a new Xephyr display.
Close Slack.
Open Slack on the Xephyr display.
Open whatever else I'll need to share in the Xephyr display, typically a web browser or a terminal.
Get on the Slack call and share my "screen".

Some small details:

You'll need to open Xephyr with the resolution that you want; given window decorations and such, you may need to play around with this a bit. Once you find out what works, put it in a script.
In order to resize windows in Xephyr, it'll need to be running a window manager. I struggled to get any "startx"-related things working, but I found that "twm" worked well enough for my purposes.
Some applications, such as Chrome, won't open on two displays at the same time. I just open a different browser in my Xephyr display (for example, I use "google-chrome" normally and "chromium-browser" in Xephyr), but you can also run Chrome using a different profile directory and it'll run in the other display.

Install Xephyr and TWM:

sudo apt install xserver-xephyr twm

Run Zephyr, Slack, and Chromium:

# Launch Xephyr and create display ":1".
Xephyr -ac -noreset -screen 1920x1000 :1 &
# Start a window manager in Xephyr.
DISPLAY=:1 twm &>/dev/null &
# Open Slack in Xephyr.
DISPLAY=:1 slack &>/dev/null &
# Open Chromium in Xephyr.
DISPLAY=:1 chromium-browser &>/dev/null &

It's kind of dirty, but it works extremely well, and I don't have to worry about messing with my monitor setup when I need to give a presentation.

Edit: an earlier version of this post used "Xephyr -bc -ac -noreset -screen 1920x1000 :1 &" for the Xephyr command; I can't get this to work with "-bc" anymore; I must have copied the wrong command when I published the post.

Unit-testing reCAPTCHA v2 and v3 in Go

2020-01-29T22:47:00.000-05:00

I recently worked on a project where we allowed new users to sign up for our system with a form. A new user would need to provide us with her name, her e-mail address, and a password. In order to prevent spamming, we used reCAPTCHA v3, and so that meant that we also submitted a reCAPTCHA token along with the rest of the new-user data.

Unit-testing the sign-up process was fairly simple if we turned off the reCAPTCHA requirement, but the weakest link in the whole process is the one part that we could not control: reCAPTCHA. It would be foolish not to have test coverage around the reCAPTCHA workflow.

So, how do you unit-test reCAPTCHA?

Focus: Server-side testing

For the purposes of this post, I'm going to be focusing on testing reCAPTCHA on the server side. This means that I'm not concerned with validating that users acted like humans fiddling around on a website. Instead, I'm concerned with what our sign-up endpoint does when it receives valid and invalid reCAPTCHA tokens.

reCAPTCHA in Go

There are a variety of Go package to provide reCAPTCHA support; however, only one of them (1) has support for Go modules, and (2) has support for unit testing built in:

https://github.com/tekkamanendless/go-recaptcha

For docs, see:

https://godoc.org/github.com/tekkamanendless/go-recaptcha

When you create a new reCAPTCHA site, you're given public and private keys (short little strings, nothing huge). The public key is used on the client side when you make your connection to the reCAPTCHA API, and a response token is provided back. The private key is used on the server side to connect to the reCAPTCHA API and validate the response token.

Since the client side will likely be a line or two of Javascript that generates a token, our server-side work will be focused on validating that token.

Assuming that the newly generated token is "NEW_TOKEN" and that the private key is "YOUR_PRIVATE_KEY", then this is all you have to do in order to validate that token:

import "github.com/tekkamanendless/go-recaptcha"

// ...

recaptchaVerifier := recaptcha.New("YOUR_PRIVATE_KEY")
success, err := recaptchaVerifier.Verify("NEW_TOKEN")
if err != nil {
// Fail with some 500-level error about not being able to verify the token
}

if !success {
// Fail with some 400-level error about not being a human
}

// The token is valid!

And that's it! All we really care about is whether or not it worked.

Unit testing

tekkamanendless/go-recaptcha includes a package called "recaptchatest" that provides a fake reCAPTCHA API running as an httptest.Server instance. This server simulates enough of the reCAPTCHA API to let you do the kinds of testing that you need to.

Just like the actual reCAPTCHA service, you can create multiple "sites" on the test server. Each site will have a public and private key, and you can call the NewResponseToken method of a site to have that site generate a valid token for that site.

In terms of design, you'll set up the test server, the test site, and the valid token in advance of your test. When you create your Recaptcha instance with the test site's private key, all you have to do is set the VerifyEndpoint property of that instance to point to the test server (otherwise, it would try to talk to the real reCAPTCHA API and fail).

Here's a simple example:

import (

"github.com/tekkamanendless/go-recaptcha"

"github.com/tekkamanendless/go-recaptcha/recaptchatest"

)

// ...

// Create a new reCAPTCHA test server, site, and valid token before the main test.

testServer := recaptchatest.NewServer()

defer testServer.Close()

site := testServer.NewSite()

token := site.NewResponseToken()

// Create the reCAPTCHA verifier with the site's private key.

recaptchaVerifier := recaptcha.New(site.PrivateKey)

// Override the endpoint so that it uses the test server.

recaptchaVerifier.VerifyEndpoint = testServer.VerifyEndpoint()

// Run your test.

// ...

// Validate that the reCAPTCHA token is good.

success, err := recaptchaVerifier.Verify(token)

assert.Nil(t, err)

assert.True(t, success)

The recaptchatest test server doesn't do too much that's fancy, but it will properly return a failure if the same token is verified twice or if the token is too old. It also has some functions to let you tweak the token properties so you don't have to wait around for 2 minutes for a token to age out; you can make one that's already too old (see Site.GenerateToken for more information).

Encrypt your /home directory using LUKS and a spare disk

2019-10-14T18:17:00.000-04:00

Every year or two, I rotate a drive out of my NAS. My most recent rotation yielded me with a spare 1TB SSD. My main machine only had a 250GB SSD, so I figured that I'd just replace my /home directory with a mountpoint on that new disk, giving my lots of space for video editing and such, since I no longer had the room to deal with my GoPro footage.

My general thought process was as follows:

I don't want to mess too much with my system.
I don't want to clone my whole system onto the new drive.
I want to encrypt my personal data.
I don't really care about encrypting the entire OS.

I had originally looked into some other encryption options, such as encrypting each user's home directory separately, but even in the year 2019 there seemed to be too much drama dealing with that (anytime that I need to make a PAM change, it's a bad day). Using LUKS, the disk (well, partition) is encrypted, so everything kind of comes for free after that.

If you register the partition in /etc/crypttab, your machine will prompt you for the decryption key when it boots (at least Kubuntu 18.04 does).

One other thing: dealing with encrypted data may be slow if your processor doesn't support AES encryption. Do a quick check and make sure that "aes" is listed under "Flags":

lscpu;

If "aes" is there, then you're good to go. If not, then maybe run some tests to see how much CPU overhead disk operations use on LUKS (you can follow this guide, but stop before "Home setup, phase 2", and see if your overhead is acceptable).

The plan

Luks Setup

Format the new disk with a single partition.
Set up LUKS on that partition.
Back up the LUKS header data.

Home setup, phase 1

Copy everything in /home to the new partition.
Update /etc/crypttab.
Update /etc/fstab using a test directory.
Reboot.
Test.

Home setup, phase 2

Update /etc/fstab using the /home directory.
Reboot.
Test.

LUKS setup

Wipe the new disk and make a single partition. For the remainder of this post, I'll be assuming that the partition is /dev/sdx1.

Install "cryptsetup".

sudo apt install cryptsetup;

Set up LUKS on the partition. You'll need to give it a passphrase. I recommend something that's easy to type, like a series of four random words, but you do you). You'll have to type this passphrase every time that you boot your machine up.

sudo cryptsetup --verify-passphrase luksFormat /dev/sdx1;

Once that's done, you can give it some more (up to 8) passphrases. This may be helpful if you want to have other people access the disk, or if you just want to have some backups, just in case. If there are multiple passphrases, any one of them will work fine; you don't need to have multiple on hand.

sudo cryptsetup --verify-passphrase luksAddKey /dev/sdx1;

The next step is to "open" the partition. The last argument ("encrypted-home") is the name to use for the partition that will appear under "/dev/mapper".

sudo cryptsetup luksOpen /dev/sdx1 encrypted-home;

At this point, everything is set up and ready ready. Confirm that with the "status" command.

sudo cryptsetup status encrypted-home;

Back up the LUKS header data. If this information gets corrupted on the disk, then there is no way to recover your data. Note that if you recover data using the header backup, then the passphrases will be the ones in the header backup, not whatever was on the disk at the time of the recovery.

sudo cryptsetup luksHeaderBackup /dev/sdx1 --header-backup-file /root/luks.encrypted-home.header;

I put mine in the /root folder (which will not be on the encrypted home partition), and I also backed it up to Google Drive. Remember, if you add, change, or delete passphrases, you'll want to do make another backup (otherwise, those changes won't be present during a restoration operation).

If you're really hardcore, fill up the partition with random data so that no part of it looks special. Remember, the whole point of encryption is to make it so that whatever you wrote just ends up looking random, so writing a bunch of zeros with "dd" will do the trick:

sudo dd if=/dev/zero of=/dev/mapper/encrypted-home;

Before you can do anything with it, you'll need to format the partition. I used EXT4 because everything else on this machine is EXT4.

sudo mkfs.ext4 /dev/mapper/encrypted-home;

Home setup, phase 1

Once the LUKS partition is all set up, the next set of steps is just a careful copy operation, tweaking a couple /etc files, and verifying that everything worked.

The safest thing to do would be to switch to a live CD here so that you're guaranteed to not be messing with your /home directory, but I just logged out of my window manager and did the next set of steps in the ctrl+alt+f2 terminal. Again, you do you.

Mount the encrypted home directory somewhere where we can access it.

sudo mkdir /mnt/encrypted-home;

sudo mount /dev/mapper-encrypted-home /mnt/encrypted-home;

Copy over everything in /home. This could take a while.

sudo cp -a /home/. /mnt/encrypted-home/;

Make sure that /mnt/encrypted-home contains the home folders of your users.

Set up /etc/crypttab. The format is:

${/dev/mapper name} UUID="${disk uuid}" none luks

In our case, the /dev/mapper name is going to be "encrypted-home". To find the UUID, run:

sudo blkid /dev/sdx1;

So, in my particular case, /etc/crypttab looks like:

encrypted-home UUID="5e01cb97-ceed-40da-aec4-5f75b025ed4a" none luks

Finally, tell /etc/fstab to mount the partition to our /mnt/encrypted-home directory. We don't want to clobber /home until we know that everything works.

Update /etc/fstab and add:

/dev/mapper/encrypted-home /mnt/encrypted-home ext4 defaults 0 0

Reboot your machine.

When it comes back up, it should ask you for the passphrase for the encrypted-home partition. Give it one of the passphrases that you set up.

Log in and check /mnt/encrypted-home. As long as everything's in there that's supposed to be in there (that is, all of your /home data), then phase 1 is complete.

Home setup, phase 2

Now that we know everything works, the next step is to clean up your actual /home directory and then tell /etc/fstab to mount /dev/mapper/encrypted-home at /home.

I didn't want to completely purge my /home directory; instead, I deleted everything large and/or personal in there (leaving my bash profile, some app settings, etc.). This way, if my new disk failed or if I wanted to use my computer without it for some reason, then I'd at least have normal, functioning user accounts. Again, you do you. I've screwed up enough stuff in my time to like to have somewhat nice failback scenarios ready to go.

Update /etc/fstab and change /dev/mapper/encrypted-home line to mount to /home.

/dev/mapper/encrypted-home /home ext4 defaults 0 0

Reboot.

When it comes back up, it should ask you for the passphrase for the encrypted-home partition. Give it one of the passphrases that you set up.

To confirm, check your mountpoints:

mount | grep /home

You should see something like:

/dev/mapper/encrypted-home on /home type ext4 (rw,relatime,data=ordered)

Now that everything's working, you can get rid of "/mnt/encrypted-home"; we're not using it anymore.

sudo rmdir /mnt/encrypted-home;