Wednesday, October 19, 2022

Watch out for SNI when using an nginx reverse proxy

From time to time, I'll have a use case where some box needs to talk to some website that it can't reach (through networking issues), and the easiest thing to do is to throw an nginx reverse proxy on a network that it can reach (such that the reverse proxy can reach both).

The whole shtick of a reverse proxy is that you can access the reverse proxy directly and it'll forward the request on to the appropriate destination and more or less masquerade itself as if it were the destination.  This is in contrast with a normal HTTP proxy that would be configured separately (if supported by whatever tool you're trying to use).  Sometimes a normal HTTP proxy is the best tool for the job, but sometimes you can cheat with a tweak to /etc/hosts and a reverse proxy and nobody needs to know what happened.

Here, we're focused on the reverse proxy.

In this case, we have the following scenario:

  1. Box 1 wants to connect to site1.example.com.
  2. Box 1 cannot reach site1.example.com.
To cheat using a reverse proxy, we need Box 2, which:
  1. Can be reached by Box 1.
  2. Can reach site1.example.com.
To set up the whole reverse proxy thing, we need to:
  1. Set up nginx on Box 2 to listen on port 443 (HTTPS) and reverse proxy to site1.example.com.
  2. Update /etc/hosts on Box 1 so that site1.example.com points to Box 2's IP address.

At first, I was seeing this error message on the reverse proxy's "nginx/error.log":

connect() to XXXXXX:443 failed (13: Permission denied) while connecting to upstream, client: XXXXXX, server: site1.example.com, request: "GET / HTTP/1.1"

"Permission denied" isn't great, and that told me that it was something OS-related.

Of course, it was an SELinux thing (in /var/log/messages):

SELinux is preventing /usr/sbin/nginx from name_connect access on the tcp_socket port 443.

The workaround was:

setsebool -P nis_enabled 1

This also was suggested by the logs, but it didn't seem to matter:

setsebool -P httpd_can_network_connect 1

After fixing that, I was seeing:

SSL_do_handshake() failed (SSL: error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure:SSL alert number 40) while SSL handshaking to upstream, client: XXXXXX, server: site1.example.com, request: "GET / HTTP/1.1"

After tcpdump-ing the traffic from Box 1 and also another box that could directly talk to site1.example.com, it was clear Box 1 was not using SNI in its requests (SNI is a TLS extension that passes the host name in plaintext so that proxies and load balancers can properly route name-based requests).

It took way too long for me to find the nginx setting to enable it (I don't know why its disabled by default), but it's:

proxy_ssl_server_name on;

Anyway, the final nginx config for the reverse proxy on Box 2 was:

server {
  listen 443 ssl;
  server_name site1.example.com;

  ssl_certificate /etc/nginx/ssl/server.crt;
  ssl_certificate_key /etc/nginx/server.key;
  ssl_protocols TLSv1.2;
 
  location / {
    proxy_pass https://site1.example.com;
    proxy_ssl_session_reuse on;
    proxy_ssl_server_name on;
  }
}

As far as Box 1 was concerned, it could connect to site1.example.com with only a small tweak to /etc/hosts.

Wednesday, October 5, 2022

Speed up PNG encoding in Go with NRGBA images

After migrating my application from Google Cloud Engine to Google Cloud Run, I suddenly had a use case for optimizing CPU utilization.

In my analysis of my most CPU-intensive workloads, it turned out that the majority of the time was spent encoding PNG files.

tl;dr Use image.NRGBA when you intend to encode a PNG file.

(For reference, this particular application has a Google Maps overlay that synthesizes data from other sources into tiles to be rendered on the map.  The main synchronization job runs nightly and attempts to build or download new tiles for the various layers based on data from various ArcGIS systems.)

Looking at my code, I couldn't really reduce the number of calls to png.Encode, but that encoder really looked inefficient.  I deleted the callgrind files (sorry), but basically, half of the CPU time in png.Encode was around memory operations and some runtime calls.

I started looking around for maybe some options to pass to the encoder or a more purpose-built implementation.  I ended up finding a package that mentioned a speedup, but only for NRGBA images.  However, that package looked fairly unused, and I wasn't about to turn all of my image processing to so something with 1 commit and no users.

This got me thinking, though: what is NRGBA?

It turns out that there are (at least) two ways of thinking about the whole alpha channel thing in images:

  1. In RGBA, each of the red, green, and blue channels has already been premultiplied by the alpha channel, such that the value of, for example, R can range from 0 to A, but no higher.
  2. In NRGBA, each of the red, green, and blue channels has its original value, and the alpha channel merely represents the opacity of the pixel in general.

For my human mind, using various tools and software over the years, when I think of "RGBA", I think of "one channel each for red, green, and blue, and one channel for the opacity of the pixel".  So what this means is that I'm thinking of "NRGBA" (for non-premultiplied RGBA).

(Apparently there are good use cases for both, and when compositing, at some point you'll have to multiply by the alpha value, so "RGBA" already has that done for you.)

Okay, whatever, so what does this have to do with CPU optimization?

In Go, the png.Encode function is optimized for NRGBA images.  There's a tiny little hint about this in the comment for the function:

Any Image may be encoded, but images that are not image.NRGBA might be encoded lossily.

This is corroborated by the PNG rationale document, which explains that

PNG uses "unassociated" or "non-premultiplied" alpha so that images with separate transparency masks can be stored losslessly.

If you want to have the best PNG encoding experience, then you should encode images that use NRGBA already.  In fact, if you open up the code, you'll see that it will convert the image to NRGBA if it's not already in that format.

Coming back to my callgrind analysis, this is where all that CPU time was spent: converting an RGBA image to an NRGBA image.  I certainly thought that it was strange how much work was being done creating a simple PNG file from a mostly-transparent map tile.

Why did I even have RGBA images?  Well, my tiling API has to composite tiles from other systems into a single PNG file, so I simply created that new image with image.NewRGBA.  And why that function?  Because as I mentioned before, I figured "RGBA" meant "RGB with an alpha channel", which is what I wanted so that it would support transparency.  It never occurred to me that "RGBA" was some weird encoding scheme for pixels in contrast to another encoding scheme called "NRGBA"; my use cases had never had me make such a distinction.

Anyway, after switching a few image.NewRGBA calls to image.NewNRGBA (and literally that was it; no other code changed), my code was way more efficient, cutting down on CPU utilization by something like 50-70%.  Those RGBA to NRGBA conversions really hurt.

100% cost savings switching from App Engine to Cloud Run... wild

I have written numerous times about how much I like Google's App Engine platform and about how I've tried to port as many of my applications to it as possible.  The idea of App Engine is still glorious, but I  have now been converted to Cloud Run.

The primary selling point?  It has literally reduced my application's bill to $0.  That's a 100% reduction in cost from App Engine to Cloud Run.  It's absolutely wild, and it's now my new favorite thing.

Pour one out for App Engine because Cloud Run is here to stay.

Aside: a (brief) history of App Engine

Before I explain how Cloud Run manages to make my application free to run, I need to first run you through a brief history of App Engine, why it was great, why it became less great, and how Cloud Run could eat its lunch like that.

App Engine v1

Google App Engine entered the market in 2011 as a one-stop solution to host infinitely scalable applications.  Prior to App Engine, if you wanted to host a web application, you had to either commit to a weird framework and go with that framework's cloud solution, or you had to spin up a bunch of virtual machines (or god help you Lambda functions) to handle your web traffic, your database, your file systems, maybe tie into a CDN, and also set up Let's Encrypt or something similar to handle your TLS certificates.

From the perspective of an application owner or application developer, that's a lot of work that doesn't directly deal with application stuff and ends up just wasting tons of everyone's time.

App Engine was Google's answer to that, something that we would now call "serverless".

The premise was simple: upload a zip file with your application, some YAML files with some metadata (concerning scaling, performance, etc.), and App Engine would take care of all the annoying IT stuff for you.  In fact, they tried to hide as much of it from you as possible—they wanted you to focus on the application logic, not the nitty gritty backend details.

(This allowed Google to do all kinds of tricks in the background to make things better and faster, since the implementation details of their infrastructure were generally out of bounds for your application.)

App Engine's killer feature was that it bundled a whole bunch of services together, things that you'd normally have to set up yourself and pay extra for.  This included, but was not limited to:

  1. The web hosting basics:
    1. File hosting
    2. A CDN service
    3. An execution environment in your language of choice (Java, Python, PHP, etc.)
  2. A cheap No-SQL database to store all your data
  3. A memcached memory store
  4. A scheduler that'll hit any of your web endpoints whenever you want
  5. A task queue for background tasks
  6. A service that would accept e-mail and transform it into an HTTP POST request

All of this came with a generous free tier (truly free) where you could try stuff out and generally host small applications with no cost at all.  If your application were to run up against the free-tier limits, it would just halt those services until the next day when all of the counters reset and your daily limits restarted.

(Did I mention e-mail?  I had so many use cases that dealt with accepting e-mail from other systems and taking some action.  This was a godsend.)

The free tier gave you 28 hours of instance time (that is, 28 hours of a single instance of the lowest class).  For trying stuff out, that was plenty, especially when your application only had a few users and went unused for hours at a time.

(Yes, if you upgraded your instance type one notch that became 14 hours free, but still.)

To use all of the bonus features, you generally had to link against (more or less) the App Engine SDK, which took care of all of the details behind the scene.

This was everything that you needed to build a scalable web application, and it did its job well.

App Engine v2

Years went by, and Google made some big changes to App Engine.  To me, these changes went against the great promise of App Engine, but they make sense when looking at Cloud Run as the App Engine successor.

App Engine v2 added support for newer versions of the languages that App Engine v1 had supported (which was good), but it also took away a lot of the one-stop-shop power (which was bad).  The stated reason was to remove a lot of the Google Cloud-specific stuff and make the platform a bit more open and compatible with other clouds.

(While that's generally good, the magic of App Engine was that everything was built in, for free.)

Now, an App Engine application could no longer have free memcached data; instead, you could pay for a Memory Store option.  For large applications, it probably didn't matter, but for small ones that cost basically nothing to run, this made memcached untenable.

Similarly, the e-mail service was discontinued, and you were encouraged to move to Twilio SendGrid, which could do something similar.  Google was nice enough to give all SendGrid customers a free tier that just about made up for what was lost.  The big problem was that all of the built-in tools were gone.  Previously, the App Engine local development console had a place where you could type in an e-mail and send it to your application; now you had to write your own tools.

The scheduler and task queue systems were promoted out of App Engine and into Google Cloud, but the App Engine tooling continued to support them, so they required only minimal changes on the application side.

The cheap No-SQL database, Datastore, was also promoted out of App Engine, and there were no changes to make whatsoever.  Yay.

The App Engine SDK was removed, and an App Engine application was simply another application in the cloud, with annoying dependencies and third-party integrations.

(Yes, technically Google added back support for the App Engine SDK years later, but meh.)

Overall, for people who used App Engine exclusively for purpose-built App Engine things, v2 was a major step down from v1.  Yes, it moved App Engine toward a more standards-based approach, but the problem was that there weren't standards around half the stuff that App Engine was doing, and in the push to make it more open and in-line with other clouds, it lost a lot of its power (because other clouds weren't doing what App Engine was doing, which is why I was using Google's App Engine).

So, why Cloud Run?

Cloud Run picks up where App Engine v2 left off.  With all of the App Engine-specific things removed, the next logical jump was to give Google a container image instead of a zip file and let it do its thing.  App Engine had already experimented with this with the "Flexible" environment, but its pricing was not favorable compared to the "Standard" offering for the kinds of projects that I had.

From a build perspective, Cloud Run gives you a bit more flexibility than App Engine.  You can give Google a Dockerfile and have it take care of all the compiling and such, or you can do most of that locally and give Google a much simpler Dockerfile with your pre-compiled application files.  I don't want to get too deep into it, but you have options.

Cloud Run obviously runs your application as a container.  Did App Engine?  Maybe?  Maybe not?  I figured that it would have, but that's for Google to decide (the magic of App Engine).

But here's the thing: App Engine was billed as an instance, not a container.

Billing: App Engine vs. Cloud Run

I'm going to focus on the big thing, which is instance versus container billing.  Everything else is comparable (data traffic, etc.).

App Engine billed you on instance time (wall time), normalized to units of the smallest instance class that they support.  To run your application at all, there was a 15-minute minimum cost.  If your application wasn't doing anything for 15 minutes, then App Engine would shut it down and you wouldn't be billed for that time.  But for all other time, anytime that there was at least one instance running, you were billed for that instance (more or less as if you had spun up a VM yourself for the time that it was up).

For a lightweight web application, this kind of billing isn't ideal because most of the work that your application does is a little bit of processing centered around database calls.  Most of the time the application is either not doing anything at all (no requests are coming in) or it's waiting on a response from a database.  The majority of the instance CPU goes unused.

Cloud Run bills on CPU time, that is, the time that the CPU is actually used.  So, for that same application, if it's sitting around doing nothing or waiting on a database response, then it's not being billed.  And there's no 15-minute minimum or anything.  For example, if your application does some request parsing, permissions checks, etc., and then sits around waits for 1 second for a database query to return, then you'll be billed for like 0.0001 seconds of CPU time, which is great (because your application did very little actual CPU work in that second it took for the request to complete).

Cloud Run gives you basically 48 hours of "vCPU-seconds" (that is, a single CPU churning at 100%) per month.  So for an application that sits around doing nothing most of the time, the odds are that you'll never have to pay for any CPU utilization.  For a daily average, this comes out to about 1.5 hours of CPU time free per day.  Yes, you also pay for the amount of memory that your app uses while it's handling requests, but my app uses like 70MB of memory because it's a web application.

(For me, my app has a nightly jobs that eats up a bunch of CPU, and then it does just about nothing all day.)

Overall, for me, this is what moved me to Cloud Run.  I'm billed for the resources that I'm actually using when I'm actually using them, and the specifics around instances and such are no longer my concern.

This also means that I can use a more microservice-based approach, since I'm billed on what my application does, not what resources Google spins up in the background to support it.  With App Engine, running separate "services" would have been too costly (each with its own instances, etc.).  With Cloud Run, it's perfect, and I can experiment without needing to worry about massive cost changes.

What's the catch?

App Engine's deployment tools are top-notch, even in App Engine v2.  When you deploy your application (once you figure out all the right flags), it'll get your application up and running, deploy your cron jobs, and ensure that your Datastore indexes are pushed.

With Cloud Run, you don't get any of the extra stuff.

I've put together a list of things that you'll have to do to migrate your application from App Engine to Cloud Run, sorted from easiest to hardest.

Easy: Deploy your application

This is essentially gcloud run deploy with some options, but it's extremely simple and straightforward.  Migrating from App Engine will take like 2 seconds.

Easy: Get rid of app.yaml

You won't need app.yaml anymore, so remove it.  If you had environment variables in there, move them to their own file and update your gcloud run deploy command to include --env-vars-file.

In Cloud Run, everything is HTTPS, so you don't have to worry about insecure HTTP endpoints.

There are no "warmup" endpoints or anything, so just make sure that your container does what it needs to when it starts because as soon as it can receive an HTTP request, it will.

If you had configured a custom instance class, you may need to tweak your Cloud Run service with parameters that better match your old settings.  I had only used an F2 class because the F1 class had a limitation on the number of concurrent requests that it could receive, and Cloud Run has no such limitation.  You can also configure the maximum number of concurrent requests much, much higher.

Similarly, all of the scaling parameters are... different, and you can deal with those in Cloud Run as necessary.  Let your app run for a while and see how Cloud Run behaves; it's definitely a different game from App Engine.

Easy: Deploy your task queues

Starting with App Engine v2, you already had to deploy your own task queues (App Engine v1 used to take care of that for you), so you don't have to change a single thing with deployment.  Note that you will have to change how those task queues are used; see below.

Easy: Deploy your Datastore indexes

This is essentially gcloud datastore indexes create index.yaml with some options, and you should already have index.yaml sitting around, so just add this to your deployment code.

Easy: Rip out all "/_ah" endpoints

App Engine liked to have you put special App Engine things under "/_ah", so I stashed all of my cron job and task queue endpoints in there.  Ordinarily, this would be fine, except that Cloud Run quietly treats "/_ah" as special and will quietly fail all requests to any endpoints under it with a permission-denied error.  I wasted way too much time trying to figure out what was going on before I realized that it was Cloud Run doing something sneaky and undocumented in the background.

Move any endpoints under "/_ah" to literally any other path and update any references to them.

Easy: Create a Dockerfile

Create a Dockerfile that's as simple or as complex as you want.  I wanted a minimal set of changes from my current workflow, so I had my Dockerfile just include all for binaries that get compiled locally as well as any static files.

If you want, you can set up a multi-stage Dockerfile that has separate, short-lived containers for building and compiling that ultimate ends up with a single, small container for your application.  I'll eventually get there, but actually deploying into Cloud Run took precedence for me over neat CI/CD stuff.

Easy: Detect Cloud Run in production

I have some simple code that detects if it's in Google Cloud or not so that it does smart production-related things if it is.  In particular, logging in production is JSON-based so that StackDriver can pick it up properly.

The old "GAE_*" and "GOOGLE_*" environment variables are gone; you'll only get "K_SERVICE" and friends to tell you a tiny amount of information about your Cloud Run environment.  You'll basically get back the service name and that's it.

If you want your project ID or region, you'll have to pull those from the "computeMetadata/v1" endpoints.  It's super easy; it's just something that you'll have to do.

For more information on using the "computeMetadata/v1" endpoints, see this guide.

For more information about the environment variables available in Cloud Run, see this document.

Easy: "Unlink" your Datastore

Way back in the day, Datastore was merely a part of App Engine.  At some point, Google spun it out into its own thing, but for older projects, it's still tied to App Engine.  Make sure that you go to your Datastore's admin screen in Google Cloud Console and unlink it from your App Engine project.

This has no practical effects other than:

  1. You can no longer put your Datastore in read-only mode.
  2. You can disable your App Engine project without also disabling your Datastore.

It may take 10-20 minutes for the operation to complete, so just plan on doing this early so it doesn't get in your way.

To learn more about unlinking Datastore from App Engine, see this article.

Medium: Change your task queue code

If you were using App Engine for task queues, then you were using special App Engine tasks.  Since you're not using App Engine anymore, you have to use the more generic HTTP tasks.

In order to do this, you'll need to know the URL for your Cloud Run service.  You can technically cheat and use the information in the HTTP request to generally get a URL that (should) work, but I opted to use the "${region}-run.googleapis.com/apis/serving.knative.dev/v1/namespaces/${project-id}/services/${service-name}" endpoint, which returns the Cloud Run URL, among other things.  I grabbed that value when my container started and used it anytime I had to create a task.

(And yes, you'll need to get an access token from another "computeMetadata/v1" endpoint in order to access that API.)

For more information on that endpoint, see this reference.

Once you have the Cloud Run URL, the main difference between an App Engine task and an HTTP task is that App Engine tasks only have the path while the HTTP task has the full URL to hit.  I had to change up my internals a bit so that the application knew its base URL, but it wasn't much work.

To learn more about HTTP tasks, see this document.

Medium: Build a custom tool to deploy cron jobs

App Engine used to take care of all the cron job stuff for you, but Cloud Run does not.  At the time of this writing, Cloud Run has a beta command that'll create/update cron jobs from a YAML file, but it's not GA yet.

I built a simple tool that calls gcloud jobs list and parses the JSON output and compares that to a custom YAML file that I have that describes my cron jobs.  Because cron jobs do not refer to an App Engine project, they can reference any HTTP endpoint or a particular Cloud Run service.  My YAML file has an option for the Cloud Run service name, and my tool looks up the Cloud Run URL for that service and appends the path onto it.

It's not a lot of work, but it's work that I had to do in order to make my workflows make sense again.

Also note that the human-readable cron syntax ("every 24 hours") is gone and you'll need to use the standard "minute hour day-of-month month day-of-week" syntax that you're used to in most other cron-related things.  You can also specify which time zone each cron job should use for interpreting that syntax.

To learn more about the cron syntax, see this document.

Medium: Update your custom domains

At minimum, you will now have a completely different URL for your application.  If you're using custom domains, you can use Cloud Run's custom domains much like you would use App Engine's custom domains.

However, as far as Google is concerned, this is basically removing a custom domain from App Engine and creating a custom domain in Cloud Run.  This means that it'll take anywhere from 20-40 minutes for HTTPS traffic to your domain to work after you make the switch.

Try to schedule this for when you have minimal traffic.  As far as I can tell, there's nothing that you can do to speed it up.  You may wish to consider using a load balancer instead (and giving the load balancer the TLS certificate responsibilities), but I didn't want to pay extra for a service that's basically free and only hurts me when I need to make domain changes.

Medium: Enable compression

Because App Engine ran behind Google's CDN, compression was handled by default.  Cloud Run does no such thing, so if you want compression, you'll have to do it yourself.  (Yes, you could pay for a load balancer, but that doesn't reduce your costs by 100%.)  Most HTTP frameworks have an option for it, and if you want to roll your own, it's fairly straightforward.

I ended up going with https://github.com/NYTimes/gziphandler.  You can configure a minimum response size before compressing, and if you don't want to compress everything (for example, image files are already compressed internally these days), you have to provide it with a list of MIME types that should be compressed.

Hard: Enable caching

Because App Engine ran behind Google's CDN, caching was handled by default.  If you set a Cache-Control header, the CDN would take care of doing the job of an HTTP cache.  If you did it right, you could reduce the traffic to your application significantly.  Cloud Run does no such thing, and your two practical options are (1) pay for a load balancer (which does not reduce your costs by 100%) or (2) roll your own in some way.

There's nothing stopping you from dropping in a Cloud Run service that is basically an HTTP cache that sits in front of your actual service, but I didn't go that route.  I decided to take advantage of the fact that my application knew more about caching than an external cache ever could.

For example, my application knows that its static files can be cached forever because the moment that I replace the service with a new version, the cache will die with it and a new cache will be created.  For an external cache, you can't tell it, "hey, cache this until the backing service randomly goes away", so you have to set practical cache durations such as "10 minutes" (so that when you do make application changes, you can be reasonably assured that they'll get to your users fairly quickly).

I ended up writing an HTTP handler that parsed the Cache-Control header, cached things that needed to be cached, supported some extensions for the things that my application could know (such as "cache this file forever), and served those results, so my main application code could still generally operate exactly as if there were a real HTTP cache in front of it.

Extra: Consider using "alpine" instead of "scratch" for Docker

I use statically compiled binaries (written in Go) for all my stuff, and I wasted a lot of time using "scratch" as the basis for my new Cloud Run Dockerfile.

The tl;dr here is that the Google libraries really, really want to validate the certificates of the Google APIs, and "scratch" does not install any CA certificates or anything.  I was getting nasty errors about things not working, and it took me a long time to realize that Google's packages were upset about not being able to validate the certificates in their HTTPS calls.

If you're using "alpine", just add apk add --no-cache ca-certificates to your build.

Also, if you do any kind of time zone work (all of my application's users and such have a time zone, for example), you'll also want to include the timezone data (otherwise, your date/time calls will fail).

If you're using "alpine", just add apk add --no-cache tzdata to your build.  Also don't forget to set the system timezone to UTC via echo UTC > /etc/timezone.

Extra: You may need your project ID and region for tooling

Once running, your application can trivially find out what its project ID and region are from the "computeMetadata/v1" APIs, but I found that I needed to know these things in advance while running my deployment tools.  Your case may differ from mine, but I needed these in advance.

I have multiple projects (for example, a production one and a staging one), and my tooling needed to know the project ID and region for whatever project it was building/deploying.  I added this to a config file that I could trivially replace with a copy from a directory of such config files.  I just wanted to mention that you might run into something similar.

Extra: Don't forget to disable App Engine

Once everything looks good, disable App Engine in your project.  Make sure that you unlinked your Datastore first, otherwise you'll also disable that when you disable App Engine.

Extra: Wait a month before celebrating

While App Engine's quotas reset every day, Cloud Run's reset every month.  This means that you won't necessarily be able to prove out your cost savings until you have spent an entire month running in Cloud Run.

After a few days, you should be able to eyeball the data and check the billing reports to get a sense of what your final costs will be, but remember: you get 48 hours of free CPU time before they start charging you, and if your application is CPU heavy, it might take a few days or weeks to burn through those free hours.  Even after going through the free tier, the cost of subsequent usage is quite reasonable.  However, the point is that you can't draw a trend line on costs based on the first few days of running.  Give it a full month to be sure.

Extra: Consider optimizing your code for CPU consumption

In an instance world like App Engine, as long as your application isn't using up all of the CPU and forcing App Engine to spawn more instances or otherwise have requests queue up, CPU performance isn't all that important.  Ultimately, it'll cost you the same regardless of whether you can optimize your code to shave off 20% of its CPU consumption.

With Cloud Run, you're charged, in essence, by CPU cycle, so you can drastically reduce your costs by optimizing your code.  And if you can keep your stuff efficient enough to stay within the free tier, then your application is basically free.  And that's wild.

If you're new to CPU optimization or haven't done it in a while, consider enabling something like "pprof" on your application and then capturing a 30 second workload.  You can view the results with a tool like KCachegrind to get a feel for where your application is spending most of its CPU time and see what you can do about cleaning that up.  Maybe it's optimizing some for loops; maybe it's caching; maybe it's using a slightly more efficient data structure for the work that you're doing.  Whatever it is, find it and reduce it.  (And start with the biggest things first.)

Conclusion

Welcome to Cloud Run!  I hope your cost savings are as good as mine.

Edit (2022-10-19): added caveats about HTTP compression and caching.

Sunday, July 10, 2022

Publishing a Docker image for a GitHub repo

It's 2022, and if you're making a GitHub project, chances are that you'll need to publish a Docker image at some point.  These days, it's really easy with GitHub CI/CD and their "actions", which generally take care of all of the hard work.

I'm assuming that you already have a working GitHub CI/CD workflow for building whatever it is that you're building, and I'm only going to focus on the Docker-specific changes that need to be made.

You'll want to set up the following workflows:
  1. Upon "pull_request" for the "master" branch, build the Docker image (to make sure that the process works), but don't actually publish it.
  2. Upon "push" for the "master" branch, build the Docker image and publish it as "latest".
  3. Upon "release" for the "master" branch, build the Docker image and publish it with the release's tag.
Before you get started, you'll need to create a Dockerhub "access token" to use for your account.  You can do this under "Account Settings" → "Security" → "Access Tokens".

Workflow updates

Upon "pull_request"

This workflow happens every time a commit is made to a pull request.  While we want to ensure that our Docker image is built so that we know that the process works, we don't actually want to publish that image (at least I don't; you might have a use for it).

To build the Docker image, just add the following to your "steps" section:
- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v2
- name: Build
  uses: docker/build-push-action@v3
  with:
    context: .
    push: false
    tags: YOUR_GROUP/YOUR_REPO:latest

Simple enough.

The "docker/setup-buildx-action" action does whatever magic needs to happen for Docker stuff to work in the pipeline, and the "docker/build-push-action" builds the image from your Dockerfile and pushes the image.  Because we're setting "push: false", it won't actually push.

Upon "merge" into "master"

This workflow happens every time a PR is merged into the "master" branch.  In this case, we want to do everything that we did for the "pull_request" case, but we also want to push the image.

The changes here are that we'll set "push: true" and also specify our Dockerhub username and password.

To build and push the Docker image, just add the following to your "steps" section:
- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v2
- name: Login to DockerHub
  uses: docker/login-action@v2
  with:
    username: ${{ secrets.DOCKERHUB_USERNAME }}
    password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build
  uses: docker/build-push-action@v3
  with:
    context: .
    push: true
    tags: YOUR_GROUP/YOUR_REPO:latest

Boom.

The new action "docker/login-action" logs into Dockerhub with your username and password, which is necessary to actually push the image.

Upon "release"

This workflow happens every time a release is created.  This is generally similar to "merge" case, except instead of using the "latest" label, we'll be using the release's tag.

To build and push the Docker image, just add the following to your "steps" section:
- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v2
- name: Login to DockerHub
  uses: docker/login-action@v2
  with:
    username: ${{ secrets.DOCKERHUB_USERNAME }}
    password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build
  uses: docker/build-push-action@v3
  with:
    context: .
    push: true
    tags: YOUR_GROUP/YOUR_REPO:${{ github.event.release.tag_name }}

And that's it.  The "github.event.release.tag_name" variable holds the name of the Git tag, which is what we'll use for the Docker label.

Monday, May 23, 2022

sed will blow away your symlinks by default

sed typically outputs to stdout, but sed -i allows you edit a file “in place”.  However, under the hood, it actually creates a new file and then replaces the original file with the new file.  This means that sed replaces symlinks with normal files.  This is most likely not what you want.

However, there is a flag to pass to make it work the way that you’d expect:

--follow-symlinks

So, if you're using sed -i, then you probably also want to tack on --follow-symlinks, too.

Saturday, April 16, 2022

Golang, http.Client, and "too many open files"

I've been having an issue with my application for a while now, and I finally figured out what the problem was.  In this particular case, the application is a web app (so, think REST API written in Go), and one of its nightly routines is to synchronize a whole bunch of data with various third-party ArcGIS systems.  The application keeps a cache of the ArcGIS images, and this job updates them so they're only ever a day old.  This allows it to show map overlays even if the underlying ArcGIS systems are inaccessible (they're random third-party systems that are frequently down for maintenance).

So, imagine 10 threads constantly making HTTP requests for new map tile images; once a large enough batch is done, the cache is updated, and then the process repeats until the entire cache has been refreshed.

In production, I never noticed a direct problem, but there were times when an ArcGIS system would just completely freak out and start lying about not supporting pagination anymore or otherwise spewing weird errors (but again, it's a third-party system, so what can you do?).  In development, I would notice this particular endpoint failing after a while with a "dial" error of "too many open files".  Every time that I looked, though, everything seemed fine, and I just forgot about it.

This last time, though, I watched the main application's open sockets ("ss -anp | grep my-application"), and I noticed that the number of connections just kept increasing.  This reminded me of my old networking days, and it looked like the TCP connections were just accumulating until the OS felt like closing them due to inactivity.

That's when I found that Go's "http.Client" has a method called "CloseIdleConnections()" that immediately closes any idle connections without waiting for the OS to do it for you.

For reasons that are not relevant here, each request to a third-party ArcGIS system uses its own "http.Client", and because of that, there was no way to reuse any connections between requests, and the application just kept racking up open connections, eventually hitting the default limit of 1024 "open files".  I simply added "defer httpClient.CloseIdleConnections()" after creating the "http.Client", and everything magically behaved as I expected: no more than 10 active connections at any time (one for each of the 10 threads running).

So, if your Go application is getting "too many open files" errors when making a lot of HTTP requests, be sure to either (1) re-architect your application to reuse your "http.Client" whenever possible, or (2) be sure to call "CloseIdleConnections()" on your "http.Client" as soon as you're done with it.

I suspect that some of the third-party ArcGIS issues that I was seeing in production might have essentially been DoS errors caused by my application assaulting these poor servers with thousands of connections.

Saturday, April 2, 2022

Service workers, push notifications, and IndexedDB

I have a pretty simple use case: a user wanted my app to provide a list of "recent" notifications that had been sent to her.  Sometimes a lot of notifications will come through in a relatively short time period, and she wants to be able to look at the list of them to make sure that she's handled them all appropriately.

I ended up having the service worker write the notification to an IndexedDB and then having the UI reload the list of notifications when it receives a "message" event from the service worker.

Before we get there, I'll walk you through my process because it was painful.

Detour: All the mistakes that I made

Since I was already using HTML local storage for other things, I figured that I would just record the list of recent notifications in there.  Every time that the page would receive a "message" event, it would add the event data to a list of notifications in local storage.  That kind of worked as long as I was debugging it.  As long as I was looking at the page, the page was open, and it would receive the "message" event.

However, in the real world, my application is installed as a "home screen" app on Android and is usually closed.  When a notification arrived, there was no open page to receive the "message" event, and it was lost.

I then tried to have the service worker write to HTML local storage instead.  It wouldn't matter which side (page or service worker) actually wrote the data since both sides would detect a change immediately.  Except that's not how it works.  Service workers can't use HTML local storage because of some rules around being asynchronous or something.

Anyway, HTML local storage was impossible as a simple communication and storage mechanism.

Because the page was usually not opened, MessageChannel and BroadcastChannel also wouldn't work.

I finally settled on using IndexedDB because a service worker is allowed to use it.  The biggest annoyance (in the design) was that there is no way to have a page "listen" for changes to an IndexedDB, so I couldn't just trivially tell my page to update the list of notifications to display when there was a change to the database.

After implementing IndexedDB, I spent a week trying to figure out why it wasn't working half the time, and that leads us to how service workers actually work.

Detour: How service workers work

Service workers are often described as a background process for your page.  The way that you hear about them, they sound like daemons that are always running and process events when they receive them.

But that's not anywhere near correct in terms of how they are implemented.  Service workers are more like "serverless" functions (such as Google Cloud Functions) in that they generally aren't running, but if a request comes in that they need to handle, then one is spun up to handle the request, and it'll be kept around for a few minutes in case any other requests come in for it to handle, and then it'll be shut down.

So my big mistake was thinking that once I initialized something in my service worker then it would be available more or less indefinitely.  The browser knows what events a service worker has registered ("push", "message", etc.) and can spin up a new worker whenever it wants, typically to handle such an event and then shut it down again shortly thereafter.

Service workers have an "install" event that gets run when new service worker code gets downloaded.  This is intended to be run exactly once for that version of the service worker.

There is also an "activate" event that gets run when an actual worker has been assigned to the task.  You can basically view this as an event that gets once when a service worker process starts running, regardless of how many times this particular code has been run previously.  If you need to initialize some global things for later functions to call, you should do it here.

The "push" event is run when a push message has been received.  Whatever work you need to do should be done in the event's "waitUntil" method as a promise chain that ultimately results in showing a notification to the user.

Detour: How IndexedDB works

IndexedDB was seemingly invented by people who had no concept of Promises in JavaScript.  Its API is entirely insane and based on "onsuccess", "oncomplete", and "onerror" callbacks.  (You can technically also use event listeners, but it's just as insane.)  It's an asynchronous API that doesn't use any of the standard asynchronous syntax as anything else in modern JavaScript.  It is what it is.

Here's what you need to know: everything in IndexedDB is callbacks.  Everything.  So, if you want to connect to a database, you'll need to make an IDBRequest and set the "onsuccess" callback.  Once you have the database, you'll need to create a transaction and set the "oncomplete" callback.  Then you can create another IDBRequest for reading or writing data from an object store (essentially a table) and setting the "onsuccess" callback.  It's callback hell, but it is what it is.  (Note that there are wrapper libraries that provide Promise-based syntax, but I hate having to wrap a standard feature for no good reason.)

(Also, there's an "onupgradeneeded" callback at the database level that you can use to do any schema- or data-related work if you're changing the database version.)

Putting it all together

I decided that there was no reason to waste cycles opening the IndexedDB on "activate" since there's no guarantee that it'll actually be used.  Instead, I had the "push" event use the previous database connection (if there was one) or create a new connection (if there wasn't).

I put together the following workflow for my service worker:

  1. Register the "push" event handler ("event.waitUntil(...)"):
    1. (Promise) Connect to the IndexedDB.
      1. If we already have a connection from a previous call, then return that.
      2. Otherwise, connect to the IndexedDB and return that (and also store it for quick access the next time so we don't have to reconnect).
    2. (Promise) Read the list of notifications from the database.
    3. (Promise) Add the new notification to the list and write it back to the database.
    4. (Promise) Fire a "message" event to all active pages and show a notification if no page is currently visible to the user.
And for my page:
  1. Load the list of notifications from the IndexedDB when the page loads.  (This sets our starting point, and any changes will be communicated by a "message" event from the service worker.)
  2. Register the "message" event handler:
    1. Reload the list of notifications from the IndexedDB.  (Remember, there's no way to be notified on changes, so receiving the "message" event and reloading is the best that we can do.)
    2. (Handle the message normally; for me, this shows a little toast with the message details and a link to click on to take the user to the appropriate screen.)

For me, the database work is a nice-to-have; the notification is the critical part of the workflow.  So I made sure that every database-related error was handled and the Promises resolved no matter what.  This way, even if there was a completely unexpected database issue, it would just get quietly skipped and the notification could be shown to the user.

In my code, I created some simple functions (to deal with the couple of IndexedDB interactions that I needed) that return Promises so I could operate normally.  You could technically just do a single "new Promise(...)" to cover all of the IndexedDB work if you wanted, or you could one of those fancy wrapper libraries.  In any case, you must call "event.waitUntil" with a Promise chain that ultimately resolves after doing something with the notification.  How you get there is up to you.

I also was using the IndexedDB as an asynchronous local storage, so I didn't need fancy keys or sorting or anything.  I just put all of my data under a single key that I could "get" and "put" trivially without having to worry about row counts or any other kind of data management.  There's a single object store with a single row in it.

Thursday, March 3, 2022

Dump your SSH settings for quick troubleshooting

I recently had a Jenkins job that would die, seemingly-randomly.  The only thing that really stood out was that it would tend to succeed if the runtime was 14 minutes or less, and it would tend to fail if the runtime was 17 minutes or more.

This job did a bunch of database stuff (through an SSH tunnel; more on that soon), so I first did a whole bunch of troubleshooting on the Postgres client and server configs, but nothing seemed relevant.  It seemed to disconnect ("connection closed by server") on these long queries that would sit there for a long time (maybe around 15 minutes or so) and then come back with a result.  After ruling out the Postgres server (all of the settings looked good, and new sessions had decent timeout configs), I moved on to SSH.

This job connects to a database by way of a forwarded port through an SSH tunnel (don't ask why; just understand that it's the least worst option available in this context).  I figured that maybe the SSH tunnel was failing, since I start it in the background and have it run "sleep infinity" and then never look at it again.  However, when I tested locally, my SSH session would run for multiple days without a problem.

Spoiler alert: the answer ended up being the client config, but how do you actually find that out?

SSH has two really cool options.

On the server side, you can run "sudo sshd -T | sort" to have the SSH daemon read the relevant configs and then print out all of the actual values that it's using.  So, this'll merge in all of the unspecified defaults as well as all of the various options in "/etc/sshd_config" and "/etc/sshd_config.d", etc.

On the client side, you can run "ssh -G ${user}@${host} | sort", and it'll do the same thing, but for all of the client-side configs for that particular user and host combination (because maybe you have some custom stuff set up in your SSH config, etc.).

Now, in my case, it ended up being a keepalive issue.  So, on the server side, here's what the relevant settings were:

clientalivecountmax 0
clientaliveinterval 900
tcpkeepalive yes

On the client (which would disconnect sometimes), here's what the relevant settings were:

serveralivecountmax 3
serveraliveinterval 0
tcpkeepalive yes

Here, you can see that the client (which is whatever the default Jenkins Kubernetes agent ended up being) enabled a TCP keepalive, but it set the keepalive interval to "0", which means that it wouldn't send keepalive packets at all.

According to the docs, the server should have sent out keepalives every 15 minutes, but whatever it was doing, the connection would drop after 15 minutes.  Setting "serveraliveinterval" to "60" ended up solving my problem and allowed my SSH sessions to stay active indefinitely until the script was done with them.

Little bonus section

My SSH command to set up the tunnel in the background was:

ssh -4 -f -L${localport}:${targetaddress}:${targetport} ${user}@${bastionhost} 'sleep infinity';

"-4" forces it to use an IPv4 address (relevant in my context), and "-f" puts the SSH command into the background before "sleep infinity" gets called, right after all the port forwarding is set up.  "sleep infinity" ensures that the connection never closes on its own; the "sleep" command will do nothing forever.

(Obviously, I had the "-o ServerAliveInterval=60" option in there, too.)

With this, I could trivially have my container create an SSH session that allowed for port-forwarding, and that session would be available for the entirety of the container's lifetime (the entirety of the Jenkins build).

Tuesday, March 1, 2022

QNAP, NFS, and Filesystem ACLs

I recently spent hours banging my head against a wall trying to figure out why my Plex server couldn't find some new media that I put on its volume in my QNAP.

tl;dr QNAP "Advanced Folder Permissions" turn on file access control lists (you'll need the "getfacl" and "setfacl" tools installed on Linux to mess with them).  For more information, see this guide from QNAP.

I must have turned this setting on when I rebuilt my NAS a while back, and it never mattered until I did some file operations with the File Manager or maybe just straight "cp"; I forget which (or both).  Plex refused to see the new file, and I tried debugging the indexer and all that other Plex stuff before realizing that while it could list the file, it couldn't open the file, even though its normal "ls -l" permissions looked fine.

Apparently the file access control list denied it, but I didn't even have "getfacl" or "setfacl" installed on my machine (and I had never even heard of this before), so I had no idea what was going on.  I eventually installed those tools and verified that while the standard Linux permission looked fine, the ACL permissions did not.

"sudo chmod -R +r /path/to/folder" didn't solve my problem, but tearing out the ACL did: "sudo setfacl -b -R /path/to/folder"

Later, I eventually figured out that it was QNAP's "Advanced Folder Permissions" and just disabled that so it wouldn't bother me again.

Sunday, January 9, 2022

Moving (or renaming) a push-notification ServiceWorker

Service Workers (web workers, etc.) are a relatively new concept.  They can do all kinds of cool things (primarily related to network requests), but they are also the mechanism by which a web site can receive push messages (via Web Push) and show them as OS notifications.

The general rule of Service Workers is to pick a file name (such as "/service-worker.js") and never, ever change it.  That's cool, but sometimes you do need to change it.

In particular, I started my push messaging journey with "platinum-push-messaging", a now-defunct web component built by Google as part of the initial Polymer project.  The promise was cool: just slap this HTML element on your page with a few parameters and boom: you have working push notifications.

When it came out, the push messaging spec was young, and no browsers fully supported its encrypted data payloads, so "platinum-push-messaging" did a lot of work to work around that limitation.  As browsers improved to support the VAPID spec, "platinum-push-messaging" (along with all of the other "platinum" elements) were quietly deprecated and archived (around 2017).

This left me with a problem: a rotting push notification system that couldn't keep up with the spec and the latest browsers.  I hacked the code to all hell to support VAPID and keep the element functioning, but I was just punting.

Apple ruined the declarative promise of the Polymer project by refusing to implement HTML imports, so the web components community adopted the NPM distribution model (and introduced a whole bunch of imperative Javascript drama and compilation tools).  Anyway, no modern web components are installed with Bower anymore, so that left me with a deprecated Service Worker in a path that I wanted to get rid of: "bower_components/platinum-push-messaging/service-worker.js"

Here was my problem:

  1. I wanted the push messaging Service Worker under my control at the top level of my application, "/push-service-worker.js".
  2. I had hundreds of users who were receiving push notifications via this system, and the upgrade had to be seamless (users couldn't be forced to take any action).
I ended up solving the problem by essentially performing a switcheroo:
  1. I had my application store the Web Push subscription info in HTML local storage.  This would be necessary later as part of the switcheroo.
  2. I removed "bower_components/platinum-push-messaging/".  Any existing clients would regularly attempt to update the service worker, but it would quietly fail, leaving the existing one running just fine.
  3. I removed all references to "platinum-push-messaging" from my code.  The existing Service Worker would continue to run (because that's what Service Workers do) and receive push messages (and show notifications).
  4. I made my own push-messaging web component with my own service worker living at "/push-service-worker.js".
  5. (This laid the framework for performing the switcheroo.)
  6. Upon loading, the part of my application that used to include "platinum-push-messaging" did a migration, if necessary, before loading the new push-messaging component:
    1. It went through all the Service Workers and looked for any legacy ones (these had "$$platinum-push-messaging$$" in the scope).  If it found any, it killed them.

      Note that the "$$platinum-push-messaging$$" in the scope was a cute trick by the web component: a page can only be controlled by one Service Worker, and the scope dictates what that Service Worker can control.  By injecting a bogus "$$platinum-push-messaging$$" at the end of the scope, it ensured that the push-messaging Service Worker couldn't accidentally control any pages and get in the way of a main Service Worker.
    2. Upon finding any legacy Service Workers, it would:
      1. Issue a delete to the web server for the old (legacy) subscription (which was stored in HTML local storage).
      2. Tell the application to auto-enable push notifications.
      3. Resume the normal workflow for the application.
  7. The normal workflow for the application entailed loading the new push-messaging web component once the user was logged in.  If a Service Worker was previously enabled, then it would remain active and enabled.  Otherwise, the application wouldn't try to annoy users by asking them for push notifications.
  8. After the new push-messaging web component was included, it would then check to see if it should be auto-enabled (it would only be auto-enabled as part of a successful migration).
    1. If it was auto-enabled, then it would enable push messaging (the user would have already given permission by virtue of having a legacy push Service Worker running).  When the new push subscription was ready, it would post that information to the web server, and the user would have push messages working again, now using the new Service Worker.  The switcheroo was complete.
That's a bit wordy for a simple switcheroo, but it was very important for me to ensure that my users never lost their push notifications as part of the upgrade.  The simple version is: detect legacy Service Worker, kill legacy Service Worker, delete legacy subscription from web server, enable new Service Worker, and save new subscription to web server.

For any given client, the switcheroo happens only once.  The moment that the legacy Service Worker has been killed, it'll never run again (so there's a chance that if the user killed the page in the milliseconds after the kill but before the save, then they'd lose their push notifications, but I viewed this as extremely unlikely; I could technically have stored a status variable, but it wasn't worth it).  After that, it operates normally.

This means that there are one of two ways for a user to be upgraded:
  1. They open up the application after it has been upgraded.  The application prompts them to reload to upgrade if it detects a new version, but eventually the browser will do this on its own, typically after the device reboots or the browser has been fully closed.
  2. They click on a push notification, which opens up the application (which is #1, above).
So at this point, it's a waiting game.  I have to maintain support for the switcheroo until all existing push subscriptions have been upgraded.  The new ones have a flag set in the database, so I just need to wait until all subscriptions have the flag.  Active users who are receiving push notifications will eventually click on one, so I made a note to revisit this and remove the switcheroo code once all of the legacy subscriptions have been removed.

I'm not certain what causes a new subscription to be generated (different endpoint, etc.), but I suspect that it has to do with the scope of the Service Worker (otherwise, how would it know, since service worker code can change frequently?).  I played it safe and just assumed that the switcheroo would generate an entirely new subscription, so I deleted the legacy one no matter what and saved the new one no matter what.