Wednesday, April 5, 2023

Dealing with KDE "plasmashell" freezing

I've been using KDE for over a decade now, and something that started happening in the past year or two (at least on Kubuntu 22.04) would be that my whole screen would mostly freeze.  Generally, I'd be able to alt-tab between windows, interact with them, etc., but I couldn't click on or interact with anything related to the window manager (the title bars, the task bar, etc.).

In my case, I'd immediately notice when I came back to my desk and there was obviously a notification at some point, but the rendering got all screwed up:


In this image, you can see that the notification toast window has no visible content and instead looks like the KDE background image.  Also, the time is locked at 5:29 PM, which is when this problem happened (I didn't get back to my desk until 8:30 AM the next morning).

The general fix for this is to use a shell (if you have one open, great; if not, press ctrl+alt+F2 to jump to the console) and kill "plasmashell":
killall plasmashell

Once that's done, your window manager should be less broken, but it won't have the taskbar, etc.  From there, you can press alt+F2 to open the "run" window, and type in:
plasmashell --replace


You can also run this from a terminal somewhere, but you need to make sure that your "DISPLAY" environment variable is set up correctly, etc.  I find it easier to do it it from the run window (and I don't have to worry about redirecting its output anywhere, since "plasmashell" does generate some logging noise).


Friday, February 10, 2023

Using a dynamic PVC on Kubernetes agents in Jenkins

I recently had to create a Jenkins job that needed to use a lot of disk space. The short version of the story is that the job needed to dump the contents of a Postgres database and upload that to Artifactory, and the "jfrog" command line tool won't let you stream an upload, so the entire dump had to be present on disk in order for it to work.

I run my Jenkins on Kubernetes, and the Kubernetes hosts absolutely didn't have the disk space needed to dump this database, and it was definitely too big to use a memory-based filesystem.

The solution was to use a dynamic Persistent Volume Claim, which is maybe(?) implemented as an ephemeral volume in Kubernetes, but the exact details of what it does under the hood aren't important.  What is important is that, as part of the job running, a new Persistent Volume Claim (PVC) gets created and is available for all of the containers in the pod.  When the job finishes, the PVC gets destroyed.  Perfect.

I couldn't figure out how to create a dynamic PVC as an ordinary volume that would get mounted on all of my containers (it's a thing, but apparently not for a declarative pipeline), but I was able to get the "workspace" dynamic PVC working.

A "workspace" volume is shared across all of the containers in the pod and have the Jenkins workspace mounted.  This has all of the Git contents, including the Jenkinsfile, for the job (I'm assuming that you're using Git-based jobs here).  Since all of the containers share the same workspace volume, any work done in one container is instantly visible in all of the others, without the need for Jenkins stashes or anything.

The biggest problem that I ran into was the permissions on the "workspace" file system.  Each of my containers had a different idea of what the UID of the user running the container would be, and all of the containers have to agree on the permissions around their "workspace" volume.

I ended up cheating and just forcing all of my containers to run as root (UID 0), since (1) everyone could agree on that, and (2) I didn't have to worry about "sudo" not being installed on some of the containers that needed to install packages as part of their setup.

Using "workspace" volumes

To use a "workspace" volume, set workspaceVolume inside the kubernetes block:

kubernetes {
   workspaceVolume dynamicPVC(accessModes: 'ReadWriteOnce', requestsSize: "300Gi")
   yaml '''
---
apiVersion: v1
kind: Pod
spec:
   securityContext:
      fsGroup: 0
      runAsGroup: 0
      runAsUser: 0
   containers:
[...]

In this example, we allocate a 300GiB volume for the duration of the job running.

In addition, you can see that I set the user and group information to 0 (for "root"), which let me work around all the annoying UID mismatches across the containers.  If you only have one container, then obviously you don't have to do this.  Also, if you have full control of your containers, then you can probably set them up with a known user with a fixed UID who can sudo, etc., as necessary.

For more information about using Kubernetes agents in Jenkins, see the official docs, but (at least of the time of this writing) they're missing a whole lot of information about volume-related things.

Troubleshooting

If you see Jenkins trying to create and then delete pods over and over and over again, you have something else wrong.  In my case, the Kubernetes service accout that Jenkins uses didn't have any permissions around "persistentvolumeclaims" objects, so every time that the Pod was created, it would fail and try again.

I was only able to see the errors in the Jenkins logs in Kubernetes; they looked something like this:

Caused: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.100.0.1:443/api/v1/namespaces/cicd/persistentvolumeclaims. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. persistentvolumeclaims is forbidden: User "system:serviceaccount:cicd:default" cannot create resource "persistentvolumeclaims" in API group "" in the namespace "cicd".

I didn't have the patience to figure out exactly what was needed, so I just gave it everything:

- verbs:
    - create
    - delete
    - get
    - list
    - patch
    - update
    - watch
  apiGroups:
    - ''
  resources:
    - persistentvolumeclaims


Tuesday, January 31, 2023

Use a custom login page when using Apache to require sign-in

Apache has its own built-in authentication system(s) for providing access control to a site that it's hosting.  You've probably encountered this before using "basic" authentication backed by a flatfile created and edited using the htpasswd command.

If you do this using the common guides on the Internet (for example, this guide from Apache itself), then when you go to your site, you'll be presented with your browser's built-in basic-authentication dialog box asking for a username and password.  If you provide valid credentials, then you'll be moved on through to the main site, and if you don't, then it'll dump you to a plain "401 Unauthorized" page.

This works fine, but it has three main drawbacks:

  1. Password managers (such as LastPass) can't detect this dialog box and autofill it, which is very annoying.
  2. On some mobile browsers, the dialog gets in the way of normal operations.  Even if you have multiple tabs open, whatever tab is trying to get you to log in will get in the way and force you to deal with it.
  3. If you're using Windows authentication, the browser might detect the 401 error and attempt to sign you in using your domain credentials.  If the server has a different set of credentials, then it'll mean that you can't actually log in due to Windows trying to auto log in.

(And the built-in popup is really ugly, and it submits the password in plaintext, etc., etc.)

Apache "Form" Authentication

To solve this problem, Apache has a type of authentication called "form" that adds an extra step involving an HTML form (that's fully customizable).

The workflow is as follows:

  1. Create a login HTML page (you'll have to provide the page).
  2. Register a handler for that page to POST to (Apache already has the handler).
  3. Update any "Directory" or "Location" blocks in your Apache config to use the "form" authentication type instead of "basic".
You'll also need these modules installed and enabled:
  1. mod_auth_form
  2. mod_request
  3. mod_session
  4. mod_session_cookie

On Ubuntu, I believe that these were all installed out of the box but needed to be enabled separately.  On Red Hat, I had to install the mod_session package, but everything was otherwise already enabled.

Example

If you want to try out "form" authentication, I recommend that you get everything working with "basic" authentication first.  This is especially true if you have multiple directories that need to be configured separately.

For this example, I'm going to use our Nagios server.

There were two directories that needed to be protected: "/usr/local/nagios/sbin" and "/usr/local/nagios/share".  This setup is generally described by this document (although it covers "digest" authentication instead of "basic").

For both directories that already had "AuthType" set up, the changes are simple:

  1. Change AuthType Basic to AuthType Form.
  2. Change AuthBasicProvider to AuthFormProvider.
  3. Add the login redirect: AuthFormLoginRequiredLocation "/login.html"
  4. Enable sessions: Session On
  5. Set a cookie name: SessionCookieName session path=/

I decided to put my login page at "/login.html" because that makes sense, but you could put it anywhere (and even host it on a different server if you specify a full URL instead of just a path).

That page should contain a "form" with two "input" elements: "httpd_username" and "httpd_password".  The form "action" should be set to "/do-login.html" (or whatever handler you want to register with Apache).

At its simplest, "login.html" looks like this:

<form method="POST" action="/do-login.html">
  Username: <input type="text" name="httpd_username" value="" />
  Password: <input type="password" name="httpd_password" value="" />
  <input type="submit" name="login" value="Login" />
</form>

You'll probably want an "html" tag, a title and body and such, maybe some CSS, but this'll get the job done.

The last step is to register the thing that'll actually process the form data: "/do-login.html"

In your Apache config, add a "location" for it:

<Location "/do-login.html">
  SetHandler form-login-handler

  AuthType form
  AuthName "Nagios Access"
  AuthFormProvider file
  AuthUserFile /path/to/your/htpasswd.users

  AuthFormLoginRequiredLocation "/login.html"
  AuthFormLoginSuccessLocation "/nagios/"

  Session On
  SessionCookieName session path=/
</Location>

The key thing here is SetHandler form-login-handler.  This tells Apache to use its built-in form handler to take the values from httpd_username and httpd_password and compare them against your authentication provider(s) (in this example, it's just a flatfile, but you could use LDAP, etc.).

The other two options handle the last bit of navigation.  AuthFormLoginRequiredLocation sends you back to the login page if the username/password combination didn't work (you could potentially have another page here with an error message pre-written).  AuthFormLoginSuccessLocation sends you to the place where you want the user to go after login (I'm sending the user to the main Nagios page, but you could send them anywhere).

Notes

Other Authentication Providers

I've just covered the "file" authentication provider here.  If you use "ldap" and/or any others, then that config will need to be copied to every single place where you have "form" authentication set up, just like you would if you were only using the "file" provider.

I found this to be really annoying, since I had two directories to protect plus the form handler, so that brings over another 4 lines or so to each config section, but what matters is that it works.


Wednesday, October 19, 2022

Watch out for SNI when using an nginx reverse proxy

From time to time, I'll have a use case where some box needs to talk to some website that it can't reach (through networking issues), and the easiest thing to do is to throw an nginx reverse proxy on a network that it can reach (such that the reverse proxy can reach both).

The whole shtick of a reverse proxy is that you can access the reverse proxy directly and it'll forward the request on to the appropriate destination and more or less masquerade itself as if it were the destination.  This is in contrast with a normal HTTP proxy that would be configured separately (if supported by whatever tool you're trying to use).  Sometimes a normal HTTP proxy is the best tool for the job, but sometimes you can cheat with a tweak to /etc/hosts and a reverse proxy and nobody needs to know what happened.

Here, we're focused on the reverse proxy.

In this case, we have the following scenario:

  1. Box 1 wants to connect to site1.example.com.
  2. Box 1 cannot reach site1.example.com.
To cheat using a reverse proxy, we need Box 2, which:
  1. Can be reached by Box 1.
  2. Can reach site1.example.com.
To set up the whole reverse proxy thing, we need to:
  1. Set up nginx on Box 2 to listen on port 443 (HTTPS) and reverse proxy to site1.example.com.
  2. Update /etc/hosts on Box 1 so that site1.example.com points to Box 2's IP address.

At first, I was seeing this error message on the reverse proxy's "nginx/error.log":

connect() to XXXXXX:443 failed (13: Permission denied) while connecting to upstream, client: XXXXXX, server: site1.example.com, request: "GET / HTTP/1.1"

"Permission denied" isn't great, and that told me that it was something OS-related.

Of course, it was an SELinux thing (in /var/log/messages):

SELinux is preventing /usr/sbin/nginx from name_connect access on the tcp_socket port 443.

The workaround was:

setsebool -P nis_enabled 1

This also was suggested by the logs, but it didn't seem to matter:

setsebool -P httpd_can_network_connect 1

After fixing that, I was seeing:

SSL_do_handshake() failed (SSL: error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure:SSL alert number 40) while SSL handshaking to upstream, client: XXXXXX, server: site1.example.com, request: "GET / HTTP/1.1"

After tcpdump-ing the traffic from Box 1 and also another box that could directly talk to site1.example.com, it was clear Box 1 was not using SNI in its requests (SNI is a TLS extension that passes the host name in plaintext so that proxies and load balancers can properly route name-based requests).

It took way too long for me to find the nginx setting to enable it (I don't know why its disabled by default), but it's:

proxy_ssl_server_name on;

Anyway, the final nginx config for the reverse proxy on Box 2 was:

server {
  listen 443 ssl;
  server_name site1.example.com;

  ssl_certificate /etc/nginx/ssl/server.crt;
  ssl_certificate_key /etc/nginx/server.key;
  ssl_protocols TLSv1.2;
 
  location / {
    proxy_pass https://site1.example.com;
    proxy_ssl_session_reuse on;
    proxy_ssl_server_name on;
  }
}

As far as Box 1 was concerned, it could connect to site1.example.com with only a small tweak to /etc/hosts.

Wednesday, October 5, 2022

Speed up PNG encoding in Go with NRGBA images

After migrating my application from Google Cloud Engine to Google Cloud Run, I suddenly had a use case for optimizing CPU utilization.

In my analysis of my most CPU-intensive workloads, it turned out that the majority of the time was spent encoding PNG files.

tl;dr Use image.NRGBA when you intend to encode a PNG file.

(For reference, this particular application has a Google Maps overlay that synthesizes data from other sources into tiles to be rendered on the map.  The main synchronization job runs nightly and attempts to build or download new tiles for the various layers based on data from various ArcGIS systems.)

Looking at my code, I couldn't really reduce the number of calls to png.Encode, but that encoder really looked inefficient.  I deleted the callgrind files (sorry), but basically, half of the CPU time in png.Encode was around memory operations and some runtime calls.

I started looking around for maybe some options to pass to the encoder or a more purpose-built implementation.  I ended up finding a package that mentioned a speedup, but only for NRGBA images.  However, that package looked fairly unused, and I wasn't about to turn all of my image processing to so something with 1 commit and no users.

This got me thinking, though: what is NRGBA?

It turns out that there are (at least) two ways of thinking about the whole alpha channel thing in images:

  1. In RGBA, each of the red, green, and blue channels has already been premultiplied by the alpha channel, such that the value of, for example, R can range from 0 to A, but no higher.
  2. In NRGBA, each of the red, green, and blue channels has its original value, and the alpha channel merely represents the opacity of the pixel in general.

For my human mind, using various tools and software over the years, when I think of "RGBA", I think of "one channel each for red, green, and blue, and one channel for the opacity of the pixel".  So what this means is that I'm thinking of "NRGBA" (for non-premultiplied RGBA).

(Apparently there are good use cases for both, and when compositing, at some point you'll have to multiply by the alpha value, so "RGBA" already has that done for you.)

Okay, whatever, so what does this have to do with CPU optimization?

In Go, the png.Encode function is optimized for NRGBA images.  There's a tiny little hint about this in the comment for the function:

Any Image may be encoded, but images that are not image.NRGBA might be encoded lossily.

This is corroborated by the PNG rationale document, which explains that

PNG uses "unassociated" or "non-premultiplied" alpha so that images with separate transparency masks can be stored losslessly.

If you want to have the best PNG encoding experience, then you should encode images that use NRGBA already.  In fact, if you open up the code, you'll see that it will convert the image to NRGBA if it's not already in that format.

Coming back to my callgrind analysis, this is where all that CPU time was spent: converting an RGBA image to an NRGBA image.  I certainly thought that it was strange how much work was being done creating a simple PNG file from a mostly-transparent map tile.

Why did I even have RGBA images?  Well, my tiling API has to composite tiles from other systems into a single PNG file, so I simply created that new image with image.NewRGBA.  And why that function?  Because as I mentioned before, I figured "RGBA" meant "RGB with an alpha channel", which is what I wanted so that it would support transparency.  It never occurred to me that "RGBA" was some weird encoding scheme for pixels in contrast to another encoding scheme called "NRGBA"; my use cases had never had me make such a distinction.

Anyway, after switching a few image.NewRGBA calls to image.NewNRGBA (and literally that was it; no other code changed), my code was way more efficient, cutting down on CPU utilization by something like 50-70%.  Those RGBA to NRGBA conversions really hurt.

100% cost savings switching from App Engine to Cloud Run... wild

I have written numerous times about how much I like Google's App Engine platform and about how I've tried to port as many of my applications to it as possible.  The idea of App Engine is still glorious, but I  have now been converted to Cloud Run.

The primary selling point?  It has literally reduced my application's bill to $0.  That's a 100% reduction in cost from App Engine to Cloud Run.  It's absolutely wild, and it's now my new favorite thing.

Pour one out for App Engine because Cloud Run is here to stay.

Aside: a (brief) history of App Engine

Before I explain how Cloud Run manages to make my application free to run, I need to first run you through a brief history of App Engine, why it was great, why it became less great, and how Cloud Run could eat its lunch like that.

App Engine v1

Google App Engine entered the market in 2011 as a one-stop solution to host infinitely scalable applications.  Prior to App Engine, if you wanted to host a web application, you had to either commit to a weird framework and go with that framework's cloud solution, or you had to spin up a bunch of virtual machines (or god help you Lambda functions) to handle your web traffic, your database, your file systems, maybe tie into a CDN, and also set up Let's Encrypt or something similar to handle your TLS certificates.

From the perspective of an application owner or application developer, that's a lot of work that doesn't directly deal with application stuff and ends up just wasting tons of everyone's time.

App Engine was Google's answer to that, something that we would now call "serverless".

The premise was simple: upload a zip file with your application, some YAML files with some metadata (concerning scaling, performance, etc.), and App Engine would take care of all the annoying IT stuff for you.  In fact, they tried to hide as much of it from you as possible—they wanted you to focus on the application logic, not the nitty gritty backend details.

(This allowed Google to do all kinds of tricks in the background to make things better and faster, since the implementation details of their infrastructure were generally out of bounds for your application.)

App Engine's killer feature was that it bundled a whole bunch of services together, things that you'd normally have to set up yourself and pay extra for.  This included, but was not limited to:

  1. The web hosting basics:
    1. File hosting
    2. A CDN service
    3. An execution environment in your language of choice (Java, Python, PHP, etc.)
  2. A cheap No-SQL database to store all your data
  3. A memcached memory store
  4. A scheduler that'll hit any of your web endpoints whenever you want
  5. A task queue for background tasks
  6. A service that would accept e-mail and transform it into an HTTP POST request

All of this came with a generous free tier (truly free) where you could try stuff out and generally host small applications with no cost at all.  If your application were to run up against the free-tier limits, it would just halt those services until the next day when all of the counters reset and your daily limits restarted.

(Did I mention e-mail?  I had so many use cases that dealt with accepting e-mail from other systems and taking some action.  This was a godsend.)

The free tier gave you 28 hours of instance time (that is, 28 hours of a single instance of the lowest class).  For trying stuff out, that was plenty, especially when your application only had a few users and went unused for hours at a time.

(Yes, if you upgraded your instance type one notch that became 14 hours free, but still.)

To use all of the bonus features, you generally had to link against (more or less) the App Engine SDK, which took care of all of the details behind the scene.

This was everything that you needed to build a scalable web application, and it did its job well.

App Engine v2

Years went by, and Google made some big changes to App Engine.  To me, these changes went against the great promise of App Engine, but they make sense when looking at Cloud Run as the App Engine successor.

App Engine v2 added support for newer versions of the languages that App Engine v1 had supported (which was good), but it also took away a lot of the one-stop-shop power (which was bad).  The stated reason was to remove a lot of the Google Cloud-specific stuff and make the platform a bit more open and compatible with other clouds.

(While that's generally good, the magic of App Engine was that everything was built in, for free.)

Now, an App Engine application could no longer have free memcached data; instead, you could pay for a Memory Store option.  For large applications, it probably didn't matter, but for small ones that cost basically nothing to run, this made memcached untenable.

Similarly, the e-mail service was discontinued, and you were encouraged to move to Twilio SendGrid, which could do something similar.  Google was nice enough to give all SendGrid customers a free tier that just about made up for what was lost.  The big problem was that all of the built-in tools were gone.  Previously, the App Engine local development console had a place where you could type in an e-mail and send it to your application; now you had to write your own tools.

The scheduler and task queue systems were promoted out of App Engine and into Google Cloud, but the App Engine tooling continued to support them, so they required only minimal changes on the application side.

The cheap No-SQL database, Datastore, was also promoted out of App Engine, and there were no changes to make whatsoever.  Yay.

The App Engine SDK was removed, and an App Engine application was simply another application in the cloud, with annoying dependencies and third-party integrations.

(Yes, technically Google added back support for the App Engine SDK years later, but meh.)

Overall, for people who used App Engine exclusively for purpose-built App Engine things, v2 was a major step down from v1.  Yes, it moved App Engine toward a more standards-based approach, but the problem was that there weren't standards around half the stuff that App Engine was doing, and in the push to make it more open and in-line with other clouds, it lost a lot of its power (because other clouds weren't doing what App Engine was doing, which is why I was using Google's App Engine).

So, why Cloud Run?

Cloud Run picks up where App Engine v2 left off.  With all of the App Engine-specific things removed, the next logical jump was to give Google a container image instead of a zip file and let it do its thing.  App Engine had already experimented with this with the "Flexible" environment, but its pricing was not favorable compared to the "Standard" offering for the kinds of projects that I had.

From a build perspective, Cloud Run gives you a bit more flexibility than App Engine.  You can give Google a Dockerfile and have it take care of all the compiling and such, or you can do most of that locally and give Google a much simpler Dockerfile with your pre-compiled application files.  I don't want to get too deep into it, but you have options.

Cloud Run obviously runs your application as a container.  Did App Engine?  Maybe?  Maybe not?  I figured that it would have, but that's for Google to decide (the magic of App Engine).

But here's the thing: App Engine was billed as an instance, not a container.

Billing: App Engine vs. Cloud Run

I'm going to focus on the big thing, which is instance versus container billing.  Everything else is comparable (data traffic, etc.).

App Engine billed you on instance time (wall time), normalized to units of the smallest instance class that they support.  To run your application at all, there was a 15-minute minimum cost.  If your application wasn't doing anything for 15 minutes, then App Engine would shut it down and you wouldn't be billed for that time.  But for all other time, anytime that there was at least one instance running, you were billed for that instance (more or less as if you had spun up a VM yourself for the time that it was up).

For a lightweight web application, this kind of billing isn't ideal because most of the work that your application does is a little bit of processing centered around database calls.  Most of the time the application is either not doing anything at all (no requests are coming in) or it's waiting on a response from a database.  The majority of the instance CPU goes unused.

Cloud Run bills on CPU time, that is, the time that the CPU is actually used.  So, for that same application, if it's sitting around doing nothing or waiting on a database response, then it's not being billed.  And there's no 15-minute minimum or anything.  For example, if your application does some request parsing, permissions checks, etc., and then sits around waits for 1 second for a database query to return, then you'll be billed for like 0.0001 seconds of CPU time, which is great (because your application did very little actual CPU work in that second it took for the request to complete).

Cloud Run gives you basically 48 hours of "vCPU-seconds" (that is, a single CPU churning at 100%) per month.  So for an application that sits around doing nothing most of the time, the odds are that you'll never have to pay for any CPU utilization.  For a daily average, this comes out to about 1.5 hours of CPU time free per day.  Yes, you also pay for the amount of memory that your app uses while it's handling requests, but my app uses like 70MB of memory because it's a web application.

(For me, my app has a nightly jobs that eats up a bunch of CPU, and then it does just about nothing all day.)

Overall, for me, this is what moved me to Cloud Run.  I'm billed for the resources that I'm actually using when I'm actually using them, and the specifics around instances and such are no longer my concern.

This also means that I can use a more microservice-based approach, since I'm billed on what my application does, not what resources Google spins up in the background to support it.  With App Engine, running separate "services" would have been too costly (each with its own instances, etc.).  With Cloud Run, it's perfect, and I can experiment without needing to worry about massive cost changes.

What's the catch?

App Engine's deployment tools are top-notch, even in App Engine v2.  When you deploy your application (once you figure out all the right flags), it'll get your application up and running, deploy your cron jobs, and ensure that your Datastore indexes are pushed.

With Cloud Run, you don't get any of the extra stuff.

I've put together a list of things that you'll have to do to migrate your application from App Engine to Cloud Run, sorted from easiest to hardest.

Easy: Deploy your application

This is essentially gcloud run deploy with some options, but it's extremely simple and straightforward.  Migrating from App Engine will take like 2 seconds.

Easy: Get rid of app.yaml

You won't need app.yaml anymore, so remove it.  If you had environment variables in there, move them to their own file and update your gcloud run deploy command to include --env-vars-file.

In Cloud Run, everything is HTTPS, so you don't have to worry about insecure HTTP endpoints.

There are no "warmup" endpoints or anything, so just make sure that your container does what it needs to when it starts because as soon as it can receive an HTTP request, it will.

If you had configured a custom instance class, you may need to tweak your Cloud Run service with parameters that better match your old settings.  I had only used an F2 class because the F1 class had a limitation on the number of concurrent requests that it could receive, and Cloud Run has no such limitation.  You can also configure the maximum number of concurrent requests much, much higher.

Similarly, all of the scaling parameters are... different, and you can deal with those in Cloud Run as necessary.  Let your app run for a while and see how Cloud Run behaves; it's definitely a different game from App Engine.

Easy: Deploy your task queues

Starting with App Engine v2, you already had to deploy your own task queues (App Engine v1 used to take care of that for you), so you don't have to change a single thing with deployment.  Note that you will have to change how those task queues are used; see below.

Easy: Deploy your Datastore indexes

This is essentially gcloud datastore indexes create index.yaml with some options, and you should already have index.yaml sitting around, so just add this to your deployment code.

Easy: Rip out all "/_ah" endpoints

App Engine liked to have you put special App Engine things under "/_ah", so I stashed all of my cron job and task queue endpoints in there.  Ordinarily, this would be fine, except that Cloud Run quietly treats "/_ah" as special and will quietly fail all requests to any endpoints under it with a permission-denied error.  I wasted way too much time trying to figure out what was going on before I realized that it was Cloud Run doing something sneaky and undocumented in the background.

Move any endpoints under "/_ah" to literally any other path and update any references to them.

Easy: Create a Dockerfile

Create a Dockerfile that's as simple or as complex as you want.  I wanted a minimal set of changes from my current workflow, so I had my Dockerfile just include all for binaries that get compiled locally as well as any static files.

If you want, you can set up a multi-stage Dockerfile that has separate, short-lived containers for building and compiling that ultimate ends up with a single, small container for your application.  I'll eventually get there, but actually deploying into Cloud Run took precedence for me over neat CI/CD stuff.

Easy: Detect Cloud Run in production

I have some simple code that detects if it's in Google Cloud or not so that it does smart production-related things if it is.  In particular, logging in production is JSON-based so that StackDriver can pick it up properly.

The old "GAE_*" and "GOOGLE_*" environment variables are gone; you'll only get "K_SERVICE" and friends to tell you a tiny amount of information about your Cloud Run environment.  You'll basically get back the service name and that's it.

If you want your project ID or region, you'll have to pull those from the "computeMetadata/v1" endpoints.  It's super easy; it's just something that you'll have to do.

For more information on using the "computeMetadata/v1" endpoints, see this guide.

For more information about the environment variables available in Cloud Run, see this document.

Easy: "Unlink" your Datastore

Way back in the day, Datastore was merely a part of App Engine.  At some point, Google spun it out into its own thing, but for older projects, it's still tied to App Engine.  Make sure that you go to your Datastore's admin screen in Google Cloud Console and unlink it from your App Engine project.

This has no practical effects other than:

  1. You can no longer put your Datastore in read-only mode.
  2. You can disable your App Engine project without also disabling your Datastore.

It may take 10-20 minutes for the operation to complete, so just plan on doing this early so it doesn't get in your way.

To learn more about unlinking Datastore from App Engine, see this article.

Medium: Change your task queue code

If you were using App Engine for task queues, then you were using special App Engine tasks.  Since you're not using App Engine anymore, you have to use the more generic HTTP tasks.

In order to do this, you'll need to know the URL for your Cloud Run service.  You can technically cheat and use the information in the HTTP request to generally get a URL that (should) work, but I opted to use the "${region}-run.googleapis.com/apis/serving.knative.dev/v1/namespaces/${project-id}/services/${service-name}" endpoint, which returns the Cloud Run URL, among other things.  I grabbed that value when my container started and used it anytime I had to create a task.

(And yes, you'll need to get an access token from another "computeMetadata/v1" endpoint in order to access that API.)

For more information on that endpoint, see this reference.

Once you have the Cloud Run URL, the main difference between an App Engine task and an HTTP task is that App Engine tasks only have the path while the HTTP task has the full URL to hit.  I had to change up my internals a bit so that the application knew its base URL, but it wasn't much work.

To learn more about HTTP tasks, see this document.

Medium: Build a custom tool to deploy cron jobs

App Engine used to take care of all the cron job stuff for you, but Cloud Run does not.  At the time of this writing, Cloud Run has a beta command that'll create/update cron jobs from a YAML file, but it's not GA yet.

I built a simple tool that calls gcloud jobs list and parses the JSON output and compares that to a custom YAML file that I have that describes my cron jobs.  Because cron jobs do not refer to an App Engine project, they can reference any HTTP endpoint or a particular Cloud Run service.  My YAML file has an option for the Cloud Run service name, and my tool looks up the Cloud Run URL for that service and appends the path onto it.

It's not a lot of work, but it's work that I had to do in order to make my workflows make sense again.

Also note that the human-readable cron syntax ("every 24 hours") is gone and you'll need to use the standard "minute hour day-of-month month day-of-week" syntax that you're used to in most other cron-related things.  You can also specify which time zone each cron job should use for interpreting that syntax.

To learn more about the cron syntax, see this document.

Medium: Update your custom domains

At minimum, you will now have a completely different URL for your application.  If you're using custom domains, you can use Cloud Run's custom domains much like you would use App Engine's custom domains.

However, as far as Google is concerned, this is basically removing a custom domain from App Engine and creating a custom domain in Cloud Run.  This means that it'll take anywhere from 20-40 minutes for HTTPS traffic to your domain to work after you make the switch.

Try to schedule this for when you have minimal traffic.  As far as I can tell, there's nothing that you can do to speed it up.  You may wish to consider using a load balancer instead (and giving the load balancer the TLS certificate responsibilities), but I didn't want to pay extra for a service that's basically free and only hurts me when I need to make domain changes.

Medium: Enable compression

Because App Engine ran behind Google's CDN, compression was handled by default.  Cloud Run does no such thing, so if you want compression, you'll have to do it yourself.  (Yes, you could pay for a load balancer, but that doesn't reduce your costs by 100%.)  Most HTTP frameworks have an option for it, and if you want to roll your own, it's fairly straightforward.

I ended up going with https://github.com/NYTimes/gziphandler.  You can configure a minimum response size before compressing, and if you don't want to compress everything (for example, image files are already compressed internally these days), you have to provide it with a list of MIME types that should be compressed.

Hard: Enable caching

Because App Engine ran behind Google's CDN, caching was handled by default.  If you set a Cache-Control header, the CDN would take care of doing the job of an HTTP cache.  If you did it right, you could reduce the traffic to your application significantly.  Cloud Run does no such thing, and your two practical options are (1) pay for a load balancer (which does not reduce your costs by 100%) or (2) roll your own in some way.

There's nothing stopping you from dropping in a Cloud Run service that is basically an HTTP cache that sits in front of your actual service, but I didn't go that route.  I decided to take advantage of the fact that my application knew more about caching than an external cache ever could.

For example, my application knows that its static files can be cached forever because the moment that I replace the service with a new version, the cache will die with it and a new cache will be created.  For an external cache, you can't tell it, "hey, cache this until the backing service randomly goes away", so you have to set practical cache durations such as "10 minutes" (so that when you do make application changes, you can be reasonably assured that they'll get to your users fairly quickly).

I ended up writing an HTTP handler that parsed the Cache-Control header, cached things that needed to be cached, supported some extensions for the things that my application could know (such as "cache this file forever), and served those results, so my main application code could still generally operate exactly as if there were a real HTTP cache in front of it.

Extra: Consider using "alpine" instead of "scratch" for Docker

I use statically compiled binaries (written in Go) for all my stuff, and I wasted a lot of time using "scratch" as the basis for my new Cloud Run Dockerfile.

The tl;dr here is that the Google libraries really, really want to validate the certificates of the Google APIs, and "scratch" does not install any CA certificates or anything.  I was getting nasty errors about things not working, and it took me a long time to realize that Google's packages were upset about not being able to validate the certificates in their HTTPS calls.

If you're using "alpine", just add apk add --no-cache ca-certificates to your build.

Also, if you do any kind of time zone work (all of my application's users and such have a time zone, for example), you'll also want to include the timezone data (otherwise, your date/time calls will fail).

If you're using "alpine", just add apk add --no-cache tzdata to your build.  Also don't forget to set the system timezone to UTC via echo UTC > /etc/timezone.

Extra: You may need your project ID and region for tooling

Once running, your application can trivially find out what its project ID and region are from the "computeMetadata/v1" APIs, but I found that I needed to know these things in advance while running my deployment tools.  Your case may differ from mine, but I needed these in advance.

I have multiple projects (for example, a production one and a staging one), and my tooling needed to know the project ID and region for whatever project it was building/deploying.  I added this to a config file that I could trivially replace with a copy from a directory of such config files.  I just wanted to mention that you might run into something similar.

Extra: Don't forget to disable App Engine

Once everything looks good, disable App Engine in your project.  Make sure that you unlinked your Datastore first, otherwise you'll also disable that when you disable App Engine.

Extra: Wait a month before celebrating

While App Engine's quotas reset every day, Cloud Run's reset every month.  This means that you won't necessarily be able to prove out your cost savings until you have spent an entire month running in Cloud Run.

After a few days, you should be able to eyeball the data and check the billing reports to get a sense of what your final costs will be, but remember: you get 48 hours of free CPU time before they start charging you, and if your application is CPU heavy, it might take a few days or weeks to burn through those free hours.  Even after going through the free tier, the cost of subsequent usage is quite reasonable.  However, the point is that you can't draw a trend line on costs based on the first few days of running.  Give it a full month to be sure.

Extra: Consider optimizing your code for CPU consumption

In an instance world like App Engine, as long as your application isn't using up all of the CPU and forcing App Engine to spawn more instances or otherwise have requests queue up, CPU performance isn't all that important.  Ultimately, it'll cost you the same regardless of whether you can optimize your code to shave off 20% of its CPU consumption.

With Cloud Run, you're charged, in essence, by CPU cycle, so you can drastically reduce your costs by optimizing your code.  And if you can keep your stuff efficient enough to stay within the free tier, then your application is basically free.  And that's wild.

If you're new to CPU optimization or haven't done it in a while, consider enabling something like "pprof" on your application and then capturing a 30 second workload.  You can view the results with a tool like KCachegrind to get a feel for where your application is spending most of its CPU time and see what you can do about cleaning that up.  Maybe it's optimizing some for loops; maybe it's caching; maybe it's using a slightly more efficient data structure for the work that you're doing.  Whatever it is, find it and reduce it.  (And start with the biggest things first.)

Conclusion

Welcome to Cloud Run!  I hope your cost savings are as good as mine.

Edit (2022-10-19): added caveats about HTTP compression and caching.

Sunday, July 10, 2022

Publishing a Docker image for a GitHub repo

It's 2022, and if you're making a GitHub project, chances are that you'll need to publish a Docker image at some point.  These days, it's really easy with GitHub CI/CD and their "actions", which generally take care of all of the hard work.

I'm assuming that you already have a working GitHub CI/CD workflow for building whatever it is that you're building, and I'm only going to focus on the Docker-specific changes that need to be made.

You'll want to set up the following workflows:
  1. Upon "pull_request" for the "master" branch, build the Docker image (to make sure that the process works), but don't actually publish it.
  2. Upon "push" for the "master" branch, build the Docker image and publish it as "latest".
  3. Upon "release" for the "master" branch, build the Docker image and publish it with the release's tag.
Before you get started, you'll need to create a Dockerhub "access token" to use for your account.  You can do this under "Account Settings" → "Security" → "Access Tokens".

Workflow updates

Upon "pull_request"

This workflow happens every time a commit is made to a pull request.  While we want to ensure that our Docker image is built so that we know that the process works, we don't actually want to publish that image (at least I don't; you might have a use for it).

To build the Docker image, just add the following to your "steps" section:
- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v2
- name: Build
  uses: docker/build-push-action@v3
  with:
    context: .
    push: false
    tags: YOUR_GROUP/YOUR_REPO:latest

Simple enough.

The "docker/setup-buildx-action" action does whatever magic needs to happen for Docker stuff to work in the pipeline, and the "docker/build-push-action" builds the image from your Dockerfile and pushes the image.  Because we're setting "push: false", it won't actually push.

Upon "merge" into "master"

This workflow happens every time a PR is merged into the "master" branch.  In this case, we want to do everything that we did for the "pull_request" case, but we also want to push the image.

The changes here are that we'll set "push: true" and also specify our Dockerhub username and password.

To build and push the Docker image, just add the following to your "steps" section:
- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v2
- name: Login to DockerHub
  uses: docker/login-action@v2
  with:
    username: ${{ secrets.DOCKERHUB_USERNAME }}
    password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build
  uses: docker/build-push-action@v3
  with:
    context: .
    push: true
    tags: YOUR_GROUP/YOUR_REPO:latest

Boom.

The new action "docker/login-action" logs into Dockerhub with your username and password, which is necessary to actually push the image.

Upon "release"

This workflow happens every time a release is created.  This is generally similar to "merge" case, except instead of using the "latest" label, we'll be using the release's tag.

To build and push the Docker image, just add the following to your "steps" section:
- name: Set up Docker Buildx
  uses: docker/setup-buildx-action@v2
- name: Login to DockerHub
  uses: docker/login-action@v2
  with:
    username: ${{ secrets.DOCKERHUB_USERNAME }}
    password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Build
  uses: docker/build-push-action@v3
  with:
    context: .
    push: true
    tags: YOUR_GROUP/YOUR_REPO:${{ github.event.release.tag_name }}

And that's it.  The "github.event.release.tag_name" variable holds the name of the Git tag, which is what we'll use for the Docker label.