Thursday, March 19, 2026

Signs of a current Chromium crash in GCP Cloud Run when using go-rod

This is mainly here so that I can find it again if it ever comes up, or maybe you'll find it when you're googling a log message.

The tl;dr is that Chromium crashes in GCP Cloud Run (on Alpine Linux if that's relevant) at version 146.0.7680.80-r0.

Why?  I'm not quite sure.  Yesterday, when it was running version 144.0.7559.132-r4, it ran fine.

(We use Chromium under the hood with go-rod to generate PDFs as part of our application.)

In the logs, we saw this:
Connecting to browser at endpoint: ws://127.0.0.1:42590/devtools/browser/f2909d30-39b2-4d59-9fab-e81d79a8f006
Chromium path: /usr/bin/chromium
Recovered from panic: [*net.OpError] write tcp 127.0.0.1:58219->127.0.0.1:42590: use of closed network connection
Uncaught signal: 4, pid=280, tid=280, fault_addr=94169238392146.

As best as I can tell, signal 4 on Linux has to do with a processor instruction mismatch of some kind.  Anyway, the (automatic) Chromium version change was the only thing that could explain the problem, and we rolled back by using "alpine:3.22" instead of "alpine:latest" in our Docker image.

This instantly fixed the problem.  I don't know why it broke, but I know where it broke.  Hopefully they get this fixed up, and now we know that we have to actually test this in GCP if we ever want to un-pin the Alpine version and upgrade Chromium again.

Saturday, August 2, 2025

Troubleshooting a "Killed: 9" error in MacOS for a Go application

We recently switched from building and signing our MacOS build from an annoying MacOS box to a happy Linux container in GitHub.  (We used a Rust-based tool called rcodesign, which, after a bit of trial and error, worked wonderfully.)

Last week, a support ticket came in that the newest version of the application wasn't working on MacOS.  That was strange because I had spent hours personally testing it on a MacOS laptop to make sure that the new build process was flawless.

The customer's symptom: our application wasn't running.

The actual symptom: whenever you ran our application, it was instantly killed and this was the only output:
Killed: 9

Let the troubleshooting begin

The "Killed: 9" thing was suspicious because that's the kind of error that we would get when we didn't sign our MacOS binaries correctly (which is why I was worried that it was related to switching our build process).  However, I confirmed that we were signing them correctly, and the local MacOS tools agreed:
codesign -vvv /Applications/my-app.app/Contents/MacOS/my-app
/Applications/my-app.app/Contents/MacOS/my-app: valid on disk
/Applications/my-app.app/Contents/MacOS/my-app: satisfies its Designated Requirement

The only difference between the last version that worked and the one that didn't was a single dependency upgrade in our go.mod file: github.com/projectdiscovery/nuclei/v3.  Nuclei is used for pentesting, which our application can do as part of its suite of tools.

Our application doesn't even use the nuclei code unless it's specifically requested, and we were getting "Killed: 9" even when we ran it with "--help".

This seemed nuts; how could upgrading a package that we use for a specific scheduled action cause MacOS to instant-kill the application with "Killed: 9"?

What I knew so far:
  1. The application wasn't doing anything interesting (it never printed a single one of our logs).
  2. MacOS was killing it on purpose.

What's the kernel think?

I asked the kernel for its logs while I ran our application with "--help":
sudo log stream --predicate 'sender = "kernel"'

When it ran, the only interesting thing appeared to be this line:
kernel: CODE SIGNING: process 27627[my-app]: rejecting invalid page at address 0x10232000 from offset 0x0 in file "<nil>" (cs_mtime:0.0 == mtime:0.0) (signed:0 validated:0 tainted:0 nx:0 wpmapped:1 dirty:0 depth:0)

So MacOS was doing something related to the signing of our binaries, but I didn't know what.  Remember, asking it about the binary itself resulted in no errors, but I was seeing one now when it ran.

The error did reference a file ("<nil>"), so maybe if I built it with debug symbols it would help.

I also stumbed onto some diagnostic information in /Library/Logs/DiagnosticReports with a series of files called "my-app-XXXX-XX-XX-XXXXXX.ips", where the X's represented a date-time.  Those were JSON files with some information, and one of them appeared to be a stack trace:
  [...]
  "faultingThread": 0,
  "threads": [
    {
      "triggered": true,
      "id": 734760,
      "threadState": {
        [...]
        "trap": {
          "value": 14,
          "description": "(invalid protections for user instruction read)"
        },

Other than that error ("invalid protections for user interaction read") in thread 0, there wasn't much useful information.

I rebuilt the application with debug symbols this time and tried again.

Same deal, but this time, in the stack trace in the ".ips" file, register 8 had the symbol:
        "r8": {
          "value": 133683616,
          "symbolLocation": 0,
          "symbol": "*/jitdec.Decode"
        },

Aha!  A function!  It was calling "jitdec.Decode".

Source code archaeology

"jitdec.Decode" wasn't anything that I had heard of before (it certainly wasn't one of our functions), so I googled.

"jitdec.Decode" is part of the "bytedance/sonic" package, which is a high-performance JSON package for Go.  Nuclei uses sonic instead of the standard JSON package for some reason; I guess someone complained at some point that its JSON operations were too slow or something.

I now knew where it was breaking, but I didn't know (1) why, or (2) why that function was even being called.

Since it was breaking during "--help", I suspected that it had to be an "init" function somewhere.  In Go, an imported package's "init" function is called before anything else, so nuclei had to be doing something stupid up front.  After a bunch of digging through source code, it turned out that nuclei did have an "init" function where it loaded the local configuration from disk.  The format of that local configuration?  JSON.

Okay, so I knew that on startup, the application would try to load the nuclei config files, and that doing that called the "jitdec.Decode" function, which caused MacOS to kill the application.

Why kill it now all of the sudden?  Nuclei had been using the sonic package for ages.

I diffed the nuclei versions involved (we upgraded from v3.4.5 to v3.4.7); nothing looked all that interesting in terms of actual code that changed, but they did upgrade sonic from v1.12.8 to v1.13.3.

I diffed the sonic versions involved (they upgraded from v1.12.8 to v1.13.3) and found some changes in the "jitdec" package.  In particular, there was a "//go:build" comment that specifically excluded Go 1.24 (which we use), and after the upgrade, it excluded Go 1.25.  So whatever was in those files, it used to be skipped for our Go version, and now it wasn't.

What was this whole "jitdec" thing, anyway?  Apparently it does some just-in-time (JIT) stuff to somehow decode JSON faster?  Again, this seems like overkill for a project (nuclei) that is not, to the best of my knowledge, bottlenecked by slow JSON encoding and decoding.

But that's the difference: previously, our application didn't include any JIT, and now it did.  And because it read the nuclei config files before anything else, the application was compiling code on the fly to decode all 200 bytes as fast as possible to crush the benchmarks.  MacOS noticed that the application was running code that wasn't signed, and it killed the application.

Now what?

Entitlements

My first instinct was to see if I could just switch the package to the standard JSON package, but I couldn't figure out how to do that.  Some packages let you do a specific import up front to turn on or off their weird, high-performance overrides, but not nuclei.  So we were stuck with JIT.

MacOS has a series of "entitlements" for an application: at signing time, you also bundle in a list of special things that the application is allowed to do.  One of them is relates to JIT specifically, and one relates to unsigned executable memory.

rcodesign has a "--entitlements-xml-file" option to specify an entitlements file.  I plugged that in, rebuilt, reinstalled, and tested everything, and the application ran normally.

In this particular case, I had to grant these two entitlements to my application to make MacOS happy about whatever sonic was doing:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
       <key>com.apple.security.cs.allow-jit</key>
       <true/>
       <key>com.apple.security.cs.allow-unsigned-executable-memory</key>
       <true/>
</dict>
</plist>

To verify that the application was properly built with those entitlements, you can run:
codesign -d --entitlements - --xml /Applications/my-app.app/Contents/MacOS/my-app

It should dump out that same entitlements XML file (but all on one line).

Summary

MacOS sucks.  It's horrible to work with and the documentation is bad.

However, these things might help you in the future:
  1. You need to sign your MacOS applications.
  2. The MacOS kernel will kill your application if it doesn't like something about it ("Killed: 9").
  3. Use the "codesign" tool to see what MacOS thinks about your installed application.
  4. Use the "log stream" command to see what the kernel is doing when it kills your application.
  5. Turn on debug symbols during your build process.
  6. Hunt down the ".ips" file for your killed application and see what the stack trace looks like.
  7. There are a whole bunch of entitlements that your application may need to have; if it's a runtime error, there's a chance that it needs one that it doesn't have.

Wednesday, April 5, 2023

Dealing with KDE "plasmashell" freezing

I've been using KDE for over a decade now, and something that started happening in the past year or two (at least on Kubuntu 22.04) would be that my whole screen would mostly freeze.  Generally, I'd be able to alt-tab between windows, interact with them, etc., but I couldn't click on or interact with anything related to the window manager (the title bars, the task bar, etc.).

In my case, I'd immediately notice when I came back to my desk and there was obviously a notification at some point, but the rendering got all screwed up:


In this image, you can see that the notification toast window has no visible content and instead looks like the KDE background image.  Also, the time is locked at 5:29 PM, which is when this problem happened (I didn't get back to my desk until 8:30 AM the next morning).

The general fix for this is to use a shell (if you have one open, great; if not, press ctrl+alt+F2 to jump to the console) and kill "plasmashell":
killall plasmashell

Once that's done, your window manager should be less broken, but it won't have the taskbar, etc.  From there, you can press alt+F2 to open the "run" window, and type in:
plasmashell --replace


You can also run this from a terminal somewhere, but you need to make sure that your "DISPLAY" environment variable is set up correctly, etc.  I find it easier to do it it from the run window (and I don't have to worry about redirecting its output anywhere, since "plasmashell" does generate some logging noise).


Friday, February 10, 2023

Using a dynamic PVC on Kubernetes agents in Jenkins

I recently had to create a Jenkins job that needed to use a lot of disk space. The short version of the story is that the job needed to dump the contents of a Postgres database and upload that to Artifactory, and the "jfrog" command line tool won't let you stream an upload, so the entire dump had to be present on disk in order for it to work.

I run my Jenkins on Kubernetes, and the Kubernetes hosts absolutely didn't have the disk space needed to dump this database, and it was definitely too big to use a memory-based filesystem.

The solution was to use a dynamic Persistent Volume Claim, which is maybe(?) implemented as an ephemeral volume in Kubernetes, but the exact details of what it does under the hood aren't important.  What is important is that, as part of the job running, a new Persistent Volume Claim (PVC) gets created and is available for all of the containers in the pod.  When the job finishes, the PVC gets destroyed.  Perfect.

I couldn't figure out how to create a dynamic PVC as an ordinary volume that would get mounted on all of my containers (it's a thing, but apparently not for a declarative pipeline), but I was able to get the "workspace" dynamic PVC working.

A "workspace" volume is shared across all of the containers in the pod and have the Jenkins workspace mounted.  This has all of the Git contents, including the Jenkinsfile, for the job (I'm assuming that you're using Git-based jobs here).  Since all of the containers share the same workspace volume, any work done in one container is instantly visible in all of the others, without the need for Jenkins stashes or anything.

The biggest problem that I ran into was the permissions on the "workspace" file system.  Each of my containers had a different idea of what the UID of the user running the container would be, and all of the containers have to agree on the permissions around their "workspace" volume.

I ended up cheating and just forcing all of my containers to run as root (UID 0), since (1) everyone could agree on that, and (2) I didn't have to worry about "sudo" not being installed on some of the containers that needed to install packages as part of their setup.

Using "workspace" volumes

To use a "workspace" volume, set workspaceVolume inside the kubernetes block:

kubernetes {
   workspaceVolume dynamicPVC(accessModes: 'ReadWriteOnce', requestsSize: "300Gi")
   yaml '''
---
apiVersion: v1
kind: Pod
spec:
   securityContext:
      fsGroup: 0
      runAsGroup: 0
      runAsUser: 0
   containers:
[...]

In this example, we allocate a 300GiB volume for the duration of the job running.

In addition, you can see that I set the user and group information to 0 (for "root"), which let me work around all the annoying UID mismatches across the containers.  If you only have one container, then obviously you don't have to do this.  Also, if you have full control of your containers, then you can probably set them up with a known user with a fixed UID who can sudo, etc., as necessary.

For more information about using Kubernetes agents in Jenkins, see the official docs, but (at least of the time of this writing) they're missing a whole lot of information about volume-related things.

Troubleshooting

If you see Jenkins trying to create and then delete pods over and over and over again, you have something else wrong.  In my case, the Kubernetes service accout that Jenkins uses didn't have any permissions around "persistentvolumeclaims" objects, so every time that the Pod was created, it would fail and try again.

I was only able to see the errors in the Jenkins logs in Kubernetes; they looked something like this:

Caused: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.100.0.1:443/api/v1/namespaces/cicd/persistentvolumeclaims. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. persistentvolumeclaims is forbidden: User "system:serviceaccount:cicd:default" cannot create resource "persistentvolumeclaims" in API group "" in the namespace "cicd".

I didn't have the patience to figure out exactly what was needed, so I just gave it everything:

- verbs:
    - create
    - delete
    - get
    - list
    - patch
    - update
    - watch
  apiGroups:
    - ''
  resources:
    - persistentvolumeclaims


Tuesday, January 31, 2023

Use a custom login page when using Apache to require sign-in

Apache has its own built-in authentication system(s) for providing access control to a site that it's hosting.  You've probably encountered this before using "basic" authentication backed by a flatfile created and edited using the htpasswd command.

If you do this using the common guides on the Internet (for example, this guide from Apache itself), then when you go to your site, you'll be presented with your browser's built-in basic-authentication dialog box asking for a username and password.  If you provide valid credentials, then you'll be moved on through to the main site, and if you don't, then it'll dump you to a plain "401 Unauthorized" page.

This works fine, but it has three main drawbacks:

  1. Password managers (such as LastPass) can't detect this dialog box and autofill it, which is very annoying.
  2. On some mobile browsers, the dialog gets in the way of normal operations.  Even if you have multiple tabs open, whatever tab is trying to get you to log in will get in the way and force you to deal with it.
  3. If you're using Windows authentication, the browser might detect the 401 error and attempt to sign you in using your domain credentials.  If the server has a different set of credentials, then it'll mean that you can't actually log in due to Windows trying to auto log in.

(And the built-in popup is really ugly, and it submits the password in plaintext, etc., etc.)

Apache "Form" Authentication

To solve this problem, Apache has a type of authentication called "form" that adds an extra step involving an HTML form (that's fully customizable).

The workflow is as follows:

  1. Create a login HTML page (you'll have to provide the page).
  2. Register a handler for that page to POST to (Apache already has the handler).
  3. Update any "Directory" or "Location" blocks in your Apache config to use the "form" authentication type instead of "basic".
You'll also need these modules installed and enabled:
  1. mod_auth_form
  2. mod_request
  3. mod_session
  4. mod_session_cookie

On Ubuntu, I believe that these were all installed out of the box but needed to be enabled separately.  On Red Hat, I had to install the mod_session package, but everything was otherwise already enabled.

Example

If you want to try out "form" authentication, I recommend that you get everything working with "basic" authentication first.  This is especially true if you have multiple directories that need to be configured separately.

For this example, I'm going to use our Nagios server.

There were two directories that needed to be protected: "/usr/local/nagios/sbin" and "/usr/local/nagios/share".  This setup is generally described by this document (although it covers "digest" authentication instead of "basic").

For both directories that already had "AuthType" set up, the changes are simple:

  1. Change AuthType Basic to AuthType Form.
  2. Change AuthBasicProvider to AuthFormProvider.
  3. Add the login redirect: AuthFormLoginRequiredLocation "/login.html"
  4. Enable sessions: Session On
  5. Set a cookie name: SessionCookieName session path=/

I decided to put my login page at "/login.html" because that makes sense, but you could put it anywhere (and even host it on a different server if you specify a full URL instead of just a path).

That page should contain a "form" with two "input" elements: "httpd_username" and "httpd_password".  The form "action" should be set to "/do-login.html" (or whatever handler you want to register with Apache).

At its simplest, "login.html" looks like this:

<form method="POST" action="/do-login.html">
  Username: <input type="text" name="httpd_username" value="" />
  Password: <input type="password" name="httpd_password" value="" />
  <input type="submit" name="login" value="Login" />
</form>

You'll probably want an "html" tag, a title and body and such, maybe some CSS, but this'll get the job done.

The last step is to register the thing that'll actually process the form data: "/do-login.html"

In your Apache config, add a "location" for it:

<Location "/do-login.html">
  SetHandler form-login-handler

  AuthType form
  AuthName "Nagios Access"
  AuthFormProvider file
  AuthUserFile /path/to/your/htpasswd.users

  AuthFormLoginRequiredLocation "/login.html"
  AuthFormLoginSuccessLocation "/nagios/"

  Session On
  SessionCookieName session path=/
</Location>

The key thing here is SetHandler form-login-handler.  This tells Apache to use its built-in form handler to take the values from httpd_username and httpd_password and compare them against your authentication provider(s) (in this example, it's just a flatfile, but you could use LDAP, etc.).

The other two options handle the last bit of navigation.  AuthFormLoginRequiredLocation sends you back to the login page if the username/password combination didn't work (you could potentially have another page here with an error message pre-written).  AuthFormLoginSuccessLocation sends you to the place where you want the user to go after login (I'm sending the user to the main Nagios page, but you could send them anywhere).

Notes

Other Authentication Providers

I've just covered the "file" authentication provider here.  If you use "ldap" and/or any others, then that config will need to be copied to every single place where you have "form" authentication set up, just like you would if you were only using the "file" provider.

I found this to be really annoying, since I had two directories to protect plus the form handler, so that brings over another 4 lines or so to each config section, but what matters is that it works.


Wednesday, October 19, 2022

Watch out for SNI when using an nginx reverse proxy

From time to time, I'll have a use case where some box needs to talk to some website that it can't reach (through networking issues), and the easiest thing to do is to throw an nginx reverse proxy on a network that it can reach (such that the reverse proxy can reach both).

The whole shtick of a reverse proxy is that you can access the reverse proxy directly and it'll forward the request on to the appropriate destination and more or less masquerade itself as if it were the destination.  This is in contrast with a normal HTTP proxy that would be configured separately (if supported by whatever tool you're trying to use).  Sometimes a normal HTTP proxy is the best tool for the job, but sometimes you can cheat with a tweak to /etc/hosts and a reverse proxy and nobody needs to know what happened.

Here, we're focused on the reverse proxy.

In this case, we have the following scenario:

  1. Box 1 wants to connect to site1.example.com.
  2. Box 1 cannot reach site1.example.com.
To cheat using a reverse proxy, we need Box 2, which:
  1. Can be reached by Box 1.
  2. Can reach site1.example.com.
To set up the whole reverse proxy thing, we need to:
  1. Set up nginx on Box 2 to listen on port 443 (HTTPS) and reverse proxy to site1.example.com.
  2. Update /etc/hosts on Box 1 so that site1.example.com points to Box 2's IP address.

At first, I was seeing this error message on the reverse proxy's "nginx/error.log":

connect() to XXXXXX:443 failed (13: Permission denied) while connecting to upstream, client: XXXXXX, server: site1.example.com, request: "GET / HTTP/1.1"

"Permission denied" isn't great, and that told me that it was something OS-related.

Of course, it was an SELinux thing (in /var/log/messages):

SELinux is preventing /usr/sbin/nginx from name_connect access on the tcp_socket port 443.

The workaround was:

setsebool -P nis_enabled 1

This also was suggested by the logs, but it didn't seem to matter:

setsebool -P httpd_can_network_connect 1

After fixing that, I was seeing:

SSL_do_handshake() failed (SSL: error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure:SSL alert number 40) while SSL handshaking to upstream, client: XXXXXX, server: site1.example.com, request: "GET / HTTP/1.1"

After tcpdump-ing the traffic from Box 1 and also another box that could directly talk to site1.example.com, it was clear Box 1 was not using SNI in its requests (SNI is a TLS extension that passes the host name in plaintext so that proxies and load balancers can properly route name-based requests).

It took way too long for me to find the nginx setting to enable it (I don't know why its disabled by default), but it's:

proxy_ssl_server_name on;

Anyway, the final nginx config for the reverse proxy on Box 2 was:

server {
  listen 443 ssl;
  server_name site1.example.com;

  ssl_certificate /etc/nginx/ssl/server.crt;
  ssl_certificate_key /etc/nginx/server.key;
  ssl_protocols TLSv1.2;
 
  location / {
    proxy_pass https://site1.example.com;
    proxy_ssl_session_reuse on;
    proxy_ssl_server_name on;
  }
}

As far as Box 1 was concerned, it could connect to site1.example.com with only a small tweak to /etc/hosts.

Wednesday, October 5, 2022

Speed up PNG encoding in Go with NRGBA images

After migrating my application from Google Cloud Engine to Google Cloud Run, I suddenly had a use case for optimizing CPU utilization.

In my analysis of my most CPU-intensive workloads, it turned out that the majority of the time was spent encoding PNG files.

tl;dr Use image.NRGBA when you intend to encode a PNG file.

(For reference, this particular application has a Google Maps overlay that synthesizes data from other sources into tiles to be rendered on the map.  The main synchronization job runs nightly and attempts to build or download new tiles for the various layers based on data from various ArcGIS systems.)

Looking at my code, I couldn't really reduce the number of calls to png.Encode, but that encoder really looked inefficient.  I deleted the callgrind files (sorry), but basically, half of the CPU time in png.Encode was around memory operations and some runtime calls.

I started looking around for maybe some options to pass to the encoder or a more purpose-built implementation.  I ended up finding a package that mentioned a speedup, but only for NRGBA images.  However, that package looked fairly unused, and I wasn't about to turn all of my image processing to so something with 1 commit and no users.

This got me thinking, though: what is NRGBA?

It turns out that there are (at least) two ways of thinking about the whole alpha channel thing in images:

  1. In RGBA, each of the red, green, and blue channels has already been premultiplied by the alpha channel, such that the value of, for example, R can range from 0 to A, but no higher.
  2. In NRGBA, each of the red, green, and blue channels has its original value, and the alpha channel merely represents the opacity of the pixel in general.

For my human mind, using various tools and software over the years, when I think of "RGBA", I think of "one channel each for red, green, and blue, and one channel for the opacity of the pixel".  So what this means is that I'm thinking of "NRGBA" (for non-premultiplied RGBA).

(Apparently there are good use cases for both, and when compositing, at some point you'll have to multiply by the alpha value, so "RGBA" already has that done for you.)

Okay, whatever, so what does this have to do with CPU optimization?

In Go, the png.Encode function is optimized for NRGBA images.  There's a tiny little hint about this in the comment for the function:

Any Image may be encoded, but images that are not image.NRGBA might be encoded lossily.

This is corroborated by the PNG rationale document, which explains that

PNG uses "unassociated" or "non-premultiplied" alpha so that images with separate transparency masks can be stored losslessly.

If you want to have the best PNG encoding experience, then you should encode images that use NRGBA already.  In fact, if you open up the code, you'll see that it will convert the image to NRGBA if it's not already in that format.

Coming back to my callgrind analysis, this is where all that CPU time was spent: converting an RGBA image to an NRGBA image.  I certainly thought that it was strange how much work was being done creating a simple PNG file from a mostly-transparent map tile.

Why did I even have RGBA images?  Well, my tiling API has to composite tiles from other systems into a single PNG file, so I simply created that new image with image.NewRGBA.  And why that function?  Because as I mentioned before, I figured "RGBA" meant "RGB with an alpha channel", which is what I wanted so that it would support transparency.  It never occurred to me that "RGBA" was some weird encoding scheme for pixels in contrast to another encoding scheme called "NRGBA"; my use cases had never had me make such a distinction.

Anyway, after switching a few image.NewRGBA calls to image.NewNRGBA (and literally that was it; no other code changed), my code was way more efficient, cutting down on CPU utilization by something like 50-70%.  Those RGBA to NRGBA conversions really hurt.