Sense Codons: Go

Showing posts with label Go. Show all posts

Thursday, March 19, 2026

Signs of a current Chromium crash in GCP Cloud Run when using go-rod

This is mainly here so that I can find it again if it ever comes up, or maybe you'll find it when you're googling a log message.

The tl;dr is that Chromium crashes in GCP Cloud Run (on Alpine Linux if that's relevant) at version 146.0.7680.80-r0.

Why? I'm not quite sure. Yesterday, when it was running version 144.0.7559.132-r4, it ran fine.

(We use Chromium under the hood with go-rod to generate PDFs as part of our application.)

In the logs, we saw this:

Connecting to browser at endpoint: ws://127.0.0.1:42590/devtools/browser/f2909d30-39b2-4d59-9fab-e81d79a8f006

Chromium path: /usr/bin/chromium
Recovered from panic: [*net.OpError] write tcp 127.0.0.1:58219->127.0.0.1:42590: use of closed network connection
Uncaught signal: 4, pid=280, tid=280, fault_addr=94169238392146.

As best as I can tell, signal 4 on Linux has to do with a processor instruction mismatch of some kind. Anyway, the (automatic) Chromium version change was the only thing that could explain the problem, and we rolled back by using "alpine:3.22" instead of "alpine:latest" in our Docker image.

This instantly fixed the problem. I don't know why it broke, but I know where it broke. Hopefully they get this fixed up, and now we know that we have to actually test this in GCP if we ever want to un-pin the Alpine version and upgrade Chromium again.

Saturday, August 2, 2025

Troubleshooting a "Killed: 9" error in MacOS for a Go application

We recently switched from building and signing our MacOS build from an annoying MacOS box to a happy Linux container in GitHub. (We used a Rust-based tool called rcodesign, which, after a bit of trial and error, worked wonderfully.)

Last week, a support ticket came in that the newest version of the application wasn't working on MacOS. That was strange because I had spent hours personally testing it on a MacOS laptop to make sure that the new build process was flawless.

The customer's symptom: our application wasn't running.

The actual symptom: whenever you ran our application, it was instantly killed and this was the only output:

Killed: 9

Let the troubleshooting begin

The "Killed: 9" thing was suspicious because that's the kind of error that we would get when we didn't sign our MacOS binaries correctly (which is why I was worried that it was related to switching our build process). However, I confirmed that we were signing them correctly, and the local MacOS tools agreed:

codesign -vvv /Applications/my-app.app/Contents/MacOS/my-app

/Applications/my-app.app/Contents/MacOS/my-app: valid on disk

/Applications/my-app.app/Contents/MacOS/my-app: satisfies its Designated Requirement

The only difference between the last version that worked and the one that didn't was a single dependency upgrade in our go.mod file: github.com/projectdiscovery/nuclei/v3. Nuclei is used for pentesting, which our application can do as part of its suite of tools.

Our application doesn't even use the nuclei code unless it's specifically requested, and we were getting "Killed: 9" even when we ran it with "--help".

This seemed nuts; how could upgrading a package that we use for a specific scheduled action cause MacOS to instant-kill the application with "Killed: 9"?

What I knew so far:

The application wasn't doing anything interesting (it never printed a single one of our logs).
MacOS was killing it on purpose.

What's the kernel think?

I asked the kernel for its logs while I ran our application with "--help":

sudo log stream --predicate 'sender = "kernel"'

When it ran, the only interesting thing appeared to be this line:

kernel: CODE SIGNING: process 27627[my-app]: rejecting invalid page at address 0x10232000 from offset 0x0 in file "<nil>" (cs_mtime:0.0 == mtime:0.0) (signed:0 validated:0 tainted:0 nx:0 wpmapped:1 dirty:0 depth:0)

So MacOS was doing something related to the signing of our binaries, but I didn't know what. Remember, asking it about the binary itself resulted in no errors, but I was seeing one now when it ran.

The error did reference a file ("<nil>"), so maybe if I built it with debug symbols it would help.

I also stumbed onto some diagnostic information in /Library/Logs/DiagnosticReports with a series of files called "my-app-XXXX-XX-XX-XXXXXX.ips", where the X's represented a date-time. Those were JSON files with some information, and one of them appeared to be a stack trace:

[...]

"faultingThread": 0,

"threads": [

{

"triggered": true,

"id": 734760,

"threadState": {

[...]

"trap": {

"value": 14,

"description": "(invalid protections for user instruction read)"

Other than that error ("invalid protections for user interaction read") in thread 0, there wasn't much useful information.

I rebuilt the application with debug symbols this time and tried again.

Same deal, but this time, in the stack trace in the ".ips" file, register 8 had the symbol:

"r8": {

"value": 133683616,

"symbolLocation": 0,

"symbol": "*/jitdec.Decode"

Aha! A function! It was calling "jitdec.Decode".

Source code archaeology

"jitdec.Decode" wasn't anything that I had heard of before (it certainly wasn't one of our functions), so I googled.

"jitdec.Decode" is part of the "bytedance/sonic" package, which is a high-performance JSON package for Go. Nuclei uses sonic instead of the standard JSON package for some reason; I guess someone complained at some point that its JSON operations were too slow or something.

I now knew where it was breaking, but I didn't know (1) why, or (2) why that function was even being called.

Since it was breaking during "--help", I suspected that it had to be an "init" function somewhere. In Go, an imported package's "init" function is called before anything else, so nuclei had to be doing something stupid up front. After a bunch of digging through source code, it turned out that nuclei did have an "init" function where it loaded the local configuration from disk. The format of that local configuration? JSON.

Okay, so I knew that on startup, the application would try to load the nuclei config files, and that doing that called the "jitdec.Decode" function, which caused MacOS to kill the application.

Why kill it now all of the sudden? Nuclei had been using the sonic package for ages.

I diffed the nuclei versions involved (we upgraded from v3.4.5 to v3.4.7); nothing looked all that interesting in terms of actual code that changed, but they did upgrade sonic from v1.12.8 to v1.13.3.

I diffed the sonic versions involved (they upgraded from v1.12.8 to v1.13.3) and found some changes in the "jitdec" package. In particular, there was a "//go:build" comment that specifically excluded Go 1.24 (which we use), and after the upgrade, it excluded Go 1.25. So whatever was in those files, it used to be skipped for our Go version, and now it wasn't.

What was this whole "jitdec" thing, anyway? Apparently it does some just-in-time (JIT) stuff to somehow decode JSON faster? Again, this seems like overkill for a project (nuclei) that is not, to the best of my knowledge, bottlenecked by slow JSON encoding and decoding.

But that's the difference: previously, our application didn't include any JIT, and now it did. And because it read the nuclei config files before anything else, the application was compiling code on the fly to decode all 200 bytes as fast as possible to crush the benchmarks. MacOS noticed that the application was running code that wasn't signed, and it killed the application.

Now what?

Entitlements

My first instinct was to see if I could just switch the package to the standard JSON package, but I couldn't figure out how to do that. Some packages let you do a specific import up front to turn on or off their weird, high-performance overrides, but not nuclei. So we were stuck with JIT.

MacOS has a series of "entitlements" for an application: at signing time, you also bundle in a list of special things that the application is allowed to do. One of them is relates to JIT specifically, and one relates to unsigned executable memory.

rcodesign has a "--entitlements-xml-file" option to specify an entitlements file. I plugged that in, rebuilt, reinstalled, and tested everything, and the application ran normally.

In this particular case, I had to grant these two entitlements to my application to make MacOS happy about whatever sonic was doing:

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">

<dict>

<key>com.apple.security.cs.allow-jit</key>

<true/>

<key>com.apple.security.cs.allow-unsigned-executable-memory</key>

<true/>

</dict>

</plist>

To verify that the application was properly built with those entitlements, you can run:

codesign -d --entitlements - --xml /Applications/my-app.app/Contents/MacOS/my-app

It should dump out that same entitlements XML file (but all on one line).

Summary

MacOS sucks. It's horrible to work with and the documentation is bad.

However, these things might help you in the future:

You need to sign your MacOS applications.
The MacOS kernel will kill your application if it doesn't like something about it ("Killed: 9").
Use the "codesign" tool to see what MacOS thinks about your installed application.
Use the "log stream" command to see what the kernel is doing when it kills your application.
Turn on debug symbols during your build process.
Hunt down the ".ips" file for your killed application and see what the stack trace looks like.
There are a whole bunch of entitlements that your application may need to have; if it's a runtime error, there's a chance that it needs one that it doesn't have.

Wednesday, October 5, 2022

Speed up PNG encoding in Go with NRGBA images

After migrating my application from Google Cloud Engine to Google Cloud Run, I suddenly had a use case for optimizing CPU utilization.

In my analysis of my most CPU-intensive workloads, it turned out that the majority of the time was spent encoding PNG files.

tl;dr Use image.NRGBA when you intend to encode a PNG file.

(For reference, this particular application has a Google Maps overlay that synthesizes data from other sources into tiles to be rendered on the map. The main synchronization job runs nightly and attempts to build or download new tiles for the various layers based on data from various ArcGIS systems.)

Looking at my code, I couldn't really reduce the number of calls to png.Encode, but that encoder really looked inefficient. I deleted the callgrind files (sorry), but basically, half of the CPU time in png.Encode was around memory operations and some runtime calls.

I started looking around for maybe some options to pass to the encoder or a more purpose-built implementation. I ended up finding a package that mentioned a speedup, but only for NRGBA images. However, that package looked fairly unused, and I wasn't about to turn all of my image processing to so something with 1 commit and no users.

This got me thinking, though: what is NRGBA?

It turns out that there are (at least) two ways of thinking about the whole alpha channel thing in images:

In RGBA, each of the red, green, and blue channels has already been premultiplied by the alpha channel, such that the value of, for example, R can range from 0 to A, but no higher.
In NRGBA, each of the red, green, and blue channels has its original value, and the alpha channel merely represents the opacity of the pixel in general.

For my human mind, using various tools and software over the years, when I think of "RGBA", I think of "one channel each for red, green, and blue, and one channel for the opacity of the pixel". So what this means is that I'm thinking of "NRGBA" (for non-premultiplied RGBA).

(Apparently there are good use cases for both, and when compositing, at some point you'll have to multiply by the alpha value, so "RGBA" already has that done for you.)

Okay, whatever, so what does this have to do with CPU optimization?

In Go, the png.Encode function is optimized for NRGBA images. There's a tiny little hint about this in the comment for the function:

Any Image may be encoded, but images that are not image.NRGBA might be encoded lossily.

This is corroborated by the PNG rationale document, which explains that

PNG uses "unassociated" or "non-premultiplied" alpha so that images with separate transparency masks can be stored losslessly.

If you want to have the best PNG encoding experience, then you should encode images that use NRGBA already. In fact, if you open up the code, you'll see that it will convert the image to NRGBA if it's not already in that format.

Coming back to my callgrind analysis, this is where all that CPU time was spent: converting an RGBA image to an NRGBA image. I certainly thought that it was strange how much work was being done creating a simple PNG file from a mostly-transparent map tile.

Why did I even have RGBA images? Well, my tiling API has to composite tiles from other systems into a single PNG file, so I simply created that new image with image.NewRGBA. And why that function? Because as I mentioned before, I figured "RGBA" meant "RGB with an alpha channel", which is what I wanted so that it would support transparency. It never occurred to me that "RGBA" was some weird encoding scheme for pixels in contrast to another encoding scheme called "NRGBA"; my use cases had never had me make such a distinction.

Anyway, after switching a few image.NewRGBA calls to image.NewNRGBA (and literally that was it; no other code changed), my code was way more efficient, cutting down on CPU utilization by something like 50-70%. Those RGBA to NRGBA conversions really hurt.

Saturday, April 16, 2022

Golang, http.Client, and "too many open files"

I've been having an issue with my application for a while now, and I finally figured out what the problem was. In this particular case, the application is a web app (so, think REST API written in Go), and one of its nightly routines is to synchronize a whole bunch of data with various third-party ArcGIS systems. The application keeps a cache of the ArcGIS images, and this job updates them so they're only ever a day old. This allows it to show map overlays even if the underlying ArcGIS systems are inaccessible (they're random third-party systems that are frequently down for maintenance).

So, imagine 10 threads constantly making HTTP requests for new map tile images; once a large enough batch is done, the cache is updated, and then the process repeats until the entire cache has been refreshed.

In production, I never noticed a direct problem, but there were times when an ArcGIS system would just completely freak out and start lying about not supporting pagination anymore or otherwise spewing weird errors (but again, it's a third-party system, so what can you do?). In development, I would notice this particular endpoint failing after a while with a "dial" error of "too many open files". Every time that I looked, though, everything seemed fine, and I just forgot about it.

This last time, though, I watched the main application's open sockets ("ss -anp | grep my-application"), and I noticed that the number of connections just kept increasing. This reminded me of my old networking days, and it looked like the TCP connections were just accumulating until the OS felt like closing them due to inactivity.

That's when I found that Go's "http.Client" has a method called "CloseIdleConnections()" that immediately closes any idle connections without waiting for the OS to do it for you.

For reasons that are not relevant here, each request to a third-party ArcGIS system uses its own "http.Client", and because of that, there was no way to reuse any connections between requests, and the application just kept racking up open connections, eventually hitting the default limit of 1024 "open files". I simply added "defer httpClient.CloseIdleConnections()" after creating the "http.Client", and everything magically behaved as I expected: no more than 10 active connections at any time (one for each of the 10 threads running).

So, if your Go application is getting "too many open files" errors when making a lot of HTTP requests, be sure to either (1) re-architect your application to reuse your "http.Client" whenever possible, or (2) be sure to call "CloseIdleConnections()" on your "http.Client" as soon as you're done with it.

I suspect that some of the third-party ArcGIS issues that I was seeing in production might have essentially been DoS errors caused by my application assaulting these poor servers with thousands of connections.

Monday, July 5, 2021

Working around App Engine's bogus file modification times in Go

When an App Engine application is deployed, the files on the filesystem have their modification times "zeroed"; in this case, they are set to Tuesday, January 1, 1980 at 00:00:01 GMT (with a Unix timestamp of "315532801"). Oddly enough, this isn't January 1, 1970 (with a Unix timestamp of "0"), so they're adding 1 year and 1 second for some reason (probably to avoid actually zeroing out the date).

If you found your way here by troubleshooting, you may have seen this for your "Last-Modified" header:

last-modified: Tue, 01 Jan 1980 00:00:01 GMT

There's an issue for this particular problem (currently they're saying that it's working as designed); to follow the issue or make a comment, see issue 168399701.

For App Engine in Go, I've historically bypassed the static files stuff and just had my application serve up the files with "http.FileServer", and I've disabled caching everywhere to play it safe ("Cache-Control: no-cache, no-store, must-revalidate"). Recently, I've begun to experiment with a "max-age" of 1-minute lined up on 1-minute boundaries so that I get a bit of help from the GCP proxy and its caching powers while not shooting myself in the foot allowing stale copies of my files to linger all over the Internet.

This caused me a huge amount of headache recently when my web application wasn't updating in production, despite being pushed for over 24 hours. It turns out that the browser (Chrome) was making a request by including the "If-Modified-Since" header, and my application was responding back with a 304 Not Modified response. No matter how many times my service worker tried to fetch the new data, the server kept telling it that what it had was perfect.

The default HTTP file server in some languages lets you tweak how it responds ("ETag", "Last-Modified", etc.), but not in Go. "http.FileServer" has no configuration options available to it.

What I ended up doing was wrapping "http.FileServer"'s "ServeHTTP" in another function; this function had two main goals:

Set up a weak ETag value using the expiration date (ideally, I'd use a strong value like the MD5 sum of the contents, but I didn't want to have to rewrite "http.FileServer" just for this).
Remove the request headers related to the modification time ("If-Modified-Since" and "If-Unmodified-Since"). "http.FileServer" definitely respects "If-Modified-Since", and because the modification time is bogus in App Engine, I figured that just quietly removing any headers related to that would keep things simple.

Here's what I ended up with:

staticHandler := http.StripPrefix("/", http.FileServer(http.Dir("/path/to/my/files")))

myHandler.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {

// Cache all the static files aligned at the 1-minute boundary.

expirationTime := time.Now().Truncate(1 * time.Minute).Add(1 * time.Minute)

w.Header().Set("Cache-Control", fmt.Sprintf("public, max-age=%0.0f, must-revalidate", time.Until(expirationTime).Seconds()))

w.Header().Set("ETag", fmt.Sprintf("W/\"exp_%d\"", expirationTime.Unix())) // The ETag is weak ("W/" prefix) because it'll be the same tag for all encodings.

// Strip the headers that `http.FileServer` will use that rely on modification time.

// App Engine sets all of the timestamps to January 1, 1980.

r.Header.Del("If-Modified-Since")

r.Header.Del("If-Unmodified-Since")

staticHandler.ServeHTTP(w, r)

})

Anyway, I fought with this for two days before finally realizing what was going on, so hopefully this will let you work around App Engine's bogus file-modification times.

Thursday, April 15, 2021

Using "errors.Is" to detect "connection reset by peer" and work around it

I maintain an application that ties into Emergency Reporting using their REST API. When an item is updated, I have a Google Cloud Task that attempts to publish a change to a web hook, which connects to the Emergency Reporting API and creates a new incident in that system. Because it's in Cloud Tasks, if the task fails for any reason, Cloud Tasks will attempt to retry the task until it succeeds. Cool.

I also have it set up to send any log messages at warning level or higher to a Slack channel. Also cool.

However, in December of 2020, Emergency Reporting switched to some kind of Microsoft-managed authentication system for their API, and this has only brought problems. The most common of which is that the authentication API will frequently fail with a "connection reset by peer" error. My Emergency Reporting wrapper detects this and logs it; my web hook detects a sign-in failure and logs that; and the whole Cloud Task detects that the web hook has failed and logs that. Cloud Tasks automatically retries the task, which makes another post to the web hook, and everything succeeds the second time. But by now, I've accumulated a bunch of warnings in the Slack channel. Not cool.

So here's the thing: the Emergency Reporting API can fail for a lot of reasons, and I'd like to be notified when something important actually happens. But a standard, run-of-the-mill TCP "connection reset by peer" error is not important at all.

Here's an example of the kind of error that Go's http.Client.PostForm returns in this case:

Could not post form: Post https://login.emergencyreporting.com/login.emergencyreporting.com/B2C_1A_PasswordGrant/oauth2/v2.0/token: read tcp [fddf:3978:feb1:d745::c001]:33391->[2620:1ec:29::19]:443: read: connection reset by peer

Looking at the error, it looks like there are 4 layers of error:

The HTTP post
The particular TCP read
A generic "read"
A generic "connection reset by peer"

What I really want to do in this case is detect a generic "connection reset by peer" error and quietly retry the operation, allowing all other errors to be handled as true errors. Doing string-comparison operations on error text is rarely a good idea, so what does that leave us with?

Go 1.13 adds support for "error wrapping", where one error can "wrap" another one, while still allowing programs to make decisions based on the "wrapped" error. You may call "errors.Is" to determine if any error in an error chain matches a particular target.

Fortunately, all of the packages in this particular chain of errors utilize this feature. In particular, the syscall package has a set of distinct Errno errors for each low-level error, including "connection reset by peer" (ECONNRESET).

This lets us do something like this:

tokenResponse, err = client.GenerateToken()
if err != nil {
// If this was a connection-reset error, then continue to the next retry.
if errors.Is(err, syscall.ECONNRESET) {
logrus.Info("Got back a syscall.ECONNRESET from Emergency Reporting.")
// [attempt to retry the operation]
} else {
// This was some other kind of error that we can't handle.
// [log a proper error message and fail]
}
}

Since using "errors.Is" to detect the "connection reset by peer" error, I haven't received a single annoying, pointless error message in my Slack channel. I did have to spend a bit of time trying to figure out what that ultimate, underlying error was, but after that, it's been working flawlessly.

Monday, December 7, 2020

Working with the Google Datastore emulator

I do a good chunk of my business in Google App Engine; you package up your web application, send it to GCP, and then it takes care of scaling and uptime and all that stuff.

When I started out in 2014, I created my main application in Java because that was the least-crappy language that was supported. However, in 2020, there are a whole lot more languages (in particular: Go). I've slowly been working on porting my application from Java 8 to Go 1.14. Along the way, I've run into some really annoying issues.

For today, I'm going to be focusing on the Datastore emulator. In "old" App Engine (Java 8, Go 1.11, Python 2, etc.), they gave you a whole emulator suite. Your application ran inside of that suite, and you had fake Google-based App Engine authentication, inbound e-mail, and a Datastore emulator that also had a web UI that you could use to see your entities and manipulate them. The Datastore emulator's web UI wasn't as good as the current one that you get in production, but it was good enough to use for development.

Well, in "new" App Engine, the emulator suite is gone, and now you have to emulate or mock every aspect of App Engine that you plan on using. It's not a huge deal, but it is a bit inconvenient. In particular, you now have to start your own Datastore emulator.

It's easy to start:

gcloud config set project <your-project-id>;
gcloud beta emulators datastore start;

There are some environment variables that you'll need to export for the various libraries to detect and use instead of the production instance; run this to see them:

gcloud beta emulators datastore env-init;

That part is fine.

There are also two halfway-decent third party web UIs for the Datastore emulator:

I fought for hours trying to figure out why either of those two web UIs didn't work. Neither would show any namespaces (and thus, neither would show any entities).

The short answer is that despite what the Datastore emulator claims it's using for the project ID, the only thing that it actually uses is "dummy-emulator-datastore-project".

I got a hint about it by poking around in the emulator's data file, and I got some confirmation in this file, which is the only thing on the Internet at the time of this writing that references that string: https://code.googlesource.com/gocloud/+/master/datastore/datastore.go

So, if you start the Datastore emulator according to the instructions and either of those two web UIs aren't working, try setting the project ID to "dummy-emulator-datastore-project".

In "google-cloud-gui", you set the project ID in the UI when you hit the "+" button to create a new project.
In "dsui", you set the project ID using the "--projectId" flag.

Saturday, July 18, 2020

Google Cloud Functions Issues Upgrading From Go 1.11 To 1.13

I use Google Cloud Functions with Go. However, I upgraded from Go 1.11 to Go 1.13 (because Go 1.11 is being deprecated) and ran into some annoying, undocumented issues.

Static Files And The Current Working Directory

One of my Cloud Functions acts as a tiny web server; it has a few static HTML files that it serves in addition to its dynamic things.

In Go 1.11, Cloud Functions put the static files (and all the source files, for that matter) in the working directory of the function. This (1) makes sense, and (2) makes testing easy.

However, in Go 1.13, Cloud Functions puts the static files (and all of the source files) is placed in the ./serverless_function_source_code directory. Why? Who knows. All that mattered is that after a simple version upgrade, all of my stuff broke because it couldn't find files that it was able to find before the upgrade.

I found that using a sync.Once to attempt to change the current working directory (if necessary) is a fairly clean backward-compatible way of handling this issue.

Here's an example; it's fairly verbose, but you could rip out most of the logging if you don't want or need it.

// GoogleCloudFunctionSourceDirectory is where Google Cloud will put the source code that was uploaded.

const GoogleCloudFunctionSourceDirectory = "serverless_function_source_code"

// once is an object that will only execute its function one time.

// Because we want to log during our initialization, we need to handle this in a non-standard

// function and keep track of our initialization status.

var once sync.Once

// Initialize initializes the application.

// Primarily, this changes the current working directory.

func Initialize(log *logrus.Logger) {

log.Infof("Initializing the application.")

path, err := os.Getwd()

if err != nil {

log.Warnf("Could not find the current working directory: %v", err)

}

log.Infof("Current working directory: %s", path)

log.Infof("Looking for top-level source directory: %s", GoogleCloudFunctionSourceDirectory)

fileInfo, err := os.Stat(GoogleCloudFunctionSourceDirectory)

if err == nil && fileInfo.IsDir() {

log.Infof("Found top-level source directory: %s", GoogleCloudFunctionSourceDirectory)

err = os.Chdir(GoogleCloudFunctionSourceDirectory)

if err != nil {

log.Warnf("Could not change to directory %q: %v", GoogleCloudFunctionSourceDirectory, err)

}

log.Infof("Initialization complete.")

}

// CloudFunction is an HTTP Cloud Function with a request parameter.

func CloudFunction(w http.ResponseWriter, r *http.Request) {

log := logrus.New()

// Initialize our application if we haven't already.

once.Do(func() { Initialize(log) })

// YOUR CLOUD FUNCTION LOGIC HERE

}

For more information, see the Cloud Functions concepts docs.

Logging And Environment Variables

For whatever reason, Cloud Functions with Go don't log at anything other than the "default" log level; this means that all of my carefully crafted log messages all just get dumped into the logs at the same severity.

I've been using gcfhook with logrus to get around this, but it's not an ideal solution. That combination works by nullifying all output of the application and then adding a logrus hook that connects to the StackDriver API to send proper logs over the network. It works fine, but it's silly to have to make a network connection to a logging API when the application itself can output directly.

As of Go 1.13, Cloud Functions will no longer set the FUNCTION_NAME, FUNCTION_REGION, and GCP_PROJECT environment variables. This is a problem because we need those three pieces of information in order to use the StackDriver API to send the log messages. You could publish those environment variables back as part of your deployment, but I'd prefer not to.

Fortunately, Cloud Functions can now parse (poorly documented) JSON-formatted lines from stdout and stderr, resulting in proper log messages with severities. The Cloud Functions docs refer to this as "structured logging", but the docs don't seem to apply correctly. Cloud Run has a document on how these JSON-formatted lines should look, but it's still a bit hazy.

Anyway, the gcfstructuredlogformatter package introduces a logrus formatter that outputs JSON instead of plain text for logs. This eliminates the need for the extra environment variables and generally simplifies the logging workflow. It should only be a couple of lines of code to sub out gcfhook for gcfstructuredlogformatter.

Here's an example:

// CloudFunction is an HTTP Cloud Function with a request parameter.

func CloudFunction(w http.ResponseWriter, r *http.Request) {

log := logrus.New()

if value := os.Getenv("FUNCTION_TARGET"); value == "" {

log.Infof("FUNCTION_TARGET is not set; falling back to normal logging.")

} else {

formatter := gcfstructuredlogformatter.New()

log.SetFormatter(formatter)

}

log.Infof("This is an info message.")

log.Warnf("This is a warning message.")

log.Errorf("This is an error message.")

// YOUR CLOUD FUNCTION LOGIC HERE

}

Hopefully this stopped you from banging your head against the wall for a few hours like I was doing as I tried to frantically figure out why the upgrade had failed in such weird ways.

Wednesday, January 29, 2020

Unit-testing reCAPTCHA v2 and v3 in Go

I recently worked on a project where we allowed new users to sign up for our system with a form. A new user would need to provide us with her name, her e-mail address, and a password. In order to prevent spamming, we used reCAPTCHA v3, and so that meant that we also submitted a reCAPTCHA token along with the rest of the new-user data.

Unit-testing the sign-up process was fairly simple if we turned off the reCAPTCHA requirement, but the weakest link in the whole process is the one part that we could not control: reCAPTCHA. It would be foolish not to have test coverage around the reCAPTCHA workflow.

So, how do you unit-test reCAPTCHA?

Focus: Server-side testing

For the purposes of this post, I'm going to be focusing on testing reCAPTCHA on the server side. This means that I'm not concerned with validating that users acted like humans fiddling around on a website. Instead, I'm concerned with what our sign-up endpoint does when it receives valid and invalid reCAPTCHA tokens.

reCAPTCHA in Go

There are a variety of Go package to provide reCAPTCHA support; however, only one of them (1) has support for Go modules, and (2) has support for unit testing built in:

https://github.com/tekkamanendless/go-recaptcha

For docs, see:

https://godoc.org/github.com/tekkamanendless/go-recaptcha

When you create a new reCAPTCHA site, you're given public and private keys (short little strings, nothing huge). The public key is used on the client side when you make your connection to the reCAPTCHA API, and a response token is provided back. The private key is used on the server side to connect to the reCAPTCHA API and validate the response token.

Since the client side will likely be a line or two of Javascript that generates a token, our server-side work will be focused on validating that token.

Assuming that the newly generated token is "NEW_TOKEN" and that the private key is "YOUR_PRIVATE_KEY", then this is all you have to do in order to validate that token:

import "github.com/tekkamanendless/go-recaptcha"

// ...

recaptchaVerifier := recaptcha.New("YOUR_PRIVATE_KEY")
success, err := recaptchaVerifier.Verify("NEW_TOKEN")
if err != nil {
// Fail with some 500-level error about not being able to verify the token
}

if !success {
// Fail with some 400-level error about not being a human
}

// The token is valid!

And that's it! All we really care about is whether or not it worked.

Unit testing

tekkamanendless/go-recaptcha includes a package called "recaptchatest" that provides a fake reCAPTCHA API running as an httptest.Server instance. This server simulates enough of the reCAPTCHA API to let you do the kinds of testing that you need to.

Just like the actual reCAPTCHA service, you can create multiple "sites" on the test server. Each site will have a public and private key, and you can call the NewResponseToken method of a site to have that site generate a valid token for that site.

In terms of design, you'll set up the test server, the test site, and the valid token in advance of your test. When you create your Recaptcha instance with the test site's private key, all you have to do is set the VerifyEndpoint property of that instance to point to the test server (otherwise, it would try to talk to the real reCAPTCHA API and fail).

Here's a simple example:

import (

"github.com/tekkamanendless/go-recaptcha"

"github.com/tekkamanendless/go-recaptcha/recaptchatest"

)

// ...

// Create a new reCAPTCHA test server, site, and valid token before the main test.

testServer := recaptchatest.NewServer()

defer testServer.Close()

site := testServer.NewSite()

token := site.NewResponseToken()

// Create the reCAPTCHA verifier with the site's private key.

recaptchaVerifier := recaptcha.New(site.PrivateKey)

// Override the endpoint so that it uses the test server.

recaptchaVerifier.VerifyEndpoint = testServer.VerifyEndpoint()

// Run your test.

// ...

// Validate that the reCAPTCHA token is good.

success, err := recaptchaVerifier.Verify(token)

assert.Nil(t, err)

assert.True(t, success)

The recaptchatest test server doesn't do too much that's fancy, but it will properly return a failure if the same token is verified twice or if the token is too old. It also has some functions to let you tweak the token properties so you don't have to wait around for 2 minutes for a token to age out; you can make one that's already too old (see Site.GenerateToken for more information).