Recovered from panic: [*net.OpError] write tcp 127.0.0.1:58219->127.0.0.1:42590: use of closed network connection
Uncaught signal: 4, pid=280, tid=280, fault_addr=94169238392146.
After migrating my application from Google Cloud Engine to Google Cloud Run, I suddenly had a use case for optimizing CPU utilization.
In my analysis of my most CPU-intensive workloads, it turned out that the majority of the time was spent encoding PNG files.
tl;dr Use image.NRGBA when you intend to encode a PNG file.
(For reference, this particular application has a Google Maps overlay that synthesizes data from other sources into tiles to be rendered on the map. The main synchronization job runs nightly and attempts to build or download new tiles for the various layers based on data from various ArcGIS systems.)
Looking at my code, I couldn't really reduce the number of calls to png.Encode, but that encoder really looked inefficient. I deleted the callgrind files (sorry), but basically, half of the CPU time in png.Encode was around memory operations and some runtime calls.
I started looking around for maybe some options to pass to the encoder or a more purpose-built implementation. I ended up finding a package that mentioned a speedup, but only for NRGBA images. However, that package looked fairly unused, and I wasn't about to turn all of my image processing to so something with 1 commit and no users.
This got me thinking, though: what is NRGBA?
It turns out that there are (at least) two ways of thinking about the whole alpha channel thing in images:
For my human mind, using various tools and software over the years, when I think of "RGBA", I think of "one channel each for red, green, and blue, and one channel for the opacity of the pixel". So what this means is that I'm thinking of "NRGBA" (for non-premultiplied RGBA).
(Apparently there are good use cases for both, and when compositing, at some point you'll have to multiply by the alpha value, so "RGBA" already has that done for you.)
Okay, whatever, so what does this have to do with CPU optimization?
In Go, the png.Encode function is optimized for NRGBA images. There's a tiny little hint about this in the comment for the function:
Any Image may be encoded, but images that are not image.NRGBA might be encoded lossily.
This is corroborated by the PNG rationale document, which explains that
PNG uses "unassociated" or "non-premultiplied" alpha so that images with separate transparency masks can be stored losslessly.
If you want to have the best PNG encoding experience, then you should encode images that use NRGBA already. In fact, if you open up the code, you'll see that it will convert the image to NRGBA if it's not already in that format.
Coming back to my callgrind analysis, this is where all that CPU time was spent: converting an RGBA image to an NRGBA image. I certainly thought that it was strange how much work was being done creating a simple PNG file from a mostly-transparent map tile.
Why did I even have RGBA images? Well, my tiling API has to composite tiles from other systems into a single PNG file, so I simply created that new image with image.NewRGBA. And why that function? Because as I mentioned before, I figured "RGBA" meant "RGB with an alpha channel", which is what I wanted so that it would support transparency. It never occurred to me that "RGBA" was some weird encoding scheme for pixels in contrast to another encoding scheme called "NRGBA"; my use cases had never had me make such a distinction.
Anyway, after switching a few image.NewRGBA calls to image.NewNRGBA (and literally that was it; no other code changed), my code was way more efficient, cutting down on CPU utilization by something like 50-70%. Those RGBA to NRGBA conversions really hurt.
I've been having an issue with my application for a while now, and I finally figured out what the problem was. In this particular case, the application is a web app (so, think REST API written in Go), and one of its nightly routines is to synchronize a whole bunch of data with various third-party ArcGIS systems. The application keeps a cache of the ArcGIS images, and this job updates them so they're only ever a day old. This allows it to show map overlays even if the underlying ArcGIS systems are inaccessible (they're random third-party systems that are frequently down for maintenance).
So, imagine 10 threads constantly making HTTP requests for new map tile images; once a large enough batch is done, the cache is updated, and then the process repeats until the entire cache has been refreshed.
In production, I never noticed a direct problem, but there were times when an ArcGIS system would just completely freak out and start lying about not supporting pagination anymore or otherwise spewing weird errors (but again, it's a third-party system, so what can you do?). In development, I would notice this particular endpoint failing after a while with a "dial" error of "too many open files". Every time that I looked, though, everything seemed fine, and I just forgot about it.
This last time, though, I watched the main application's open sockets ("ss -anp | grep my-application"), and I noticed that the number of connections just kept increasing. This reminded me of my old networking days, and it looked like the TCP connections were just accumulating until the OS felt like closing them due to inactivity.
That's when I found that Go's "http.Client" has a method called "CloseIdleConnections()" that immediately closes any idle connections without waiting for the OS to do it for you.
For reasons that are not relevant here, each request to a third-party ArcGIS system uses its own "http.Client", and because of that, there was no way to reuse any connections between requests, and the application just kept racking up open connections, eventually hitting the default limit of 1024 "open files". I simply added "defer httpClient.CloseIdleConnections()" after creating the "http.Client", and everything magically behaved as I expected: no more than 10 active connections at any time (one for each of the 10 threads running).
So, if your Go application is getting "too many open files" errors when making a lot of HTTP requests, be sure to either (1) re-architect your application to reuse your "http.Client" whenever possible, or (2) be sure to call "CloseIdleConnections()" on your "http.Client" as soon as you're done with it.
I suspect that some of the third-party ArcGIS issues that I was seeing in production might have essentially been DoS errors caused by my application assaulting these poor servers with thousands of connections.
When an App Engine application is deployed, the files on the filesystem have their modification times "zeroed"; in this case, they are set to Tuesday, January 1, 1980 at 00:00:01 GMT (with a Unix timestamp of "315532801"). Oddly enough, this isn't January 1, 1970 (with a Unix timestamp of "0"), so they're adding 1 year and 1 second for some reason (probably to avoid actually zeroing out the date).
If you found your way here by troubleshooting, you may have seen this for your "Last-Modified" header:
last-modified: Tue, 01 Jan 1980 00:00:01 GMT
There's an issue for this particular problem (currently they're saying that it's working as designed); to follow the issue or make a comment, see issue 168399701.
For App Engine in Go, I've historically bypassed the static files stuff and just had my application serve up the files with "http.FileServer", and I've disabled caching everywhere to play it safe ("Cache-Control: no-cache, no-store, must-revalidate"). Recently, I've begun to experiment with a "max-age" of 1-minute lined up on 1-minute boundaries so that I get a bit of help from the GCP proxy and its caching powers while not shooting myself in the foot allowing stale copies of my files to linger all over the Internet.
This caused me a huge amount of headache recently when my web application wasn't updating in production, despite being pushed for over 24 hours. It turns out that the browser (Chrome) was making a request by including the "If-Modified-Since" header, and my application was responding back with a 304 Not Modified response. No matter how many times my service worker tried to fetch the new data, the server kept telling it that what it had was perfect.
The default HTTP file server in some languages lets you tweak how it responds ("ETag", "Last-Modified", etc.), but not in Go. "http.FileServer" has no configuration options available to it.
What I ended up doing was wrapping "http.FileServer"'s "ServeHTTP" in another function; this function had two main goals:
I maintain an application that ties into Emergency Reporting using their REST API. When an item is updated, I have a Google Cloud Task that attempts to publish a change to a web hook, which connects to the Emergency Reporting API and creates a new incident in that system. Because it's in Cloud Tasks, if the task fails for any reason, Cloud Tasks will attempt to retry the task until it succeeds. Cool.
I also have it set up to send any log messages at warning level or higher to a Slack channel. Also cool.
However, in December of 2020, Emergency Reporting switched to some kind of Microsoft-managed authentication system for their API, and this has only brought problems. The most common of which is that the authentication API will frequently fail with a "connection reset by peer" error. My Emergency Reporting wrapper detects this and logs it; my web hook detects a sign-in failure and logs that; and the whole Cloud Task detects that the web hook has failed and logs that. Cloud Tasks automatically retries the task, which makes another post to the web hook, and everything succeeds the second time. But by now, I've accumulated a bunch of warnings in the Slack channel. Not cool.
So here's the thing: the Emergency Reporting API can fail for a lot of reasons, and I'd like to be notified when something important actually happens. But a standard, run-of-the-mill TCP "connection reset by peer" error is not important at all.
Here's an example of the kind of error that Go's http.Client.PostForm returns in this case:
Could not post form: Post https://login.emergencyreporting.com/login.emergencyreporting.com/B2C_1A_PasswordGrant/oauth2/v2.0/token: read tcp [fddf:3978:feb1:d745::c001]:33391->[2620:1ec:29::19]:443: read: connection reset by peer
Looking at the error, it looks like there are 4 layers of error:
Go 1.13 adds support for "error wrapping", where one error can "wrap" another one, while still allowing programs to make decisions based on the "wrapped" error. You may call "errors.Is" to determine if any error in an error chain matches a particular target.
Fortunately, all of the packages in this particular chain of errors utilize this feature. In particular, the syscall package has a set of distinct Errno errors for each low-level error, including "connection reset by peer" (ECONNRESET).
This lets us do something like this:
tokenResponse, err = client.GenerateToken()
if err != nil {
// If this was a connection-reset error, then continue to the next retry.
if errors.Is(err, syscall.ECONNRESET) {
logrus.Info("Got back a syscall.ECONNRESET from Emergency Reporting.")
// [attempt to retry the operation]
} else {
// This was some other kind of error that we can't handle.
// [log a proper error message and fail]
}
}
Since using "errors.Is" to detect the "connection reset by peer" error, I haven't received a single annoying, pointless error message in my Slack channel. I did have to spend a bit of time trying to figure out what that ultimate, underlying error was, but after that, it's been working flawlessly.
I do a good chunk of my business in Google App Engine; you package up your web application, send it to GCP, and then it takes care of scaling and uptime and all that stuff.
When I started out in 2014, I created my main application in Java because that was the least-crappy language that was supported. However, in 2020, there are a whole lot more languages (in particular: Go). I've slowly been working on porting my application from Java 8 to Go 1.14. Along the way, I've run into some really annoying issues.
For today, I'm going to be focusing on the Datastore emulator. In "old" App Engine (Java 8, Go 1.11, Python 2, etc.), they gave you a whole emulator suite. Your application ran inside of that suite, and you had fake Google-based App Engine authentication, inbound e-mail, and a Datastore emulator that also had a web UI that you could use to see your entities and manipulate them. The Datastore emulator's web UI wasn't as good as the current one that you get in production, but it was good enough to use for development.
Well, in "new" App Engine, the emulator suite is gone, and now you have to emulate or mock every aspect of App Engine that you plan on using. It's not a huge deal, but it is a bit inconvenient. In particular, you now have to start your own Datastore emulator.
It's easy to start:
gcloud config set project <your-project-id>;
gcloud beta emulators datastore start;
There are some environment variables that you'll need to export for the various libraries to detect and use instead of the production instance; run this to see them:
gcloud beta emulators datastore env-init;
That part is fine.
There are also two halfway-decent third party web UIs for the Datastore emulator:
I fought for hours trying to figure out why either of those two web UIs didn't work. Neither would show any namespaces (and thus, neither would show any entities).
The short answer is that despite what the Datastore emulator claims it's using for the project ID, the only thing that it actually uses is "dummy-emulator-datastore-project".
I got a hint about it by poking around in the emulator's data file, and I got some confirmation in this file, which is the only thing on the Internet at the time of this writing that references that string: https://code.googlesource.com/gocloud/+/master/datastore/datastore.go
So, if you start the Datastore emulator according to the instructions and either of those two web UIs aren't working, try setting the project ID to "dummy-emulator-datastore-project".