Thursday, April 15, 2021

Using "errors.Is" to detect "connection reset by peer" and work around it

 I maintain an application that ties into Emergency Reporting using their REST API.  When an item is updated, I have a Google Cloud Task that attempts to publish a change to a web hook, which connects to the Emergency Reporting API and creates a new incident in that system.  Because it's in Cloud Tasks, if the task fails for any reason, Cloud Tasks will attempt to retry the task until it succeeds.  Cool.

I also have it set up to send any log messages at warning level or higher to a Slack channel.  Also cool.

However, in December of 2020, Emergency Reporting switched to some kind of Microsoft-managed authentication system for their API, and this has only brought problems.  The most common of which is that the authentication API will frequently fail with a "connection reset by peer" error.  My Emergency Reporting wrapper detects this and logs it; my web hook detects a sign-in failure and logs that; and the whole Cloud Task detects that the web hook has failed and logs that.  Cloud Tasks automatically retries the task, which makes another post to the web hook, and everything succeeds the second time.  But by now, I've accumulated a bunch of warnings in the Slack channel.  Not cool.

So here's the thing: the Emergency Reporting API can fail for a lot of reasons, and I'd like to be notified when something important actually happens.  But a standard, run-of-the-mill TCP "connection reset by peer" error is not important at all.

Here's an example of the kind of error that Go's http.Client.PostForm returns in this case:

Could not post form: Post https://login.emergencyreporting.com/login.emergencyreporting.com/B2C_1A_PasswordGrant/oauth2/v2.0/token: read tcp [fddf:3978:feb1:d745::c001]:33391->[2620:1ec:29::19]:443: read: connection reset by peer

Looking at the error, it looks like there are 4 layers of error:

  1. The HTTP post
  2. The particular TCP read
  3. A generic "read"
  4. A generic "connection reset by peer"
What I really want to do in this case is detect a generic "connection reset by peer" error and quietly retry the operation, allowing all other errors to be handled as true errors.  Doing string-comparison operations on error text is rarely a good idea, so what does that leave us with?

Go 1.13 adds support for "error wrapping", where one error can "wrap" another one, while still allowing programs to make decisions based on the "wrapped" error.  You may call "errors.Is" to determine if any error in an error chain matches a particular target.

Fortunately, all of the packages in this particular chain of errors utilize this feature.  In particular, the syscall package has a set of distinct Errno errors for each low-level error, including "connection reset by peer" (ECONNRESET).

This lets us do something like this:

tokenResponse, err = client.GenerateToken()
if err != nil {
   // If this was a connection-reset error, then continue to the next retry.
   if errors.Is(err, syscall.ECONNRESET) {
      logrus.Info("Got back a syscall.ECONNRESET from Emergency Reporting.")
      // [attempt to retry the operation]
   } else {
      // This was some other kind of error that we can't handle.
      // [log a proper error message and fail]
   }
}

Since using "errors.Is" to detect the "connection reset by peer" error, I haven't received a single annoying, pointless error message in my Slack channel.  I did have to spend a bit of time trying to figure out what that ultimate, underlying error was, but after that, it's been working flawlessly.