Saturday, April 16, 2022

Golang, http.Client, and "too many open files"

I've been having an issue with my application for a while now, and I finally figured out what the problem was.  In this particular case, the application is a web app (so, think REST API written in Go), and one of its nightly routines is to synchronize a whole bunch of data with various third-party ArcGIS systems.  The application keeps a cache of the ArcGIS images, and this job updates them so they're only ever a day old.  This allows it to show map overlays even if the underlying ArcGIS systems are inaccessible (they're random third-party systems that are frequently down for maintenance).

So, imagine 10 threads constantly making HTTP requests for new map tile images; once a large enough batch is done, the cache is updated, and then the process repeats until the entire cache has been refreshed.

In production, I never noticed a direct problem, but there were times when an ArcGIS system would just completely freak out and start lying about not supporting pagination anymore or otherwise spewing weird errors (but again, it's a third-party system, so what can you do?).  In development, I would notice this particular endpoint failing after a while with a "dial" error of "too many open files".  Every time that I looked, though, everything seemed fine, and I just forgot about it.

This last time, though, I watched the main application's open sockets ("ss -anp | grep my-application"), and I noticed that the number of connections just kept increasing.  This reminded me of my old networking days, and it looked like the TCP connections were just accumulating until the OS felt like closing them due to inactivity.

That's when I found that Go's "http.Client" has a method called "CloseIdleConnections()" that immediately closes any idle connections without waiting for the OS to do it for you.

For reasons that are not relevant here, each request to a third-party ArcGIS system uses its own "http.Client", and because of that, there was no way to reuse any connections between requests, and the application just kept racking up open connections, eventually hitting the default limit of 1024 "open files".  I simply added "defer httpClient.CloseIdleConnections()" after creating the "http.Client", and everything magically behaved as I expected: no more than 10 active connections at any time (one for each of the 10 threads running).

So, if your Go application is getting "too many open files" errors when making a lot of HTTP requests, be sure to either (1) re-architect your application to reuse your "http.Client" whenever possible, or (2) be sure to call "CloseIdleConnections()" on your "http.Client" as soon as you're done with it.

I suspect that some of the third-party ArcGIS issues that I was seeing in production might have essentially been DoS errors caused by my application assaulting these poor servers with thousands of connections.

Saturday, April 2, 2022

Service workers, push notifications, and IndexedDB

I have a pretty simple use case: a user wanted my app to provide a list of "recent" notifications that had been sent to her.  Sometimes a lot of notifications will come through in a relatively short time period, and she wants to be able to look at the list of them to make sure that she's handled them all appropriately.

I ended up having the service worker write the notification to an IndexedDB and then having the UI reload the list of notifications when it receives a "message" event from the service worker.

Before we get there, I'll walk you through my process because it was painful.

Detour: All the mistakes that I made

Since I was already using HTML local storage for other things, I figured that I would just record the list of recent notifications in there.  Every time that the page would receive a "message" event, it would add the event data to a list of notifications in local storage.  That kind of worked as long as I was debugging it.  As long as I was looking at the page, the page was open, and it would receive the "message" event.

However, in the real world, my application is installed as a "home screen" app on Android and is usually closed.  When a notification arrived, there was no open page to receive the "message" event, and it was lost.

I then tried to have the service worker write to HTML local storage instead.  It wouldn't matter which side (page or service worker) actually wrote the data since both sides would detect a change immediately.  Except that's not how it works.  Service workers can't use HTML local storage because of some rules around being asynchronous or something.

Anyway, HTML local storage was impossible as a simple communication and storage mechanism.

Because the page was usually not opened, MessageChannel and BroadcastChannel also wouldn't work.

I finally settled on using IndexedDB because a service worker is allowed to use it.  The biggest annoyance (in the design) was that there is no way to have a page "listen" for changes to an IndexedDB, so I couldn't just trivially tell my page to update the list of notifications to display when there was a change to the database.

After implementing IndexedDB, I spent a week trying to figure out why it wasn't working half the time, and that leads us to how service workers actually work.

Detour: How service workers work

Service workers are often described as a background process for your page.  The way that you hear about them, they sound like daemons that are always running and process events when they receive them.

But that's not anywhere near correct in terms of how they are implemented.  Service workers are more like "serverless" functions (such as Google Cloud Functions) in that they generally aren't running, but if a request comes in that they need to handle, then one is spun up to handle the request, and it'll be kept around for a few minutes in case any other requests come in for it to handle, and then it'll be shut down.

So my big mistake was thinking that once I initialized something in my service worker then it would be available more or less indefinitely.  The browser knows what events a service worker has registered ("push", "message", etc.) and can spin up a new worker whenever it wants, typically to handle such an event and then shut it down again shortly thereafter.

Service workers have an "install" event that gets run when new service worker code gets downloaded.  This is intended to be run exactly once for that version of the service worker.

There is also an "activate" event that gets run when an actual worker has been assigned to the task.  You can basically view this as an event that gets once when a service worker process starts running, regardless of how many times this particular code has been run previously.  If you need to initialize some global things for later functions to call, you should do it here.

The "push" event is run when a push message has been received.  Whatever work you need to do should be done in the event's "waitUntil" method as a promise chain that ultimately results in showing a notification to the user.

Detour: How IndexedDB works

IndexedDB was seemingly invented by people who had no concept of Promises in JavaScript.  Its API is entirely insane and based on "onsuccess", "oncomplete", and "onerror" callbacks.  (You can technically also use event listeners, but it's just as insane.)  It's an asynchronous API that doesn't use any of the standard asynchronous syntax as anything else in modern JavaScript.  It is what it is.

Here's what you need to know: everything in IndexedDB is callbacks.  Everything.  So, if you want to connect to a database, you'll need to make an IDBRequest and set the "onsuccess" callback.  Once you have the database, you'll need to create a transaction and set the "oncomplete" callback.  Then you can create another IDBRequest for reading or writing data from an object store (essentially a table) and setting the "onsuccess" callback.  It's callback hell, but it is what it is.  (Note that there are wrapper libraries that provide Promise-based syntax, but I hate having to wrap a standard feature for no good reason.)

(Also, there's an "onupgradeneeded" callback at the database level that you can use to do any schema- or data-related work if you're changing the database version.)

Putting it all together

I decided that there was no reason to waste cycles opening the IndexedDB on "activate" since there's no guarantee that it'll actually be used.  Instead, I had the "push" event use the previous database connection (if there was one) or create a new connection (if there wasn't).

I put together the following workflow for my service worker:

  1. Register the "push" event handler ("event.waitUntil(...)"):
    1. (Promise) Connect to the IndexedDB.
      1. If we already have a connection from a previous call, then return that.
      2. Otherwise, connect to the IndexedDB and return that (and also store it for quick access the next time so we don't have to reconnect).
    2. (Promise) Read the list of notifications from the database.
    3. (Promise) Add the new notification to the list and write it back to the database.
    4. (Promise) Fire a "message" event to all active pages and show a notification if no page is currently visible to the user.
And for my page:
  1. Load the list of notifications from the IndexedDB when the page loads.  (This sets our starting point, and any changes will be communicated by a "message" event from the service worker.)
  2. Register the "message" event handler:
    1. Reload the list of notifications from the IndexedDB.  (Remember, there's no way to be notified on changes, so receiving the "message" event and reloading is the best that we can do.)
    2. (Handle the message normally; for me, this shows a little toast with the message details and a link to click on to take the user to the appropriate screen.)

For me, the database work is a nice-to-have; the notification is the critical part of the workflow.  So I made sure that every database-related error was handled and the Promises resolved no matter what.  This way, even if there was a completely unexpected database issue, it would just get quietly skipped and the notification could be shown to the user.

In my code, I created some simple functions (to deal with the couple of IndexedDB interactions that I needed) that return Promises so I could operate normally.  You could technically just do a single "new Promise(...)" to cover all of the IndexedDB work if you wanted, or you could one of those fancy wrapper libraries.  In any case, you must call "event.waitUntil" with a Promise chain that ultimately resolves after doing something with the notification.  How you get there is up to you.

I also was using the IndexedDB as an asynchronous local storage, so I didn't need fancy keys or sorting or anything.  I just put all of my data under a single key that I could "get" and "put" trivially without having to worry about row counts or any other kind of data management.  There's a single object store with a single row in it.

Thursday, March 3, 2022

Dump your SSH settings for quick troubleshooting

I recently had a Jenkins job that would die, seemingly-randomly.  The only thing that really stood out was that it would tend to succeed if the runtime was 14 minutes or less, and it would tend to fail if the runtime was 17 minutes or more.

This job did a bunch of database stuff (through an SSH tunnel; more on that soon), so I first did a whole bunch of troubleshooting on the Postgres client and server configs, but nothing seemed relevant.  It seemed to disconnect ("connection closed by server") on these long queries that would sit there for a long time (maybe around 15 minutes or so) and then come back with a result.  After ruling out the Postgres server (all of the settings looked good, and new sessions had decent timeout configs), I moved on to SSH.

This job connects to a database by way of a forwarded port through an SSH tunnel (don't ask why; just understand that it's the least worst option available in this context).  I figured that maybe the SSH tunnel was failing, since I start it in the background and have it run "sleep infinity" and then never look at it again.  However, when I tested locally, my SSH session would run for multiple days without a problem.

Spoiler alert: the answer ended up being the client config, but how do you actually find that out?

SSH has two really cool options.

On the server side, you can run "sudo sshd -T | sort" to have the SSH daemon read the relevant configs and then print out all of the actual values that it's using.  So, this'll merge in all of the unspecified defaults as well as all of the various options in "/etc/sshd_config" and "/etc/sshd_config.d", etc.

On the client side, you can run "ssh -G ${user}@${host} | sort", and it'll do the same thing, but for all of the client-side configs for that particular user and host combination (because maybe you have some custom stuff set up in your SSH config, etc.).

Now, in my case, it ended up being a keepalive issue.  So, on the server side, here's what the relevant settings were:

clientalivecountmax 0
clientaliveinterval 900
tcpkeepalive yes

On the client (which would disconnect sometimes), here's what the relevant settings were:

serveralivecountmax 3
serveraliveinterval 0
tcpkeepalive yes

Here, you can see that the client (which is whatever the default Jenkins Kubernetes agent ended up being) enabled a TCP keepalive, but it set the keepalive interval to "0", which means that it wouldn't send keepalive packets at all.

According to the docs, the server should have sent out keepalives every 15 minutes, but whatever it was doing, the connection would drop after 15 minutes.  Setting "serveraliveinterval" to "60" ended up solving my problem and allowed my SSH sessions to stay active indefinitely until the script was done with them.

Little bonus section

My SSH command to set up the tunnel in the background was:

ssh -4 -f -L${localport}:${targetaddress}:${targetport} ${user}@${bastionhost} 'sleep infinity';

"-4" forces it to use an IPv4 address (relevant in my context), and "-f" puts the SSH command into the background before "sleep infinity" gets called, right after all the port forwarding is set up.  "sleep infinity" ensures that the connection never closes on its own; the "sleep" command will do nothing forever.

(Obviously, I had the "-o ServerAliveInterval=60" option in there, too.)

With this, I could trivially have my container create an SSH session that allowed for port-forwarding, and that session would be available for the entirety of the container's lifetime (the entirety of the Jenkins build).

Tuesday, March 1, 2022

QNAP, NFS, and Filesystem ACLs

I recently spent hours banging my head against a wall trying to figure out why my Plex server couldn't find some new media that I put on its volume in my QNAP.

tl;dr QNAP "Advanced Folder Permissions" turn on file access control lists (you'll need the "getfacl" and "setfacl" tools installed on Linux to mess with them).  For more information, see this guide from QNAP.

I must have turned this setting on when I rebuilt my NAS a while back, and it never mattered until I did some file operations with the File Manager or maybe just straight "cp"; I forget which (or both).  Plex refused to see the new file, and I tried debugging the indexer and all that other Plex stuff before realizing that while it could list the file, it couldn't open the file, even though its normal "ls -l" permissions looked fine.

Apparently the file access control list denied it, but I didn't even have "getfacl" or "setfacl" installed on my machine (and I had never even heard of this before), so I had no idea what was going on.  I eventually installed those tools and verified that while the standard Linux permission looked fine, the ACL permissions did not.

"sudo chmod -R +r /path/to/folder" didn't solve my problem, but tearing out the ACL did: "sudo setfacl -b -R /path/to/folder"

Later, I eventually figured out that it was QNAP's "Advanced Folder Permissions" and just disabled that so it wouldn't bother me again.

Sunday, January 9, 2022

Moving (or renaming) a push-notification ServiceWorker

Service Workers (web workers, etc.) are a relatively new concept.  They can do all kinds of cool things (primarily related to network requests), but they are also the mechanism by which a web site can receive push messages (via Web Push) and show them as OS notifications.

The general rule of Service Workers is to pick a file name (such as "/service-worker.js") and never, ever change it.  That's cool, but sometimes you do need to change it.

In particular, I started my push messaging journey with "platinum-push-messaging", a now-defunct web component built by Google as part of the initial Polymer project.  The promise was cool: just slap this HTML element on your page with a few parameters and boom: you have working push notifications.

When it came out, the push messaging spec was young, and no browsers fully supported its encrypted data payloads, so "platinum-push-messaging" did a lot of work to work around that limitation.  As browsers improved to support the VAPID spec, "platinum-push-messaging" (along with all of the other "platinum" elements) were quietly deprecated and archived (around 2017).

This left me with a problem: a rotting push notification system that couldn't keep up with the spec and the latest browsers.  I hacked the code to all hell to support VAPID and keep the element functioning, but I was just punting.

Apple ruined the declarative promise of the Polymer project by refusing to implement HTML imports, so the web components community adopted the NPM distribution model (and introduced a whole bunch of imperative Javascript drama and compilation tools).  Anyway, no modern web components are installed with Bower anymore, so that left me with a deprecated Service Worker in a path that I wanted to get rid of: "bower_components/platinum-push-messaging/service-worker.js"

Here was my problem:

  1. I wanted the push messaging Service Worker under my control at the top level of my application, "/push-service-worker.js".
  2. I had hundreds of users who were receiving push notifications via this system, and the upgrade had to be seamless (users couldn't be forced to take any action).
I ended up solving the problem by essentially performing a switcheroo:
  1. I had my application store the Web Push subscription info in HTML local storage.  This would be necessary later as part of the switcheroo.
  2. I removed "bower_components/platinum-push-messaging/".  Any existing clients would regularly attempt to update the service worker, but it would quietly fail, leaving the existing one running just fine.
  3. I removed all references to "platinum-push-messaging" from my code.  The existing Service Worker would continue to run (because that's what Service Workers do) and receive push messages (and show notifications).
  4. I made my own push-messaging web component with my own service worker living at "/push-service-worker.js".
  5. (This laid the framework for performing the switcheroo.)
  6. Upon loading, the part of my application that used to include "platinum-push-messaging" did a migration, if necessary, before loading the new push-messaging component:
    1. It went through all the Service Workers and looked for any legacy ones (these had "$$platinum-push-messaging$$" in the scope).  If it found any, it killed them.

      Note that the "$$platinum-push-messaging$$" in the scope was a cute trick by the web component: a page can only be controlled by one Service Worker, and the scope dictates what that Service Worker can control.  By injecting a bogus "$$platinum-push-messaging$$" at the end of the scope, it ensured that the push-messaging Service Worker couldn't accidentally control any pages and get in the way of a main Service Worker.
    2. Upon finding any legacy Service Workers, it would:
      1. Issue a delete to the web server for the old (legacy) subscription (which was stored in HTML local storage).
      2. Tell the application to auto-enable push notifications.
      3. Resume the normal workflow for the application.
  7. The normal workflow for the application entailed loading the new push-messaging web component once the user was logged in.  If a Service Worker was previously enabled, then it would remain active and enabled.  Otherwise, the application wouldn't try to annoy users by asking them for push notifications.
  8. After the new push-messaging web component was included, it would then check to see if it should be auto-enabled (it would only be auto-enabled as part of a successful migration).
    1. If it was auto-enabled, then it would enable push messaging (the user would have already given permission by virtue of having a legacy push Service Worker running).  When the new push subscription was ready, it would post that information to the web server, and the user would have push messages working again, now using the new Service Worker.  The switcheroo was complete.
That's a bit wordy for a simple switcheroo, but it was very important for me to ensure that my users never lost their push notifications as part of the upgrade.  The simple version is: detect legacy Service Worker, kill legacy Service Worker, delete legacy subscription from web server, enable new Service Worker, and save new subscription to web server.

For any given client, the switcheroo happens only once.  The moment that the legacy Service Worker has been killed, it'll never run again (so there's a chance that if the user killed the page in the milliseconds after the kill but before the save, then they'd lose their push notifications, but I viewed this as extremely unlikely; I could technically have stored a status variable, but it wasn't worth it).  After that, it operates normally.

This means that there are one of two ways for a user to be upgraded:
  1. They open up the application after it has been upgraded.  The application prompts them to reload to upgrade if it detects a new version, but eventually the browser will do this on its own, typically after the device reboots or the browser has been fully closed.
  2. They click on a push notification, which opens up the application (which is #1, above).
So at this point, it's a waiting game.  I have to maintain support for the switcheroo until all existing push subscriptions have been upgraded.  The new ones have a flag set in the database, so I just need to wait until all subscriptions have the flag.  Active users who are receiving push notifications will eventually click on one, so I made a note to revisit this and remove the switcheroo code once all of the legacy subscriptions have been removed.

I'm not certain what causes a new subscription to be generated (different endpoint, etc.), but I suspect that it has to do with the scope of the Service Worker (otherwise, how would it know, since service worker code can change frequently?).  I played it safe and just assumed that the switcheroo would generate an entirely new subscription, so I deleted the legacy one no matter what and saved the new one no matter what.

Saturday, October 30, 2021

Troubleshooting a weird Nagios NRPE SSL/TLS error

We recently gained limited access to a customer data center in order to monitor to some machines that our software is running on.  For historical reasons, we use Nagios as our monitoring tool (yes, I know that it's 2021) and we use NRPE to monitor our Linux boxes (yes, I know that NRPE is deprecated in favor of NCPA).

We had to provide the customer with a list of source IP addresses and target ports (for example, 5666 for NRPE) as part of the process to get the VPN set up.  Foreshadowing: this will become relevant soon.

After getting NRPE installed on all of our machines, we noticed that Nagios was failing to connect to any of the them.  The NRPE logs all had the following errors:

Starting up daemon
Server listening on 0.0.0.0 port 5666.
Server listening on :: port 5666.
Warning: Daemon is configured to accept command arguments from clients!
Listening for connections on port 5666
Allowing connections from: 127.0.0.1,::1,[redacted]
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
warning: can't get client address: Connection reset by peer
Error: (!log_opts) Could not complete SSL handshake with [redacted]: 5
warning: can't get client address: Connection reset by peer
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
Error: Network server getpeername() failure (107: Transport endpoint is not connected)
warning: can't get client address: Connection reset by peer
Error: (!log_opts) Could not complete SSL handshake with [redacted]: 5
warning: can't get client address: Connection reset by peer
warning: can't get client address: Connection reset by peer

So, this is obviously an SSL/TLS problem.

However, everyone on the Internet basically says that this is a problem with the NRPE client machine (the Nagios source address isn't listed in "allowed_hosts", it's not set up for SSL correctly, you didn't compile it right, etc.).

After fighting with this for hours, we finally figured out what was wrong.

A hint was the "getpeername() failure"; if you open up the NRPE source code, this runs immediately after the connection is established.  The only way that you could see this error ("Transport endpoint is not connected") is if the socket was closed between that initial connection and "getpeername".

Running "tcpdump" on both sides yielded the following findings:

On Nagios:

Nagios → NRPE machine: SYN
NRPE machine → Nagios: SYN, ACK
Nagios → NRPE machine: ACK
Nagios → NRPE machine: TLSv1 Client Hello
NRPE machine → Nagios: RST, ACK

On the NRPE machine to be monitored:

Nagios → NRPE machine: SYN
NRPE machine → Nagios: SYN, ACK
Nagios → NRPE machine: ACK
Nagios → NRPE machine: RST, ACK

Both machines agreed on the first 3 packets: the classic TCP handshake.  However, they differed on the subsequent packets.  Nagios sent a TLSv1 "Client Hello" packet and immediately had the connection closed by the NRPE machine.  However, the NRPE machine did not see the TLSv1 "Client Hello" at all; rather, it saw that Nagios immediately closed the connection.

This is indicative of some trickery being done by the customer's equipment (firewall, VPN, etc.).  From what I can tell, they're quietly stripping out any TLS packets and killing the connection if it finds any.  They probably have an incorrect port rule set up for port 5666, but anyway, that's the problem here: the network infrastructure is tearing out the TLS packets and closing the connection.

Saturday, July 10, 2021

Migrating from a static volume to a storage pool in QNAP

 I bought a QNAP TS-451+ NAS a number of years ago.  At the time, you could only set up what are now called "static volumes"; these are volumes that are composed of a number of disks in some RAID configuration.  After a firmware update, QNAP introduced "storage pools", which act as a layer in between the RAIDed disks and the volumes on top of them.  Storage pools can do snapshots and some other fancy things, but the important thing here is that QNAP was pushing storage pools now, and I had a static volume.

I wanted to migrate from my old static volume to a new storage pool.  I couldn't really find any examples of anyone who had performed such a migration successfully; most of the advice on the Internet was basically, "back up your stuff and reformat".  Given the fact that my volume was almost full and that QNAP does not support an in-place migration, I figured that if I added on some extra storage in the form of an expansion unit, I could probably pull it off with minimal hassle.

(The official QNAP docs generally agree with this.)

tl;dr It was pretty easy to do, just a bit time-consuming.  I'll also note that this was a lossless process (other than my NFS permissions); I didn't have to reinstall anything or restore any backups.

Here's the general workflow:

  1. Attach the expansion unit.
  2. Add the new disks to the expansion unit.
  3. Create a new storage pool on the expansion unit.
  4. Transfer each folder in the original volume to a new folder on the expansion unit.
  5. Write down the NFS settings for the original volume's folders.
  6. Delete the original volume.
  7. Create a new storage pool with the original disks.
  8. Create a new system volume on the main storage pool.
  9. Create new volumes as desired on the main storage pool.
  10. Transfer each folder from the expansion volume to the main volume.
  11. Re-apply the NFS settings on the folders on the main storage pool's volumes.
  12. Detach the expansion unit.
Some details follow.

QNAP sells expansion units that can act as additional storage pools and volumes, and the QNAP OS integrates them pretty well.  I purchased a TS-004 and connected it to my TS-451+ NAS via USB.  I had some new drives that I was planning to use to replace the drives currently in the NAS, so instead of doing that right away, I put them all in the expansion unit and created a new storage pool (let's call this the expansion storage pool).

I had originally tried using File Station to copy and paste all of my folders to a new volume in the expansion unit, but I would get permission-related errors, and I didn't want to deal with individual files when there were millions to transfer.  QNAP has an application called Hybrid Backup Sync, and one of the things that you can do is a 1-way sync "job" that lets you properly copy everything from one folder on one volume to another folder on another volume.  So I created new top-level folders in the expansion volume and then used Hybrid Backup Sync to copy all of my data from the main volume to the expansion volume (it preserved all the file attributes, etc.).

For more information how to use Hybrid Backup Sync to do this, see this article from QNAP.

(If you're coming from a static volume and you set up a storage pool on the expansion unit, then QNAP has a feature where you can transfer a folder on a static volume to a new volume in a storage pool, but this only works one way; you can't use this feature to transfer back from storage pool to storage pool, only from static volume to storage pool.)

I then wrote down the NFS settings that I had for my folders on the main unit (it's pretty simple, but I did have some owner and whitelist configuration).

Once I had everything of mine onto the expansion volume, I then deleted the main (system) volume.  QNAP was okay with this and didn't complain at all.  Some sites that I had read claimed that you'd have to reboot or reformat or something if you did this, but at least on modern QNAP OSes, it's fine with you deleting its system volume.

For more information on deleting a volume, see this article from QNAP.

I created a new storage pool with the main unit's existing disks, and then I created a small, thin volume on it to see what would happen.  QNAP quickly decided that this new volume would be the new "system" volume, and it installed some applications on its own, and then it was done.  My guess is that it installed whatever base config it needs to operate on that new volume and maybe transferred the few applications that I already had to it or something.

(I then rebooted the QNAP just to make sure that everything was working, and it ended up being fine.)

On the expansion unit, I renamed all of the top-level folders to end with "_expansion" so that I'd be able to tell them apart from the ones that I would make on the main unit.

Then I used Hybrid Backup Sync to copy my folders from the expansion volume to the main volume.  Once that was done, I modified the NFS settings on the main volume's folders to match what they had been originally.

I tested the connections from all my machines that use the NAS, and then I detached and powered down the expansion unit.  I restarted the NAS and tested the connections again, and everything was perfect.  Now I had a storage pool with thin-provisioned volumes instead of a single, massive static volume.