Monday, October 14, 2019

Encrypt your /home directory using LUKS and a spare disk

Every year or two, I rotate a drive out of my NAS.  My most recent rotation yielded me with a spare 1TB SSD.  My main machine only had a 250GB SSD, so I figured that I'd just replace my /home directory with a mountpoint on that new disk, giving my lots of space for video editing and such, since I no longer had the room to deal with my GoPro footage.

My general thought process was as follows:

  1. I don't want to mess too much with my system.
  2. I don't want to clone my whole system onto the new drive.
  3. I want to encrypt my personal data.
  4. I don't really care about encrypting the entire OS.
I had originally looked into some other encryption options, such as encrypting each user's home directory separately, but even in the year 2019 there seemed to be too much drama dealing with that (anytime that I need to make a PAM change, it's a bad day).  Using LUKS, the disk (well, partition) is encrypted, so everything kind of comes for free after that.

If you register the partition in /etc/crypttab, your machine will prompt you for the decryption key when it boots (at least Kubuntu 18.04 does).

One other thing: dealing with encrypted data may be slow if your processor doesn't support AES encryption.  Do a quick check and make sure that "aes" is listed under "Flags":
lscpu;
If "aes" is there, then you're good to go.  If not, then maybe run some tests to see how much CPU overhead disk operations use on LUKS (you can follow this guide, but stop before "Home setup, phase 2", and see if your overhead is acceptable).

The plan

  1. Luks Setup
    1. Format the new disk with a single partition.
    2. Set up LUKS on that partition.
    3. Back up the LUKS header data.
  2. Home setup, phase 1
    1. Copy everything in /home to the new partition.
    2. Update /etc/crypttab.
    3. Update /etc/fstab using a test directory.
    4. Reboot.
    5. Test.
  3. Home setup, phase 2
    1. Update /etc/fstab using the /home directory.
    2. Reboot.
    3. Test.

LUKS setup

Wipe the new disk and make a single partition.  For the remainder of this post, I'll be assuming that the partition is /dev/sdx1.

Install "cryptsetup".
sudo apt install cryptsetup; 
Set up LUKS on the partition.  You'll need to give it a passphrase.  I recommend something that's easy to type, like a series of four random words, but you do you).  You'll have to type this passphrase every time that you boot your machine up.
sudo cryptsetup --verify-passphrase luksFormat /dev/sdx1;
Once that's done, you can give it some more (up to 8) passphrases.  This may be helpful if you want to have other people access the disk, or if you just want to have some backups, just in case.  If there are multiple passphrases, any one of them will work fine; you don't need to have multiple on hand.
sudo cryptsetup --verify-passphrase luksAddKey /dev/sdx1;
The next step is to "open" the partition.  The last argument ("encrypted-home") is the name to use for the partition that will appear under "/dev/mapper".
sudo cryptsetup luksOpen /dev/sdx1 encrypted-home;
At this point, everything is set up and ready ready.  Confirm that with the "status" command.
sudo cryptsetup status encrypted-home;
Back up the LUKS header data.  If this information gets corrupted on the disk, then there is no way to recover your data.  Note that if you recover data using the header backup, then the passphrases will be the ones in the header backup, not whatever was on the disk at the time of the recovery.
sudo cryptsetup luksHeaderBackup /dev/sdx1 --header-backup-file /root/luks.encrypted-home.header;
I put mine in the /root folder (which will not be on the encrypted home partition), and I also backed it up to Google Drive.  Remember, if you add, change, or delete passphrases, you'll want to do make another backup (otherwise, those changes won't be present during a restoration operation).

If you're really hardcore, fill up the partition with random data so that no part of it looks special.  Remember, the whole point of encryption is to make it so that whatever you wrote just ends up looking random, so writing a bunch of zeros with "dd" will do the trick:
sudo dd if=/dev/zero of=/dev/mapper/encrypted-home;
Before you can do anything with it, you'll need to format the partition.  I used EXT4 because everything else on this machine is EXT4.
sudo mkfs.ext4 /dev/mapper/encrypted-home;

Home setup, phase 1

Once the LUKS partition is all set up, the next set of steps is just a careful copy operation, tweaking a couple /etc files, and verifying that everything worked.

The safest thing to do would be to switch to a live CD here so that you're guaranteed to not be messing with your /home directory, but I just logged out of my window manager and did the next set of steps in the ctrl+alt+f2 terminal.  Again, you do you.

Mount the encrypted home directory somewhere where we can access it.
sudo mkdir /mnt/encrypted-home; 
sudo mount /dev/mapper-encrypted-home /mnt/encrypted-home;
Copy over everything in /home.  This could take a while.
sudo cp -a /home/. /mnt/encrypted-home/;
Make sure that /mnt/encrypted-home contains the home folders of your users.

Set up /etc/crypttab.  The format is:
${/dev/mapper name} UUID="${disk uuid}" none luks
In our case, the /dev/mapper name is going to be "encrypted-home".  To find the UUID, run:
sudo blkid /dev/sdx1;
So, in my particular case, /etc/crypttab looks like:
encrypted-home UUID="5e01cb97-ceed-40da-aec4-5f75b025ed4a" none luks
Finally, tell /etc/fstab to mount the partition to our /mnt/encrypted-home directory.  We don't want to clobber /home until we know that everything works.

Update /etc/fstab and add:
/dev/mapper/encrypted-home /mnt/encrypted-home ext4 defaults 0 0
Reboot your machine.

When it comes back up, it should ask you for the passphrase for the encrypted-home partition.  Give it one of the passphrases that you set up.

Log in and check /mnt/encrypted-home.  As long as everything's in there that's supposed to be in there (that is, all of your /home data), then phase 1 is complete.

Home setup, phase 2

Now that we know everything works, the next step is to clean up your actual /home directory and then tell /etc/fstab to mount /dev/mapper/encrypted-home at /home.

I didn't want to completely purge my /home directory; instead, I deleted everything large and/or personal in there (leaving my bash profile, some app settings, etc.).  This way, if my new disk failed or if I wanted to use my computer without it for some reason, then I'd at least have normal, functioning user accounts.  Again, you do you.  I've screwed up enough stuff in my time to like to have somewhat nice failback scenarios ready to go.

Update /etc/fstab and change /dev/mapper/encrypted-home line to mount to /home.
/dev/mapper/encrypted-home /home ext4 defaults 0 0
Reboot.

When it comes back up, it should ask you for the passphrase for the encrypted-home partition.  Give it one of the passphrases that you set up.

Log in.  You should now be using an encrypted home directory.  Yay.

To confirm, check your mountpoints:
mount | grep /home
You should see something like:
/dev/mapper/encrypted-home on /home type ext4 (rw,relatime,data=ordered)
Now that everything's working, you can get rid of "/mnt/encrypted-home"; we're not using it anymore.
sudo rmdir /mnt/encrypted-home;

Friday, March 23, 2018

Fix for when Chrome stops making screen updates

My desktop environment is KDE, and I use Chrome for my browser.  At any given time, I'll have 2-5 windows with 10-40 tabs each.  However, every once in a while (usually once every week or so), the rendering of Chrome will freeze.  That is, the entire window will remain frozen (visually), but clicks and everything else go through fine (you just can't see the results).  Changing my window focus (switching to a different window, opening the "K" menu, etc.) usually causes a single render, but that doesn't help with actually interacting with a (visually) frozen Chrome window.

Closing Chrome and opening it back up works, but that's really inconvenient.

I'm still not sure why this happens, but I do have a quick (and convenient) fix: change your compositor (and then change it back).  Why does this work?  I'm not sure, but since it's obviously a rendering problem, making a rendering change makes sense.

Step by step:

  1. Open "System Settings".
  2. Open "Display and Monitor".
  3. Go to "Compositor".
  4. Change "Rendering backend" from whatever it is to something else (usually "OpenGL 3.1" to "OpenGL 2.0" or vice versa).
  5. Click "Apply".
This always solves the problem for me.  You can even switch it back to the original value after you hit "Apply" the first time.

Monday, October 16, 2017

Hidden dependencies with Gerrit and Bazel

This is going to be very short.

I was trying to build the "go-import" plugin for Gerrit (because apparently no one wants to build it and publish it), and I spent the past four hours trying to get "bazel build ..." to actually run.

If you're getting errors like these:
root@25f962af2a00:/gerrit# bazel fetch //...
ERROR: /gerrit/lib/jetty/BUILD:1:1: no such package '@jetty_servlet//jar': Argument 0 of execute is neither a path nor a string. and referenced by '//lib/jetty:servlet'
ERROR: /gerrit/lib/codemirror/BUILD:11:1: no such package '@codemirror_minified//jar': Argument 0 of execute is neither a path nor a string. and referenced by '//lib/codemirror:mode_q_r'
ERROR: /gerrit/lib/codemirror/BUILD:11:1: no such package '@codemirror_minified//jar': Argument 0 of execute is neither a path nor a string. and referenced by '//lib/codemirror:theme_solarized_r'
ERROR: /gerrit/lib/codemirror/BUILD:11:1: no such package '@codemirror_original//jar': Argument 0 of execute is neither a path nor a string. and referenced by '//lib/codemirror:theme_elegant'
ERROR: Evaluation of query "deps(//...)" failed: errors were encountered while computing transitive closure
Building: no action running

root@25f962af2a00:/gerrit# bazel build --verbose_failures plugins/go-import:go-import

ERROR: /gerrit/lib/BUILD:176:1: no such package '@jsr305//jar': Argument 0 of execute is neither a path nor a string. and referenced by '//lib:jsr305'
ERROR: Analysis of target '//plugins/go-import:go-import' failed; build aborted

Then you probably also need these packages:

  1. nodejs
  2. npm
  3. zip
I was building on a Docker container, and it didn't have any of those.

The "Argument 0 of execute" line appears to be referring to some internal call to "execute" that looks something like "execute("npm",...)" or "execute("zip",...").

Anyway, that instantly fixed my problem.

Friday, March 17, 2017

Shrinking a disk to migrate to a smaller SSD

I have finally migrated all of my storage to solid state drives (SSDs).  For my older machines, the 256GB SSDs were bigger than their normal hard drives, so Clonezilla worked perfectly well.  The very last holdout was my 2TB Proxmox server, since my largest SSD was 480GB.

Since I was only using about 120GB of space on that disk, my plan was to shrink the filesystems, then shrink the partitions, and then clone the disk with Clonezilla.

tl;dr both Clonezilla and grub can be jerks.

Even after shrinking everything down, Clonezilla refused to clone the disk because the SSD was smaller than the original (even though all of the partitions would fit on the SSD with plenty of room to spare).  Clonezilla did let me clone a partition at a time, so I did that, but the disk wouldn't boot.  To make a long story short, you need to make sure that you run "update-grub" and "grub-install" after pulling a disk-switcheroo stunt.  And yes, to do that, you'll need to use a live OS to chroot in (and yes, you'll need to mount "/dev", etc.) and run those commands.

Just that paragraph above would have saved me about six hours of time, so there it is, for the next person who is about to embark on a foolish and painful journey.

Shrink a disk

My Proxmox host came with a 2TB spinner disk, and I foolishly decided to use it instead of putting in a 256GB SSD on day one.  I didn't feel like reinstalling Proxmox and setting all of my stuff up again, and I certainly didn't want to find out that I didn't back up some crucial file or couldn't remember the proper config; I just wanted my Proxmox disk migrated over to SSD, no drama.

Remember, you aren't actually shrinking the disk itself; rather, you're reducing the allocated space on that disk by shrinking your filesystems and partitions (and, if you're using LVM, your logical volumes and physical volumes, too).

So, the goal is: shrink all of your partitions down to a size that they can be copied over to a smaller disk.
Remember, you don't have to shrink all of your partitions, on the big ones.  In my case, I had a 1MB partition, a 500MB partition, and a 1.9TB partition.  Obviously, I left the first two alone and shrunk the third one.

First: Shrink whatever's on the partition

How easy this step is depends on if you're using LVM or not.  If you are not using LVM and just have a normal filesystem on the partition, then all you need to do is resize the filesystem to a smaller size (obviously, you have to have less space in use than the new size).

If you are using LVM, then your partition is full of LVM stuff (namely, all of the data needed to make the logical volumes (LVs) that reside on that partition (which is the physical volume, or PV)).
Note: you cannot mess with a currently-mounted filesystem.  If you need to resize your root filesystem, then you'll have to boot into a live OS and do it from there.
(From here on, we'll use "sda" as the original disk and "sdb" as the new SSD.)

 Option 1: Filesystem

Shrink the filesystem using the appropriate tool for your filesystem.  For example, let's assume that we want to shrink "/dev/sda3" down to 50GB.  You could do this with an EXT4 filesystem via:
resize2fs -p /dev/sda3 50G

Option 2: LVM

Shrink whatever filesystems are on the logical volumes that are on the physical volume that you want to shrink.  For example, let's assume that we want to shrink the "your-lv" volume down to 50GB.
resize2fs -p /dev/your-vg/your-lv 50G

Then, shrink the logical volume to match:
lvresize /dev/your-vg/your-lv --size 50G

Now that the logical volume has been shrunk, you should be able to see free space in its physical volume.  Run "pvs" and look at the "PFree" column; that's how much free space you have.

Now shrink the physical volume:
pvresize /dev/sda3 --setphysicalvolumesize 50G

Second: Shrink the partition

Be very, very sure of what you want the new size of your partition to be.  Very sure.  The surest that you've ever been in your life.

Then, use "parted" or "fdisk" to (1) delete the partition, and then (2) add a new partition that's smaller.  The new partition must start on the exact same spot that the old one did.  Make sure that the partition type is set appropriately ("Linux" or "Linux LVM", etc.).  Basically, the new one should look exactly like the old one, but with the ending position smaller.

Clone a disk

Use Clonezilla (or any other live OS, for that matter) to clone the partitions to the new disk.  With Clonezilla, I still had to make the partitions on the new disk myself, so while you're there, you might as well just "dd" the contents of each partition over.

Create the new partitions and copy the data

On the new disk, create partitions that will correspond with those on the old disk.  You can create them with the same size, or bigger.  I created my tiny partitions with the same sizes (1MB and 500MB), but I created my third partition to take up the rest of the SSD.

Copy the data over using whatever tool makes you happy; I like "dd" because it's easy.  Be very, very careful about which disk is the source disk (input file, or "if") and which is the target disk (output file, or "of").  For example:
dd if=/dev/sda1 of=/dev/sdb1 bs=20M
dd if=/dev/sda2 of=/dev/sdb2 bs=20M
dd if=/dev/sda3 of=/dev/sdb3 bs=20M

If you create larger partitions on the new disk, everything will still work just fine; the filesystems or LVM physical volumes simply will have the original sizes.  Once everything is copied over, you can simply expand the filesystem (or, for LVM, expand the physical volume, then the logical volume, and then the physical volume) to take advantage of the larger partition.

Remove the original disk

Power off and remove the original disk; you're done with it anyway.

If you're using LVM, that original disk will only get in the way, since it'll get confused about having double the number of everything and won't set up your volumes correctly.

(From here on, we'll use "sda" as the new SSD since we've removed the original disk.)

Bonus: expand your filesystems

Now that everything's copied over to the new disk, you can expand your filesystems to take up the full size of the partition.  You may have to run "fsck" before it'll let you expand the filesystem.

For a normal filesystem, simply use the appropriate tool for your filesystem.  For example, let's assume that we want to grow "/dev/sda3" to take up whatever extra space there might be.  You could do this with an EXT4 filesystem via:
resize2fs /dev/sda3

In the case of LVM, you'll have to expand the physical volume first, then the logical volume, and then the filesystem (essentially, this is the reverse of the process to shrink them).  For example:
pvresize /dev/sda3
lvresize -l +100%FREE /dev/your-vg/your-lv
resize2fs /dev/your-vg/your-lv
Note: the "-l +100%FREE" arguments to "lvresize" tell it to expand to include all ("100%") of the free space available on the physical volume.

Fix grub

I still don't fully grok the changes that UEFI brought to the bootloader, but I do know this: after you pull a disk-switcheroo, grub needs to be updated and reinstalled.

So, using your live OS, mount the root filesystem from your new disk (and any other dependent filesystem, such as "/boot" and "/boot/efi").  Then create bind mounts for "/dev", "/dev/pts", "/sys", "/proc", and "/run", since those are required for grub to install properly.

Finally, chroot in to the root filesystem and run "update-grub" followed by "grub-install" (and give it the path to the new disk):
update-grub
grub-install /dev/sda

Conclusion

That should do it.  Your new disk should boot properly.

For a decent step-by-step guide on shrinking an LVM physical volume (but with the intent of creating another partition), see this article from centoshelp.org.

Live-updating a Java AppEngine project's HTML with Maven

I recently switched all of my Java AppEngine projects to use Maven once the Google updated its Eclipse plugin to support Google Cloud SDK (instead of the AppEngine SDK); see the getting-started with Maven guide.  Google Cloud SDK is much more convenient for me because it gets regular updates via "apt-get" as the "google-cloud-sdk" package, plus it has a nice CLI for working with an app (great for emergency fixes via SSH).

Maven is relatively new to me; I tried it years ago, but it didn't play well with the AppEngine plugin for Eclipse, so I gave up.  Now that it's fully supported, I am very happy with it.  tl;dr is that it's the Java compiler and toolkit that you always thought existed, but didn't.

There is exactly one problem with the new Eclipse plugin for AppEngine/Maven support: the test server doesn't respect any of the plugin settings in my "pom.xml" file.  This means that it won't run on the port that I tell it to run on, etc.

That's not a huge deal because I can just run "mvn appengine:run" to run it perfectly via the command line (and I love my command line).
Aside: the command is actually slightly more involved since AppEngine needs Java 7 (as opposed to Java 9, which is what my desktop runs).
JAVA_HOME=/usr/lib/jvm/java-7-oracle/ mvn appengine:run
This worked perfectly while I was editing my Java source files.  The default "pom.xml" has a line in it that tells eclipse to compile its Java code to the appropriate Maven target directory, which means that the development server will reload itself anytime that, effectively, the Java source code changes.
In case you're wondering, that line is:
<!-- for hot reload of the web application--> <outputDirectory>${project.build.directory}/${project.build.finalName}/WEB-INF/classes</outputDirectory>
However, another time, I had to update the HTML portion of my app, and my changes weren't being applied.  I'd change a file, reload a million times, and nothing would work.  Restarting the development server would work, but that was crazy to me: why would the development server support a hot reload for Java source code changes but not for trivial static files (such as the HTML files)?

I spent hours and hours troubleshooting this very thing, and I finally learned how this whole Maven thing works:
  1. Maven compiles the application (including the Java code) to a "target".
  2. For development purposes, the target is a directory that can be scanned for changes; for production purposes, the target is a "jar" or "war" file, for example.
  3. That "outputDirectory" tag in "pom.xml" tells Eclipse to compile its Java changes to the appropriate target directory, allowing a running development server to hot reload the Java changes.
When the target directory is built, all of the static files are copied to it.  When you edit such a file in Eclipse, you're editing the original file in the "src/main/webapp" folder, not the target folder that the development server is running from.

To solve the problem (that is, to add hot reload for static files), you simply need to mirror whatever change you made in the original folder to the copy in the target folder.

I spent too many hours trying to figure out how to do this in Eclipse and eventually settled on a short bash script that continually monitors the source directory using "inotifywait" for changes and then mirrors those changes in the target directory (as I said, I love my command line).

My solution is now to run "mvn appengine:run" in one terminal and this script in another (it uses "pom.xml" to figure out the target directory):

#!/bin/bash

declare -A commands;
commands[inotifywait]=inotify-tools;
commands[xmlstarlet]=xmlstarlet;

for command in "${!commands[@]}"; do
   if ! which "${command}" &>/dev/null; then
      echo "Could not find command: ${command}";
      echo "Please intall it by running:";
      echo "   sudo apt-get install ${commands[$command]}";
      exit 1;
   fi;
done;

artifactId=$( xmlstarlet sel -N my=http://maven.apache.org/POM/4.0.0 --template --value-of /my:project/my:artifactId pom.xml );
version=$( xmlstarlet sel -N my=http://maven.apache.org/POM/4.0.0 --template --value-of /my:project/my:version pom.xml );

sourceDirectory=src/main/webapp/;
sourceDirectoryLength="${#sourceDirectory}";
targetDirectory=target/"${artifactId}-${version}"/;

echo "Source directory: ${sourceDirectory}";
echo "Target directory: ${targetDirectory}";

inotifywait -m -r src/main/webapp/ -e modify -e moved_to -e moved_from -e create -e delete 2>/dev/null | while read -a line; do
   echo "Line: ${line[@]}";
   fullPath="${line[0]}";
   operation="${line[1]}";
   filename="${line[2]}";
   
   path="${fullPath:$sourceDirectoryLength}";

   echo "Operation: ${operation}";
   echo "   ${path} :: ${filename}";

   case "${operation}" in
      CREATE|MODIFY|MOVED_TO)
         cp -v -a -f "${fullPath}${filename}" "${targetDirectory}${path}${filename}";
         ;;
      DELETE|MOVED_FROM)
         rm -v -r -f "${targetDirectory}${path}${filename}";
         ;;
      *)
         echo "Unhandled operation: ${operation}";
         ;;
   esac;
done;

Sunday, January 8, 2017

Running transmission as a user in Ubuntu 16.04

I recently bought a QNAP TS-451+ NAS for my house.  This basically meant that all of my storage would be backed by NFS mounts.  QNAP (currently) only supports NFSv3, so the UIDs and GIDs on any files have to match the UIDs and GIDs on all of my boxes for everything to work.

The "standard" first user (in my case, "codon") in Ubuntu has a UID of 1000 and a GID of 1000 (also called, uncreatively "codon").  I have this user set up on every box in my network, so this means that any "codon" user on any box can read and write to anything that any other "codon" user created.

However, I use transmission to download torrents.  Transmission runs as the "debian-transmission" user, with a UID of 500 and a GID of 500.  Because of this, "codon" had to use "sudo" to change the ownership of any downloaded files before moving them to their final destinations.  In addition, my Plex server couldn't see the files (because the "plex" user was in the "codon" group with GID 1000).

Without jumping through a bunch of hoops, I figured that the simplest solution would be to run transmission as the "codon" user.  This should have been the simplest way, but I ran into a bunch of problems and outdated guides.  Here, I present to you how to run transmission as a user in Ubuntu 16.04.

Background: systemd

Ubuntu 16.04 uses systemd for its "init" system.  Systemd is great, but it's a whole different game than the older SysV init or now-defunct Upstart init systems.

The magic of systemd is that you do not have to modify any system-owned files; this means that you will not get conflicts when you upgrade.  Instead, you simply supplement an existing unit (a "unit" is what systemd operates on, not an "init script") by creating a new file in a particular location.

Step 1: run as a different user

Explanation

The systemd unit for transmission is called "transmission-daemon", and it is defined here:
/lib/systemd/system/transmission-daemon.service
In principle, you could simply modify that file to change the "User=debian-transmission" line, but that's not the systemd way.  Instead, we're going to create a supplement file in "/etc/systemd/" that has a different "User=..." line.  When systemd reads its configuration, it'll read the original "transmission-daemon.service" file, and then it'll apply the changes in our supplement file.

The appropriate place to put these supplement files is the following directory:
/etc/systemd/${service-path}.d/
Here, ${service-path} refers to the path after "/lib/systemd".  In transmission's case, this is "system/transmission-daemon.service".

What to do

Stop transmission (if it's already running).
sudo systemctl stop transmission-daemon;
Create the supplement file directory for transmission.
sudo mkdir -p /etc/systemd/system/transmission-daemon.service.d;
Create a new supplement file called "run-as-user.conf".
sudo vi /etc/systemd/system/transmission-daemon.service.d/run-as-user.conf
and put the following text in it.
[Service]
User=codon
Obviously, use your desired username and not "codon".

Tell systemd to reload its units.
sudo systemctl daemon-reload;
To confirm that the changes went through, ask systemd to print the configuration that it's using for transmission.
sudo systemctl cat transmission-daemon;
You should see something like this:
# /lib/systemd/system/transmission-daemon.service
[Unit]
Description=Transmission BitTorrent Daemon
After=network.target 
[Service]
User=debian-transmission
Type=notify
ExecStart=/usr/bin/transmission-daemon -f --log-error
ExecReload=/bin/kill -s HUP $MAINPID 
[Install]
WantedBy=multi-user.target 
# /etc/systemd/system/transmission-daemon.service.d/run-as-user.conf
[Service]
User=codon
You can see that our supplement file was appended to the end.

Step 2: get the configuration directory setup

Explanation

Transmission loads its configuration from the user's home directory, in particular:
~/.config/transmission-daemon/
Since Ubuntu runs transmission as the "debian-transmission" user, the transmission configuration resides in that user's home directory, which happens to be:
/var/lib/transmission-daemon
And here's where things get strange.

The normal place for system-wide configuration files is "/etc/", and transmission tries to pass itself off as a normal, system-wide daemon.  To get around this, the main configuration file for transmission appears to be "/etc/transmission-daemon/settings.json", but it's actually:
/var/lib/transmission-daemon/.config/transmission-daemon/settings.json
The Ubuntu configuration just happens to symlink that file to "/etc/transmission-daemon/settings.json".

Now that we're going to be running our transmission as our user and not "debian-transmission", the configuration will be loaded from our user's home directory (in my case, "/home/codon").  Thus, the fake-me-out "/etc/transmission-daemon/settings.json" configuration file will not be used.

When transmission starts, if it doesn't have a configuration directory set up already, it will set one up on its own.  So here, we want to start transmission briefly so that it configures this directory structure for us.

What to do

Start transmission and then stop transmission.
sudo systemctl start transmission-daemon;
sudo systemctl stop transmission-daemon;
You should now have the following directory in your user's home directory:
.config/transmission-daemon/

Step 3: change "settings.json" and start transmission

Explanation

Transmission will load its configuration from "~/.config/transmission-daemon/settings.json" when it starts, so just update that file with whatever your configuration is and then start transmission.

Note that transmission will overwrite that file when it stops, so make your changes while transmission is not running.  If you make changes while transmission is running, then simply issue the "reload" command to it and it'll reload the configuration live.
sudo systemctl reload transmission-daemon;

What to do

Update "~/.config/transmission-daemon/settings.json" to suite your needs.

Then, start transmission:
sudo systemctl start transmission-daemon;
Make sure that everything's running okay:
sudo systemctl status transmission-daemon;
You should see something like this:
● transmission-daemon.service - Transmission BitTorrent Daemon
   Loaded: loaded (/lib/systemd/system/transmission-daemon.service; disabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/transmission-daemon.service.d
           └─run-as-user.conf
   Active: active (running) since Sun 2017-01-08 19:28:27 EST; 1min ago
 Main PID: 3271 (transmission-da)
   Status: "Uploading 54.18 KBps, Downloading 0.17 KBps."
   CGroup: /system.slice/transmission-daemon.service
           └─3271 /usr/bin/transmission-daemon -f --log-error

Jan 08 19:28:27 transmission systemd[1]: Starting Transmission BitTorrent Daemon...
Jan 08 19:28:27 transmission systemd[1]: Started Transmission BitTorrent Daemon.

Thursday, April 28, 2016

Troubleshooting PXE boot problems on a Dell server

At work, we ship "appliances"—servers that have a pre-installed operating system and software stack.  The general workflow to build one is:

  1. The box is racked and plugged into the "build" network.
  2. The box PXE boots to a special controller script.
  3. Someone answers a few questions.
  4. The script partitions and formats the disks and installs the OS and software.
  5. The box is powered down, packaged, and shipped.
I recently spent days fighting with a new PXE boot "live OS", encountered error messages so lonely that few sites on the Internet reference them, debugged "initrd", and ultimately solved my problems by feeling really, really stupid.

I'll tell you the story of what happened in case you happen to try to walk down this path in the future, and I'll go into detail on everything as we go.  But first...

The Motivation

When you PXE boot, you get a "live OS", typically a read-only operating system with some tmpfs mounts for write operations.  This OS is often minimal, having just enough tools and packages to accomplish the goal of partitioning, formatting, and copying files.  Ours is done by mounting the root filesystem over NFS, so the smaller the OS, the better.

We had two problems that I wanted to solve:
  1. The current live OS was hand-built on Ubuntu 12.04 with no documentation; and
  2. Ubuntu 12.04's version of "lbzip2" is buggy and can't decompress some files.
My goal was to switch to Ubuntu 14.04 (which has a perfectly working version of "lbzip2") in a way that the live OS would be generated from a script (reliably and repeatably).

(If you're interested in building an Ubuntu filesystem from scratch that you can then use as a live OS, or make into an ISO image, or whatever, see debootstrap.)

The Execution

I created a few-dozen line script to bootstrap Ubuntu 14.04, install a kernel and some packages, and install the tools that it would need to run when it booted.  That part went well.

I copied the new OS to our NFS server, replacing the existing one.  That part went well.

All of our automated virtual-appliance builds continued to work (actually better and faster than before).  That part went well.

I met with the manufacturing team to get the formal sign-off that it would work with the physical Dell servers that we use.  This did not go well.

The Flaw In The Plan

Manufacturing booted up a box onto the build network but it crashed.

Here's what he reported:
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000200

CPU: 4 PID: 1 Comm: init Not tainted 3.13.0-24-generic #47-Ubuntu
Hardware name:    /0JP31P, BIOS 2.5.2 01/28/2015
 ffff880802af0000 ffff8808041f5e48 ffffffff81715ac4 ffffffff81a4c480
 ffff8808041f5ec0 ffffffff8170ecc5 ffffffff00000010 ffff8808041f5ed0
 ffff8808041f5e70 ffffffff81f219e0 0000000000000200 ffff8808041f8398
Call Trace:
 [<ffffffff81715ac4>] dump_stack+0x45/0x56
 [<ffffffff8170ecc5>] panic+0xc8/0x1d7
 [<ffffffff8106a391>] do_exit+0xa41/0xa50
 [<ffffffff8109dd84>] ? vtime_account_user+0x54/0x60
 [<ffffffff8106a41f>] do_group_exit+0x3f/0xa0
 [<ffffffff8106a494>] SyS_exit_group+0x14/0x20
 [<ffffffff817266bf>] tracesys+0xe1/0xe6
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1 at /build/buildd/linux-3.13.0/arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5d/0x60()
Modules linked in: nfs lockd sunrpc fscache

Okay, a kernel panic.  I did upgrade the kernel from the old Ubuntu 12.04 version.  In fact, the old live OS had a hand-built Gentoo kernel for arcane reasons.  So maybe...

The Red Herring

Well, I did install the Ubuntu 14.04 ".deb" package for the Fusion I/O Drive's "iomemory-vsl" kernel module.  Some of our equipment have I/O Drives, so the live OS needs to be able to talk to them.

The VMs (which don't have I/O Drives) worked fine, and the physical hardware (which do) did not work.

I assumed that there was some mismatch between Fusion I/O's kernel module and the Ubuntu 14.04 kernel.  I compiled the driver (using Fusion I/O's guide) and re-made my live OS.

Same result: kernel panic.

In addition, it turned out that physical hardware that did not have an I/O Drive failed, as well.

The Hint

I took out my phone and recorded the boot process at 120 frames per second to see the text before the kernel panic.  The messages that I'd been able to see were not helpful, and when the kernel panics, I lose the ability to scroll up in the history.  And since the system never gets up all the way, there's no way to access the logs.

Here's what I saw:
systemd-udevd[329]: starting version 204
Begin: Loading essential drivers ... done
Begin: Running /scripts/init-premount ... done
Begin: Mounting root file system ... Begin: Running /scripts/nfs-top ... done
FS-Cache: Loaded
RPC: Registered named UNIX socket transport module.
RPC: Registered udp transport module.
RPC: Registered tcp transport module.
RPC: Registered tcp NFSv4.1 backchannel transport module.
FS-Cache: Netfs 'nfs' registered for caching
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
/init: .: line 252: can't open '/run/net-*.conf'
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000200

The short story here is that the "ipconfig" message repeating over and over means that the kernel can't find any network interfaces (at least other than "lo").

The long story is that when booting from network (NFS), the "initrd" init script is going to run some scripts to set up NFS.  One of these calls a function "configure_networking" (since NFS won't work without networking).  That function tries to figure out which device to use for networking.

For a box with multiple interfaces, this could be a bit tricky.  However, if you've toggled on bit b10 (0x2) in your PXE setup using the "IPAPPEND" directive, then you'll get an environment variable called "BOOTIF" that encodes the MAC address of the interface that originally PXE booted.  The "configure_networking" function then looks through the /sys/class/net/* files to see which interface has that MAC address.  From there, it will know the device and can set it up.

However, "ipconfig" claims that there are "no devices to configure".  Also, if you debug the "configure_networking" function, you'll see that there's only one file in /sys/class/net, and that's "lo".  So, the system has a loopback interface, but none of the Ethernet interfaces that I expected.

The Problem

Very simply put: the appropriate Ethernet driver (kernel module) was not present in the "initrd" file.

The "initrd" file (short for "initial RAM disk") contains some bootstrap scripts and kernel modules.  For example, for PXE booting, we get the kernel and the "initrd" file from TFTP (or HTTP) from the Ethernet card itself as part of the process.  Thus, it is up to the "initrd" file to get networking up and running so that it can mount the root filesystem over NFS.

I jumped through hoops trying to figure out how to get the "initrd" file to have enough (or the right) modules.  I eventually settled on bucking the trends set in the "Diskless Ubuntu" and "Diskless Debian" guides.

In my /etc/initramfs-tools/update-initramfs.conf, I set:
MODULES=most
instead of:
MODULES=netboot

The "MODULES" variable tells "update-initramfs" (and "mkinitramfs") which modules to include in the "initrd" file.  Obviously, less modules means less space.  So setting "MODULES" to "netboot" is a special setting that tries to pull in the bare minimum to get networking working.  I figured whatever; more modules is better than less modules, especially since the module for my Ethernet card isn't making it in there.

However, that still didn't help.

(Note: I haven't experimented with "MODULES=netboot" again; the "initrd" file that gets generated is under 20MB, and it takes under 1 second to transfer that on my network, so I'm not too interested in trimming that file down to the bare essentials at this point.)

My Ethernet card was some kind of Broadcom NetXtreme card, and the Internet seemed to think that the appropriate kernel module for it is "tg3".

After banging my head for way too long, I realized that "tg3" was not present in either the "initrd" file or the "/lib/modules/" directory.  I didn't realize that this was a problem because the Ubuntu 12.04 version of the system had compiled-in drivers (remember, I said that it was a Gentoo-built kernel); they were not built as modules (".ko" files).  So all of my grepping and comparing of files and directories seemed to tell me that a "tg3.ko" file was not necessary or relevant.

Once I realized that I needed the "tg3" module (and that it was compiled into the older kernel), I had to find out where to get it from.

The Solution

It turns out that there are two Linux kernel packages in Ubuntu:
  1. linux-image; and
  2. linux-image-extra.
The "linux-image-extra" package was key.  This package dropped in a lot of modules, including "tg3.ko".

So, in addition to installing "linux-image", I installed "linux-image-extra".  When I ran "update-initramfs", all of the modules got copied into the "initrd" file.  So when I booted the box, it successfully found my interfaces (for example, "eth2"), mounted the root NFS filesystem, and continued along its merry way.

FS-Cache: Netfs 'nfs' registered for caching
[...]
IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
[...]
IP-Config: eth2 hardware address b0:83:XX:XX:XX:XX mtu 1500 DHCP RARP
IP-Config: no response after 2 secs - giving up
tg3 0000:01:00.0 eth2: Link is up at 1000 Mbps, full duplex
tg3 0000:01:00.0 eth2: Flow control is on for TX and on for RX
tg3 0000:01:00.0 eth2: EEE is disabled
IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
IP-Config: eth2 hardware address b0:83:XX:XX:XX:XX mtu 1500 DHCP RARP
IP-Config: no response after 3 secs - giving up
IP-Config: eth2 hardware address b0:83:XX:XX:XX:XX mtu 1500 DHCP RARP
IP-Config: eth2 guessed broadcast address 10.XXX.XX.XXX
IP-Config: eth2 complete (dhcp from 192.168.XX.XX):
 address: 10.XXX.XX.XXX    broadcast: 10.XXX.XX.XXX    netmask: 255.255.252.0
 gateway: 10.XXX.XX.X      dns0     : 192.168.XX.XX    dns1   : 10.XXX.X.XX
 domain : XXX.XXXXXX.com
 rootserver: 10.XXX.XXX.XX rootpath:
 filename: pxelinux.0
Begin: Running /scripts/nfs-premount ... done
Begin: Running /scripts/nfs-bottom ... done
done
Begin: Running /scripts/init-bottom ... done

The Lesson

Well, what did I learn here?
  1. "ipconfig: no devices to configure" means that the kernel doesn't think that you have any Ethernet devices (either there aren't any, or you're missing the driver for it).
  2. "linux-image-extra" has all the drivers that you probably want (it certainly has "tg3").
  3. When I see "Running /scripts/nfs-premount" messages, that file lives in the "initrd" file.  It is totally possible to take apart an "initrd" file (with "cpio"), make changes to the scripts, and then put it back together (again with "cpio").
  4. If it seems impossible that your I/O Drive kernel module is causing networking-related kernel panics, then you're probably right and you have a networking issue somewhere.
  5. Not being able to set up NFS in a network-booting setup will result in a "kernel panic".  This surprised me; I figured that I'd get to an emergency console or something.  Kernel panics usually point to deep, scary problems, not something as simple as not getting an IP address.