Thursday, April 28, 2016

Troubleshooting PXE boot problems on a Dell server

At work, we ship "appliances"—servers that have a pre-installed operating system and software stack.  The general workflow to build one is:

  1. The box is racked and plugged into the "build" network.
  2. The box PXE boots to a special controller script.
  3. Someone answers a few questions.
  4. The script partitions and formats the disks and installs the OS and software.
  5. The box is powered down, packaged, and shipped.
I recently spent days fighting with a new PXE boot "live OS", encountered error messages so lonely that few sites on the Internet reference them, debugged "initrd", and ultimately solved my problems by feeling really, really stupid.

I'll tell you the story of what happened in case you happen to try to walk down this path in the future, and I'll go into detail on everything as we go.  But first...

The Motivation

When you PXE boot, you get a "live OS", typically a read-only operating system with some tmpfs mounts for write operations.  This OS is often minimal, having just enough tools and packages to accomplish the goal of partitioning, formatting, and copying files.  Ours is done by mounting the root filesystem over NFS, so the smaller the OS, the better.

We had two problems that I wanted to solve:
  1. The current live OS was hand-built on Ubuntu 12.04 with no documentation; and
  2. Ubuntu 12.04's version of "lbzip2" is buggy and can't decompress some files.
My goal was to switch to Ubuntu 14.04 (which has a perfectly working version of "lbzip2") in a way that the live OS would be generated from a script (reliably and repeatably).

(If you're interested in building an Ubuntu filesystem from scratch that you can then use as a live OS, or make into an ISO image, or whatever, see debootstrap.)

The Execution

I created a few-dozen line script to bootstrap Ubuntu 14.04, install a kernel and some packages, and install the tools that it would need to run when it booted.  That part went well.

I copied the new OS to our NFS server, replacing the existing one.  That part went well.

All of our automated virtual-appliance builds continued to work (actually better and faster than before).  That part went well.

I met with the manufacturing team to get the formal sign-off that it would work with the physical Dell servers that we use.  This did not go well.

The Flaw In The Plan

Manufacturing booted up a box onto the build network but it crashed.

Here's what he reported:
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000200

CPU: 4 PID: 1 Comm: init Not tainted 3.13.0-24-generic #47-Ubuntu
Hardware name:    /0JP31P, BIOS 2.5.2 01/28/2015
 ffff880802af0000 ffff8808041f5e48 ffffffff81715ac4 ffffffff81a4c480
 ffff8808041f5ec0 ffffffff8170ecc5 ffffffff00000010 ffff8808041f5ed0
 ffff8808041f5e70 ffffffff81f219e0 0000000000000200 ffff8808041f8398
Call Trace:
 [<ffffffff81715ac4>] dump_stack+0x45/0x56
 [<ffffffff8170ecc5>] panic+0xc8/0x1d7
 [<ffffffff8106a391>] do_exit+0xa41/0xa50
 [<ffffffff8109dd84>] ? vtime_account_user+0x54/0x60
 [<ffffffff8106a41f>] do_group_exit+0x3f/0xa0
 [<ffffffff8106a494>] SyS_exit_group+0x14/0x20
 [<ffffffff817266bf>] tracesys+0xe1/0xe6
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1 at /build/buildd/linux-3.13.0/arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x5d/0x60()
Modules linked in: nfs lockd sunrpc fscache

Okay, a kernel panic.  I did upgrade the kernel from the old Ubuntu 12.04 version.  In fact, the old live OS had a hand-built Gentoo kernel for arcane reasons.  So maybe...

The Red Herring

Well, I did install the Ubuntu 14.04 ".deb" package for the Fusion I/O Drive's "iomemory-vsl" kernel module.  Some of our equipment have I/O Drives, so the live OS needs to be able to talk to them.

The VMs (which don't have I/O Drives) worked fine, and the physical hardware (which do) did not work.

I assumed that there was some mismatch between Fusion I/O's kernel module and the Ubuntu 14.04 kernel.  I compiled the driver (using Fusion I/O's guide) and re-made my live OS.

Same result: kernel panic.

In addition, it turned out that physical hardware that did not have an I/O Drive failed, as well.

The Hint

I took out my phone and recorded the boot process at 120 frames per second to see the text before the kernel panic.  The messages that I'd been able to see were not helpful, and when the kernel panics, I lose the ability to scroll up in the history.  And since the system never gets up all the way, there's no way to access the logs.

Here's what I saw:
systemd-udevd[329]: starting version 204
Begin: Loading essential drivers ... done
Begin: Running /scripts/init-premount ... done
Begin: Mounting root file system ... Begin: Running /scripts/nfs-top ... done
FS-Cache: Loaded
RPC: Registered named UNIX socket transport module.
RPC: Registered udp transport module.
RPC: Registered tcp transport module.
RPC: Registered tcp NFSv4.1 backchannel transport module.
FS-Cache: Netfs 'nfs' registered for caching
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
ipconfig: no devices to configure
/init: .: line 252: can't open '/run/net-*.conf'
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000200

The short story here is that the "ipconfig" message repeating over and over means that the kernel can't find any network interfaces (at least other than "lo").

The long story is that when booting from network (NFS), the "initrd" init script is going to run some scripts to set up NFS.  One of these calls a function "configure_networking" (since NFS won't work without networking).  That function tries to figure out which device to use for networking.

For a box with multiple interfaces, this could be a bit tricky.  However, if you've toggled on bit b10 (0x2) in your PXE setup using the "IPAPPEND" directive, then you'll get an environment variable called "BOOTIF" that encodes the MAC address of the interface that originally PXE booted.  The "configure_networking" function then looks through the /sys/class/net/* files to see which interface has that MAC address.  From there, it will know the device and can set it up.

However, "ipconfig" claims that there are "no devices to configure".  Also, if you debug the "configure_networking" function, you'll see that there's only one file in /sys/class/net, and that's "lo".  So, the system has a loopback interface, but none of the Ethernet interfaces that I expected.

The Problem

Very simply put: the appropriate Ethernet driver (kernel module) was not present in the "initrd" file.

The "initrd" file (short for "initial RAM disk") contains some bootstrap scripts and kernel modules.  For example, for PXE booting, we get the kernel and the "initrd" file from TFTP (or HTTP) from the Ethernet card itself as part of the process.  Thus, it is up to the "initrd" file to get networking up and running so that it can mount the root filesystem over NFS.

I jumped through hoops trying to figure out how to get the "initrd" file to have enough (or the right) modules.  I eventually settled on bucking the trends set in the "Diskless Ubuntu" and "Diskless Debian" guides.

In my /etc/initramfs-tools/update-initramfs.conf, I set:
MODULES=most
instead of:
MODULES=netboot

The "MODULES" variable tells "update-initramfs" (and "mkinitramfs") which modules to include in the "initrd" file.  Obviously, less modules means less space.  So setting "MODULES" to "netboot" is a special setting that tries to pull in the bare minimum to get networking working.  I figured whatever; more modules is better than less modules, especially since the module for my Ethernet card isn't making it in there.

However, that still didn't help.

(Note: I haven't experimented with "MODULES=netboot" again; the "initrd" file that gets generated is under 20MB, and it takes under 1 second to transfer that on my network, so I'm not too interested in trimming that file down to the bare essentials at this point.)

My Ethernet card was some kind of Broadcom NetXtreme card, and the Internet seemed to think that the appropriate kernel module for it is "tg3".

After banging my head for way too long, I realized that "tg3" was not present in either the "initrd" file or the "/lib/modules/" directory.  I didn't realize that this was a problem because the Ubuntu 12.04 version of the system had compiled-in drivers (remember, I said that it was a Gentoo-built kernel); they were not built as modules (".ko" files).  So all of my grepping and comparing of files and directories seemed to tell me that a "tg3.ko" file was not necessary or relevant.

Once I realized that I needed the "tg3" module (and that it was compiled into the older kernel), I had to find out where to get it from.

The Solution

It turns out that there are two Linux kernel packages in Ubuntu:
  1. linux-image; and
  2. linux-image-extra.
The "linux-image-extra" package was key.  This package dropped in a lot of modules, including "tg3.ko".

So, in addition to installing "linux-image", I installed "linux-image-extra".  When I ran "update-initramfs", all of the modules got copied into the "initrd" file.  So when I booted the box, it successfully found my interfaces (for example, "eth2"), mounted the root NFS filesystem, and continued along its merry way.

FS-Cache: Netfs 'nfs' registered for caching
[...]
IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
[...]
IP-Config: eth2 hardware address b0:83:XX:XX:XX:XX mtu 1500 DHCP RARP
IP-Config: no response after 2 secs - giving up
tg3 0000:01:00.0 eth2: Link is up at 1000 Mbps, full duplex
tg3 0000:01:00.0 eth2: Flow control is on for TX and on for RX
tg3 0000:01:00.0 eth2: EEE is disabled
IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
IP-Config: eth2 hardware address b0:83:XX:XX:XX:XX mtu 1500 DHCP RARP
IP-Config: no response after 3 secs - giving up
IP-Config: eth2 hardware address b0:83:XX:XX:XX:XX mtu 1500 DHCP RARP
IP-Config: eth2 guessed broadcast address 10.XXX.XX.XXX
IP-Config: eth2 complete (dhcp from 192.168.XX.XX):
 address: 10.XXX.XX.XXX    broadcast: 10.XXX.XX.XXX    netmask: 255.255.252.0
 gateway: 10.XXX.XX.X      dns0     : 192.168.XX.XX    dns1   : 10.XXX.X.XX
 domain : XXX.XXXXXX.com
 rootserver: 10.XXX.XXX.XX rootpath:
 filename: pxelinux.0
Begin: Running /scripts/nfs-premount ... done
Begin: Running /scripts/nfs-bottom ... done
done
Begin: Running /scripts/init-bottom ... done

The Lesson

Well, what did I learn here?
  1. "ipconfig: no devices to configure" means that the kernel doesn't think that you have any Ethernet devices (either there aren't any, or you're missing the driver for it).
  2. "linux-image-extra" has all the drivers that you probably want (it certainly has "tg3").
  3. When I see "Running /scripts/nfs-premount" messages, that file lives in the "initrd" file.  It is totally possible to take apart an "initrd" file (with "cpio"), make changes to the scripts, and then put it back together (again with "cpio").
  4. If it seems impossible that your I/O Drive kernel module is causing networking-related kernel panics, then you're probably right and you have a networking issue somewhere.
  5. Not being able to set up NFS in a network-booting setup will result in a "kernel panic".  This surprised me; I figured that I'd get to an emergency console or something.  Kernel panics usually point to deep, scary problems, not something as simple as not getting an IP address.