The infamous checksum bug

Another networking issue with an instance. Occasionally, a precise instance would come up but the networking wasn’t working, it wouldn’t get an IP address. Interestingly, this only seemed to happen when it was running on the head node.

(Yes, we sometimes also use our head node as a compute node. It’s a small cluster).

I could see the DHCPDISCOVER and DHCPOFFER packets by tcpdump’ing vnet0, but it never sent out a DHCPREQUEST packet.

To debug this, I had to log into the instance. The problem was, I didn’t know the password for the “ubuntu” user. This was an image that I had downloaded from Canonical.

I needed to alter the password inside of the instance so that I could log into it using VNC. I got the libvirt domain name with a “nova show <instance>” command.

$ nova show myinstance | grep instance_name
| OS-EXT-SRV-ATTR:instance_name | instance-000000e1 |

Then I shut it down:

$ sudo virsh shutdown instance-000000e1

I needed to edit the /etc/shadow file to specify a password for the root account. The problem was that I didn’t know how to generate a password hash in the right format.

It turns out the OpenStack Compute code does this. I whipped up a quick Python script that would output the appropriate hash, and then did this:

$ ./ mypassword

To edit /etc/shadow inside of the guest, I used the virt-edit program from libguestfs:

$ sudo virt-edit -d instance-000000e1 /etc/shadow

I added the hash to the root account. Then, I brought it started it up paused, so I could tcpdump and connect to the vnc console before it started to boot:

$ sudo virsh start  instance-000000e1 --paused
$ sudo tcpdump -i vnet0 port 67 or port 68

Then I resumed it:

$ sudo virsh resume instance-000000e1

Once it booted, I logged in to the root account using VNC. I ran the DHCP client manually in the foreground, to watch what happened:

# dhclient -d eth0


Ah, the infamous UDP checksum problem!

This fixed it:

iptables -t mangle -A POSTROUTING -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill

I looked at the nova code, and they only enable it when running in multi-host node. So perhaps those checksums get filled by the networking drivers when the UDP packets leave the network host, but if the DHCP server is running on the same machine as the guest, this doesn’t happen (in multi-host, this is always the case).

This entry was posted in openstack. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s