Fun with Windows TCP/IP debugging

Today I learned… In Windows, if a process listens on a port, spawns a child, then dies, then no other process can listen on that port until all of the children have been terminated. So, if you were running, say, PowerShellServer, and a process inside of an SSH session hangs, then you can’t restart it until you hunt down the process.

Thank you Server Fault for the answer, TCPView for telling me that a zombie was listening on the port, and Process Explorer for identifying the orphaned processes.

Posted in Uncategorized | Tagged | Leave a comment

Test, test

Test, test

Testing out the new Byword functionality of posting to WordPress.

Posted in Uncategorized | Leave a comment

The infamous checksum bug

Another networking issue with an instance. Occasionally, a precise instance would come up but the networking wasn’t working, it wouldn’t get an IP address. Interestingly, this only seemed to happen when it was running on the head node.

(Yes, we sometimes also use our head node as a compute node. It’s a small cluster).

I could see the DHCPDISCOVER and DHCPOFFER packets by tcpdump’ing vnet0, but it never sent out a DHCPREQUEST packet.

To debug this, I had to log into the instance. The problem was, I didn’t know the password for the “ubuntu” user. This was an image that I had downloaded from Canonical.

I needed to alter the password inside of the instance so that I could log into it using VNC. I got the libvirt domain name with a “nova show <instance>” command.

$ nova show myinstance | grep instance_name
| OS-EXT-SRV-ATTR:instance_name | instance-000000e1 |

Then I shut it down:

$ sudo virsh shutdown instance-000000e1

I needed to edit the /etc/shadow file to specify a password for the root account. The problem was that I didn’t know how to generate a password hash in the right format.

It turns out the OpenStack Compute code does this. I whipped up a quick Python script that would output the appropriate hash, and then did this:

$ ./mkpasswd.py mypassword
$5$hla.HR1DOHbjcsPK$FvCd7KYZ0SD.9lpA1Iz5u22DamGbh9YFoCH2u8byr/5

To edit /etc/shadow inside of the guest, I used the virt-edit program from libguestfs:

$ sudo virt-edit -d instance-000000e1 /etc/shadow

I added the hash to the root account. Then, I brought it started it up paused, so I could tcpdump and connect to the vnc console before it started to boot:

$ sudo virsh start  instance-000000e1 --paused
$ sudo tcpdump -i vnet0 port 67 or port 68

Then I resumed it:

$ sudo virsh resume instance-000000e1

Once it booted, I logged in to the root account using VNC. I ran the DHCP client manually in the foreground, to watch what happened:

# dhclient -d eth0

dhclient-d

Ah, the infamous UDP checksum problem!

This fixed it:

iptables -t mangle -A POSTROUTING -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill

I looked at the nova code, and they only enable it when running in multi-host node. So perhaps those checksums get filled by the networking drivers when the UDP packets leave the network host, but if the DHCP server is running on the same machine as the guest, this doesn’t happen (in multi-host, this is always the case).

Posted in openstack | Leave a comment

Writing a cinder_manage Ansible module

I was working on a cinder_manage module for Ansible to invoke the equivalent of “cinder-manage db sync” inside of Python code. This is used to initialize the OpenStack Block Storage database. I had already written one for glance-manage and keystone-manage, but each of the OpenStack projects uses a slightly different internal API for database initialization. This blog post documents my thought process trying to go through the Python code to figure out how it works.

To implement this, I need to invoke the equivalent of “cinder-manage db sync” inside of Python code. I could do this with the shell, but I  prefer to dig into the guts of the cinder internals and do this in pure Python.

(Technically, I do call “cinder-manage db sync” from the shell, but I implement check mode using Python code).

cinder/db/migration.py has db_sync and db_version methods, but they don’t provide a way to pass in the path to the cinder.conf file that has the database connection info, so I can’t call them directly.

They both defer to methods with the same name in cinder/db/sqlalchemy/migration.py

Looking at cinder/db/sqlalchemy/migration.py:db_version, we see this:

repository = _find_migrate_repo()
 try:
    return versioning_api.db_version(get_engine(), repository)

The _find_migration_repo function is looking for the sqlalchemy migration scripts, that’s going to look relative to the current directory of the Python script, no need to mess with that. The connection string is going to be needed by that get_engine() method:from

cinder.db.sqlalchemy.session import get_engine

OK, let’s look at cinder/db/sqlalchemy/session.py:get_engine

def get_engine():
 """Return a SQLAlchemy engine."""
 global _ENGINE
 if _ENGINE is None:
 connection_dict = sqlalchemy.engine.url.make_url(FLAGS.sql_connection)

A-ha, it’s FLAGS.sql_connection. What’s flags?

FLAGS = flags.FLAGS

OK…

import cinder.flags as flags

There we go, it’s cinder.flags.FLAGS

cinder/flags.py:
from cinder.openstack.common import cfg
FLAGS = cfg.CONF

All right, so flags come from cinder/openstack/common/cfg.py

CONF = CommonConfigOpts()

Hmm, CommonConfigOpts doesn’t take any arguments. Let’s look back at cinder/flags.py

def parse_args(argv, default_config_files=None):
 FLAGS.disable_interspersed_args()
 return argv[:1] + FLAGS(argv[1:],
 project='cinder',
 default_config_files=default_config_files)

That’s interesting, it’s actually calling FLAGS and adding to it. That’s what we want. Except we don’t really want to call parse_args, because we don’t have an argv. I think we just want to call FLAGS with our arguments.

But, is default_config_files going to be set for us already? And what’s that first argument? Recall that Flags are of type CommonConfigOpts. Is that callable? Let’s take a look.

Its parent, ConfigOpts, is callable:

def __call__(self, args=None, project=None, prog=None, 
             version=None, usage=None, default_config_files=None)

Let’s see if we can test things out. We want to do something like
CONF.(args=[], project=’cinder’, default_config_files=[‘/etc/cinder/cinder.conf’])

One way to test this is to check if the value changes from a default.

>>> from cinder.flags import FLAGS
>>> FLAGS.verbose
False
>>> FLAGS(args=[], project='cinder', default_config_files=['/etc/cinder/cinder.conf'])
[]
>>> FLAGS.verbose
True

Here’s another test

>>> from cinder.flags import FLAGS
>>> FLAGS.sql_connection
'sqlite:////usr/lib/python2.7/dist-packages/cinder.sqlite'
>>> FLAGS(args=[], project='cinder', default_config_files=['/etc/cinder/cinder.conf'])
[]
>>> FLAGS.sql_connection
'sqlite:////var/lib/cinder/cinder.sqlite'

Yup, working.

OK, so we should be able to write a method in cinder_manage to load the config file

def load_config_file(conf):
 flags.FLAGS(args=[], project='cinder',
 default_config_files=['/etc/cinder/cinder.conf'])

Now we need to figure out the current version and the repo version. Current version is easy:

from cinder.db import migration
current_version = migration.db_version()

How about the repo version? Let’s look back at how it was done in cinder code.

The db_sync method in cinder/db/sqlalchemy/migration.py isn’t too helpful here:

def db_sync(version=None):
 if version is not None:
 try:
 version = int(version)
 except ValueError:
 raise exception.Error(_("version should be an integer"))
current_version = db_version()
 repository = _find_migrate_repo()
 if version is None or version > current_version:
   return versioning_api.upgrade(get_engine(), repository, version)
 else:
   return versioning_api.downgrade(get_engine(), repository,
 version)

It tells us how to find the sqlalchemy repository:

repository = _find_migrate_repo()

But it doesn’t actually retrieve the repo version.

In keystone_manage, we did this:

  repo_path = migration._find_migrate_repo() repo_version = versioning_api.repository.Repository(repo_path).latest

Will that still work? Let’s check on the command-line. We’ll need to do this:

import cinder.db.sqlalchemy.migration
repo_path = cinder.db.sqlalchemy.migration._find_migrate_repo()

It turns out that this returns a repository, not a path

In [10]: cinder.db.sqlalchemy.migration._find_migrate_repo()
Out[10]: <migrate.versioning.repository.Repository at 0x37fd250>

We need to change code a little, we can just do this:

from cinder.db import migration
repository = migration._find_migrate_repo()
repo_version = repository.latest

Done!

Of course, this uses an internal API, which means its likely to change in the next release, but we can just update the ansible module when that happens.

Posted in openstack | 2 Comments

Partitioning is hard to do

Configuring Ubuntu preseed files for automatically partitioning is… non-trivial. Especially when you want to boot off  of a large disk. Here’s a gist for those interested. I imagine some lines here are superfluous, but this works.

# Use LVM for partitioning
d-i partman-auto/method string lvm
# If one of the disks that are going to be automatically partitioned
# contains an old LVM configuration, the user will normally receive a
# warning. Preseed this away
d-i partman-lvm/device_remove_lvm boolean true
# And the same goes for the confirmation to write the lvm partitions.
d-i partman-lvm/confirm boolean true
# Really, please don't prompt me!
d-i partman-lvm/confirm_nooverwrite boolean true
# partitioning
# Physical partitions:
# 1. BIOS boot partition: 1 MB See https://wiki.archlinux.org/index.php/GRUB2#GUID_Partition_Table_.28GPT.29_specific_instructions
# 2. Boot partition: 250 MB
# 2. LVM, with the following logical volumes
# – Root partition: 250 GB (256000 MB), ext4.
# – Swap: 100% of RAM
# – Data partition: remaining space, XFS
d-i partman-auto/expert_recipe string \
boot-root :: \
1 1 1 free method{ biosgrub } . \
250 250 250 ext2 \
$primary{ } $bootable{ } \
method{ format } format{ } \
use_filesystem{ } filesystem{ ext2 } \
mountpoint{ /boot } \
. \
100% 2048 100% linux-swap \
lv_name{ swap } \
method{ swap } format{ } \
$lvmok{ } \
. \
256000 256000 256000 ext4 \
lv_name{ root } \
method{ lvm } format{ } \
use_filesystem{ } filesystem{ ext4 } \
mountpoint{ / } \
$lvmok{ } \
. \
1024 1024 -1 xfs \
lv_name{ data } \
method{ lvm } format{ } \
use_filesystem{ } filesystem{ xfs } \
mountpoint{ /data } \
$lvmok{ } \
.
# This makes partman automatically partition without confirmation, provided
# that you told it what to do using one of the methods above.
d-i partman-partitioning/confirm_write_new_label boolean true
d-i partman/choose_partition select finish
d-i partman/confirm boolean true
d-i partman/confirm_nooverwrite boolean true

Posted in sysadmin | Leave a comment

Copying the config from a Cisco switch to OSX via TFTP

I wanted to copy the configuration information from a Cisco switch onto my local machine, which runs Mac OS X (Mountain Lion). You can do this via TFTP, since the switch has a TFTP client and OS X comes with a TFTP server.

You probably need to turn off the OS X firewall if you have it running (System Preferences -> Security & Privacy -> Firewall).

To start the TFTP server on OS X:

$ sudo launchctl load -w /System/Library/LaunchDaemons/tftp.plist

The server uses /private/tftpboot as its root directory for the files.

If you want to copy a file to a TFTP server, a file with the same name must already exist at the destination and be world-writeable. Otherwise you’ll get an “Access denied” everywhere, even if the destination directory is world-writeable.

Therefore, you’ll want to do something like this:

$ sudo touch /private/tftpboot/config.txt
$ sudo chmod a+w /private/tftpboot/config.txt

Next, log in to the switch via ssh and then copy the running config back to your local machine. In this example, my local machine has the IP 192.168.3.2:

# copy running-config tftp://192.168.3.2/config.txt

The /private/tftpboot/config.txt file should now be populated with the configuration info for the switch.

Once you’re done, you can turn off the TFTP server on your local machine:

$ sudo launchctl unload /System/Library/LaunchDaemons/tftp.plist

I tried this with a Cisco Catalyst 2960 switch, but I suspect it will also work for other devices such as Nexus switches and ASAs.

Posted in sysadmin | 4 Comments

It’s a one-line function that returns a constant. What could go wrong?

Here’s a function:

def get_foo():
    return None

Simple, right? I don’t need to unit test that. Except I did, because that method was part of a mixin:

class MyMixin(object):
    def get_foo():
        return None

And that mixin was attached to a class, which inherited from another class. And that parent class also inherited from a different mixin, that had a different implementation of get_foo(). So, if I had actually written the unit test against the class in question, I would’ve caught that error right away.

At least I had the sense to write the unit test when I was trying to figure out why things were failing in the functional testing.

Write the unit test. Just write it.

Posted in Uncategorized | Leave a comment

Sensor problems

Sometimes the problem isn’t in the underlying system, but it’s in the sensor you’re using. When you’re trying something new, you’re more likely to think it’s the new thing you’re trying that’s broken.

I’m using a feature of ansible I hadn’t used before,limiting host when including a playbook. Unfortunately, it didn’t seem to be working right for me. I was testing it out using the --limit-hosts flag, which just shows you what hosts are going to run instead of actually running the ansible tasks on those hosts.

It turns out, list-hosts doesn’t work when you limit hosts with includes. I thought the feature was broken, but it works fine.

Posted in ansible | Leave a comment

Lead astray by the symptom

I was instantiating a Django subclass in a test, and it wasn’t setting one of the foreign key fields. We are subclassing an Order class, and was doing: OurOrder.objects.create(other=other)

When I retrieved the object from the database later, it wasn’t set. I was convinced it was some weirdness with multi-table inheritance. Turned out I had done other=Other(...) instead of other=Other.objects.create(...), so there was no database entry. Whoops…

Posted in django | Leave a comment

Today’s black hole

SpiderMonkey and Unicode escapes.

Posted in Uncategorized | Leave a comment