Saturday, 5 November 2016

A Mini Datacenter for OSX / MacOS using LXD and Xhyve

This post documents my efforts to create a 'portable' mini data center for OSX/MacOS that I can easily pass to friends and colleagues as a single zip file.

The background to this is that over the last few years I've had the pleasure of using VMware vCenter at work on a Cisco UCS. This has been used for our development and staging environments. On my home server, meanwhile, I have been successfully using LXC and now LXD.

I originally tried LXC a number of years back out of frustration with some VMware workstation pains I was having on a fairly resource limited server, so thinking I’d give containerisation a try, I discovered that for the workload I was dealing with, LXC gave me an almost 4x speedup where suddenly my IO and Memory bottlenecks vanished.

Needless to say I'm fairly sold on containerisation for development and testing.

Now that I've been using it for some time, I've been looking for a way to take this environment away with me on my laptop when I travel, I've also wanted to be able to create a standalone 'distribution' I can give away to share the goodness.

Getting this all set up and running is fairly complex, and as LXD is still fairly new, there is not a lot of documentation around and none that I could find for making it work on the Mac.

For this article I'm going to focus on getting it running on the Mac using Funtoo as the linux distribution for the LXD host, however, it is certainly possible to replicate this for other distros and hypervisors. For example another good choice for the host might be Ubuntu or Alpine.

So, enough talk, now on with the details. Firstly what am I trying the achieve?
  • A mini Datacenter I can use to quickly spin up 10 Linux images to do some platform development and testing on a typical macbook
  • Support for fast snapshotting and an efficient filesystem architecture
  • A datacenter environment that has some of the features of a vCenter or a Public cloud
  • A design that is API enabled and could easily be linked into a continuous integration environment.
  • Something I can zip up to under 500Meg and throw onto my Google Drive to share.
In order to do this on a laptop (lets say my MacBook Air) I’m going to need to use some pretty cool stuff, so lets list out the goodies we are about to play with:
  1. LXD as our datacenter control system
  2. ZFS so we can do the fast snapshotting, and because it is just generally awesome
  3. Xhyve so we can use the native (free) hypervisor inside OSX 
  4. Plan 9 filesystem for sharing with the host (there are plenty of other ways however I haven't played with this one before and it seems like an efficient way to do things)
At this point you can continue reading for the details or you can grab the latest zip file from my google drive and then jump to Booting and Configuring the Datacenter below.

Now, lets talk about the choices in a little more detail:

LXD

In order to comfortably run around ten servers in my laptop datacenter I’m going to need to use containerisation (C-Level Hypervisor) rather than full virtualisation.

Containerisation means that rather than running a separate kernel for each vm, I can run a single kernel and share it and its memory across each of my containerised virtual machines.

LXD leverages the power of LXC containers by wrapping them under the control of a REST API enabled daemon, it also provides a great command line client for managing machines.

For more details, go checkout the main website for LXD 
There is also a great set of articles by Stéphane Graber well worth taking a look at.

ZFS

A truely amazing filesystem, ZFS has massive scalability, built in integrity, built in compression, drive pooling and raid, just to mention a few things.

Go read about it and come back convinced if you aren't already. For our purposes ZFS is already leveraged by LXD to provide near instantaneous snapshotting.

You could also use Btrfs if you prefer. This will get you to the same goal, however I'm biased towards the Sun / Solaris camp on this one.

Xhyve with 9p / VirtFS

Xhyve is the OSX equivalent of Bhyve which is the qemu equivalent for FreeBSD. The nice thing about this is that it is free and works natively on just about every Mac with a recent OS and CPU. It is also very small and can be easily bundled with what we are building (it is, for example, bundled as part of Docker for Mac)

Go check out Dunkelstern's excellent Braindump on the topic.

Having a look around I found a number of forks of the original xhyve and one in particular caught my attention having already merged in support for the plan9 filesystem.

Here's the Plan

Now that we have talked a little about the technology, lets draw up a plan to pull all the pieces together.
  • Update the build environment and kernel 
    • (since I will copy a working kernel and modules from here)
  • Build ZFS
  • Build a host OS environment for LXD
    • kernel with virtio, plan9 and required CGroup namespace options set
    • LXD
    • network bridges
    • squid + other useful packages
  • Create the raw disk image for Xhyve
  • Build Xhyve with Plan9
  • Wire it all together and create a Zip file to distribute

1. Update the build environment and kernel

Since I already have Funtoo installed on my build machine, my first step is to update everything (especially in light of the recent Dirty Cow kernel issues)
So, following my usual upgrade steps:

emerge --update --newuse --deep --with-bdeps=y @system -a
emerge --update --newuse --deep --with-bdeps=y @world -a 
emerge @preserved-rebuild -a emerge --depclean -a

According to the release notes for the latest ZFS package the newest kernel currently supported is the 4.8 series. I've been using 4.4.19 successfully for a while, so it looks like it is time to take the plunge and update to 4.8.5

The following command lets me update the kernel using the config for the old working kernel as a starting point

genkernel --zfs --menuconfig --kernel-config=/etc/kernels/kernel-config-x86_64-4.4.19 all

Make sure you have virtio, 9p filesystem, all the control groups and namespace options, ip masquerading and bridging turned on. (I've put my final kernel-config in the shared directory)

2. Build ZFS

On my build environment I'm using 'zfs on root' hence the --zfs in the above genkernel command, so my first priority is to get my build machine working and rebooted safely....

Now that the kernel has built correctly and is linked in via /usr/src/linux, I can emerge in the Solaris Porting Layer and the ZFS kernel mods as follows:

emerge -av spl zfs-kmod

Finally I need to rebuild my zfs package but before I start I need one more trick. In the /etc/make.conf file I set:

FEATURES="buildpkg"

This means that when zfs finished building I end up with a nice binary gentoo package sitting under /usr/portal/packages. This binary package comes in handy later when we build the host environment.

emerge -av zfs

After doing the usual messing about with grub.cfg I've now rebooted successfully and have a nice running 4.8.5 kernel.

3. Build a host environment for LXD

So far we have only been working on the build environment and my assumption is that if you get this far you have a working build machine running Funtoo, Gentoo or other linux with a working ZFS on a recent kernel. To continue we also need a working LXD running in the build environment as we are going to use an LXD container to build the host environment before converting it into a disk image that can be used by Xhyve.

Under Funtoo or Gentoo this is a simple as:

emerge -av lxd

The best test to know if it is also going to work smoothly is to run lxc-checkconfig and make sure that each item it outputs is a green 'enabled'. You can run this when you have rebooted into your brand new kernel, or you can point it at your kernel source and get it to check the config of a non running kernel.

At the time of writing there is no Funtoo in the provided lxd image repo, however I notice that there is now a Gentoo image, so it should be fairly easy to start from there as an alternative. I've also been keeping my eye on Alpine because of its very low footprint, however my efforts getting lxd to install from the Alpine test repo have so far been unsuccessful.

To get my Funtoo bootstrapped, I started with an older stage3 so the resulting mini datacenter will hopefully run on a wider variety of Mac's including those dating back a few years.

When starting from a stage3 you need to untar it into a rootfs directory and tarball it up again with the appropriate metadata.yaml so that it can then be imported into lxd (see the section on Manually building an Image in Stéphane Graber's article)

Once you have imported your image you can use that to 'init' a new LXC image as follows:

lxc image import myfuntoostage3.tar.gz --alias funtoo
lxc init funtoo funtoo

Now we can do a few more handy tricks to link it through to your build hosts existing portage and distfiles repository as follows:

lxc config device add funtoo sharedmem disk source=/dev/shm path=/dev/shm
lxc config device add funtoo portagedir disk source=/usr/portage path=/usr/portage
lxc config device add funtoo distfiles disk source=/data/distfiles path=/var/distfiles
lxc config set funtoo security.privileged true

I've set security.priviledge to true since this means we get 'normal' uid:gid values in the exposed filesystem mounted on the build environment. Later we need to use this exposed rootfs to build an image usable by Xhyve.

So at this point if you list your images, you should see the new funtoo image appear

zen ~ # lxc list
+------------+---------+----------------------+------+------------+-----------+
|    NAME    |  STATE  |         IPV4         | IPV6 |    TYPE    | SNAPSHOTS |
+------------+---------+----------------------+------+------------+-----------+
| funtoo     | STOPPED |                      |      | PERSISTENT | 0         |
+------------+---------+----------------------+------+------------+-----------+

It is time to start the image, jump inside it and complete the usual steps to getting a running updated OS built from source. You can check out the instructions on Funtoo.org or refer to the notes for your distro of choice.

Once we are updated, then we need to configure the networking:

zen ~ # lxc start funtoo
zen ~ # lxc exec funtoo bash
funtoo ~ # cd /etc/init.d/
funtoo init.d # ln -s netif.tmpl net.eth0
funtoo init.d # ln -s netif.tmpl net.br0
funtoo init.d # ln -s netif.tmpl net.lxdbr0
funtoo init.d # rc-update add net.eth0 default
 * service net.eth0 added to runlevel default
funtoo init.d # rc-update add net.br0 default
 * service net.br0 added to runlevel default
funtoo init.d # rc-update add net.lxdbr0 default
 * service net.lxdbr0 added to runlevel default

This gives us two bridge interfaces we can slave to eth0

We need to set up the configs for these interfaces such that br0 gets an address via dhcp from Xhyve (it helps if your build host is also running a dhcp server) and lxbr0 is assigned to some other class C. First the eth0 interface needs to not get an ip address:

cat > /etc/conf.d/net.eth0
template="interface-noip"
^D

The br0 interface will get an address via dhcp

cat > /etc/conf.d/net.br0
template="bridge"
stp="on"
forwarding=1
slaves="net.eth0"
stp="on"
^D

The lxdbr0, as you might suspect, is for the LXD guest network. This one is given a static address. I have arbitrarily chosen the 192.168.64.0 network here:

cat > /etc/conf.d/net.lxdbr0
template="bridge"
ipaddr="192.168.64.3/24"
gateway="192.168.64.1"
nameservers="8.8.8.8"
domain="lxd.local"
slaves="net.eth0"
stp="on"
forwarding=1
^D

Make sure you add the essential ingredients such as dhcp, dhcpcd and squid (if you want to expose a proxy to your guests)

emerge -av dhcp dhcpcd bind-tools inetd dnsmasq squid

If you are interested in trying pylxd for accessing the api from python, you might also want to add pip and any other useful utility packages that you know and love.

emerge -av dev-python/pip netkit-telnetd netcat

Now since we want to run a dhcp daemon for the LXD guests we need to add this to /etc/dhcpcd.conf so it only serves addresses to lxdbr0

denyinterfaces eth0 br0

I added the following lines to /etc/dhcp/dhcpd.conf to give us a 192.168.14.0 network for the guests

subnet 192.168.14.0 netmask 255.255.255.0 {
 range 192.168.14.10 192.168.14.200;
 option routers 192.168.14.1;
}

We also need to add dhcpcd to the default runlevel

rc-update add dhcpcd

At this point you should be able to poweroff and restart the container to observe that the new interfaces correctly come up. (this assumes your buildhost or local network exposes a dhcp server)

Now that we have gotten this far, we have a working LXD guest sharing the buildhost's kernel (which we updated earlier to support LXD and ZFS)

Rather than trying to rebuild the kernel and modules within this guest, we can leverage the hard work we did earlier and re-use the ZFS package and kernel-modules from the build host.

Within the guest we can install ZFS from the binary package as follows:

cd /usr/portage/packages/sys-fs/
funtoo sys-fs # emerge --nodeps zfs-0.6.5.8.tbz2

And now add the various zfs services to the default runlevel

rc-update add zfs-import
rc-update add zfs-mount

Finally edit /etc/inittab and uncomment all the terminals leaving only a serial console for Xhyve to communicate over

# SERIAL CONSOLES
s0:12345:respawn:/sbin/agetty -L 115200 ttyS0 vt100
#s1:12345:respawn:/sbin/agetty -L 115200 ttyS1 vt100

We also want to force the virtio-net and zfs drivers to load at boot so we need to edit /etc/conf.d/modules as follows:

modules="virtio-net zfs"

For non privileged containers we need both /etc/subuid and /etc/subgid to look like this:

dnsmasq:100000:65536
lxd:1000000:65536
root:1000000:65536
squid:165536:65536

Optionally we can trim down the image and remove anything we don't think we need. For example:

emerge -C man-pages man-pages-posix debian-sources genkernel postfix
rm -rf /var/spool/postfix/*

For file sharing between LXD host and the real host, make a directory called /mnt/shared and add the following to /etc/hosts

host       /mnt/shared 9p defaults,rw,relatime,sync,dirsync,trans=virtio,version=9p2000.L 0 0
/dev/vda1  /           ext4            noatime,rw              0 1

Finally, set a root password and powerdown the container.

Now that the container is off (and we can still access it from the host) we can add in the hosts modules to the filesystem as follows:

cp -a /lib/modules/4.8.5/ /var/lib/lxd/containers/funtoo/rootfs/lib/modules/

It is a good idea to do a quick sanity check and make sure that both zfs.ko and virtio_net.ko were in there somewhere!

Or to keep it smaller, you might want to just copy the modules you need.

4. Create the raw disk image for Xhyve

All the hard work we did in the previous section has now left us with the files needed for our new raw host OS mounted at /var/lib/lxd/containers/funtoo/rootfs

To get these into an image we could do this manually or we can take the shortcut of using virt-make-fs (emerge this into your build environment if you dont already have it)

virt-make-fs --type=ext4 --format=raw --size=+300M --partition -- \
    /data/lxd/containers/funtoo/rootfs funraw.img

This takes about 5 minutes to run on my Intel NUC and at the end of the operation I have a new raw image. (The +300M gives us just a little headroom on the root partition, however the aim is that we will use /mnt/shared or a zfs pool for any new space required within the image)

5. Build Xhyve with Plan9 support

Now on your Mac, clone the xhyve repo and build it as follows:

git clone --recursive https://github.com/jceel/xhyve
cd hyve ; make

(note: I did this on 10.11 and it worked fine, however I notice there are now errors building on MacOS 10.12.1, however YMMV so I've made my binaries available in my shared directory)

6. Wire it all together and create a Zip file to distribute

Firstly we copy our new funraw.img, kernel and initramfs to our Mac.
I created a separate directory called lxd with the following files in it:

simonmac:lxd simon$ ls -F
funraw.img                              readme.txt                              xhyve.9p*
initramfs-genkernel-x86_64-4.8.5        runxhyve.sh*
kernel-genkernel-x86_64-4.8.5           shared/

There is an empty directory called 'shared' which we will use to share files with the host MacOS

We can start the image for the first time using the following: (sample provided in my shared directory)

./runxhyve.sh

This will ask for your MacOS password since it calls sudo, then you should see the OpenRC boot messages being pumped out of the virtual serial device and printed to stdout by Xhyve.

On first boot you will notice a few errors to do with the filesystem not mounting read write. I suspect this is something to do with the way virt-make-fs constructs the image, however this is easily fixed by logging in and doing the following:

localhost ~ # mount / -o rw,remount
localhost ~ # touch /etc/conf.d
localhost ~ # poweroff

Now when we run the xhyve shell script again it should boot normally and you should be able to ping external hosts. You should also have access to the hosts shared directory on /mnt/shared

At this point I have run 'poweroff' and zipped the entire directory into a zip file and uploaded it to my Google Drive.

7. Booting and Configuring the Datacenter

Ok, lets run it again. (or for the first time if you have downloaded my prebuilt zip file)

./runxhyve.sh

First we need to setup a nice ZFS pool to place all our guests into.
Rather than have to maintain extra space in the datacenter OS partition for this, we now have the advantage of being able to use the host's file system.

Lets create a 10Gig partition to start with. We can always grow this later or do other zfs magic on it if required.

It would be even nicer if MacOS supported sparse files, however for our purposes this file uses up all the space we ask for:

dd if=/dev/zero of=/mnt/shared/pool.zfs bs=$((1024 * 1024)) count=0 seek=10240
zpool create -f -o cachefile= -O compression=on -m none lxdpool /mnt/shared/pool.zfs

Now lets check that with 'zpool list':

localhost ~ # zpool list
NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
lxdpool  9.94G   283K  9.94G         -     0%     0%  1.00x  ONLINE  -

Also it might be nice to allocate some swap space. We can set this up as follows:

zfs create -V 2G -b $(getconf PAGESIZE) \
              -o primarycache=metadata \
              -o com.sun:auto-snapshot=false lxdpool/swap

Make and mount the swap:

mkswap -f /dev/zvol/lxdpool/swap
swapon /dev/zvol/lxdpool/swap

Rather than add a swap line to /etc/fstab, I've instead opted for creating a /etc/local.d/lxd.start script instead. This script checks for the swap device before enabling swap, it also ensures your shared directory is correctly in place. Here is a sample one:

#!/bin/bash
if [ ! -d /mnt/shared/images ]; then
    mkdir /mnt/shared/images
fi

# If zfs swap is defined lets activate it
if [ -e /dev/zvol/lxdpool/swap ]; then
    swapon /dev/zvol/lxdpool/swap
fi

Ok, before we initialise LXD, we need to softlink some directories into our /mnt/shared so that we don't eat into the extra headroom on the root filesystem that we created with virt-make-fs

rm -rf /var/lib/lxd/images
mkdir -p /mnt/shared/images
ln -s /mnt/shared/images/ /var/lib/lxd/images

Now lets call and initialise LXD

localhost lxd # lxd init
Name of the storage backend to use (dir or zfs) [default=zfs]:
Create a new ZFS pool (yes/no) [default=yes]? no
Name of the existing ZFS pool or dataset: lxdpool
Would you like LXD to be available over the network (yes/no) [default=no]?
Would you like stale cached images to be updated automatically (yes/no) [default=yes]? no
Would you like to create a new network bridge (yes/no) [default=yes]? no
LXD has been successfully configured.

For simplicity I chose not to setup the network control port for this walkthrough.

This is a good point to poweroff the datacenter host and start it again from the ./runxhyve.sh shell script. Everything should come back up again, 'zpool list' should show the lxdpool as ONLINE and 'free' should show you your swap space.

8. Testing the Datacenter 

Lets create a couple of virtual machines and then test the networking between them, and from them to the LXD host.

For the virtual machine choice, you can pick any distro you like out of "lxc image list images:" I've chosen a nice small alpine image for this test.

localhost ~ # lxc launch images:alpine/3.4/amd64 alpine1
Creating alpine1
Retrieving image: 100%
Starting alpine1
localhost ~ # lxc launch images:alpine/3.4/amd64 alpine2
Creating alpine2
Starting alpine2
localhost ~ # lxc list
+---------+---------+-----------------------+------+------------+-----------+
|  NAME   |  STATE  |         IPV4          | IPV6 |    TYPE    | SNAPSHOTS |
+---------+---------+-----------------------+------+------------+-----------+
| alpine1 | RUNNING | 192.168.14.189 (eth0) |      | PERSISTENT | 0         |
+---------+---------+-----------------------+------+------------+-----------+
| alpine2 | RUNNING | 192.168.14.72 (eth0)  |      | PERSISTENT | 0         |
+---------+---------+-----------------------+------+------------+-----------+

You can log into each with "lxc exec alpineN sh"

localhost ~ # lxc exec alpine1 sh
~ # ping -c 1 192.168.14.72
PING 192.168.14.72 (192.168.14.72): 56 data bytes
64 bytes from 192.168.14.72: seq=0 ttl=64 time=0.212 ms

--- 192.168.14.72 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.212/0.212/0.212 ms
~ # ping -c 1 192.168.14.1
PING 192.168.14.1 (192.168.14.1): 56 data bytes
64 bytes from 192.168.14.1: seq=0 ttl=64 time=0.131 ms

--- 192.168.14.1 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.131/0.131/0.131 ms

This shows we can ping the LXD host 192.168.14.1 and the other alpine container on its dynamically assigned 192.168.14.72

If you want to enable the VMs to reach the outside you can either enable natting on the datacenter host (already enabled if you downloaded the zip) or point applications on them to the squid proxy we installed.

How far can this scale you might ask? After editing the runxhyve.sh script to give the lxd host 6Gigs of ram, I have successfully run up 40 centos6 machines (36 salt minions and 4 salt masters) with them all happily talking to each other over zeromq. (FYI: in order to make this work I needed to add the following line to /etc/sysctl.conf so I could have that many containers without getting a 'Too many open files' error:


fs.inotify.max_user_instances = 1024

Conclusion

This has been a long post, however we have walked a long journey and ended up with a ready to run  Datacenter in a zip file. I've shared the final zip and the various key components on my Google Drive. I hope you found this useful, it was certainly fun to put together.

What's next? I'm feeling tempted to see if I can mod this to also run under KVM and Hyper-V