Wednesday, May 16, 2012

One Million Bugs

Launchpad.net passed one of those interesting milestones with Bug #1000000 today. Ok, so it's interesting for us who like base-10, for you power of 2 types, you'll have to wait around until Bug #1073741824.

Putting aside the interesting number, let's use it as a reminder of the amazing participation by the community in not only the Ubuntu project, but all the other projects (including many of my favorites at Linaro) that are hosted on on Launchpad.  Thanks to all the tireless testing and bug tracking/fixing by this entire community, all of these projects are better off.  Also, it should serve as a reminder that we should persevere and push even harder to find all those issues still waiting to be discovered so we can get them fixed, and further increase the quality of Ubuntu and others.  Finally, thanks to the Launchpad team for their efforts in producing and maintaining this service that is useful to so many of us.

Wednesday, March 14, 2012

Who tests the test system?

Automated testing is common for most development teams these days.  Usually it starts small - make it easy to run a few tests.  Then it grows into storing the results, then deploying the software you want to test.  Automation can quickly explode into huge systems with complex infrastructure requirements that have to be supported underneath it all.  Like any other software, it needs to be tested.  Like any other infrastructure, it needs to be monitored.

One thing we're doing in LAVA now to help with this is something called "health checks".  Basically we run a known-good job through the system that runs through various deployment scenarios, and logs the results.  If any part of the process fails (including uploading the results), the board is taken offline just in case something is wrong with it.  What could go wrong?  There are a lot of possibilities here., but basically they fall into three categories.

Infrastructure problems - This would include a bad SD card (yep, they go bad sometimes), bad recovery image on the board, faulty cabling, loose power connector, bad switch or terminal server port, even external network connectivity problems.  Another one that you might not normally think of here, is something wrong with the server itself.  For instance, if the server load goes too high, it could cause issues with the running job.  This highlights the need for a more distributed system, which unfortunately also leads to greater complexity and more moving parts.

Device problems - problems with the thing you're trying to test itself.  We test on ARM development boards, and these should be pretty rare.  The possibility exists that we might just have a bad board in some cases though.  Memory could fail, there could be bad solder joints, components on the board itself could fail.  We've seen one case recently where the network port on just one board will sometimes disappear. Other boards of the same type are just fine.  Yes, hardware can fail too.  Identifying failed hardware, and shutting it down before we continue to trust it for any other tests is an important step to ensuring that it doesn't affect the results of future testing.

Bugs - Automation is hard, assumptions have to be made, and sometimes those assumptions are broken just as with any other software.  Because of the complexity of the underlying infrastructure, it's often hard to really see what's going to happen until new code is rolled out into production.  We can use the same health check system to force a health check on all boards after a new rollout of the software, and offline any (or all) of the boards if the new software causes problems.

It's not enough though.  Simply running these tests creates a flood of data, but the data has to be aggregated, and action has to be taken when things go wrong.  Spring and Michael from the LAVA team have done some great work to add the parts needed for all of this.  We have a new view in the scheduler now that shows the device health for all devices.  We store a health check job for each device type in the database and whenever the board hasn't been checked in the past 24 hours, or whenever the board comes online after being offline, the health is checked.  In the event of a new code rollout, we can now mark all boards' health status as unknown - forcing it to check itself before running any other real test job.  So what we're left with now is a clear view when problems are spotted that helps us see what to do next.  It's easier to spot problems quickly, and we err on the side of caution by offlining the board every time a health job fails to produce the expected result.  Now the hard work begins - focusing on the most frequently found problems first, using the data we gather here to find infrastructure issues we have in the middle of the night that nobody would see otherwise, and fixing these problems.  In the end, what we hope to be left with is a system that is self-checking, self-monitoring, and protects the jobs running on it from running on a device we have any reason to suspect might cause the tests to fail unless the thing we're trying to test has a problem.

Friday, February 03, 2012

LAVA at the Linaro Connect

Interested in validation and going to the Linaro connect?  Here are some of the sessions already scheduled that you might be interested in on the Validation track:


Additionally, here are some sessions in other tracks that you may like:


Make sure to check http://summit.linaro.org/lcq1-12/track/linaro-validation/ throughout the week for changes and additions.

If you are around in the afternoons and want to help out on LAVA development, or want help getting started in it, swing by the validation hacking sessions and talk to us.

Tuesday, December 06, 2011

LAVA Deployment Changes

Up to this point, we've supported LAVA releases in a number of ways:

  • Tarballs - not really recommended, but source in it's raw form
  • BZR branches - if you are doing development, or just *have* to be on the bleeding edge
  • pypi - convenient, easy to install, updated with monthly release cycles
  • .deb packages in the PPA - convenient, easy to install for Ubuntu, updated with monthly release cycles

Here are the problems:

Packages are fairly convenient to install, but take quite a bit of time at release time to update, rebuild, copy to all supported versions, and test.  Because of this, if we have a new feature or important bug fix that we want to roll out before the next release, we have only two choices: 1. hot-fix it on the server, make very sure that we apply the same fix to trunk, or 2. fix it in trunk, test, make a lava-foo-20YY.MM-1 release, repackage, install, etc.  Option 1 is a bit ugly, but fast.  Option 2 is really the right thing to do, but very time consuming.

Another thing we would really like to do is have the ability to host multiple "instances", such as a production and a staging instance.  Using packages, this isn't really possible.  Using VMs is an option of course, but there are downsides and it would consume a lot of extra resources.  Being able to deploy multiple instances is not only useful for production systems, but for development as well.  If you are working on multiple branches and want to test them separately, it's nice to have an easy way to do that.

Finally, as we look for ways to make LAVA more scalable, one of the things we are looking at is celery.  There are other libraries we need as well, so celery is just one of many, but one of the issues we have here is that there are no packages in the archive.  Sure, we could build a package of it and keep it in our PPA, but then we are maintaining that package in addition to all the other LAVA components.  And there will surely be others besides celery too.

As of yesterday, we are now deploying LAVA in the Linaro Validation Lab using a more flexible approach.  Basically, it involves python virtual environments, with separate tables for each instance, and each instance running under it's own userid.  Zygmunt and Michael in particular did a lot of hacking on most of the components to make them aware of the instances, and create upstart jobs that can start/stop/restart components based on the instance ID.  Instances can be assembled from a list of requirements that can pull from pypi, or even bzr branches.  There are even scripts (lp:lava-deploy-tool) to help with creation and setup of the instances.  The scripts even support backing up and restoring the data.

So what will become of the packages?  It was recently announced on the linaro-dev mailing list that we are phasing out packages, for at least the server components.  We feel like the new methods of deployment offer greater flexibility, stable deployment support as well as easy ways to update to the latest code, or even your own branches, and many other benefits.  Try it out and let us know what you think.

Monday, September 26, 2011

Stacks and stacks of PandaBoards

If you watch our scheduler at http://validation.linaro.org/lava-server/scheduler/ you may have noticed that even though we are increasing the number of continuous integration tests for Android and Linux kernel, the jobs are clearing out much more quickly in the past few days.  We've added infrastructure and boards and now have 24 PandaBoards in the Linaro Validation Farm!  We've also updated our rack design to more efficiently pack a lot more boards into less space, while keeping them accessible and serviceable.  Here'a picture Dave sent me, showing a bit of what he's put in place there.


We did hit a bit of a snag with one thing, and I anticipated this would be an issue quite a ways back.  We use linaro-media-create to construct the images in the same way anyone else using Linaro images would construct them, but running 30 of these in parallel will pretty much drag the server down to a crawl.  I did some quick tests of multiple processes running linaro-media-create locally and the completion time for every l-m-c process running in parallel increases significantly for each new process you add.  Combine this with lots of boards, lots of jobs, and other IO such as database hits, and it can take hours to complete just the image creation, which should only take minutes.  The long term solution is that we are looking at things like celery to distribute big tasks out to other systems.  In the short term, simply serializing the l-m-c processes results in a significant performance increase for all the jobs

Making LAVA more People-Friendly

One of the other new features of LAVA that's worth pointing out is a subtle, but significant step toward making it a little friendlier for those trying to find the results they are looking for.  Internally, LAVA uses things like SHA1 on the bundles, and UUIDs on the test runs to have a unique identifier that can be transferred between systems.  Previously, we displayed this as the name of the link.  If you're looking through a results stream and trying to find the test you just ran on the ubuntu-desktop image with the lt-mx5 hardware pack though, it's not very helpful.  You could, of course, go through the scheduler and link to the results there, but if you just wanted to browse the results in a bundle stream and look at ones that interest you, there was no easy way to do that.

Now, we use the job_name specified in the job you submit to the scheduler to give it a name. What you set the job_name field to, is entirely up to you.  It's all about helping it to mean something to the end user.  In the example above, the stream of results is for daily testing of hardware packs and images.  So the hwpack name, datestamp, image name, and image datestamp are simply used for the job_name.  Kernel CI results, Android CI results, and others will certainly have different names that mean more to them in their context.

Tuesday, September 20, 2011

Configuring LAVA Dispatcher

An important new change will be landing in the release of LAVA Dispatcher this week, and it should be good news to anyone currently deploying the dispatcher. Configuration for your board types and test devices will no longer be in python modules, but in configuration files that you can keep across upgrades.

First off, if you don't have a config, a default will be provided for you. You'll probably want to tell it more about your environment though. If you are configuring it for the whole system, you will probably want to put your configs under /etc/xdg/lava-dispatcher/. If you are developing locally on your machine, you may want to use ~/.config/lava-dispatcher/ instead. 

The main config file is lava-dispatcher.conf.  Here's an example:
#Main LAVA server IP in the boards farm
LAVA_SERVER_IP = 192.168.1.68

#Location for hosting rootfs/boot tarballs extracted from images
LAVA_IMAGE_TMPDIR = /var/www/images/tmp

#URL where LAVA_IMAGE_TMPDIR can be accessed remotely
#PWL - might not be needed
#LAVA_IMAGE_URL_DIR = /images/tmp
LAVA_IMAGE_URL = http://%(LAVA_SERVER_IP)s/images/tmp

#Default test result storage path
LAVA_RESULT_DIR = /lava/results

#Location for caching downloaded artifacts such as hwpacks and images
LAVA_CACHEDIR = /linaro/images/cache

# The url point to the version of lava-test to be install with pip
LAVA_TEST_URL = bzr+http://bazaar.launchpad.net/~linaro-validation/lava-test/trunk/#egg=lava-test

The big things to change here will be the LAVA_SERVER_IP, which should be set to the address where you are running the dispatcher, and the directories.  LAVA_TEST_URL, by default, will point at the lava-test in the trunk of our bzr branch.  This means you'll always get the latest, bleeding edge version.  If you don't like that, you can point it at a stable tarball, or even your own branch with custom modifications.

Next up is device-defaults.conf.  Look at the example under the lava_dispatcher/default-config branch, because it's a bit longer.  Fortunately, most of this can probably go unchanged. You'll want to specify things like the default network interface, command prompts, and client types here.  For most people using Linaro images, this can just remain as-is.

The part you will almost certainly want to customize is in the devices and device-types directories.  First, a device-type
device-types/panda.conf


boot_cmds = mmc init,
    mmc part 0,
    setenv bootcmd "'fatload mmc 0:3 0x80200000 uImage; fatload mmc
    0:3 0x81600000 uInitrd; bootm 0x80200000 0x81600000'",
    setenv bootargs "' console=tty0 console=ttyO2,115200n8
    root=LABEL=testrootfs rootwait ro earlyprintk fixrtc nocompcache
    vram=48M omapfb.vram=0:24M mem=456M@0x80000000 mem=512M@0xA0000000'",
    boot
type = panda

boot_cmds_android = mmc init,
    mmc part 0,
    setenv bootcmd "'fatload mmc 0:3 0x80200000 uImage;
    fatload mmc 0:3 0x81600000 uInitrd;
    bootm 0x80200000 0x81600000'",
    setenv bootargs "'console=tty0 console=ttyO2,115200n8
    rootwait rw earlyprintk fixrtc nocompcache vram=48M
    omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000
    init=/init androidboot.console=ttyO2'",
    boot
If you are using a pandaboard with Linaro images, you can probably just use this as it is.

Now to specify a device we want to test on:
devices/panda01.conf

device_type = panda
hostname = panda01
And that's it. You'll want one of those for each board you have, and a device-type config file for each type of device you have. Many thanks to David Schwarz and Michael Hudson-Doyle for pulling this important change together and getting it merged. Oh, and what else for this release? LOTS! But more than I want to include in a single post. I'll try to hit some of the highlights in other postings around the release though. Enjoy :)