Launchpad.net passed one of those interesting milestones with Bug #1000000 today. Ok, so it's interesting for us who like base-10, for you power of 2 types, you'll have to wait around until Bug #1073741824.
Putting aside the interesting number, let's use it as a reminder of the amazing participation by the community in not only the Ubuntu project, but all the other projects (including many of my favorites at Linaro) that are hosted on on Launchpad. Thanks to all the tireless testing and bug tracking/fixing by this entire community, all of these projects are better off. Also, it should serve as a reminder that we should persevere and push even harder to find all those issues still waiting to be discovered so we can get them fixed, and further increase the quality of Ubuntu and others. Finally, thanks to the Launchpad team for their efforts in producing and maintaining this service that is useful to so many of us.
Linux, testing, and other things I happen to find interesting with my impressively short attention span
Wednesday, May 16, 2012
Wednesday, March 14, 2012
Who tests the test system?
Automated testing is common for most development teams these days. Usually it starts small - make it easy to run a few tests. Then it grows into storing the results, then deploying the software you want to test. Automation can quickly explode into huge systems with complex infrastructure requirements that have to be supported underneath it all. Like any other software, it needs to be tested. Like any other infrastructure, it needs to be monitored.
One thing we're doing in LAVA now to help with this is something called "health checks". Basically we run a known-good job through the system that runs through various deployment scenarios, and logs the results. If any part of the process fails (including uploading the results), the board is taken offline just in case something is wrong with it. What could go wrong? There are a lot of possibilities here., but basically they fall into three categories.
Infrastructure problems - This would include a bad SD card (yep, they go bad sometimes), bad recovery image on the board, faulty cabling, loose power connector, bad switch or terminal server port, even external network connectivity problems. Another one that you might not normally think of here, is something wrong with the server itself. For instance, if the server load goes too high, it could cause issues with the running job. This highlights the need for a more distributed system, which unfortunately also leads to greater complexity and more moving parts.
Device problems - problems with the thing you're trying to test itself. We test on ARM development boards, and these should be pretty rare. The possibility exists that we might just have a bad board in some cases though. Memory could fail, there could be bad solder joints, components on the board itself could fail. We've seen one case recently where the network port on just one board will sometimes disappear. Other boards of the same type are just fine. Yes, hardware can fail too. Identifying failed hardware, and shutting it down before we continue to trust it for any other tests is an important step to ensuring that it doesn't affect the results of future testing.
Bugs - Automation is hard, assumptions have to be made, and sometimes those assumptions are broken just as with any other software. Because of the complexity of the underlying infrastructure, it's often hard to really see what's going to happen until new code is rolled out into production. We can use the same health check system to force a health check on all boards after a new rollout of the software, and offline any (or all) of the boards if the new software causes problems.
It's not enough though. Simply running these tests creates a flood of data, but the data has to be aggregated, and action has to be taken when things go wrong. Spring and Michael from the LAVA team have done some great work to add the parts needed for all of this. We have a new view in the scheduler now that shows the device health for all devices. We store a health check job for each device type in the database and whenever the board hasn't been checked in the past 24 hours, or whenever the board comes online after being offline, the health is checked. In the event of a new code rollout, we can now mark all boards' health status as unknown - forcing it to check itself before running any other real test job. So what we're left with now is a clear view when problems are spotted that helps us see what to do next. It's easier to spot problems quickly, and we err on the side of caution by offlining the board every time a health job fails to produce the expected result. Now the hard work begins - focusing on the most frequently found problems first, using the data we gather here to find infrastructure issues we have in the middle of the night that nobody would see otherwise, and fixing these problems. In the end, what we hope to be left with is a system that is self-checking, self-monitoring, and protects the jobs running on it from running on a device we have any reason to suspect might cause the tests to fail unless the thing we're trying to test has a problem.
One thing we're doing in LAVA now to help with this is something called "health checks". Basically we run a known-good job through the system that runs through various deployment scenarios, and logs the results. If any part of the process fails (including uploading the results), the board is taken offline just in case something is wrong with it. What could go wrong? There are a lot of possibilities here., but basically they fall into three categories.
Infrastructure problems - This would include a bad SD card (yep, they go bad sometimes), bad recovery image on the board, faulty cabling, loose power connector, bad switch or terminal server port, even external network connectivity problems. Another one that you might not normally think of here, is something wrong with the server itself. For instance, if the server load goes too high, it could cause issues with the running job. This highlights the need for a more distributed system, which unfortunately also leads to greater complexity and more moving parts.
Device problems - problems with the thing you're trying to test itself. We test on ARM development boards, and these should be pretty rare. The possibility exists that we might just have a bad board in some cases though. Memory could fail, there could be bad solder joints, components on the board itself could fail. We've seen one case recently where the network port on just one board will sometimes disappear. Other boards of the same type are just fine. Yes, hardware can fail too. Identifying failed hardware, and shutting it down before we continue to trust it for any other tests is an important step to ensuring that it doesn't affect the results of future testing.
Bugs - Automation is hard, assumptions have to be made, and sometimes those assumptions are broken just as with any other software. Because of the complexity of the underlying infrastructure, it's often hard to really see what's going to happen until new code is rolled out into production. We can use the same health check system to force a health check on all boards after a new rollout of the software, and offline any (or all) of the boards if the new software causes problems.
It's not enough though. Simply running these tests creates a flood of data, but the data has to be aggregated, and action has to be taken when things go wrong. Spring and Michael from the LAVA team have done some great work to add the parts needed for all of this. We have a new view in the scheduler now that shows the device health for all devices. We store a health check job for each device type in the database and whenever the board hasn't been checked in the past 24 hours, or whenever the board comes online after being offline, the health is checked. In the event of a new code rollout, we can now mark all boards' health status as unknown - forcing it to check itself before running any other real test job. So what we're left with now is a clear view when problems are spotted that helps us see what to do next. It's easier to spot problems quickly, and we err on the side of caution by offlining the board every time a health job fails to produce the expected result. Now the hard work begins - focusing on the most frequently found problems first, using the data we gather here to find infrastructure issues we have in the middle of the night that nobody would see otherwise, and fixing these problems. In the end, what we hope to be left with is a system that is self-checking, self-monitoring, and protects the jobs running on it from running on a device we have any reason to suspect might cause the tests to fail unless the thing we're trying to test has a problem.
Friday, February 03, 2012
LAVA at the Linaro Connect
Interested in validation and going to the Linaro connect? Here are some
of the sessions already scheduled that you might be interested in on
the Validation track:
Additionally, here are some sessions in other tracks that you may like:
Make sure to check http://summit.linaro.org/lcq1-12/track/linaro-validation/ throughout the week for changes and additions.
If you are around in the afternoons and want to help out on LAVA development, or want help getting started in it, swing by the validation hacking sessions and talk to us.
-
Automated bootloader testing in LAVA
https://blueprints.launchpad.net/lava-dispatcher/+spec/linaro-validation-q112-bootloader-testing - Discuss what needs to be done still on celery, and usage
https://blueprints.launchpad.net/lava-lab/+spec/linaro-validation-q112-lava-celery - LAVA Server Admin tools
https://blueprints.launchpad.net/lava-server/+spec/linaro-validation-q112-admin-tool - Discuss test plans for big.LITTLE
https://blueprints.launchpad.net/lava-test/+spec/linaro-validation-q112-big-little-testplan - Monitoring and managing device health in LAVA
https://blueprints.launchpad.net/lava-lab/+spec/linaro-validation-q112-device-health-monitoring
Additionally, here are some sessions in other tracks that you may like:
-
Ubuntu LEB and LAVA: Current status and future planning for proper image testing and validation
https://blueprints.launchpad.net/linaro-ubuntu/+spec/linaro-platforms-q112-lava-and-ubuntu-leb-testing-validation - Linaro's Kernel CI Process discussion and Feedback
https://blueprints.launchpad.net/linaro/+spec/linaro-kernel-ci-q112-discussion - End to End Demo of Enabling Something for LAVA
https://blueprints.launchpad.net/linaro/+spec/linaro-training-q112-end-to-end-lava-demo
Make sure to check http://summit.linaro.org/lcq1-12/track/linaro-validation/ throughout the week for changes and additions.
If you are around in the afternoons and want to help out on LAVA development, or want help getting started in it, swing by the validation hacking sessions and talk to us.
Subscribe to:
Posts (Atom)