Automated testing is common for most development teams these days. Usually it starts small - make it easy to run a few tests. Then it grows into storing the results, then deploying the software you want to test. Automation can quickly explode into huge systems with complex infrastructure requirements that have to be supported underneath it all. Like any other software, it needs to be tested. Like any other infrastructure, it needs to be monitored.
One thing we're doing in LAVA now to help with this is something called "health checks". Basically we run a known-good job through the system that runs through various deployment scenarios, and logs the results. If any part of the process fails (including uploading the results), the board is taken offline just in case something is wrong with it. What could go wrong? There are a lot of possibilities here., but basically they fall into three categories.
Infrastructure problems - This would include a bad SD card (yep, they go bad sometimes), bad recovery image on the board, faulty cabling, loose power connector, bad switch or terminal server port, even external network connectivity problems. Another one that you might not normally think of here, is something wrong with the server itself. For instance, if the server load goes too high, it could cause issues with the running job. This highlights the need for a more distributed system, which unfortunately also leads to greater complexity and more moving parts.
Device problems - problems with the thing you're trying to test itself. We test on ARM development boards, and these should be pretty rare. The possibility exists that we might just have a bad board in some cases though. Memory could fail, there could be bad solder joints, components on the board itself could fail. We've seen one case recently where the network port on just one board will sometimes disappear. Other boards of the same type are just fine. Yes, hardware can fail too. Identifying failed hardware, and shutting it down before we continue to trust it for any other tests is an important step to ensuring that it doesn't affect the results of future testing.
Bugs - Automation is hard, assumptions have to be made, and sometimes those assumptions are broken just as with any other software. Because of the complexity of the underlying infrastructure, it's often hard to really see what's going to happen until new code is rolled out into production. We can use the same health check system to force a health check on all boards after a new rollout of the software, and offline any (or all) of the boards if the new software causes problems.
It's not enough though. Simply running these tests creates a flood of data, but the data has to be aggregated, and action has to be taken when things go wrong. Spring and Michael from the LAVA team have done some great work to add the parts needed for all of this. We have a new view in the scheduler now that shows the device health for all devices. We store a health check job for each device type in the database and whenever the board hasn't been checked in the past 24 hours, or whenever the board comes online after being offline, the health is checked. In the event of a new code rollout, we can now mark all boards' health status as unknown - forcing it to check itself before running any other real test job. So what we're left with now is a clear view when problems are spotted that helps us see what to do next. It's easier to spot problems quickly, and we err on the side of caution by offlining the board every time a health job fails to produce the expected result. Now the hard work begins - focusing on the most frequently found problems first, using the data we gather here to find infrastructure issues we have in the middle of the night that nobody would see otherwise, and fixing these problems. In the end, what we hope to be left with is a system that is self-checking, self-monitoring, and protects the jobs running on it from running on a device we have any reason to suspect might cause the tests to fail unless the thing we're trying to test has a problem.