Much is still unknown about the crash of Lion Air Flight 610 in Indonesia, but one potential explanation is that a malfunctioning physical sensor caused a software algorithm to incorrectly intervene in a way that the human pilots were neither aware of nor able to correct in time. Regardless of what investigators ultimately determine as the cause of the crash, the idea of a critical software system being unable to cope with a single malfunctioning sensor reminds us that for all of the incredible capability of modern software systems and promise of technologies like deep learning, their success ultimately hinges on their ability to accurately observe the outside world through software and hardware sensors. When those sensors go wrong, algorithms may not be able to tell the difference between a hardware error and a dangerous external world state that requires immediate intervention. How do we encourage software developers to build more resilient systems?

At the end of the day, all computer algorithms that make decisions are ultimately input/output systems, receiving some form of input and generating some form of output. From the simplest hand coded application of a few lines of code on through the most advanced neural network running on a fleet of accelerators, algorithms “see” the world around them through sensors. Sensors may be software based, as in network traffic analyzers, or hardware based, as in industrial control systems. What happens when those sensors go wrong?

A core part of the software development cycle involves testing applications against every conceivable kind of faulty input. However, given the nearly infinite number of possible inputs and the enormous complexity of modern software, it is impossible to exhaustively test every possible configuration a given software system might find itself in.


Most importantly, traditional software testing focuses on errors coming through expected channels, such as out of bound inputs or complex action sequences that can lead to race or error conditions. Testing for most applications does not focus heavily on how a system copes with legitimate inputs that do not accurately reflect the outside world or what to do when the underlying computing hardware malfunctions.

Considerable research has gone into creating resilient computing architectures that are capable of detecting and in some cases mitigating, errors in the underlying execution hardware. From intermittent memory errors to failing processor components, computers are not infallible and it is very possible for a software program to find itself running on a machine in which memory values spontaneously zero out or a processor starts yielding incorrect numeric results. Modern computing architectures, however, are typically designed with verification checks that can detect these cases and either disable the system or work around the errors.

When it comes to the sensory inputs that applications use to interface with the external world, however, far less investment is typically spent on ensuring that software is able to adequately detect when things are going awry.

For example, typical self-driving software assumes that the inputs from the cameras represent specific points in space relative to the vehicle. If an installation error, routing short circuit or remote hack causes camera inputs to be changed, such as right and left or front and back cameras swapped or cameras substantially moved from their expected locations, not all systems will be able to properly detect that something is wrong and shut down. Similarly, if false imagery is fed into the system through a compromised camera feed, not all systems will be able to recognize that the movements and physics represented by the imagery do not accurately reflect what would be expected based on the control inputs being provided by the vehicle and thus order the algorithm to disengage.

Alternatively, imagine overheating or a system malfunction in one side a vehicle that results in the cameras on that side reducing their framerates or yielding strange artifacts. The self-driving algorithms must now contend with unmatched visual inputs where a steering command results in the correct adjustments on the right, but the left of the vehicle is still seen as not moving by the appropriate amount.

A human operating a remote-controlled vehicle with such an issue would likely immediately notice something was amiss and begin testing for malfunctions such as toggling the lights and noticing that one set of camera feeds noticeably lags the others. Current generation autonomous vehicle systems do not possess the more generalized intelligence required to conduct their own tests to identify vehicle problems entirely on their own. Instead, they often rely on the large number of different kinds of sensors on a vehicle to identify differences between what radar and visual sensors are reporting, for example. However, a fully compromised vehicle in which a remote attacker has taken control over all inputs may not be detectable by current AI systems.

Such issues are separate from, though related, to the highly publicized world of adversarial efforts to alter imagery with customized noise and artifacts to systematically lead algorithms astray. Adding a few markings to a stop sign may cause a computer vision algorithm to no longer recognize it as a stop sign. Sensory error, however, is different in that it represents a class of very subtle errors that can cause algorithms to malfunction even in an unaltered scene.

Autonomous vehicles are unique in that they typically have a large number of sensors that have very different sensing characteristics. Typical software applications, on the other hand, have far fewer input sources from which to triangulate the state of the world around them. A smart thermostat, for example, might have only a single onboard sensor. If that sensor begins to systematically overreport the temperature of the house, it could easily turn off the heat in the middle of the winter. Similarly, if a failed HVAC fan prevents heat from reaching the side of the house where the device is, it might compensate by running the heat at maximum, possibly causing a fire on the other side of the house. Some systems support placing sensors across the house and can detect when one sensor is reporting dramatically different results from the others, but not all systems would be able to adequately cope with a failed HVAC fan that causes half a house to respond very differently than the other half.

Underlying all of this is the simple fact that modern software systems do not typically emphasize fault tolerance in the face of inputs that are within the range of expected values but which do not accurately reflect the state of the external world they are supposed to assess. Adding additional independent measurements, triangulating across all inputs and building internal mental models of the difference between the expected and the observed state and the probability that difference could be as a result of sensor failure, would all contribute towards more robust software systems. Most importantly, we must encourage resilient and fault-tolerate software design across all domains, building systems that anticipate and expect systematic error and failure of their inputs and are designed to robustly address or correct it. In short, instead of building software for a perfect world, we must start building software for the very imperfect real world in which it must operate.

Source link