
ASIL Decomposition and Functional Safety
In my previous blog, titled A Deeper Dive on Functional Safety (FuSa), I took a closer look at the two key elements of functional safety: systematic fault coverage and random fault coverage. In short, systematic fault coverage ensures that the device in the system is designed, verified, and tested at a level of rigor and robustness that is consistent with the target Automotive Safety Integrity Level (ASIL), to ensure there are no faults that are systematic in nature. (For those not familiar with the terminology being used at this point, it is advised to review the earlier blogs on this topic.)
For those of you old enough to remember the Intel “Pentium Processor Divide Error,” this is a good example of an error that was systematic in nature - i.e. the same set of numbers when divided by one another consistently produced an erroneous result across every Pentium Processor. A key foundation to achieving a given ASIL is that there are no systematic faults that have been designed into the device. If that is not the case, one can imagine a situation where an instruction is issued to a processor to have a car turn right, but instead, it turns left and will do so erroneously every time that instruction is issued. If systematic errors are not wrung out during the design, verification, and testing of the device, it is almost certain that the system will fail or, at a minimum, provide an undesired response.
Random hardware failures, on the other hand, occur unpredictably over the lifetime of a product. However, they tend to be probabilistic in nature. These errors are the basis for the term ‘probabilistic metric for random hardware failures’ (PMHF), and occur for various reasons, such as neutron flux, power supply droop, transients, etc., which are independent of design and quality rigor. The other faults that comprise the suite of random hardware faults include single point fault metric (SPFM) and latent point fault metric (LFM). As ASILs increase, the acceptable levels of “escaped faults” become more stringent for the system.
To remind the reader, functional safety is not the absence of faults - it is the ability to detect or flag a fault when it occurs, so this information can be passed on to the system, and the appropriate corrective action can be taken. This action can range from advising the driver to take the car into the shop for maintenance in a few months, to crippling the car and driving it off to the side of the road. These decisions happen at the system level, but the fault needs to be detected and flagged at the chip level and communicated in order for the system to be able to make those decisions.
While semiconductors employed in critical automotive safety applications support ASIL D systematic fault coverage, the majority of the complex semiconductors employed in a vehicle support only ASIL B random fault coverage, which seems somewhat counterintuitive. That said, the entire system, which comprises all the multiple devices in the safety path, typically needs to achieve ASIL D at the system level. So, the question is: how is ASIL D random fault coverage at the system level achieved when each device itself supports only ASIL B random fault coverage? The answer is through a technique called decomposition. Before going into some of the details of decomposition, here is another analogy that can be very helpful.
If you have ever glanced into the cockpit of a major airliner aircraft, it is riddled with an incredible number of gauges, dials, knobs, and switches. One begins to wonder exactly how a pilot would be able to figure out which gauges to read and which knobs and switches to turn. Upon closer inspection—should you be allowed inside the cockpit of a commercial airplane—you would find that there are actually three gauges that are measuring the exact same thing, as well as three switches and knobs that control the same aspect of the plane. This is done deliberately, and is referred to as triple mode redundancy (TMR).
The motivation for this approach is that, under normal operating conditions, all three gauges will read identically the same. However, in the case of a random fault, the probability is that the random fault will only occur with one of the gauges, not all three. So, if two of the three gauges are reading the same value and one gauge isn’t, it probably means that the gauge showing the different result is wrong. Employing TMR, the probabilities of a random fault not being detected can become even more infinitesimally smaller than those defined in ASIL D random fault coverage.
The price tag associated with solving the problem through TMR also leads to a significant increase in the cost of each protected system. This may not be a problem for a multi-million-dollar aircraft, but it quickly becomes impractical in an eighty-thousand-dollar passenger car. Now, taking a step back for a moment and returning to the original point with regard to systematic fault coverage, which requires real rigor and a foundation for FuSa, imagine a platform with three Pentium Processors, with the divide error now connected in TMR. When they check each other’s results, even though the answers to the division problem are incorrect, they all arrive at the exact same conclusion, which is obviously a problem. This is why I made the point that robust, i.e. ASIL D systematic fault coverage is a foundational requirement for FuSa.
So, for reasons previously explained, most semiconductors don’t simply employ TMR on-chip to achieve ASIL D random fault coverage, because the costs would be exorbitant and prohibitive. To achieve ASIL B random fault coverage alone requires subsystems with “lock-step” processors on board, where, rather than three processors, two processors check each other’s results and use that information for safety-critical regions of the device. Typically, these are the processors that either drive actuators (motors, brakes, etc.) or pass messages on the CAN bus - a bus that ultimately may affect the actions of an actuator. I’ll stop here, because this topic too can get fairly deep fairly quickly.
Through a technique called decomposition, multiple devices in the system provide alternative paths to ensure checks and balances are maintained at the system level, similar to the TMR aircraft example. Decomposition is not a topic worthy of blogging about in depth, as it becomes more complex even faster than the way ASIL B random fault coverage is achieved at a device level. However, the curious reader can certainly use ChatGPT to dig in deeper if interested. Suffice to say, decomposition, in effect, mirrors that of TMR in the way the results from one device aren’t simply taken at face value.
Perhaps the key takeaway from this three-part series on FuSa is that designing devices and systems for automotive applications requires significant thought and foresight. One of the key forcing functions to ensure that best-practices are employed and ISO26262-compliant guidelines are followed with rigor is that, in the case of a life-threatening accident, if it becomes clear that these guidelines and rigor were not followed, the automotive OEM and Tier 1 (as appropriate) will assume the liabilities.
When it was proven that a well-known OEM with a “stuck accelerator” problem was caused by poor software coding practices, a settlement of $1.2 B was reached. And oh, yeah, we didn’t touch on the topic of FuSa as it applies to software. Maybe another blog?
In my previous blog, titled A Deeper Dive on Functional Safety (FuSa), I took a closer look at the two key elements of functional safety: systematic fault coverage and random fault coverage. In short, systematic fault coverage ensures that the device in the system is designed, verified, and tested at a level of rigor and robustness that is consistent with the target Automotive Safety Integrity Level (ASIL), to ensure there are no faults that are systematic in nature. (For those not familiar with the terminology being used at this point, it is advised to review the earlier blogs on this topic.)
For those of you old enough to remember the Intel “Pentium Processor Divide Error,” this is a good example of an error that was systematic in nature - i.e. the same set of numbers when divided by one another consistently produced an erroneous result across every Pentium Processor. A key foundation to achieving a given ASIL is that there are no systematic faults that have been designed into the device. If that is not the case, one can imagine a situation where an instruction is issued to a processor to have a car turn right, but instead, it turns left and will do so erroneously every time that instruction is issued. If systematic errors are not wrung out during the design, verification, and testing of the device, it is almost certain that the system will fail or, at a minimum, provide an undesired response.
Random hardware failures, on the other hand, occur unpredictably over the lifetime of a product. However, they tend to be probabilistic in nature. These errors are the basis for the term ‘probabilistic metric for random hardware failures’ (PMHF), and occur for various reasons, such as neutron flux, power supply droop, transients, etc., which are independent of design and quality rigor. The other faults that comprise the suite of random hardware faults include single point fault metric (SPFM) and latent point fault metric (LFM). As ASILs increase, the acceptable levels of “escaped faults” become more stringent for the system.
To remind the reader, functional safety is not the absence of faults - it is the ability to detect or flag a fault when it occurs, so this information can be passed on to the system, and the appropriate corrective action can be taken. This action can range from advising the driver to take the car into the shop for maintenance in a few months, to crippling the car and driving it off to the side of the road. These decisions happen at the system level, but the fault needs to be detected and flagged at the chip level and communicated in order for the system to be able to make those decisions.
While semiconductors employed in critical automotive safety applications support ASIL D systematic fault coverage, the majority of the complex semiconductors employed in a vehicle support only ASIL B random fault coverage, which seems somewhat counterintuitive. That said, the entire system, which comprises all the multiple devices in the safety path, typically needs to achieve ASIL D at the system level. So, the question is: how is ASIL D random fault coverage at the system level achieved when each device itself supports only ASIL B random fault coverage? The answer is through a technique called decomposition. Before going into some of the details of decomposition, here is another analogy that can be very helpful.
If you have ever glanced into the cockpit of a major airliner aircraft, it is riddled with an incredible number of gauges, dials, knobs, and switches. One begins to wonder exactly how a pilot would be able to figure out which gauges to read and which knobs and switches to turn. Upon closer inspection—should you be allowed inside the cockpit of a commercial airplane—you would find that there are actually three gauges that are measuring the exact same thing, as well as three switches and knobs that control the same aspect of the plane. This is done deliberately, and is referred to as triple mode redundancy (TMR).
The motivation for this approach is that, under normal operating conditions, all three gauges will read identically the same. However, in the case of a random fault, the probability is that the random fault will only occur with one of the gauges, not all three. So, if two of the three gauges are reading the same value and one gauge isn’t, it probably means that the gauge showing the different result is wrong. Employing TMR, the probabilities of a random fault not being detected can become even more infinitesimally smaller than those defined in ASIL D random fault coverage.
The price tag associated with solving the problem through TMR also leads to a significant increase in the cost of each protected system. This may not be a problem for a multi-million-dollar aircraft, but it quickly becomes impractical in an eighty-thousand-dollar passenger car. Now, taking a step back for a moment and returning to the original point with regard to systematic fault coverage, which requires real rigor and a foundation for FuSa, imagine a platform with three Pentium Processors, with the divide error now connected in TMR. When they check each other’s results, even though the answers to the division problem are incorrect, they all arrive at the exact same conclusion, which is obviously a problem. This is why I made the point that robust, i.e. ASIL D systematic fault coverage is a foundational requirement for FuSa.
So, for reasons previously explained, most semiconductors don’t simply employ TMR on-chip to achieve ASIL D random fault coverage, because the costs would be exorbitant and prohibitive. To achieve ASIL B random fault coverage alone requires subsystems with “lock-step” processors on board, where, rather than three processors, two processors check each other’s results and use that information for safety-critical regions of the device. Typically, these are the processors that either drive actuators (motors, brakes, etc.) or pass messages on the CAN bus - a bus that ultimately may affect the actions of an actuator. I’ll stop here, because this topic too can get fairly deep fairly quickly.
Through a technique called decomposition, multiple devices in the system provide alternative paths to ensure checks and balances are maintained at the system level, similar to the TMR aircraft example. Decomposition is not a topic worthy of blogging about in depth, as it becomes more complex even faster than the way ASIL B random fault coverage is achieved at a device level. However, the curious reader can certainly use ChatGPT to dig in deeper if interested. Suffice to say, decomposition, in effect, mirrors that of TMR in the way the results from one device aren’t simply taken at face value.
Perhaps the key takeaway from this three-part series on FuSa is that designing devices and systems for automotive applications requires significant thought and foresight. One of the key forcing functions to ensure that best-practices are employed and ISO26262-compliant guidelines are followed with rigor is that, in the case of a life-threatening accident, if it becomes clear that these guidelines and rigor were not followed, the automotive OEM and Tier 1 (as appropriate) will assume the liabilities.
When it was proven that a well-known OEM with a “stuck accelerator” problem was caused by poor software coding practices, a settlement of $1.2 B was reached. And oh, yeah, we didn’t touch on the topic of FuSa as it applies to software. Maybe another blog?