.jpeg)
A Deeper Dive on Functional Safety (FuSa)
Last October, I wrote a blog on Functional Safety, which provided a high-level overview of this critical, complex topic. The blog received very strong interest, proving to be one of my most popular to date. Given the strong interest in FuSa (or ISO 26262), it seemed appropriate to do a deeper dive on it, with a specific focus on random fault coverage, which was very briefly covered in the previous blog.
To recap, the key points regarding FuSa are as follows:
- The key objective of any system that affects the safety of the vehicle is to never experience a failure, and this is addressed through design rigor and high-quality devices and processes. However, FuSa anticipates that even with the best designs and quality, failures are still unavoidable. FuSa focuses on detection of failures when they occur and flagging those failures, so the vehicle can respond accordingly. It’s not necessarily that failures don’t occur – what’s important is, the detection of the failures when they do occur.
- FuSa anticipates that, even while additional hardware (called safety mechanisms) is added to detect failures, it is prone to failure.
- The Automotive Safety Integrity Level (ASIL) specifies the acceptable frequency at which failures escape detection by the safety mechanisms. The ASIL is measured in letters A through D, with increasing levels of stringency, where ASIL D represents the most stringent level. The more critical the impact that a failure has over the control of the vehicle, the higher the required ASIL.
- There are two primary components to FuSa:
- Systematic fault coverage – Ensures that the processes used to design, document, test, and verify the device have a certain rigor - the level of that rigor is specified by an associated ASIL A through D.
- Random fault coverage – Focuses on random hardware failures that can occur unpredictably during the lifetime of a component.
- Typically, devices that have meaningful impact on the safety of the system are required to be certified to ASIL D/B - which means that the device supports ASIL D systematic fault coverage and ASIL B random fault coverage. Achieving ASIL D random fault coverage at the “system level” (multiple devices working together in one system) typically relies on a concept called decomposition. This is a topic that will be covered in a future blog.
Now that the reader is well versed on FuSa and ready to be a Safety Manager (a real term for an individual who is responsible for overseeing the safety efforts of the company, a requirement of being compliant to the specification), we are going to spend some time taking a more in-depth look at random fault coverage.
Random hardware failures occur unpredictably over the lifetime of a product – however, they tend to be probabilistic in nature. These errors are the basis for the term probabilistic metric for random hardware failures (PMHF), and occur for various reasons, which are independent of design and quality rigor. Typically, random failures occur at different rates over the lifetime of the product during three distinct periods.
- Burn in – Typically referred to as infant mortality, when devices, shortly after manufacturing, can exhibit large numbers of failures until the device has been fully burned in.
- Useful life – This is the period that is post burn-in and is also the period in which the device is actively employed in the end-application. It is during this period that the typical random failure rates are low and is when random failures are generally calculated/ assessed.
- Wearout – This is theperiod when the device has served its useful life and is now operating beyondthe period that the device has been designed to operate – typically referred toas the “mission profile” of the device. Similar to the burn-in period, duringthe wearout period, the device typically exhibits a significantly higher randomfailure rate.
As part of the safety analysis of a device, a thorough analysis of the potential failure modes, including those due to neutron strikes, are evaluated. Random failures are measured in failures in time (FIT). One FIT is equal to one in 1 billion operating hours, or 114,000 years. To say these specifications are stringent is perhaps an understatement, but these types of extremely low failure rates are important when considering that the electronics ultimately have control over the vehicle.
In addition to evaluating the PMHF of the device, there is also an analysis which is conducted that looks at how well a design can withstand a single-point fault, which is referred to as the single point fault metric (SPFM). This metric evaluates the effectiveness of the safety mechanisms to both detect and handle single-point / isolated faults. In other words, to understand if there is a case in which a single fault of a specific type can overwhelm the safety mechanism.
Lastly, the final key metric that is evaluated in the context of achieving a given ASIL is referred to as the latent fault metric (LFM). This is a metric that determines the effectiveness of a system’s safety mechanisms in detecting faults that may go undetected for extended periods of time. The required values for the various metrics by ASIL are shown in the table below.
Consistent with the points that were made earlier, increasing ASILs drives more stringent requirements.
And yet again, we have only scratched the surface on this topic. But it is probably easiest to get your arms wrapped around this topic by taking small, bite size pieces. There are many other topics to cover in this complex field, which is of extreme importance, as growing numbers of semiconductor devices with increasing complexities are taking greater control over the vehicle.
In upcoming blogs, we will look at the concept of decomposition, or how to achieve ASIL D random fault coverage at the system level, while employing devices that are only certified to support ASIL B random fault coverage.
Last October, I wrote a blog on Functional Safety, which provided a high-level overview of this critical, complex topic. The blog received very strong interest, proving to be one of my most popular to date. Given the strong interest in FuSa (or ISO 26262), it seemed appropriate to do a deeper dive on it, with a specific focus on random fault coverage, which was very briefly covered in the previous blog.
To recap, the key points regarding FuSa are as follows:
- The key objective of any system that affects the safety of the vehicle is to never experience a failure, and this is addressed through design rigor and high-quality devices and processes. However, FuSa anticipates that even with the best designs and quality, failures are still unavoidable. FuSa focuses on detection of failures when they occur and flagging those failures, so the vehicle can respond accordingly. It’s not necessarily that failures don’t occur – what’s important is, the detection of the failures when they do occur.
- FuSa anticipates that, even while additional hardware (called safety mechanisms) is added to detect failures, it is prone to failure.
- The Automotive Safety Integrity Level (ASIL) specifies the acceptable frequency at which failures escape detection by the safety mechanisms. The ASIL is measured in letters A through D, with increasing levels of stringency, where ASIL D represents the most stringent level. The more critical the impact that a failure has over the control of the vehicle, the higher the required ASIL.
- There are two primary components to FuSa:
- Systematic fault coverage – Ensures that the processes used to design, document, test, and verify the device have a certain rigor - the level of that rigor is specified by an associated ASIL A through D.
- Random fault coverage – Focuses on random hardware failures that can occur unpredictably during the lifetime of a component.
- Typically, devices that have meaningful impact on the safety of the system are required to be certified to ASIL D/B - which means that the device supports ASIL D systematic fault coverage and ASIL B random fault coverage. Achieving ASIL D random fault coverage at the “system level” (multiple devices working together in one system) typically relies on a concept called decomposition. This is a topic that will be covered in a future blog.
Now that the reader is well versed on FuSa and ready to be a Safety Manager (a real term for an individual who is responsible for overseeing the safety efforts of the company, a requirement of being compliant to the specification), we are going to spend some time taking a more in-depth look at random fault coverage.
Random hardware failures occur unpredictably over the lifetime of a product – however, they tend to be probabilistic in nature. These errors are the basis for the term probabilistic metric for random hardware failures (PMHF), and occur for various reasons, which are independent of design and quality rigor. Typically, random failures occur at different rates over the lifetime of the product during three distinct periods.
- Burn in – Typically referred to as infant mortality, when devices, shortly after manufacturing, can exhibit large numbers of failures until the device has been fully burned in.
- Useful life – This is the period that is post burn-in and is also the period in which the device is actively employed in the end-application. It is during this period that the typical random failure rates are low and is when random failures are generally calculated/ assessed.
- Wearout – This is theperiod when the device has served its useful life and is now operating beyondthe period that the device has been designed to operate – typically referred toas the “mission profile” of the device. Similar to the burn-in period, duringthe wearout period, the device typically exhibits a significantly higher randomfailure rate.
As part of the safety analysis of a device, a thorough analysis of the potential failure modes, including those due to neutron strikes, are evaluated. Random failures are measured in failures in time (FIT). One FIT is equal to one in 1 billion operating hours, or 114,000 years. To say these specifications are stringent is perhaps an understatement, but these types of extremely low failure rates are important when considering that the electronics ultimately have control over the vehicle.
In addition to evaluating the PMHF of the device, there is also an analysis which is conducted that looks at how well a design can withstand a single-point fault, which is referred to as the single point fault metric (SPFM). This metric evaluates the effectiveness of the safety mechanisms to both detect and handle single-point / isolated faults. In other words, to understand if there is a case in which a single fault of a specific type can overwhelm the safety mechanism.
Lastly, the final key metric that is evaluated in the context of achieving a given ASIL is referred to as the latent fault metric (LFM). This is a metric that determines the effectiveness of a system’s safety mechanisms in detecting faults that may go undetected for extended periods of time. The required values for the various metrics by ASIL are shown in the table below.
Consistent with the points that were made earlier, increasing ASILs drives more stringent requirements.
And yet again, we have only scratched the surface on this topic. But it is probably easiest to get your arms wrapped around this topic by taking small, bite size pieces. There are many other topics to cover in this complex field, which is of extreme importance, as growing numbers of semiconductor devices with increasing complexities are taking greater control over the vehicle.
In upcoming blogs, we will look at the concept of decomposition, or how to achieve ASIL D random fault coverage at the system level, while employing devices that are only certified to support ASIL B random fault coverage.