Reliability Importance Measures of Components in a Complex System - Identifying the 20% in the 80/20 Rule

[Editor's Note: This article has been updated since its original publication to reflect a more recent version of the software interface.]

When analyzing a system's reliability and availability, measuring the importance of components is often of significant value in prioritizing improvement efforts, performing trade-off analysis in system design or suggesting the most efficient way to operate and maintain a system. Focusing on the most problematic areas in the system results in the most significant gains. This article presents different ways for assessing the importance of non-repairable and repairable components within a system using BlockSim.

Introduction

In 1906, an Italian economist named Vilfredo Pareto noticed that 20% of the people owned 80% of the wealth. This is often referred to as the 80/20 rule, popularized by the quality management pioneer Dr. Joseph Juran, who, during his work in the 1930s and 40s, recognized a universal principle he called the "vital few and trivial many." Juran described the Pareto concept as distinguishing the vital few issues from the trivial many issues, stating that 20% of the defects cause 80% of the problems. Recently, Microsoft discovered that 80% of the errors and crashes in Windows and Office are caused by 20% of the entire pool of bugs detected [1]. Even if your system does not follow the 80/20 rule exactly, it is useful to prioritize the issues in your system before deciding on a plan of attack.

With modern technology and higher reliability requirements, systems are getting more complicated. Therefore, identifying the most problematic components can become difficult. Many systems are repairable systems composed of many components that fail and get repaired based on different distributions. With limitations and constraints (such as spare parts availability, repair crew response time, logistic delays etc.), exact analytical solutions become intractable. In these cases, simulation becomes the tool of choice in modeling repairable systems and identifying weak components and areas where maintainability limitations hinder the availability of the system.

Note: In this article, the cost of improving the reliability of the component is not considered. Cost of improvement is covered in the Reliability Allocation section of the System Analysis Reference.

1. Importance Measures for Non-Repairable Components

In simple systems such as a series system, it is easy to identify the weak components. However, in more complex systems, this becomes quite a difficult task. For complex systems, the analyst needs a mathematical approach that will provide the means of identifying and quantifying the importance of each component in the system.

Using reliability importance (IR) measures is one method of identifying the relative importance of each component in a system with respect to the overall reliability of the system. The reliability importance, IRi, of component i in a system of n components is given by Leemis [2]:

(1)

where:

Rs(t) is the system reliability at a certain time, t
Ri(t) is the component reliability at a certain time, t

This metric measures the rate of change (at time t) of the system reliability with respect to the component's reliability change. It also measures the probability of a component being responsible for system failure at time t. The value of the reliability importance given by Eqn. (1) depends on both the reliability of a component and its corresponding position in the system.

As an example, let us consider the system shown next.

RBD

The failure distributions for the components in the diagram are:

Block Name	Failure Distribution (hr)
A	Weibull (β = 1.5, η = 200)
B	Weibull (β = 4, η = 1000)
C	Exponential (λ = 0.0008)
D	Weibull (β = 2, η = 150)
E	Weibull (β = 2, η = 400)
F	Weibull (β = 1.7, η = 400)
G	Weibull (β = 1.5, η = 100)
H	Weibull (β = 1.4, η = 800)
I	Weibull (β = 1.5, η = 1000)

The system reliability equation for this configuration can be expressed as:

Hence, according to Eqn. (1), the reliability importance of component A, for example, is:

By varying the time value, t, and obtaining the corresponding reliabilities at t for each of the components in the above equation, we can obtain the reliability importance value for different times. For instance, if t = 50 hours, IRA = 0.936. The same procedure can be applied for every component.

This type of reliability importance measure can be presented graphically in various ways. The following BlockSim plot shows the reliability importance of each block over time.

Plot: Reliability Importance vs. Time

The next plot is a snapshot of the previous plot at a specific time value (this is called "static reliability importance").

Plot: Static Reliability Importance

The following plot is also static reliability importance, but is presented as a "square pie chart" that shows the breakdown of the components' reliability importance.

Plot: Static Reliability Importance Tableau

The three plots above show the clear dominance of two (20%) of the components, A and I, in responsibility for most of the failures of the system.

2. Importance Measures for Repairable Components

Let us now assume that the system is repairable, with the following repair distributions and preventive maintenance policies.

Table 1: Maintainability characteristics of the system

Block Name	Repair Distribution (hours)	Preventive Maintenance Policy	Preventive Maintenance Repair Distribution (hours)
A	Normal (μ = 20, σ = 0.1)	Every 300 hours of Block Age	Normal (μ = 6, σ = 2)
B	Normal (μ = 10, σ = 2)	Every 300 hours of Block Age	Normal (μ = 6, σ = 2)
C	Normal (μ = 0.5, σ = 0.1)	No PM (Constant Failure Rate)	No PM (Constant Failure Rate)
D	Normal (μ = 0.5, σ = 0.01)	Every 300 hours of Block Age	Normal (μ = 6, σ = 2)
E	Normal (μ = 10, σ = 2)	Every 300 hours of Block Age	Normal (μ = 6, σ = 2)
F	Normal (μ = 10, σ = 2)	Every 200 hours of Block Age	Normal (μ = 6, σ = 2)
G	Normal (μ = 1, σ = 0.1)	Every 200 hours of Block Age	Normal (μ = 6, σ = 2)
H	Normal (μ = 10, σ = 2)	Every 200 hours of Block Age	Normal (μ = 6, σ = 2)
I	Normal (μ = 20, σ = 2)	Every 200 hours of Block Age	Normal (μ = 6, σ = 2)

When dealing with repairable systems, the system reliability (and system availability) depends on the components' failure characteristics, but also on other contributory factors such as time-to-repair distributions, maintenance practices, crews and spare availabilities, logistic delays, etc.

Through simulation, the system and component histories over time can be captured. The results of the simulation can be used to quantify two other types of reliability importance measures, ReliaSoft's Failure Criticality Index (RS FCI) and ReliaSoft's Downing Event Criticality Index (RS DECI), both available in BlockSim. A discussion of these two metrics is presented next.

2.1. ReliaSoft's Failure Criticality Index (RS FCI)

ReliaSoft's Failure Criticality Index (RS FCI) is a relative index showing the percentage of times that a failure of a component caused a system failure. RS FCI is obtained from:

This metric considers only failure events and excludes preventive maintenance and inspection events that cause an interruption in the system's operation.

RS FCI reports the percentage of times that a system failure event was caused (triggered) by a failure of a particular component over the simulation time (0,t). Intuitively, this index has the same meaning and the same application as the reliability importance measure, IRi, described in Eqn. (1).

For example, if we simulate the system's operation for 5,000 hours in BlockSim, we obtain the following Block Summary report.

For component A, RS FCI = 75.03%. This implies that 75.03% of the times that the system failed, a component A failure was responsible. Note that the combined RS FCI of A and I is 81.41%. In other words, A and I contributed to about 80% of the system's total downing failures.

The RS FCI results can also be seen in a graphical format.

Plot: RS FCI

2.2. ReliaSoft's Downing Event Criticality Index (RS DECI)

ReliaSoft's Downing Event Criticality Index (RS DECI) is a relative index showing the percentage of times that a downing event on a particular component caused the system to go down. This is obtained from:

This metric considers all downing events (i.e., failures, preventive maintenance and inspection events that cause an interruption in the system's operation).

In the simulation results, we see that for component A, RS DECI = 46.30%. This implies that 46.30% of the times that the system was down were due to component A being down. Note that the combined RS DECI of A and I is 84.69%. Once again we see how the vital few issues, A and I (20% of the components), contributed to about 80% of the system downtime, whereas the trivial many (80% of the components) contributed to only 20% of the downtime.

The RS DECI results can also be seen in a graphical format.

Plot: RS DECI

3. FRED Report

FRED stands for Failure Reporting, Evaluation and Display. This report provides a graphical demonstration of the maintainability/availability characteristics of the components/subsystems in a system and helps to identify areas for improvement (i.e., better reliability and/or better maintainability).

For the repairable system example, the FRED report is shown next.

The FRED report shows the average availability, the MTBF, the MTTR (mean time to repair) and the RS FCI values for each component in the system. In addition, the components are color coded to show the maintainability/availability of each component in relation to the other components (using a color spectrum varying from red for worst, to dark green for best). For example, we can conclude from the above FRED report that component G's reliability needs improvement (MTBF=107.717) and that component A's availability is the lowest (Am=0.912).

4. What-If Analysis

Another possible way to understand the importance of any element in a system is to perform sensitivity analysis using what-if analysis. This allows for even more flexible types of importance measures beyond the aforementioned standard types of measure. The analyst can vary parameters, take out elements, add resources, change preventive maintenance policies, etc. and assess the effect on a certain reliability/availability metric of interest. Practically, any element in the system can be manipulated to assess the impact on the system's reliability/availability; we will list some common examples next.

4.1. Eliminating Problems

The analyst can study the impact on reliability or availability if a problem (failure mode) is eliminated. This can be done by analyzing the system with the problem and without the problem. The next plot shows the reliability of the system if A is eliminated.

Overlay Plot

In BlockSim, you can delete a block or set it so that it does not fail; this will eliminate its effect.

4.2. Changing Failure Distribution

Another way of assessing the importance of a component is by varying its failure distribution parameter. For example, you can assess the impact on the reliability of the system of improving a part or switching to a different supplier. The next figures show the difference in B10 life of the system if the original C component, C1, which follows an exponential distribution with λ1=0.0008, is replaced by a similar (but more expensive) component, C2, with λ2=0.0004, that can be purchased from a different supplier.

B10 Life for the System with C1

B10 Life for the System with C1

B10 Life for the System with C2

B10 Life for the System with C2

The above analysis can be used to weigh the gains obtained by switching to a more expensive supplier.

4.3. Changing Maintainability Characteristics

This what-if analysis is done by changing maintainability-related factors and assessing the impact on availability or total failure number. Maintainability-related factors include the repair duration, the frequency of preventive maintenance and inspections, the initial stock level of spare parts, the restocking policy, the logistic delay for obtaining parts (addressed by choosing a different distribution company or delivery companies) or number of crews.

The next example shows the impact on availability if preventive maintenance is performed on every component every 500 hours of component age. You can see that availability increases, which indicates that the original schedule of preventive maintenance was far too frequent. By performing the maintenance less often, you will not only increase availability (by reducing planned downtime) but will also save money on the maintenance tasks.

Mean Availability of the System with Original PM Timing

Mean Availability of the System with Original PM Timing

Mean Availability of the System with New PM Timing

Mean Availability of the System with New PM Timing

5. Conclusion

Before embarking on a lengthy and expensive development or maintenance restructuring program, it is important to identify the most problematic areas in the system. In addition to various standard metrics available in BlockSim, what-if type analysis can be used to address other types of importance measures.

References

1. "Microsoft's CEO: 80-20 Rule Applies To Bugs, Not Just Features," Paula Rooney, CRN, www.crn.com/sections/breakingnews/dailyarchives.jhtml?articleId=18821726, visited on 8/2/2006.

2. Leemis, L.M. Reliability Probabilistic Models and Statistical Methods, Prentice Hall, Inc. Englewood Clifs, New Jersey, 1995.

3. Wang, W., Loman, J., Vassiliou, P., Reliability Importance of Components in a Complex System, Proceedings of the Annual Reliability & Maintainability Symposium, 2004.