Reliability Engineering
Massimo Lazzaroni, Loredana Cristaldi, Lorenzo Peretto, Paola Rinaldi, and Marcantonio Catelani
Reliability Engineering Basic Concepts and Applications in ICT
ABC
Authors Prof. Dr. PhD. Massimo Lazzaroni Università degli Studi di Milano Dipartimento di Tecnologie dell’Informazione Via Bramante 65 I-26013 Crema Italy Email:
[email protected] Dr. PhD. Paola Rinaldi Alma Mater Studiorum - Università di Bologna Dipartimento di Elettronica, Informatica e Sistemistica Viale Risorgimento, 2 I - 40136 Bologna Italy
Prof. Dr. PhD. Loredana Cristaldi Dipartimento di Elettrotecnica Politecnico di Milano Piazza Leonardo da Vinci, 32 I - 20133 Milano Italy
Prof. Dr. Marcantonio Catelani Dipartimento di Elettronica e Telecomunicazioni Università degli Studi di Firenze via S. Marta, 3 I - 50139 Firenze Italy
Prof. Dr. PhD. Lorenzo Peretto Alma Mater Studiorum - Università di Bologna Dipartimento di Ingegneria Elettrica Viale Risorgimento, 2 I - 40136 Bologna Italy
ISBN 978-3-642-20982-6
e-ISBN 978-3-642-20983-3
DOI 10.1007/978-3-642-20983-3 Library of Congress Control Number: 2011928069 c 2011 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed on acid-free paper 987654321 springer.com
Preface
Nowadays, in many fields of application, it is fundamental to define and fulfil Dependability performances. For a complex equipment in avionics, automotive, transportation – only considering some examples – we have to take into account the functional requirements of the system and, in addition, its requirements in terms of Reliability, Maintainability, Availability and Safety. In other words, it is fundamental to evaluate as the functional requirements of the equipment under consideration are maintained in the time, in specified conditions of use. For these reasons it is fundamental, as starting point, to focus the attention on the correct use of the terminology in this field. To this aim, in Chapter 1, an overview of the most important terms correlated with Dependability is proposed. Referring to the International Standard, we can assume Dependability as the collective term used to describe the availability performance and its influencing factors: reliability, maintainability and maintenance support performance. On the basis of this concept, it is evident that dependability gives a general description of the item - a component, an equipment, a system, and so on- in non-quantitative terms. To express its performance in quantitative terms and thus to describe, measure, improve, guarantee and certify such an item, it is necessary to establish the characteristics of reliability, availability, maintainability, maintenance support performance and safety. Assuming the reliability of an item as the probability that such an item will adequately perform the specified function for a well-defined time interval in specified environmental conditions, it is well clear the importance of the probability and statistics sciences in both reliability definition and evaluation. So, Chapter 2, is devoted to introduce some important probability and statistics basic concepts that are necessary for the reliability evaluation. In particular, the statistical point of view is developed and discussed as a first approach to dependability feature of a system under consideration. A brief overview on probability and statistics is given in the first part and the reliability function is then derived. Furthermore, both the concept and model of the failure rate are also proposed. In Chapter 3 the techniques used for describe the performance of devices in a system are considered. To this aim the system is assumed as a combination of elementary devices – subsystems and elements - that follow a well-defined functional structure. The reliability evaluation of series, parallel and mixed structures will be shown and discussed in details. To this aim, the concept of Reliability Block Diagram is defined as a mandatory tool. The theory is developed using many practical examples. Parallel configuration is further developed in order to discuss the different type of redundancy for reliability growth: active, warm and stand-by.
VI
Preface
At the end of this chapter, two different types of redundancy approaches are compared: system redundancy and component redundancy. The results so obtained are fundamental during the design phase of a system when reliability aspects have to be taken into account. Considering that the component reliability is often affected by different external and internal influencing factors – stress, environment, quality, technology, and so on - the operating profile of a component must be taken into account for good reliability predictions. Chapter 4 focuses the attention on such aspects. It is interesting to remember that, in many situations, the operating profile changes according to the type of operation of the component. So, we can assume continuous operation or non-continuous operation, such as also sporadic operation. Moreover, storage conditions may have deep impact on reliability of the component when operating. Obviously, environmental factors need to be taken into account as well. The environment contributes to both aging and failures during the life of devices or systems. To this aim, both duration and intensity of environmental stresses must be included in the system operational model. In this chapter, after a brief introduction, the stress factors will be analyzed and some aspects concerning the component degradation are presented. Some concepts regarding the analysis of failure modes and laboratory test are also presented. The evaluation of the failure rate for an item represents, often, a very difficult task. In order to implement this evaluation for an electronic or an electromechanical equipment, it is possible to use ad hoc HandBooks (HDBKs) – reliability prediction handbooks. In the first part of Chapter 5, a brief historical overview of such HandBooks is given, introducing first generation as well as second and third generation HDBKs. As practical application of reliability prediction in electronic field the USA military HDBK (MIL) and Farada HDBK is presented. In Chapter 6 the concept of Availability is explained and discussed. Availability is defined by the international standards as the aptitude of an element to perform its required function in given conditions up to a given point in time or during a given time interval, assuming that any eventual external measured are assured. So, the Availability is a concept that refers to reparable systems, that is systems where the operating life cycles can be often described by a sequence of up (operating) and down (not operating) states. In this case, the important variables to be determined are both the time to failure and the time to repair and or restore. For a more detailed and exhaustive description of the dependability performance a set of well known techniques are present in literature. Such techniques, normally classified into quantitative and qualitative, are methods of analysis to evaluate the dependability parameters and the failure modes in which a realistically complex system is, or, could be subjected. In Chapter 7 a simplified version of the Markov model is proposed. Other techniques able to allow the knowledge of the mechanism of the system failures and to identify all the potential weakness of the system under evaluation are presented in Chapter 8. In particular, we refer to the Failure Modes and Effects Analysis (FMEA) and the Failure Modes, Effects
Preface
VII
and Criticality Analysis (FMECA). These methods are able to highlight the failure modes leading to a negative final effect, also in terms of criticality on the system, operator and/or environment. A third method here presented is the Fault Tree Analysis (FTA), a deductive method for the analysis of a top event represented by a failure condition of the system as a function of failures of subsystems and components. Massimo Lazzaroni Loredana Cristaldi Lorenzo Peretto Paola Rinaldi Marcantonio Catelani
Contents
1 The Concept of Measurable Quality ...............................................................1 1.1 Introduction ................................................................................................1 1.2 Is Conformity Synonymous with Reliability? Some Definitions................2 1.3 Failures, Faults and Their Classification ....................................................4 References ...............................................................................................................6 2 The Concept of “Statistical” Reliability..........................................................7 2.1 Introduction ................................................................................................7 2.2 Definition of Probability.............................................................................8 2.2.1 Axioms of Probability ......................................................................9 2.2.2 Law of Large Numbers ....................................................................9 2.3 The Random Variables .............................................................................10 2.4 Probability Distribution ............................................................................11 2.5 The Characteristics of Reliability .............................................................14 2.5.1 Reliability.......................................................................................14 2.5.2 Failure Rate ....................................................................................17 2.6 The Frequency Approach..........................................................................18 Example 1.................................................................................................20 2.7 Models of Failure Rate .............................................................................23 Example 2.................................................................................................25 2.8 Other Laws ...............................................................................................26 Exponential Law.......................................................................................27 Log-normal Distribution...........................................................................29 Weibull Distribution .................................................................................30 References .............................................................................................................32 3 Reliability Analysis in the Design Phase .......................................................33 3.1 Introduction............................................................................................33 3.2 Reliability Evaluation of Series, Parallel and Mixed Structures ............34 3.2.1 The Series Functional Configuration ..........................................34 Example 1 ...................................................................................35 Example 2 ...................................................................................36 Example 3 ...................................................................................36 3.2.2 The Concept of Redundancy: Parallel Functional Configuration ..............................................................................37 Example 4 ....................................................................................38 Example 5 ....................................................................................39
X
Contents
Example 6 ....................................................................................40 Example 7 ....................................................................................41 Example 8 ....................................................................................42 3.3 Types of Redundancy ............................................................................42 3.4 Functional Configuration k out of n.......................................................43 Example 9..............................................................................................43 Example 10............................................................................................44 Example 11............................................................................................46 Example 12............................................................................................46 Example 13............................................................................................46 Example 14............................................................................................49 Example 15............................................................................................50 Example 16............................................................................................54 References .............................................................................................................57 4 Experimental Reliability and Laboratory Tests ..........................................59 4.1 Introduction ..............................................................................................59 4.2 Stress Factors ............................................................................................60 4.3 Component Degradation ...........................................................................64 4.4 The Prediction Approach ..........................................................................65 4.5 Failure Modes ...........................................................................................70 4.6 Laboratory Tests on Components and Systems ........................................71 References .............................................................................................................76 5 Reliability Prediction Handbooks: Evaluation of the System Failure Rate.....................................................................................................77 5.1 Introduction ..............................................................................................77 5.2 Second Generation Handbooks.................................................................78 5.3 MIL-Handbook-217..................................................................................78 5.4 Failure Rate Data Bank (FARADA).........................................................79 5.5 Third Generation Data Banks ...................................................................79 5.6 Calculation of the Failure Rate .................................................................81 Example 1.................................................................................................81 Example 2.................................................................................................82 5.7 FIT: A More Recent Unit..........................................................................83 References .............................................................................................................83 6 Repairable Systems and Availability ............................................................85 6.1 Introduction ..............................................................................................85 6.2 Mean Time To Repair/Restore (MTTR)...................................................86 6.2.1 A Particular Case............................................................................87 6.3 Mean Time Between Failures (MTBF).....................................................87 6.4 The Significance of Availability in the Life Cycle of a Product...............88 6.5 Instantaneous Availability ........................................................................89 6.6 Dependability: An Evaluation of the “Level of Confidence” in Reference to the Correct Functioning of the System ................................89
Contents
XI
6.7 The Prerequisites of Dependability...........................................................91 References .............................................................................................................92 7 Techniques and Methods to Support Dependability ...................................93 7.1 Introduction ..............................................................................................93 7.2 Introduction to Quantitative Techniques...................................................94 7.3 Evaluation of Availability Using Analytical Models................................95 7.4 Markov Models.........................................................................................95 7.5 Transition Matrix and Fundamental Equation ..........................................98 7.6 Diagrams of State .....................................................................................98 Case 1 – Analysis of a system with one element ......................................98 a) Non-repairable element ....................................................................98 b) Repairable element...........................................................................99 Case 2 - Analysis of a system with two elements ...................................100 7.7 Evaluation of Reliability.........................................................................101 7.8 Calculation of Reliability, Unreliability and Availability.......................102 7.9 Markov Analysis of a System: Application Example .............................103 7.10 Numerical Resolution of the System ....................................................106 7.11 Possible Solutions to the Absence of Memory of the Markov Model ......................................................................................108 References ...........................................................................................................109 8 Qualitative Techniques.................................................................................111 8.1 Introduction ............................................................................................111 8.2 Failure Mode and Effects Analysis (FMEA) ..........................................113 8.2.1 Operative Procedure.....................................................................114 Step 1 – Definition of the System................................................114 Step 2 – Elaboration of Block Diagrams .....................................115 Step 3 – Definition of Basic Principals........................................115 Step 4 – Definition of Failure Modes ..........................................115 Step 5 – Identifying the causes of failures. ..................................118 Step 6 – Identifying the effects of failure modes. ........................118 Step 7 – Definition of measures and methods for identifying and isolating failures......................................................119 Step 8 – Prevention of undesirable events ...................................119 Step 9 – Classification of the severity of final effects. ................119 Step 10 – Multiple failure modes.................................................120 Step 11 – Recommendations .......................................................120 8.2.2 FMEA Typology ..........................................................................120 8.2.3 The Concept of Criticality............................................................122 8.2.4 Final Considerations on FMEA Analysis.....................................122 Example 1 ....................................................................................122 8.2.4.1 Analysis of Failure Modes: Discussion and Exclusions ......................................................................123 8.2.4.2 Drafting the FMEA Table ..............................................124 Example 2 ....................................................................................124 Example 3 ....................................................................................127
XII
Contents
8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) ....................128 8.3.1 Failure Modes and Their Probability............................................130 8.3.2 Evaluation of Criticality ...............................................................130 8.3.3 FMECA Based on the Concept of Risk........................................130 8.3.4 FMECA Based on the Failure Rate..............................................135 8.3.4.1 Evaluation of the Criticality Coefficient of a Component ...................................................................137 8.3.4.2 Failure Rate Evaluation ..................................................138 8.3.5 Worksheet Examples....................................................................139 8.4 Fault Tree Analysis (FTA)......................................................................144 Phase 1: Fault Tree logical construction.................................................145 Phase 2: Probability evaluation of fault tree ...........................................145 8.4.1 Graphical Constructing of a Fault Tree........................................147 8.4.2 Qualitative Analysis of a Fault Tree.............................................150 8.4.3 Quantitative Analysis of a Fault Tree...........................................151 8.4.4 Advantages and Disadvantages of Fault Tree Analysis (FTA) ....153 8.5 An Overview Example............................................................................154 References ...........................................................................................................158 Index ...................................................................................................................159
Chapter 1
The Concept of Measurable Quality
Abstract. In spite of the science of Reliability is old in the time, there is often misunderstanding about the various terms used in this field. This is a real situation in many contexts, also for high technology industrial and practical applications. The aim of this Chapter concerns an overview of the most important terms correlated with Dependability, that is a qualitative performance of an item such as a component, a device, an equipment, and so on. Referring to the International Standard, Dependability is defined, in fact, as the collective term used to describe the availability performance and its influencing factors: reliability, maintainability and maintenance support performance. On the basis of this definition, it is evident that dependability gives a general description of the item in non-quantitative terms. To express its performance in quantitative terms and thus to describe, measure, improve, guarantee and certify such an item, it is necessary to establish the characteristics of reliability, availability, maintainability and maintenance support performance. Such terms are defined in this Chapter.
1.1 Introduction In the current technological context, characterized evermore by sudden and important developments, the concept of availability assumes a role of primary importance, both in the design and realization of a product as a component or a system. In general terms, we can think of a product as the result of a series of correlated activities, normally developed within a production process. This transforms an incoming element (inputs) such as raw materials, technology or resources into the desired product at the end (output) of the process. In order to express a judgment on the qualitative level of the product thus obtained, it is useful to recall the definition of the term Quality given by the EN ISO 9000 Standard [1] which states: “the totality of characteristics of an entity that bear on its ability to satisfy stated and implicit needs”. It is evident that any consideration of the correct design and implementation of a product and therefore on the consequent verification of the quality level requires a preliminary definition of its characteristics that, in general terms, can be classified as qualitative and/or quantitative. Going into more detail and still following the above mentioned Standard, the characteristics can be of a physical nature (e.g. mechanical, electrical, chemical, etc.), or a functional nature (the speed of an automobile, memory capacity of a computer chip, etc.), time dependent
2
1 The Concept of Measurable Quality
(requisites of reliability, maintainability, availability), and so on. Yet, independently of their nature and always with the view of expressing an objective evaluation of the “quality” of the product, it is necessary that such characteristics are adequately defined in measurable terms. Only in this case it is possible to verify that all requirements have been met, that the expressed or implicit needs manifested by those interested in buying or using the product in question are satisfied. In other terms, it is necessary to measure and keep under control the capability the product must achieve and for which it was designed and realized. This capacity must be measured over time in order to assure that the product is able to maintain and supply the necessary characteristics whenever requested. This demonstrates the multiplicity and heterogeneity of the characteristics that can distinguish a product, even if it is technologically simple, and consequently, the importance of correctly identifying such characteristics in order to verify not only attaining the objectives of the project but also, and above all, to undertake any eventual improvements aimed at guaranteeing increasingly higher levels of quality.
1.2 Is Conformity Synonymous with Reliability? Some Definitions Since the objective of this book regards reliability and its impact, as an essential requirement for a proper and modern design oriented towards competitiveness, we must pay particular attention to the characteristics of the product regarding the time. As noted previously, in addition to reliability, maintainability, availability and safety are also very important. Above all, in certain technological contexts and for particular applications, we talk about RAMS performances (Reliability, Availability, Maintainability and Safety) to define the life cycle of a product. It is necessary however, not to confuse the concept of conformity with the RAMS prerequisites, which are quite different. In this paragraph, we are trying to give, as much as possible, an exhaustive vision of the essential terminology. The reader can consult the bibliography for further in depth details. Let us consider a generic element (entity or item), a component rather that a device, a subsystem, a functional unity, an apparatus or a system. The IEC 50 (191) standards [2] define an element as anything that can be considered individually and that can perform one or more functions. Conformity means the response of functional parameters of the element (performance) to pre-established values (specifications). Conformity is definite and measurable, for example, in terms of nominal value and tolerance, percentage of defective elements, etc. We assume that an element conforms if it possesses the technical capacity to do what is requested of it and for which it was designed. With determined and fixed conditions of use, such capacity must be maintained over time. This is commonly referred to as Reliability. In fact, reliability is defined [1] in qualitative terms as “the ability of an item to perform a required
1.2 Is Conformity Synonymous with Reliability? Some Definitions
3
function under given conditions for a given time interval”. In this sense, reliability represents one of the characteristics of the element which can be expressed quantitatively by means of a probability. After establishing a time interval and assuming that the element is capable of carrying out its required function at the beginning of such an interval (the element conforms and conditions of its use at time “zero” have been set), reliability corresponds to the “probability that the element is capable of performing its required function in the established time interval, under established conditions” [2]. Reliability can be determined through mathematical models (law of reliability) or measured and estimated through statistical parameters, for example the Mean Time To Failure (MTTF) or the Mean Time Between Failures (MTBF). As it will become clear in the following, the study of reliability permits not only to evaluate the conformity of a device over time but also to compare different design solutions with the same functional characteristics; it can also identify, inside an apparatus, subsystems or critical elements that could cause a failure or malfunction of the apparatus itself, necessitating corrective action. For this reason, reliability has a determining role in modern design and constitutes a competitive element even in the light of stricter safety requirements. However, a working apparatus or a machine, even though still reliable, is still affected by inevitable degradation that will cause it, more or less rapidly, to modify or lose its technical capacity. It is therefore necessary to restore it to its proper working order whenever an interruption occurs or its characteristics have become unacceptable. For such a reason, merely knowing the prerequisites of reliability is not fully sufficient to represent the characteristics of an element during its life cycle. It is also necessary to take into consideration the concept of reconditioning or restoration. In this regard, the Standard [1] introduces the concept of dependability as “the collective term used to describe the availability performance and its influencing factors: reliability, maintainability and maintenance support performance.” Based on the definition, it is evident that dependability gives a general description of the element in non-quantitative terms. To express its performance in quantitative terms and thus to describe, anticipate, measure, improve, guarantee and certify such an element, it is necessary to establish, in addition to reliability already introduced, the characteristics of availability, maintainability and maintenance support performance. They are defined [1] below. Availability: the ability of an item to be in a state to perform a required function under given conditions at a given instant of time or over a given time interval, assuming that the required external resources are provided. Maintainability: the ability of an item under given condition of use, to be retained in, or restored to, a state in which it can perform a required function, when maintenance is performed under given conditions and using stated procedures and resources. Maintenance Support Performance: the ability of a maintenance organization, under given conditions, to provide upon demand, the resources required to maintain an item, under a given maintenance policy.
4
1 The Concept of Measurable Quality
As with reliability, the characteristics of availability and maintainability can be studied with mathematical models and measured by means of statistical parameters. For example, the average time for making repairs and the average time for restoring the element to proper working conditions. Other types of parameters can also be used, such as operative availability. For some applications, it is fundamental to introduce the concept of safety, meaning the freedom from unacceptable risk of harm and determinable by means of the SIL (Safety Integrity Level). Safety can be also defined as the absence of catastrophic consequences on the user(s) and the environment. It should be noted that in Information Technology (IT) further requirements are mandatory for dependability: confidentiality, integrity and security. Confidentiality is the absence of unauthorized disclosure of information and integrity is the absence of improper system alterations. Finally, security can be defined as the simultaneous existence of availability for authorized users only, confidentiality, and integrity with meaning ‘unauthorized’ [3]. Combining this concept with those defined previously, it is possible to speak, for certain applications and in general terms, of RAMS requirements. As the definition shows, safety is connected to the evaluation of risk which means the probable rate of occurrence of a hazard causing harm and the degree of severity of the harm. Risk analysis can be carried out using FMECA techniques as presented in Chapter 8. It is therefore evident that availability, reliability, maintainability and safety are in themselves essential characteristics to define, control, maintain and improve the performance of an element over time. This is why they are listed as “key elements” when specifying product prerequisites and must be considered as standards of functional performance, “incoming” information for a correct design. Their a posteriori evaluation when a product is completed, is a clear indication of a bad design of a system and as such, represents a solution that, from an engineering point of view, cannot be considered valid. Intervening on the characteristics of availability of an apparatus already made, or even worse, on a working apparatus, can, at times, bring about unbearable costs for both the user and the company. It creates a bad impression in terms of quality for whoever puts the product on the market. To this can be added the responsibility for defective products and the consequences, at times legal, that in a user-supplier relationship, could arise through misinterpretation or by not respecting certain prerequisites of trust stipulated in the contract.
1.3 Failures, Faults and Their Classification The time interval during which an element functions properly ends when any type of deterioration causes an unacceptable variation in the nominal characteristics of the correct use of such an element. The element ceases to perform its required function and failure occurs. Failure, therefore, is defined as the transition from a state of proper functioning to a malfunctioning state, which can be total or partial. The time to first failure represents the total time duration of operating time of an item from the instant it is first put in an up-state, until failure. The time to first failure, or simply the time to failure, represents the random variable that is
1.3 Failures, Faults and Their Classification
5
manifested in the event of a failure. The objective evidence of a failure is called failure mode. Some examples of failure mode are an open circuit, absence of an incoming signal or a valve that remains closed. Circumstances connected to the design, realization or use of an element that have led to a failure represent the cause of failure; the term failure mechanism refers to the chemical, physical or other type of process that has caused the failure. Failures can be classified according to various criteria. An important classification is made in function of the cause responsible for their occurrence. Below are defined the most recent terms according to [2, 3]: 1. Misuse failures, due to the application of stresses during use which exceed the stated capabilities of the item. 2. Primary failures, where the direct or indirect cause of the failure is not due to the failure of another device. 3. Induced failures or secondary failures, generated by the failure of another device. 4. Early life failure, attributable to intrinsic weaknesses in construction of the element and whose causes are normally identifiable during the manufacturing process and which are manifested during the initial use of the device. 5. Random failure, due to uncontrollable factors which occur during the “useful life” of the component and with a probability independent of time. 6. Wear out failure, generated by chemical-physical degradation phenomena and with an increasing probability of occurring with the passage of time. Some of these definitions are useful in understanding particular areas that characterize the temporal progression of the failure rate treated in Chapter 2. A different classification in function of the consequences as a result of a failure is defined below [2, 3]: 1. Critical failures that can cause, with a high probability, damage to persons or material unacceptable to other parts of the system. 2. Failures of primary importance, that, although different than those mentioned previously, can reduce the functionality of a system. 3. Failures of secondary importance are those which do not reduce the functionality of a system. Such a classification is particularly useful for the development of techniques for analyses of availability, e.g. Failure Modes, Effects and Criticality Analysis (FMECA) and Fault Tree Analysis (FTA), as discussed in Chapter 8. Considering instead the nature of a failure at a system level, we can identify the following [2, 3]: 1. Total failure, when variations in the characteristics of the element are such to completely compromise its function. 2. Partial failure, when the variations of one or more characteristics of the element do not impede its complete functioning. 3. Intermittent failure, characterized by a succession, generally casual, of periods of normal operations with periods of failure breakdowns, without any maintenance operations carried out on the device.
6
1 The Concept of Measurable Quality
The occurrence of a failure brings the element to a state of fault, characterized by the inability to perform its required function, not including such inability during preventive maintenance or other planned activities [3]. It is important not to confuse the concept of failure as an event, with the concept of fault, associated with a particular state of a system. Analogous to the failures, also for fault a classification can be made according to appropriate criteria which will not be treated here. It is useful however to discuss the meaning of some important activity that can be undertaken when an element is damaged. Such activity differs according to its purpose and in particular regard [2, 3]: Fault diagnosis: actions taken for fault recognition, fault location and cause identification. Fault recognition: the event of a fault being recognized. Fault location: actions taken to identify the faulty sub-item or items at the appropriate indenture level. Level of intervention signifies the appropriate level of subdivision of the device (in this case, the system) in regard to maintenance to be carried out [2, 3]. Fault correction: action taken after fault location to restore the ability of the faulty item to perform a required function. Restoration refers to the event when the item regains the ability to perform a required function. Repair: corrective maintenance performed on the faulty item. Other important definitions can be found in [2, 3] and will be discussed in more detail later.
References [1] ISO 9000:2005, Quality management systems - Fundamentals and vocabulary [2] IEC 60050-191 ed1.0, International Electrotechnical Vocabulary. Chapter 191: Dependability and quality of service. Forecast publication date for Ed. 2.0 is 2012-06-02. IEC International Electrotechincal Commission, Geneve (December 31,1990) [3] Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C.: Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computing 1(1), 11–33 (2004)
Chapter 2
The Concept of “Statistical” Reliability
Abstract. The reliability of an item or a system can be think, as a first approach, as the probability that the device or the system will adequately perform the specified function for a well-defined time interval in specified environmental conditions. Starting from this first definition it is well clear the importance of the probability and statistics science in both reliability definition and evaluation. This chapter is devoted to introduce some important probability and statistics concepts necessary for reliability evaluation. In particular, the statistical point of view is developed and discussed as a first approach to dependability feature of a system or device. A brief overview on probability and statistics concepts is given in the first pages of this chapter. In particular, in 2.2.1 the axioms of probability are given and the law of the large number is discussed in 2.2.2. Random variables are introduced in 2.3 and probability distributions are detailed in 2.4. Finally, the reliability function is derived. Furthermore, it is defined the concept of the failure rate model in section 2.7. In 2.8 the more used distribution laws are discussed.
2.1 Introduction The concept of availability implies that a device or system must be able to be utilized for its life cycle responding to a priori established properties. The specifications can vary in function of the supplier, the buyer, and/or the user or regulating authorities. Generally speaking, it would be worthwhile that for every device or system on the market, information is included on its relative life cycle. It is therefore necessary to define, based on sufficient and pertinent experimental data, the statistical reliability of a device or system. A statistical study of reliability that must count on the experience derived from collecting experimental data presupposes the knowledge of the concept of random phenomenon and the meaning of the laws of distribution. It is from here that the definition reported in IEC 60050-191 [1] arises where reliability is seen in probabilistic terms and such probability characterizes the attitude expressed by “reliability.” It is necessary to start with the concept of probability and clarify how the statistics and theory of probability work together.
8
2 The Concept of “Statistical” Reliability
2.2 Definition of Probability The literature notes three definitions of probability: the axiomatic definition, the definition based on the concept of relative frequency and the “classic” definition. The axiomatic definition is the foundation of the mathematical theory of probability. In order to give the axiomatic definition the following definition are necessary [2]. • Random experiment: it is an experiment that can result in different outcomes, even if it is repeated in the same manner many times. • Sample Space: it is the set of all possible outcomes of a random experiment and it is denoted as S. It is also called sure event. • Discrete Sample Space: when a sample space consists of a finite set of outcomes the space is denoted as discrete sample space. • Event: it is a subset of the sample space, as above defined, of a random experiment. The letter E is used in order to denote an event. • Probability of an event: the probability of an event E, in a discrete sample space, is often written as P(E) and it equals the sum of the probabilities of the outcomes in E. • Union: the union of two or more events is the event that consists of all outcomes that are contained in either of the two or more events. The union is here denoted with E1 ∪ E2. • Intersection: the intersection of two or more events is the event that consists of all outcomes that are contained in all the aforementioned events. The intersection is here denoted with E1 ∩ E2. • Complement: the complement of an event (in a sample space) are the outcomes in the sample space that are not in the event. The complement of the event E is here denoted as E . • Events mutually exclusive: Two events E1 and E2 such that E1 ∩ E2 = 0, are denoted as events mutually exclusive. The following properties are also valid: • Commutative law: A ∪ B = B ∪ A and A ∩ B = B ∩ A; • Associative law: A ∪ (B ∪ C) = (A ∪ B) ∪ C and A ∩ (B ∩ C) = (A ∩ B) ∩ C; • Distributive law: A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C) and A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C); • Complement law: A ∩ A = 0 and A ∪ A = S ; • Idempotent law: A ∪ A = A and A ∩ A = A; • De Morgan’s law: A ∪ B = A ∩ B and A ∩ B = A ∪ B ;
(
)
• Identity law: A = A and A ∪ A ∩ B = A ∪ B . Now the Axioms of probability con be given.
2.2 Definition of Probability
9
2.2.1 Axioms of Probability Probability is the number assigned to each member of a set of events of a random experiment in which the following very important three properties are satisfied. Denoting with S the sample space and with E any event in a random experiment, the following three axioms can be written: 1. P(S) = 1 2. 0 ≤ P(E) ≤ 1 3. For two events named for example E1 and E2 with E1 ∩ E2 = 0, P(E1 ∪ E2) = P(E1) + P(E2). The aforementioned axioms lead to consider the probabilities as relative frequencies. This relative frequency must be between zero and one as states by axiom 2. Axiom 1 is due to the fact that an outcome from the sample space occurs on every trial of an experiment and the relative frequency of S is one. Finally, property 3 implies that if two events have no common outcomes the relative frequency of outcomes in E1 ∪ E2 is the sum of the relative frequencies of the outcomes in the events. The reader would be note that: • P(0) = 0 • P( E ) = 1 – P(E) Starting from these three postulates, we can build the necessary theory for our purpose (deductive approach). The definition based on the concept of relative frequency is founded on experimental results. If the sample space consists of n possible outcomes which are equally likely - i.e. all single element events have the same probability - then the probability of any event E is evaluable as:
P (E ) =
number of elements of E nE = n n
(2.1)
It appears evident that the preceding formulation is not directly verifiable and it is apparently unusable in strict terms; a possible solution comes from the laws of large numbers and thus from the axiomatic approach. Here a brief overview of this law can be very useful.
2.2.2 Law of Large Numbers The law of large numbers state that in repeated and independent trials having the same probability p of success in each trial, the percentage of successes approach to the chance of success as the number of trials increases (the probability of an event is seen as the limit of the relative frequency of occurrence of this event in a long series of trial repetitions). The percentage of successes differs from the probability p by a fixed positive amount, denoted as ε > 0, which converges to zero as the number of trials n goes to infinity.
10
2 The Concept of “Statistical” Reliability
Two simple considerations are here mandatory: • The difference between the number of successes and the number of trials times the chance of success in each trial (the number of successes awaited) grows, as the number of trials increases, like the square-root of the number of trials. • As n grows, the chance of a large difference between the percentage of successes and the chance of success gets smaller and smaller even if this difference can be large in some sequences of trials. In fact, affirming that the difference between the percentage of successes and the chance of success tends to zero is not equivalent to affirm that it has a large probability of being arbitrarily close to zero. Returning on our problematic, it can be shown that:
n ⎫ ⎧ P ⎨ lim E = P(E )⎬ = 1 → ∞ n n ⎭ ⎩
(2.2)
When n can be considered “large enough”, we can affirm with a certain level of confidence that the relative frequency evolves in the probability of verification of event E. The classic definition presupposes that P(E) can be a priori evaluated without having to use experimental data. With the hypothesis of events of equal probability, it is necessary however to count the total number N of possible results of experiment under consideration. If the event E occurs NE times, then: P (E ) =
NE N
(2.3)
As amply demonstrated in the literature, the classic approach requires justifying the premise with the results or rather the plausible events coincide with the probable events.
2.3 The Random Variables In order to study a random experiment or better the results of a random experiment, a summarizing single number can be very useful. In many random experiments the sample space is simply a description of possible outcomes. In some cases it is very useful to associate a well-defined number with each outcome in the sample space [2]. The particular outcome of the experiment is not a priori known. The resulting value of the considered variable is not a priori known. Starting from this, the variable that associates the aforementioned number with the outcome of a random experiment is a random variable as defined in the following. Random variable: function able to assigns a real number to each outcome in the sample space of a random experiment.
2.4 Probability Distribution
11
In particular the set of possible numbers of a random variable X is denoted as range of X. Once the end of the experiment the measured value of the random variable, sometimes known as outcome, is denoted by x. The random variable is often a real number. The random variable that represents this measurement is said to be a continuous random variable. In different type of experiments the measurements are limited to integers or to fractional numbers. These are two examples where the measurements are limited to discrete points on the axis line. In this case the random variable is said to be a discrete random variable. The two following definition are valid [2]. Continuous Random variable: random variable with an interval of real numbers for its range. Please note that the interval can be finite or infinite. Discrete Random variable: random variable with a finite range.
2.4 Probability Distribution Often, we are interesting in evaluate the probability that a random variable assumes a particular value. The probability that a random variable X assumes a welldefined value x is described and evaluated with the probability distribution. For a discrete random variable the probability distribution is very simple to describe. In this case, in fact, the distribution is a simple table of the possible values associated to the probability of each value. In this case the probability distribution can be drawn as depicted in figure 2.1.
Fig. 2.1 Graphic representation of discrete probability distribution, with
∑ p(xi ) = 1 .
In other situation it is necessary or advisable to express the probability distribution by an equation. For example, in many types of experiments the quantity of interest can be represented by a continuous random variable. The range of possible values of these random variables often span over an interval or real numbers. The range of the random variable, in this case, includes all values in an interval of real numbers. Thus, the range can be seen as a continuum [2].
12
2 The Concept of “Statistical” Reliability
When a continuous random variable would be described, a probability density function f(x) can be used to describe the probability distribution of the variable under consideration X. The probability density function f(x) can be described as reported in figure 2.2.
Fig. 2.2 Probability density function.
The probability that X is between a and b is evaluated as the integral of f(x) from a to b of the probability density function. This probability is depicted in figure 2.3.
Fig. 2.3 Probability evaluated from the area under probability density function f(x).
The area under the probability density function f(x) can be so evaluated with the following formula: P{a ≤ X ≤ b} =
b
∫ f (x)dx a
Now it is possible to give the following definitions.
(2.4)
2.4 Probability Distribution
13
Probability density function f(x): for a continuous random variable X the probability density function is a simple description of the probabilities associated with a random variable an it is a function such that:
f (x ) ≥ 0
(2.5)
+∞
∫ f (x) dx = 1
(2.6)
−∞
P{a ≤ X ≤ b} =
b
∫ f (x ) dx
(2.7)
a
It is possible to use a second way of describing the distribution of a continuous random variable as shown in the following definition. Cumulative distribution function: The Cumulative distribution function of a continuous random variable X is:
F ( x ) = P( X ≤ x ) =
x
∫ f (ξ ) dξ
(2.8)
−∞
for - ∞ < x < ∞. F(x), also called simply distribution function, give the probability that the random variable will assume a value smaller than or equal to x. Moreover, F(x) is a no decreasing function, and F(-∞) = 0, F(+∞) = 1. Thus: +∞
∫ f (ξ ) dξ = 1
(2.9)
−∞
as reported in (2.6). The derivative of the cumulative distribution function is the probability density function of the random variable X: f (x ) =
dF (x ) dx
(2.10)
as long as derivative exists. This definition leads to concludes that previous equation (2.7) can be rewritten as (see also Fig. 2.4): P{a ≤ x ≤ b} = F (b ) − F (a ) =
b
∫ f (x ) dx
(2.11)
a
Equation (2.11) can be derived also reasoning as in the following:
P{x ≤ X ≤ x + Δx} = F (x )Δx
(2.12)
14
2 The Concept of “Statistical” Reliability
F (t)
1
0.8 0.6 0.4
P{a ≤ x ≤ b} = F (b) − F (a )
0.2 0 f (t)
x b
P{a ≤ x ≤ b} = F (b) − F (a ) = ∫ f (x ) dx
0.5 0.4
a
0.3 0.2 0.1 0
a
b
x
Fig. 2.4 Cumulative distribution function; example of relationship between the Cumulative distribution function F(t) and the Probability density function f(t) for a continuous random variable.
and, finally: f (x ) = lim
Δx → 0
P{x ≤ X ≤ x + Δx} Δx
(2.13)
2.5 The Characteristics of Reliability We start recalling here the definition of reliability of a device or system as “the ability of an item to perform a required function under given conditions for a given time interval”. The preceding definition represents therefore a type of reliability “specification” for which it is necessary to define a “measurement” that allows for a quantitative and comparative evaluation. One notes that the concept of “performing a required function” is complementary to that of a failure: termination of the ability of an item to perform a required function (clause 3.2 IEC 60812:2006 [3]). Therefore, as with a failure, there is a life span associated with this (in probabilistic terms, this is the random variable “time to failure” or more simply, failure time) and the quantitative evaluation of reliability is by way of the evaluation of reliability as a “performance”, the MTTF (Mean Time To Failure), the failure rate λ and the MTBF (Mean Time Between Failures).
2.5.1 Reliability For convenience, we indicate with tf the random variable failure time defined in the interval [0, ∞). With a fixed subinterval [0, t] and the operating conditions
2.5 The Characteristics of Reliability
15
established for the device or system under examination, we can state that this is reliable if it correctly carries out in such an interval, the function or functions for which it has been designed. In probabilistic terms, we can write [4, 5]: R(t )=P{ tf > t}
(2.14)
In fact, the Reliability function is a survival function and it is denoted by R(t). R(t) is the probability that the considered item (device, system and so on) will operate failure – free in [0, t]. It is worth observing that the reliability function, being a probability, is dimensionless. Known the probability function of a failure, we have R(t ) =
∞
∫ f (t ) dt
(2.15)
t
Assuming that the system can be found in only two states, the state of correct functioning or the state of failure, we can define the function of unreliability as complementary to R(t), that is:
{
F (t ) = 1 − R (t ) = P t f ≤ t
}
(2.16)
where tf is the random variable. And also F (t ) =
t
∫ f (t ) dt
(2.17)
0
It therefore appears evident that the device or system is unreliable when there is a condition of failure in the considered interval [0, t]. From (2.9) and (2.14) it is possible to extract the failure probability density function as:
f (t ) =
dF (t ) dR(t ) =− dt dt
(2.18)
from which one determines the probability that the system will break down in the interval (t , t + dt ) . The relationship ∞
∫ f (t ) dt = 1
(2.19)
0
expresses the concept according to which the device is destined to breakdown with the passage of time. For the item under consideration, we can identify two distinct situations. In the case of a non-repairable device, for example an incandescent lamp or a microprocessor, is usual to define the Mean Time to Failure (MTTF). This one – item not – reparable system is characterized by the (cumulative) distribution function F(t) = P(tf ≤ t), where tf is the failure-free operating time of the item. The time in this case can be considered as a continuous random variable. It would be noted that t > 0 and F(0) = 0. The Reliability function R(t) is the
16
2 The Concept of “Statistical” Reliability
probability of no failure presence in the interval [0,t]. The problem consists in to evaluate the mean of the continuous random variable tf with density f(t): +∞
{ } ∫ t f (t ) dt
E tf =
(2.20)
−∞
if the integral converges absolutely. The time is a positive random variable and the previous equation can be so reduced to: +∞
E{ t f } = ∫ t f (t ) dt
(2.21)
0
It follows that:
{ } ∫ t f (t ) dt = ∫ t dFdt(t ) dt = − ∫ t dRdt(t ) dt = − ∫ t dR(t ) +∞
+∞
+∞
+∞
0
0
0
0
E tf =
(2.22)
and, finally:
{ }
+∞
E t f = − t dR (t ) =
∫
0
+∞
∫ R(t ) dt
(2.23)
0
The mean or expected value of the time to failure is denoted by Mean Time to Failure (MTTF) and it is so given by +∞
+∞
0
0
{ } ∫ R(t ) dt = ∫ [1 − F (t )] dt
MTTF = E t f =
(2.24)
From this we clearly deduce that the mean time to failure for the device under consideration represents the area under the reliability function. Considering instead a system for which, following a failure, correct functioning can be restored, the MTBF (Mean time Between Failures) can be defined. It is easy to demonstrate that, indicating the reliability of a system with RS(t), the value of MTBF can be determined as: ∞
MTBF = RS (t ) dt
∫
(2.25)
0
Obviously, both the Mean Time to Failure (MTTF) and the Mean Time Between Failures (MTBF) are expressed in hours.
2.5 The Characteristics of Reliability
17
2.5.2 Failure Rate Less intuitive but easily deduced is the expression of the failure rate λ(t). We consider the event a failure of the device in the interval [t, t+dt]. This event however is conditioned by the fact that the system did not fails before time t. Conditional probability is treated in a specific way by the probability theory. In fact, with event B with P(B)>0, one defines the probability of event A conditioned by event B as the ratio of the probability of the joined event (AB) to the probability of event B: P (A ∩ B ) P (B )
P (A | B ) =
(2.26)
The conditioned probability of a failure can therefore be expressed as (f = failure):
} P{ t < t
{
P t < t f < t + dt t f > t =
{
f
< t + dt
P tf >t
}
}
(2.27)
The Failure rate λ(t) of an item can be defined as:
λ (t ) = lim
1
dt →0 dt
(
P t < t f < t + dt | t f > t
)
(2.28)
thus:
λ (t ) = lim
dt → 0
Recalling now that:
(
1 P (t < t f < t + dt ) dt P (t f > t )
)
(
(
)⋅
)
P t f ≤ t = F (t ) → P t f > t = 1 − F (t )
and, if F(t) is derivable:
λ (t ) = lim
P t < t f < t + dt
dt →0
dt
1 f (t ) f (t ) = = 1 − F (t ) 1 − F (t ) R(t )
(2.29)
(2.30)
(2.31)
with t > 0, F(0) = 0 and R(0) = 1. The ratio of the conditioned probability with an object that breaks down in the interval [t, t+dt] and the duration dt of the interval is defined as “Instantaneous Failure Rate” λ(t ) =
f (t ) R(t )
(2.32)
in hours –1. Equation (2.32) can furthermore be expressed as:
λ(t ) = −
d 1 dR (t ) = − log (R (t )) R (t ) dt dt
(2.33)
18
2 The Concept of “Statistical” Reliability
2.6 The Frequency Approach Returning to the concept of probability based on relative frequency (expressed by 2.1) and the relation (2.4), we can define an analysis tool, the experimental histogram of relative frequency: f (x )Δx =
n( x ) n
(2.34)
From a practical point of view, this means that after repeating the experiment n times and after counting the tests n(x), the relationship x ≤ X ≤ x+Δx is valid. Thinking in histogram terms, Δx represents the width of the classes which make up the rectangles and whose height is given by: f (x) =
n( x ) n ⋅ Δx
(2.35)
We will try to arrive at a definition of reliability starting from the “empirical” definition, extracted from the analysis of failure data. We shall consider n identical elements and statistically independent items that are put into operation under the same conditions at time t = 0. nh(t) indicates the subset of elements n that have not yet fails at a generic instant of time t (these are the item yet properly working at time t). With a fixed interval Δt = t n − t n −1 , we consider the ratio between nh(t) and the total number of devices: R N (t ) =
nh (t ) n
(2.36)
Since the definition of probability based on the concept of relative frequency (2.36) determines the probability of an event as the ratio between the number of time a certain event A takes place and the total number of experiments, the function expressed by the ratio RN(t), expresses a probability that we will define empirical reliability. This ratio, for the law of large numbers, (considering n = a large number of devices) approximates therefore the function R(t) defined in paragraph 2.5.1. The extension to the distribution function of random variable tf, that in empirical terms we will indicate with FN(t), immediately becomes
FN (t ) = 1 − RN (t ) = 1 −
nh (t ) n − nh (t ) n f (t ) = = n n n
(2.37)
where nf(t) indicates the number of elements that have failed down in time t, considering that nh(t) + nf(t) = n (where nh are the healthy items).
2.6 The Frequency Approach
19
The course of failures reported in Figure 2.5 furthermore suggests that definition of an experimental histogram of relative frequency where Δt (interval between one breakdown and the next) represents the width of the classes and whose height is given by:
f N (t ) =
n f (t + Δt ) − n f (t )
(2.38)
n ⋅ Δt
From (2.15) and (2.16) we have:
f N (t ) =
FN (t i + Δt i ) − FN (t i ) with t i ≤ t ≤ t i + Δt Δt i
200 180 160 140 120 nh 100 80 60 40 20 0
t1
t2
t3
t4
t5
t6
(2.39)
t
Fig. 2.5 Course of failures.
Indicating t1, t2, ... tn as the times to failure observed relative to n elements under consideration (Figure 2.5), it is possible to define the empirical mean time to failure (MTTFN) as: MTTFN =
t1 + t2 + ... + t n n
(2.40)
Having defined in (2.32) the “Instantaneous Failure Rate” as the ratio between the probability of the event and the observation interval, we have in empirical terms the possibility to express the failure rate as the ratio between elements that have broken down in the interval (t, t+Δt] and the number of elements ns(t) functioning at time t, that is:
λ N (t ) =
f (t ) 1 n f (t + Δt ) − n f (t ) n ⋅ = f N (t ) ⋅ = N ( ) Δt ( ) nh t nh t R N (t )
(2.41)
It is evident that the failure rate is the reciprocal of a time and is usually expressed in hours -1.
20
2 The Concept of “Statistical” Reliability
Example 1
Considering n = 172 the number of elements in the trial, we obtain, relative to their breakdown, the data reported in Table 2.1. Table 2.1 Data of relative failures of applicatory example 1. Time interval (h)
Failure found at end of interval
0 – 1000
59
1000 – 2000
24
2000 – 3000
29
3000 – 4000
30
4000 – 5000
17
5000 – 6000
13
Faults
172
The evaluation of the Reliability function RE(t) is performed as reported in Table 2.2. The functions of empirical reliability and probability density, evaluated and based on equations (2.36) and (2.39), demonstrate the trend reported in Figures 2.6 and 2.7 (Tables 2.3 and 2.4 report the results). Table 2.2 Table for RE(t) evaluation. t (h)
nh
RE(t)
0
172
1
1000
113
0.657
2000
89
0.517
3000
60
0.349
4000
30
0.174
5000
13
0.076
6000
0
0
Total
172
2.6 The Frequency Approach
21
1.00 0.90 0.80 0.70
RE (t)
0.60 0.50 0.40 0.30 0.20 0.10 0.00 0
1000
2000
3000
4000
5000
6000
t (hours)
Fig. 2.6 Trend of empirical function RE(t) relative to the data reported in applicatory example 1.
The failure rate will show a trend as seen in Figure 2.8 (Table 2.5 shows results). 0.400 0.350
fE(t) [10-3]
0.300 0.250 0.200 0.150 0.100 0.050 0.000 1000
2000
3000
4000
t (hours)
Fig. 2.7 Empirical density function from data in Example 1.
5000
6000
22
2 The Concept of “Statistical” Reliability
Table 2.3 Evaluation of empirical functions of reliability and distribution of times to failure. t (h)
RE
0
1.0
1000
(172 - 59)/172 = 0.657
2000
0.517
3000
0.349
4000
0.174
5000
0.076
6000
0
Table 2.4 Empirical density function. Time interval [t+Δt ] (hours) RE
FE(t) FE(t+Δt)-FE(t) fE [10-3]
0-1000
0.657
0.343 0.343
0.343
1001-2000
0.517
0.483 0.140
0.140
2001-3000
0.349
0.651 0.169
0.169
3001-4000
0.174
0.826 0.174
0.174
4001-5000
0.076
0.924 0.099
0.099
5001-6000
0.000
1.000 0.076
0.076
Table 2.5 Failure rate. Time interval [t+Δt ] (hours)
fE(t) [10-3]
RE(t)
λE [10-3]
0-1000
0.343
1.000
0.343
1001-2000
0.140
0.657
0.212
2001-3000
0.169
0.517
0.326
3001-4000
0.174
0.349
0.500
4001-5000
0.099
0.174
0.567
5001-6000
0.076
0.076
1.000
2.7 Models of Failure Rate
23
1.200 1.000
λ E(t) [10-3]
0.800 0.600 0.400 0.200 0.000 1000
2000
3000
4000
5000
6000
t (hours)
Fig. 2.8 Trend of failure rate.
2.7 Models of Failure Rate The most diffuse and widely known model of the failure rate is the model defined as the “bathtub” curve. This failure rate is often exhibited when a large population of statistically identical and independent items is considered. This model arises from actuary tables (relative to human mortality) from 1800 and is utilized in the field of insurance. In figure 2.9 a qualitative drawn of the bathtub curve is depicted. As well noted, three phases characterized by different trends can be analyzed:
Fig. 2.9 The “bathtub” curve (qualitative).
24
2 The Concept of “Statistical” Reliability
1. The phase immediately following the start of the life cycle of the device is characterized by a high failure rate that decreases rapidly over time. This phase is called “infant mortality” or “Early failure” phase. This course is derived from the existence of a “weak” fraction of the population whose defects however cause a failure in a short period of time. It would be mentioned that sometimes the failure rate not necessary goes as depicted in figure, but also oscillate. 2. A phase with an approximately constant failure rate whose value is determined above all by the level of solicitations the system is subjected to. Failures are Poisson distributed and often cataleptic. 3. A third phase, known as wear out, characterized by long time intervals and a rapidly increasing failure rate. This is the period of “wear out” failure of the devices. The failures are attributable to wear out, aging and fatigue. Phase 2, where the failure rate is more or less constant, has an influence which is very interesting in the evaluation of reliability performance.
(a)
(b) Fig. 2.10 Trend assumed by Reliability (a) and Probability Density Functions (b) in the case where λ = constant.
Recalling the expression (2.33)
λ(t ) = −
1 dR(t ) d = − log(R (t )) R(t ) dt dt
2.7 Models of Failure Rate
25
and considering as an initial condition that reliability at time 0 is at a maximum and is equal to 1, we have: t
R(t ) = e
− ∫ λ (t )dt 0
(2.42)
Recalling the bathtub curve and hypothesizing to find oneself in phase 2 (a valid hypothesis above all in electric and electronic environments), (2.42) is notably simplified, becoming:
R(t ) = e−λt
(2.43)
Equations (2.32) and (2.43) lead to the definitions of Reliability and Probability Density functions as seen in figures 2.10 (a) e (b). Based on the preceding, it therefore follows that: ∞
∞
0
0
MTTF = t ⋅ f (t ) dt = R (t ) dt =
∫
∫
1
λ
(2.44)
Example 2
Figure 2.11 demonstrates the plot of reliability in conditions of a constant failure rate for two devices that present different values of λ. For example, at time t = 2•103 h, we note that for device 1, for which one has λ1 = 0.25•10-3 h -1, this presents with the probability of functioning correctly equal to about 61%. The second device, with a failure rate λ2 = 0.5•10-3 h–1, has a value of about 37%. With a selected reliability, for example 0.25, the graph shows that the interval of correct functioning for the second device is almost double that for the first device. The information obtained from analyzing the two curves is extremely important when undertaking preventive measures for the reliability performance of the apparatus, both in terms of optimizing conditions of use, the appropriate selection of components to guarantee the required functions of the system as well as project revisions. As will be subsequently illustrated, the calculation of the failure rate can be done by means of specific handbooks. The exponential law represents, for its simplicity, the most widely used model in the study of reliability. However, this is not the only model that can be utilized, availing oneself of experimental data to describe failure events. In fact, this will demonstrate how, for the function of unreliability of particular systems, other laws are more appropriate.
26
2 The Concept of “Statistical” Reliability
(a)
(b) Fig. 2.11 Comparison of two devices with different failure rate.
2.8 Other Laws In this section the following distribution will be presented: • Exponential law • Log-Normal distribution • Weibull distribution
Others distribution can be considered, such as gamma, Muth, Uniform, Log Logistic, Inverse Gaussian, Exponential Power and Pareto. However, these distributions will be not considered in this book.
2.8 Other Laws
27
For the aforementioned distributions the following characteristic function will be analyzed: • • • •
reliability, (failure) probability density function, failure rate, MTTF.
Exponential Law
This distribution is characterized by only one single positive scale parameter λ, that represents the failure rate, i.e. the number of failures in per unit time. For this type of distribution the characteristic functions are:
f (t ) = λe−λt , λ > 0, t ≥ 0
(2.45)
R (t ) = e − λ t
MTTF =
σ2 =
(2.46)
1
(2.47)
λ 1
(2.48)
λ2
The functions are plotted in Figures 2.12, 2.13, 2.14 and 2.15. 2.5 2
λ=2
λ = 1.5
1.5
f(t) 1 λ=1
0.5 0 0
0.5
1
1.5
2
2.5
t Fig. 2.12 Exponential law: plot of the (failure) probability density function.
3
28
2 The Concept of “Statistical” Reliability
1 λ=1
0.8
λ = 1.5
0.6
R(t) 0.4 λ=2 0.2 0 0
0.5
1
1.5
2
2.5
3
t Fig. 2.13 Exponential law: plot of the Reliability function.
2.5 λ=2
2
λ(t)
λ = 1.5
1.5
λ=1
1 0.5 0 0
0.5
1
1.5
2
2.5
3
t Fig. 2.14 Exponential law: plot of the failure rate.
1.2 λ=1
1 0.8
λ = 1.5
MTTF 0.6
λ=2
0.4 0.2 0 0
0.5
1
1.5
t Fig. 2.15 Exponential law: plot of the MTTF.
2
2.5
3
2.8 Other Laws
29
The ratio MTTF =
1
λ
is valid only for exponential distribution as is the probability that the system at instant t = MTTF is still functioning is R (MTTF ) =
1 ≅ 0,37 e
(2.49)
The exponential model demonstrates well the performance of elements after infant mortality in that its reliability interval depends only on the duration of the interval itself and not on its starting moment. In fact, applying the definition of conditioned probability, we have
R (s + t | s ) =
e −λ (t + s ) = e − λt − λs e
(2.50)
This particular law is often used to model the lifetime of many electronic components. Furthermore, exponential law is appropriate when a used device that has not failed is statistically as good as a new device. Many properties can be recalled for this distribution. The more important properties is the memoryless property: Memoryless property: If T ∼ exponential (λ), then
P[T ≥ t ] = P[T ≥ t + s | T ≥ s ]
t ≥ 0; s ≥ 0
It should be noted that, the exponential distribution is the only continuous distribution with the memoryless property. Log-normal Distribution
The Log-normal characteristic functions are:
f (t ) =
1 ⎛ ln t − μ ⎞ − ⎜ ⎟ e 2⎝ σ ⎠
1
σ t 2π
2
(2.51)
t
R(t ) = 1 − ∫ f ( x )dx
(2.52)
0
MTTF =
2 ⎛ ⎜ μ+σ ⎜ 2 e⎝
⎞ ⎟ ⎟ ⎠
(2.53)
The Log-normal Distribution is utilized for the data processing of data derived from tests of accelerated life in particular for semiconductor devices. The parameters of the Log-normal law are: μ (mean value), σ (standard deviation of logarithm
30
2 The Concept of “Statistical” Reliability
of time to failure). From the moment that the function λ(t) is decreasing for long periods of time, such distribution is generally used for describing reparation events in that the lack of spare parts can induce repair times far superior to the average. It is possible to demonstrate that the probability that the system at the instant t = MTTF is still depends only on functioning σ. In fact, if y=
⎛ ln r − μ ⎞ ⎟ ⎜ e⎝ σ ⎠
(2.54)
we have:
R(MTTF ) =
∞
1 2π
∫e
(− y 2)dy 2
(2.55)
σ 2
Weibull Distribution
The exponential law is often not well usable. In fact, the exponential law is limited in applicability because of the memoryless property, as previous described. The too hard assumption for the exponential law is the constant failure rate that results often too restrictive and/or inappropriate. Weibull distribution overcome these problems. Weibull distribution is utilized as a model to describe infant mortality and it is a function of three parameters: γ (minimum life or the period within which failure do not occur), θ (parameter of scale intended as characteristic life) and b (a form parameter). If γ = 0, we can write the following relations: b⎛t ⎞ f (t ) = ⎜ ⎟ θ ⎝θ ⎠
b −1 − ⎛⎜ t ⎞⎟ e ⎝θ ⎠
R (t ) =
b
, b, θ > 0, t ≥ 0
⎛t ⎞ −⎜ ⎟ e ⎝θ ⎠
(2.56)
b
b⎛t ⎞
λ (t ) = ⎜ ⎟ θ ⎝θ ⎠
(2.57) b −1
(2.58)
Such distribution is particularly significant for studying the reliability of systems since it allows for the describing of failure events for those systems characterized by a failure rate variable over time. In fact, for b > 1 (b < 1), the Weibull distribution describes a system with an increasing (decreasing) rate. For b = 1, the Weibull distribution coincides with the exponential law. Therefore, in such circumstances, the failure rate is constant (θ = 1/λ). The characteristic function for the Weibull distribution are plotted in the following figures.
2.8 Other Laws
31
1.2 b=3
1
b=1
0.8 b=2
f(t) 0.6 0.4 0.2 0 0
0.5
1
1.5
2
2.5
3
t Fig. 2.16 Weibull law: plot of failure probability density function for λ = 1 and three different values for the parameter b.
1 0.8 0.6
R(t) 0.4 0.2
b=1
b=2
b=3 0 0
0.5
1
1.5
2
2.5
3
t Fig. 2.17 Weibull law: plot of reliability function for λ = 1 and three different values for the parameter b.
32
2 The Concept of “Statistical” Reliability
30 b=3
25 20
λ(t) 15 10 b=2 5 b=1 0 0
0.5
1
1.5
2
2.5
3
t Fig. 2.18 Weibull law: plot of the failure rate function for θ = 1 and three different values for the parameter b. When b = 1 we have the exponential case with constant failure rate.
References [1] IEC 60050-191 ed1.0, International Electrotechnical Vocabulary. Chapter 191: Dependability and quality of service, Forecast publication date for Ed. 2.0 is 2012-06-02. IEC International Electrotechincal Commission, Geneve (December 31, 1990) [2] Montgomery, D.C., Runger, G.C.: Applied Statistics and probability for engineers, 2nd edn. John Wiley & Sons, Chichester, ISBN 0-471-17027-5 [3] IEC 60812:2006 – Analysis techniques for system reliability – Procedure for Failure mode and effects analysis (FMEA) [4] Leemis, L.M.: Reliability, Probabilistic Models and Statistical methods, 2nd edn., ISBN 978-0-692-00027-4 [5] Birolini, A.: Reliability Engineering – Theory and Practice, Springer, Heidelberg, ISBN: 978-3-642-14951-1
Chapter 3
Reliability Analysis in the Design Phase
Abstract. In this chapter the techniques used for describe the performance of devices in a system will be considered. To this aim it is important to consider the system as a combination of elementary devices that follow a well-defined functional structure. After a brief introduction, the reliability evaluation of series, parallel and mixed structures will be shown and discussed. To this aim, the concept of Reliability Block Diagram is also defined as a mandatory tool. The theory is developed using many practical examples. Parallel configuration is further developed in order to discuss the different type of redundancy for reliability growth: active, warm and standby. At the end of this chapter two different types of redundancy approaches are compared: system redundancy and component redundancy. The results so obtained are fundamental during the design phase of a system when reliability aspects have to be taken into account.
3.1 Introduction In general terms, we can consider a system as a set of elements, subsystems or components, connected among themselves in order to guarantee one or more functional performances. Reliability, and therefore availability of such a system, depends on the characteristics of reliability and availability of the elements which make up the system, and their interconnections. The study of the relationships of the connections between the subsystems is called Combinatory Analysis and can be visualized in a diagram denoted as Reliability Block Diagram (RBD). In this chapter we will discuss some of the most common functional configurations, denoted as canonic configurations, whose combinations give origin to mixed configurations. For each functional configuration, not to be confused with the corresponding electric configuration, it will be possible to determine mathematical models of reliability and, consequently, the value of the Mean Time To Failure relative to the system (MTTFs). As will become more evident, the MTTFs can be determined through a combination of the failure rates of the elements constituting such system.
34
3 Reliability Analysis in the Design Phase
3.2 Reliability Evaluation of Series, Parallel and Mixed Structures 3.2.1 The Series Functional Configuration The series functional configuration, whose block diagram is shown in Figure 3.1, represents the simplest and most common reliability model in certain contexts, e.g. in the field of electronics. Considering the system S, composed of n elements Ei , for i = 1,… n, we say that the system is operative if and only if all the elements Ei are functioning correctly.
Fig. 3.1 Block diagram of reliability for the series functional configuration.
In the simplified hypothesis of independent events for which we can assume that the performance of every element Ei , in terms of correct functioning or failure, does not depend on the condition assumed by other elements, the reliability of the system corresponds to the product of the reliability of single blocks, that is: n
RS (t) = ∏ Ri(t)
(3.1)
i =1
Assuming the condition of random failure and indicated with λi the constant failure rate associated with the generic element Ei for which we assume that R i (t) = e − λ i t , the equation (3.1) becomes: n
RS (t ) = ∏ Ri (t )
⎛ n ⎞ −⎜⎜ ∑ λi ⎟⎟ t = e ⎝ i =1 ⎠
= e − λS t
(3.2)
i =1
In the assumed hypotheses, Eq. (3.2) demonstrates an important property of the series functional configuration according to which the failure rate of the system λS can be determined by the sum of the failure rates of the constituent elements λi , in hours-1, that is:
λS =
n
∑λi i =1
(3.3)
3.2 Reliability Evaluation of Series, Parallel and Mixed Structures
35
Consequently the Mean Time To Failure for the system, in hours, is:
MTTFS =
n 1 = 1/ ∑ λi λS i =1
(3.4)
It is therefore sufficient to know the failure rate of each element to determine the value of the MTTFs. For electronic equipments, this is called “reliability prediction” and can be carried out by means of particular handbooks as described in Chapter 5. From the analysis of Eq. (3.2), fundamental considerations can be made for the series configuration: 1. being the reliability a probability and a number, fixed the time, from 1 to 0, we can deduce that the system reliability is inferior to the smallest value of reliability of the constituent elements, that is:
R S (t) ≤ min {R i (t)} ; i
i = 1, ⋅ ⋅ ⋅ n
(3.5)
2. The probability of the system functioning correctly decreases with an increasing number of constituent elements. Example 1 To justify Eq. (3.5), we consider a system composed of three elements E1 , E2 , E3 whose RBD is reported in Figure 3.2.
Fig. 3.2 RBD for a system composed of three elements in a series configuration.
If the values of reliability of each element, at generic time t, are 0.4, 0.7 and 0.9 respectively, the probability of the system functioning at the same time t is equal to 0.252. Though elementary, this example allows us to make the following practical considerations. First of all, the presence of an intrinsically weak element in the series configuration has a strong negative effect on the system reliability. However, even improving the performance of the other two elements, the probability of the proper functioning of the system is less than 40%. In addition, the probability of the system functioning correctly decreases with an increasing number of constituent elements.
36
3 Reliability Analysis in the Design Phase
Fig. 3.3 Reliability plot of a system with three elements in series configuration.
Example 2 Figure 3.3 shows the reliability plots with constant failure rates λ1 < λ2 < λ3. The lower plot, relative to the series, clearly shows how the high failure rate of the third element negatively influences total reliability which, assuming a value of 1 at time zero, decreases exponentially in function of the failure rate λS = λ1 +λ2 +λ3 . Example 3 Similar considerations can be made examining the values reported in Table 3.1. Assuming high values of reliability for a single element, it appears evident that the reliability of a system, at a fixed time t, decreases with the increase in the number of elements which make up the system. If we consider, for example, an RBD with 20 elements set up in series functional configuration, and that for simplicity we consider identical, the probability that the system is functioning correctly at time t, will be over 65% only if the reliability of the single element is above 0.98. Table 3.1 Influence of reliability values of the elements on the system reliability. Element reliability
►
Number of elements ▼
0.8
0.85
0.9
0.95
0.98
0.99
System reliability ▼
1
0.8
0.85
0.9
0.95
0.98
0.99
5
0.32768
0.44370
0.59049
0.77378
0.90392
0.95099
10
0.10737
0.19687
0.34868
0.59874
0.81707
0.90438
20
0.01153
0.03876
50
1.47•10-5 2.96•10-4
0.12158
0.35849
0.66761
0.81791
5.15•10-3
0.07694
0.36417
0.60501
3.2 Reliability Evaluation of Series, Parallel and Mixed Structures
37
3.2.2 The Concept of Redundancy: Parallel Functional Configuration The parallel function configuration, also called redundant configuration (or active redundancy), assumes a determining role every time it is necessary to increase the reliability of a system. The RBD for such a configuration is shown in Figure 3.4. The system is operative also if only one component allocated in parallel is operative. Vice versa, the system is not functioning when all the elements are faulty. Considering this last definition and recalling the hypotheses made for the series configuration according to which events are independent and with constant failure rate, this shows that the unreliability of a system corresponds to the product of the unreliability of the constituting elements, that is: FS (t ) =
n
∏ Fi (t )
(3.6)
i =1
from which we extract the reliability of a system as: RS (t ) = 1 − FS (t ) = 1 −
∏ Fi (t ) = 1 − ∏ (1 − e −λ t ) n
n
i =1
i =1
i
(3.7)
and consequently, the mean time to failure: +∞
MTTFS = ∫ RS (t )dt 0
(3.8)
E1
Ei
En SYSTEM
Fig. 3.4 RDB for the parallel configuration.
Considering the example of a system with two independent elements connected in parallel configuration and assuming a failure rate constant equal to λ1 and λ2, from Eq.(3.7) we obtain the following expression of reliability: RS (t ) = e − λ1t + e − λ2t − e − (λ1 + λ2 )t
(3.9)
38
3 Reliability Analysis in the Design Phase
and recalling Eq.(3.8)
MTTFS =
1
λ1
+
1
λ2
−
1 λ1 + λ2
(3.10)
In the simplified hypothesis according to which the two elements in redundancy are identical:
3 (3.11) 2λ from which we clearly deduce an increase of 50% of the MTTFs in respect to the case of a single element characterizing by the same failure rate. Such a concept is the basis of the allocation of redundancy as the methodology to increase the reliability of a system. Finally, it is necessary to evaluate the failure rate for the allocation of redundancy. This task will be obtained taking into consideration a two component parallel block with identical constant failure rate λ. Recalling (2.33) we obtain: MTTFS =
λS (t ) = − =−
=
d 1 dR(t ) 1 =− ⋅ 2 ⋅ e − λt − e − 2 λt = − λt − 2 λt R(t ) dt dt 2⋅e − e
{
{
}
}
{
}
1 1 ⋅ − 2λ e − λt + 2λ e − 2 λt = ⋅ 2λe − λt − 2λe − 2 λt = 2 ⋅ e − λt − e − 2λt 2 ⋅ e − λt − e − 2λt
{
}
e − λt − e −2λt e − λt (1 − e − λt ) 1 − λt − 2 λt λ λ λ = = ⋅ − = 2 e e 2 2 e − λt ( 2 − e − λt ) 2 ⋅ e − λt − e − 2λt 2 ⋅ e − λt − e − 2 λt = 2λ
(1 − e −λt ) (2 − e −λt )
Similar consideration can be deduced for more general case of many blocks parallel connected with different failure rate. It should be noted that the failure rate for parallel block is time dependent even if the failure rate of the single blocks are time independent. For the parallel functional configuration it is possible to deduce the following considerations. 1. With a fixed time, the total reliability of a system is superior to the highest value of reliability of the constituent elements for which we can write that:
RS (t ) ≥ max {Ri (t )} ; i
i = 1, ⋅ ⋅ ⋅ n
(3.12)
2. The probability of the system functioning increases with the increase in the number of constituent elements. Example 4 Such property can be proved considering a system made up of three elements E1, E2, E3 whose RBD is shown in Figure 3.5. If the values of reliability for each
3.2 Reliability Evaluation of Series, Parallel and Mixed Structures
39
element, at the generic time t, are respectively 0.4, 0.7 and 0.9, the probability of the system functioning, at the same time, becomes 0.982.
Fig. 3.5 RBD for a system made up of three elements in parallel configuration.
Example 5 The results of reliability of a system obtained by connecting a maximum of six elements in parallel configuration are reported in Table 3.2 and in Figure 3.6. For simplicity, we assume, also in this case, identical values of reliability, equal to 0.8 at time t. From the results of table, we observe that with two elements in parallel we obtain an increase in reliability, equal to 20% in respect to the configuration with a single element. Obviously, this increase is always positive with the increase in the number of elements in active redundancy, but, as it was logical to expect, always smaller and however, not such as to justify cost in design review. In the proposed example the importance of the parallel configuration as a technique for increasing the system reliability is demonstrated. A further application reported below is an example of active redundancy allocation for a series configuration. It is important to remember that the active redundancy discussed in this paragraph must not be confused with the stand-by redundancy. In general terms, this last configuration, which will not be discussed here, foresees the functioning of a system A and the activation of a system B whenever A assumes a failure state. A diagnostic block D controls the correct functioning of A and causes the activation of B when necessary. It is evident that for this configuration the reliability of all the system depends on the reliability of blocks A, B and D according to a connection of conditioned probability. Table 3.2 Increase in reliability for a configuration in active redundancy. Number of Elements
System reliability
Increase of reliability (a)Increase of reliability (b)
1
0.800000
---
---
2
0.960000
0.160000
20.00 %
3
0.992000
0.032000
24.00 %
4
0.998400
0.006400
24.80 %
5
0.999680
0.001280
24.96 %
6
0.999936
0.000256
24.99 %
a) respect to the configuration of the preceding step. b) respect to the initial configuration with a single element.
40
3 Reliability Analysis in the Design Phase 1
Reliability
0.8
System Reliability
0.6 0.4
Increase of Reliability
0.2 0 1
2
3 4 Numbers of elements (a)
5
6
5
6
Reliability Increment (%)
30%
20%
10%
0% 1
2
3 4 Numbers of elements (b)
Fig. 3.6 Plot of increase in reliability for a configuration in active redundancy: (a) system reliability and increase of reliability respect to the configuration of the preceding step, (b) percentage increase of reliability respect to the initial configuration with a single element.
Example 6 Considering the RBD for a series configuration (Figure 3.7.a) in which the element Ei was a priori identified with a higher failure rate. A possible solution for increasing the reliability of the system could regard the insertion of an active redundancy as reported in Figure 3.7.b. Such a configuration usually takes the name of “allocation of redundancy”. Considering equations (3.1) and (3.7) we can evaluate the expression for the system reliability as: (3.13) R S (t ) = (2 − R i (t )) ⋅ R series s (t ) Recalling example 1 and applying the preceding relations in which active redundancy is positioned on the element E1, with a reliability value of 0.4, we obtain
3.2 Reliability Evaluation of Series, Parallel and Mixed Structures
41
a probability of general functioning of the entire system equal to 0.4032, with an increase of 60% with respect to the initial value of 0.252.
Fig. 3.7 RBD: (a) series functional (b) allocation of redundancy.
Example 7 Figure 3.8 shows a comparison of the configurations with two equal and independent elements functioning in series and in parallel, and the plot of reliability with a single element having the same failure rate constant. Reliability 1 0.9 0.8 a
0.7 0.6 0.5
b
0.4 0.3 c
0.2 0.1 0
0
0.5 1
1.5 2
2.5 3 3.5 4 Time (h)
4.5 5
a) two elements in parallel:
R S (t ) = 2 e − λ t − e − 2 λt ; MTTF
b) single element:
RS (t ) = e −λt
;
MTTFS =
c) two elements in series:
RS (t ) = e −2 λt ;
MTTFS =
S
=
1
λ 1 2λ
Fig. 3.8 Comparison among basic configurations with the same failure rate.
3 2λ
42
3 Reliability Analysis in the Design Phase
The series and parallel configurations can be opportunely combined giving rise to the so-called mixed configurations. For these, assuming the same hypotheses and recalling Eqs. (3.1) and (3.7), both the plots of reliability vs. time and the value of MTTFs can be immediately determined. Example 8
A system consisting of seven subsystems, each characterized by a different failure rate constant, is represented by the RDB in Figure 3.9. We can observe that this is a combination of series and parallel configurations. In fact, the series-parallel configuration in which the entities E1, E2, E3 of the superior path are located in series among themselves and in parallel with series E4, E5 of the inferior branch. Everything is in series with the elements E6, E7. We obtain therefore: R(t) = e − ( λ1 + λ2 + λ3 + λ6 + λ7 ) t + e − ( λ4 + λ5 + λ6 + λ7 ) t − e − ( λ1 + λ2 + λ3 + λ4 + λ5 + λ6 + λ7 ) t
(3.14)
and consequently: MTTFS =
1 1 1 + − λ1 + λ2 + λ3 + λ6 + λ7 λ4 + λ5 + λ6 + λ7 λ1 + λ2 + λ3 + λ 4 + λ5 + λ6 + λ7
E 1 (λ1 )
E 2 (λ2)
E 3 (λ3 ) E 6 (λ6 )
E 4 (λ4 )
E 7 (λ7 )
E 5 (λ5 )
Fig. 3.9 Mixed configuration.
3.3 Types of Redundancy Redundancy is useful when very high dependability features are mandatory. In particular, this is true if high reliability, availability and safety of equipment is requested. It is important to underline that we deal with reliability block diagram. Parallel of items in RDB not means automatically and necessarily parallel in hardware block diagram. Three different types of redundancy can be defined [6]: • Active redundancy (even known as parallel or hot redundancy): this is the aforementioned redundancy. Redundant elements are subject always to the same load; • Warm Redundancy: The redundant elements are subject to a lower load until one of the elements fails; it is presents a load sharing. • Stand-by Redundancy (even known as cold redundancy): in this type of redundancy the redundant elements are subject to no load until one of the operating elements fails; the load sharing is not possible and, very important, the failure rate of the elements with no load (in reserve state) is equal to zero.
3.4 Functional Configuration k out of n
43
3.4 Functional Configuration k out of n A particular configuration previously cited is represented by a system operative when at least k number of elements out of a total of n elements are functioning normally. The configuration is also called k-out-of-n redundancy with k ≤ n. For this configuration we can assume a structure in which k elements are in active redundancy and the remaining elements (n-k) are in stand-by. A typical example is a steel cable formed of n strands that can withstand foreseen stress if at least k numbers of strands are intact. To calculate the reliability of this configuration we use binomial distribution hypothesizing that the generic element of the system can assume only two conditions: correct functioning or in failure. Recalling the definition give for this configuration and indicating with R(t) the reliability of the generic element, the reliability of the system RS(t) can be expressed as:
RS (t ) =
n
⎛n⎞
∑ ⎜⎜⎝ i ⎟⎟⎠ (R(t )) i (1 − R(t )) n−i
(3.15)
i =k
denoting as ⎛n⎞ n! ⎜⎜ ⎟⎟ = i ! ( n − i )! i ⎝ ⎠
(3.16)
the binomial coefficient with 0!=1. Assuming a constant failure rate, we have: RS (t ) =
i n −i ⎛ ⎞ ∑ ⎜⎜⎝ i ⎟⎟⎠ (e -λt ) (1 − e -λt ) n
n
(3.17)
i =k
from which the mean time to failure of the system can be immediately calculated as: MTTFS =
+∞
∫0
( )
⎡ n ⎛n⎞ -λ t ⎢∑ ⎜⎜ ⎟⎟ e ⎢⎣ i = k ⎝ i ⎠
i
⎤ (1 − e -λt ) n −i ⎥ dt ⎥⎦
(3.18)
It is interesting to observe that for k = 1 this configuration coincides with the parallel configuration, while for k = n with the series configuration. Example 9
We wish to determine the probability of functioning of the system 2 out of 3 at a time t=104 hours considering the failure rate of the generic element equal to 3•10-5 h-1. From Eq. (3.16) the reliability is:
(
3 ⎛3⎞ R S (t ) = ∑ ⎜⎜ ⎟⎟ e −iλt 1 − e −λt i =2 ⎝ i ⎠
)
3 −i
(
)
⎛3⎞ ⎛ 3⎞ = ⎜⎜ ⎟⎟e − 2 λt 1 − e − λt + ⎜⎜ ⎟⎟ e − 3 λt ⎝ 2⎠ ⎝ 3⎠
44
3 Reliability Analysis in the Design Phase
and:
R S (t ) = 3e −2 λt - 2e −3 λt .
Considering the assigned value of the failure rate, the reliability of the system at 104 hours is given by 5/6 ≅ 0.833. Example 10
Table 3.3 and Figure 3.10 compare the results and reliability trends for different k out of n functional configurations, in the hypothesis that they have led to the determinations of expressions (3.16) and (3.17). Table 3.3 Characteristics of reliability for k out of n configurations. Configuration
Reliability model
MTTFS
a
Single element
e −λt
1/ λ
b
1-out-of-2
2e −λt − e −2λt
9 / (6 λ)
c
1-out-of-3
3e −λt − 3e −2λt + e −3λt
11 / (6 λ)
d
2-out-of-3
3e −2λt − 2e −3λt
5 / (6 λ)
Fig. 3.10 Reliability of system for different functional configurations with elements having the same failure rate.
3.4 Functional Configuration k out of n
45
(a)
(b)
(c) Fig. 3.11 RDB for example 11.
46
3 Reliability Analysis in the Design Phase
Further overview examples are presented in the following. Example 11
We are interested in investigate the reliability of an airplane with four propellers, two of these propellers are on the left wing and the others are on the right wing, such as depicted in Figure 3.11.a. The airplane will fly if at least one propeller on each wing functions. In order to drawn the RBD some preliminary consideration are mandatory. We denote the four propellers with letters A, B, C and D and, for example, propellers A and B are located on left wing whereas the propellers C and D are located on right wing. The statement “the airplane will fly if at least one propeller on each wing functions” lead to consider the two wings as a series structure. In fact, failure of the propulsion function on either wing results in the system failure. This structure is depicted in Figure 3.11.b. Moreover, the single wing can be modeled with a very simple parallel system, or better subsystem. The subsystem is composed by two propellers, parallel configured, because only one propeller on each wing is required to operate in correct way. The block diagram is depicted in Figure 3.11.c. It should be noted that in this example the only airplane failures considered are due to the propeller failures. Obviously, many others failure are possible but for sake of simplicity these further failure are here not considered [7]. Example 12
Many and interesting applications can be implemented considering the example 11. For instance, the following situation can be analyzed. We are interested in investigate the reliability of an airplane with four propellers, two of these propellers are on the left wing and the others are on the right wing, as depicted in Figure 3.11.a. The airplane will fly if at least two propeller functions. It should be highlighted that in this case we deal with a situation where two propellers are sufficient in order to assure the correct function of the airplane and the position of these propellers is not important (please note that this is an example and actual situation can be very different!). The concept can be analyzed as follows: the function is available if 2 out of 4 propellers are correctly working. It is so easy to understand this is a k out of n. In fact we have a system operative when at least k numbers of elements out of a total of n elements are functioning normally. Aforementioned equations 3.15 through 3.18 can be applied. Example 13
We are now interested in the system with the Reliability Block Diagram (RBD) of Figure 3.12 where the original design not guarantee the wanted reliability level.
3.4 Functional Configuration k out of n
47
2 1 3 Fig. 3.12 RDB for example 12.
A possible solution is to use the redundancy in order to improve the reliability. If further three components identical to the previous components are available two different arrangements are possible:
• component redundancy as depicted in Figure 3.13.a, and • system redundancy ad depicted in Figure 3.13.b.
(a)
(b) Fig. 3.13 Component redundancy (a) and System redundancy (b).
The difference between such redundancy configurations is well described: in component redundancy each component is replicated in parallel at its position in
48
3 Reliability Analysis in the Design Phase
the system whereas in system redundancy the system is replicated in parallel to its self. For two parallel components the following equation is valid (P = Parallel): n
n
i =1
i =1
(
R P (t ) = 1 − FP (t ) = 1 − ∏ Fi (t ) = 1 − ∏ 1 − e −λit
)
and, in case of two components, where the time is not more highlight: 2
2
i =1
i =1
R P (t ) = 1 − FP (t ) = 1 − ∏ Fi (t ) = 1 − ∏ (1 − Ri ) = 1 − (1 − R1 )(1 − R2 ) =
= 1 − (1 − R1 )(1 − R2 ) = 1 − 1 + R2 + R1 − R1 R2 = R1 + R2 − R1 R2 .
And taking into consideration for sake of simplicity the same value for reliability, R = R1 = R2, the reliability is evaluable as follows: RP = 2 R − R 2
If the value of reliability for each element, at the generic time t, is, for example, R=0.9, we obtain: R P = 1.8 − 0.81 = 0.99 .
So, for component redundancy the overall reliability is evaluable with the notation of Figure 3.14 where inside the block the reliability at the generic time t is reported. The reliability is (S = System, CR = component redundancy):
(
)
RS (CR ) = 0.99 ⋅ 2 ⋅ 0.99 − 0.99 2 = 0.9899.
0.99 0.99 0.99 Fig. 3.14 Component redundancy.
In order to evaluate the reliability of the system redundancy solution it is necessary evaluate first the reliability of the original design:
(
)
RS = 0.9 ⋅ 2 ⋅ 0.9 − 0.9 2 = 0.891.
With the notation of Figure 3.15, the overall reliability in case of system redundancy is (S = System, SR = system redundancy): RS ( SR ) = 2 RS − RS 2 = 2 ⋅ 0.891 − 0.8912 = 0.9881 .
3.4 Functional Configuration k out of n
49
Fig. 3.15 System redundancy for example 12.
The result is that, in this example: R S (CR ) > RS ( SR ) .
The obtained results in nearly true in all possible situations with the following inequality: RS (CR ) ≥ R S ( SR ) .
(3.19)
Redundancy at the component level is more effective than redundancy at the system level even if this arrangement can create design problems and it can be difficult to obtain in many situations. It is obvious that in the choice of a specific configuration evaluation of economic aspects have to be made. Example 14
The results obtained in the previous example 13 can be further analyzed taking into account different value for the reliability in a well-defined time instant. In this example the following values for reliability are considered: 0.8, 0.85, 0.9, 0.95, 0.99 and 1. The results are summarized in Table 3.4 and in Figure 3.16. Table 3.4 Reliability evaluation for example 14.
R
0.8
0.85
0.9
0.95
0.99
1
RS(CR) RS(SR)
0.958464
0.977005
0.989901
0.997494
0.999900
1
0.946176
0.971397
0.988119
0.997257
0.999898
1
50
3 Reliability Analysis in the Design Phase 1
System Reliability
0.99
Component Redundancy 0.98 0.97
System Redundancy
0.96 0.95 0.94 0.8
0.85
0.9
0.95
1
Component Reliability Fig. 3.16 Plot of reliability for example 14.
Table and plot are in compliance with (3.19) here reported for simplicity: R S (CR ) ≥ RS ( SR )
In this example the value for reliability equal to 1 has been taken into account. For this value the reliability of the two considered redundancy lead to have: RS (CR ) = RS ( SR )
whereas for different values of reliability the following statement is valid: RS (CR ) > RS ( SR ) .
Example 15
Evaluate the MTTF for the system depicted in Figure 3.17. Components are used during the phase with an approximately constant failure rate of the bath-tube curve and λ = λ1 = λ2 = λ3.
Fig. 3.17 RBD for example 15.
3.4 Functional Configuration k out of n
51
In order to solve this configuration, initial considerations are necessary. Recalling Eq. (2.33) here reported for the sake of simplicity
λ(t ) = −
1 dR (t ) d = − log(R(t )) R(t ) dt dt
and considering as an initial condition for the reliability at time 0 its maximum value equal to 1, results: t
R (t ) = e
− ∫ λ (t )dt 0
Recalling the bathtub curve and hypothesizing the phase 2 (useful life and random failures) of the bath-tube curve, the aforementioned equation became:
R(t ) = e−λt We approach to the solution in different steps. First parallel block The block is depicted in Figure 3.18.
Fig. 3.18 First parallel block.
The reliability of this parallel block is calculated with Eq. (3.7): n
n
i =1
i =1
(
)
2
(
)
R P1 (t ) = 1 − FP (t ) = 1 − ∏ Fi (t ) = 1 − ∏ 1 − e −λit = 1 − ∏ 1 − e −λit = 2e −λt − e −2λt i =1
and consequently, the mean time to failure: MTTFP1 =
+∞ +∞ +∞ −λt +∞ − 2λt 3 −λt − 2 λt ∫0 RP1 (t )dt = ∫0 (2e − e )dt = 2∫0 e dt − ∫0 e dt = 2λ
52
3 Reliability Analysis in the Design Phase
as found in (3.11). Now it is necessary to evaluate the failure rate for this block. Recalling Eq. (2.33) we obtain:
λP1 (t ) = − =−
=
1 dR(t ) 1 d =− ⋅ 2 ⋅ e − λt − e − 2 λt = − λt − 2λt ( ) R t dt dt 2⋅e − e
{
{
}
}
{
}
1 1 ⋅ − 2λ e − λ t + 2λ e − 2 λ t = ⋅ 2λ e − λt − 2λ e − 2 λt = 2 ⋅ e − λt − e − 2 λt 2 ⋅ e − λt − e − 2 λt
2⋅e
− λt
{
}
1 e − λ t − e −2 λ t e −λt (1 − e − λt ) ⋅ 2λ e − λt − e − 2λt = 2λ = 2λ − λt = − 2λt − λt − 2λt 2⋅e − e e ( 2 − e − λt ) −e = 2λ
(1 − e −λt ) ( 2 − e −λ t )
Second parallel block The block is depicted in Figure 3.19. Taking into account the value of the failure rate, results: RP 2 (t ) = 2e − λt − e −2λt
MTTFP 2 = λP 2 (t ) = 2λ
3 2λ
(1 − e − λt ) ( 2 − e − λt )
as for the previously evaluated block.
Fig. 3.19 Second parallel block.
Third parallel block The block is depicted in Figure 3.20. Taking into account the value of the failure rate, we have: R P 3 (t ) = 2e − λt − e −2λt
MTTFP 3 = λP 3 (t ) = 2λ
as for the previously evaluated blocks.
3 2λ
(1 − e − λt ) ( 2 − e − λt )
3.4 Functional Configuration k out of n
53
Fig. 3.20 Third parallel block.
Fig. 3.21 Second and third parallel blocks.
Parallel of the second and third block The block is depicted in Figure 3.21. Reliability is: R P 23 (t ) = 1 − FP 23 (t ) = 1 −
(
)(
∏ Fi (t ) = 1 − ∏ (1 − e −λ (t )t ) = P3
P3
i= P 2
i= P 2
i
)
= 1 − 1 − e − λP 2 (t )t ⋅ 1 − e − λP 3 (t )t = e − λP 3 (t )t + e − λP 2 (t )t − e − λP 2 (t )t e − λP 3 (t )t .
However, blocks are equals so failure rate are also equal: R P 23 (t ) = e − λP (t )t + e − λP (t )t − e − λP (t )t e − λP (t )t = 2e − λP (t )t − e −2λP (t )t
MTTF23 =
+∞ +∞ −λ ( t )t − e −2λ (t )t )dt = ∫0 RP23 (t )dt = ∫0 (2e P
=2
+∞ −λ (t )t P
∫0
e
dt −
+∞ −2λ (t )t P
∫0
e
P
dt
Finally, the evaluation of the failure rate is: λP 23 (t ) = −
1
RP 23 (t )
dR P 23 (t ) . dt
54
3 Reliability Analysis in the Design Phase
It is easy to understand that from this point the mathematical treatment became very difficult. For more complex structure software for reliability evaluation has been developed and are often utilized. When the device count became large the complexity of the mathematics lead to consider the use of these software. It should be noted that 4 or 5 devices are sufficient in order to have hard calculus. The question is now: have many devices are present in an automobile, in an airplane, a PC, etc? Example 16
Let us to consider a system composed by four identical devices connected as depicted in Figure 3.22. Each item is independent from other items.
Fig. 3.22 RBD for example 16.
Solving before the two series items from (3.1) with the notation used in Figure 3.23 the following reliability is obtained: n
RS (t ) = ∏ Ri (t ) = R 2 (t ) i =1
Now taking into account the parallel structure: RSystem (t ) = 2 R 2 (t ) − R 4 (t )
RSystem(t) RS(t) A
A
A
A RS(t)
Fig. 3.23 RBD for example 16.
3.4 Functional Configuration k out of n
55
System Reliability
The obtained results are plotted in Figure 3.24. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Switch Reliability Point
Item Reliability
System Reliability 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Item Reliability Fig. 3.24 Item Vs system reliability of Figure 3.23.
At low level of single device reliability the system reliability is lower than the item reliability. For high values of item reliability the system reliability higher than the item reliability. It is very important to evaluated the point named Switch Reliability Point in Figure 3.24. In this point the system reliability is equal to the item reliability: 2 R 2 ( t ) − R 4 (t ) = R (t )
Solving the equation we obtain R 4 ( t ) − 2 R 2 (t ) + R ( t ) = 0
and
{
}
R (t ) R 3 (t ) − 2 R (t ) + 1 = 0 R 3 (t ) − 2 R (t ) + 1 = 0
A plot of the previous equation is reported in Figure 3.25. Reliability can assume only a positive value, so the dashed line depicts the parts of the function not valid for Reliability.
56
3 Reliability Analysis in the Design Phase
1
R3(t) -2R(t)+1
0.8 0.6 0.4 0.2
∼ 0.618
0 -0.2 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Item Reliability Fig. 3.25 Plot of R3(t)-2R(t)+1 Vs R(t).
In the previous equation 1 is a solution and with Ruffini theorem the equation can be rewritten in the following form:
(R(t ) − 1)(R 2 (t ) + R (t ) − 1) = 0 .
Finally, the solutions are: ⎧ ⎪ R1 = 1 ⎪⎪ 1 1 5 ≅ 0.618 ⎨ R2 = − + 2 2 ⎪ ⎪ R3 = − 1 − 1 5 ≅ −1.618 ⎪⎩ 2 2
The switch reliability point is about 0.618. The negative solution R3 is not a valid solution for Reliability. If an exponential reliability function is considered and a constant failure rate of each block the system reliability is: RSystem (t ) = 2e −2 λt − e −4 λt
As far as the MTTF is concerning results: MTTFSystem =
+∞
∫0 RSystem (t )dt = λ − 4λ = 4λ . 1
1
3
and recalling that for a single device MTTF =
1
λ
it is possible to conclude that the MTTF of the system is 75% of the MTTF of the single item:
References
57
MTTFSystem =
3 3 = MTTF = 0.75 ⋅ MTTF 4λ 4
Item reliability and System reliability are drawn in Figure 3.26. 1 0.9 0.8
Reliability
0.7 0.6 0.5 Item Reliability, R(t)
0.4 0.3
System Reliability, RSystem(t)
0.2 0.1 0 0
0.5
1
1.5
2
2.5
3
3.5
4
λt Fig. 3.26 Plot of Item Reliability and System Reliability.
References [1] IEC 50 (191) International Standard: International Electrotechnical Vocabulary – Charter 191: Dependability and quality of service, IEC International Electrotechincal Commission, Geneve (December 1990) [2] Garvin, D.A: Competing in the eight dimension of quality. Harvard Business Review (1987) [3] Michelini, R.C., Razzoli, R.P.: Affidabilità e sicurezza del manufatto industriale: la progettazione integrate per lo sviluppo sostenibile. Tecniche nuove (2000) [4] Iuculano, G.: Introduzione a probabilità, statistica e processi stocastici, Pitagora editrice, Bologna (1996) [5] Garcia-Diaz, A., Phillips, D.T.: Principles of Experimental Design and Analysis. Chapman & Hall, Boca Raton (1995) [6] Birolini, A.: Reliability Engineering – Theory and Practice. Springer, Heidelberg, ISBN: 978-3-642-14951-1 [7] Leemis, L.M.: Reliability, Probabilistic Models and Statistical methods, 2nd edn., ISBN: 978-0-692-00027-4
Chapter 4
Experimental Reliability and Laboratory Tests
Abstract. Component reliability is often affected by different influencing factors. In particular the operating profile of a component would be taken into account if good reliability predictions are necessary. The operating profile change in according to the type of operation of the component. So, we can have continuous operation or non-continuous operation, such as also sporadic operation. Moreover, storage conditions may be deep impact on reliability of the component when operating. Obviously, environmental factor need to be taken into account also. The environment contributes to both aging and failures during the life of the device or system under consideration would be considered. To this aim, both duration and intensity of environmental stresses should be included in the system operational model. In this chapter, after a brief introduction, the stress factors will be analyzed in 4.2. In 4.3 the component degradation is presented, and in 4.4 a model for aging based on the temperature is deeply discussed. Analysis of failure modes (4.5) and laboratory test are, finally, presented (4.6).
4.1 Introduction For a specific component, the physical reliability is based on the analysis of its “life - cycle” through the definition of a model. In any context, mechanical as well as electric or electronic etc., the definition of such a model depends on the current state of the component or system. The state is then influenced by “inputs” (e.g. loading conditions), “influencing factors” (environmental, mechanical, electrical etc.) and design performances. What was stated previously brings us to a formalization of equations where variables appear representing the phenomenon, input and measurements. In the study of physical reliability it is important to remember that any material (and generalizing, any type of component) is able to store “energy” at the atomic level, originating from an external environment. The capacity of storing energy allows the defining of a critical value upon which, the mechanism of conservation gives way to the mechanism of the modification of structural bonds. This leads to the point of breakdown of the material (component) itself. The anisotropy of materials causes an irregular distribution of energy: this determines a breakdown due to storing energy values inferior to the theoretical critical value. Interaction with the external environment therefore cannot be predicted “in a deterministic manner” in that the molecular aspect of materials is also
60
4 Experimental Reliability and Laboratory Tests
associated with the method of energy exchange with the external environment: quantity, type and exchange dynamics determine different processes of evolution. The critical value can be reached through instantaneous breakdown processes (due to when the nature and dynamics of a situation exceed the resistance of the material) or slowdowns - this is associated with the phenomenon of fatigue.
4.2 Stress Factors The definition of a physical reliability model of a component requires the knowledge of both the failure modes and failure mechanisms. These lasts are connected to the various types of stress applied to the device: the way the device is used in normal functioning conditions as well as influencing factors related to the work environment have to be considered. Depending on the working conditions, the combination of influencing factors that can lead to failures can be of different types. For example: for electronic components, the stress factor is often the “work temperature” of the component while for components in chemical plants, it could be the corrosive capacity of fluids working in the system. Principal influencing factors can be classified into three types: • climatic factor where an increase in ambient temperature makes heat dissipation more difficult which obviously affects the performance of the electronic component. This can also affect other types of components, e.g. insulation of electrical machines. • mechanical factors (shock) involved in transportation and installation and in particular, the vibration the component undergoes when functioning normally. • electrical factors (e.g. electromagnetic interference) due to characteristics of the supply of electricity or mutual interference among machines. In general, every device is subjected to influencing factors. Obviously, the type of component and its application make some factors more predominant than others. Since the way that components and materials are used has a strong impact on the systems reliability, (note the definition given in Chapter 1), the study of models is aided, by means of standards, by defining parameters to use for the selection and qualification of materials/components. The importance of these aspects can be understood by referring to the international standard (e.g. ETSI and IEC). In fact, these standards specify the limits of stress and trial conditions relative to temperature, humidity, precipitation, radiation, sand, noise, vibrations, electrical and mechanical shock. The standards are used also to classify different environments, defining for each the values of specific environmental parameters (temperature, relative humidity, vibrations etc.). Table 4.1 regards standardized normal conditions. The Table makes reference to temperature as the central point of a diagram that demonstrates the combination of possible values of air, temperature and relative humidity. This is known as a climate plot.
4.2 Stress Factors
61
The plot identifies the following areas (e.g. Figure 4.1): • a more internal area that represents conditions encountered 90% of the time • an intermediate area referring to environmental conditions at normal limits • a more external zone referring to “exceptional” environmental conditions (e.g. a breakdown in the air conditioning system). The functioning of apparatus must be guaranteed in the intermediate area defined by the normal climatic limits (see Fig. 4.1). In the zone included in the “exceptional climatic limits” and “normal climatic limits”, the apparatus can work with degraded performances are allowed to the apparatus but its functionality can be restored when the condition are reported in the “normal” area. It is important to remember that also when handbooks are used (see chapter 5), all reliability models include the effects of environmental stresses through the environmental factor, πE. Table 4.1 Example of environmental classification. Ta (°C) Environment SECTION
Airconditioned
SHELTERED Not Airconditioned
FREE AIR
Unsheltered
ENVIRONMENTAL SUB SECTION
(central point of the climatic plot)
Standard
25
Special
30
With partial temperature control
25
Without temperature control inside walls
25
Without temperature control with greenhouse effect
30
Without temperature control with natural ventilation and without greenhouse effect
30
Without temperature control inside container, for on line equipments
30
Mobile (cockpit or carrier)
30
Cold climate
15
Cold climate temperate
15
Warm climate temperate
20
Warm climate temperate (tropical)
25
Warm dry climate
25
Warm dry climate temperate
25
One might ask if it is possible to find a relationship between the applied stress and the strength of a component. This would permit the designing of components for which the conditions for failure do not exist. For many components, both the stress (load) and strength follow a statistical distribution. Hypothesizing a normal distribution (in the literature however, there are studies in which the analysis is implemented by analyzing different distributions), we can use statistical
62
4 Experimental Reliability and Laboratory Tests
parameters in order to analyze how load and strength distributions interfere with the evaluation of the probability of failure (Figure 4.2). Starting from the mean values and the standard deviations of the strength and stress distributions, it is possible to define the Safety Margin (SM) and the Loading Roughness (LR). Representing the mean values with L for the Load distribution and S for the Strength distribution (Figure 4.3) and denoting with σL and σS the standard deviation of the distributions of the strength of the component and the stress applied to it, respectively [1], SM and LR are defined as:
SM =
S−L
σ S2
+ σ L2
; LR =
σL σ S2 + σ L2
(4.1)
The probability that a failure will take place depends on the distance of the two distributions while the number of components involved in a failure condition dependents on these as well as the shape of distributions.
Fig. 4.1 Climatic plot: an example (ETSI 300 019-1-3) [2].
4.2 Stress Factors
Probability density
Fig. 4.2 Relationship between stress and resistance (qualitative plot).
Fig. 4.3 Analysis of links [1] (qualitative plot).
63
64
4 Experimental Reliability and Laboratory Tests
As shown in Figure 4.3, (a) represents the ideal condition, while (b) and (c) refer to two possible cases where a failure could arise. In (b), the safety level is low although it has a narrow distribution of stress, and it is the shape of the resistance which widens. As seen in the figure, the probability of failure involves only a small fraction of components that respond to such laws. In (c), instead, a greater state of criticality is represented: the probability of failure due to more consistent stress involves a fraction more large of components. This last denotes a useful piece of information for quality control: in the case where reliability performances are fundamental but it is impossible to check entire production lots (e.g. systems containing a prevalence of electronic components), a system for checking components subjected to overstress can be implemented. Those whose resistance falls at the end of the distribution curve are eliminated. The above considerations are the basis for implementing screening tests, whose goal is to demonstrate the percentage of components that are intrinsically weak, that is the elements which fall into the area of premature failures of the bathtub curve, as discussed in Chapter 2. In order to better understand the previous considerations, we propose as an example, the analysis of a regulation valve. Manufacturer’s specifications state that the valve can function up to a maximum pressure of 14000 kPa (with a standard deviation of 5%). The reliability of a device, utilized when a fluid is exerting a pressure of 10000 kPa (the standard deviation is equal to 1300 kPa) can be estimated from the safety level:
SM =
S−L
σ S2
+ σ L2
=
14000 − 10000 700 2 + 1300 2
≅ 2.71
(4.2)
From which, considering starting again from the normal distribution tables, we can evaluate the reliability as:
R = F (SM ) = 0.9966
(4.3)
4.3 Component Degradation Only in ideal conditions a device, subjected to various levels or types of stress, can maintain its characteristics unaltered. In a real situation, the component undergoes a degradation process and, consequently, its performances change with time. Recalling the hypothesis of normal distribution discussed previously, we can hypothesize the evolution of resistance over time, of a component subjected to external stresses as represented in Figure 4.4. The trend plotted in figure 4.4 depends on the knowledge of the device functionality, evaluated by means of analytical models as well as laboratory tests [1].
4.4 The Prediction Approach
65
time Fig. 4.4 Degradation of Resistance.
4.4 The Prediction Approach The study of both the mechanisms and types of failure are fundamental for defining reliability models as well as the evaluation of failure rates. However, in that the test is implemented in simulated conditions, it is important to underline that such models represent an estimation of the best observed data. As a consequence, the use of such models in the reliability field can be well considered only if the device or the material under examination is used in the same conditions in which the test was implemented. An influencing factor which affects materials, components and processes is the temperature. In that many processes (chemical reactions, diffusion of gases etc.) are accelerated when the temperature increases it is possible to define the Arrhenius model for this influencing factor: R = H ⋅ e Ea
KT
(4.4)
(R = velocity of activation, H = typical constant of process, K = constant of Boltzmann, 8,623 • 10-5 = 1/11605 eV/K, Ea = activation energy of the degradation process, in eV, in function of technology, T = thermodynamic temperature in kelvin, K, i.e. temperature °C + 273.15).
66
4 Experimental Reliability and Laboratory Tests
This acceleration model is often used in order to predict life of component (item) as a function of temperature. It applies specifically to those failure mechanisms that are temperature related and which are within the range of validity for the model. The model states: Time to failure = const ⋅ e E a
KT
(4.5)
where Time to failure is a measure of life of the item under consideration, const. represents a parameter evaluated by experimental activity for the involved item. It should be noted that the unit electronvolt (eV) is a measure of an energy. The value of Ea depend on the considered failure mode. Some values of the activation energies Ea for some silicon semiconductor failure mechanisms are reported in Table 4.2. Another Arrhenius expression involves information about the failure rate. Denoting as λ1 the component failure rate (in h-1) at the temperature T1 (reference temperature in K), the failure rate λ2 of the same component at the temperature T2 (stress temperature) is given by: Ea
λ2 = λ1 ⋅ e K
(1/ T1 −1/ T2 )
(4.6)
-1
where λ2 is the failure rate, in h , of the same component at the temperature T2. Table 4.2 Approximated values of the Activation energy, Ea, for different failure mechanism in silicon semiconductors. Failure mechanism
Ea (eV)
Corrosion
0.3 – 1.1
Assembly Defects
0.5 – 0.7
Electromigration - Al line - Contact/Via
0.6 0.9
Mask Defects
0.7
Photoresist Defects
0.7
Contamination
1.0
Charge Injection
1.3 0.2 – 1.0 1.0 – 1.05
Dielectric Breakdown Au-Al Intermetallic Growth
4.4 The Prediction Approach
AF
67
1.E+06
Ea = 1.0 Ea = 0.9
1.E+05
Ea = 0.8 1.E+04
Ea = 0.7 Ea = 0.6
1.E+03
Ea = 0.5 1.E+02
1.E+01
1.E+00 0
50
100
150
200
Temperature (°C) Fig. 4.5 Acceleration Factor (AF) evaluated in according to (4.7) Vs actual temperature t2 = T2-273.15, with t1 = T1-273.15 = 35°C. Plot reports the AF curve for different values of Ea.
Taking into consideration the previous equation, the acceleration factor AF is defined as:
AF =
Ea (1 / T1−1 / T2 ) eK
(4.7)
A plot of the acceleration factor is depicted in Figure 4.5 where a reference temperature t1 = 35°C is assumed and the activation energy is varying from 0.4 to 1.0 eV. An interesting way to plot the Acceleration Factor is depicted in Figure 4.6. In this plot an inverse absolute temperature horizontal scale is used. If the degrees Centigrade are represented on the horizontal axis the plot of Figure 4.7 is finally obtained. A straight line plot on the plot depicted in Figure 4.6 and 4.7 supports the assumption that an Arrhenius relationship holds. The Arrhenius model can be considered as a simplified degradation model. In fact, with the increase of technology and, consequently, the functionality of modern components (transistors, microprocessors, custom devices, and so on), the model of Eq. (4.6) may not be adequate. Figure (4.9) shows a difference between information contained in an electronic data base (e.g. MIL-HDBK 217, data base for prediction in electronic equipment [2, 3]), and the actual situation. However, on the basis of the Arrhenius theory, more complex models are deduced representing the failure rate for electronic components and considering different influencing factors.
68
4 Experimental Reliability and Laboratory Tests
AF
1.E+06
Ea = 1.0 Ea = 0.9
1.E+05
Ea = 0.8 1.E+04
Ea = 0.7 Ea = 0.6
1.E+03
Ea = 0.5 1.E+02
1.E+01
1.E+00 0.00331
0.00311
0.00291
0.00271
0.00251
0.00231
0.00211
Inverse absolute temperature = 1/T2 Fig. 4.6 Acceleration Factor (AF) evaluated in according to (4.7) Vs inverse absolute temperature = 1/T2, with T1=308.15 K. Plot reports the AF curve for different values of Ea.
Ea = 1.0 Ea = 0.9 Ea = 0.8 Ea = 0.7 Ea = 0.6 Ea = 0.5
Fig. 4.7 Acceleration Factor (AF) evaluated in according to (4.7) Vs inverse absolute temperature = 1/T2 (in Celsius degrees), with t1=35 °C. Plot reports the AF curve for different values of Ea.
4.4 The Prediction Approach
69
A generic model able to predict the failure rate in electronic field is represented by the following equation:
λ = λ0 ⋅ π E ⋅ π S ⋅ π T .... ⋅ π K
(4.8)
where λ0 denotes the failure rate in reference conditions, πE is the environmental factor, πS is a factor as a function of electrical stress applied to the component, πT is a temperature dependent factor, πK denotes a factor that takes into account complexity, technology and functionality of the component. This formula has been adopted in the standard US MIL-HDBK-217. This standard will be discussed in Chapter 5 in regard to handbooks. We assume the exponential distribution as hypothesis for eq. (4.8).
T (°C) 400 350 300
4.0 3.0 2.0
250
1.5
200
1.0 0.9 0.8
150 100
0.7 0.6
50
0.5
25 0
10
1
10
Fig. 4.8 Arrhenius’s model.
2
10
3
10
4
t (h)
10
5
10
6
10
10
7
70
4 Experimental Reliability and Laboratory Tests
Fig. 4.9 Temperature Vs reliability for electronic components (qualitative).
4.5 Failure Modes Failure mode describes how a component can fail. The identification of failure mode is fundamental when, in the analysis of reliability, it is important to know the consequences of a failure of the system. The previous paragraphs demonstrated how failure mechanisms responsible for such processes as corrosion, wear, vibrations, fractures, oxidation etc. play a fundamental role in the dynamics of the failure. Note however, that the causes of a particular breakdown often have to be reach on the production process or into the context where the component is working. Generally, three different operating conditions can be identified: continuously active, standby and intermittent activity (components in standby are normally passive). Starting with operating conditions, it is possible to define two categories for the causes of failures: the first classification is related to failures that occur when a component is called into service from a standby mode (demand related), and the second is related to failure in components during continuous activity (time related); for components which operate in both modes, you can obviously see both types of failures. In addition to the category of catastrophic failure, characterized by the total loss of functionality, we can consider also degradation failure and incipient failure. Degradation failure describes cases in which there is a loss of function but the component is still capable of performing above minimal requirements. Incipient failure refers to cases where circumstances indicate that without maintenance or repair, the system or component will undergo a loss of function. Furthermore,
4.6 Laboratory Tests on Components and Systems
71
types of breakdowns are distinguished in regard to sub-function when not functioning. To simplify, for certain devices (e.g. a motor), a failure may occur in the start up phase or shutting down phase, but it is obvious that this type of breakdown could evolve in different ways.
4.6 Laboratory Tests on Components and Systems Recalling the Standard IEC 60050 (191) the term test denotes an operation, or series of operations, carried out in order to evaluate, quantify and classify a characteristic or some other property of an item [4]. For item, we normally mean an elementary component, a sub-system or a more complex system. For laboratory tests instead, we mean a compliance test (suitable for verifying a characteristic of an element) or a determination test (carried out to establish a characteristic of an element). These tests are performed in established and controlled conditions which may or may not simulate field conditions. A procedure for determining and measuring reliability parameters of a family of components in the laboratory is to subject a representative sample of such components to the same stress that this will undergo when functioning, both for the type of stress (e.g. temperature, humidity etc.) and for the level of stress (for temperature, 40 °C, 55 °C etc.). In this case, the test continues until all or most of the representative samples have failed; this type of test is commonly referred as a long life test. When the test end before all samples have failed, the data analysis can be very difficult. This way to operate is named censoring. Censoring is often presents in lifetime data because in many situation is impossible to observe the lifetime of all devise under test. This is particularly true for electronic device where the failure rate allows a life time very long. The resulting tests are so very long. A way to overcome this problem is the use of accelerated test as discussed in the following. It is, however, possible to process data even in presence of censoring. A censored test occurs when only a bound is determined in the time of failure. The following figures depicted the typical situations. If n is the number of items and nf is the number of the observed failure three main type of situations are possible [6]:
• Complete data set: the test end when nf = n as depicted in Figure 4.10.a. • Time censored data set: the test end at a priori well-defined time. The situation is shown in Figure 4.10.b. In this type of censoring the number of failure nf collected during the test is random. In Figure is depicted a situation with nf = 4. • Order statistic censored data set: the test end when an a priori well-defined number of failure nf is observed. In this type of censoring the time necessary to complete the test is random. In Figure 4.10.c. is reported a case for nf = 3. The test end when the third failure occurs. The aforementioned types of censored data are the results of right censoring. Other types of censoring are possible, such as left censoring and interval censoring.
72
4 Experimental Reliability and Laboratory Tests
Samples tf (5)
5
tf (4)
4 tf (3)
3 tf (2)
2
tf (1)
1 0
Time (h) (a)
(b)
(c) Fig. 4.10 Examples of not censored data (a) and censored data (b and c). The symbol × denotes a failure and tf(n) denotes a failure instant of the sample n [6].
4.6 Laboratory Tests on Components and Systems
73
Recalling the “bathtub curve” that characterizes the failure rate of an electrical or electronic component (see Chapter 2) and remembering that the time interval of the useful life (the central part of the bathtub curve) extends for hundreds of thousands of hours, it appears evident that especially for electronic components, this type of test is inadequate being the information on the life of component in that it would furnish information on the comportment of a component over very long periods of time, comparable with its technological obsolescence. We must therefore consider an accelerated life test, that is a test in which the elements are subjected to higher levels of stress with respect to the normal use. As said in previous section, we define acceleration factor as the ratio between the value of the stress applied during the test and the corresponding value that characterizes the conditions of normal use. The aim of this test is to increase the degradation phenomena without altering the dominant failure mechanisms (defined in Chapter 1). This allows observing the failure of the components in a shorter time. This category of tests is also useful in carrying out quantitative comparisons among the same type of components but of a different origin; for example, components coming from different production lines or from different manufacturers. These take into consideration a wide variety of different types of stress, both climatic (cold, heat, humidity) and in general, environmental stress (vibrations, corrosive conditions). Table 4.3 Classification of environmental tests (see EN 60068-1 for exact table). Test
Environmental stress
A
Cold
B
Dry heat
C
Heat with high humidity (continuous)
D
Heat with high humidity (not continuous, cyclic)
E
Mechanical impacts (bumps and jerks)
F
Vibrations (sinusoidal, random occasional)
G
Constant acceleration
J
Mold
K
Corrosive atmosphere (e.g. salty fog)
Test
Environmental stress
L
Powder and sand
M
Atmospheric pressure (high and low)
N
Temperature changes
Q
Hermetic sealing (for liquids and gas)
R
Water (rain, dripping)
S
Radiation (solar, excluding electromagnetic radiation)
T
Welding
U
Sturdiness of terminals (of components)
74
4 Experimental Reliability and Laboratory Tests
Table 4.3 reports a classification of tests in electrical and electronic settings, derived from IEC 60068-1 or EN 60068-1 standards – Environmental testing – Part 1 General and guidance [5]. The tests can be more carefully detailed in function of the particular type of stress. For example, test U for the durability of terminals and devices integrally assembled with a component can concern traction (Ua1), compression (Ua2), bending (Ub), torsion (Uc), and the torque factor measurement. Independent of the type, level and length of the stress, laboratory tests, both for conformity and determination, are normally carried out according to the following sequence: Phase 1 – Preliminary adjustment: Operation performed on the device (or samples) being tested in order to eliminate the effects of its preceding states or conditions. For example, this could consist in maintaining the elements under test, for an established time interval, at ambient or laboratory temperature before applying stress. Phase 2 – Controls and initial measurements: This phase ascertains that all components to be tested are functioning correctly (conformity assessment). We assume this phase (Phase 2) as a reference condition for measurements on component under test. Phase 3 – Treatment: Components are subjected to a stress profile according to standards or determined by other experimental criteria. An example could be the application of a temperature (Test B – Dry heat) for a certain time period using an oven or the application of heat with high humidity (Test D) in a climate controlled room. Phase 4 – Readjustment: After the stress is applied, it is necessary to restore components to reference conditions of phase 2 and verify the level of degradation or the occurrence of failure. A sequence test is represented by a repetition of a cycle of tests that are characterized by the phases described above. The laboratory test can be also classified as:
• combined test, in which two or more types of environmental stress take place simultaneously on the device (e.g. the combined test heat and humidity); • compound test, in which two or more types of environmental stress are applied in quick succession. (e.g. test Z/AD: compound test (Z) of cold (A) and heat with high humidity in cycles (D). • sequence test, in which the element being tested is subjected successively to two or more types of stress at time intervals that affect the test components. (e.g. welding test (T) followed by rapid temperature changes (Na test) and by non-constant acceleration – impact tests (Ea) Table 4.4 lists some of the principal effects of various environmental agents.
4.6 Laboratory Tests on Components and Systems
75
Table 4.4 Principal effects of various environmental agents. Agent
main effect
type of resulting failure
High temperature
Thermal aging oxidation flaws chemical reaction Softening, fusion, sublimation Reduced viscosity Dilatation
Wear of mobile parts for dilatation or loss in lubricating performances.
Adsorption and Absorption of humidity Swelling of materials Physical breakdown, insulaLoss in mechanical resistance High relative humidity Chemical reaction (corrosion, electroly- tion defects, mechanical failure. sis) Conductivity growth of insulating materials High pressure
Solar radiation
Sand or dust
Corrosive atmosphere
Rain
Compression, deformation
Mechanical failure, loss (hermetic defects)
Chemical, physical and photochemical reactions Surface deterioration Insulation defects Discoloration Heating Ozone formation Abrasion and erosion Seizure Increased wear, electrical Incrustation failure, mechanical failure, overheating Loss of thermal conductibility Electrostatic effects Chemicaol Reactions Increased wear, electrical Increase in conductivity failure, mechanical failure Increase in contact resistance Water absorption Changes of temperature Electrical failure, flaws, leaks, surface deterioration Erosion Corrosion
Rapid temperature changes
Changes of temperature Differential heating
Constant Accelera- Mechanical stress tion, vibrations, jolts Fatigue and bumps Resonance
Mechanical failure, fine leaks, seal degradation, cracks
Mechanical failure, increased wear of moving parts, structural deformation
76
4 Experimental Reliability and Laboratory Tests
References [1] O’Connor, P.D.T.: Practical Reliability Engineering, 4th edn. Wiley, Chichester ISBN: 0-470-84462-0 [2] ETSI ETS 300 019-1-3-Edition 1-1992-02, equipment Engineering (EE) - Environmental conditions and environmental tests for telecommunications equipment; - Part 13: Classification of environmental conditions; - Stationary use at weatherprotected locations [3] MIL-HDBK-217F, Reliability Prediction of Electronic Equipment (December 2, 1991), with Notice 1 -10 July 1992 and NOTICE 2 - 28 February 1995 [4] IEC 60050-191 ed1.0, International Electrotechnical Vocabulary. Chapter 191: Dependability and quality of service. Forecast publication date for Ed. 2.0 is 2012-06-02. IEC International Electrotechnical Commission, Geneve (December 31, 1990) [5] IEC 60068-1 ed. 6.0, Environmental testing. Part 1: General and guidance (1998), Forecast publication date for updated version is August 2011 [6] Leemis, L.M.: Reliability, Probabilistic Models and Statistical methods, 2nd edn. ; ISBN 978-0-692-00027-4
Chapter 5
Reliability Prediction Handbooks: Evaluation of the System Failure Rate
Abstract. The evaluation of the failure rates for device, components and systems is often a very difficult task. In order to simplify the failure rate evaluation for established electronic and electromechanical devices, it is possible to use ad hoc handbooks. In fact, values for failure rates are given for many devices in the failure rate handbooks. In this chapter an historical overview on these handbooks is given. Many handbooks are available where the laws on dependency of the failure rate on different stresses are considered. In 5.1 a brief introduction about the first generation handbooks is given. In 5.2 the second generation handbooks are presented. A USA military handbooks is presented in the next 5.3, 5.4 and 5.5 are devoted to Farada and third generation handbook respectively. In 5.6 example of failure rate evaluation are discussed. Finally, the units for failure rate, the FIT, is introduced in 5.7.
5.1 Introduction The origin of modern handbooks took place in the United States. It consisted in the analysis of data collected in a military environment for systems (equipped with electronic components) utilized during the Second World War. Between 19431950, the correlation between the frequency of failure in communication and navigation apparatus and the severity of the conditions in which these were required to operate became clear. At that time, particular attention was already being paid to climatic conditions such as temperature and humidity. This drew attention to the decrease in safety levels for troops, also in terms of the “availability” of equipment and relative maintenance costs. On the basis of such considerations, the American government initiated a program in 1952 known as AGREE (Advisory Group on the Reliability Electronic Equipment), a consulting group that in 1957, published a report on specifications and tests regarding the reliability of such equipment. This gave rise to the origin of handbooks to be used as support for the project. The objective was to furnish an evaluation of the failure rate with a certain level of confidence. In 1953 la Radio Electronic Television Manufacturer’s Association (RETMA), later to be called the Electronic Industries Association (EIA), formed a commission for the use of electronic applications to determine methods and procedures for
78
5 Reliability Prediction Handbooks: Evaluation of the System Failure Rate
the collection, analysis and classification of reliability data. The results of the commission’s work were published in the Electronics Applications Reliability Review bulletins. These represented the first rational collection of data from companies and industrial organizations such as RCA (Radio Corporation of America), GE (General Electric) and Motorola which published the results of life cycle tests they performed on their components. A trace of this important work can be seen in the first edition of MIL HDBK 217E (“Reliability Prediction of Electronic Equipment”), published by the American Department of Defense in 1962. The first attempt to define a handbook reporting information relative to mechanical and electromechanical components was in 1959, with the Martin Titan Handbook, also known as “Procedure and Data for Estimating Reliability and Maintainability.“ The quality of this handbook can be reached in its attempt to present the data following a certain criterion of standardization: the Titan is in fact, the first collection of data in which failure rates are expressed as a function of working hours and that uses exponential distribution in its calculations. If the reported data suffers from not being supported by the correct statistical information (in fact, the number of components tested, breakdowns noted and hours of observation are not reported), it is supported by information on the failure mode. Titan was the first to propose empirical factors (“factor K”) which take into account modes of use and the eventual presence of redundancy.
5.2 Second Generation Handbooks In the Sixties, in the wake of experience with the Titan Handbook and requests from Air Force, programs were initiated for the collection and organization of reliability statistics. This work leads to the realization of the handbook: • MIL-Handbook-217 • Failure Rate Data Bank (FARADA) • RADC Non Electronic Reliability Notebook.
5.3 MIL-Handbook-217 Starting with the characteristics of the Titan Handbook the Military Handbook 217 (MIL-HDBK 217) classifies components into categories and subcategories connoted by corrective factors. The latest editions of MIL-HDBK 217 furnish some of the most complete collections of data available. Unfortunately, due to collection and organization methods utilized, the data is not always reliable. The models contained in the MIL-HDBK-217 refer only to defects in production of components which undergo stress connected to their use. Problems correlated to the design, transport and the mode of usage are not considered in the model. The empirical origin of data at the basis of the handbook, not associated with an effective analysis of the real origins of breakdowns, is such that this data
5.4 Failure Rate Data Bank (FARADA)
79
cannot be utilized to identify the onset of eventual problems and therefore one cannot assign statistical confidence associated with results obtained through such models. In particular, the variations of tolerance are masked by the use of K factors. Furthermore, the failure rates are considered as fixed measurements of a specific apparatus and not as a general measure of a range of different types of equipment.
5.4 Failure Rate Data Bank (FARADA) In the Seventies, encouraged by the US Army, a program for the exchange of data relative to equipment sold in a military environment was initiated. Known as GIDEP (Government/Industry Data Exchange Program) , this brought together more than 400 participants, 80% of which were private industrial organizations. Data collected by GIDEP was used by the first software system for processing data, with the advantage of quick updating and organization according to useful formats for their statistical processing. The relative handbook, FARADA, also furnished, in addition to failure rates information regarding substitution rates of components and where available, the failure modes. The data come from the field, accelerated life cycle tests, and demonstrative reliability tests. The problem with the statistical analysis adopted by this handbook is that the data comes from non-homogeneous populations. Though using chisquare distribution for defining intervals of confidence, the average estimated failure rate is not representative of the subpopulation of samples. One of the problems associated with different databases is the use of only statistical techniques when defining intervals of confidence. Other manuals, commonly denoted as second generation manuals, are similar to MIL-HDBK but collect data essentially for components in the field of telecommunications. Among these, we note the RPP manual published by Bell Core since 1984, the HRD Manual published by British Telecom and the Italtel IRPH93 Manual (with the collaboration of both the French CNET and British Telecom). The RPP manual reports data mainly for devices and components used in the telecommunications field and covering five different areas. A common element which characterizes these databases is the hypothesis of components with a constant failure rate.
5.5 Third Generation Data Banks It appears evident that one of the objectives of the definition of a handbook should be the possibility of utilizing data adequately characterized by their uncertainty. The problem arises here how it is possible to define the uncertainty of reliability measurements. Unable to refer to a single and common methodology in the definition of uncertainty, one uses quartiles (e.g. the Swedish handbook TBOOK - Reliability Data of Components in Nordic Nuclear Power Plants) for the definition of confidence
80
5 Reliability Prediction Handbooks: Evaluation of the System Failure Rate
intervals (the Italian handbook EIREDA - European Industry Reliability Data Handbook) or the absence of this information, the typical situation in handbooks used in military environments. With the “third generation” of databases, attention has moved from military and aerospace industries to the type of intrinsically critical installations such as nuclear power plants, oil rigs and the chemical industry. Modern databases of reliability in the public domain are the following: • • • •
IEEE-Std-500 (Piscataway, NJ, 1984) OREDA (Offshore reliability data, Norway, 1984) EIREDA (European Industry Reliability Data Handbook, Italy, 1991) T-BOOK (Reliability Data of Components in Nordic Nuclear Power Plants, Sweden) • CCPS (Guidelines of the Center for Chemical Process Safety, New York, 1989) • NSWC-94/L07 - Handbook of Reliability Prediction Procedures for Mechanical Equipment. In particular, the last handbook cited above, developed by the Naval Surface Warfare Center – Carderock Division, provides failure rates for basic classes of mechanical components (belts, springs, bearings, breaks and clutches to cite just a few). The failure rate models take into account the impact of some factors on the reliability of components. To better understand, just think that for a spring, the most common failure modes are due to fatigue and excessive loads. The reliability of a spring depends on the material of which it is made, the working environment and the way in which the project is carried out. It is obvious that the use of these models requires large amounts of data that may not be known by the user. Another aspect of this database is that in the evaluations, a parameter relative to manufacturing defects is not examined. It is interesting to observe that there does not exist a unique “profile” of the users of databases. The information collected is obviously useful to the design engineer (that is interested in mechanisms and failure modes), to the risk analyst (information on the availability of the system, or rather the probability of a successful mission through the availability of components and relative breakdown rates), not to mention maintenance experts, ever more attentive to service performance. The critical element in the design of a handbook is the mode in which data is collected and the definition of attributes which define intervals for measuring reliability. To better understand this concept, take the example of a very simple and widely diffuse component: the resistors. It is sufficient to consult any catalogue including on line catalogues for acquiring this elementary component to understand how the definition of the parameters which characterize the resistor is fundamental. Under the heading of resistors, many varied and different types of components and usage applications are listed (e.g. from resistors for printed circuits to traction applications). In second generation databases, the determination of correspondence among the attributes is left to the user while in third generation databases, a hierarchic approach was implemented; it furnishes the user with a guide when a knowledge of attributes is insufficient.
5.6 Calculation of the Failure Rate
81
5.6 Calculation of the Failure Rate This demonstrates the calculation of the failure rate in an electronic environment following the procedures according to MIL-HDBK 217. As made clear by the definition of reliability as in Chapter 1, any evaluation of the failure rate, and therefore reliability, cannot be undertaken without a knowledge of the working environment in which the system must operate. Data banks in electronics, and therefore the MIL-HDBK 217, proposal a classification of working environments in: • Fixed protected environment: characterized by a high level of insensitivity to the atmospheric environment in regard to temperature as well as the control of humidity within defined limits. An example can be given by electronic apparatus allocated in masonry buildings. The MIL-HDBK 217 classifies this as a “land protected environment” (GB, Ground benign), with controlled temperature and humidity and no mechanical stress, easily accessible for maintenance activity. • Fixed unprotected environment: characterized by thermal and mechanical stress determined directly by natural climatic conditions. The MIL-HDBK 217 calls this “fixed land environment” (GF, Ground fixed), characterized by moderately controlled environmental conditions. This is typical of apparatus installed in the open air, e.g. electronic control unit for traffic control, environmental monitoring, telecommunication equipment and radar. • Mobile environment: characterized by mechanical stress and temperature gradients of a certain severity, typical of portable equipment or mounted on mobile equipment. The MIL-HDBK 217 identifies this as GM (Ground mobile). It should be remembered that in addition to these, the MIL-HDBK 217, in respect to other databanks in electronics, classifies a further eleven operating environments including naval (N), aeronautical (A) and space (S) as well as environments characterized by particularly critical conditions with the presence of high stress for the component. In all reliability handbooks, the operating environment is identified by means of the πE factor (Environmental Factor). The prediction models relative to electronic components present in handbooks make reference furthermore to the following hypotheses: system in functional series configuration, independent failures and constant failure rate. It is assumed furthermore that Arrhenius’s law, which is the model for describing the physicalchemical degradation of a component, relates the time to failure to the level of thermal stress applied. Based on such hypotheses, the failure rate of the system can be immediately calculated, using the relations of the series functional configuration reported in Chapter 3. Example 1 Calculation of the failure rate of a signalling lamp of a warm aircraft. We assume that the lamp is positioned on the aircraft, constantly functioning on an inhabited area at 24 V c.c. The model for this component into the MIL-HDBK217E (section 5.1.17) is:
82
5 Reliability Prediction Handbooks: Evaluation of the System Failure Rate
λ p = λb ⋅ π u ⋅ π A ⋅ π E
failure
(5.1)
10 6 h
with: λb = 0,074·241,29 = 4.5 failures /106 hours; πu = 1.0 is the utilization factor; πA = 3.3 is the application factor; πE = 4.0 is the environmental factor. Environmental conditions are considered as those for an unprotected device. The value of the failure rate for this component is:
λ p = 4.5 ⋅10 −6 ⋅1.0 ⋅ 3.3 ⋅ 4.0 = 59
failures 10 6 h
(5.2)
Consequently, the Mean Time To Failure:
MTTF =
1
λp
=
1 = 1.7 ⋅10 4 h −6 59 ⋅10
(5.3)
Example 2
Calculation of the failure rate for a bridge rectifier with four equal diodes. From the point of view of the reliability performances, such diodes have to be considered in series configuration. Recalling MIL-HDBK 217 E, the failure rate for a single component (diode) is represented by the following equation:
λ p = λb ⋅ π E ⋅ π Q ⋅ π C ⋅ π S ⋅ π T
failures 10 6 h
(5.4)
Assuming the diode as a “power rectifier fast recovery”, JAN quality, a metallic junction and used in “Ground Fixed” environment with Vs=Vdmax/VRRM=0.78 and junction temperature Tj=166°C, we have: λE = 6.0; πQ = 2,4 is the quality factor; πC = 1.0 is the construction factor; πS = 0.547 is the voltage stress factor; πT = 28 is the temperature factor. For the bridge rectifier we obtain:
λtotal = 4 ⋅λ p = 60
failures 10 6 h
(5.5)
and the corresponding MTTF: MTTF =
1
λp
= 0.067 ⋅ 10 6 = 6.7 ⋅ 10 4 h
(5.6)
and MTTFs =
1
λtotal
= 0.017 ⋅10 6 = 1.7 ⋅10 4 h
(5.7)
5.7 FIT: A More Recent Unit
83
5.7 FIT: A More Recent Unit Failure rates λ lies between 10-10 h-1 and 10-7h-1 for electronic components. In particular failure rate of 10-10 h-1 are valid for passive electronic components. For active device failure of about 10-7h-1 are more frequent, in particular for VLSI ICs. In order to manipulate more easy data a new unit has been introduced. The unit 10-9 h-1 is designed by Failures in Time (FIT) or failures per 109 h [1-4]. For example, the previous results obtained for Example 1 and Example 2 can be rewritten in the following way. Eq. 5.2 can be written as:
λ p = 59
failures 10 6 h
= 59 ⋅ 10 − 6 = 59000 ⋅ 10 −9 = 59000 FIT,
(5.8)
and eq. 5.5 became:
λtotal = 60
failures 10 6 h
= 60 ⋅ 10 −6 = 60000 ⋅ 10 −9 = 60000 FIT.
(5.9)
If the FIT value is available, for example λ = 200000 FIT, then:
λ (FIT) = 200000 FIT = 200000 ⋅ 10-9 h −1 = 200 ⋅ 10-6 h −1
(5.10)
References [1] Birolini, A.: Reliability Engineering – Theory and Practice. Springer, Heidelberg, ISBN: 978-3-642-14951-1 [2] U.S.A. Department of Defense, “MIL-HDBK-217F Military Handbook Reliability Prediction of Electronic Equipment” (and later versions) (1991) [3] IEEE Std, IEEE Guide for Selecting and Using Reliability Prediction Based on IEEE 1413 (February 19, 2003) [4] U.S.A. Department of Defense, MIL-HDBK-781 Military Handbook for Reliability Test Methods, Plans, and Environments for Engineering, Development, Qualification, and Production-Revision D
Chapter 6
Repairable Systems and Availability
Abstract. In Chapter six the concepts of Availability will be explained and discussed. The Availability is a concept that refers to reparable systems. For such systems the time of operation is not continuous, since their operating life cycles are described by a sequence of up and down states. Therefore the system operates until it fails, then it will be repaired and so it returns to its original functioning state. It will fail again after some random time of operation, get repaired again, and this process of both failure and repair will reiterate; hence the state of the system alternates between a operating state and a repair state. In this case, the important variables to be determined are the times to failure and the times to repair. Availability is defined as “the aptitude of the element to perform its required function in given conditions up to a given point in time or during a given time interval assuming that any eventual external resource is assured.” The availability of a machine can also be defined as the percentage of time, in respect to total time, in which the machine is required to function.
6.1 Introduction When considering repairable systems or components, in addition to defining reliability, the function of availability must also be clarified. As seen in Chapter 1, reliability is defined as the probability that a device performs a specific function up to a specific time interval, in pre-established conditions of use. This concept does not allow for an interruption in service. In cases where maintenance is scheduled, this must be carried out in time intervals when the device or component is not in use. In repairable systems where the system is unavailable for the time necessary to effect repairs or maintenance, availability implies that the system will not be functioning for a given time interval. Availability is therefore a more general function that takes into account both the reliability of a system and maintenance aspects, and then a return to normal functioning after a failure. IEC 60050 (191), even known as International Electrotechnical Vocabulary (IEV), standard define availability as “the ability of an item to be in a state to perform a required function under given conditions at a given instant of time or over a given time interval, assuming that the required external resources are provided”[1]. The availability of a machine can also be defined as the percentage of time, in respect to total time, in which the machine is required to function.
86
6 Repairable Systems and Availability
6.2 Mean Time To Repair/Restore (MTTR) In the case of repairable components, the parameter that expresses the mean time from the onset of a failure to its complete repair is fundamental. This is known as the Mean Time To Repair/Restore (MTTR). Maintainability is a property of repairable systems and is defined as the facility in which a system can be repaired once a malfunction (or failure) is manifested. Maintainability is the probability M(t) that a malfunctioning system can be restored to its correct functioning within time t. This is closely correlated with availability because the shorter the interval for restoring the system to its proper functioning is, the higher the probability will be of finding a function system at a given time interval. For the extreme value M(O) = 1, the system in question will always be available.
Fig. 6.1 Comportment of a repairable system.
Similar to the MTTF (Mean Time To Failure) that characterizes nonrepairable devices, we refer to functions analogous to those already defined for reliability in repairable systems. The total of these functions is called the function of maintainability. Table 6.1 Analogy between the functions of maintainability and the functions of reliability. Functions of maintainability
Analogous functions of reliability
g(t) Probability density of normal repair
f(t) Distribution of failure probability
M(t) Probability of repair (maintainability)
F(t) Unreliability
N(t) Probability of non repair
R(t) Reliability
μ (t) Repair rate (instantaneous)
λ(t) Failure Rate (instantaneous)
6.3 Mean Time Between Failures (MTBF)
87
The definition of MTTR is given by:
MTTR =
∑ t i ⋅ g (t i ) ⋅ Δ t i
(6.1)
i
For such functions, relationships identical to those of reliability are also valid here. Therefore with t = 0, the time a failure occurs, we have: ti = i-th repairing time g(t)⋅Δt = probability that repairs will finish within the interval [t, t+Δt ] M(t) = probability that repairs will finish within the interval [0, t] μ(t)⋅Δt = probability that repairs will finish within the interval [t, t+Δt] , not completed at time t.
6.2.1 A Particular Case If the repair rate is constant so that μ(t) = μ, we have: MTTR =
1
(6.2)
μ
6.3 Mean Time Between Failures (MTBF) The mean time between failures (MTBF) can be defined in two ways: − −
MTBF is the MTTFS in repairable devices; MTBF is the sum of the mean time of MTTFs of the device plus the MTTR (Mean time to repair/restore)
Component
Terminology 2
Terminology 1
(used in the following) Up
Up
Not repairable
Down
Down MTTF
Up
Up
Repairable
MTTF
Down
Down
MTTF MTTR MTBF MTTR
Fig. 6.2 Definitions for MTBF.
MTBF
88
6 Repairable Systems and Availability
It may seem more logical to use the second definition because you can maintain the same terminology for MTTF, independent of the fact that the device can or cannot be repaired, considering one a theoretical extension of the other. Figure 6.2 graphically demonstrates the difference between the two MTBF definitions.
6.4 The Significance of Availability in the Life Cycle of a Product Figure 6.3 illustrates time frames for both functioning and faults phases for elements used in the analysis of availability “C” and “P” represent periods of time attributed respectively to corrective maintenance (performed after a failure) and preventive maintenance (carried out before the system failure), often waiting for the necessary resources to complete the work. Availability therefore, is the probability of being able to function correctly at the required moment, independent of any previous failure subsequently repaired, and not up to a determined point in time, as is asserted in the definition of reliability. This concept implies that the device may be non-functioning at certain times. A system can display high availability levels notwithstanding frequent but short periods of malfunctioning. Total Time TT
Up Time
Operating Time OT
OFF Time
Down Time
Stand by Time ST
TMT
TCM
ALDT
TPM
C
P
Key - TT: Total Time of use, Up Time: Functioning time, Down Time: Non-functioning time, OT Operating time: part of up time when effective use takes place, ST Stand-by Time: part of up time during which effective use is waiting to begin and the system is assumed to be operating, TMT Total Maintenance Time, ALDT Administrative and Logistic Down Time: Time often used waiting for parts and personnel for maintenance, TCM Total corrective maintenance time, TPM Total Preventive Maintenance time. Fig. 6.3 Time frame of a repairable system.
6.5 Instantaneous Availability
89
Availability is a good parameter to characterize those systems where malfunctioning is acceptable since, in most circumstances the system functions correctlyThe basic mathematical definition of A (Availability) is: A=
UpTime UpTime = TotalTime UpTime + DownTime
(6.3)
The actual evaluation of availability is carried out substituting temporal elements with other parameters that realize the desired function. We thus have different formulations aimed at visualizing specific objectives. Under certain circumstances, it is necessary to define the availability of a repairable system only in regard to effective work time and corrective maintenance time. This is known as inherent availability and is represented as: A=
MTTF MTTF = MTTF + MTTR MTBF
(6.4)
In these ideal conditions, waiting time and time associated with preventive maintenance are overlooked (MTTR is calculated considering only corrective maintenance time). Such quantity (dimensionless and between 0 and 1) takes on a double significance: 1. a posteriori, the “efficiency” of a system for which the parameters MTTF, MTTR and MTBF have been determined; 2. instantaneously, with the probability that the system is available (not under repair) Obviously the complement to 1 of availability assumes the name of Unavailability (U) with the significance: U = 1− A =
MTTR MTBF + MTTR
(6.5)
6.5 Instantaneous Availability Stationary availability or simply “availability”, is the limit value (t → infinity ) of another quantity (variable) that is called “instantaneous availability” A(t); this quantity represents the a priori mean availability, estimated in time t. The plot of instantaneous availability depends on initial conditions (at the instant t = 0, the system can be “functioning” or “in failure”); in any case, the limit value A(t → infinity) is always A.
6.6 Dependability: An Evaluation of the “Level of Confidence” in Reference to the Correct Functioning of the System In general, reliable systems are utilized in situations where it is necessary to guarantee a series of performance characteristics, for example, safety or operational
90
6 Repairable Systems and Availability
availability. Recently, the word “reliability” has been substituted by the term “dependability” which corresponds to “faith in the correct functioning of the system.” In order to define dependability, it is necessary to clarify first the concepts of service and user. The service furnished by a system is the behavior of the system itself, how it is perceived by its users. The user of a system is another system which interacts by way of a system interface. The function of a system is what we expect from the system itself. The description of the function of a system is furnished by its functional specifications. Service is correct if the specific functions of the system are performed. A system is considered dependable if it has a high probability of successfully carrying out its specific functions. This first presumes that the system is available. Furthermore, in order to completely perform a specific function of the system, it is necessary to define all the environmental and operative requirements for the system to provide the desired service. Dependability is therefore a measurement of how much faith we have in the service given by the system. The design and implementation of “dependable” systems necessitates the appropriate methodology for identifying possible causes of malfunctions, commonly known as “impediments.” The technology to eliminate or at least limit the effects of such causes is also necessary. Consequently, in order to deal with the problem of dependability, we need to know what impediments may arise and the technology to avoid the consequences. Systems that utilize such techniques are called Faults Tolerant. Impediments to dependability assume three aspects: fault, error and failure. A system is in failure when it does not perform its specific function. A failure is therefore a transition from a state of correct service to a state of no correct service. The periods of time when a system is not performing any service are called outage periods. Inversely, the transition from a period of non-service to a state of correct functioning is restoration of service. (Figure 6.4) Possible system failures can be subdivided into classes of severity in respect to the possible consequences of system failure and its effect on the external environment. A generally used classification is one which separates failures into two categories: benign and catastrophic. Constructing a dependable system includes the prevention failures. To attain this, it is necessary to understand the process which may lead to a failure, which originates from a cause (failure), inside or outside the system. The failure may remain dormant for a period of time until its activation. The activation of a failure leads to an error, that part of the state of a system that can cause a successive failure. The failure is therefore the effect, externally observable, of an error in the system. Errors are said to be in a latent state until they become observable and/or lead to a state of failure. Similar failures can correspond to many different errors, just as the same error can cause different failures.
6.7 The Prerequisites of Dependability
91
Failure
Correct functioning
Incorrect functioning
Restoration
Fig. 6.4 State of a system.
Fig. 6.5 Chain fault – error – failure.
Systems are collections of interdependent components (elements, entities) which interact among themselves in accordance with predefined specifications. The chain fault – error – failure presented in Figure 6.5 can therefore be utilized to describe both the failure of a system and the failure of a single component. One fault can lead to successive faults, just as an error, through its propagation, can cause further errors. A system failure is often observed at the end of a chain of propagated errors.
6.7 The Prerequisites of Dependability In order to measure the level of dependability reached by the system under analysis, it is necessary to evaluate a group of characteristics of the system. These characteristics generally assume a different role and importance in respect to the prerequisites of the system itself.
92
6 Repairable Systems and Availability
The principal prerequisites of dependability are: Reliability, Maintainability, Availability (defined above) and Safety, defined as the probability S(t) that the system will not malfunction when it is required to operate or that even in the presence of a malfunctioning, this does not compromise the safety of the personnel or machinery related to the system itself (the absence of unacceptable risks, as defined in Chapter 1). In other words, this is a measure of both the capacity of the system to function correctly as well as the capacity not to carry out correctly it specific function but without generating considerable consequences. Note that safety differs from reliability and availability in that reliability and availability regard correct functioning and do not include effects derived from malfunctions. Testability, defined as the ability to identify characteristics of a system by means of tests and/or measures. This is clearly related to maintainability since the easier it is to test a malfunctioning system to identify component failures, the shorter the time to restore the system to correct functioning is. Performability, P(L,t), a function of time, is defined as the probability that the value of the level of functionality of the system at time t is at least equal to L. It is a measure of the capacity of a system to furnish a determined quantity of work during a certain time interval, even in the presence of failures. This plays a fundamental role in the design of systems in which the presence of faults does not imply the loss of some functioning but only a reduction in the level of such functioning.
References [1] IEC 60050-191, International Electrotechnical Vocabulary 1.0 edn. ch. 191: Dependability and quality of service, December 31 (1990); Forecast publication date for 2.0 edn. IEC International Electrotechincal Commission, Geneve, June 2 (2012)
Chapter 7
Techniques and Methods to Support Dependability
Abstract. In chapter seven some of the Reliability Techniques will be considered. These techniques, classified into quantitative and qualitative, are methods of analysis to evaluate the dependability parameters and the failure modes in which a realistically complex system is or could be subjected to. The techniques that will be described are Markov models. Other techniques, such as FMEA, FMECA and FTA, will be discussed in the following chapter 8. Markov models are characterized by a particular representation of Availability: a matrix in place of a single index. The matrix representation permits studying the behavior of the system, under different hypotheses, as a stochastic process, and studying its temporal evolution.
7.1 Introduction In regard to the dependability of a real complex system (dependability is defined in Chapter 1), it is possible to conduct a twofold evaluation: Probabilistic Evaluation or Quantitative Evaluation, the objective of which is to estimate the attributes of dependability for a system and/or its components. To this aim some techniques are discussed in the present Chapter. Qualitative Evaluation with the aim to understand how component malfunctioning can lead to a loss of function and performance as well as system failure and to fully understand the possible consequences. This is discussed in Chapter 8. Quantitative methods have been widely discussed in the literature and numerous well tested techniques have been developed and utilized with success. In function of the level of extraction considered for the system under analysis, analytical (or axiomatic) and experimental techniques can be considered. The evaluation of the behavior of a system can be done in the following ways: • Experimental (empirical): a prototype of the system is used and the parameters are estimated by means of statistical data. The experimental approach is, generally, characterized by: − − −
much more expensive and complex than the analytical technique system prototype is often unavailable dependability is difficult to evaluate because long observation times are necessary
94
7 Techniques and Methods to Support Dependability
• Analytical and simulative: the parameters are deduced directly from a mathematical model or graph of the system itself (model of system and its components).
7.2 Introduction to Quantitative Techniques The use of quantitative techniques is based on the analytical description (through equations) or graphic description (through diagrams) of the behavior of the system. Measurements of the dependability performance are obtained in function of the parameters of the model, that typically includes probability distributions to represent random phenomenon connected to malfunctions. Systems under analysis are mainly divided into two classes: discrete systems and continuous systems. In discrete systems one or more quantities of the system change instantaneously, but only in separated time instants. In continuous systems, instead, the quantity can change continuously over time. An example of a discrete system is the line of data-packets of a computer on line. The state of the system changes in separate instants of time (caused by a packet leaving or a new packet arriving). A train represents an example of a continuous system that has speed and position, in respect to a railway station, that continuously vary in time. Often systems are both discrete and continuous at the same time, based on the quantity being analyzed. In analytical models the components of the system are represented by state variables and parameters, and their interactions are represented through the relationships among these quantities. In simulation models instead, the dynamic behavior of the system is reproduced over time. The evaluation of these models requires the execution of a program denoted as simulator that permits the temporal evolution of the system and furnishes an estimate of the measurements in interest. The most useful distinctions in the classification of models are between static and dynamic models, between deterministic and stochastic models and between continuous and discrete models. A static model permits the representation of a system at an established time instant and this can be useful in obtaining information on the static characteristics of the system. A dynamic model, on the other hand, represents the temporal evolution of the system e.g. through the interaction among components constituting such system. A model can be considered as deterministic when it contains no components that present probabilistic behavior. On the contrary, a model is stochastic when it contains one or more components that demonstrate a probabilistic behavior. The distinction between continuous and discrete models is similar to that made for continuous and discrete systems. In this case, however, we do not consider if the quantity in interest is continuous. For the objects in question, we wish to represent the quantity as a variable with or without continuity in time. Models used for the analysis of computerized systems (mainly discrete systems) are almost always dynamic and stochastic.
7.3 Evaluation of Availability Using Analytical Models
95
7.3 Evaluation of Availability Using Analytical Models Complex systems can therefore be analyzed in regard to the calculation of dependability and availability if they are considered to consist of a group of variously connected entities. In particular, when calculating the above mentioned quantities, it is first necessary to identify the dependability and availability of single entities. Then, given the possibility of redundancy, identify the configurations which permit the system to function in accordance with its design specifications. Finally, it is necessary to establish connections between the individual failures of the entity and those of the system as a whole. The entities have, in addition, reliability and availability indexes depending on their quality levels, maintenance policy of the producer and their own interconnections. For these reasons, the use of a single technique may not be sufficient in all cases. In general, appropriate combinations of these techniques can be used for the construction and solution of a model. The techniques most utilized for constructing models are the combinatorial analysis methods and Markov processes. The stochastic methods of combinatorial reliability are the simplest and most intuitive, both in the construction and solution of the model, but these are often inadequate in the representation of the often complex dependence among the different components of the system. In addition, they do not permit the representation of repairable systems. Furthermore, such methods assume the complete knowledge of the structure of the system and can be implemented when the system is defined in all its constructive details. (e.g. series and parallel configurations etc. as discussed in Chapter 3). Therefore, in place of combinatorial methods, Markov processes can be taken into account in order to model complex systems when the coverage factor and repair, and/or maintenance factors, have to be considered . Markov models are considered for analyzing the dependability and availability of a system constituted of several entities that can fail independently of one another. They are also used to evaluate the safety performance of the system.
7.4 Markov Models In respect to other techniques, Markov models are characterized by a different representation of availability: a matrix in place of a single index. The matrix representation permits studying the performance of the system, under different hypotheses, as a stochastic process, and studying its time evolution (not possible with other approaches). Furthermore, unlike other techniques, Markov models take into account reparability and the order in which failures occur in the system. A Markov model is based on a graphic representation of the system dependability which permits to synthetically observe the effect of the unreliability of elementary components on the whole system and to understand the characteristics of single components with the next highest level of complexity. This continues up to the highest level of the system.
96
7 Techniques and Methods to Support Dependability
The analysis of a system using a Markov model furnishes an analytical, uniform and synthetic method to calculate availability and other associated functions. It is particularly useful in graphically representing situations of failure and repair of component elements. A system constituting by N components can assume 2N mutually exclusive configurations denoted as system states. This is due to the fact that each individual device can also assume only two mutually exclusive conditions: correct functioning or failure. In other words, the model defines states of the system as those combinations that consent to the functioning or non-functioning of the system. Analogously, changes in the state of the system (due to the change in single components) are called transitions and their chronology describes the temporal evolution of the system (path). The procedure requires calculating the probability of finding the system in each of the possible states. The evaluation of the system availability consists in the sum of the probabilities that compete with states defined as successful, where the system in functioning correctly. The main problem in the construction of a chain of Markov models (Markov chain) is due to the exponential growth in the number of states as the complexity of the system increases. The increase in the number of states becomes necessary when very compact representations of stochastic processes of notable dimensions have to be manipulated. Nevertheless, several well tested algorithms have been developed for such models. In order to apply a Markov model, it is necessary to verify the following hypotheses: • the process must be stationary: its behavior must be the same at any moment under consideration and consequently, the probability of a transition between two states must remain the same during the specific time interval. In the mode of availability, this implies that the rate of transition between the two states (characterized by a failure rate λ and a repair rate μ) must remain constant during the time of observation. As known the distribution of the probability density of the observed quantity is exponential. • the process must be without memory according to this hypothesis, the future random behavior of a system depends only on its actual state, and not on preceding states or the way in which the actual state has occurred. This means that the probability of transition pij depends only on the two states i and j and is completely independent of what occurred preceding the transition to state i. In some cases, such conditions can be considered too simplified. At the beginning, the state of the system corresponds to when all of its entities are functioning. To determine the probability if a system is in state s at a given instant, it is necessary to identify only the probability of transition from one state to another (if restore intervals are not foreseen, there is a continual decrease in performance). When the process is not stationary, this creates a function that connects transitions in the system to time. These are referred to as non - Markovian processes. In certain conditions, there are techniques for solving these problems.
7.4 Markov Models
97
The Markovian approach can be applied both in a continuous time field (Markov Processes) and a discrete time field (Markov Chains). States can be classified according to their characteristics: • Ergodic Group: a group of states such that when a system has assumed, it is no longer capable of exiting (a process is ergodic if the mean times of the data sample functions of the process converge at the corresponding spatial mean). • Transitory Group: a group characterized by the fact that once a transition has conducted the system outside this group, it can no longer reenter it. • Absorbent State: once a system has reached this state, the state can no longer be changed (e.g. a state where non-repairable damage to a component is observed; once in failure, with no probability for the component to repair itself, obligates the system to enter this state and from which it can no longer exit). Markov models are based on the concept state and transition. • State: represents everything that must be known in order to describe a system at any given instant. In the case of a dependability model, the state represents the possible configuration of a functioning entity or an entity in failure (functioning or non-functioning of various components of the system). Usually, the starting point of any system is that in which all the components are functioning. A successive state could be the non-functioning of even a single component of the system. • Transitions: describe the passing from one state to the verification of an event: the failure of an entity or the restore of an entity in failure condition. Transitions of state are characterized by the concept of probability, e.g. the probability of the failure of an entity or the probability of repair. Markov models are characterized by a certain probability number pij, each having the significance of transition from an initial state i to a final state j. These probabilities of transition must follow some initial rules: 1) Defining the hazard function - also known as failure rate - as z(t):
z(t ) =
R(t ) − R(t + δt ) R(t ) ⋅ δt
(7.1)
the probability of transition from an initial state to a final state relative to the time interval δt is identified by the product z(t)δt. In fact, we see: z (t ) ⋅ δt =
R (t ) − R (t + δt ) F (t + δt ) − F (t ) = = P[(t < t ≤ t + δt ) /(t > t )] R (t ) R (t )
(7.2)
2) The probability that in the interval δt two or more transitions will occur is infinitesimal and thus may be ignored.
98
7 Techniques and Methods to Support Dependability
7.5 Transition Matrix and Fundamental Equation The probability of transition pij can be grouped in a matrix P, denoted as transition matrix, with a row index i that represents the initial state and a column j that represents the state of arrival. The transition matrix has important properties: it is square and the rows are stochastic. This means that the sum of the elements of each line represents the probability to remain in a certain state or to exit from this state and is equal to 1. If we indicate with ρij the rate of transition from state i to state j and with Pi,j and the probability of transition between these same two states in δt, assuming that the interval of time δt is infinitesimal and considering the probability as infinitesimal of a higher order that at the same time interval δt, two or more transitions occur, we see: ρijΔt = z(t)Δt = λΔt = Pij
(7.3)
If Pi(t) is the probability of observing the system in the state i at time t, the probability of observing the same state at time t +Δt is given by the sum of the probabilities that compete for the two mutually exclusive events: • the system was in the state j at the instant t and passed to the state i during δt; • the system was in state i at the instant t and did not pass to any other state during δt Pi(t+Δt) = ∑i≠jρi,jΔtPj(t)+[1-∑i≠jρi,jΔt]Pi(t)
(7.4)
Developing and dividing by Δt and passing to the limit for Δt →0 we get a system of differential equations of the first order that, resolved for the give initial conditions (usually 1 for the initial state and 0 for the other states; it is presumed that the state is with a probability of 1 in the initial state), restores the probability of being in single states at a certain instant. The complexity of a system increases rapidly with the number of states: for N components, 2N configurations are possible and therefore 2N states to study the characteristics in terms of probability.
7.6 Diagrams of State The graphic representation of Markov models is based on the use of graphic symbols to define states and transitions. In general, states are indicated with circles and transitions with directional paths. Case 1 – Analysis of a system with one element a) Non-repairable element The most elementary situation is that of a system made of only one non-repairable element with a constant failure rate and that assumes only two states: the operative state (S0) and the failure state (S1).
7.6 Diagrams of State
99
We can define the following quantities: • P0(t), probability that the component functions at time t • P1(t), probability that the component is in failure at time t • λ, constant failure rate. The probability that the system is found in state S0 at the instant t+δt is given by the probability that this is found in state S0 at time t multiplied by the probability that it does not fail during the interval δt; if at instant t it is found in state S1, the probability of coming back to state S0 is 0 being the component non-repairable. Such probabilities can be represented by the diagram of state in Figure 7.1. From Figure 1 we can write the following equations for the probability that the system will be found in S0 or S1 at time t+δt: ⎧ P0 (t + δt ) = P0 (t )(1 − λδ t ) + P1 (t ) ⋅ 0 ⎨ ⎩ P1 (t + δt ) = P0 (t )λδt + P1 (t ) ⋅ 1 1-λδ t
(7.5)
1
λδ t S0
S1
Fig. 7.1 Diagram of state of a system made of only one non-repairable element.
Considering the limit for δt→0, we obtain the following the first order differential equations ⎧ dP0 (t ) ⎪⎪ dt = −λP0 (t ) ⎨ ⎪ dP1 (t ) = λP (t ) 0 ⎪⎩ dt
(7.6)
That can be solved with the usual analytical methods after having established the starting conditions. It is generally assumed that for t = 0, the element is functioning, that is P0(0) =1 e P1(0) = 0. b) Repairable element In the case of system consisting of a single repairable component with a failure rate λ and a repair rate µ both constant in time, the diagram of state is modified as indicated in figure 7.2, where the path from the state S1 to the state S0 represents the probability of transition µδt from a fault state (S1) to the functional state (S0), after the restore of the element . From the diagram of state, we obtain the equations for the probability that the system is found in the state S0 or S1 at time t+δt:
100
7 Techniques and Methods to Support Dependability
⎧ P0 (t + δt ) = P0 (t )(1 − λδt ) + P1 (t ) μδt ⎨ ⎩ P1 (t + δt ) = P0 (t )λδt + P1 (t ) ⋅ (1 − μδt )
(7.7)
1-μδ t
1-λδ t
λδ t S0
μδ t
S1
Fig. 7.2 Diagram of State for a system composed of one repairable element.
The corresponding differential equations result: ⎧ dP0 (t ) ⎪⎪ dt = − λ P0 (t ) + μP1 (t ) ⎨ ⎪ dP1 (t ) = λ P (t ) − μP (t ) 0 1 ⎪⎩ dt
(7.8)
Case 2 - Analysis of a system with two elements Series systems and parallel systems are the easiest to analyze. Moreover, such systems are also those of major interest. In fact, analysis of more complex systems and frequently be brought back to the analysis of series or parallel systems. One begins by analyzing simple systems consisting of only two devices. Initially, one assumes that the faults of various components is independent, even if this simplified hypothesis is not also justified in reality. To this aim, one thinks for example of the reliability performance of two electrical lines, not too far from each other, positioned in a zone subject to seismic phenomena. An earthquake of sufficient intensity can simultaneously render both lines out of service. For faults connected to earthquakes, it is therefore not possible to assume that the two lines are independent of each other. If the system is composed of two non-repairable elements there are four possible states, having to consider also the two cases in which only one element has failed. Indicating with X1 e X2 the two functioning elements and X 1 , X 2 the same elements in failure, 4 four possible states can be identified: ⎧S0 = X 1 X 2 ⎪ ⎪ S1 = X 1 X 2 S i == ⎨ ⎪S2 = X 1 X 2 ⎪ ⎩S3 = X 1 X 2
(7.9)
7.7 Evaluation of Reliability
101 1-λ 13δt
1-(λ01+ λ02) δt
S1
λ01δt
1
λ 13δt
S1 = X 1 X 2 S3
S0
S0 = X 1X 2
S2 = X1X 2
λ 02δt
λ 23δt
S3 = X1X 2
S2 1-λ23 δt
Fig. 7.3 Diagram of state of a system composed of two non-repairable elements where λij•δt represents the probability of transition from the state Si to the state Sj.
The system of equations becomes: ⎧ P0 (t + δt ) = P0 (t )[1 − (λ01 + λ02 )δt ] ⎪ ⎪ P1 (t + δt ) = P0 (t )λ01δt + P1 (t )(1 − λ13δt ) ⎨ ⎪ P2 (t + δt ) = P0 (t )λ02δt + P2 (t )(1 − λ23δt ) ⎪⎩ P3 (t + δt ) = P1 (t )λ13δt + P2 (t )λ23δt + P3 (t )
(7.10)
From which, analogous to the case of a system composed of only one element, we can formulate the relative differential equations.
7.7 Evaluation of Reliability To evaluate the function of reliability, the system must contain at least one absorbent state. With t tending to infinity, the probability of such a state tends to 1 while that of the other states tend to 0. As stated previously, reliability of the system is given by the sum of the probabilities of the states that assure the proper functioning of the system. For the system seen previously, consisting of only one non-repairable element, it is represented in Figure 7.1. Reliability coincides with P0(t) that can be deduced from (7.6): ⎧ dP0 (t ) ⎪⎪ dt = −λP0 (t ) Æ R(t ) = P0 (t ) = e −λt ⎨ ( ) dP t ⎪ 1 = λP (t ) 0 ⎪⎩ dt
(7.11)
in accordance with Chapter 2. For a system with two non-repairable elements, it is necessary to establish if the system is in series or parallel. If the system is considered in series, the only state that represents functioning is S0 , and therefore
102
7 Techniques and Methods to Support Dependability
⎧ P0 (t + δt ) = P0 (t )[1 − (λ01 + λ02 )δt ] ⎪ ⎪ P1 (t + δt ) = P0 (t )λ01δt + P1 (t )(1 − λ13δt ) ⎨ ⎪ P2 (t + δt ) = P0 (t )λ02δt + P2 (t )(1 − λ23δt ) ⎪⎩ P3 (t + δt ) = P1 (t )λ13δt + P2 (t )λ23δt + P3 (t )
(7.12.a)
and, R (t ) = P0 (t ) = e [− ( λ 01 + λ 02 )t ]
(7.12.b)
Also in this case, this is in accordance with Chapter 2 If a system is considered parallel the operative states are S0, S1, S2 and since these are mutually exclusive:
R(t)=P0(t)+P1(t)+P2(t)
(7.13)
and the value of R(t) will be calculated using the other differential equations.
7.8 Calculation of Reliability, Unreliability and Availability To demonstrate the procedure for calculating the availability of a system, we will consider the simple case of a repairable system consisting of only one element. Its state diagram is represented in Figure 7.2. In such a case, the system of differential equations (7.6) can be rewritten in a matrix form:
⎡ dP0 (t ) ⎢ dt ⎣
⎡− λ λ ⎤ dP1 (t ) ⎤ = [P0 (t ) P1 (t )] ⋅ ⎢ ⎥ ⎥ dt ⎦ ⎣ μ − μ⎦
(7.14)
From (7.14) the following two equations are obtained: − ( λ + μ )t ⎧ μ [P0 (0) + P1 (0)] + e [λP0 (0) − μP1 (0)] ⎪ P0 (t ) = μ +λ μ+λ ⎪ ⎨ −( λ + μ ) t ⎪ P (t ) = λ [P (0) + P (0)] + e [− λP0 (0) + μP1 (0)] 0 1 ⎪ 1 μ+λ μ+λ ⎩
(7.15)
Being P0(0) + P1(0) = 1 we have: ⎧ e − ( λ + μ )t μ + [λP0 (0) − μP1 (0)] ⎪ P0 (t ) = μ +λ μ +λ ⎪ ⎨ −( λ + μ ) t ⎪ P (t ) = λ + e [− λP0 (0) + μP1 (0) ] ⎪ 1 μ +λ μ+λ ⎩
(7.16)
Assuming the system functioning at time t=0 so that P0(0) =1 and P1(0) = 0, we obtain:
7.9 Markov Analysis of a System: Application Example
μ ⎧ ⎪ P0 (∞ ) = μ + λ ⎪ P⎨ ⎪ P (∞ ) = λ ⎪⎩ 1 μ +λ
103
(7.17)
P0(t) and P1(t) represent the probability, dependent on the time, that the system is operating functioning (availability) or in a fault condition (unavailability). It is therefore possible to calculate the asymptotic probability that the system reaches a state, for t →∞:
μ ⎧ ⎪ P0 (∞) = μ + λ ⎪ P⎨ ⎪ P (∞) = λ ⎪⎩ 1 μ+λ
(7.18)
The Availability over time of a repairable system is therefore equal to:
A(t ) = P0 (t ) =
μ μ +λ
+
λe −(λ +μ )t μ+λ
(7.19)
and for t →∞ becomes: A(∞ ) = P0 (∞ ) =
μ μ+λ
(7.20)
7.9 Markov Analysis of a System: Application Example With this example, we wish to describe the approach which must be followed in order to evaluate the availability of a system using the Markov techniques. The study is carried out considering a system of measurement based on GPS-based device capable of measuring the initial instant of fast transients overlapping a sinusoidal voltage at 50Hz. Referring to Figure 7.4, this is constituted by: • • • •
a voltage transducer (V-VT), with a failure rate λ_TV = 110-4 [failures/h]; an event detector (comparator), with parameters λ_SC = 110-3 e μSC = 510-2 [failures/h]; a GPS unit, with a λ_GPS = 210-4 [failures/h]; a data processing unit .
The signal is conducted across a voltage transducer V-VT whose output is sent to the event detector. This, based on a comparator circuit, in the presence of a fast transient, will generate an output signal in TTL logic which is detected by the GPS unit as an event, with a resolution of 100ns. For sake of simplicity, we will consider that the subsystem be made up of the following three components:
104
7 Techniques and Methods to Support Dependability
V-VT Voltage transducer
SC Event detector
GPS Unit
Hyper terminal
Initial Instant
Fig. 7.4 GPS based measuring system of fast voltage transients.
1. the voltage transducer (TV); 2. event detector (comparator or SC); 3. the GPS unit. According to the Markovian approach with a system composed of N = 3 elements, in the Availability model, 2N=8 states compete. For illustrative purposes, we are supposing that only the even surveyor is repairable while a breakdown in the voltage transducer or the GPS unit would cause a failure of system functioning as a whole. In practice, these hypotheses imply that from the 8 possible states (supposing that all components are repairable), only 6 are now possible: State S0: initial state where all devices are functioning. The probability of finding the system in this state at time t = 0 is equal to 1. State S1: state in which the voltage transducer is in faulted condition. The probability of passing to this state from the initial state depends on the λ_TV. Given that the transducer has been assumed to be non-repairable, this state is considered absorbent. State S2: the event detector function is not working in this state. Unlike the other states, this is not an absorbent state. Therefore in this state, in addition to the bidirectional transition with the initial state, determined by the failure rate λSC and repair rate μSC of the event detector, two other transitions can occur due to the failure of the V-VT and of the GPS unit. State S3: state in which the GPS unit is not working. As in the case of V-VT this state is absorbent and the probability of entering this state is connected to λGPS. State S4: the system is in this state due to the contemporary failure of the event surveyor and V-VT (this obviously takes place before the comparator can be repaired). State S5: the system is in this state when both the GPS unit and the comparator are not operating. Similar to State S5, this is an absorbent state.
7.9 Markov Analysis of a System: Application Example
105
The model of the Markov states is represented in Figure 7.5. The probability of remaining in single states is not shown. The following transitions can be identified: Transitions 1,2,3: respectively, from the initial state to states S1,S2,S3. These are given by TV, SC, GPS failures. Transitions 4,5: from the state S2 to the state S4 due to a failure of the VT and from the state S2 to the state S5 due to failure of GPS unit. Transition 0: takes the system from S2 back to the initial state. This depends on the μSC.
Fig. 7.5 State diagram of fast voltage transients detector.
It is useful to recall that states S1,S3,S4,S5 are absorbent states while S0 and S2 are those in which the system is not considered in faulted condition. The sum of the probabilities that compete in these two states, P0(t)+P2(t), will give the availability of the system. To apply the Markov model, we use a system of differential equations that bind the probability of transition of the system to a time interval Δt: ⎧P0 (t + Δt ) = P0 (t ) [1 - (λTV + λSC + λGPS )Δt ] + P2 (t ) μSC Δt ⎪P (t + Δt ) = P (t ) λ Δt + P (t ) 0 TV 1 ⎪ 1 ⎪⎪P2 (t + Δt ) = P0 (t ) λSC Δt + P2 (t ) [1 - ( μ SC + λTV + λGPS )Δt ] ⎨ ⎪P3 (t + Δt ) = P0 (t ) λGPS Δt + P3 (t ) ⎪P4 (t + Δt ) = P2 (t ) λTV Δt + P4 (t ) ⎪ ⎪⎩P5 (t + Δt ) = P2 (t ) λGPS Δt + P5 (t )
(7.21)
This is the system that, solved for opportune initial conditions, will form the probability associated to each state.
106
7 Techniques and Methods to Support Dependability
7.10 Numerical Resolution of the System There are calculation codes able to solve the system of equations obtained by using Markov models. They provide the numerical value for the probability associated to the various states of the system. In particular, for the example proposed, it is necessary to remain in the states S0 and S2. Figure 7.6 represents a state diagram in which transitions with a higher probability have been emphasized with larger arrows. The numerical values indicated in the figure refer to values of the probability of being in the various states after 10000 hours of operation. The probability of being in state S3 is equal to 65.4% and 32.7% in S1. Table 7.1 shows more detailed results relative to values of the Availability calculated at various time instants. The trend of A(t) in various time intervals (for t from 0 to 1s; from 0 to 1000 s; from 0 to 10000 s) is also presented.
0.3268
0.0065 S1
1 10-13
S4
1 10-15
S0
0.6537
0.0130 S3
S5
Fig. 7.6 Diagram of state of the system in the example with emphasis on the probability of transition. Table 7.1 Availability values for example presented.
Time (s) t =1
t =10000
Reliability characteristic
Result
P(State 1)
0.9987
P(State 3)
0.00097
R(t) ≡ A(t)
0.9997
P(State 1)
0.0488
R(t)
See figure 7.7 See figure 7.8
7.10 Numerical Resolution of the System
107
It is observed that after 10000 hours of operation, the availability of the system has decreased to 0.05. As demonstrated also by the graphs, the probability that the system is still working after one year (8,760 hours) is very low (0.7%.). Reliability Function R(t)=exp(-3/10000t)
1 0.99995 0.99990
R
0.99985 0.99980 0.99975 0.99970 Time [h]
Fig. 7.7 Reliability function for table 7.1 for t = 1.
R
Reliability Function R(t)=exp(-3/10000t)
Time [h]
Fig. 7.8 Reliability function for table 7.1 for t = 10000.
108
7 Techniques and Methods to Support Dependability
7.11 Possible Solutions to the Absence of Memory of the Markov Model The most severe hypothesis in the application and interpretation of the results provided by the Markov model is the lack of memory. As stated previously, it allows evaluating the probability of finding the system in a determined state depending only on the probability of exiting from the preceding state and not on the history of the system. However in practice this implies disregarding the influence that one fault may have on another. For example, according to Figure 7.9, the λGPS of the GPS unit is the same both in the transition the state S3 and to the state S5, where a failure has already occurred.
Fig. 7.9 Diagram of state of system considering also the contemporaneous aspect of more events.
Therefore, in the Markov model, the failure of the event surveyor does not influence the behavior of the GPS unit. However this may not be true. For instance, a short circuit occurred at the comparator could easily damage both the voltage transducer and the GPS unit. In order to more fully understand the effect produced on the system as a whole by a failure, it is necessary to discuss two further transitions that go from the initial state respectively to the state of simultaneous failures of both the event detector and the transducer and to the state of simultaneous failure of the event detector and GPS unit (Figure 7.9). For the two new transitions, a failure rate, extracted from the product of the failure rates of the two components is supposed. We consider that the transition takes place if both elements are in fault condition. In particular, we have: Transition 7: from state S0 to state S4. The transition depends on the λSC/VT which will be given by a combination of both λSC and λVT.
References
109
Transition 8: from state S0 to state S5. The transition depends on the contemporarily failure of both SC and GPS units. The relevant failure rate λSC/GPS is given by a combination of both λSC and λGPS.
References [1] Birolini, A.: Reliability Engineering – Theory and Practice. Springer, Heidelberg, ISBN: 978-3-642-14951-1 [2] Leemis, L.M.: Reliability, Probabilistic Models and Statistical methods, 2nd edn., ISBN 978-0-692-00027-4
Chapter 8
Qualitative Techniques
Abstract. In the previous chapter quantitative methods useful for reliability evaluation have been presented. Here the evaluation of the behavior of a system is conduit with analytical and graphic methods. A second way to study the behavior of a system is based on a qualitative approach. These methods are able to understand the mechanism of the system failures and are able to identify all the potential weakness of the system under evaluation. Two main techniques of analysis are used and will be presented in this chapter: the Failure modes and effects analysis (FMEA) and the Failure modes effects and criticality analysis (FMECA). Such techniques are able to highlight the failure mode leading to a negative final effect. A third method here presented is based on a quite different approach. Fault Tree Analysis (FTA), a deductive method, in fact starts from the final effect studying the causes of a particular and well defined failure.
8.1 Introduction As already pointed out in the preceding chapters, the performance and quality of an industrial product or system also includes reliability, availability and maintainability, intrinsic characteristics fundamental in defining the requirements of the products they represent. In other words, the dependability of the system or product. This has a large impact on operative costs, keeping the product in use, and acceptable costs throughout the life cycle of the product. Having defined the characteristics of dependability so required, a task of the project designer, its effective realization must be verified. To accomplish this, some methods of analyzing reliability were introduced which permit reviewing and forecasting the level of a product/system reliability. Techniques for analyzing trustworthiness are used, in fact, to review and forecast evaluations of the reliability, availability, maintainability and safety of a system. Reliability analysis takes place mainly during the concept and defining phases, the design and development phases as well as in maintenance phases, in order to determine and evaluate dependability values of a system or installation. To this aim the most used methods of analysis can be summarized as follows: 1. Failure modes and effects analysis (FMEA); 2. Failure modes, effects and criticality analysis (FMECA); 3. Fault Tree Analysis (FTA);
112
4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
8 Qualitative Techniques
Event Tree Analysis (ETA); Reliability Block Diagram Analysis (RBD); Markov Analysis (MA); Petri Net Analysis; Hazard and operability studies (HAZOP) Reliability forecast through the part count (PC); Analysis of human reliability (HRA); Analysis of force - resistance; Truth table (structure function analysis); Boolean methods; Statistical test and estimation methods for reliability growth.
Only the first three methods will be discussed in the following. Further details on the others can be found in the bibliography. Such methods can adequately define qualitative characteristics, even though, as will be seen, there may be some quantitative evaluations. The methods of analysis that will be discussed in the following allow one to evaluate the failure modes to which a system is or could be subjected to. Furthermore, the result so obtained permits the project designer to identify any modification which must be made in order to best improve the RAMS requirements of the product. In general, there are two types of analysis: inductive (bottom-up analysis such as FMEA and FMECA) and deductive (top-down analysis such as FTA). Inductive methods start at the lowest level, for example the failure of a single component (mechanical, electrical etc.) that has an effect on all the system. In such a case, a detailed knowledge of the system and its structure is required to study fault conditions and failure propagation, for example. Inductive methods are generally rather stringent and well designed to identify all the individual failure modes. An analysis of this type gives its best results when conducted in the final planning or design phase although it can be profitably used even in different phases of the design process. Deductive methods, insetad start from the final effect, for example, studying the causes of a particular and well defined failure. The analysis is implemented starting at a system level and little by little progresses towards the lower levels (e.g. to the analysis of a single mechanical, electronic etc. component). Deductive methods are orientated towards events and are particularly useful during the first phases of the project, when operational details are still to be defined. It must be remembered that the aforementioned methods for the analysis of reliability are treated in the technical standards edited by the International Electrotechnical Commission (IEC) and in particular by the Technical Committee IEC TC 56 Dependability. Reference is often made to applicable standards when present, aware of the fact that in industry, conformity to standards is often a mandatory in the relationship between supplier and user. At the end of the chapter there is a brief summary of reference standards, although certainly not complete. Standards are constantly being updated and it is best to always verify the latest revisions.
8.2 Failure Mode and Effects Analysis (FMEA)
113
8.2 Failure Mode and Effects Analysis (FMEA) Failure Mode and Effects Analysis (FMEA) is a systematic procedure able to analyze a system with the aim to identify potential failure modes, their cause and effect on performance and, when applicable, their effect on the safety of the personnel, on the environment as well as on the system [1, 2]. During the advanced design phase, this analysis technique can draw attention to eventual weaknesses in the system, in such a way as to suggest necessary modifications it improve reliability and more generally, availability. This analysis can include forecasting and preventive measures to be undertaken during the initial phases of the development of a new product. Often, the FMEA, concentrating on these aspects, allows to know if the component under examination, for example, satisfies safety requirements. At this point, it is best to establish the most appropriate way to discuss the failure modes rather than the failure itself. For failure mode, we intend the manner in which an item fails. For failure effect we intend the consequence of a failure mode (measureable in a quantitative or qualitative sense) of a component or part of a system in term of operation, function or status of the item [1, 2]. FMEA is, essentially, a method for studying the failure of devices manufactured with different technologies (electrical, mechanical, hydraulic etc.) or their combination of such technologies. FMEA can also be used for studying software performance and function as well as action undertaken by the operator. The analysis is conducted starting from the characteristics of components of a system by means of an inductive process. Hypothesizing the failure of a component, the analysis demonstrates the relationship existing between the failure itself and breakdowns, defects and decreased performance or integrity of the entire system. FMEA analysis allows a good understanding of the behavior of a component of a system such as an electronic board, a component as well as mechanical device, and how this influences the functioning of the entire system particularly in cases of breakdown. In fact, it is always necessary to ensure that malfunctioning of a part of a system does not lead to dangerous situations for personnel or for the environment. For example, in an industrial installation or in a numeric control machine, it is not permissible that the breakdown of an electronic device or control board can allow unsafe procedures to be carried out. The electronic board where a breakdown has occurred must therefore be designed in such a way that in the presence of a failure, the system is prohibited from functioning or at least, unsafe operations cannot take place. The solution is therefore in design and FMEA analysis can identify all the breakdowns, that if unforeseen, can give rise to dangerous situations. It can also verify that the solutions adopted are efficient. This is why it is always useful and advisable to update and carry out a FMEA analysis during the design phase. Therefore, this method, if updated, becomes also an instrument for the a posteriori verification of design and a test of the conformity of the system to required specifications and standards as well as to the needs of the user. It is clear that the results of a FMEA analysis establish the priorities for checking the processes and inspections to be performed during construction and installation as well as rating, acceptance and operative tests. This does not mean to
114
8 Qualitative Techniques
diminish the importance of a FMEA analysis on an already designed apparatus. In such a case, the analysis is simply conducted as a type of a posteriori verification of what has already been made. This can be regarded as the first useful step in defining design requirements and criteria at the moment considered opportune to redesign or update the product. An FMEA analysis permits: 1. to identify failures including induced failures; 2. to determine the necessity of redundancy and further overdesigns and/or simplifications to be made to the project; 3. to determine the necessity of choosing adequate materials, parts, components or devices; 4. to identify the serious consequences of failures and to determine the necessity to revise and/or modify the project and identify safety risks to personnel and evidence problems of legal responsibility; 5. to define testing and maintenance procedures, suggesting potential failure modes and determining parameters that must be recorded during testing, verification and operative phases. It is also useful in drawing up a guide to identifying and investigating failures; 6. to define some software characteristics, if present.
8.2.1 Operative Procedure In principal, a precise operative procedure for carrying out a FMEA analysis can be defined but it is not a priori possible to go into detail. This derives from the fact that the design of systems and their applications is characterized by widely variable complexity and it may therefore be necessary to adopt very specific FMEA procedures adapted to available information. The classic procedure for implementing a FMEA analysis is the following: Step 1 – Definition of the System This phase defines the system, its functional requirements, environmental requirements and legal ordinances to be respected. In defining functional and operative requirements, it is necessary to identify, in addition to expected performance, any undesirable operations that can result in failure conditions. Environmental conditions regard the temperature of the work environment of the system (e.g. one must take into consideration the high temperatures that develops when an apparatus is encased, such as in a control box/cabinet), humidity, pressure and the presence of dust, mould and conditions of high salinity, vibrations. Furthermore, all problems connected to electromagnetic compatibility (EMC) must be taken into account if the system is interested to this aspect. Finally, it is always necessary to comply with all legislative requirements, especially in terms of safety and risks evaluation.
8.2 Failure Mode and Effects Analysis (FMEA)
115
Step 2 – Elaboration of Block Diagrams Blocks diagrams are elaborated in order to identify the interconnections among subsystems, circuits and components. They demonstrates the fundamental elements of the system and the in series an in parallel relationships present as well as the functional interdependence existing among the elements which constitute the system. This step is also denoted as analysis of the hierarchic levels. Step 3 – Definition of Basic Principals It is first necessary to establish the most adequate level of analysis. This is chosen by the technicians carrying out the FMEA analysis and is represented by the lowest level possible for which the necessary information is available. In the FMEA analysis of an electronic system, for instance, this is where the failure modes of single electronic components are analyzed. Step 4 – Definition of Failure Modes This step consists in identifying the modes, the causes and the effects of failures, their relative importance and in particular, their means of propagation. The failure mode is the objective evidence of the presence of a failure or, as defined in the IEC 60812 standard, the manner in which an item fails. It is interesting at this point to remember the difference between failure and fault. At this aim it is useful the definition from IEC 60182 standard: Failure: termination of the ability of an item to perform a required function (clause 3.2 IEC 60812:2006). Fault: state of an item characterized by the ability to perform a required function, excluding the inability during preventive maintenance or other planned actions, or due to the lack of external resources. A fault is often the result of a failure of the item itself, but may exist without prior failure (clause 3.3 IEC 60812:2006). So, failure is defined as the transition from a state of proper functioning to a malfunctioning state (fault). In the automotive field for example, failures modes can be that the car will not start, the lights don’t function, intermittent functioning, deflated tires. In this phase it is necessary to identify the critical elements/components of the system and make a list of failures. Failure modes can almost always be classified in one of the categories listed in Tables 8.1.a, 8.1.b and 8.1.c. In particular, in Tables 8.1.a and 8.1.b two different examples of general failure modes are given (different lists would be required for different types of systems) and in Table 8.1.c a detailed list of failure modes is reported. Furthermore, typical failure modes for electronic components are, for
116
8 Qualitative Techniques
example, opens, shorts, drift, functional faults. For mechanical components, instead, example of failure modes can be brittle rupture, cracking and creep and so on. The following table 8.1.d reports the relative frequency of failure modes in some electronics components [3]. Table 8.1a First example of a set of general failure modes (in compliance to IEC 60812). Id
General Failure Modes
1
Failure during operation
2
Failure to operate at a prescribed time (when necessary or when wanted)
3
Failure to cease operation at a prescribed time (when necessary or when wanted)
4
Premature operation (too early)
Table 8.1b Second example of a set of general failure modes. Id
General Failure Modes
1
Failure to operate at the proper time
2
Intermittent operation
3
Failure to stop operating at the proper time
4
Loss of output
Table 8.1c Specific failure modes. Id
Specific Failure Modes
1
Structural breakdown
2
Seizing or jamming
3
Vibrations
4
Does not stay in position
5
Does not open
6
Does not close
7
Remains open
8
Remains closed
9
External loss
10
Internal loss
11
Outside tolerance (+)
12
Outside tolerance (-)
13
Functions at inappropriate times
14
Intermittent function
15
Irregular function
16
Display error
17
Reduced flow
18
Activation error
19
Does not stop
8.2 Failure Mode and Effects Analysis (FMEA)
117
Table 8.1c (continued) 20
Does not start
21
Does not switch over
22
Premature intervention
23
Late intervention
24
Input error (excessive)
25
Input error (insufficient)
26
Output error ( excessive )
27
Output error ( insufficient )
28
No input
29
No output
30
Short circuit (electrical)
31
Open circuit (electrical )
32
Leakage (electrical)
33
Other conditions of failure according to system characteristics, operating conditions and operative restrictions
Table 8.1d Values for failure modes of electronic devices (%) [3] Device
Shorts
Digital Bipolar Integrated Circuits 50 Digital MOS Integrated Circuits
20
Linear Integrated Circuits
Opens
Drift
20
60
20
25
75
Bipolar transistor (BJT)
85
15
Field Effect Transistor (FET)
80
15
General Purpose Diode (Si)
80
20
Zener Diode (Si)
70
20
Thyristors
20
20
50
Optoelectronics device
10
50
40
Fixed Resistors
40
60
Variable Resistors
70
20
5 10
Foil Capacitor
15
80
5
Ceramic Capacitor
70
10
20
Tantalium Capacitor
80
15
5
Aluminum Capacitor
30
30
40
Coils
20
70
4
Relays
20
Quartz Crystal
Functional
30
10
10
80 80
20
118
8 Qualitative Techniques
Once the failure mode has been identified, its consequent effects have to be studied. Finally, general or local failure effects are evaluated as well as the final effect that is to say at the highest level of the system. In this phase it is important to list all the possible and/or potential failure modes of the system on which the FMEA analysis is being performed. To this end, although worthwhile, it is not always possible to involve the manufacturers of the components or the equipment used to determine the failure modes of their products. Therefore, it is important to possess the most complete documentation possible. Failure modes are generally deduced in different ways. In fact failure modes are deduced differently if a component is new or has previously been used. In particular: • If the component or device is new or is new to the project designer, meaning that there is insufficient data regarding the behavior of the component, failure modes can be investigated by looking for similarities with components that have the same function or furnish test results of components operating under particularly hard conditions. Failure modes can also be deduced by a theoretical study of the component or the system. • If the component has been used previously, failure modes are generally deduced from maintenance and service reports to users, performance reports, failures, laboratory tests and other information available to the company including information furnished by the manufacturer. FMEA analysis can study the individual failure modes and their effects on the system. This is less adapted to studying combinations and dependence on the sequence of failures. Example 1 demonstrates this concept. Further consideration must be given to so-called common failures that are quite frequent. These are failures which originate from an event that causes the contemporary failure of two or more components. The most common failure modes are typically due to environmental conditions, poor performance or characteristics (due to design insufficiency, for example), defects during construction, errors in assembly or installation or human errors, e.g. during operations or maintenance. FMEA analysis permits qualitative evaluations of the aforementioned failures. Step 5 – Identifying the causes of failures. It is well to indicate the cause in every type of failure; the more the effect of the type of failure, the more accurately the cause must be described. Step 6 – Identifying the effects of failure modes. A failure mode generally has effects in terms of reduced or absent functioning in the system, leading to graver situations of harm to personnel and/or the environment. Two types of effects are typically recognized: local and final. Local effects involve a failure of the component under examination. Final effects are seen at the level of a subsystem or system.
8.2 Failure Mode and Effects Analysis (FMEA)
119
Step 7 – Definition of measures and methods for identifying and isolating failures. This phase defines the ways and methods for identifying and isolating failures, suggesting procedures to be followed to detect a failure and the means to do so. The objective is to furnish the operator or maintenance personnel the information to verify the presence or absence of the failure mode under consideration. FMEA analysis can be applied to a process or a product, and also to a design process. In different cases, the types and time necessary to identify failure change. When FMEA analysis is applied to a process, it is necessary to establish where it is most efficient place to detect failures, e.g. during the functioning of a process by an operator, from Statistical Process Control (SPC), Quality Control (QC) etc. When applied to a design project, maximum attention must be paid to when and where a failure mode can be easily identified: during project revision, the analysis phase or test phase. Step 8 – Prevention of undesirable events. In this phase, possible design or operative measures to be taken to prevent undesirable events are identified. The shortest possible way is to get to the root of the problem in such a way that undesirable events do not occur. It is possible to use redundancy, alarms and monitoring systems and to introduce limits to hypothetical damage. Step 9 – Classification of the severity of final effects. This classification is carried out taking into account various aspects: the nature of the system under examination, performance and functional characteristics of the system, contractual and legal requirements, especially in regard to operator safety, and finally, guarantee requirements. An example of this classification is proposed by IEC in [1] as in Table 8.2. Table 8.2 Example of classification of final effects (*) Class Severity level Consequence to persons or environment I
Insignificant
It can be so classified a failure mode which could potentially degrade system functions. However, no damage to the system will caused and the consequent situation or state of the system does not constitute a threat to life or injury.
II
Marginal
A failure mode can be classified as marginal when it could potentially degrade system performance function(s) without appreciable damage to the system or threat to life or injury.
Critical
A failure mode is critical when it could potentially result in the failure of the system’s primary functions and therefore causes considerable damage to the system and its environment. However, it is very important to check that the failure mode so classified does not constitute a serious threat to life or injury.
Catastrophic
We have a catastrophic failure mode when it could potentially result in the failure of the system’s primary functions. As a result the failure mode causes serious damage to the system and its environment and/or personal injury.
II
IV (*)
IEC 60812:2006-01.
120
8 Qualitative Techniques
Step 10 – Multiple failure modes Considering specific combinations of multiple failures modes. It would be noted, at this point, that FTA is generally more suitable to take into account multiple failure modes. This consideration is very important when it is necessary to select what methodology of analysis is preferred. Step 11 – Recommendations Recommendations are made reporting useful observations to clarify for example, eventual aspects not completely analyzed, unusual conditions, the effects of failures of redundant elements, critical aspects of the project, references to other data for analysis of sequential failures and all observations upon completing the analysis. Figure 8.1 demonstrates the typical procedure of a FMEA analysis.
8.2.2 FMEA Typology A FMEA can be applied in various applicative fields. We can define the following: • FMEA analysis of Service that identifies and highlights any eventual nonconformity (NC) associated with planning a service. With such an analysis, it is possible to study the effects that a NC can generate while providing the service. The objective is to identify the critical points of service and elimination of NC. • FMEA System analysis where the propagation of the effects of possible failures of a component at the functional levels of the system are analyzed. The objective is to minimize the effects of such failures and identifying critical points for corrective actions in order to increase the availability performance of the system. • FMEA Project analysis carried out during the design phase of the project, before initiating the productive process. Objectives are the same as those for a FMEA System analysis. • FMEA Process analysis identifies critical points of the process and their influence on producing the product (criticality/breakdown/damage to the productive process that could lead to NC of the product). This analysis evaluates the effects that these failures can have on the functionality and safety of the product. The objective is to identify critical points of the process to keep under control, detection of eventual causes of failures to minimize their effect on the product and finally, improvement of reliability performance of the process.
8.2 Failure Mode and Effects Analysis (FMEA)
Fig. 8.1 Analysis flowchart for FMEA (in compliance to IEC 60812).
121
122
8 Qualitative Techniques
8.2.3 The Concept of Criticality An analysis of criticality is usually the objective of an in-depth FMECA analysis rather than a FMEA. However, in some cases, it is useful to proceed with a qualitative criticality analysis even when conducting a FMEA analysis especially when a complete analysis proves to be excessively taxing but one wishes to have an indication of the criticality of events. Criticality is a way to quantify the attention it is opportune to give to a determined failure, event or non-conformity and depends both on the probability of its occurrence and the gravity of the consequences it may have. The attention to dedicate to an event depends first of all on the effect it may have on the safety of personnel, on the damage it can cause with subsequent losses and its effect on the availability of service. It is rather difficult to define a generally valid criterion to evaluate criticality because the concept of the seriousness of the consequences and their probability of their happening come into play. The level of gravity can vary and can be evaluated differently if the objective, for example, regards the safety of personnel, damage and relative losses, or the availability of service. Criticality is defined by means of a scale of values that permits evaluating the seriousness of consequences in function of the criteria taken into consideration. The aforementioned Table 8.2 shows a classification with four principal levels of the gravity of consequences. Different levels may also be utilized.
8.2.4 Final Considerations on FMEA Analysis It is clear that when complex systems must be analyzed, a FMEA analysis tends to become ponderous, tedious and filled with possible repetitions. In these cases, the experience of those conducting the analysis plays an important role. Furthermore, there are some particulars which tend to simplify the analysis. It is rare that a system is designed in all its parts. It is often a case of revising an already existing system or, in the case of a new project, one can have some subsystems that have already been utilized satisfactorily in other projects. If the working conditions are the same, it may be possible to use the previous reliability considerations for the new project. The results of a FMEA can give important information for establishing the priority of a statistical control of the process, incoming samples, inspections and qualifications. Example 1 A brief example of a FMEA analysis on an industrial controller with a microcontroller consisting of various types of electronic boards is given here. For simplicity, we will not discuss the initial part of the analysis (definition of the system), drawing up block diagrams and definition of basic principles. The analysis, conducted at the level of a single component, has as its main objective, the analysis of failure modes that can lead to unsafe conditions for personnel.
8.2 Failure Mode and Effects Analysis (FMEA)
123
To this end, the parts with safety functions are validated demonstrating their conformity to the basic principles of safety and confirming that the specifications, design, implementation and choice of components comply with standards. The functioning in regard to environmental influences has been verified by means of opportune tests of electromagnetic compatibility (EMC). The criteria of analysis are the following: 1) the effects of failures in sequence (the failure of one component leads to failures in other components) are not studied. The first failure and all successive failures are considered as a single failure; 2) common failure modes are considered as single breakdowns; 3) only one independent failure is verified at a time. 8.2.4.1 Analysis of Failure Modes: Discussion and Exclusions In microprocessor/microcontroller systems, the possibility of the blockade of a device must always be considered. The design team evaluates whether to activate the function of watch dog software to block output in the case of an infinite loop in the program cycle or even using an external watch dog circuit. Memory, which can be internal or external to the device, must be opportunely monitored if its failure can lead to the system working in an unsafe manner or cause damage to the system or the component being manufactured. A system is generally designed in such a way that the loss of data will not cause dangerous situations both because the data in memory does not influence safety and because in the event of a failure, the system responds or does not respond in an appropriate way. Given the criticality the device, it is furthermore necessary to monitor the supply voltage by means of a supervisor circuit. Based on a careful analysis, we can exclude some failure modes due to: 1. low probability of the event; 2. acceptable technical experience; 3. technical requirements deriving from the application and specific risks in consideration. We have therefore a list of a priori excluded failures. The following examples, maybe useful in understanding this concept. For a Safety Relay, the following failures are generally excluded: • a short circuit between the three terminals of a relay; • a short circuit between two pairs of contacts and between contacts and a coil; • the simultaneous closure of contacts normally open (NO) and normally closed (NC). The exclusion of such failure mode can be deduced from documentation furnished by the manufacturer’s guarantee1. 1
For example, from manufacturer datasheet: “…All Safety relays are fitted with forced guided contacts to ensure the safe switching of a control system. All contacts are linked so that if one contact welds the others remain in their current position with open contacts maintaining a contact gap of > 0.5 mm. These relays meet the requirements of EN 50205”.
124
8 Qualitative Techniques
For contactors, the following failures are excluded: • contacts normally open remain closed having been deactivated. The failure mode is avoidable or rendered improbable thanks to design criteria based on over dimensioning. For this purpose a component is used with a nominal current equal to double the current in the circuit, with a commutation frequency higher than necessary and a guaranteed total number of commutations ten timers superior to that expected. The circuit on which the contactor works is opportunely protected against short circuits. Furthermore, measures are taken so that electronic control opens the contactor only when the current is very much reduced. From the point of view of mechanical stress, the installation is such that the vibrations and shocks foreseen are much lower than the maximum values indicated by the manufacturer. • the possibility of a short circuit between the three terminals of an exchange contact and between the contacts and the coil can be excluded based on the manufacturer’s guarantee (if the printed circuit is well made and the components soldered well!) • finally, the simultaneous closure of contacts normally open and normally closed can be excluded whenever a safety relay is used. Further typical cases of failure modes that, according to some hypotheses, can generally be excluded are those regarding printed circuits. In fact, a Printed Circuit Board - PCB (double sides, FR4-74 made with a minimum thickness of 0.8 mm and a minimum copper base of 17 µm) is made with material conforming to IEC Standards 61249 and both creepage distances and clearance are measured in compliance with IEC Standards 60664 with a pollution level of 2. The circuit is covered with a protective layer and the case that contains the circuits has a minimum level of protection index (IP) equal to IP54. Finally, one notes that a short circuit between adjoining PCB path could occur when the electronic board functions in dirty or humid conditions and in the presence of dripping water and when maintenance is not carried out (frequent in some industrial conditions). Below standard soldering can also give rise to both short circuits and open circuits. One must also keep in mind that all failure modes are not always possible. Some component breakdowns can also lead to short circuits or open circuits. In this simplified analysis, this was not taken into consideration. 8.2.4.2 Drafting the FMEA Table The following Table 8.3.1 demonstrates the FMEA table. Only a study of failure modes of a limited number of components is reported. Example 2 Table 8.3.2 demonstrates a second example of a FMEA analysis. Note that the table of the analysis is partly different from the preceding.
Component
Push button
Relay
Resistor
Ref.
P1
K1
R1
Not synchronization between contacts (Simultaneous NO /NC) Open circuit
Short circuit
Nominal value modification
1.5
5.2
5.3
5.1
Other terminal short circuit
NO contacts always open NC contacts always closed NO contacts always closed NC contacts always open Three terminal short circuit
1.4
1.3
1.2
1.1
Fail to close
Fail to open
1.1
1.2
Failure mode
Id.
Thermal stress
Thermal stress
Thermal stress
Failure mode not possible (see notes) Failure mode not possible (see notes) Failure mode not possible (see notes)
Mechanical failure and Contacts welds
Coil breakdown
Mechanical failure
Contacts welds
Possible Failure cause
− −
− −
Simulated Changing the resistor with another one of different value
Null in the range 60 % - +80% of the nominal value
Impossibility to activate movimentation on X coordinate −
−
−
Simulated
The operation is not performed
At the end of the operation it is not possible to start another operation The operation is not performed The operation is not performed
Final effects
Analysis of the schematic circuit
Analysis of the schematic circuit
Simulated
Simulated
Detection method
No malfunction is detectable
Not necessary
Not necessary
−
−
−
Not necessary
Not necessary
Not necessary
Compensating provision against failure Not necessary
Not dangerous failure mode
Safety relay
Safety relay
Safety relay
−
−
The system work on the transition.
Notes
8.2 Failure Mode and Effects Analysis (FMEA) 125
Table 8.3.1 Extract of FMEA analysis for applicative example 1.
Thermistor Protection system short circuit
1.5
Inadequate cooling
Thermistor open circuit
1.4
Motor cooling 2.1 system
Isolation breakdown
1.3
2
Open circuit Connection fracture
Blockage low diff. pressure
Persistent high temperature manufacturing defect Ageing; connection fracture
Open circuit Winding fracture
Possible failure causes
1.1
Failure mode
1.2
Motor stator
1
Failure Id
Item
Ref.
High temperature stator measured by thermistor
Protection system
Protection system
Low speed roughness Low speed roughness Protection system
Symptom detected by
Effect on unit output (total effect) Low power Trip
No output
No output
Excessive winding temperatur e
Excessive motor temperature
Reduced No output if trip margin load is high
None
Overload
Low power Trip
Local effect
Temperature trip stator
Fitted spare temperature trip
Fitted spare
Single phase protection temperature trip Single phase protection temperature trip Annual inspection temperature trip
Compensation provision
2
3
3
4
3
4
Severity class (S)
Nothing
Recommend a spare connected through to outside casing Recommend a spare connected through to outside casing
Nothing
Nothing
Nothing
Recommendations and actions taken
126 8 Qualitative Techniques
Table 8.3.2 Extract of FMEA analysis for applicative example 2.
8.2 Failure Mode and Effects Analysis (FMEA)
127
Example 3 The following is a brief discussion of a third example of a FMEA analysis. This regards STANDARD IEC 60812:2006 which can be consulted for further information. The analysis is conducted on a subsystem of a system composed of a motor and a generator. The study does not take into consideration the effects of breakdowns on loads fed by groups or systems connected in any way to the system under examination. The system is first of all subdivided into subsystems as represented by the block diagrams reported in figure 8.2. Subsystem 2 is further developed to the level of the component and it is at this level that the FMEA analysis is carried out. The FMEA is not reported here in that, for many aspects, it is similar to those already presented. Some aspects however merit to be explicitly described. First of all, the block diagram must be opportunely developed in such a way that each diagram is easily identifiable by way of a clear and simple system. In the case under examination, little by little as one goes into detail, the blocks are identified by an unambiguous numerical code, e.g. 2.1.2. this is a great help when one must write and interpret the table of the results of the analysis. Furthermore, the FMEA analysis is conducted here on a system where electric, electronic and mechanical aspects exist together. This means that FMEA is useful for so called multi-disciplinary systems analysis in various fields. Motor – generator set
1
2
Machine structure
4
3 Enclosure heating, ventilation and cooling system
DC machine
5 AC machine
Instrumentation
2 Enclosure heating, ventilation and cooling system 2.1
2.2 Ventilation and cooling system
Heater system
2.1.1
2.3
2.4 Emergency cooling (inlet/ outlet) doors
Condensate/ cooling-water drain system
2.1.2 …..
…..
Fig. 8.2 Block diagram of system under examination (IEC 60812:2006).
128
8 Qualitative Techniques
8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) The evaluation of the criticality of failure modes and their effects has already been discussed but it was always done on a qualitative basis, based on experience and knowledge without going into detail of a quantitative evaluation. If such an evaluation is imperative, the analysis is then called Failure mode, effects, and criticality analysis (FMECA). An example of FMECA flowchart is depicted in Fig. 8.3. In general, it is possible to apply a quantitative evaluation of criticality when data relative to failure mode rates of systems or components to be analyzed is available. Otherwise a qualitative evaluation is applied. Establishing what is critical and the probability that a failure mode will occur is a very useful in determining what corrective action must be taken and defining the line between acceptable and unacceptable risks. Different types of critical failure modes can be identified (each company can define its own categories and classes). A scale of criticality based on the following categories is generally valid: 1. 2. 3. 4.
death or injury to the public or company personnel damage to this or other equipment economic damage deriving from loss of output or loss of system functions inability to perform a function due to inability of equipment to properly perform its principal function
An example of a scale of criticality is seen in Table 8.2. The choice of criticality categories requires careful study and prudence. It is necessary to keep in mind all the factors that have an impact on the evaluation of the system, its performance, costs, programs, safety and risks.
8.3 Failure Mode, Effects, and Criticality Analysis (FMECA) Start FMECA Select a level of analysis Select a component of the item or system or subsystrem to analyze Identify failure modes of the selected component Select the failure mode to analyze
Identify immediate effect and the final effect of the selected failure mode
Determine severity of the final effects
Identify potential causes of that failure mode Evaluate the frequency of probability of occurrance for the selcted failure mode during the predetermined time period
Do severity and/or probability of occurrence warrant the need for action?
NO
YES Propose mitigation method, corrective actions or compensating provisions (design review). Identify actions and responsible personnel. Write document notes, recommendations, actions, and remarks
Are there further component failure modes to analyze?
NO
Are there further components to NO be taken into account for analysis?
YES
YES
End of FMECA. Fix the next revision data as appropriate
Fig. 8.3 FMECA flowchart (in compliance to IEC 60812).
129
130
8 Qualitative Techniques
8.3.1 Failure Modes and Their Probability After the failure mode identification, it is necessary to evaluate the corresponding probability of occurrence. In the FMECA analysis, this evaluation is carried out analytically. To this aim, it is necessary to access at the detailed information regarding the reliability of components/devices, for example its failure rate. Whenever one applies a qualitative analysis, by choice or due to a lack of data, the probability with which a failure mode is manifested is generally described by discrete levels. For example, these can be the following: • • • • •
Level A, when the failure mode can occur frequently; Level B, when the failure mode is reasonably probable; Level C, when the failure mode is occasional; Level D, when the failure mode is remote; Level E, when the failure mode is extremely rare.
For a quantitative analysis, the evaluation of two indexes is made, RPN and Cm, to be discussed below.
8.3.2 Evaluation of Criticality This evaluation can be carried by means of a criticality grid where the abscissa represents the probability or frequency of failure and the ordinate represents the class of criticality. The failure modes, duly classified after having evaluated probability, are inserted into one of the squares of the criticality grid reported in Table 8.9. This will be discussed more in depth later. Obviously, the farther away a square is from the origin, the greater the criticality is the mode of failure and thus the greater the necessity to adopt appropriate counter measures. The evaluation of criticality involves quantifying of the effects of a failure mode/non conformity. This operation is not always easy to carry out and often involves brainstorming. Measuring criticality can be performed in various ways from which different types of FMECA are derived. Here we will present two ways, the first based on risk and the second based on the failure rate.
8.3.3 FMECA Based on the Concept of Risk This follows the STANDARD IEC 60812, and refers to the concept of Risk R and Risk Priority Number (RPN). Risk is evaluated by means of an opportune measurement of the severity of the effects and of an estimate of the expected probability that the failure mode itself is manifested in an a prior determined interval of time. A measurement of potential risk is therefore:
R = S⋅P
(8.1)
8.3 Failure Mode, Effects, and Criticality Analysis (FMECA)
131
where • S (Severity) represents the estimate of how strongly the effects of a failure impacts on the system or user (personnel or customer for example). This is the gravity or criticality of the failure and is generally expressed in levels of criticality. Finally, it would be noted that S is a non-dimensional number. • P (Probability) denotes the probability of occurrence. Even this parameter is a non-dimensional number. An additional information concerning the failure detection at system level is possible by using a new parameter named D (Detection). The evaluation of RPN instead is given by the following equation:
RPN = S ⋅ O ⋅ D
(8.2)
where: • O (Occurrence) is the probability that a failure mode will be manifested in a established time that usually coincides with the useful life of the component under examination. It may be defined as a ranking number (or index number) rather than the actual probability of occurrence. Through a design change it is possible to remove or limit one or more failure modes. This is the only possible way to reduce occurrence ranking. • D (Detection) is the estimate of the possibility of identifying/diagnosing and eliminating/preventing the onset of a breakdown before its effects are manifested on the system or personnel. This number is usually ranked in reverse order from the severity or occurrence numbers: the higher the detection number D, the less probable is the possibility of identifying the failure and vice versa. Starting from these considerations the lower probability of detection leads to a higher RPN; this indicates the necessity to resolve the failure mode with maximum priority and speed. Detection capability are mainly obtained or planned in the design phase. Typical design controls are design verification or validation such as: design review, road testing for automotive industries, etc. Detection is so an assessment of the capability of the design review to detect a potential cause or mechanism or design weakness. The level of severity together with RPN permits establishing on which failure mode it is necessary to concentrate resources in order to mitigate or annul the effects. S, O and D are generally estimated for values for 1 to 4 or 5 and in some contexts from 1 to 10 as reported in the following three tables 8.4, 8.5 and 8.6. Even if we are able to refer to such examples, every evaluation for establishing the values S, O and D must be connected with personal experience and with the type of analysis being carried out (on a product, process and working conditions).
132
8 Qualitative Techniques
However, not always having effected an accurate evaluation of RPN erroneous deductions may occur. In fact, this parameter, as it is defined, can present some problems: • Gaps in the range: referring to the values of S, O and D summarized in the tables the RPN index does not assume 1000 values as would be expected multiplying the 3 factors each of which is included in the scale of 1-10, but rather only 120 different values: 88% of the range is empty. For example, two identical values of RPN can derive from different values of the parameters, S, O and D, and this must be kept in due consideration. • Duplicate RPNs: different situations generated by identical RPN values could be obtained. • Sensitivity to small changes: even a small variation in one of the factors implies a notable variation in the RPN value when the other factors are large; a minor variation in the RPN value, instead, when the other two factors assume lower values. • Inadequate scaling: the distance between contiguous RPN values is not always the same. • Inadequate scale of RPN: the difference in RPN value might appear negligible while in fact significant. An example would be useful to understand the possible situations. Two different classifications might lead to the following value: ⎧ S1 = 5 ⎪ Scenario 1 ⇒ ⎨O1 = 3 ⇒ RPN1 = S1 ⋅ O1 ⋅ D1 = 45 ⎪D = 3 ⎩ 1 ⎧ S2 = 5 ⎪ Scenario 2 ⇒ ⎨O2 = 4 ⇒ RPN 2 = S 2 ⋅ O2 ⋅ D2 = 60 ⎪D = 3 ⎩ 2
It would be noted that RPN2 is not twice RPN1, while the probability of Occurrence O1 is twice O2 as shown in Table 8.5. In fact, a value of occurrence equal to 3 correspond to a 5·10-4 whereas an occurrence of 4 correspond to a probability of occurrence of 10·10-4. This example leads to consider that it is not possible to compare linearly the RPN number.
8.3 Failure Mode, Effects, and Criticality Analysis (FMECA)
133
Table 8.4 Table for determining the parameter S (according to IEC 60812:2006 and MILHDBC-338B). Please note that this is an example of classification used in the automobile industry. S
Criteria
Ranking
None
No discernible effect is present
1
Very minor
Fit and finish /squeak and rattle item does not conform. Defect noticed by discriminating customers/operators (or by less than 25%)
2
Minor
Fit and finish /squeak and rattle item does not conform. Defect noticed by average customer or operator (or by 25% - 75% of customers/operators)
3
Very low
Fit and finish /squeak and rattle item does not conform. Defect noticed by most customers or operators (for example greater than 75%)
4
Low
The vehicle(s) or the item(s) under consideration is (are) operable but comfort/convenience item(s) operable at a reduced level of performance. Customer somewhat is dissatisfied.
5
Moderate
The vehicle or item under consideration is operable but comfort/convenience item(s) is inoperable. Customer is dissatisfied.
6
High
The vehicle or item under consideration is operable but at a reduced level of performance. Customer is very dissatisfied.
7
Very high
Vehicle or item under consideration is inoperable (there is a loss of primary function).
8
Hazardous with warning
Very high severity ranking when a potential failure mode affects safe vehicle operation and/or involves non-compliance with government regulation and/or mandatory standards.
9
Very high severity ranking when a potential failure mode affects Hazardous withsafe vehicle operation and/or involves non-compliance with govout warning ernment regulation and/or mandatory standards without warning.
10
Table 8.5 Recurrence of modes of failure, frequency and probability (according to IEC 60812:2006 and MIL-HDBC-338B). Failure mode Failure mode
IEC Frequency
MIL
Probability
Possible
Rating, O
Definition
Description
Remote
Failure is unlikely
≤ 0.010
≤ 1·10-5
Low
Relatively few failures
0.1
1·10-4
1 in 150000
2
0.5
5·10-4
1 in 15000
3
1
1·10-3
1 in 2000
4
2
2·10-3
1 in 400
5
5
5·10-3
1 in 80
6
10
1·10-2
1 in 20
7
20
-2
2·10
1 in 8
8
5·10-2
1 in 3
9
≥ 1·10-1
≥1 in 2
10
Moderate
High Very high
Occasional failures
Repeated failures
(‰)
Failure is almost inevitable50 ≥ 100
failure rates
≤ 1 in 1500000 1
134
8 Qualitative Techniques
Table 8.6 Criteria for evaluating parameter D (according to IEC 60812:2006 and MILHDBC-338B). D
Criteria: Likelihood of detection by Design Control
Almost Certain
Design Control (or design review) will almost certainly detect a potential cause or mechanism and subsequent failure mode.
Ranking 1
Very High
Very high chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.
2
High
High chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.
3
Moderately high
Moderately high chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.
4
Moderate
Moderately chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.
5
Low
Low chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.
6
Very low
Very Low chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.
7
Remote
Remote chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.
8
Very remote
Very remote chance the Design Control (or design review) will detect a potential cause/mechanism and subsequent failure mode.
9
Design Control (or design review) will not and/or cannot detect a Absolutely uncer- potential cause/mechanism and subsequent failure mode. This tain ranking would be selected also in case there is not a design review process.
10
Table 8.7 Evaluation of RPN: an example. Operation
Characteristic of breakdown
N Phase
Mode of failure Cause
1
Rating Effect on component or system S O
D
RPN
Variations not Deficiency in Product cannot be sold. Delay Incoming 2 1 noted by suppli- information in delivery. inspection er system
1
2 4
2
FIFO not respected
Traceability Material does not conform to not available aging or specifications
2 2
1
3
Exchange of printed circuit (PCB)
Wrong mark- Sending a non conforming ing PCB to production
8 6
10 480
4
Damage from ESD to PCB
Product manipulation does Printed circuit damaged not conform
7 4
5
140
The aforementioned considerations lead us to deduce that drawing legitimate conclusions from the study of RPN must be done with extreme prudence. Table 8.7 shows some examples of the evaluation of the RPN coefficient.
8.3 Failure Mode, Effects, and Criticality Analysis (FMECA)
135
8.3.4 FMECA Based on the Failure Rate Estimating the criticality of a failure mode can also be implemented by means of a study of the failure rates of devices, subsystems and constituent parts of the system. Unfortunately, generally traceable failure rates in databanks refer to components and not to failure modes of components. There is also a further complication. Usually the available data is valid in well-established environmental and operative conditions. Available failure rates are therefore not immediately usable and cannot be included in the final analysis report. An estimate of failure rates of a determined failure mode is calculated through the following formula:
λm = λc ⋅ α m ⋅ β m
(8.3)
where • λm is the failure rate of a single failure mode to be analyzed; • λc is the failure rate of the item • αm is the probability that the item breaking down fails in the failure mode m; obviously for an item
∑α
m
=1
• βm is the conditional probability of the failure effect given the failure mode m, i.e. the probability that faced with that failure mode, the critical effect under examination is produced. The value of this parameter could be selected according to information of Table 8.8. Table 8.8 Criteria for evaluation of parameter βm. Failure modes effect
Value
Possible loss
βm =1 0.1 < βm < 1 0 < βm ≤ 0,1
No effect
0
Real loss Probable loss
This relationship is valid in the hypothesis of constant failure rate. This is not always true and is one of the limits of this approach. It is often useful to have an indication regarding time, for example, the useful life of the component denoted as tc. In this case, we use the coefficient of criticality of the failure mode (sometimes named Failure Mode Criticality Number) as:
Cm = λm ⋅ tc = λc ⋅ α m ⋅ β m ⋅ tc
(8.4)
Note that the time of observation, which often but not always coincides with the useful life of the component, refers to the component and not to the failure
136
8 Qualitative Techniques
mode. For a single component there can often exist varied modes of failure. If these failure modes are n, then: n
n
n
m =1
m =1
m =1
Cc = ∑ Cm = ∑ λm ⋅ tc = ∑ λc ⋅ α m ⋅ β m ⋅ tc
(8.5)
where Cc is the coefficient of the criticality of the component. The probability that a failure mode is manifested within a certain time interval is:
Pm = 1 − e −Cm
(8.6)
It is possible to subdivide the field Pm into classes as indicated in Table 8.9. Of the two failure modes classified here, one is more severe and the other has a greater probability of occurring. To decide upon which of the two modes one must initially concentrate, it is necessary to take into consideration how the scales of the two axes were created and above all, the type of application you are dealing with. Table 8.9 Matrix of criticality
Probability
C 5
Pm Pm>0,2
4
0,1≤ Pm