PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY
HANDBOOK SECOND EDITION
© 2009 by Taylor & Francis Group, LL...
307 downloads
2417 Views
6MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY
HANDBOOK SECOND EDITION
© 2009 by Taylor & Francis Group, LLC
9879_C000.indd 1
3/10/09 4:08:01 PM
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY
HANDBOOK SECOND EDITION
Edited by
MICHAEL PECHT
Boca Raton London New York
CRC Press is an imprint of the Taylor & Francis Group, an informa business
© 2009 by Taylor & Francis Group, LLC
9879_C000.indd 3
3/10/09 4:08:01 PM
CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2009 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-13: 978-0-8493-9879-7 (Hardcover) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http:// www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Product reliability, maintainability, and supportability handbook / editors, Michael Pecht. -- 2nd ed. p. cm. Includes bibliographical references and index. ISBN 978-0-8493-9879-7 (alk. paper) 1. Electronic apparatus and appliances--Reliability--Handbooks, manuals, etc. 2. Electronic apparatus and appliances--Maintainability--Handbooks, manuals, etc. I. Pecht, Michael. II. Title. TK7870.P748 2009 658.5’75--dc22
2008044162
Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com
© 2009 by Taylor & Francis Group, LLC
9879_C000.indd 4
3/10/09 4:08:02 PM
Contents Preface......................................................................................................................vii Editor.........................................................................................................................xi Contributors............................................................................................................ xiii Chapter 1 Product Effectiveness and Worth................................................................................1 Harold S. Balaban, Ned Criscimagna, Michael Pecht Chapter 2 Reliability Concepts.................................................................................................. 19 Diganta Das, Michael Pecht Chapter 3 Statistical Inference Concepts................................................................................... 31 Jun Ming Hu, Mark Kaminskiy, Igor A. Ushakov Chapter 4 Practical Probability Distributions for Product Reliability Analysis....................... 57 Diganta Das, Michael Pecht Chapter 5 Confidence Intervals................................................................................................. 83 Diganta Das, Michael Pecht Chapter 6 Hardware Reliability................................................................................................. 95 Abhijit Dasgupta, Jun Ming Hu Chapter 7 Software Reliability................................................................................................ 141 Richard Kowalski, Carol Smidts Chapter 8 Failure Modes, Mechanisms, and Effects Analysis................................................ 185 Sony Mathew, Michael Pecht Chapter 9 Design for Reliability.............................................................................................. 201 Diganta Das, Michael Pecht Chapter 10 System Reliability Modeling.................................................................................. 219 Michael Pecht v © 2009 by Taylor & Francis Group, LLC
9879_C000toc.indd 5
3/4/09 11:14:52 AM
vi
Contents
Chapter 11 Reliability Analysis of Redundant and Fault-Tolerant Products............................. 239 Joanne Bechta Dugan Chapter 12 Reliability Models and Data Analysis for Repairable Products............................. 299 Harold S. Balaban Chapter 13 Continuous Reliability Improvement...................................................................... 325 Walter Tomczykowski Chapter 14 Logistics Support.................................................................................................... 357 Robert M. Hecht Chapter 15 Product Effectiveness and Cost Analysis............................................................... 391 Harold S. Balaban, David Weiss Chapter 16 Process Capability and Process Control................................................................. 421 Diganta Das, Michael Pecht
© 2009 by Taylor & Francis Group, LLC
9879_C000toc.indd 6
3/4/09 11:14:52 AM
Preface To ensure product reliability, an organization must follow certain practices during the product development process. These practices impact reliability through the selection of parts (materials), product design, manufacturing, assembly, shipping and handling, operation, maintenance, and repair. The following practices are described in this book: • Define realistic product reliability requirements determined by factors including the targeted life-cycle application conditions and performance expectations. The product requirements should consider the customer’s needs and the manufacturer’s capability to meet those needs. • Define the product life-cycle conditions by assessing relevant manufacturing, assembly, storage, handling, shipping, operating, and maintenance conditions. • Ensure that supply-chain participants have the capability to produce the parts (materials) and services necessary to meet the final reliability objectives. • Select the parts (materials) that have sufficient quality and are capable of delivering the expected performance and reliability in the application. • Identify the potential failure modes, failure sites, and failure mechanisms by which the product can be expected to fail. • Design the process to capability (i.e., the quality level that can be controlled in manufacturing and assembly), considering the potential failure modes, failure sites, and failure mechanisms obtained from the physics of failure, analysis, and the lifecycle profile. • Qualify the product to verify the reliability of the product in expected life-cycle conditions. Qualification encompasses all activities that ensure that the nominal design and manufacturing specifications will meet or exceed the reliability goals. • Ascertain whether manufacturing and assembly processes are capable of producing the product within the statistical process window required by the design. Variability in material properties and manufacturing processes will impact the product’s reliability. Therefore, characteristics of the process must be identified, measured, and monitored. • Manage the life-cycle usage of the product using closed-loop, root-cause monitoring procedures.
Chapter 1: Product Effectiveness and Worth. This chapter presents a definition of product effectiveness and discusses the relationships between product effectiveness and its related functions (availability, dependability, and capability). The chapter concludes with a discussion of assignment responsibility and product worth. Chapter 2: Reliability Concepts. This chapter presents the fundamental mathematical theory for reliability. The focus is on reliability and unreliability functions, probability density function, hazard rate, conditional reliability function, and key time-to-failure metrics. Chapter 3: Statistical Inference Concepts. This chapter introduces statistical inference concepts as ways to analyze probabilistic models from observational data. The chapter discusses basic types of statistical estimation, hypothesis testing, and reliability regression model fitting. vii © 2009 by Taylor & Francis Group, LLC
9879_C000e.indd 7
3/4/09 11:18:18 AM
viii
Preface
Chapter 4: Practical Probability Distributions for Product Reliability Analysis. In this chapter, basic types of discrete and continuous probability distributions are introduced. Two discrete distributions (binomial and Poisson) and four continuous distributions (Weibull, exponential, normal, and lognormal) commonly used in reliability modeling and hazard rate assessments are presented. Chapter 5: Confidence Intervals. This chapter presents the concept of confidence interval and its relationship with tolerance, sample size, and confidence levels. Examples of confidence interval calculations and estimations are provided. Chapter 6: Hardware Reliability. Using failure models and examples, this chapter focuses on reliability assessment and its associated validation techniques for engineering hardware. The chapter provides a case study on wirebond assembly in microelectronic packages to illustrate the implementation of a probabilistic physicsof-failure approach in reliability prediction and modeling. Chapter 7: Software Reliability. This chapter provides a definition of software, software reliability, software quality, and software safety. It also discusses software development models and techniques for both improving and assessing software reliability. Chapter 8: Failure Modes, Mechanisms, and Effects Analysis. Knowledge of failure mechanisms that cause product failure is essential in implementation of appropriate design practices for design and development of reliable products. This chapter presents a new methodology called failure modes, mechanisms, and effects analysis (FMMEA) to identify potential failure mechanisms and models for potential failure modes and to prioritize failure mechanisms. FMMEA enhances the value of failure modes and effects analysis (FMEA) and failure modes, effects, and criticality analysis (FMECA) by identifying “high-priority failure mechanisms” to help create an action plan to mitigate their effects. The knowledge about the cause and consequences of mechanisms found through FMMEA helps in efficient and costeffective product development. Chapter 9: Design for Reliability. There are steps that must be taken to develop a product that meets reliability objectives. This chapter provides an overview of product requirements and constraints, product life-cycle conditions, parts selection and management, failure modes, mechanisms, effects analysis, design techniques, qualification, manufacture and assembly, and closed-loop monitoring. Chapter 10: System Reliability Modeling. This chapter describes how to combine reliability information from parts and subsystems to compute system level reliability. Reliability block diagrams are used as a means to represent the logical system architecture and develop system reliability models for a system. This chapter also presents fault-tree analysis for system reliability modeling. Chapter 11: Reliability Analysis of Redundant and Fault-Tolerant Products. A fault-tolerant product is designed to continue operating correctly despite the failure of some constituent components. This chapter presents methods for evaluating reliability in several types of fault-tolerant conditions. Chapter 12: Reliability Models and Data Analysis for Repairable Products. This chapter describes methods for modeling and analyzing failures of repairable products (particularly nonelectronic equipment) that normally exhibit wearout characteristics.
© 2009 by Taylor & Francis Group, LLC
9879_C000e.indd 8
3/4/09 11:18:18 AM
Preface
ix
Analytical background and data analysis techniques that describe the reliability behavior of repairable products are provided. Chapter 13: Continuous Reliability Improvement. Reliability improvement techniques can be applied to a new product that has passed its major hardware and/ or software design reviews, to a developed product that the manufacturer wishes to make more competitive, or to an existing product that is not meeting the customer’s expectations of reliability performance. This chapter discusses the principles of reliability growth, accelerated testing, and management of a continuous improvement program. Chapter 14: Logistics Support. Integrated logistics support (ILS) applied to products constitutes a life-cycle approach to maintenance and support. This chapter discusses the influence of reliability on logistics support requirements, emphasizing how the reliability of product, equipment, or assembly influences the need for spare or repair parts, support equipment, and maintenance personnel. Chapter 15: Product Effectiveness and Cost Analysis. This chapter shows how reliability and maintainability data can be combined with performance data to assess overall product effectiveness and how cost aspects can be introduced to provide a more complete basis for design decisions. Chapter 16: Process Capability and Process Control. Quality is a measure of a product’s ability to meet the workmanship criteria of the manufacturer. This chapter introduces the concepts of process capability and the basics of statistical process control techniques. Chapter sections present the concepts of average outgoing quality, process capability, defects calculation, and statistical process control, with examples. The Audience for This Book This book is for professionals interested in gaining knowledge of the practical aspects of reliability. It is equally helpful for students interested in pursuing this challenging career in liability, as well as maintainability and supportability teams.
© 2009 by Taylor & Francis Group, LLC
9879_C000e.indd 9
3/4/09 11:18:18 AM
Editor Michael Pecht is visiting professor in Electrical Engineering at City University of Hong Kong. He has an MS in Electrical Engineering and an MS and PhD in Engineering Mechanics from the University of Wisconsin at Madison. He is a professional engineer, an IEEE Fellow, an ASME Fellow, and an IMAPS Fellow. He was awarded the highest reliability honor, the IEEE Reliability Society’s Lifetime Achievement Award in 2008. He served as chief editor for IEEE Transactions on Reliability for eight years and was on the advisory board of IEEE Spectrum. He is chief editor for Microelectronics Reliability and is an associate editor for IEEE Transactions on Components and Packaging Technology. He is the founder of CALCE (Center for Advanced Life Cycle Engineering) at the University of Maryland, College Park, where he is also the George Dieter Chair Professor in Mechanical Engineering and a professor in Applied Mathematics. He has written more than 20 books on electronic products development, use, and supply chain management, and over 400 technical articles. He has been leading a research team in the area of prognostics for the past ten years. He has consulted for over 100 major international electronics companies, providing expertise in strategic planning, design, test, prognostics, IP, and risk assessment of electronic products and systems. He has previously received the European Micro and Nano-Reliability Award for outstanding contributions to reliability research, the 3M Research Award for electronics packaging, and the IMAPS William D. Ashman Memorial Achievement Award for his contributions in electronics reliability analysis.
xi © 2009 by Taylor & Francis Group, LLC
9879_C000f.indd 11
3/4/09 11:28:15 AM
Contributors Harold S. Balaban has over 40 years experience in developing weapon system models for cost and effectiveness analyses for Department of Defense and other government agencies. He is currently employed by the Institute for Defense Analyses (IDA), where he specializes in applying reliability and maintainability concepts to weapon system life-cycle costs and effectiveness modeling. He has developed a number of models and cost-estimating relationships that enable such work to be accomplished efficiently and accurately—notably the IDA IMEASURE program for maintenance manpower estimation, the airlifter mission capable rate simulation model, and the IDA CER model for estimating depot level reparable and consumables costs. Prior to his work at IDA, Dr. Balaban was employed by ARINC Research Corporation; his last position was director, advanced analysis. He was responsible for developing and applying analytical and simulation models to perform studies of cost, effectiveness, and reliability, maintainability, and availability of military systems. He led the team that developed the highly successful System Testability and Maintenance Program, which was the forerunner of products used today to improve organizational diagnostics of military systems. He was also a major contributor to efforts to introduce long-term warranties and logistic controls in military systems acquisition. Dr. Balaban has presented and published numerous papers on reliability and maintainability, contributed chapters for three textbooks, and taught graduate courses in reliability theory and operations research at George Washington University and at University College, University of Maryland. He holds a PhD degree in mathematical statistics from the George Washington University. He contributed to Chapters 1, 12, and 15 of this book. Ned Criscimagna is president and owner of Criscimagna Consulting, LLC. He provides training, program assessment, and related reliability consulting services for industry and government. Prior to starting his own business, Mr. Criscimagna worked at Alion Science & Technology (previously IIT Research Institute, IITRI), where he was a senior science advisor and served in various capacities, including 5 years as the deputy director of the Reliability Analysis Center. Before joining IITRI in 1994, he served in various positions with the ARINC Research Corporation. Prior to his career with private industry, he served 20 years as an Air Force officer in various engineering, maintenance, and staff positions. While on the Air Force staff and the staff of the Air Force Systems Command, he helped develop and implement policies on reliability and maintainability, quality, and system acquisition. He was a member of the Air Force’s repair and maintenance 2000 study team and was involved with the initial efforts to implement the Department of Defense’s total quality management approach to acquisition. He is a member of the American Society of Quality Assurance and the Society of Automotive Engineers, and a senior member of the Society of Logistics Engineers. He holds a BS in mechanical engineering from the University of Nebraska, Lincoln, and an MS in systems engineering–reliability from the Air Force Institute of Technology. He contributed to Chapter 1 of this book. xiii © 2009 by Taylor & Francis Group, LLC
9879_C000g.indd 13
3/4/09 12:04:44 PM
xiv
Contributors
Diganta Das has a PhD in mechanical engineering from the University of Maryland, College Park, and a BTech in manufacturing science and engineering from the Indian Institute of Technology. He is a member of the research staff at the Center for Advanced Life Cycle Engineering. His expertise is in reliability, environmental and operational ratings of electronic parts, uprating, electronic part reprocessing, technology trends in electronic parts, and parts selection and management methodologies. He performs benchmarking processes and organizations of electronics companies for parts selection and management and reliability practices. Dr. Das also assists organizations in design improvements. He has published more than 50 articles on these subjects and has presented his research at international conferences and workshops. He served as technical editor for two IEEE standards and is currently coordinator for two additional IEEE standards. He is an editorial board member for Microelectronics Reliability and the International Journal for Performability Engineering. He is a Six Sigma black belt and is a member of IEEE and IMAPS. He contributed to Chapters 2, 4, 5, 9, and 16 of this book. Abhijit Dasgupta is a faculty member and researcher in the CALCE Electronic Packaging Research Center at the University of Maryland. He received his PhD in theoretical and applied mechanics from the University of Illinois. He conducts research in the area of micromechanical modeling of constitutive and damage behavior of heterogeneous materials and structures, with particular emphasis on fatigue and creep–fatigue interactions. His research also includes associated stress analysis techniques under combined thermomechanical loading, formulating physics-offailure models to evolve guidelines for design, validation testing, and screening and derating for reliable electronic packages. He contributed to Chapter 6 of this book. Joanne Bechta Dugan was awarded a BA in mathematics and computer science from La Salle University, Philadelphia, in 1980 and an MS and a PhD in electrical engineering from Duke University, Durham, North Carolina, in 1982 and 1984, respectively. Dr. Dugan is currently associate professor of electrical engineering at Duke University and visiting scientist at the Research Triangle Institute. She has performed and directed research on the development and application of techniques for the analysis of computer systems designed to tolerate hardware and software faults. Her research interests include hardware and software reliability engineering, faulttolerant computing, and mathematical modeling using dynamic fault trees, Markov models, Petri nets, and simulation. Dr. Dugan is a senior member of the IEEE and a member of the Association for Computing Machinery, Eta Kappa Nu, and Phi Beta Kappa. She contributed Chapter 11 of this book. Robert M. Hecht is a senior principal engineer with the ARINC Research Corporation. He specializes in the evaluation of reliability, maintainability, and testability problems of fielded equipment and the planning and management of product improvement programs. He has supported numerous weapon systems programs, including the P-3C, E-2C, bA-6E, EA-6B, ES-3, GUARDRAIL, QUICK FIX, EF-111A, and the M1 Abrams main battle tanks. He has extensive experience in the design for reliability of electronic, electromechanical, and mechanical systems.
© 2009 by Taylor & Francis Group, LLC
9879_C000g.indd 14
3/4/09 12:04:45 PM
Contributors
xv
Prior to joining ARINC Research, Mr. Hecht was a reliability engineer with the Bell Aerospace Company’s New Orleans operation. At Bell, he conducted reliability analysis in support of U.S. Navy surface effect ship and air cushion vehicle programs. While with the U.S. Army, Mr. Hecht managed the reliability and maintainability demonstration testing of general military equipment. He received a BS in aerospace engineering from Pennsylvania State University and an MS in engineering from the University of New Orleans. Mr. Hecht is an ASQC certified reliability engineer. He contributed Chapter 14 of this book. Jun Ming Hu is the managing director of Microsoft Asia Center for Hardware (MACH) in Shenzhen, China. The MACH team is responsible for the design, engineering, testing, and manufacturing of Microsoft hardware products, including mice, keyboards, Webcams, Xbox controllers, gaming text input devices, Zune music accessories, and other hardware products for world markets. The team also provides supporting work for Xbox console manufacturing, testing, and component sourcing and qualification. The MACH team manages many design and manufacturing partners in China. Dr. Hu joined Microsoft Corporation at Redmond in 2000 as the engineering manager of hardware reliability and component engineering. He moved to Shen Zhen in February 2004 to set up the MACH organization. Before joining Microsoft, Dr. Hu worked for Ford Motor Company in Michigan for 8 years as a senior technical specialist and engineering manager of computer-aided design for automotive electronics development. Dr. Hu received a BS in 1982 and an MS in 1985 from Shanghai Jiao-Tong University; he received a PhD from the University of Maryland in 1989. Dr. Hu holds more than 14 U.S. and international patents for electronics products and qualification methods. He was the associate editor of IEEE Transactions on Reliability and an editorial board member of Journal of the Institute of Environmental Sciences from 1993–1998. He is a recipient of the Asian American Corporate Achievements Award and two Henry Ford Technology Awards. He contributed to Chapters 3 and 6 of this book. Mark Kaminskiy is the chief statistician at the Center of Technology and Systems Management of the University of Maryland, College Park. He is a researcher and consultant in statistical and probabilistic reliability, life data analysis, and risk analysis of engineering systems. He has performed research and consulting projects funded by government and industrial companies such as the Department of Transportation, Coast Guard, Army Corps of Engineers, the Navy, Nuclear Regulatory Commission, American Society of Mechanical Engineers, Ford Motor, Qualcomm Inc., and several other engineering companies. Dr. Kaminskiy is the author and co-author of over 100 publications in journals, conference proceedings, reports, and books, including “Modeling Population Dynamics for Homeland Security Applications,” co-authored with B. Ayyub, in Wiley Handbook of Science and Technology for Homeland Security, edited by J. G. Voeller (John Wiley & Sons, 2008); Reliability Engineering and Risk Analysis: A Practical Guide, co-authored with M. Modarres and V. Krivtsov (Marcel Dekker, 1999, 2009); “Accelerated Testing” (Chapter 5) in Statistical Reliability Engineering (John Wiley & Sons, 1999); and “Statistical Analysis of Reliability Data” in Encyclopedia of IEEE (John Wiley & Sons, vol. 20, 1999). He received an
© 2009 by Taylor & Francis Group, LLC
9879_C000g.indd 15
3/4/09 12:04:45 PM
xvi
Contributors
MS in nuclear physics at the Polytechnic University of St. Petersburg (Russia) and a PhD in electrical engineering at the Electrotechnical University of St. Petersburg (Russia). He contributed to Chapter 3. Richard Kowalski retired as director, product assurance, from ARINC Incorporated in 2002 after 27 years with the firm. He was responsible for the development and execution of hardware and software quality program policy. Dr. Kowalski was trained in software capability evaluation (SCE), using the capability maturity model and the integrated model developed by the Software Engineering Institute at Carnegie–Mellon, and he conducted SCEs at several U.S. and European companies and for several programs at ARINC. Dr. Kowalski is a member of Sigma Xi and is a life senior member of the Institute of Electrical and Electronic Engineers (IEEE). For more than 20 years, Dr. Kowalski was a member of the IEEE Reliability Society’s Administrative Committee and is a past editor of the IEEE Transactions on Reliability. Dr. Kowalski received a BS in mathematics from Northeastern University and an MS and PhD in mathematics from Case Institute of Technology. He contributed to Chapter 7. Sony Mathew is a faculty research assistant at the Center for Advanced Life Cycle Engineering in the Mechanical Engineering Department of the University of Maryland, College Park. He is also pursuing his PhD in mechanical engineering from the A. James Clark School of Engineering at the University of Maryland. His areas of research are reliability, tin whiskers, and prognostics and health management of electronic products. He earned his MS in mechanical engineering from the University of Maryland in May 2005. He has a BA in mechanical engineering (1997) and an MBA (1999) from Pune University, India. He contributed to Chapter 8. Carol Smidts is an assistant professor in the Department of Materials and Nuclear Engineering, Reliability Program, University of Maryland, College Park. She obtained an MS in physics engineering in 1986 from the Universite Libre de Bruxelles, Belgium, and her PhD in physics engineering in 1991 from the same university. Her research has focused mainly on dynamic system reliability, Markovian analysis, and human reliability. Her recent work has been devoted to software reliability. She contributed to Chapter 7. Walter Tomczykowski the director of the Life Cycle Management and Operations Support Department at ARINC Engineering Services, reporting to the vice president of the Advanced Systems Division. He received an MS in reliability engineering from the University of Maryland and a BS in electrical engineering technology from Northeastern University in Boston. For over 25 years he has been leading specialized teams in the areas of reliability, maintainability, life-cycle cost, human factors, counterfeit prevention, and obsolescence management for the Office of the Secretary of Defense, Defense Logistics Agency, and Department of Defense (DoD) programs throughout the services and various federal agencies, such as the Department of Homeland Security and the Department of the Treasury. As the director for the Life Cycle Management and Operations Support Department, Mr. Tomczykowski is responsible for personnel in Boston, Annapolis (including
© 2009 by Taylor & Francis Group, LLC
9879_C000g.indd 16
3/4/09 12:04:45 PM
Contributors
xvii
Patuxent River), Maryland, Dayton, Ohio, San Antonio, Texas, Oklahoma City, and Panama City, Florida. Specifically, for obsolescence management (DMSMS— diminishing manufacturing and material shortages), his teams provide support to the Defense MicroElectronics Activity, Defense Supply Center, Columbus, NAVAIR Aging Aircraft IPT, Coast Guard, AWACS, B-2, USMC H-1, and a variety of other DoD programs. He is a primary author of the DMSMS Cost Factors, the DMSMS Program Manager’s Handbook, and the DMSMS Acquisition Guidelines. He is often requested as a keynote speaker at aging aircraft, DMSMS, and other obsolescence management conferences to share his knowledge of reliability, life-cycle cost, and obsolescence management. His work in reliability has also been published in the Wiley Encyclopedia of Electrical and Electronics Engineering. He contributed Chapter 13. Igor A. Ushakov taught for approximately 15 years at the Moscow Institute of Physics and Technology. In 1989 he was invited to be a distinguished visiting professor at George Washington University in Washington, D.C. Later, he taught at George Mason University and the University of California, San Diego. He has also worked at well-known American companies such as MCI, Qualcomm, Hughes Network Systems, and Mantech. His experience is focused on reliability and effectiveness analysis of large-scale telecommunication systems and mathematical and computer modeling of communication systems. He has been chair of sessions at numerous international conferences (in the United States, Russia, Ukraine, Canada, Japan, Great Britain, France, Italy, Germany, Norway, Poland, Hungary, and Bulgaria). He has authored more than 300 papers in various prestigious international math and engineering journals in operations research, reliability engineering and theory, and telecommunication network modeling. Professor Ushakov has written approximately 30 books in Russian, English, German, and Bulgarian, including three published in the United States. Publications include Histories of Scientific Insights (Lulu, Morrisville, North Carolina, 2007), Course on Reliability Theory (Drofa, Moscow, 2007), Statistical Reliability Engineering (John Wiley & Sons, New York, 1999), Probabilistic Reliability Engineering (John Wiley & Sons, New York, 1995), and Handbook of Reliability Engineering (John Wiley & Sons, New York, 1994). A member of Sigma Xi, Omega Rho, and Tau Beta Pi, Professor Ushakov is a founder of the Gnedenko Forum, an informal international association of specialists in probability and statistics. He contributed to Chapter 3. David Weiss is a consultant in the fields of reliability and systems analysis. For 10 years he served as the manager for reliability programs in the Engineering Research Center at the University of Maryland, working with faculty in the creation of a graduate program in reliability engineering. Prior to joining the University of Maryland, he was a reliability manager with General Electric Company and a partner in the consulting firm Booz Allen Hamilton. He contributed to Chapter 15.
© 2009 by Taylor & Francis Group, LLC
9879_C000g.indd 17
3/5/09 5:49:34 PM
CHAPTER 1
Product Effectiveness and Worth Harold S. Balaban, Ned Criscimagna, Michael Pecht
CONTENTS 1.1 Introduction .......................................................................................................1 1.2 Attributes Affecting Product Effectiveness ......................................................2 1.3 Programmatic Factors Affecting Product Effectiveness...................................3 1.3.1 Product Effectiveness ...........................................................................5 1.3.2 Operational Readiness and Availability ...............................................6 1.3.3 Dependability .......................................................................................7 1.3.4 Capability..............................................................................................8 1.3.5 Reliability .............................................................................................8 1.3.6 Maintainability ................................................................................... 10 1.3.7 Relationships Among Time Elements ................................................ 13 1.4 Assignment of Responsibility ......................................................................... 13 1.4.1 Administrative Time........................................................................... 14 1.4.2 Logistics Time .................................................................................... 15 1.4.3 Active Repair Time and Operating Time ........................................... 15
1.1
INTRODUCTION
The ultimate goal for any product or system is that it perform some intended function as affordably and as well as possible. The function may be described as some output characteristic, such as satisfactory message transmission in a communication system, cargo tonnage for a transportation system, or the accuracy of weather identification for airborne weather radar. The term for the overall capability of a product to meet customer objectives is product effectiveness. If the product is effective, it carries out the intended function well; if it is not effective, deficient attributes must be improved. The term for the overall cost, including purchase price, costs associated with operation maintenance, and repair and disposal costs, is product worth. 1 © 2009 by Taylor & Francis Group, LLC
2
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
1.2
ATTRIBUTES AFFECTING PRODUCT EFFECTIVENESS
Product effectiveness is a function of many product attributes and external factors. For an automobile, dependability, safety, ease of repair, and comfort are among the attributes a buyer might cite as important. In terms of product worth, purchase price, economic operation, and good resale value may be additional attributes. A successful blend of these attributes results in a car that is perceived to be of high value. For any specific product, a distinct blend of attributes is needed to achieve high product effectiveness and worth. Good product design and development require that members of the design team (and the customer, if appropriate) evaluate and discuss all pertinent attributes affecting product effectiveness during the appropriate phases of the product’s life cycle: concept formulation, research and development, production, operation, and disposal. For many products—particularly those with a long life—the highest cost is to operate, support, and maintain the product. Many of the tasks and decisions arising early in the life cycle of a product affect the product at later stages and affect costs throughout product life. Table 1.1 shows how the cost of decisions made early in development affects downstream costs. For example, although only 3 to 5% of the total development and production costs may be expended in the concept definition phase, from 40 to 60% of the total cost may be committed as a result of decisions and actions taken during that period. Table 1.2 lists some attributes that affect product effectiveness in terms of performance, availability, and affordability. The term “performance” represents operational, physical, or functional characteristics. “Availability” represents the likelihood of having the product in a usable state; “affordability” relates to the economic consequences associated with product development, purchase, and operation. Overall product effectiveness and worth can theoretically be improved by trading off attributes, which is an extremely complex process. For example, an automobile manufacturer wants to maximize profits and may feel this is best done by increasing market share through offering a new car that provides maximum affordability and reliability. Affordability is a function of how cheaply the car can be manufactured; features that would make the car easy to maintain might have to be compromised or eliminated to achieve ease of manufacture. Under the hood of today’s automobiles, manufacturing cost and maintenance trade-offs are apparent compared with the cars of, say, 20 years ago. New design approaches, such as electronic ignition, are more reliable than those Table 1.1 Product Development and Production Costs Percent of Total Costs Development Process Phase
Incurred
Concept definition Design Testing Process planning Production
3–5% 5–8% 8–10% 10–15% 15–100%
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Committed 40–60% 60–80% 80–90% 90–95% 95–100%
PRODUCT EFFECTIVENESS AND WORTH
Table 1.2
3
Example Attributes Affecting Product Effectiveness and Product Worth
Performance Operational: Range Speed Accuracy Vulnerability Payload Output power
Availability Reliability: Failure-free operation Redundancy or graceful degradation Mean time to failure
Physical: Volume and density Weight Input power Environment Functional: Safety Mission success rate
Affordability Cost to: Develop or buy
Maintainability: Ease of repair (access, time to repair) Required resources (manpower, tools). Fault Detection and isolation (testability)
Own or operate Maintain Dispose
Logistics supportability: Sparing Training Facilities Time to develop
of the past and the use of computer diagnostics balances the repair challenge presented by today’s complex engines and transmissions. A good design team knows that attributes sometimes support each other and are sometimes contradictory; and that, consequently, trade-offs become a necessary part of the development process.
1.3
PROGRAMMATIC FACTORS AFFECTING PRODUCT EFFECTIVENESS
A typical history of the development of a new product reveals a number of steps in the progression from original concept to an acceptable production model. These steps are particularly marked if the equipment represents a technical innovation—that is, if it pushes the state of the art by introducing entirely new functions or by performing established functions in an entirely new way. The marketplace (or an existing customer base) defines the need for new or improved technical performance. The design and development team executes a multitude of operations leading to accomplishment of program objectives, primarily the production of a system or product that will perform as intended, with minimum breakdowns and rapid repair. This must be done within acceptable development, production, and support budgets and within an established schedule. The three program criteria—performance, cost, and schedule—impose severe pressures on a company. Just as compromises among product attributes are required to achieve desired product effectiveness, compromises are often necessary among program objectives. These compromises begin early in the development process, usually in the basic research and concept validation phases. For example, the time allocated to develop needed technologies or to prove concept feasibility may be curtailed to meet a schedule driven by a competitive challenge.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
4
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
After preliminary work on a typical product, a prototype model is built; ideally, it represents the final product as closely as possible. Its purpose is to establish the initial feasibility of satisfying critical effectiveness attributes. This model could be a hardware or software prototype, or a computer simulation of the system, key subsystems, or components. The prototype may be crude in appearance, unsuitable for production line manufacturing, subject to frequent failure, or repairable only by skilled technicians using expensive equipment and considerable time. Early attention to manufacture, quality, and reliability can save time and money later in the product development program. As the program moves forward, changes to improve reliability become more difficult and expensive, the schedule becomes more inflexible, and budgets become tighter. Despite increased emphasis on reliability, many new products experience serious growing pains during their first years of operation as designers undertake extraordinary and sometimes frantic efforts to determine causes of failure and to eliminate them through modifications, upgrades, or changes in operating and maintenance procedures. Factors important in the development of a new product (revolutionary change) also apply to modification or development programs integrating proven equipment (evolutionary change). For both revolutionary and evolutionary development, reliability is a key attribute affecting product effectiveness and should be considered from the outset. Figure 1.1 shows the major components of product effectiveness: availability, dependability, and capability. In turn, availability and dependability have reliability, maintainability, and logistics supportability as their major constituent elements.
A measure of how well the product does its job
A measure of the product’s condition when first required to perform
A measure of the product’s condition during the performance of its function
A measure of how well the product’s performance meets objectives
Maintainability–restoration capability Logistic support–external factors
Figure 1.1 Major components of product effectiveness.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
1.3.1
5
Product Effectiveness
Product effectiveness can be formally defined as the ability of a product to meet an operational demand when operated under specified conditions. Effectiveness is influenced by how a product is used and maintained, as well as by the design and production processes. It can also be influenced by the logistics support system, company policies, regulations and laws governing product use, fiscal constraints, and other administrative policy decisions. For single-use systems, such as missiles, torpedoes, and fuses, operating time and calendar time during the operational phase are relatively unimportant, as is repair of failed units. However, these time elements are critical in determining the effectiveness of a multi-use product, which also may have to accommodate the repair of failures. A product fails if it does not operate when called upon to perform or if it fails to operate successfully (that is, does not complete its function or mission). Both multi-use and one-shot products must be operated and supported under specified conditions defined by the customer or supplier. If a product is pushed to operate at higher stresses for uses unforeseen by the design team, product effectiveness may be decreased. The U.S. Air Force’s experience with the B-52 aircraft exemplifies how a change in usage environment can affect system effectiveness. The B-52 was originally designed as a high-altitude bomber, but changing needs required the Air Force to include low-altitude penetration as one of the aircraft’s missions. Because low-altitude flight imposed higher stresses on the airframe, additional modifications were necessary to strengthen the structure and maintain the desired service life. “Specified conditions” also include whether the product is used in continuous or cyclic operation. In continuous operation, maintenance is performed after a failure occurs, and any failure reduces product effectiveness. For products operated cyclically, such as a car or an airplane, in windows of time when product operation is not critical, maintenance can be performed. Potential failures can be averted through a planned preventive maintenance program. Removing the product from the readiness state for a portion of each day to perform maintenance may increase effectiveness. However, if the percentage of equipment that becomes inoperable prior to demand for use is insensitive to preventive maintenance, it is best to maintain a continual state of readiness and perform only corrective maintenance. Another influence on product effectiveness is a change in operational requirements (technical performance attributes). For example, the vulnerability of a target may be reduced by a change in target design, such as the addition of armor or electronic countermeasures in a military system. The effectiveness of the system intended to counter the target would decrease then even though no degradation of the system itself had occurred. Consider a race car designed to attain a top speed of 200 miles per hour. If competitors’ cars are able to attain top speeds of 210 miles per hour, all other factors being equal, the effectiveness of the slower car has decreased. The terms design effectiveness and use effectiveness are sometimes used to describe the performance of a product. Design effectiveness measures how well the product meets specific performance requirements under test conditions that minimize
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
6
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
operator, maintenance, and logistic influences. Use effectiveness is at the other end of the effectiveness spectrum. It attempts to assess how well the product meets the demands placed on it, even if such demands exceed specifications. Although a sports car and a station wagon both provide transportation, the use effectiveness of the sports car for cargo carrying is not very high, just as the station wagon may not meet handling or acceleration requirements very well. Product effectiveness measures how well the product does the job for which it was purchased. It is a function of availability (the likelihood that the product is ready to start the job), dependability (likelihood that the product will operate in states that will produce output designed to do the job), and capability (how well the designed outputs actually accomplish the necessary tasks, given the states in which the product operated). Each of these topics—availability, dependability, and capability—will now be addressed, followed by a discussion of the three major components of availability and dependability: reliability, maintainability, and logistics supportability. 1.3.2 Operational Readiness and Availability The capability of a product to perform its intended function when called upon is its operational readiness or its operational availability.* The difference between readiness and availability is that the latter includes only operational and downtimes, while the former also includes free and storage times—that is, periods when the product is not needed. Operational readiness or availability differs from product effectiveness in several ways. Its emphasis is on the “when called upon” aspect, rather than on the completion of the task or mission. This emphasis focuses on a probability at a point in time rather than over an interval, as is the case with the mission success rate (the percentage of successfully completed missions). This interval of time can be extremely long, as in the case of a satellite on a long-term mission to another planet; the satellite may be operationally available at launch time, but that does not ensure that it will operate successfully for the duration of its mission. For products that are continually used and are providing useful output, availability is often estimated by calculating the fraction of total “need time” in which the product is operational or capable of providing useful output. Another difference between operational availability and product effectiveness is that the performance attributes of the latter include designed-in capabilities, such as accuracy, power, and weight. Operational availability typically excludes detailed examination of these characteristics by addressing only the product’s readiness to perform its intended function at a particular point in time. Depending on the intended use, one or more performance attributes may apply to availability. The difference between a product’s being operational or not is often a function of the customer’s definition of failure, which depends on the use of the product. If the performance * Although terms such as availability were at one time closely associated with military systems, they are now more widely used in commercial industry. The availability of an off-shore oil rig, for example, is of extreme importance.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
7
related to a critical attribute is not satisfactory, the customer may consider the product to be “down,” and readiness or availability from that point until the need ends or until the deficiency is corrected is zero. For example, if a radar set has a specified range of 50 miles, should the radar be considered down if it is effective only to 45 miles? If the 50-mile range is the absolute minimum needed to avoid midair collisions, the aircraft on which the radar is installed would be considered unflyable, and the radar would be considered unavailable for the mission. If the 50-mile range is a goal value and 20 miles is the absolute minimum, a 45-mile range might be acceptable. An availability calculation could be based on a definition that includes as uptime all periods for which the range is at least 20 miles. Operational availability and readiness, therefore, relate uptime and downtime to the conditions under which the product will be used. The following definitions are used: r The operational availability of a system or product is the probability that it is operating satisfactorily at any point in time when used under stated conditions, where the total time considered includes operating time, active repair time, administrative time, and logistic time. r The operational readiness of a system or product is the probability that, at any point in time, it is either operating satisfactorily or is ready to be placed in operation on demand when used under stated conditions, including allowable warning time. Total calendar time is the basis of computation.
A subset of operational availability is intrinsic or inherent availability. Like the design effectiveness concept, this measure attempts to minimize the effects of external influences by considering only active repair time and required use time. Thus, free time when the product is not needed and downtimes due to logistic and administrative delays are excluded. Intrinsic availability is a built-in capability; thus, the design and production engineers first must address discovered problems, assuming that operating conditions are compatible with design specifications. If these engineers cannot resolve the problem, then the product operations manager may be assigned to reduce administrative or logistics delays or to utilize and maintain the product more efficiently. 1.3.3 Dependability Most products can be in any one of a number of different states during their operation. Dependability measures the likelihood of each possible product state. If a product contains n identifiable components and each component can be in only one of two states (say, success or failure), then the product can be in any one of 2n states. For example, a product with 10 components has 1,024 possible states, if each component is either up or down. We do not usually quantify dependability by a single number as we may do for availability, but rather use the dependability concept to quantify effectiveness. However, dependability quantification is possible for simple cases. For example, for our sample product, we may define a subset of the 1,024 states as success states; the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
8
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
product is considered dependable if it operates within this subset. However, not all of the success states will necessarily result in the same level of acceptable output; in such cases, the capability measure has to be considered, as discussed in the next section. From a more analytical point of view, the dependability concept describes how the product transitions from one state to another. For example, the failure of a component will generally transition the product from its present state to a less capable state. If repair during operation is possible, there may be a transition back to the more productive state. If an item failure brings the product down, then no useful output may be produced until repairs are made. 1.3.4 Capability Capability measures how well the product accomplishes the task it is assigned. It is normally a state-dependent measure. If the product is not operating, then its capability would normally be zero, but not always. Consider a tank protecting an enclave from rebel troops. The tank may not be able to fire, but if the enemy sees the tank and is unaware of its state, its protective mission may still be accomplished while repairs are undertaken. On the other hand, a product that is operating as it is supposed to may not have the highest capability. An optical aerial camera may not get a desired picture because of cloud cover, even though all components are operating perfectly. Products that have backup or redundant modes of operation will have a number of states that can produce useful output. For each state, a capability measure exists. For example, the speed, range, and fuel consumption of a multi-engine aircraft depend on the number of engines operating. The units of measure of capability depend on the product and its tasks. The capability measure may be directly related to such product output as picture resolution, number of messages delivered, kilowatts of power produced, or the amount of damage to the enemy. When it is difficult to define or to quantify such a measure, an ordinal scale may be used—for example, from 0 to 100, with 100 representing the best possible output. A probability measure may also be used in some cases. Each of the possible product states is determined to be either a success or a failure. Then the product capability is the probability that the product operates within the class of success states. 1.3.5 Reliability A critical attribute determining product effectiveness is reliability, which is a measure of the product’s ability to avoid failure. A reliability deficiency will eventually result in an impaired or lost performance, compromised safety, and the need for such restorative actions as diagnosis, repair, spare replenishment, and maintenance. High-reliability products will operate longer, allowing resources to be focused on improving performance. Within a product effectiveness context, satisfactory operation is normally associated with a defined envelope of satisfactory outputs. If all the product outputs are within this envelope, then the product is operating reliably. Note that reliable operation by this definition does not imply satisfactory results. An optical aerial camera operating in a cloudy environment is an example.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
9
1.0
R(t)
0
Time Figure 1.2 A typical reliability function.
Observed reliability is the ratio of items operating within specifications for the stated period to the total number of items in the sample. A reliability function is this same probability expressed as a function of the time period. Figure 1.2 is an example of a reliability function. Products that have built-in test equipment (BITE) to monitor output and warn the operator when one or more outputs are out of tolerance can be continually accessed for reliability. BITE is now a common design practice for industrial products, and it is becoming more prevalent for consumer products, especially for electronic items. Nevertheless, the assessment of product performance often has to be made by the operator, and distinguishing between a capability and a reliability problem is not always easy. A washing machine user may easily determine that the washing machine has failed because water is flooding the laundry room. A more difficult assessment is determining the reason that the washed clothes do not appear to be as clean as they should. Whether the problem is one of reliability (e.g., failure of the motor to agitate the water properly) or capability (e.g. insufficient motor capacity for the wash load), the usual decision is to assign the problem to reliability unless a specific analysis of effectiveness is conducted. Mission reliability is usually defined as the probability that a product will operate successfully for the duration of the mission, given that it is ready to start the mission when called upon to do so. Mission reliability, therefore, is the probability that no failure will occur during the mission that prevents the mission from being satisfactorily completed. For a one-time operation, this probability is a point on the reliability function curve corresponding to a time equal to the mission time. If repeated missions are undertaken and wearout may be occurring, adjustments must be made for cumulative operating or stress time following the most recent maintenance or restoration. All the alternative modes of operation required for mission completion must be considered in mission reliability. Alternative modes include operations using redundant or backup units that take over for failed units.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
10
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Logistic reliability, on the other hand, is concerned not only with mission accomplishment but also with all failures that place a demand on the logistic system, regardless of when the failures occur or whether they affect mission accomplishment, require maintenance, require spare parts, or all three. Thus, component redundancy, which normally increases mission reliability, almost always decreases logistic reliability. A natural measure of logistic reliability is the demand rate, which tracks the demand events occurring when a failure triggers the logistics support system. 1.3.6 Maintainability Maintainability addresses the ease and economy with which the maintenance actions necessary to restore a failed product to a satisfactory state can be taken. Following a failure, restoration involves isolating the source of the failure, correcting the problem, checking out the product, removing test equipment and tools, securing all access doors and panels, and making the product acceptably available to perform its required function. The statistical average for downtime during restoration actions is called mean downtime (MDT). MDT comprises diagnostic time, active repair time, logistic delay, and administrative delay. The relative ease with which a product can be kept in operational condition or restored to it after failure is typically embodied in the maintainability characteristic. Maintainability, which comprises all active repair time, is a fundamental design attribute; most of the effort to affect this attribute favorably is expended in the design phase. For a product to be highly maintainable, the design should not be complex; equipment should be easy to access, remove, and replace; the types of fasteners should be as uniform as possible; few special tools should be needed; and so forth. Such factors are the responsibility of the design engineer. The most general definition of maintainability is the probability that, when maintenance is initiated under stated conditions, a failed product will be restored to operational effectiveness within a given period of time, excluding downtime due to logistic or administrative delay. A subset of maintainability is testability. Testability is defined in terms of failure detection and source isolation. The definition may be expanded by including the rapidity and accuracy of detection and isolation. Ideally, all failures (and only failures) are detected as soon as they occur, allowing the operator to take appropriate action (for example, turning the product off to prevent further damage). Failures can be detected by human observation (for example, the operator may see smoke or an invalid product response) or by the product itself using a built-in test. Similarly, the maintainer can isolate the source of failure and identify the cause using manual or semiautomatic methods to check various components until the failure is found or using automatic built-in tests. In practice, some failures are intermittent and difficult to detect and isolate. Definitions for the time divisions are given in Table 1.3. Time is of fundamental importance for quantifying product or system properties because it permits measurement rather than qualitative description. The usual measures of time—year, month, day, and hour—normally form the basis for the computation of reliability, maintainability, and availability parameters. However, because there are so many ways of
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
Table 1.3
11
Definitions of Time Elements
Time Element Operating time
Downtime r Realization time r Active repair time
r Logistic delay
r Administrative delay Free time
Standby time
Access time
Diagnosis time
Replacement time
Supply delay
Checkout time
Definition Time during which the product is operating in a manner acceptable to the operator; this element includes the time when the customer is dissatisfied with the manner of operation, but not so dissatisfied that the product must be shut down for repair or, if repair will not satisfy customer needs, discarded Total time during which the product is not in an acceptable operating condition or is not operationally ready Time that elapses before the fault condition becomes apparent Portion of downtime during which actual maintenance takes place; included is the time to prepare the product for repair, locate the fault, correct the fault, and check out the product That portion of downtime during which repair is delayed (waiting time) solely because a part or unit needed to make a repair is not available That portion of downtime not covered by active repair or logistic time Time during which operational use of the product is not required; it may or may not be depending on downtime or on whether the product is in operable condition; during free time periods, downtime is not included in operational availability calculations Time during which the product is operable but is being held as a spare; standby time is the time during which the product is operable but is not being used to perform a useful function; the product can be called upon to operate at any random point of time during the period Time from realizing that a fault exists to making contact with displays and test points and commencing fault finding; this does not include travel or preparation; access time reflects the removal of covers and shields and the connection of test equipment and is determined largely by mechanical design Fault-finding time, including the adjustment test equipment (e.g., setting up an oscilloscope or generator), carrying out checks (e.g., examining wave forms for comparisons with a handbook), interpreting information (this may be aided by algorithms), verifying conclusions, and deciding upon corrective action Time for removing the faulty line replaceable assembly (LRA), followed by connecting and wiring a replacement as appropriate; the LRA is the replaceable item beyond which fault diagnosis does not continue; replacement time is largely dependent on the choice of LRA and on mechanical design features, such as the choice of connectors Time required from the point of identifying the need for a maintenance part or assembly (LRA) until that part or assembly is in the hands of the maintenance technician; supply delay can be factored into elements such as time to remove the part from the maintenance technician’s tool kit, time to obtain the part from a supply bin, time to receive the part from a warehouse at another site, or time to procure the part from a manufacturer Time of verifying that the fault condition no longer exists and that the product is operational; it may be possible to restore the product to operation before completing the checkout—in which case, although it is a repair function, all of checkout time does not constitute downtime; adjustments may be required when a new module is inserted into the product; as in the case of checkout, some or all of the alignment time may fall outside the downtime window
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
12
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 1.3 Principal divisions of calendar time.
delineating these time intervals, the method adopted in each investigation must be carefully developed in order to provide the desired results. In general, the time interval of interest is the total calendar time during which the product is in use. As shown in Figure 1.3, this interval may be divided into available time and unavailable time. During available time, the product is available for use by the intended user; during unavailable time, the product is being supplied, repaired, or restored and is not available for use. Thus, there are really two time-division criteria: the equipment’s state of operability and the demand for its use. These criteria are outlined as follows: criterion 1: product state of operability r operable/inoperable r administrative delay r logistic delay r realization time r repair time criterion 2: demand for product use r use required r use not required − storage time − free time − standby time
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
13
Figure 1.4 Relationships among the time elements as they influence product effectiveness.
1.3.7 Relationships Among Time Elements An examination of the relationships among the various time elements can provide additional insight into the properties of the product effectiveness components. As an aid in doing this, Figure 1.4 shows how various time intervals combine and influence the product effectiveness components. Note that capability is not normally a timedependent parameter and therefore no time factors are shown to influence it.
1.4 ASSIGNMENT OF RESPONSIBILITY Even before discussing quantitative measurement for the concepts displayed in Figure 1.4, it is possible to demonstrate how such measures can be helpful in locating trouble areas and assigning responsibility for remedial action to improve effectiveness. These concepts can provide information for comparative evaluation of competing equipment or systems and for determining the particular characteristics responsible for the differences. The property of the time breakdown that leads to these results is the relationship between the lengths of various time intervals and the responsibilities of various personnel groups. It is apparent that administrative personnel are responsible for controlling free time, storage time, administrative time, and logistic time, while production and design engineers are responsible to a large degree for operating time
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
14
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
(failure frequency) and active repair time. Of course, maintenance and design engineers share the responsibility for active repair time. To achieve the outmost effectiveness, it is necessary to maximize operating time and minimize downtime. The role of non-use time (free time and storage time) is to serve as a safety valve. Maximum free time means minimum pressure for product use; storage time results from the existence of spares to carry the operational load in case of emergency. Because the deterioration rate during storage may be different from that during use and because, by error, some inoperable equipment may be placed in storage, this time element must be considered in determining operational readiness. Large amounts of free time result when a product has a short operating time— for example, when equipment is needed relatively infrequently—and there is firm scheduling of the need. It can also happen if working hours are restricted instead of continuous. For example, banks need time locks on safes at night but not during the day. Some communication equipment regularly has free time. Automatic answering services are needed only when the operator is absent. Some television stations have regular hours during which there is no telecast. It is clear that operational readiness can be enhanced by using free time for maintenance, and free time can thus compensate to some extent for poor maintainability and poor reliability. The important point with respect to free time and storage time is that they provide administrative flexibility to help alleviate the effects of equipment inadequacies and thus to gain operational readiness. However, it is important to note that free time and storage time have no connection with improving poor equipment. They provide an inferior but sometimes necessary alternative to the preferred solution of obtaining better equipment. They are a substitute for quality, but not a way of achieving it. It follows from the foregoing discussion that the more significant indicators of equipment characteristics are to be found in times other than free time and storage time. Figure 1.4 shows that these other types of time are all involved in the concept of availability, which combines operating time with total downtime, including the three subcategories of downtime: administrative time, logistic time, and active repair time. These subcategories involve both administrative and engineering responsibilities. 1.4.1
Administrative Time
The administrative time category is almost entirely determined by administrative decisions about the processing of records and the personnel policies governing maintenance engineers, technicians, and those engaged in associated clerical activities. Establishing efficient methods of monitoring, processing, and analyzing repair activities is the responsibility of administration. In addition, administrative time has been defined to include wasted time because such time is the responsibility of administration. It is independent of engineering as such and is not the responsibility of the equipment manufacturer.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
15
1.4.2 Logistics Time Logistic time is the time consumed by delays in repair due to the unavailability of replacement parts. This is a matter largely under the control of administration, although the requirements for replacements are determined by operating conditions and the built-in ability of the equipment to withstand operating stress levels. Policies determined by procurement personnel can, if properly developed, minimize logistic time. Therefore, the responsible administrative officials in this area are likely to be different from those who most directly influence the other time categories. This justifies separate consideration of logistic time. 1.4.3 Active Repair Time and Operating Time Active repair time and operating time are both determined principally by the built-in characteristics of the equipment, and hence are primarily the responsibility of the equipment manufacturer. Improvement in this area requires action to reduce the frequency of failure, to increase the ease of repair, or both. Operating time and active repair time are associated with the concepts of reliability and repairability, respectively, which are related through the concept of intrinsic availability. Administration can do little to reduce active repair time or increase operating time (i.e., failure-free time). Administrators can influence these time elements to a limited extent by assuring that operating stress levels are within design specifications and that the maintenance shop is supplied with proper tools and adequately trained personnel. Because products are generally purchased, most customers want to buy products that perform their intended function at the lowest total cost. Cost studies show that the total cost of ownership (including initial and operating costs for the service life of the equipment) can be materially reduced if proper attention is given to reliability and maintainability early in the design of the product. These considerations lead to the concept of product worth, which is illustrated in Figure 1.5, and relate product effectiveness to total cost, scheduling, and personnel requirements.
Figure 1.5 Concepts associated with product worth.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
16
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Technical development plan
Technical requirements
Product design plan
Cost (initial and operating) and schedule requirements
Operability Dependability and plan (reliability, maintainability) supportability plan
Development, production, delivery, and installation plan
Personnel requirements
Personnel requirements and training plan for system development and utilization
Test and evaluation plan
Figure 1.6
Outline of requirements for a technical development plan.
To optimize product worth, program managers face the difficult task of striking balances to maximize product effectiveness while minimizing total cost, development time, and personnel requirements (see Chapter 13). Cost, schedule, and personnel are constraints faced by both military and commercial program managers. In the commercial world, time to market and staying up with the competition are additional constraints. The political constraints surrounding most military programs are unique to the military manager. In practice, managers from both communities select from several alternatives of the most promising product or component for which development effort is required. This selection can be facilitated by forming technical development plans, as outlined in Figure 1.6. At this point it should be noted that the product effectiveness applies to the operation of a product in its use environment and is capable of being measured. However, because the actual use environment is often unknown or beyond the control of the product manufacturer, only certain elements of the product effectiveness concept can be specified for contractual purposes. From a practical point of view, a mission or use analysis must be conducted to determine the required level of intrinsic availability, as well as the needed performance characteristic (design capability). The problem of specifying product requirements becomes increasingly complex if redundant or multimodal operation is employed. For example, to achieve the required level of availability at the optimum cost level, the product design team may have to consider several alternative approaches. Should several redundant systems be used so that one or more spare systems will always be available and the failed system can be repaired under less pressing circumstances, or should one highly reliable product be developed that can be repaired quickly? In many cases, improved reliability and repairability by the use of higher quality parts, redundant circuitry, plug-in assemblies, and simplified or semiautomatic fault
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND WORTH
17
isolation devices can do much to improve availability and reduce total downtime. Trade-off analyses, therefore, are often essential in determining the minimum requirements for achieving the required availability with the optimum expenditure of money, personnel, and time among which trade-offs will also be required. Historically significant reliability improvements in electronic systems were made as industries transitioned from vacuum tubes to solid-state components to microchips. New materials and approaches to design and stress analysis contributed to reducing failures. As reliability problems were attacked and alleviated, the need to address maintainability, logistics, and cost issues became more evident. Product effectiveness provides an approach to product operation, support, and performance. For example, reliability engineers working on redundancy in the early 1960s found that, by adding additional components and a switching mechanism to a functionally duplicate operating component, reliability could theoretically be improved. But, as these designs were implemented, it became clear that penalties were incurred in such areas as power, weight, maintainability, and cost. Of what value is adding a duplicate transmitter circuit if the power requirements of a redundant design cause a significant increase in the failure rate of the power supply? Product effectiveness and product worth, which account for cost and resource usage, provide a conceptual framework for determining how best to direct design and development to achieve the desired performance more efficiently and effectively.
First observation
212.1
214.2
213.7
212.7
212.5
212.7
212.8
213.0
212.9
212.3
212.5
212.1
211.8
213.5
212.0
213.0
214.5
212.3
212.2
211.9
213.2
212.7
211.9
212.3
212.0
212.8
213.9
212.6
214.0
212.4
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Sixth observation
Thirtieth observation
CHAPTER 2
Reliability Concepts Diganta Das, Michael Pecht
CONTENTS 2.1 Introduction..................................................................................................... 19 2.2 Reliability........................................................................................................ 19 2.3 Probability Density Function .......................................................................... 23 2.4 Hazard Rate ....................................................................................................25 2.5 Conditional Reliability....................................................................................26 2.6 Time to Failure................................................................................................ 27 Homework Problems................................................................................................ 27
2.1 INTRODUCTION This chapter presents some of the fundamental definitions and mathematical theory for reliability. The focus is on the reliability and unreliability functions, the probability density function, the hazard rate, the conditional reliability function, and some time-to-failure metrics.
2.2 RELIABILITY For a constant sample size, no, of identical products that are tested or being monitored, if nf number of products have failed and the remaining ns number of products are still operating satisfactorily at any time t, then
n (t ) n f (t ) n s
o
(2.1)
The factor “time” in Equation 2.1 can pertain to age, total time elapsed, operating time, number of cycles, or distance traveled, or it can be replaced by a measured 19 © 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
20
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
quantity, which ranges from –∞ to ∞. This quantity is called the variate in statistics. Variates may be discrete (e.g., number of cycles) or continuous, when they can take on any value within a certain range. The number of failures of a product (or process or event) that occur up to a given time is a fundamental reliability index. The ratio of failed products per sample size is an estimate of the unreliability, Qˆ (t ), of the product at any time t. That is,
Qˆ (t )
n f (t ) no
(2.2)
where the “hat” on the top of the variable indicates that it is an estimate. Similarly, the estimate of reliability, Rˆ (t ), of a product at time t is given by the ratio of operating (not failed) products per sample size:
n (t ) Rˆ (t ) s 1 - Qˆ (t ) no
(2.3)
As fractional numbers, Rˆ (t ) and Qˆ (t ) range in value from zero to unity; multiplied by 100, they give the probability in the form of percentages. EXAMPLE 2.1 A semiconductor fabrication plant has an average output of 10 million devices per week. Over the last year, it has been found that 100,000 devices have been rejected in final test. (a) What is the unreliability of the semiconductor devices according to the conducted test? (b) If the tests reject 99% of all defective devices, what is the chance that any device a customer receives will be defective? Solution: The total number of devices produced in a year is (a) no 52 r 10 r 106 520 r 106 The number of rejects (failures) nf over the same period is
n 1 r 105 f
Therefore, from Equation 2.3, an estimate for device unreliability is
n f (t ) 1 r 105 Qˆ (t ) y 1.92 r 10 4 no 520 r 106 or 1 chance in 5,200. (b) If the failed devices represent 99% of all the defective devices produced, then the number of defectives that passed testing is § 1 r 105
xd ¨
¨© 0.99
¶
(1 r 105 ) · y 1010
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
·¸
RELIABILITY CONCEPTS
21
Therefore, the probability of a customer getting a defective device or the expected unreliability of the supplied devices on first use is
Qˆ (t )
1010 y 1.94 r 10 6 (520 r 106 ) (1 r 105 )
or 1 chance in 515,000.
Reliability estimates obtained by testing or monitoring samples generally exhibit variability. For example, light bulbs designed to last for 10,000 hours of operation that were all installed at the same time in the same room are unlikely to fail at exactly the same time or at exactly 10,000 hours. Variability in the measured product response as well as the time of operation is expected. In fact, product reliability assessment is often associated with the estimation and measurement of this variability. The accuracy of a reliability estimate at a given time is improved by increasing the sample size no. The requirement of a large sample is analogous to the conditions required in experimental measurements of probability associated with coin tossing and dice rolling. This implies that the estimates given by Equations 2.2 and 2.3 approach actual values for R(t) and Q(t) as the sample size becomes infinitely large. Thus, the practical meaning of reliability and unreliability is that, in a large number of repetitions, the proportional frequency of occurrence of success or failure will ˆ estimates, respectively. ˆ and Q(t) approximately equal the R(t) The response values for a series of measurements on a certain product parameter can be plotted as a histogram in order to assess the variability. For example, Table 2.1 lists a series of time-to-failure results for 251 samples tested in 11 different groups. These data are summarized as a frequency table in the first two columns of Table 2.2 and a histogram is created from those two columns in Figure 2.1. In the histogram, 120
Number of Failures
100 80 60 40 20
91 –1 00 10 1– 11 0 O ve r1 10
81 –9 0
61 –7 0 71 –8 0
60 51 –
41 –5 0
31 –4 0
30 21 –
11 –2 0
0– 10
0
Operating Time (hours)
Figure 2.1
Frequency histogram or life characteristic curve for data from Table 2.2.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
22
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 2.1
Measured Time-to-Failure (Hours) Data for 251 Samples Series
1
2
3
4
5
6
7
8
9
10
11
1 1 2 3 4 5 6 6 8 9 11 13 16 18 20 25 28 32 36 46 58 79 117
1 1 3 3 4 5 6 6 8 9 12 14 16 18 20 25 28 32 37 47 59 83 120
1 2 2 3 4 5 6 7 8 9 12 14 16 18 20 26 29 33 38 48 62 85 125
1 2 2 3 4 5 6 7 8 9 12 14 16 18 21 26 29 33 39 49 64 89 126
1 2 3 3 4 5 6 7 8 10 12 14 17 18 21 27 29 34 41 49 65 93 131
1 2 3 3 4 5 6 7 8 10 12 15 17 18 22 27 29 34 41 51 66 97 131
1 2 3 3 4 5 6 7 8 11 12 15 17 19 22 27 29 35 42 52 67 99 137
1 2 3 4 4 5 6 7 8 11 13 15 17 19 23 28 29 35 42 53 69 105 140
1 2 3 4 4 5 6 7 9 11 13 15 17 19 23 28 30 36 43 54 72 107 142
1 2 3 4 4 5 6 7 9 11 13 15 18 19 24 28 31 36 44 55 76 111 —
1 2 3 4 4 5 6 7 9 11 13 15 18 20 24 28 31 36 45 56 78 115 —
Table 2.2
Grouped and Analyzed Data from Table 2.1
Operating Time (hours)
Number of Failures ( Δ nf )
Surviving Products (ns)
Probability Density Function f(t)
Reliability (R) (n0 251)
Average Hazard Rate Estimate (Δ t 10)
0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 91–100 101–110 Over 110
105 52 28 17 12 8 6 4 3 3 2 10
146 94 66 49 37 29 23 19 16 14 12 0
0.418 0.207 0.112 0.068 0.048 0.032 0.024 0.016 0.012 0.012 0.008 0.043
0.58 0.372 0.26 0.192 0.144 0.112 0.088 0.072 0.06 0.052 0.044 0
0.04 0.03 0.03 0.02 0.02 0.02 0.02 0.01 0.01 0.02 0.01 —
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY CONCEPTS
23
1
Reliability
0.8 0.6 0.4 0.2
0 10 0 1– 1 O 10 ve r1 10
0
91 –1
0
81 –9
0
71 –8
0
0
61 –7
51 –6
0
0
41 –5
31 –4
0
21 –3
11 –2
0–
10
0
Operating Time (hours) Figure 2.2
Reliability histogram of data from Table 2.1.
each rectangular bar represents the number of failures in the interval. This histogram represents the life characteristic curve for the product. The ratios of the number of surviving products to the total number of products (i.e., the reliability at the end of each interval) are calculated in the fourth column of Table 2.2 and are plotted as a histogram in Figure 2.2. As the sample size increases, the intervals of the histogram can be reduced and the plot will approach a smooth curve. For some continuous time-to-failure data, the rectangles are replaced by ordinates to obtain a smooth hazard rate curve.
2.3 PROBABILITY DENSITY FUNCTION The ratio of the number of product failures in an interval to the total number of products gives an estimate of the probability density function corresponding to the interval. For the data in Table 2.1, the estimate of the probability density function for each interval is evaluated in the fourth column of Table 2.2. Figure 2.3 shows the estimate of the probability density function for the data of Table 2.1. The sum of all possible values equals unity (values in column four of Table 2.2). The probability density function is given by:
f (t ) =
1 d[n f (t )] d[Q(t )] = dt no dt
(2.4)
Integrating both sides of this equation gives the relation for the unreliability in terms of f(t): Q (t )
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
n f (t ) = no
t
¯ f (T ) dT
0
(2.5)
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
11 –2 0 21 –3 0 31 –4 0 41 –5 0 51 –6 0 61 –7 0 71 –8 0 81 –9 0 91 –1 0 10 0 1– 11 0 O ve r1 10
0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
0– 10
f(t)
24
Operating Time (hours)
Figure 2.3
Probability density function for data in Table 2.1.
where the integral is the probability that a product will fail in the time interval 0 a n a t. The integral in Equation 2.5 is the area under the probability density function curve to the left of the time line at some time t. The unreliability is also called the cumulative failure probability distribution function for a continuous random variable. Similarly, the percentage of products that have not failed up to time t is represented by the area under the curve to the right of t by c
R (t ) =
¯ f (T ) dT
(2.6)
t
Because the total probability of failures must equal one at the end of life for a population, the function f(t) is appropriately normalized. That is, c
¯ f (t)dt 1 t
EXAMPLE 2.2 From the histogram of Figure 2.3, (a) Calculate the unreliability of the product at a time of 30 hours. (b) Calculate the reliability. Solution: For the discrete data represented in this histogram, the unreliability is the sum of the failure probability density function values from t 0 to t 30. This sum, as a percentage, is 74%. The reliability is the sum of the mass function values from t 30 to t c and is equal to 26%. The sum of reliability and unreliability must always be equal to 100%.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(2.7)
RELIABILITY CONCEPTS
25
2.4
HAZARD RATE
For a nonrepairable product, for a given period (from ti to ti Δt), the hazard rate is given by the conditional probability that the product will fail in this period, given that the product has survived up to time t. That is,
h(t ) = P(ti a T a ti $ t | T q ti )
(2.8)
Assuming that the product has survived up to time ti, an estimate of the average hazard rate over the time interval Δt, hˆ is mathematically expressed as hˆ
$ nf 1 nbp (ti ) $ t
(2.9)
where nbp(ti) is the number of products monitored or tested that have not failed at the beginning of the time period, Δnf is the number of failures in the sampling period, and Δt is the sampling interval. As nbp becomes large and as the sampling interval goes to zero, the average hazard rate estimate approaches the instantaneous hazard rate (or just hazard rate) at time t. That is, in the limit as Δt goes to zero, Equation 2.9 becomes h(t ) =
1 d n f (t ) ns dt
(2.10)
The hazard rate, h(t), is the number of failures per unit time per number of nonfailed products left. It is thus a relative rate of failure, in that it does not depend on sample size. From Equations 2.2, 2.3, and 2.10, a relation for the hazard rate in terms of the reliability is
h(t ) =
1 d R(t ) R (t ) dt
(2.11)
Integrating Equation 2.11 over an operating time from 0 to t, noting that R(t 0) 1, and taking the exponential of each side gives ¤ t ³ R(t ) exp ¥ h (T )dT ´ ¥¦ ´µ 0
¯
(2.12)
This is the fundamental equation of reliability expressed in terms of the hazard rate. The hazard rate can also be expressed as the ratio of the failure probability density function to the reliability by combining Equations 2.4 and 2.11 with Equation 2.3: h(t ) =
f (t ) R (t )
(2.13)
Using the data from Table 2.1 and Equation 2.9, an estimate (over Δt) of the hazard rate is calculated in the last column of Table 2.2. Figure 2.4 is the histogram of hazard rate versus time.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0
0–
10 11 –2 0 21 –3 0 31 –4 0 41 –5 0 51 –6 0 61 –7 0 71 –8 0 81 –9 0 91 –1 0 10 0 1– 1 O 10 ve r1 10
Average Hazard Rate
26
Operating Time (hours)
Figure 2.4
Hazard rate histogram of data from Table 2.1.
2.5 CONDITIONAL RELIABILITY The conditional reliability function R(t, T) is defined as the probability of operating for a time interval t, given that the nonrepairable system has operated for a time T prior to the beginning of the interval. The conditional reliability can be expressed as the ratio of the reliability at time (t T) to the reliability at an operating duration T, where T is the “age” of the system at the beginning of a new test or mission. That is, R (t,T ) =
R (t + T ) R (T )
(2.14)
For a product with a decreasing hazard rate, the conditional reliability will increase as the age T increases. The conditional reliability will decrease for a product with an increasing hazard rate. The conditional reliability of a product with a constant rate of failure is independent of T; that is, the reliability for a mission time t is independent of previous mission times. This suggests that a product with a constant failure rate can be treated as “as good as new” at any time. EXAMPLE 2.3 The reliability function for a system is assumed to be an exponential distribution and is given by R (t) e L0t where h0 is a constant (i.e., a constant hazard rate). Calculate the reliability of the system for mission time t, given that the system has already been used for 10 years. Solution: Using Equation 2.14, R (t,10) =
R (t +10) e L0 (t 10)
L t = L 10 = e 0 R(t ) R (10) 0 e
That is, the system reliability is “as good as new,” regardless of the age of the system.
© 2009 by Taylor & Francis Group, LLC
RELIABILITY CONCEPTS
27
2.6 TIME TO FAILURE The median M of the probability distribution is the time at which the area under the distribution is divided in half (i.e., it is the time to reach 50% reliability). That is, M
¯ f (t)dt 0.5
(2.15)
0
Because M occurs as a limit, determining an explicit relation for M can be difficult. As a result, a mean is a more preferred metric. The mean time to failure (MTTF), defined as the expected value of the failure probability density function, is one such parameter: c
MTTF =
¯ tf (t)dt
(2.16)
0
It can also be shown that Equation 2.16 is equivalent to c
MTTF =
¯ R(t) dt
(2.17)
0
The MTTF should be used only when the failure distribution function is specified because the value of the reliability function at a given MTTF depends on the probability distribution function used to model the failure data. In fact, different failure distributions can have the same MTTF while having different reliabilities. The first failures that occur in a product or system often have the biggest impact on safety, warranty, and supportability, and consequently on the profitability of the product. Thus, the beginning of the failure distribution is an important concern in reliability. The time to the first 1 or 5% of the failures is often estimated for that reason.
HOMEWORK PROBLEMS Problem 2.1 Following the format of Table 2.1, record and calculate the different reliability metrics after bending 30 paper clips 90° back and forth to failure. Plot the life characteristics curve, the estimate of the probability density function, the reliability, and unreliability, and the hazard rate. Do you think your results depend on the amount of bend of the paper clip? Explain. Problem 2.2 Prove that MTTF
¯
c
0
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
tf (t ) dt
¯
c
0
R(t ) dt .
28
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Problem 2.3 Calculate the MTTF for a failure probability density function given by k M f M x!e ; x 0,1,2,3,....... Problem 2.4 Calculate the MTTF for a failure probability density function given by ¹ ª0; for (t t ) 1 1 ; for (t1 a t a t2 ) º f (t ) « t2 t1 ¬0; for (t t2 ) » Problem 2.5 Assume that the system in Example 2.3 is a car. Do the results in the example make sense? Why or why not? Provide some examples of systems where the results may be more appropriate. Problem 2.6 Given the following: 1 $ nf fˆ no $ t f (t )
1 d [n f (t )] d [Q(t )] no dt dt c
t
Q (t )
¯
f (T )dT d , R (t )
0
¯ f (T )dT 0
c
prove
¯ f (t)dt 1 0
Problem 2.7 Hazard rate: Given the following:
hˆ
1 $nf nbp (t ) $ t
h (t )
1 d [n f (t )]
1 d [ R (t )] ns (t ) dt R (t ) dt t
R (t ) x e ¯0
h (t ) dt
, h(t ) x
f (t ) R (t )
prove the hazard rate equation. Problem 2.8 Discuss PC failure data in terms of conditional reliability (e.g., if a PC has survived 12 cycles at 90°, what is the probability that it will survive another 5 cycles?) R (t , T )
R (t T ) R (T )
where R(t, T) is the conditional probability that a product will survive for an additional time t, given that it has survived up to time T.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY CONCEPTS
29
Problem 2.9 What does the conditional reliability reduce to, if the hazard rate is a constant? Hint: t
h(t )dt R(t ) x e ¯0
Problem 2.10 Mean time to failure: prove that c
c
¯0
¯0
c
c
MTTF t f (t )dt R (t )dt
¯0 t f (t)dt ¯0 R (t)dt
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 3
Statistical Inference Concepts Jun Ming Hu, Mark Kaminskiy, Igor A. Ushakov
CONTENTS 3.1 Introduction ..................................................................................................... 32 3.2 Statistical Estimation ...................................................................................... 32 3.2.1 Point Estimation.................................................................................. 32 3.2.1.2 Method of Moments............................................................. 33 3.2.1.2 Method of Maximum Likelihood ........................................34 3.2.2 Interval Estimation ............................................................................. 36 3.3 Hypothesis Testing.......................................................................................... 37 3.3.1 Frequency Histogram ......................................................................... 37 3.3.2 Goodness-of-Fit Tests ......................................................................... 38 3.3.2.1 The Chi-Square Test ............................................................ 38 3.3.2.2 The Kolmogorov–Smirnov Test........................................... 41 3.3.2.3 Sample Comparison............................................................. 43 3.4 Reliability Regression Model Fitting .............................................................. 45 3.4.1 Gauss–Markov Theorem and Linear Regression ............................... 45 3.4.1.1 Regression Analysis............................................................. 45 3.4.1.2 The Gauss–Markov Theorem .............................................. 49 3.4.1.3 Multiple Linear Regression.................................................. 49 3.4.2 Proportional Hazard (PH) and Accelerated Life (AL) Models .......... 51 3.4.2.1 Accelerated Life (AL) Model............................................... 51 3.4.2.2 Proportional Hazard (PH) Model ........................................ 52 3.4.3 Accelerated Life Regression for Constant Stress................................ 52 3.4.4 Accelerated Life Regression for Time-Dependent Stress................... 54 3.5 Summary......................................................................................................... 56 References................................................................................................................ 56
31 © 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
32
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
3.1
INTRODUCTION
Evaluation in probabilistic terms of the real reliability of products is possible only on the basis of real statistical data obtained from special experiments or practical use. Adequately determining the form or class of the distribution function requires statistical data. Such data also provide the information to confirm prior hypotheses about different probabilistic parameters of the models used in reliability analysis. Techniques for analyzing probabilistic models from observational data are embodied in statistical inference.
3.2
STATISTICAL ESTIMATION
There are two kinds of estimation: point estimation and interval estimation. Point estimation provides a single number from a set of observational data to represent a parameter or other characteristic of the underlying distribution. Point estimation does not give any information about its accuracy. Interval estimation constructs a confidence interval that includes the true value of the parameter with a specified degree of confidence. 3.2.1
Point Estimation
Estimation of a parameter is necessarily based on a set of sample values, X1,…, Xn. If these values are independent and their underlying distribution remains the same from one sample value to another, they yield a random sample of size n from the distribution of the investigated random variable X. Let the distribution have a parameter k. Consider a random variable t(X1,…, Xn) that is a single-valued function of X1,…, Xn. The random variable t(X1,…, Xn) is referred to as a statistic. A point estimate is obtained by selecting an appropriate statistic and calculating its value from the sample data. The selected statistic is referred to as an estimator. An estimator, t(X1,…, Xn), is said to be an unbiased estimator for k if E (t(X1,…, Xn)) k for any value of k. The bias is the difference between the expected value of an estimate and the parameter value itself—the smaller the bias is, the better the estimator is. Another desirable property of an estimator, t(X1,…, Xn), is the property of consistency. An estimator, t, is said to be consistent if, for every a > 0, lim P(| t Q | E ) 1
nlc
(3.1)
This property implies that, as sample size n increases, the estimator, t(X1,…, Xn), gets closer to the true value of k. In some situations, several unbiased estimators can be found. Selecting the best among the unbiased estimators involves choosing the one with the least variance. An unbiased estimator, t, of k, with minimum variance among all unbiased estimators of k, is called efficient.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
STATISTICAL INFERENCE CONCEPTS
33
Another useful estimation property is sufficiency. An estimator, t(X1,…, Xn), is said to be a sufficient statistic of the parameter, k, if it contains all the information about k that is in the sample X1,…, Xn. Several common methods for point estimation are briefly introduced next. 3.2.1.2 Method of Moments The method of moments is used to estimate unknown parameters of the distribution function (d.f.) on the basis of empirically estimated moments of the random variable. The estimators are equated to the corresponding distribution moments. The solutions of the equations obtained provide the estimators of the distribution parameters. For example, because the mean and variance are the expected value of X and of (X *)2, respectively, the sample mean and sample variance can be defined as the expected values of a sample of size n—namely, X1,…, Xn —as follows: n
1 n
X
£X
(3.2)
i
i 1
and S2
1 n
n
£(X X )
2
(3.3)
i
1
X and S2, respectively, are the point estimates of the distribution mean, *, and variance, m2. The estimator of variance 3.3 is biased; however, this bias can be removed by multiplying it by n/(n 1): S2
1 n 1
n
£(X X )
2
i
(3.4)
i1
It can be shown that this is the unbiased estimator of variance. Comparison of Equations 3.3 and 3.4 shows that there is little difference between the two estimates for large sample sizes. EXAMPLE 3.1 The life of a device, T, is modeled as a random variable with the exponential distribution f (t ) L e Lt
(3.5)
The times to failure for accelerated life tests are 22, 24, 31, 41, 52, 63, and 70 hours. To determine the parameter, h, of the distribution, the test data are considered as the sample of t, with a sample size of seven. Because the exponential distribution is only a one-parameter distribution, the first moment is used: t
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
1 n
n
£ i 1
ti
1 7
7
£ t 43.3 (hours) i
i 1
(3.6)
34
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
This yields an estimator of the mean value. The relationship between the mean value and parameter h is c
1 L
¯
Q e Lt dt 0
(3.7)
Therefore, an estimator of h is L 1/t 0.0231 (1/hours).
3.2.1.2 Method of Maximum Likelihood The method of maximum likelihood is one of the most popular methods of estimation. Consider a random variable, X, with density function f(x, k 0), where k 0 is the parameter. Using the method of maximum likelihood, one can try to find the value of k 0 that has the highest (or most likely) probability (or probability density) of producing the particular set of measurements, X1,…, Xn. The likelihood of obtaining this particular set of sample values is proportional to the density function f(x, k 0) evaluated at the sample points X1,…, Xn. The likelihood function is introduced as L f ( X1 , z , X n ; Q ) f ( x1 , Q ) f ( x 2 , Q ), z , f ( x n , Q )
(3.8)
The definition of the likelihood function is based on the probability (for a discrete random variable) or the density (for continuous random variable) of the joint occurrence of n events, X X1,…, X Xn. The maximum likelihood estimate, Q 0, is the value of k0 that maximizes the likelihood function, Lf (X1,…, Xn; k0), with respect to k0. The usual procedure for maximization with respect to a parameter is to calculate the derivative with respect to this parameter and equate it to zero. This yields tL f ( X1 , z , X n ; Q 0 ) tQ 0
(3.9)
0
The solution of the preceding equation for k0 will give Q 0 , if it can be shown that Q 0 does indeed maximize Lf (X1,…, Xn; k0). Because of the multiplicative nature of the likelihood function, it is frequently more convenient to maximize the logarithm of the likelihood function instead; that is, t log L f ( X1 , z , X n ; Q 0 ) tQ 0
0
(3.10)
The solution for Q from this equation is the same as the one from Equation 3.9. For a density function with two or more parameters, the likelihood function becomes n
L f ( X1 , z , X n ; Q1 , z , Q m )
£ f ( X ,Q ,z ,Q i
i1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
1
m
)
(3.11)
STATISTICAL INFERENCE CONCEPTS
35
where k1,…,km are the m parameters to be estimated. In this case, the maximum likelihood estimators can be obtained by solving the following m equations: tL1 ( X1 , z , X n ; Q1 , z , Q m ) 0, j 1, z , m tQ j
(3.12)
Under some general conditions, the maximum likelihood estimates are consistent, asymptotically normal, and asymptotically efficient. Let us estimate the parameter p of the binomial distribution. In this case, ¤n³ n m L f (m | n) ¥ ´ pm 1 p , ¦mµ
m 0, 1, z , n
(3.13)
Then, tLogL f tp
³ ¤m n ¥ p´ µ p(1 p) ¦ n
(3.14)
It follows, therefore, that the maximum likelihood estimator of p is p m/n. It can also be easily checked that the sample mean is the maximum likelihood estimator of the normal distribution mean. EXAMPLE 3.2 For the life-test data of the device given in Example 3.1, estimate the parameter of the distribution, using the method of maximum likelihood. The maximum likelihood function for this problem is 7
L f (t1 ,z, t7 , L )
£ f (t , L) L e
7 L
i
7
£ i 1 ti
(3.15)
i 1
Taking the derivative yields dL f (t1 ,z, t7 , L ) dL
§ ¨ 7L 6 L 7 ¨©
7
¶
£ t ··¸ 0 i
(3.16)
i 1
Solving this equation gives
L
7
£
7 i 1
0.0231 ti
k is the estimator of the parameter h. In this example, the estimates by where L the method of moments and the method of maximum likelihood coincide.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.17)
36
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Another popular estimation method is the least squares method; it will be considered in Section 3.4.1.
3.2.2 Interval Estimation Let L(X1,…, Xn) and U(X1,…, Xn) be two statistics, such that the probability that parameter k0 lies in an interval is P{L ( X1 ,z, X n ) Q 0 U ( X1 ,z, X n )} 1 A
(3.18)
The random interval [L,U] is called a 100(1 ])% confidence interval for the parameter k0. The endpoints L and U are referred to as the 100(1 ])% confidence limits of k0; (1 ]) is called the confidence coefficient. The most commonly used values for ] are 0.10, 0.05, and 0.01. If k0 > L (k0 < U) with probability equal to 1, then U (L) is the one-sided upper (lower) confidence limit or confidence bound for k0. A 100(1 ])% confidence interval for an unknown parameter, k0, is interpreted as follows: If a series of repetitive experiments yields random samples from the same distribution and the confidence interval for each sample is calculated, then 100(1 ])% of the constructed intervals will, in the long run, contain the true value of k0. The following example illustrates the common principle of confidence limits construction. (The other procedures of interval estimation widely used in reliability data analysis are considered in Section 3.4.4.) Consider the procedure for constructing confidence intervals for the mean of a normal distribution with known variance. Let X1, X2,…, Xn be a random sample from the normal distribution, N(*,m2), in which * is an unknown parameter and m2 is assumed to be known. It is easy to show that the sample mean has the normal distribution N(*,m2/n). Thus, ( X M ) n /S has the standard normal distribution. This means that ¤ ³ X M P ¥ z1 (A / 2) a a z1 (A / 2)´ 1 A , S/ n ¦ µ
(3.19)
where z1–(]/2) is the 100(1 1/ 2A )th percentile of the standard normal distribution N(0, 1). Solving the inequalities inside the parentheses, Equation 3.19 can be rewritten as
S S ³ ¤ P ¥ x z1 A / 2 a M a X z1 A / 2 1 A ¦ n n ´µ
(3.20)
Thus, the symmetric (1 ]) confidence interval for the mean, *, of a normal distribution with known m2 is § S S ¶ , x z1 (A / 2) )· ¨ x z1 (A / 2) © n n ¸
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.21)
STATISTICAL INFERENCE CONCEPTS
37
The confidence interval is wider for a higher confidence level (1 ]). As m decreases or as n increases, the confidence interval becomes smaller for the same confidence level (1 ]).
3.3
HYPOTHESIS TESTING
Researchers often need to determine a probability distribution based on available observational data. A histogram for a set of data can yield an idea about the distribution model when visually compared with several hypothesized density functions. Certain statistical tests, known as goodness-of-fit tests, can reject or accept an assumed probability distribution determined empirically or developed theoretically on the basis of prior assumptions. When two or more distributions appear to be plausible probability distribution models, such tests can determine the relative degree of validity of the different distributions. This section illustrates how a distribution can be studied by using a frequency histogram and verifying the result with goodness-of-fit tests. Two of the most commonly used goodness-of-fit tests— the chi-square (_2) and Kolmogorov–Smirnov (K–S)—are further discussed. 3.3.1
Frequency Histogram
The frequency histogram is a graphic, empirical description of the variability of a random variable. For a specific set of experimental data, a corresponding histogram is constructed as follows: r From the observed experimental data, select a range sufficient to include the largest and smallest data values. r Divide this range into consistent intervals of equal length, Δ x (sometimes they can be different to emphasize some special areas of data domain). r Count the number of measurements within each interval and draw vertical bars with heights representing the number of observations in that interval (in the second case, the number of measurements must be related to the length of Δx).
Alternatively, the heights of the bars can be determined in terms of the ratio of the fraction of the number in the interval (relative frequency) to the length of the interval in the horizontal abscissa. That is,
fn
N x , x $x n
(3.22)
$x
where Nx,x+Δx is the number of measurements in the interval (x, x Δx) and n is the total number of measurements (the sample size). The frequency histogram can be used as an empirical frequency distribution for comparison with the theoretical density function. If the theoretical density function
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
38
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
0.00024
Probability Density
0.00020 MMIC type A, 175°C
0.00016 0.00012
Fitting curve
0.00008
Test data
0.00004 0 0
10000
20000 Failure Time, (hours)
30000
40000
Figure 3.1 Frequency diagram and probability density of the life distribution of MMIC devices.
of an assumed distribution has the same shape (in general) as the frequency histogram and the theoretical curve is close to the peaks of the bars in the frequency histogram, this distribution might model the phenomenon. Figure 3.1 shows an example of a frequency histogram and density function of life results for a type of monolithic microwave integrated circuit (MMIC) device for which the life distribution of the MMIC devices is assumed to be normal. 3.3.2 Goodness-of-Fit Tests When an assumed theoretical distribution is used to model a random variable, based perhaps on the general shape of the frequency histogram, there is no quantitative measure of how well the data fit the model. A goodness-of-fit test provides a quantitative technique to disprove (or not) the assumed distribution. Two of the most commonly used tests—the chi-square and K–S tests—are discussed next. 3.3.2.1 The Chi-Square Test Consider a sample of N observed values (measurements) of a random variable. The chi-square goodness-of-fit test compares the observed frequencies, n1, n2,…, nk , of k intervals of the random variable with the corresponding frequencies, e1, e2,…, ek , from an assumed theoretical distribution, F0(x). The basis for appraising the goodness of fit is the distribution of the statistic k
£ (n e e ) i
i 1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
i
i
2
(3.23)
STATISTICAL INFERENCE CONCEPTS
39
This statistic has a distribution that approaches the chi-square (_2) distribution with (f k 1) degrees of freedom as N l ∞. The _2 distribution has the following probability density function: f (x)
f
1 f 2
¤ 2 '¥ ¦
f³ ´ 2µ
1
x
x2 d2
1
x0
(3.24)
where f is the number of degrees of freedom, and '(r) is the gamma function. The cumulative probability function of the _2 distribution is given in any statistical book (see references for this chapter). If the parameters of the theoretical distribution are unknown and are estimated from the data, the preceding distribution remains valid if the number of degrees of freedom is reduced by one for every unknown parameter that must be estimated. On this basis, if an assumed distribution yields a result such that k
£ (n e e ) i
i 1
i
2
C1 A , f
(3.25)
i
then the assumed theoretical distribution is not rejected (i.e., the so-called null hypothesis H0: F(x) F0(x) is not rejected) at significance level ]. If the inequality 3.25 is not satisfied, the alternative hypothesis, H1: F(x) ≠ F0(x), is accepted. In the inequality 3.25, C1–],f is the value of the _2 corresponding to the cumulative probability (1 ]). Employing the _2 goodness-of-fit test, it is recommended that at least five intervals be used (k > 5), with at least five observations per interval (ei > 5) to obtain satisfactory results. The steps for conducting the _2 test are as follows: r Divide the range of data into intervals (number of intervals > 5), with the first and the last infinite intervals, and count ni the number of measurements in each interval. r Estimate the parameters of the assumed theoretical distribution, F0(x), and calculate the theoretical quantity of data in each interval, ei, as follows: ei [ F0 ( x $x ) F0 ( x )] [sample size]
(3.26)
r Calculate Equation 3.23. r Choose a specified significance level, ] (generally, 1 ] 90 or 95%) and determine the number of degrees of freedom of the _2 distribution: f k 1 [number of parameters of F0 ( x )]
(3.27)
r Determine C1–], f from the table and compare it with the obtained value of Equation 3.23. If the inequality 3.25 is satisfied, then the assumed theoretical distribution function, F0(x), is not rejected.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
40
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
EXAMPLE 3.3 Life tests of a microelectronic device give 138 failure times, grouped in seven time intervals (k 7). The number of failures in each interval is listed in Table 3.1. Determine the relative goodness of fit between the normal and exponential distributions, using the _2 test at a significance level of ] 5%. The given data lead to an estimate of the mean value and variance as T 4545.3 hours and S2 829.2 hours2, using the following formulas:
£ T s
£
k i 1
1 (ti ti 1 ) ni 2 ni
(3.28)
£
2
¤1 ³ ¥ (ti ti 1 ) T ´µ ni i 1 ¦ 2 k
£
(3.29)
ni 1
Then, the cumulative distribution function for the normal distribution is ¤ t T ³ ¤ t 4545.3 ³ FT , N (t ) & ¥ &¥ ¦ 29.2 ´µ ¦ S ´µ
(3.30)
After estimating the parameters of the exponential distribution as L 1/T 0.00022, the following cumulative distribution function results for the exponential distribution:
(3.31)
FT ,E (t ) 1 e Lt 1 e 0.00022 t Then, the theoretical number of data in each interval, ei, is calculated according to the following formulas: ei ,N 138[ FT ,N (ti ) FT ,N (ti 1 )]
(3.32)
ei ,E 138[ FT ,E (ti ) FT ,E (ti 1 )]
(3.33)
Table 3.1 Chi-Square Testing for Life Distribution (Example 3.3)
Interval No.
Interval Range ti1 ti (Hours)
Recorded Frequency
1 2 3 4 5 6 7
2000–3000 3000–3500 3500–4000 4000–4500 4500–5000 5000–5500 5500–6500
1 11 24 33 31 22 16
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Theoretical Frequency, e1
£ i
(ni e i )2
ei 2
Normal
Exponential
Normal
Exponential
0.184 7.314 15.180 26.496 32.430 28.014 22.080
54.298 11.659 6.856 6.203 5.613 5.079 6.727
0.042 1.900 7.027 8.621 8.684 9.090 9.341
52.317 52.354 95.227 210.980 325.800 382.180 394.965
STATISTICAL INFERENCE CONCEPTS
41
The number of degrees of freedom is f 7 1 2 4 for the normal distribution, and f 7 1 1 5 for the exponential distribution. At a significance level of 5%, C95%,1 9.49 for the normal distribution, and C95%,5 11.1 for the exponential distribution. Comparing these values with the values of the (ni ei )2 /ei2 listed at the bottoms of the sixth and seventh columns, it is apparent that the normal distribution is not rejected and the exponential distribution is rejected, according to the goodness-of-fit test at a 5% significance level.
£
The exponential distribution does not seem attractive for these data because, for the exponential distribution, the mean is equal to the standard deviation. In the case considered, the sample mean is about 150 times greater than the sample standard deviation!
3.3.2.2 The Kolmogorov–Smirnov Test Another widely used goodness-of-fit test is the K–S test. The basic procedure involves comparing the so-called empirical (or sample) distribution function (e.d.f.) with an assumed theoretical d.f. If the maximum discrepancy is large compared with what is anticipated from a given sample size, the assumed theoretical model is rejected. Consider an uncensored sample of n observed values of a random variable. The set of the data is rearranged in increasing order: X(1) < X(2) < … < X(n). Using the ordered sample data, e.d.f., Sn(x) is defined as follows: ª0 i Sn ( X ) « n ¬1
c X X (1) X (i ) a X x(i 1
(3.34)
X(n) a X c
i 1,z, n 1 where X(1), X(2),…, X(n) are the values of the ordered sample data (order statistics). Figure 3.2 shows a plot of Sn(x) and the proposed theoretical distribution function, F0(x). The law of large numbers shows that the e.d.f. is a consistent estimator for the corresponding d.f. In the K–S test—the maximum difference (the test statistic) between Sn(x) and F0(x) over the entire range of x—is used as the measure of the discrepancy between the theoretical model and the e.d.f. The maximum difference of Sn(x) and F0(x) is denoted by Dn max | F0 ( x ) Sn ( x )|
(3.35)
x
If the null hypothesis is true, the probability distribution of D n will be the same for every possible continuous F0(x). Thus, D n is a random variable whose distribution depends on the sample size, n, only. For a specified significance level, ], the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
42
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Sn(x) and Fn(x)
1
Fn(x)
Sa(x) Da(x) 0
Figure 3.2
x x1
x2 x3
x4
xn–1
xn
Empirical and theoretical distribution functions.
K–S test compares the observed maximum difference with the critical value DnA , defined by
P Dn a DnA 1 A.
(3.36)
Critical values, DnA , at various significance levels, ], are tabulated and given in any statistical book (see references for this chapter). If the observed Dn is less than the critical value DnA , the proposed distribution would not be rejected. The steps for conducting the K–S test are as follows: r For each sample item datum, calculate the Sn(x(i)) (i 1,…,n) according to Equation 3.34. r Estimate the parameters of the assumed theoretical distribution, F0(x), using another sample, and calculate F0(x(i)) from the assumed d.f. r Calculate the differences Sn(xi) and F0(x(i)) for each sample item and determine the maximum value of the differences according to Equation 3.35. r Choose a specified significance level, ] (generally, 1 ] 90 to 95% for all tests) and determine DnA from the appropriate statistical table (Beyer 1968). r Compare Dn with DnA ; if Dn is less than DnA , the assumed theoretical d.f., F0(x), is not rejected.
EXAMPLE 3.4 The modulus of rupture of GaAs wafers was tested. The results from 11 wafers are listed in Table 3.2. Use the K–S test to determine if the data are normally distributed at a significance level of ] 5%.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
STATISTICAL INFERENCE CONCEPTS
Table 3.2 Number i
43
K–S Testing for Modulus of Rupture (Example 3.4) Modulus of Rupture, Xi (Mpa)
Sk(xi)
Fn(xi)
Dn !Fn(xi) Sn(xi)!
67.38 69.96 71.00 73.22 74.75 75.67 80.37 81.64 84.23 85.50 125.71
0.091 0.182 0.273 0.364 0.455 0.545 0.636 0.727 0.818 0.909 1.000
0.200 0.249 0.269 0.318 0.352 0.374 0.488 0.518 0.583 0.614 0.997
0.109 0.067 0.004 0.046 0.103 0.171 0.148 0.209 0.235 0.295 0.003
1 2 3 4 5 6 7 8 9 10 11
The data given in Table 3.2 indicate the sample mean is X 80.86 MPa. The sample standard deviation is mx 16.02 MPa. Calculate Sn(x) i/n from column 3 of Table 3.2. Then calculate ¤ X 80.86 ³ F0 ( X (i ) ) & ¥ (i ) ¦ 16.02 ´µ
(3.37)
and \F0(X(i)) Sn(X(i))\ for each sample element. The results are tabulated in Table 3.2. From these results, the maximum absolute discrepancy between the two functions is Dn 0.295 and occurs at x 85.5 MPa. In this case, there are 11 experimental data points. Hence, the critical value of DnA at the 5% significance level is found to be Dn0.05 0.40. Because the maximum discrepancy of 0.295 is less than Dn0.05, the assumption of a normal distribution for the GaAs modulus of rupture is not rejected at the 5% significance level.
3.3.2.3 Sample Comparison In some experimental situations two or more samples must be compared. For example, the failure times of two samples of the same device, tested under two different complex stress conditions could be compared. The problem is to determine whether the reliability of the device under one stress condition differs from the reliability under another stress condition. Perhaps it is unnecessary to know the time to failure (TTF) distribution and the specific values of reliability function. This type of problem is related to the class of statistical tests known as nonparametric or distribution free. Consider the Mann–Whitney (Wilcoxon) test for two samples as an example of nonparametric tests. Two independent samples of measurements (Xi,…,Xm) and (Yi,…,Yn) from the continuous distributions with d.f. FX(x) and d.f. FY (y), respectively, are given. Consider the following hypotheses: r Each measurement in the first sample has the same distribution as each measurement in the second sample; that is, F X(x) FY (y) (null hypothesis, H0). r There exists a constant, 1, such that each random variable (Yi 1) has the same distribution as each Xi; that is, F X(x) FY (x 1) (alternative hypothesis, H1).
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
44
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
These hypotheses are written as follows: H 0 : FX ( x ) FY ( x ), c x c, H1 : FX ( x ) FY ( x 1), 1 w 0, c x c. The alternative hypothesis can also be formulated as 1 > 0 or 1 < 0. If the two samples are combined into a single sample, the order statistics of this sample are Z(1), Z(2),…, Z(m+n), where Z(1) < Z(2) < … < Z(m+n). The rank of a sample element corresponds to its position in the previous ordering of Z(i). Thus, the smallest sample element has the rank 1, the second smallest sample element has the rank 2, and so on. Consider the ranks of all Z(i) (i 1, 2,…, m n) that represent the elements of the first sample, X(i), in the pooled sample, Z. Let the sum of these ranks be S. Because the average of the ranks in the pooled sample is (1 m n)/2, it is clear that if H0 is true, then E (S )
m(m n 1) 2
(3.38)
mn(m n 1) 12
(3.39)
It can be shown that Var ( S )
Moreover, in this case, for any m and n greater than 8, S is approximately normally distributed with the preceding mean and variance. For any m and n less than 8, special tables must be used (Dixon and Massey 1969). For reliability applications, the most critical interest is in testing the hypothesis H0: FX(x) FY (x), ∞ < x < ∞, against the alternative H1: FX(x) FY (x k), k > 0. If x and y are the failure times, then this hypothesis is equivalent to the hypothesis (in terms of reliability functions): R X(x) > RY (y), which is also equivalent to the hypothesis that the items of the first sample from FX(x) are more reliable than the items of the second sample from the FY (y). The hypothesis H0 is rejected if S > C(m, n, ]), where ] is the probability of rejecting H0 when it is true (significance level). The values of C(m, n, ]) are tabulated for small samples (m or n is less than eight) (Dixon and Massey 1969). For large samples, the hypothesis is rejected if C > z], where C
S E (S ) (VarS )1/ 2
and z] is the 100]th percentile of the standard normal distribution N(0,1). EXAMPLE 3.5 Assume two TTF samples of a device, obtained under different stress conditions: Sample 1 (hours): 90, 367, 470, 572, 1307, 1345, 1392, 1603, 2152, 2858; m 10. Sample 2 (hours): 37, 150, 154, 319, 373, 433, 538, 571, 751, 1180; n 10.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.40)
STATISTICAL INFERENCE CONCEPTS
45
The null hypothesis is that the device has the same reliability under both stresses. The alternative hypothesis is that the second stress condition’s results are more severe. The hypothesis is tested at the 5% significance level (i.e., ] 0,05). The pooled sample can be ordered as 37(1), 90(2*), 150(3), 154(4), 319(5), 367(6*), 373(7), 433(8), 470(9*), 538(10), 577(11), 572(12*), 751(13), 1180(14), 1307(15*), 1345(16*), 1392(17*), 1603(18*), 2152(19*), 2858(20*). The ranks marked with an asterisk represent the elements of the first sample X(i) in the pooled sample z. The sum of these ranks is S 134, E(S) 105, Var S 2100/12 175, (Var S)1/2 13.23, so that C (134 105)/13,23 2,19. From a table of normal distribution (Beyer 1968), c0.05 1.64. Hence, H0 is rejected; the second stress condition is supposed to be worse for reliability.
3.4
RELIABILITY REGRESSION MODEL FITTING
The previous sections dealt mainly with a single random variable. However, reliability problems often require an understanding of the probabilistic relationships among several random variables. For example, the time to failure of a device may depend on applied voltage, environmental temperature, and humidity. The time to failure can be considered as a random variable, Y, which is a function of the variables x1 (voltage), x2 (temperature), and x3 (humidity). Such functions inevitably contain different kinds of uncertainties. Therefore, the term model is widely used. 3.4.1
Gauss–Markov Theorem and Linear Regression
3.4.1.1 Regression Analysis In regression analysis, U is referred to as the dependent variable or response and x1, x2, and x3 are the independent variables or factors. For the general case, independent variables x1,…, xk might be random or nonrandom variables whose values are known or chosen by the experimenter. The conditional expectation of U for any given values of x1,…, xk [E(Y|x1,…, xk)] is known as the regression of U on x1,…, xk . If the regression of Y is a linear function of the independent variables x1,…, k, then E (Y | x1 ,z, x k ) B0 B1x1 ! B k x k
(3.41)
The coefficients ^0, ^1, … ,^k are called regression coefficients at parameters. Insofar as the expectation of U is a nonrandom variable, the relationship 3.41 is deterministic. The corresponding regression model for the random variable of U can be written in the following form: Y B0 B1x1 ! B k x k E
(3.42)
where a is called the random error, assumed to be distributed with mean E(a) 0 and finite variance m2. If a is normally distributed, one deals with the normal regression.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
46
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 3.3 Linear regression fitting.
Simple linear regression. Now consider the regression model for the simple deterministic relationship (3.43)
Y B0 B1 x
which is known as simple linear regression. Assume n pairs of measurements (x1, y1),…, (xn, yn), as shown in Figure 3.3, where Y has a general tendency to increase with increasing values of x. Also assume that, for any given variable x, the dependent variable U is related to the independent variable x by Y B0 B1x E
(3.44)
where a is normally distributed with mean 0 and variance m2. The random variable U has, for a given x, normal distribution with mean ^0 ^1x and variance m2. Thus, the regression model is the location transformation of the random variable Y. That is, the random variable U is formed by adding nonrandom variable ^0 ^1x to the random variable a. Also suppose that, for any given values x1,…, xn, random variables Y1…, Yn are independent. For the preceding n pairs of measurements, the joint probability density function of y1,…, yn is given by § 1 ¨ 1 fn ( y | x , B0 , B1 , S ) exp 2 n/2 ¨© 2S 2 (2PS ) 2
n
£ i 1
¶ ( yi B0 B1xi )2 · ·¸
(3.45)
The function 3.45 (discussed in Section 3.3.2.1) is the likelihood function of the parameters ^0 and ^1. Maximizing this function with respect to ^0 and ^1 is reduced to the problem of minimizing the sum of squares: n
S (B0 , B1 )
£( y B B x ) i
i 1
with respect to ^0 and ^1.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
0
1 i
2
(3.46)
STATISTICAL INFERENCE CONCEPTS
47
Thus, the maximum likelihood estimation of the parameters ^0 and ^1 is the estimation by the method of least squares. The properties of least squares estimates are given by the Gauss–Markov theorem and will be discussed later. The values of ^B and ^1, minimizing S(^0, ^1), are those for which tS (B0 , B1 ) 0 tB0
tS (B0 , B1 ) 0 tB1
(3.47)
The solution of these equations yields the least squares estimates of the parameters ^0 and ^1 (denoted Bk 0 and Bk 1) as n
Bk 0 y Bk 1 x and Bk 1
£ x x y £ x x i 1 n
i
1 y n
n
£y , i
i 1
1 x n
(3.48)
i
in
where
i
2
n
£x
(3.49)
i
i1
Note that the estimates are linear functions of the measurements yi. The estimate of the dependent variable variance m2 is given by n
2
S
£ i 1
Y Yk i
2
(3.50)
i
(n 2)
where Yki Bk 0 Bk 1 xi
(3.51)
is predicted by the regression model values for the dependent variable; (n 2) is the number of degrees of freedom, and “2” is the number of the estimated parameters of the model. It can be shown that the estimates Bk 0 and Bk 1 are normally distributed random variables with the corresponding means ^0 and ^1. The joint distribution of Bk 0 and Bk 1 is a bivariate normal distribution with the covariance, Cov(Bk 0 , Bk 1), given by
Cov Bk 0 , Bk 1
xs 2 n
£ x i 1
i
x
.
(3.52)
Hence, unfortunately, in the general case, the estimates Bk 0 and Bk 1 are correlated. To avoid this, x1, x2,…, xn must be chosen so that the sample mean, x, will equal 0 in Equation 3.52. This choice is the simple example of design of experiments.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
48
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Using the obtained estimates Bk 0, Bk 1, and s2, the following confidence intervals can be constructed. The (1 ]) two-sided confidence interval for ^0 is given by § 2 1/ 2 ¨ 1 k B 0 o tn 2;A / 2 (s ) ¨ n ¨ ©
1/ 2
¶ (x) · n 2 · xi x · i 1 ¸ 2
£
(3.53)
The (1 ]) two-sided confidence interval for ^1 is § 2 1/ 2 ¨ k B 1 o tn 2;A / 2 (s ) ¨ ¨ ©
1/ 2
n
¶ 1 · 2 · xi x · ¸
£ i 1
(3.54)
The (1 ]) two-sided confidence interval for the mean value of U for any given point, x0, is given by § 2 1/ 2 ¨ 1 y( x ) o t 0 n 2;A / 2 (s ) ¨n ¨ ©
1/ 2
¶ ( x0 x ) · n 2 · xi x · i 1 ¸ 2
£
(3.55)
Based on the distributions of the preceding parameter estimates, several hypotheses can be tested: *
r Let B0 be a given number. Test the hypothesis that the regression parameter, ^0, is * equal to B0 (null hypothesis) against the alternative hypothesis that it is not equal * to B0—that is, *
H0: ^0 B*0 H 0: ^ 0 ≠
B0
r Analogously, test the following hypothesis: H0: ^1 B1* H0: ^1 ≠ B1* r Test the hypothesis about both ^0 and ^1: H0: ^0 B0* and ^1 B1* H1: the hypothesis H0 is not true.
The correlation coefficient, l, has an additional meaning in regression analysis. Let the random variables U and x have a bivariate normal distribution. In this case, the conditional distribution of U for a given value of x is the univariate normal distribution with the variance of U given by
S 2 S y2 (1 R 2 )
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.56)
STATISTICAL INFERENCE CONCEPTS
49
where S Y2 is the unconditional variance of Y (i.e., the variance of Y) when the value of X is unknown. From Equation 3.56,
R2
S Y2 S 2 S Y2
(3.57)
This last relationship has a useful interpretation. It means that the squared correlation coefficient is equal to the portion of the variance of U obtained with knowledge of x. 3.4.1.2 The Gauss–Markov Theorem Consider n measurements, Y1,…, Yn, of the dependent variable. Also suppose that the expectation, E(Yi), is given by Equation 3.41: E (Yi ) B0 x 0 i ! B k x ki
(3.58)
i 1,z, n, n k 1 where x0i,…, xki are known values of the independent variables obtained in the experiments along with the values of Yi (as rule, x0i x 1). Thus, each measurement, Yi, can be written as Yi B0 x 0 i ! B k x ki E i
(3.59)
i 1,z, n, n k 1 where ai are random uncorrelated (Cov (ai, aj) 0) errors with E(ai) 0, Var(ai) m 2, ij 1,…, n. These assumptions constitute the general linear model. It should be noted that no assumption has been made about a normal distribution of the random errors. As for the simple linear regression, the least squares estimates of ^0,…, ^k are the values of Bk 0 ,z , Bk k, which minimize the sum of the square: n
SS (B0 ,z, B k )
£ (Y B B i
0
0i
! B k B ki ) 2
(3.60)
i 1
Under the general linear model, the least squares estimates are unbiased and have the minimum variance among all unbiased estimates that are linear in the dependent variable measurements. This statement is known as the Gauss–Markov theorem. 3.4.1.3 Multiple Linear Regression The general linear model can be written in a simple form in matrix notations. Let Y (Yi,…,Yn)w, ^ (^0,…,^k)w, a (ai,…,an)w, and ¤ x 01 ¥x 02 X ¥ ¥| ¥x ¦ 0n
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
! ! ! !
x k1 ³ xk 2 ´ ´ |´ x kn ´µ
(3.61)
50
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
where Awdenotes the transpose of any vector or matrix A. Then Equation 3.59 takes the form: Y XB E
(3.62)
It can be shown that the vector of estimates Bk (Bk 0 ,z , Bk k )` is given by
Bk ( X `X ) 1 X `Y .
(3.63)
The estimates, Bk , of the coefficients ^ are, for the general case, correlated. The covariance matrix of Bk is Cov(Bk ) S 2 ( X `X ) 1
(3.64)
The matrix X is sometimes called the design matrix of the experiments. This equation is the basis for the optimal design of experiments because practically all of the optimal experiment design criteria are expressed in terms of a covariance matrix. For example, if the design matrix is orthogonal, XwX will be the identity matrix, so all of the estimates, Bk , will be independent (uncorrelated) random variables with equal variances, m 2. Note also that any optimal design of experiments is based on an a priori known form of the model 3.58. All of these considerations were made within the limits of the general linear model—that is, without an assumption about the normal distribution of the error. If, in addition, the random errors are normally distributed, the least squares estimates will have the smallest variance among all unbiased estimates (including those that are nonlinear functions of Yi). This is the case of multiple linear regression. In this case, it can be shown that the estimates, Bk , are normally distributed, so different confidence limits can be constructed and some hypotheses can be tested. Most of them are similar to those used in the simple linear regression, but some of them have multivariate peculiarities. For example, the experimenter can test the hypothesis that one of the independent variables x i in his model does not have an influence on the dependent variable Y,—that is, test the hypothesis that H 0: ^ i 0 H1: ^i ≠ 0
In many practical situations, the experimenter might also be interested in the order of importance of the independent variables in predicting the dependent variable. For example, the experimenter might want to order the factors (load, temperature, humidity) influencing the reliability (independent variable) of a given device. Testing the hypothesis mentioned earlier (^ i 0) for each independent variable, Xi , i 1,…, k, does not reveal this ordering. In these situations, applying the so-called stepwise regression method is useful (Draper and Smith 1981).
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
STATISTICAL INFERENCE CONCEPTS
3.4.2
51
Proportional Hazard (PH) and Accelerated Life (AL) Models
A reliability model is usually defined as the relationship between the time-to-failure distribution of a device and stress factors, such as load, cycling rate, temperature, humidity, and voltage. The reliability model can also be considered as a deterministic transformation of the random variable of the time to failure. Two main time transformations exist in life data analysis: the accelerated life (AL) model and the proportional hazard (PH) model. 3.4.2.1 Accelerated Life (AL) Model Let F1(t;z1) and F2(t;z2) be time-to-failure cumulative distribution functions (Cdfs) of the device under the constant stress conditions z1 and z2, respectively. Stress condition z2 is more severe than z1 if, for all positive values of t, F2 (t; z2 ) F1 (t; z1 )
(3.65)
This inequality means that a more severe stress condition accelerates the time to failure. Without loss of generality, it may be assumed that z 0 for the normal (use) stress condition. If a failure time Cdf under normal stress conditions is denoted by F0(r), the AL time transformation is given in terms of F(t;z) and F0(r) by the following relationship (Cox and Oaks 1984): F (t; z ) F0 [tY ( z , A)]
(3.66)
where s(z,B) is a function connecting time to failure with a vector of stress factors, z, and A is a vector of unknown parameters. The s(z,B) always corresponds to a decreasing time to failure. For z 0, s(z,B) is assumed to be equal to one. The relationship in Equation 3.66 is the scale transformation. It means that a change in stress does not result in a change of the shape of the distribution function, but rather changes its scale only. For the d.f. F1(t;z1) and the d.f. F2(t;z2), if z1 is less severe than z2 and t1 and t2 are the times at which F1(t1;z1) F2(t2;z2), there exists a function g (for all positive t1 and t2) such that t1 g(t2), so F2 (t2 ; z2 ) F1 ( g(t2 ); z1 )
(3.67)
Because F1(t;z) < F2(t;z), g(t) must be an increasing function with g(0) 0 and lim g(t ) c
x lc
(3.68)
The function g(t) is called the acceleration or the time transformation function. The assumption in Equation 3.66 that a change of stress condition does not change the shape of the Cdf, but changes its scale only, can be written in terms of the acceleration function as follows: g(t ) Y ( z , A)t
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.69)
52
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
In other words, Equation 3.66 is equivalent to the linear with time acceleration function. The relationships for the 100pth percentile of time to failure, tp(z), and the hazard rate, h(z), can be obtained from Equation 3.66 as t p ( z ) t 0p / Y ( z , A)
(3.70)
L (t; z ) Y ( z , A)L 0 [tY ( z , A)]
(3.71)
where t0p and h0 are the 100pth percentile and the hazard rate for the normal stress condition z 0. 3.4.2.2 Proportional Hazard (PH) Model For the PH model, the basic relationship analogous to Equation 3.66 is given by F (t; z ) 1 [1 F0 (t )]Y ( z , A)
(3.72)
The proper proportional hazard (Cox) model is known as the relationship for hazard rate, which can be obtained from Equation 3.72 as
L (t; z ) Y ( z , A)L 0 (t )
(3.73)
where s(z,B) is usually a log-linear function. The PH model time transformation does not normally retain the shape of the Cdf, and the function s(z) no longer has a simple relationship to the acceleration function. Consequently, the PH model is not as popular in reliability applications as the AL model. Nevertheless, it can be shown (Cox and Oaks 1984) that, only for the Weibull distribution, the PH model coincides with the AL model. The AL model time transformation is more popular for reliability applications, and the PH model is widely used in biomedical life data analysis. 3.4.3
Accelerated Life Regression for Constant Stress
Consider the problem of prediction on the basis of AL tests under constant stress conditions. It is assumed that the reliability model, d(z,C), for the 100pth percentile, tp, is a given function of the stress factors, z, with unknown vector of parameters, B: t p ( z , B) H ( z , B)
(3.74)
and the reliability model is related to Equation 3.70 as
H( z , B) t 0p /Y ( z , A)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(3.75)
STATISTICAL INFERENCE CONCEPTS
53
The most commonly used models for the percentiles (including median) are loglinear models. Two such models are the power rule model and the Arrhenius reaction model. For the power rule model, t p ( x ) a /x c , c 0, x 0
(3.76)
where x is a mechanical or electrical stress. For the Arrhenius reaction rate model, t p (T ) a exp ( Ea /T )
(3.77)
where T is absolute temperature and Ea is activation energy. The model combining these two models is given by t p ( x , T ) a x -c exp ( Ea /T )
(3.78)
where a, Ea , and c are the parameters to be estimated. The goal is to obtain an estimate of the vector, B, of the model 3.74 and to predict the percentile at the normal (or given) stress condition on the basis of AL tests at different stress conditions, z1,…, zk , where k is greater than the dimension of vector C—that is, k > dim B. It is also assumed that r the TTF distributions at all the stress conditions are increasing failure rate average (IFRA) distributions with continuous density functions f(t;z); and r the test results are type II censored samples, where the number of uncensored failure times, ri (i 1,…, k), and the sample sizes, ni, are large enough to estimate the tp as the sample percentile t p:
, if n, p is not an integer and t t p ª« ([ n , p]) ¬anny value from the interval [t( n , p) , t( n , p1)) ] , if n, p is an integer
(3.79)
where t(r) is the failure time (order statistic); the sample sizes are large enough that the asymptotic normal distribution of this estimate can be used. Based on the preceding assumptions, the model for Equation 3.74 can be written as t p ( z , B) H( z , B) E
(3.80)
ª ¹ p(1 p) E N «1, 2 º 2 ¬ niH ( z , B) f [H(z , B)] »
(3.81)
where
and ~N(a,b) means “is normally distributed with mean a and variance b.”
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
54
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The multiplicative model 3.80 can be transformed by means of a logarithmic transformation to a model with a normally distributed additive error—that is, to the standard normal regression ln t p ( z , B) ln H( z , B) E1
E1 N ( 0 , S 2 )
(3.82)
where m 2 is an unknown constant. This transformation is based on the properties of IFRA distributions and a probabilistic transformation: Let x ~ N(0, m 2) and m (dim A) 1), where the test results are (as in the previous case) type II censored samples and the number of uncensored failure times and sample sizes are large enough to estimate tp as the sample percentile t 0p. In this situation, the parameter estimates (of the vector A and t 0p) for the reliability model can be obtained using a least squares method solution of the following system of integral equations: t p [ z ( t )] i 0 p
t
¯
Y [ z1 (s), A]ds
(3.89)
i 1,2 ,!,k .
EXAMPLE 3.6 Assume a model 3.78 for the 10th percentile of time to failure, t0.1, of a ceramic capacitor in the form
t0.1 (U , T ) aU c exp ( Ea /T ) where U is applied voltage and T is absolute temperature. Consider a time-step-stress AL test plan using step-stress voltage in conjunction with constant temperature as accelerating stress factors. A test sample starts at a
© 2009 by Taylor & Francis Group, LLC
(3.90)
56
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 3.3 Ceramic Capacitors Test Results Temperature, K
Voltage U0, V
TTF Percentile Estimate, h
398 358 373 373
100 150 100 63
347.9 1688.5 989.6 1078.6
specified low voltage, U0, and is tested for a specified time, Δt. Then the voltage is increased by ΔU, and the sample is tested at U0 ΔU during Δt, etc.:
U (t ) U 0 $U En(t /$t )
(3.91)
where En(x) means “nearest integer not greater than x.” The test will be terminated after p ≥ 0.1 items fail. Thus, the test results are sample percentiles at each voltage–temperature combination. The test plan and results with ΔU 10 V, Δt 24 h are given in Table 3.3. For the example considered, the system of integral Equation 3.89 takes the form t0.1
a
¯ exp( E /T )[U (s )] ds c
a
i
(3.92)
0
i 1, 2, 3, 4 Solving this system for the preceding data yields the following estimates for Equation 3.88: a 2.227 10 –8 h/V1.885; Ea 1,321 104 K; c 1.885.
3.5
SUMMARY
Like the previous chapter, this one gives the reader the necessary basic statistical techniques (point and interval estimation, hypothesis testing, basic regression). Simultaneously, it is an introduction to specific reliability techniques (proportional hazard and accelerated life models).
REFERENCES Beyer, W. 1968. Handbook of tables for probability and statistics. Boca Raton, FL: CRC Press. Cox, D. R., and D. Oaks. 1984. The analysis of survival data. London: Chapman & Hall. Dixon, W. J., and F. J. Massey, Jr. 1969. Introduction to statistical analysis, 3rd ed. New York: McGraw-Hill. Draper, N., and H. Smith. 1981. Applied regression analysis. New York: John Wiley & Sons. Miner, M. A. 1945. Cumulative damage in fatigue. Journal of Applied Mechanics 12:A159–A164.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 4
Practical Probability Distributions for Product Reliability Analysis Diganta Das, Michael Pecht
CONTENTS 4.1 Introduction ..................................................................................................... 57 4.2 Discrete Distributions ..................................................................................... 58 4.2.1 Binomial Distribution ......................................................................... 58 4.2.2 Poisson Distribution............................................................................ 62 4.2.3 Other Discrete Distributions............................................................... 63 4.3 Continuous Distributions ................................................................................ 63 4.3.1 Weibull Distribution ........................................................................... 65 4.3.2 Exponential Distribution..................................................................... 68 4.3.3 The Normal Distribution .................................................................... 71 4.3.4 The Lognormal Distribution............................................................... 73 4.4 Probability Plots.............................................................................................. 75
4.1 INTRODUCTION In reliability engineering, data are often collected from analysis of incoming parts and materials, tests during and after manufacturing, fielded products, warranty returns, and so on. If the collected data can be modeled by a probability distribution, then properties of the distribution can be used to make decisions for product design, manufacture, and reliability assessment. In this chapter, discrete and continuous probability distributions are introduced, along with their key properties. Then, two discrete distributions (binomial and Poisson) and four continuous distributions (Weibull, exponential, normal, and lognormal) commonly used in reliability modeling and hazard rate assessments are presented.
57 © 2009 by Taylor & Francis Group, LLC
58
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
4.2
DISCRETE DISTRIBUTIONS
A discrete random variable (x) is a quantity that can be equal to any one of a number of discrete values (x0 x1x2 ,…, xn). There is probability fxi, that x xi: f(xi)P{x xi}
(4.1)
In Equation 4.1, f(xi) is called the probability mass function (PMF)* and the cumulative distribution function is written as F(xi) P{x a xi}
(4.2)
The mean * and variance m2 of a discrete random variable are defined in terms of the probability mass function as
M
£ x f (x ) i
(4.3)
i
i
S 2
£ (x M) i
i
2
f ( xi )
£x
2 i
f ( xi ) M 2
(4.4)
i
The binomial and Poisson distributions are distributions of interest to reliability engineers. These distributions are useful in developing product sampling and acceptance plans. They are also useful in assessing product reliability based on the reliability of the parts (materials) that comprise the product. 4.2.1 Binomial Distribution The binomial distribution is a discrete probability distribution applicable in situations that have only two mutually exclusive outcomes for each trial or test. For example, for a roll of a die, the probability is one-sixth that a specified number will occur (success) and five-sixth that it will not occur (failure). This example, known as a “Bernoulli trial,” is a random experiment with only two possible outcomes, denoted by “success” or “failure.” Of course, the definition of success or failure is defined by the experiment. In some experiments, the probability of the result not being a certain number may be defined as a success. The probability mass function f(x) for the binomial distribution gives the probability of exactly k successes in m attempts: ¤ m³ f ( k ) ¥ ´ p k q m k , 0 a p a 1, q 1 p, k 0,1, 2,! , m ¦k µ
(4.5)
* Discrete probability functions are referred to as probability mass functions and continuous probability functions are referred to as probability density functions. When referring to probability functions in generic terms, the term “probability density function” is used to mean both discrete and continuous probability functions.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
59
where p is the probability of the defined success; q (or 1 p) is the probability of failure; m is the number of independent trials; k is the number of successes in m trials; and the combinational formula is defined by ¤ m³ m! m ¥ ´ x Ck k!(m
k)! ¦ kµ
(4.6)
where ! is the symbol for factorial. Because ( p q) equals 1, raising both sides to a power j gives ( p q) j 1
(4.7)
The binomial expansion of the left-hand side term in Equation 4.7 gives the probabilities of j number of successes, as represented by the binomial distribution. For example, for three components or trials, each with equal probabilities of success (p) or failure (q), Equation 4.7 becomes ( p q)3 p3 3 p2q 3 pq 2 q 3 1
(4.8)
which is based on the general equation: m
£ f (k ) F (m) p{k a m} ( p q)
m
(4.9)
k 0
The four terms in the expansion of ( p q)3 give the values of the probabilities for getting three, two, one, and no successes, respectively. That is, for m 3 and probability of success p, f(3) p3, f(2) 3p2 q, f(1) 3pq2 , and f(0) q3. The binomial expansion is also useful when there are products with different success and failure probabilities. The formula for the binomial expansion in this case is m
(p q ) 1 i
i
(4.10)
i 1
where i pertains to the i component in a system consisting of m components. For a system of three different components, the expansion takes the following form: ( p1 q1)( p 2 q 2)( p 3 q 3) p1 p 2 p 3 ( p1 p 2q 3 p1q 2 p 3 q1 p 2q 3) ( p1q 2q 3 q1 p 2q 3 q1q 2 p 3) q1q 2q 3 1
(4.11)
where the first term on the right-hand side of the equation gives the probability of success of all three components; the second term (in parentheses) gives the probability of success of any two components;
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
60
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
the third term (in parentheses) gives the probability of success of any one component; and the last term gives the probability of failure for all components. The cumulative distribution function for a binomial distribution F(k) gives the probability of k or fewer successes in m trials. It is defined in terms of the discrete PMF by k
F (k )
£ (i)
(4.12)
i0
or by using the PMF for the binomial distribution: k
F (k )
¤ m³
£ ¥¦ i ´µ p i q (m i)
(4.13)
i0
For a binomial distribution, the mean, * is given by
M mp
(4.14)
S 2 mp (1 p)
(4.15)
and the variance is given by
EXAMPLE 4.1 An engineer wants to select four capacitors from a large lot of capacitors in which 10% are defective. What is the probability of selecting four capacitors with (a) (b) (c) (d)
zero defective capacitors; exactly one defective capacitor; exactly two defective ones; and two or fewer defective ones?
Solution: Here, success will be defined as “getting a certain number of good capacitors.” Therefore, p 0.9, q 0.1, and m 4. Using Equations 4.5 and 4.6, f(4) is the probability of all four being good (i.e., no defectives). That is, four components (trials) and equal p and q. ¤ 4³ f (4) ¥ ´ (0.9)4 (0.1)0 0.6561 ¥¦ 4 ´µ Another way to solve this problem is by defining success as “getting a certain number of defective capacitors” with p 0.1 and thus q 0.9. In this case, f(0) gives the probability that there will be no defectives in the four selected samples. That is, ¤ 4³ (a) f (0) ¥ ´ (0.1)0 (0.9)4 0.6561 ¥¦ 0 ´µ Continuing with the latter approach, the solution to problems b, c, and d are ¤ 4³ (b) f (1) ¥ ´ (0.1)1 (0.9)3 0.2916 ¥¦ 1 ´µ
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
¤ 4³ (c) f (2) ¥ ´ (0.1)2 (0.9)2 0.0486 ¥¦ 2´µ (d) F (2) f (0) f (1) f (2) 0.9963
EXAMPLE 4.2 Consider a product with a probability of failure in a given test of 0.1. If 10 of these products are tested, (a) (b) (c) (d)
What is the expected number of failures that will occur in the test? What is the variance in number of failures? What is the probability that no product will fail? What is the probability that two or more products will fail?
Solution: Here, m 10 and p 0.1 (a) The expected number of failures is the mean * mp (10 r 0.1) 1 (b) The variance is m2 mp(1 p) [10 r 0.1 r (1 0.1)] 0.9 (c) The probability of having no failures is the PMF with k 0. That is, ¤ 10 ³ f (0) ¥ ´ r 0.10 r (1 0.1)10 0.349 ¦ 0µ (d) The probability of having two or more failures is the same as one minus the probability of having zero or one failure. It is given by Pr (two or more failures) [1 {f(0) f(1)}] [1 0.349 {10 r 0.1 r (1 0.1)9}] 0.264
EXAMPLE 4.3 An electronic automotive control module consists of three identical microprocessors in parallel. The microprocessors are independent of each other and fail independently. For successful operation of the module, it is required that at least two microprocessors operate normally. The probability of success of each microprocessor for the duration of the warranty is 0.95. Determine the failure probability of the control module during warranty. Solution: The module fails when two or more microprocessors fail. In other words, the module fails when only one or none of the microprocessors is working. Thus, the probability of failure of the module during warranty will be given by Pr (module fails during warranty) [f (0) f (1)] where m 3 components k 0 or 1 is the total number of working components p 0.95 and q 0.05. Therefore: Pr (module fails during warranty) [(0.05)3 {3 r 0.95 r (0.05)2}] 0.00725 f (1) 1 works, 2 fail failure f (0) 0 work , 3 fail f (1) 2 work ,1 fails f (0) 3 work , 0 fail
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
success
61
62
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Binomial Distribution Function in Excel BINOMDIST (number_s, trials, probability_s, cumulative) returns the binomial probability distribution f (k) or F(k), where Number_s is the number of successes in trials (k). Trials is the number of independent trials (m). Probability_s is the probability of success for each trial (p). Cumulative is a logical value that determines the form of the function [(PMF (TRUE) or CDF (FALSE)]. CRITBINOM (trials, probability_s, alpha) returns the smallest value of k for which the cumulative binomial distribution is greater than or equal to a criterion value. Trials is the number of Bernoulli trials (m). Probability_s is the probability of success on each trial (p). Alpha is the criterion value. The user selects this based on the problem in hand.
4.2.2 Poisson Distribution In situations where the probability of success (p) is very low and the number (m) of samples tested (i.e., the number of Bernoulli trials conducted) is large, it is cumbersome to evaluate the binomial coefficients. A Poisson distribution is useful in such cases. The PMF of the Poisson distribution is independent of the number of trials, and is written as f (k )
Mk M e ; k 0, 1, 2,z k!
(4.16)
where * is the mean and also the variance. For a Poisson distribution for m Bernoulli trials with probability of success in each trial equal to p, the mean and the variance are given by
M mp
(4.17)
S 2 mp
(4.18)
The Poisson distribution is widely used in industrial and quality engineering applications. It is also the foundation of control charts. It is used in various applications, such as determination of particles of contamination in a manufacturing environment, number of power outages, and flaws in rolls of polymers. EXAMPLE 4.4 Solve Example 4.2 using the Poisson distribution approximation. Solution: The expected number of failures is the same as the mean * (10)(0.1) 1. The variance is also equal to one.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
63
The probability of obtaining no failures is the same as the PMF with k 0,
f (0) e M e 1 0.3678 The probability of getting two or more failures is the same as one minus the probability of obtaining one or two failures. It is given by Pr (two or more failures) [1 {f(0) f(1)}] [1 {0.3678 e 1}] 0.2642 Note the differences compared to Example 4.2. Poisson Distribution Functions in Excel POISSON (x, mean, cumulative) x is the number of events. Mean is the expected numeric value. Cumulative is a logical value that determines the form of the probability distribution returned [(PMF (TRUE) or CDF (FALSE)].
4.2.3 Other Discrete Distributions Other discrete distributions used in reliability analysis include the geometric distribution, the negative binomial distribution, and the hypergeometric distribution. These distributions can usually be modeled as special or limiting cases of the binomial distribution. With the geometric distribution, the Bernoulli trials are conducted until the first success is obtained. The geometric distribution has the “lack of memory” property, implying that the count of the number of trials can be started at any trial without affecting the underlying distribution. In this regard, this distribution has some similarity to the continuous exponential distribution, which will be described later. With the negative binomial distribution (a generalization of the geometric distribution), the Bernoulli trials are conducted until a certain number of successes are obtained. It is conceptually different from the binomial distribution because the number of successes is predetermined and the number of trials is random. With the hypergeometric distribution, testing is conducted without replacement in samples containing more than one kind of product or defect. The hypergeometric distribution differs from the binomial distribution in that the population is finite and the sampling from the population is made without replacement.
4.3
CONTINUOUS DISTRIBUTIONS
If the range of a random variable x extends over an interval (either finite or infinite) of real numbers, then x is a continuous random variable. The cumulative distribution function is given by F(xi) P{x a xi}
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(4.19)
64
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The probability density function (PDF—analogous to PMF for discrete variables) is f(x), where f(x) given by d F (x) dx
f (x)
(4.20)
which yields x
F (x)
¯ f (x) dx
(4.21)
c
The mean * and variance m of a continuous random variable are defined over the interval from ∞ to ∞ in terms of the probability density function as 2
c
M
(4.22)
¯ xf (x) dx
c c 2
S
c
¯ (x M) f (x) dx ¯ x f (x) dx M 2
c
2
2
(4.23)
c
If f(x) is the failure probability density function (see Equation 2.4), then F(x) can be considered to be the unreliability Q(x) function, when the random variable x denotes time t ≥ 0. Thus, Equation 4.21 becomes equivalent to Equation 4.5 and Equation 4.22 becomes equivalent to Equation 4.17. EXAMPLE 4.5 The PDF for the failure of an appliance as a function of time to failure is given by 1 f t t e t / 4, where t is in years, and t 0. 16 (a) What is the probability of failure in the first year? (b) What is the probability of the appliance lasting at least 5 years? (c) If no more than 5% of the appliances are to require warranty service, what is the maximum number of months for which the appliance should be warranted? Solution: For the given PDF, the CDF is t
F (t )
1 ¤t ³ t e t / 4 dt 1 ¥ 1´ e t / 4 ¦4 µ 16
¯ 0
(a) The probability of failure during the first year is F(1) 0.0265. (b) The probability of lasting more than 5 years is [1 F(5)] [1 0.3554] 0.6446. (c) For this case, F(t0) has to be less than 0.05, where t0 is the warranty period. From the preceding results, we find that the time has to be more than 1 year. Also, F(2) is equal to 0.09; hence, the warranty period should be between 1 and 2 years. Using trial and error, we find that for no more than 5% warranty service, t0 1.42 years. Therefore, the warranty should be set at no greater than 17 months.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
65
4.3.1 Weibull Distribution The Weibull distribution is a continuous distribution developed in 1939 by Walloddi Weibull, who was also credited with inventing ball bearings and the electric hammer. The Weibull distribution is widely used for reliability analyses because it is a distribution with which a wide diversity of hazard rate curves can be modeled. The distribution can also be approximated to other distributions under special or limiting conditions. The Weibull distribution has been applied to life distributions for many engineered products and has also been used for material strength and warranty analysis. The probability density function for a three-parameter Weibull probability distribution function is B ¤ t G ³ H ´µ
f (t ) BH B (t G )B 1e ¥¦
(4.24)
where ^ 0 is the shape parameter, d 0 is the scale parameter, and c is the location, or time delay, parameter. The reliability function is given by c
R(t )
¯
¤ t G ³ f (t ) dt e ¥¦ H ´µ
B
(4.25)
t
It can be shown that, for a duration t c d,starting at time t 0, the reliability R(t) 36.8% regardless of the value of ^. Thus, for any Weibull failure probability density function, 36.8% of the products survive for t c dThe time to “failure” of a product with a specified reliability, R, is given by t G H (lnR)1/B
(4.26)
The hazard rate function for the Weibull distribution is given by h(t )
f (t ) B § t G ¶ R(t ) H ¨© H ·¸
B 1
(4.27)
The conditional reliability function is R (t , T )
ª § (t T G ) B ¶ § (T G ) B ¶ ¹ R(t T ) exp « ¨ ·º ·¨ H H R(T ) ·¸ » ·¸ ¨© ¬ ¨©
(4.28)
Equation 4.28 gives the reliability for a new mission of duration t for which T hours of operation were previously accumulated up to the beginning of this new mission. It is seen that the Weibull distribution is generally dependent on both the age at the beginning of the mission and the mission duration (unless ^ 1). In fact, this is true for most distributions, except for the exponential distribution. Table 4.1 lists the key parameters for a Weibull distribution. The function is the gamma function, for which the values are available from statistical tables.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
66
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 4.1 Weibull Distribution Parameters Location parameter Shape parameter Scale parameter Mean (arithmetic average) Median (t50, or time at 50% failure) Mode (highest value of f(t)) Standard deviation
c ^ d c d'(1/^ 1) c d (ln 2)1/^ c d (1 1/^)1/^ ¤2 ³ ¤1 ³ H ' ¥ 1´ ' ¥ 1´ ¦B µ ¦B µ
Probability Density Function f (t)
The shape parameter of a Weibull distribution determines the shape of the hazard rate function. With 0 ^ 1, the hazard rate decreases as a function of time and can represent early life (i.e., infant mortality) failures. A ^ ≈ 1 indicates that the hazard rate is constant and is representative of the “useful life” period in the “idealized” bathtub curve. A ^ 1 indicates that the hazard rate is increasing and can represent wearout failures. Figure 4.1 shows the effects of ^ on the probability density function curve with d 1 and c 0. Figure 4.2 shows the effects of ^ on the hazard rate curve with d 1 and c 0. The scale parameter dhas the effect of scaling the time axis. Thus, for cand ^fixed, an increase in dwill stretch the distribution to the right while maintaining its starting location and shape (although there will be a decrease in the amplitude because the total area under the probability density function curve must be equal to unity). Figure 4.3 shows the effects of d on the probability density function for ^ 2 and c 0. The location parameter estimates the earliest time to failure and locates the distribution along the time axis. For c 0, the distribution starts at t 0. With c, this implies that the product has a failure-free operating period equal to c. Figure 4.4 shows the effects of con the probability density function curve for ^ 2 and d 1. Note that if cis positive, the distribution starts to the right of the t 0 line, or the origin. If cis negative, the distribution starts to the left of the origin and could imply failures had
β=3
β=2
β=1
Operating Time
Figure 4.1 Effects of shape parameter ^ on probability density function, where d = 1 and c = 0.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRACTICAL PROBABILITY DISTRIBUTIONS FOR PRODUCT RELIABILITY ANALYSIS
67
β=3
Hazard Rate
β = 0.5
β=1
Operating Time
Probability Density Function f(t)
Figure 4.2 Dependence of hazard rate on shape parameter ^, where d = 1 and c = 0.
0.8 η=1
0.6
η=2
0.4
η=3 0.2 0
0
1 Operating Time
2
Probability Density Function f(t)
Figure 4.3 Effects of scale parameter d on the probability density function of a Weibull distribution, where ^ = 2 and c = 0.
1
γ=0
γ>0
γ Z 0.05 1.96, a renewal process is not appropriate when compared to a monotonic trend. In this case, the interarrival times are decreasing in length, and an increasing trend of failure occurrence appears to be taking place. Note that, in Table 12.5, the maximum number of inversions is n(n 1)/2. For values not shown, use symmetry, as described: If the number of observed inversions, I, is greater than the tabular value, then for
n 8;
P( I ) 1 P(29 I ),
for
n 9;
P( I ) 1 P(37 I ),,
(12.36)
To illustrate the symmetry, we show two examples: ( A) if n 8 and I 18, P(18) 1 P(29 18) 1 P(11),
(12.37)
( B) if n 9 and I 20, P(20) 1 P(37 20) 1 P(17) If a significance level is set at, say, P, a trend exists if the probability is less than or equal to P/2 or greater than or equal to 1 P/2—for example, 0.05 and 0.95 for a 10% significance level. A one-tail test can also be used, depending on whether an increasing or decreasing trend is of interest. A large number of inversions means either that times between failure are increasing with time or that a decreasing rate of failure is occurring. 12.3.3 Test for a Homogeneous Poisson Process A relatively simple test can be used to determine whether an F-R process is an HPP, as opposed to one with a monotonic trend. This test, called the central limit theorem test or the Laplace test, is described in Cox and Lewis (1966). Two cases are considered: Case 1: observation stops at time X`. Assume that n failures occur during the interval (0, X`) at times X1, X2,…, Xn. Compute the following statistic:
£ U
n i 1
Xi
¤ n³ X`¥ ´ ¦ 12 µ
nX ` 2 1/ 2
(12.38)
Case 2: observation stops at the nth failure. Assume that n failures occur at times X1, X2,…, Xn. Compute the following:
£ U
n 1 i 1
Xi
(n 1) X n 2 1/ 2
§ (n 1) ¶ Xn ¨ · © 12 ¸
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(12.39)
316 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
To test the null hypothesis that no trend exists in the data and that an HPP is in force, compare the statistic U to the standardized normal deviate at the chosen significance level, say Z ] . If the absolute value of U is less than the value of Z ] , we cannot reject the null hypothesis. If U q Z ] , then an increasing trend exists; if U a Z ] , then a decreasing trend exists. The preceding equations apply to a single product. If data exist on a number, m, of like products, then a test of pooled data should be applied. The following test tests if each of m products has an HPP, but with possibly different occurrence rates. Let ª number of failures for item j(Case 1) Nj « ¬number of failures 1 for item j(Case 2)
(12.40)
ªobservation period for item j(Case 1) Xj « ¬ last failure time for item j(Case 2)
(12.41)
Then,
U
S1 S2 ! Sm §1 ¨ 12 ©
£
m i 1
1 2
£
m
i 1 1/ 2
Ni Xi
(12.42)
¶ N i X i2 · ¸
where Sj
£
Nj i 1
X ij
(12.43)
Sj represents the sum of the failure times for the jth product. To illustrate the computation, consider a product with observed failure times of 70, 122, 152, 165, and 170 hours. Observation stopped at the last failure time, so this is a single-product, case 2 situation. Therefore,
£ U
n 1 i 1
Xi
(n 1) X n 2 1/ 2
§ (n 1) ¶ Xn ¨ · © 12 ¸
70 122 152 165 4(170 / 2) 1.72 (12.44) 170(4 /12)1/ 2
A U value of 1.72 is significant at the 10% level, indicating that a positive trend exists.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
317
12.3.4 Comparison of Two Samples Establishing whether two samples come from the same population is important in many situations because such a test can be used to determine if a design improvement has occurred, if overhaul is beneficial, if the operating environment has changed, and so forth. The procedure for the Mann Whitney test is as follows: r Let T1(1), T1(2),…, T1(k) be the interarrival times for one sample and T2(1), T2(2),…. T2(m) be the interarrival times for the second sample. r Order the k m times, smallest to largest, and assign the rank of 1 to the smallest, 2 to the next smallest, and so on. The largest time receives the rank of k m. If two or more times are tied, use the average of the tied ranks. r Let S be the sum of the ranks associated with the first sample. That is, if R(T1(i)) is the rank for the ith failure time of the first product, k
S
£ R[T (i)] 1
(12.45)
i 1
r Compute the test statistic U:
U S k ( k 1)/ 2
(12.46)
r If k or m is less than eight, compare the test statistic U with the critical values of U, such as those found in Conover (1983), to determine if the null hypothesis that the two samples come from the same population is valid. If U is greater than the larger of the two critical values given, then the conclusion is that the first population has greater reliability than the second population; a value of U less than the smaller critical value indicates less reliability for the first population. r If both k and m are equal to or greater than 8, then a normal approximation can be used; compute
1 km 2 Z 1/ 2 §© km( k m 1)/12 ¶¸ U
(12.47)
and compare to a standardized normal value for the chosen significance level.
To illustrate the test, consider the data taken from operating ship equipment before and after overhaul (see Table 12.6). Interarrival times are then as in Table 12.7. Ordering all the times yields the results shown in Table 12.8. Summing the “before overhaul” ranks gives S 1 3 6 8 9 10 37
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(12.48)
318 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 12.6
Interarrival Time Data
Before Overhaul
After Overhaul
327
161
1380
249
2289
590
3080
901
3197 3261
Table 12.7
Interarrival Time Data (Ordered)
Before overhaul
After overhaul
327
161
1053
88
909
341
791
311
117 64
Table 12.8 Ranks of Interarrival Time Data Time
Rank
64
1
Sample Before
88
2
After
117
3
Before
161
4
After
311
5
Before
327
6
After
341
7
Before
791
8
After
909
9
Before
1053
10
After
The test statistic with k 6 is U 37 6(7 / 2) 16
(12.49)
The critical values at the 10% level are 4 and 26 (Conover 1983). Because U 16 falls within this range, the conclusion is that reliability performance was unchanged after overhaul.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
319
12.3.5 Fitting the Weibull Nonhomogeneous Poisson Process If a trend is established through application of the trend test discussed in Section 12.3.3, the next step is to see whether the data can be modeled as a nonhomogeneous Poisson process and whether an equation can be established for the process so that relevant characteristics, such as mean time between failures, average number of failures, and probability of failure, can be estimated. Because the process is nonhomogeneous, the rate at which failures occur is time dependent; thus, these characteristics will vary over time, unlike an HPP that, for example, has a single mean time between failures (MTBF). As shown earlier, the NHPP has the defining property that the distribution of the number of failures from age 0 to x is given by P[number of failures(0, x )]
e R ( x ) [ R( x )]n m!
(12.50)
t
wher R( x )
¯ r (t)dt 0
where the term r(t) is the intensity function representing the instantaneous rate of failure occurrence. In the case where r (t ) LB t B 1
(12.51)
the NHPP is called a Weibull process. A large amount of theory has been developed for such a process, and it can be used as a model for an NHPP. 12.3.5.1
Weibull Process Characteristics
Characteristics of a Weibull process include: r cumulative MTBF (0, x):
Q (0, x )
x1 B L
(12.52)
M (x)
x1 B LB
(12.53)
r instantaneous MTBF at age x:
r failure time distribution at time t, with t measured from x: 1
Fx (t ) 1 e M ( x )
(12.54)
r expected number of failures in (x1, x2):
E[ N ( x1 , x 2 )] L x 2B x1B
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(12.55)
320 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
r distribution of the number of failures in (x1, x2):
P[ N ( x1 , x 2 ) n]
n L xB xB 1§ L x2B x1B ¶ e 2 1 ¸ n! ©
(12.56)
To illustrate the use of these equations, assume that h 0.1 and ^ 2 and that time is measured in months. The following then holds: r cumulative MTBF over the period 0 6 months:
Q (0, 6) 61 2 / 0.1 1.67 months
(12.57)
r instantaneous MTBF at 12 months:
M(12) 121 2 /(0.1 r 2) 0.42 months
(12.58)
r probability of failing within the next half-month for a product with 12 months of service:
F12 (0.5) 1 e 0.5 / 42 0.304
(12.59)
r expected number of failures from 6 months to 12 months:
E[ N (6, 12)] 0.1122 62 e 0.1(12
2
62 )
10.8 failurres
(12.60)
r probability that eight failures occur from 6 months to 12 months:
P[ N (6,12) 8]
2 2 1 [0.1(122 62 )]8 e 0.1(12 6 ) 0.094 8!
(12.61)
12.3.5.2 Estimation of h and ^ Several data collection possibilities must be considered. The data may be ungrouped (U) so that every failure time on each product is known, or the data may be grouped (G) so that only the total number of failures within fixed time intervals are known. For ungrouped data, observation can stop at some given time (time-truncated [T]) or at the occurrence of a specified number of failures (N). These possibilities then give rise to the following three cases: r (U-T) ungrouped, time-truncated; r (U-N) ungrouped, failure truncated; and r (G) grouped.
Estimation for case U-T. For ungrouped data, time-truncated testing—the maximum likelihood estimate (MLE) for ^ given n failure times at over (0, x`)—is
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
321
n
Bˆ n ln x `
£
n i 1
ln xi
(12.62)
For h, the MLE is
L
n
(12.63)
ˆ
( x `) B
Estimation for case U-N. For ungrouped data with truncation based on the number of failures, n, n
Bˆ
(n 1) ln x n
£
n 1 i 1
ln xi
(12.64)
n Lˆ ˆ x nB Estimation for case G. For grouped data, the estimation procedure is somewhat more complicated in that a closed-form equation for ^ does not exist. Assume that there are k intervals with boundaries x0 0, x1,…, xk . Then, ^ is estimated as the solution of the equation ˆ
n
£
ni
ˆ
xiB ln xi xi 1B ln xi 1 ˆ
ˆ
xiB xi 1B
i 1
ln x k 0
(12.65)
where xo/nxo is defined as equal to zero. Numerical techniques must be employed to solve this equation for ^. Given an estimate for ^, h is estimated by
£ Lˆ
k
n
i 1 i B k
x
(12.66)
Example. Use the data on the pump presented in Section 12.3.2, which were previously shown to have a monotonically increasing trend. If observation stopped at 4,162 hours, the data are ungrouped and time truncated, so the U-T category holds. Therefore, n
Bˆ n ln x `
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
£
n i 1
ln x(i )
10 1.81 10 8.3338 77.8094
(12.67)
322 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Then,
Lˆ
n ( x `)
Bˆ
10 2.8 10 6 41621.81
(12.68)
Note that the ^ estimate is greater than 1.0, which is consistent with an increasing trend toward greater failure occurrence frequency. 12.3.5.3 Goodness-of-Fit Tests Goodness-of-fit tests may be used to test whether observed failure data are consistent with a Weibull process. Generally, at least 20 failure times should be observed in order to apply a goodness-of-fit test. The equations for each of the three cases are presented next. Test for case U-T. Calculate
GUT
1 12n
n
£ i 1
§ x B 2i 1 ¶ ¨ i · 2n · ¨© T ¸
2
(12.69)
where
B
(n 1)Bˆ n
(12.70)
GUT is compared with critical values for the Cramer Von Mises test. If GUT exceeds the value in the table for the selected significance level, then the null hypothesis that the data are consistent with a Weibull process is rejected. Test for case U-N. Calculate
GUN
1 12(n 1)
n 1
£ i 1
§¤ x ³ B ¶ ¨ (i ) 2i 1 · ¨¥¦ T ´µ 2(n 1) · ·¸ ¨©
2
(12.71)
where
B
(n 1) ˆ B n
(12.72)
Test for case G. For each interval, calculate the expected number of failures:
ˆ ˆ ei Lˆ xiB xi 1B ,
i 1, 2,! , k
(12.73)
Combine adjacent intervals if required so that the expected number of failures is at least five. Assume that, after such grouping, there are k` intervals. Let ni` be
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
RELIABILITY MODELS AND DATA ANALYSIS FOR REPAIRABLE PRODUCTS
323
the number of failures in the adjusted ith interval and let ei` be the corresponding expected value. Then calculate k
C
£
( xi` ei`)2
i 1
(12.74)
ei`
C is approximately distributed as chi-square with k` 2 degrees of freedom. Critical values can be found from tables of the chi-square distribution.
12.3.5.4 Confidence Interval Estimates This section presents the formulas to calculate confidence limits of Weibull characteristics for ungrouped data. In some cases, these limits are approximations— for example, there is no distinction between the two types of truncation. Many of the limits have factors of the form C A(n 1)/n and D B(n 1)/n, where n is the number of failures, and A and C depend on the confidence level and n. The C and D factors are tabulated in Crow (1975). For n > 60, they can be approximated as follows: § ¤ 2 ³ 1/ 2 ¶ C ¨1 ¥ ´ XA / 2 · (n 1)/ n ¨ ¦ nµ · © ¸ § ¤ 2³ D ¨1 ¥ ´ ¨ ¦ nµ ©
1/ 2
(12.75)
¶ XA / 2 · (n 1)/ n · ¸
where z]/2 is the 1 − ]/2 percentile of the standard normal distribution. The following confidence formulas can be used: r intensity function, r(x):
LCL : eˆL ( x ) C rˆ( x )
(12.76)
UCL : eˆU ( x ) D rˆ( x ) r expected number of failures, N(x1, x2):
UCL : N ( x , x ) D Lˆ x
x
ˆ ˆ LCL : N L ( x1 , x 2 ) C Lˆ x 2B x1B
U
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
1
2
Bˆ 2
Bˆ 1
(12.77)
324 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
r cumulative MTBF, k(0, x):
LCL : Q L (0, x )
x NU ( 0 , x )
(12.78)
x UCL : QU (0, x ) N L ( 0, x ) r instantaneous MTBF, M(x):
LCL : M L ( x )
1 rˆU ( x )
(12.79)
1 UCL : MU ( x ) rˆL ( x ) where LCL is a lower confidence limit and UCL is an upper confidence limit. Currently, no confidence formulas appear to be available for grouped data. As a conservative approach, the number of groups can be used for n in the preceding equation, which will provide wider limits than the true values. 12.4
SUMMARY
This chapter presented various methods for formulating and analyzing the reliability of repairable products. The notions of age independence and age persistence were introduced to define the boundaries of maintenance influence and they were related to the well-known renewal and Poisson processes. A basic strategy was then formulated for modeling and analyzing product failure data. Graphical and analytical tests for determining a trend in the rate of occurrence of failure were described. Finally, for the case when a Weibull nonhomogeneous Poisson process is applicable, detailed goodness-of-fit and estimation procedures were developed and illustrated. REFERENCES Balaban, H., and N. Singpurwalla. 1984. Stochastic properties of a sequence of interfailure times under minimal repair and under revival. In Reliability theory and models, ed. M. Abdel-Hameed, E. Cinlar, and J. Quinn. New York: Academic Press. Blumenthal, S., J. Greenwood, and L. Herbach. 1973. The transient reliability behavior of series systems on superimposed renewal processes. Technometrics 15:255. Conover, W. 1971. Practical nonparametric statistics. New York: John Wiley & Sons. Cox, D. R., and P. A. W. Lewis. 1966. The statistical analysis of series of events. London: Methuen. Crow, L. H. 1975. Tracking reliability growth. U.S. Army Materiel Systems Analysis Agency, Aberdeen Proving Grounds, MD. Mann, H. B. 1945. Nonparametric tests against trend. Econometrika 13:245.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 13
Continuous Reliability Improvement Walter Tomczykowski
CONTENTS 13.1 Introduction .................................................................................................. 326 13.2 Reliability Growth Process........................................................................... 326 13.2.1 Reliability Improvement Program .................................................... 326 13.2.2 Failure Classification ........................................................................ 331 13.2.3 Test Optimization ............................................................................. 333 13.2.4 Test Cycles and Environmental Considerations ............................... 334 13.3 Stress Margin Testing ................................................................................... 335 13.3.1 Stressed Life Test (STRIFE)............................................................. 336 13.3.2 Highly Accelerated Life Test (HALT) ............................................. 337 13.3.3 Inverse Power Law Model and Miner’s Rule.................................... 338 13.4 Continuous Growth Monitoring ................................................................... 339 13.4.1 Continuous Growth Models.............................................................. 339 13.4.1.1 Duane Model...................................................................... 339 13.4.1.2 AMSAA Model.................................................................. 342 13.4.2 Discrete Models ................................................................................ 349 13.4.2.1 Lloyd and Lipow Model..................................................... 349 13.4.2.2 Wolman Model................................................................... 350 13.5 Reliability Improvement Effectiveness and Uncertainty.............................. 350 13.5.1 Reliability Improvement Effectiveness............................................. 351 13.5.2 Reliability Improvement Uncertainty............................................... 351 13.6 Summary ...................................................................................................... 354 References.............................................................................................................. 354
325 © 2009 by Taylor & Francis Group, LLC
326 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
13.1
INTRODUCTION
Reliability improvement techniques can be applied to a new product that has passed its major hardware and/or software design reviews, to a developed product that the manufacturer wishes to make more competitive, or to an existing product that is not meeting the customer’s expectations of reliability performance. Presumably, the latter case should not occur because the desired level of reliability should be designed into the product before the design is released to full production. The reliability improvement process recognizes that the reliability of the drawing board design of a complex product can be improved and allocates time for that improvement. By operating or testing the product in a manner that will identify deficiencies caused by the design, manufacturing process, and/or operation, deficiencies can be detected and removed, and methods for designing-in reliability can be reevaluated or used to improve reliability. By comparison, reliability qualification testing is intended to demonstrate the ability of the product to perform in its intended environment, and stress environment screening is intended only to precipitate defects. These methods do not result in reliability improvement. A continuous reliability improvement program is cost beneficial over the life cycle. The cost benefits result from reduced warranty repairs for commercial products and reduced maintenance and spares. This chapter discusses the principles of reliability growth, accelerated testing, and management of a continuous improvement program.
13.2 RELIABILITY GROWTH PROCESS When complex equipment is designed with innovative technology or advanced production methods, the equipment often has unforeseen design, manufacturing, or operating deficiencies that affect reliability. A reliability improvement program seeks to achieve reliability goals by improving product design. The objective of an improvement program is to identify, locate, and correct faulty and weak aspects of the design, manufacturing process, and operating procedures. Reliability improvement is often accomplished through a program employing a test, analyze, and fix (TAAF) philosophy. The product’s reliability improves when corrective actions that remove the faulty and weak aspects of the design are incorporated and then verified with further testing. The TAAF process should be applied to developmental equipment but can be implemented on equipment already fielded. The TAAF process for developmental or prototype equipment is illustrated by a feedback loop, as in Figure 13.1. The use of a feedback loop is the basis for a successful improvement program. 13.2.1 Reliability Improvement Program The length of a test in a reliability improvement program is a function of the perceived and desired reliability for commercial products; the required reliability for military products; the maturity and complexity of the product or prototype; the number of
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
327
Testing of prototype equipment
Detection of failures
Analyze and (re) design
Yes
Fix verified?
No
Figure 13.1 The test, analyze, and fix process.
products available for test (test units); the effectiveness of the failure reporting, analysis, and corrective action program; and the amount of time available. Because time is needed for troubleshooting, analyzing detected failures, investigating corrective actions, and incorporating changes, the test time is a portion of the total time available. Other factors affecting test time, which could be avoided through planning and training, include operator errors, chamber breakdowns, inadequate spare test units, and poor supervision. Test time divided by total available time (where available time is scheduled calendar time multiplied by the quantity of test units) is known as test efficiency. Experience has shown that most improvement programs have a test efficiency of 50% or lower. The test efficiency can be higher than 50% if additional manpower is used to perform the failure analysis, if spare test units are available and adequate planning and training have been conducted. However, test efficiencies that approach 90–100% are often an indicator that there is no growth. No growth could be caused by environmental stresses that are not severe enough to precipitate further faults, inadequate management of the improvement program, or a deficiency in the detection and reporting system. If calendar time is limited, the quantity of test units must be increased, or accelerated life testing (Section 13.3) must be employed. When adequate program resources and funding are available, calendar time could be minimized by using several test units. When more than one unit is used, the amount of test time should be reasonably distributed among each unit or prototype. Distributing the test time will prevent the accrual of hours only on a “gold-plated” or “hand-built” prototype, while the “real” prototype is down for repair. Testing only a hand-built prototype may support the desire simply to “pass the test.” To keep the test from being biased, each unit must
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
328 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
operate for at least 75% of the average relevant test time of all the units under test. For example, for a test of 2,000 hours on two products, neither product can be tested for less than 1,000 r 0.75, or 750 hours. An improvement program can also be applied to products whose reliability is not measured in terms of failures per unit time. Some test units for measuring duration other than time could be number of miles before failure, number of flights before failure, number of copies before failure, etc. This chapter discusses growth and test duration in units of test time. Before products are officially placed under test, several tasks should be completed. One of these is establishing an environmental stress screening (ESS) process, a manufacturing process that uses random vibration and temperature cycling to precipitate part and workmanship defects. ESS will identify infant mortality failures such as manufacturing defects (missing components, wrong components, reversed components), workmanship failures (poor solder joints, bent leads, weak wirebonds), and incorrect part types. Implementing corrective actions to remove these faults reduces production, rework, and life-cycle costs. ESS can be applied to either the subassembly or final product; however, ESS is most cost beneficial when applied to the lowest level of assembly. The maximum and minimum temperature values should not exceed the rating of any of the parts or materials comprising the assembly. Care should be exercised to ensure that the physical response associated with the failure mechanisms of a part or material under stress is large enough to generate an effective screen but does not exceed the capability of the product. The effectiveness of the screen can be continuously monitored by examining the yields and results of higher level tests. If many workmanship or manufacturing failures are discovered by such tests, then the screen at lower levels must be adjusted. When the yield or fallout data are acceptable—that is, when the majority of the failures are design failures—formal reliability growth testing may begin. Equipment to be used for reliability growth testing must undergo ESS. In addition to ESS, the following five actions should be taken: r Verify the performance of the field environment simulation instruments (temperature chambers, vibration tables that will be used to test the prototypes, and the test measurement equipment). r Complete a thermal analysis of the product. r Complete a failure modes and effects analysis (FMEA). r Establish a closed loop failure reporting, analysis, and corrective action system (FRACAS); FRACAS is utilized, failure analysis is performed, and corrective action is implemented on all failures occurring during developmental and operational testing, including failures that occur during ESS, rather than just those occurring during the formal improvement program. r Complete a reliability improvement plan.
Failure analysis and corrective action are the most critical aspects of a reliability improvement program. Failures must be isolated to the root failure mode. Common failure modes for electrical products include cracked solder joints, board delaminations, component failures, software errors, procedure errors, poor board placement, and manufacturing process problems. Common failure modes for mechanical products
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
329
are corrosion, binding, and fracture (crack propagation). Specific root causes, including thermal overstress, electrical overstress, contamination, wearout, and mechanical damage must be identified. An accurately completed FMEA will aid the analysis process and save valuable time. The basic failure analysis steps for components are, in order: 1. Identify part to ensure that the correct part was used. 2. Establish a part history to determine where it failed and how it may have failed in the past. 3. Confirm the failure. 4. Analyze the part, following an ordered failure analysis flow, as shown in Figure 13.2. 5. Identify the failure mode and cause. 6. Take photographs and x-rays. 7. Adopt recommended corrective action to preclude recurrence of the same failure. 8. Produce a concise report summarizing each step.
Figure 13.2 summarizes steps needed to determine the root cause of an electronic component failure. Once the part is identified and previous failure modes and causes are determined, the part should be externally examined for signs of overstress. Electrical testing, using a curve tracer, can then be performed; if the failure is verified, further nondestructive analysis techniques—x-ray, particle impact noise detection (PIND), or leak test—may be used to isolate the root cause of the failure. The component is then cross-sectioned or de-lidded to facilitate an internal inspection. Further fault isolation is accomplished using such techniques as scanning electron microscopy, die probing, or energy-dispersive x-ray analysis. Bond failures can be analyzed by performing destructive bond pull tests. If the failure has not been verified, additional tests for unverified failures (such as temperature cycling, temperature shock, and vibration) are performed until the failure is precipitated. Even after these additional tests, some failure modes may still not be identifiable. If this occurs, management will have to decide what additional resources (such as manpower and cost) to expend in attempting to identify the problem. Once the failure analysis is complete, the corrective action is identified, and the results are documented, the information is entered into FRACAS. The information in FRACAS is used by the manufacturer to incorporate the corrective actions into the product. To ensure that FRACAS is effective, it must be integrated into the reliability improvement plan and procedures. A reliability improvement plan must be completed, approved, and coordinated through the responsible test engineer, design engineer, reliability manager, manufacturing manager, logistics manager, and program manager. For military contracts and some commercial contracts (such as improving the reliability of a transformer for a power company), the consumer should ensure that the plan and procedures are coordinated through the procuring activity and a representative of the product user. At a minimum, the plan should address the test schedule, resources, test equipment, manpower, test environment, test procedures, planned growth versus
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
330 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Failure history
Electrical curve tracer
External visual
No Verify failure? Temperature cycling
A
X-ray
Temperature cycling
A
P.I.N.D
Vibration
A
Leak
Burn-in
De-lid
Electrical curve tracer
Internal visual A Fault isolation
Destructive part analysis
Electrical test. Go to next technique if still unverified.
Document results
Figure 13.2 Failure analysis flow. (From LeStrange, J. 1990. Litton Amecon briefing to the University of Maryland Reliability Engineering Program.)
test time, failure reporting product, and corrective action program. To ensure a successful improvement program, the plan and procedures should thoroughly describe all aspects of the test, including ground rules. Establishing ground rules for the conduct of the test and guidelines for failure classification are critical aspects of the reliability growth test.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
331
13.2.2 Failure Classification A reliability improvement program with unlimited resources, funding, and time focuses attention on identifying and eliminating deficiencies in the design, rather than on the classification of failures. In the ideal case, corrections are incorporated as the root causes are discovered. However, this ideal usually does not exist due to limited customer funding, schedule demands, or the nature of the test itself (failures have to be counted and/ or classified to monitor progress). In some contracts, the producer may designate certain types of failures as nonrelevant or nonchargeable to avoid a costly investigation of their causes. However the customer may contest these decisions and demand to have the problem identified and corrected. This antagonistic situation must be avoided. To ensure a successful reliability improvement program, all failures should be considered relevant. Every hardware and software failure, including those caused by loose test cables (often the cause of intermittent failures or failures that cannot be duplicated), faulty test equipment, or other test facility problems should be investigated, and corrective action for each should be developed. Until all producers realize the benefit of investigating all failures, controversy will occur in failure classification. To minimize the problems associated with classifying failures, ground rules should be established prior to the start of the growth test. A standard failure classification methodology is illustrated in Figure 13.3. Any anomaly in prototype behavior is classified and evaluated as either a relevant or nonrelevant failure. (Some contracts require that the root causes of all failures must be determined.) Any anomaly in product operating behavior that is not expected to occur in the field is classified as nonrelevant. Nonrelevant failures are often caused by improper installation, accidental damage or mishandling, failures of the test facility, or failures due to externally induced overstress exceeding the amount approved for testing. To judge failures as nonrelevant, the strength and stress distributions of the product must be understood (see Section 13.3) (Seusy 1987). Understanding strength and strength distributions provides insight into whether the product should have operated successfully under the failure conditions. Relevant failures include all operational anomalies not classified as nonrelevant, regardless of whether the failure is verified or unverified. For example, momentary cessation of equipment function, termed intermittent failure, is a relevant failure. Failures that cannot be duplicated during troubleshooting can also be classified as relevant. Relevant failures must be investigated and may result in design or production modifications. An anomaly classified as a relevant failure may be further classified as chargeable or nonchargeable. Nonchargeable failures result from another failure; such dependent failures are induced by equipment furnished by the government or the customer or by the failure of parts whose specified life expectancy has been exceeded. Chargeable failures include intermittent failures, failures independent of equipment design, equipment and part manufacturing failures, part design failures, and failures resulting from contractor-furnished equipment (CFE) and contractor-furnished operating, maintenance, or repair procedures. Failures that have the same cause, failure mode, and environmental failure conditions are only counted as chargeable once. Chargeable failures can be used as the basis for tracking reliability growth.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
332 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Incident anomaly or failure
Could incident occur in the field?
No
Non-relevant
Yes Relevant
Was the incident a dependent or induced failure?
No
Non-chargeable
Yes Chargeable
Figure 13.3 Failure classification.
Intermittent failures, failures that cannot be duplicated, and failures caused by operator error are usually controversial and difficult to classify. These types of failures place additional burdens on maintenance, support, and logistics personnel. They also frustrate the consumer whose automobile’s classic intermittent stalling cannot be duplicated by the mechanic or whose television requires a brightness adjustment every 30 minutes at home but works perfectly for hours at the repair shop. Typically, intermittent failures caused by external power interruptions, surges, or transients are usually nonchargeable failures. To avoid classifying an intermittent failure as chargeable, external power monitors should be provided to monitor, regulate, and record the input power during the conduct of the reliability improvement program. If the intermittent failure cannot be associated with an external power interruption, surge, or transient, then it must be investigated as a “cannot duplicate” (CND). CNDs are test incidents that cannot be verified or duplicated by subsequent troubleshooting and maintenance. Common causes of CNDs are intermittent failures,
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
333
built-in test inadequacies, operator errors, improper maintenance, management misunderstanding, and incorrect user manuals. CNDs often are overlooked during an improvement program, due to schedule constraints, because time is needed to attempt to duplicate the failure and to verify that the equipment is in good condition. To maintain test efficiency, one approach is to classify CNDs as chargeable if, during troubleshooting, any component was swapped, disconnected and reconnected, or adjusted in place. If only internal built-in test (BIT) equipment or internal self-tests were used during troubleshooting (in other words, the equipment was not tampered with), then the CNDs could be classified as nonchargeable. In either case, all CNDs must be reported on the failure reporting product (FRACAS) and investigated by the cognizant testability and logistics engineers. The engineers can then attempt to remove the cause of the CNDs offline from the reliability improvement program. Even though CNDs could be classified as nonchargeable in terms of reliability, CNDs are always relevant in terms of maintainability or testability. Numerous CNDs and intermittent failures in products that contain software and utilize built-in tests may indicate bugs or errors in the software code and should be evaluated for possible software deficiencies. If the product does not contain software or BITs, the failures may only manifest themselves under certain environmental conditions, such as humidity, temperature, or vibration. Operator errors that cause failure during reliability growth are also controversial; however, during a reliability demonstration test, operator errors are always chargeable. If they could cause loss of life or other catastrophic failures in the field, then they are classified as chargeable. In less catastrophic situations, guidelines could be established; for example, if the same operating error were to occur three times, then, at the third occurrence, the failure would be classified as chargeable. Operator errors that repeat constantly could indicate poorly written operating instructions or user manuals. For example, during a growth test on a video cassette recorder (VCR), the test engineer has to verify the operation of the timing feature once every 24 hours. The test engineer carefully follows the operating instructions for the timing feature of the VCR, but it does not always turn on when expected. If this occurred only once, this failure would probably be classified as nonchargeable due to operator error, but if it occurred three or more times, it would be considered chargeable. Test time is also categorized as relevant or nonrelevant. The test time between failures when the equipment is officially under test is termed relevant test time. Time spent troubleshooting equipment failures and verifying repairs is termed nonrelevant test time. Only the accumulated relevant test time is used to determine the improvement in product reliability. Failures that occur during nonrelevant test time should be investigated and corrected, but they would not be used to gauge reliability. 13.2.3 Test Optimization To avoid duplication of effort, other types of testing—such as functional, human factor, and safety testing—should be performed concurrently with growth testing. Because design changes stemming from other test results may affect reliability, test data derived from different types of tests are best shared and may, in this way, provide deeper insight into equipment behavior. An important but often overlooked test that
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
334 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
can be conducted during growth testing is BIT false-alarm verification. Many products use BIT to determine when a failure occurs. When BIT indicates a failure, but a failure has not occurred, these BIT indications are considered false alarms. False alarms are a percentage of all failures indicated (usually specified as 1–5%) and are accounted for as a subset of CNDs. To use growth testing for false-alarm verification, external test instrumentation and recorders must monitor performance. During growth testing, if the BIT thresholds are set at the same sensitivities as the fielded equipment, then the BIT data, along with the information from the external test instrumentation, can be used to determine the percentage of false alarms. Even if the BIT thresholds are not the same, the data still provide a gross indication of false-alarm performance. 13.2.4 Test Cycles and Environmental Considerations Relevant test time consists of a series of cycles that combine the worst-case environmental stresses that the equipment will experience in the field. Accelerated testing, which applies stresses higher than expected field conditions, is discussed in
! %#
&
)
!#
$#''
$# !#
#!#$!
!
!#$! *
)
!
#!
&! #!#$! !" !!#!"#(##!#!#$! ! $#!##!#!#$! ##"#"#!#$$# $#)!#! $! Figure 13.4 Sample environmental test cycle.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
335
Section 13.3. Depending on the field application, the test cycle may include stresses that are electrical (power line cycling), thermal, moisture induced (humidity), or vibration induced. Typically, the simulation of worst-case environmental stresses may not be required for products such as consumer electronics for home use; however, worstcase stresses must always be simulated for products when safety is an issue. In order to precipitate the occurrence of failures, the environmental conditions in the test are often the worst-case stresses that the equipment would encounter in the field. An example of a typical test plan cycle is illustrated in Figure 13.4. Test chambers provide rapid temperature change and are compatible with electrodynamic or mechanical vibration equipment. Operational checks on equipment are conducted continuously or at regular intervals during each test cycle; performance checks are conducted less frequently. The performance checks are commonly performed at room ambient temperature and consist of the operational check plus additional verification of equipment behavior, such as precision and measurement repeatability. However, performance tests conducted during or immediately following environmental extremes, such as after vibration, provide further insight into equipment behavior. 13.3
STRESS MARGIN TESTING
Accelerated testing is a reliability improvement technique used to identify deficiencies quickly by increasing the product’s normal stresses. Basic conditions for accelerated testing include the following: r The dominant failure mode under normal stress and under accelerated stress should be the same. r The engineering properties associated with the failure mechanisms of a material under accelerated stress should be the same before and after the test. r The shape of the failure probability density function for the failure mechanisms at rated and higher stress levels should be the same.
To determine when these conditions are met, the failure mode (mechanism) has to be identified. The failure mechanism is the process by which the physical, electrical, mechanical, and chemical stresses combine to cause a failure. These stresses are used in the failure model to predict the reliability of the product. When the three basic conditions are met, accelerated life testing can be used to reduce test time and, consequently, test cost. Accelerated testing increases stresses such as temperature cycling, vibration, humidity, and power cycling above the product’s typical operating conditions or specifications. Pecht (1991) provides techniques to perform accelerated testing based on temperature, humidity, voltage, and mechanical stresses. To determine the equivalent amount of test time, accelerated test conditions can be extrapolated back to normal operating conditions. Failures occur only when stress exceeds the strength, as illustrated in Figure 13.5 (Seusy 1987). Product strength generally is broadly distributed and decreases with time, as shown in Figure 13.6. Stress testing simulates aging and amplifies unreliability; Figure 13.7 shows the general physical principle behind accelerated life tests.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
336 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Strength distribution
Stress distribution
Unreliability
Figure 13.5 Stress vs. strength.
Strength distribution
Stress distribution
Unreliability
Figure 13.6 Time effect on strength.
The accelerated testing techniques called STRIFE and HALT and accelerated lifetesting models tools, such as the power law model and Miner’s rule, are discussed further next (Schinner 1988; Hobbs 1990). 13.3.1 Stressed Life Test (STRIFE) The stressed life test (STRIFE), developed by the Hewlett Packard Company, uses temperature cycling, power line cycling, and/or frequency variation to accelerate product failures. This is essentially the same as the “normal” improvement program that uses the operating environment during the test. Hewlett Packard enhanced STRIFE testing
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
Stress distribution
337
Strength distribution
Unreliability
Figure 13.7 Stress testing principle.
for printed wiring boards by applying an expanded temperature range, increased temperature change rates, and increased random vibration (board electronic STRIFE testing [B.E.S.T.]). Some of the necessary conditions for B.E.S.T. include the following: r The temperature of the components on the board must be kept in continuous and rapid transition for 90% of the temperature profile by using a 15°C “overshoot” at both hot and cold extremes. The temperature profile should be tailored to both the product and the test chamber; the duration of the “overshoot” should be such that the components reach at least 90% of the hot and cold extremes. r The product must be powered on and off to create internal temperature cycles, thus accelerating electronic failures. When power is applied, the temperature of the component increases, based on the power dissipation, thermal mass, and heat transfer rate. Cycling the power on and off will also induce electrical stresses caused by voltage and current transient failures. r Random vibration should be applied to the two axes that cause the worst-case mechanical stress to identify excessive displacement.
13.3.2 Highly Accelerated Life Test (HALT) The highly accelerated life test (HALT), developed by Hobbs Engineering Corporation, applies stresses to the product that are higher than the normal operating and nonoperating levels (Hobbs 1990). Common stimuli applied are temperature, vibration, and voltage. HALT uses a stress-step approach to increase the stress levels of a stimulus progressively until an operational or destruct limit is reached. Once a failure occurs, it is investigated, and the product design is changed to compensate for the stress. This process is repeated for each stimulus and then continued by combining stimuli, such as temperature and vibration. HALT is completed only when the desired safety margin above normal operating conditions has been achieved.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
338 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Therefore, the length of a HALT program is difficult to predict. To be most cost effective, HALT should be performed prior to production and as early in the design as possible. Hobbs Engineering Corporation summarizes HALT: r an information-gathering tool and design approach to force failures that force product maturity; r may rapidly find failure mechanisms; r helps to improve the product to the technology limit; and r will require a functional change for most companies. HALT is applied to each level of production, from the subassembly to the final product. If performed correctly, HALT should increase customer satisfaction by improving reliability and lowering overall product life-cycle costs.
13.3.3 Inverse Power Law Model and Miner’s Rule The inverse power law model and Miner’s rule are two tools that can be used for measuring the effects of accelerated life testing (Raheja 1990). The inverse power law model can be used to extrapolate accelerated test conditions back to normal operating conditions. The model states that product life is inversely proportional to the stress raised to the power of Na , where Na is the acceleration factor derived from the slope of an S–N curve: (13.1)
slope 1/N a The inverse power law model can then be written as § Life at normal stress ¶ § accelerated stress ¶ ¨ ·¨ · © Life at accelerated stress ¸ © normal stress ¸
Na
(13.2)
Once the accelerated test is complete, solving for “life at normal stress” will yield the equivalent test time for normal operating conditions. For example, if the accelerated stress were twice as severe as the normal stress, the life at the accelerated stress determined to be 4 hours, and Na equal to 2, then the equivalent life at normal stress would be 16 hours. To determine the cumulative damage that may have occurred as a result of this test, Miner’s rule can be used. Miner’s rule states that the cumulative damage, CD, is k
CD
CSi
£N i 1
1
(13.3)
i
where Csi is equal to the number of cycles applied with a given mean stress, S; Ni is equal to the number of cycles to failure under stress, S; and k is the number of loads. Miner assumes that every part has a useful fatigue life and every cycle uses up a percentage of that life. When CD is equal to one, the cumulative damage should cause a failure.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
13.4
339
CONTINUOUS GROWTH MONITORING
Product reliability will improve with appropriate design modifications resulting from the TAAF process (or accelerated life testing). The failure data obtained from reliability growth testing are often used to assess the rate of improvement and estimate the continued growth of the present product. The purpose of estimating the potential reliability growth is to aid management in scheduling. In some cases, a reliability growth estimate is used to determine the expected or estimated amount of test time needed to reach a reliability level. The reliability can be assessed at any point during testing to determine whether the product is improving on schedule and whether resources are allocated appropriately. Both continuous and discrete models have been developed to assess reliability growth (Duane 1964; Lloyd and Lipow 1962). 13.4.1 Continuous Growth Models Continuous growth models were developed for repairable products in which reliability is measured in terms of mean time between failures (MTBF). The MTBF is plotted as a function of test time to illustrate growth. The MTBF is found by dividing the cumulative relevant test time by the cumulative number of relevant equipment failures. This concept was originally implemented by Duane (1964). Another continuous growth model is the Army material systems analysis activity (AMSAA) model (Crow 1974). 13.4.1.1 Duane Model When Duane was working for General Electric, he recognized a general trend in the improvement of various products under test development in terms of the cumulative failure rate. The products included hydromechanical devices, aircraft generators, and an aircraft jet engine (Duane 1964). The cumulative number of failures, plotted on log–log paper as a function of cumulative operating hours, produced a nearly straight line for all of the products. The slope of the line showed the rate of MTBF growth and indicated the effectiveness of the reliability improvement program in identifying and correcting design deficiencies. Progressively fewer failures occurred during the test program as design improvements were incorporated into the products. This phenomenon is mathematically modeled as
L3
3F
A Kt r t
where h3 is the cumulative failure rate; 3F is the cumulative number of failures; t is the cumulative operating hours; K is a constant indicating an initial failure rate; and ]r is the growth rate.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(13.4)
340 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The growth rate, ]r, must be between zero and one to model a decreasing failure rate. A growth rate that approaches one represents the maximum growth process achievable. A growth rate of 0.4–0.5 is generally accepted as a reasonable value for planning purposes. Upon completion of the reliability improvement program, the anticipated failure rate of production equipment is the instantaneous failure rate. A current or instantaneous failure rate, hI, can be found from the derivative of the cumulative number of failures, 3F: $( 3F ) d ( 3F )
A (1 A ) Kt r $l0 $t dt
L I lim
(13.5)
The instantaneous MTBF could also be determined graphically by plotting the failures on log–log paper, with the point estimate MTBF on the y-axis and the time to failure on the x-axis. A straight line is fitted to the points. The instantaneous MTBF is then determined by drawing a line parallel to and displaced by a factor of 1/(1 ]r) above this cumulative line. The Duane model is also useful for constructing curves for predicted or planned growth in order to map the progress of reliability growth. The steps involved in the construction of a planned growth curve include: r identifying reliability goals; r initializing the initial reliability on the growth curve based on historical data from similar products or initial test data; r initializing the test time to equal the time when fixes will initially be incorporated into the product (Crow 1986); the initial test time is determined by estimating the probability of obtaining at least one failure by time ti; the initial test time is calculated by subtracting 1 from the product’s reliability function equal to a failure probability of perhaps 63.2% (at t MTBF, 63.2% of test products have failed) to 95% and solving for t; increasing the failure probability will increase the expected total test time if the product’s reliability function follows an exponential distribution and the probability of at least one failure is 90%; r determining a growth rate based on the product’s complexity, maturity, and technology, the level of effort and aggressiveness of the failure analysis program, and the amount of supportive attention provided by management; and r developing a growth curve as a baseline by which reliability growth can be evaluated.
The point where the instantaneous MTBF crosses the required MTBF line is the expected test time for the reliability growth program. The growth curve serves only as a guideline for assessing progress in terms of the schedule; there is no guarantee that reliability goals will be met. To obtain growth and meet reliability goals, design deficiencies have to be detected and corrective actions must be implemented. EXAMPLE 13.1 A reliability growth test is planned for an avionic system to improve its current MTBF of 250 hours to 1,000 hours. The initial test time (ti) is 250 hours, based on
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
341
the 63.2% probability of having at least one failure by time ti. If similar products at this stage of development have achieved a growth rate of 0.3, determine: r test time for cumulative MTBF to reach 1,000 hours; r test time for instantaneous MTBF to reach 1,000 hours (the required quantity of units under test if 6 months of calendar time are available; and r testing can run for 24 hours per day); and the cumulative MTBF line and the instantaneous MTBF line graphically using log–log paper. Test time for cumulative MTBF to reach 1,000 hours. Using Equation 13.4, an expression for total cumulative test time and MTBF is determined (General Electric Company 1973). The total cumulative test time, tC, can be derived as 1/A r
§Q ¶ tc ti ¨ R · © Qi ¸
(13.6)
where ti is equal to the initial test time, ki is equal to the initial MTBF, kR is equal to the cumulative or required MTBF, and ]r is equal to the growth rate. Therefore, 1/ 0.3
§ 1000 ¶ tc 250 ¨ · © 250 ¸
25, 398 hours
(13.7)
Test time for instantaneous MTBF to reach 1,000 hours. Using Equation 13.5, the instantaneous test time T can be derived by first converting the initial failure rate, K, to the equivalent instantaneous failure rate. This is derived as
L I K (1 A r )
(13.8)
where hi is equal to the instantaneous failure rate and K is the failure rate that will be converted. Therefore,
Lt 0.004(1 0.3) 0.0028
(13.9)
Using a variation of the equation derived in the preceding step, 1/ 0.3
§ 0.0028 ¶ T 250 ¨ · © 0.001 ¸
7735 hours
(13.10)
where T is equal to the instantaneous test time. Units under test required. The test is run for 730 hours per month for 6 months, so the total time is 4,380 calendar hours; having 50% test efficiency, one gets 2,190 hours; using the instantaneous test time from Equation 13.10 divided by 2,190 hours, one obtains 3.53, or 4, units. Cumulative and instantaneous MTBF line. Figure 13.8 provides the graphical solution to this problem.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
342 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
10,000
Instantaneous MTBF Required MTBP
MTBF (hours)
1,000
Cumulative MTBF
250
Slope = 0.3
100
10 100
1,000
10,000 Test Time (hours)
100,000
Figure 13.8 Cumulative and instantaneous MTBF lines.
13.4.1.2 AMSAA Model Concepts formulated by Duane were extended in the AMSAA model, which uses Duane’s cumulative failure rate but is also based on the Poisson distribution. It is explained as follows. Let dm1 dm2 … dmk represent the cumulative test times when design modifications (dm) are made. Between design modifications, the failure rate can be assumed to be constant, as illustrated in Figure 13.9. Let hi represent the failure rate during the ith time period between modifications (dmi dmi 1). Based on the constant failure rate assumption, the number of failures, Ni, during the ith time period has a Poisson distribution with a mean number of failures hi (dmi – dmi 1). This is mathematically expressed by Prob[ N i n]
[Li (dmi dmi 1 )]n e
Lt ( dmi dmi 1 )
(13.11)
n!
where n is an integer. Let t represent the cumulative test time and let N(t) represent the total number of product failures by time t. Then N(t) is analogous to the cumulative number of failures, 3F, from Duane’s model. If t is in the first interval, then N(t) has a Poisson distribution,
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
343
Failure Rate
λ1 λ2 λ3
DM1
DM2
DM3
λ4
DM4
Design Modification
Figure 13.9 Constant failure rate between design modifications.
with mean h1t. If t is in the second interval, then N(t) is the number of failures, Ni, in the first interval plus the number of failures in the second interval between dm1 and t. Thus, in the second interval, N(t) has the mean Q(t) = h1N1 h2(t DMi). When the failure rate is assumed to be constant (h0) over a test interval—that is, between design modifications—then N(t) is said to follow a homogeneous Poisson process, with a mean of the form h0 t. When the failure rates change with time, as in Figure 13.9, from interval 1 to interval 2, then N(t) is said to follow a nonhomogeneous Poisson process. For tracking reliability growth between design modifications, N(t) follows the nonhomogeneous Poisson process, with the mean value function t
1(t )
¯ N (x)dx L
(13.12)
0
where the intensity function ih(x) hi and dmi 1 x dm1. Thus, for any t, [1(t )]n e 1 (t ) (13.13) n! where n is an integer. As Δt approaches zero, ih(t)Δt approximates the probability of a product failure in the time interval (t, t Δt). The intensity function is approximated by a continuous parametric function so that test data can be compiled and parameters estimated. The AMSAA model assumes that the intensity function can be approximated by a parametric function defined as Prob [ N (t ) n]
N L (t ) LB t B 1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(13.14)
344 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Here, the intensity function is analogous to the instantaneous failure rate, h1, defined in Equation 13.8. This is also the Weibull hazard rate function, with ^ 0, h 0, and t 0. Because the AMSAA model assumes a Poisson process with Weibull hazard rate function—not the Weibull distribution—statistical procedures for the Weibull distribution do not apply. Equation 13.14 can model various processes, including reliability growth. The parameter of interest is ^ because 1 ^ is the reliability growth rate. From the parametric assumption, the mean number of failures by time t is defined as 1(t ) L t B
(13.15)
This is not the product MTBF. If no additional modifications are incorporated into the product after the completion of the test, future failures would follow an exponential distribution, and the product MTBF could be obtained by taking the inverse of the intensity function, ih(t). The cumulative failure rate, p(t), is defined as
W (t )
N (t ) t
(13.16)
and is analogous to h3 , defined in Equation 13.4. If the cumulative failure rate, p(t), is linear with respect to time on a log–log scale, then the growth is analogous to that modeled by Duane. Parameters ^ and h can be found graphically on full logarithmic paper or determined statistically using estimation theory. For graphical estimation, a straight line is fitted to the plot of the cumulative failure rate (or mean number of failures) versus the cumulative test time, using log–log paper. Taking the logarithm of the cumulative failure rate illuminates the relationship between the parameters and the slope and ordinate intercept. For statistical estimations, the maximum likelihood estimators method (discussed in Chapter 3) can provide point approximations of the parameters. Prior to the use of the AMSAA method, the test data must be analyzed to identify significant trends, rather than a homogeneous Poisson process. One test used to identify such trends is the central limit theorem test or the Laplace test (Cox and Lewis 1966). When more than one prototype is used—say, m prototypes—the product should be analyzed on a cumulative test duration basis (time, miles, etc.). The failure data on the prototypes are combined and analyzed as if they were a single product. If the period of observation ends with a failure (failure truncated), use the test statistic (*1) generated by
£ M 1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
M i 1
X i MX N / 2
X N [ M /12]0.5
(13.17)
CONTINUOUS RELIABILITY IMPROVEMENT
345
Table 13.1 Test Statistics Z]Value
Percent Level of Significance (Two Sided)
1.960 1.645 1.282
5.0 10.0 20.0
where M is the number of failures (N) minus 1, X N is the time of the last failure, and Xi is the time of the ith failure. If the failure data are time truncated, use the test statistic (*2) generated by
M2
£
N i 1
X i Nt0 / 2
(13.18)
t0 [ N /12]0.5
where N is the number of failures and t0 is the total test time. The statistic * is compared to the standardized normal deviate at the chosen significance level—say, Z ] —if: r μ a z]: significant growth is indicated at the chosen significance level and the AMSAA model can be used for estimating parameters ^ and h; r μ q z]: significant reliability decay is indicated at the chosen significance level and further corrective action and design changes are needed; or r z] μ z]: the trend is not significant at the chosen significance level because the data (failure rate) follow a homogeneous Poisson process; additional data should be accumulated.
Critical values of the test statistics can be found in the normal distribution tables. Common two-sided significance level test statistics are shown in Table 13.1. If significant growth is indicated, the estimates of ^ and h can be determined by Equations 13.19 through 13.26. Biased estimators are typically used for large samples. However, if the goodness-of-fit test is passed, then unbiased estimator can be used for both small and large samples. For failure-truncated tests, the biased estimate of ^ is N
Bk
( N 1) ln X N
£
N 1 i 1
(13.19) ln( X i )
The unbiased estimate of ^ can be determined by multiplying the biased estimate by [(N 2)/N)]:
B
( N 2) ( N 1) ln X N
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
£
N 1 i 1
(13.20) ln X
346 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
For failure truncated tests, the biased estimate of h is N Lk k X NB
(13.21)
The unbiased estimate of h is
L
N
(13.22)
X NB
For time-truncated tests, the biased estimate of ^ is N
Bk N ln t0
£
N i 1
(13.23) ln( X i )
The unbiased estimate of ^ can be determined by multiplying the biased estimate by [(N 1)/2]: ( N 1)
B
N ln t0
£
N i 1
(13.24) ln( X i )
The biased estimate of h is N Lk k t0B
(13.25)
The unbiased estimate of h is
L
N
(13.26)
t0B
A goodness-of-fit model is used to determine if the collected data fits the AMSAA model. A popular test used for AMSAA modeling is the Cramer–Von Mises goodness-of-fit test discussed in Chapter 12. From the chosen level of significance, ], the critical value of the test statistic, C M2 , is determined from Table 13.2. The value calculated from the observations is then compared to this critical value. If the test is failure truncated, the calculated value is obtained by Equation 72 from Chapter 12: 1 C M2 12 M
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
N
£ i 1
B §¤ ¶ ¨ X i ³ 2i 1 · ¨ ¥¦ X ´µ 2M · ¨© N ·¸
2
(13.27)
CONTINUOUS RELIABILITY IMPROVEMENT
347
Table 13.2 Critical Values of CM2 Parametric Form of the Cramer–Von Mises Statistic Level of Significance, ] M 2
0.20
0.15
0.10
0.05
0.01
0.138
0.149
0.162
0.175
0.186
3
0.121
0.135
0.154
0.184
0.231
4
0.121
0.136
0.155
0.191
0.279
5
0.121
0.137
0.160
0.199
0.295
6
0.123
0.139
0.162
0.204
0.307
7
0.124
0.140
0.165
0.208
0.316
8
0.124
0.141
0.165
0.210
0.319
9
0.125
0.142
0.167
0.212
0.323
10
0.125
0.142
0.167
0.212
0.324
15
0.126
0.144
0.169
0.215
0.327
20
0.128
0.146
0.172
0.217
0.333
30
0.128
0.146
0.172
0.218
0.333
60
0.128
0.147s
0.173
0.221
0.333
100
0.129
0.147
0.173
0.221
0.336
If the test is time truncated, then the calculated value is obtained by the equation
C M2
1 12 N
N
£ i 1
B §¤ ¶ ¨ X i ³ 2i 1 · ¨¥¦ t ´µ 2N · ¨© 0 ·¸
2
(13.28)
If the calculated value is greater than the tabulated critical value, then the AMSAA model is rejected. A poor Cramer–Von Mises fit may be caused by program changes that caused jumps or discontinuities in the reliability improvement program. Plotting the data may suggest whether a different model should be used or may indicate where the discontinuities occurred. If there are jumps, the AMSAA model can be applied piecemeal; the data prior to and following the jump or discontinuity are treated separately. If the calculated value is less than the tabulated critical value, then the AMSAA model is accepted and the product intensity function (Equation 13.14) may be estimated as a function of time. Once the intensity function is determined, a parametric curve can be drawn to predict the future behavior of the product. If the product is not modified after time t 0, then failures are assumed to continue at the constant rate, L0 N L (t0 ) LB t0B 1, according to the exponential distribution. The estimate of the MTBF would then be equal to [1/(h^t^ 1)]. Confidence tables developed for the AMSAA model can then be used to determine the lower and upper confidence bounds around this MTBF. For M 100, use values for M 100.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
348 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
EXAMPLE 13.2 A reliability growth test has accumulated 2,500 hours of test time, with failures occurring at the following test times: 85, 151, 184, 267, 378, 474, 660, 803, 1,031, 1,230, 1,400, 1,589, 1,643, 1,756, and 2,122 hours. Determine if the AMSAA model is appropriate at the 10% level of significance. If it is, determine the MTBF at 2,500 hours.
£
M2
N i 1
X i Nt0 / 2
t0 [ N /12].5
1.7806
(13.29)
Because 1.7806 is less than 1.645, significant growth is indicated at the 10% significance level and the AMSAA model can be used for estimating parameters ^ and h. The unbiased estimate of ^ is N 1
B
N ln t0
£
N i 1
0.679
(13.30)
ln( X i )
The unbiased estimate of h is
L
N t0B
0.073942
(13.31)
The Cramer–Von Mises statistic is
C M2
1 12 M
M
£ i 1
2
§ 2i 1 ¶ B ¨( X i /t0 ) · 0.100305 2M ¸ ©
(13.32)
The calculated critical value, 0.100305, is less than the critical value of 0.169 obtained from the Cramer–Von Mises table. Therefore, the AMSAA model is accepted and the intensity function may be estimated for t 2,500 hours:
N L (t ) L B t B 1 0.004074
(13.33)
The inverse of the intensity function provides an MTBF of 245 hours.
The data in this example can be modified to illustrate the sensitivity of the MTBF calculated by the AMSAA model. If the first two failures occurred very early in the test—at 1 and 3 hours, instead of 85 and 151 hours—then ^ would be equal to 0.483094, h would be equal to 0.342426, and the MTBF at t 2,500 hours would be 345 hours. The successful completion of the pretest tasks, such as ESS, burn-in, thermal survey, and so forth, should help assure that failures do not occur at the start of the improvement program. A technique to estimate reliability growth when data may be missing or some failure times are not known was investigated by Crow (1988), who discussed a
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
349
methodology useful for this situation over some interval of test time. Generally, for grouped data, the estimation procedure is somewhat more complicated because a closed form equation for ^ does not exist. Assume that there are k intervals with boundaries x = 0, x1,…, xk; ^ is estimated as the solution of the following equation: n
£
ni
i 1
xi B ln xi xi 1 B ln xi 1
ln x k 0 xi B xi 1 B
(13.34)
where x0lnx0 is defined to equal zero. Numerical techniques must be employed to solve this equation for ^. Given an estimate for h, h can be estimated by
Lk
£
k
n
i 1 i Bk
(13.35)
xk
Other continuous reliability growth models have been developed that model products under assumptions tailored to the development program and to specific circumstances. Some of these include models by Cox and Lewis (1966), and Lloyd and Lipow (1962) and software models by Jelinski and Moranda (1972) and Littlewood and Verrall (1973). 13.4.2 Discrete Models The Duane and AMSAA reliability models are examples of continuous models, developed for repairable products in which reliability is measured over periods of time. Discrete models differ from continuous models because they measure reliability in terms of a go/no-go situation, such as for a missile or rocket. Products that either fail or operate when called into service are modeled by discrete functions. Popular discrete models were developed by Lloyd and Lipow (1962) and Wolman (1963). 13.4.2.1 Lloyd and Lipow Model Lloyd and Lipow (1962) considered two models. The first assumes the product in the reliability improvement program has only one failure mode. Each trial assumes that the probability that the product will fail if the failure mode has not been previously eliminated is a constant. If the trial is a success and the product does not fail, the next trial is performed. If the product fails, an attempt is made to eliminate the failure mode by a corrective action or design change. The probability of removing this failure mode is also assumed to be constant. Therefore, the product reliability, Rn, on the nth trial is Rn 1 Ae C ( n 1) where A and C are predetermined parameters.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(13.36)
350 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Lloyd and Lipow considered a second improvement program conducted in K stages. On the ith stage, N number of products are tested. Failures and successes are recorded during each stage, and improvements are not incorporated until the completion of a given stage. The reliability growth function considered is Ri Rc [A r /i]
(13.37)
where Ri is the product reliability during the ith stage, R∞ is the ultimate value of reliability as i l ∞, and ]r is the growth rate (]r 0). Maximum likelihood estimates and least square estimates are used to determine the values of R∞ and ]r. A lower confidence limit for the reliability determined during the final or Kth stage was also determined. 13.4.2.2 Wolman Model Wolman (1963) considered a situation where product failures could be classified as either an inherent cause or an assignable cause. For each trial—for instance, a missile launch—the trial is either a success or a failure. If it is a failure, then the cause is determined to be either inherent or assignable. Inherent cause failures reflect the state of the art of the product and cannot be eliminated by corrective action. Assignable cause failures can be eliminated. Wolman assumed, first, that a number of original assignable cause failure modes are known and, second, that when one of these modes causes a product failure, it will be permanently removed from the product. A Markov-chain approach was used to determine the reliability of the product after the nth trial. The model considered was k
R( n )
£ (1 q )(1 q ) i
0
k i
P0(,ni )
(13.38)
i0
where qi is the probability that the product will fail due to inherent failure modes, q0 is the probability of failure due to assignable cause failure modes, and P(n)0,i is the n-step transition probability. Other discrete reliability growth models have been developed besides the Lloyd and Lipow and Wolman models. These include models by Barlow and Scheuer (1966) and Singpurwalla (1978).
13.5
RELIABILITY IMPROVEMENT EFFECTIVENESS AND UNCERTAINTY
The overall objective of any reliability improvement program is to identify, correct, and eliminate design and manufacturing deficiencies and failure modes. If the reliability improvement program is carried out as planned, design defects should be uncovered and corrected during the growth test instead of occurring in the field. However, two phenomena inhibit this process: reliability growth test effectiveness and uncertainty.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
351
13.5.1 Reliability Improvement Effectiveness When failures occur during a reliability growth test, they may or may not be corrected. Failure modes that will not be corrected are termed type A failures; failures that will be corrected are termed type C failure modes. (A similar classification of failures was also discussed by Wolman, 1963.) After the analysis of reliability data using one of the models, careful attention should be paid to the implementation of corrective actions. Ideally, all relevant failure modes should be corrected. However, due to funding limitations or the state of the art of the design, there will be a percentage of type A failures. Experience has shown that of the type B failure modes, an average of 30% will remain in the product, even though they were thought to have been corrected. The proportion of type B failure modes that will be eliminated equals the growth effectiveness factor. The potential growth upon the completion of the test can be determined by SystemGP 1/[L A [(1 EF ) r L B ]]
(13.39)
where SystemGP is the product growth potential, hA is the observed failure rate of type A failure modes, hB is the observed failure rate of type C failure modes, and EF is the effectiveness factor. EXAMPLE 13.3 A reliability growth test was stopped at 3,000 hours of test time with 25 failures. Failure analysis determined that 6 failures were type A and the remaining 19 were type B. Experience on comparable products has shown that the effectiveness factor should be 70%. The growth potential for this product is
SystemGP 1 / [
§ 6 19 ¶ ¨(1 0.7) r · 256 hoours] 3000 © 3000 ¸
(13.40)
This example illustrates the use of known cumulative test data to determine the growth potential upon the completion of a test. Similar analytical methods could be applied to the Duane model prior to the start of the growth test to determine the amount of test time needed to reach the required MTBF, based on various values of the growth rate. 13.5.2 Reliability Improvement Uncertainty The Department of Defense conducted a study on the uncertainty of MTBF and growth estimates (U.S. Air Force, Army, and Navy 1989). Earlier, an example of the AMSAA model was given where failures occurring at the start of the test caused a 40% increase in the MTBF (345 vs. 245 hours). Given the numerous models available for analyzing growth test data, the problems associated with failure
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Estimated MTBF/True MTBF
352 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Uncertainty of MTBF estimates Monte Carlo (80% band)
2.8 2.6 2.4 2.2 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 5
10 15 20 25 No. of Failures at Which Estimate Made
30
Figure 13.10a Uncertainty of prospective MTBF estimates.
classification, and growth test effectiveness, it is apparent why there could be large variations in the results. The U.S. Air Force, Army, and Navy conducted Monte Carlo simulations using the AMSAA model to determine the probable uncertainties for MTBF and growth estimates. For each Monte Carlo trial, all failure times up to the 30th failure were recorded, and estimates were made of the growth rate and of the instantaneous (or current) MTBF. (Note this on the figures themselves.) Figure 13.10a illustrates the range of data containing 80% of the simulation results. After five failures, 10% of the MTBF estimates would be expected to exceed the true value by a factor of 2.6, and 10% would be less than 0.45 of the true value. The three figures in Figure 13.10b were also developed using Monte Carlo simulations. These figures illustrate that, as the true growth rate increases, the dispersion in estimated growth rate diminishes. The variations illustrated by these Monte Carlo simulations and the many variables inherent in the planning and conduct of growth testing make it imperative that critical program decisions not be made on the results of growth testing alone. This does not imply that reliability growth testing is not cost effective. Indeed, it can be a cost-effective method of continuously improving the reliability of a product. This does imply that sound engineering judgment should be used to compare the results of the growth program with the results of other development tests and other analyses, such as thermal surveys and reliability predictions. In addition, the reliability determined by the growth test should be bound by upper and lower confidence limits. In this case, we could say with some confidence—say, 80%—that the true MTBF would be between the lower and upper confidence limits; at 80%, there is a one in five chance that the true MTBF would not be between the bounds. When performed with the objective to eliminate and remove deficiencies, a reliability growth program will improve the reliability of a product and may eliminate the need for reliability demonstration testing. However, if there is any doubt about the results of the growth test or if major program decisions are needed, then a reliability demonstration or qualification test must be considered.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Estimated Growth Rate
CONTINUOUS RELIABILITY IMPROVEMENT
Uncertainty of Growth Rate Estimates (Monte Carlo)
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 –0.1 –0.2 –0.3 –0.4 –0.5
True growth rate = 0.1
Estimated Growth Rate
5
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 –0.1 –0.2 –0.3 –0.4 –0.5
10 15 20 25 No. of Failures at which Estimate Made
30
True growth rate = 0.3
5
Estimated Growth Rate
353
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 –0.1 –0.2 –0.3 –0.4 –0.5
10 15 20 25 No. of Failures at which Estimate Made
30
True growth rate = 0.5
5
10 15 20 25 No. of Failures at which Estimate Made
Figure 13.10b Uncertainty of prospective growth rate estimates.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
30
354 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
13.6
SUMMARY
This chapter discussed continuous reliability improvement techniques that can be applied to products. The reliability improvement process recognizes that the reliability of the drawing board design of a complex product can be improved and time allocated for that improvement. By operating or testing the product in a manner that will identify deficiencies caused by the design, manufacturing process, and/or operation, deficiencies can be detected and removed, and methods for designing-in reliability can be reevaluated or used to improve reliability. Specifically, this chapter discussed the principles of the reliability growth process, stress margin testing, methods for continuous growth monitoring and reliability improvement effectiveness, and uncertainty. Details presented in the reliability growth process section included management decisions required to implement a reliability improvement program, a summary of failure analysis procedures and common failure modes, and techniques to classify failures to ensure a successful reliability improvement program. The section on stress margin testing included accelerated testing techniques and tools to measure the effects of accelerated testing. The section on continuous growth monitoring presented both continuous and discrete growth models. Popular models such as Duane and AMSAA models were discussed and examples presented. The final section provided techniques and examples to judge effectiveness on corrective actions and the uncertainty associated with reliability improvement techniques. Information integrated throughout the chapter will assist both managers and engineers in continuously improving the reliability of a product in a cost-effective manner.
REFERENCES Barlow, R. E., and E. M. Scheuer. 1966. Reliability growth during a development testing program. Technometrics 8:53. Cox, D. R., and P. A. W. Lewis. 1966. The statistical analysis of series of events. New York: John Wiley & Sons. Crow, L. H. 1974. Reliability analysis for complex repairable systems. Technical report #138, U.S. Army material systems analysis activity, Aberdeen Proving Ground, Aberdeen, MD. ———. 1986. On the initial system reliability. Proceedings of the Annual Reliability and Maintainability Symposium, Las Vegas. ———. 1988. Reliability growth estimation with missing data—II. Proceedings of the Annual Reliability and Maintainability Symposium, Los Angeles. Duane, J. T. 1964. Learning curve approach to reliability monitoring. IEEE Transactions on Aerospace 2 (2): 563. General Electric Company. 1973. Research study of radar reliability and its impact on lifecycle costs for the APQ-113, -119, -120, -144 radars, Utica, NY. Hobbs, G. K. 1990. Highly accelerated life tests—HALT. Westminster, CO: Hobbs Engineering Corporation. Jelinski, Z., and P. B. Moranda. 1972. Software reliability research. In Statistical computer performance evaluation, ed. W. Freiberger. New York: Academic Press.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CONTINUOUS RELIABILITY IMPROVEMENT
355
LeStrange, J. 1990. Failure analysis laboratory. Litton Amecon briefing to the University of Maryland Reliability Engineering Program. Littlewood, B., and J. L. Verrall. 1973. A Bayesian reliability growth model for computer software. Record IEEE Symposium on Computer Software Reliability, New York. Lloyd, D. K., and M. Lipow. 1962. Reliability: Management methods and mathematics. Englewood Cliffs, NJ: Prentice Hall. Pecht, M. 1991. Handbook of electronic package design. New York: Marcel Dekker. Raheja, D. G. 1990. Assurance technologies: Principles and practices. New York: McGraw–Hill. Schinner, C. 1988. The board electronic STRIFE test (B.E.S.T.) program. Reliability Review 8:3. Seusy, C. J. 1987. Achieving phenomenal reliability growth. Proceedings of the ASM Conference on Reliability—Key to Industrial Success, Los Angeles, CA, 1987. Singpurwalla, N. 1978. Estimating reliability growth (or deterioration) using time series analysis. MIL-HDBK-189, appendix D. U.S. Air Force, Army, and Navy. 1989. The TAAF process, appendix C—Uncertainty of MTBF and growth estimates. HQ AMC/QA, OASN S&L, HQ USAF/LE-RD, C1–C3. Wolman, W. 1963. Problems in system reliability analysis. In Statistical theory in reliability. Madison: University of Wisconsin Press.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 14
Logistics Support Robert M. Hecht
CONTENTS 14.1 Introduction ................................................................................................. 358 14.2 Logistics Elements ...................................................................................... 359 14.3 Influence of Reliability on Logistics Resources.......................................... 361 14.3.1 Reliability, Maintenance Rates, and Expected Demand for Logistics Resources ....................................................................... 361 14.3.1.1 False Alarm Rate (FAR) ................................................364 14.3.1.2 Cannot Duplicate (CND) Rate .......................................364 14.3.1.3 Probability of Fault Detection (DET) ............................364 14.3.1.4 Probability of Fault Isolation (ISO)................................ 365 14.3.1.5 Maintenance Action Rate (MAR) .................................. 366 14.3.1.6 Demand Rate (DEM) ..................................................... 367 14.3.1.7 Mean Downtime (MDT) ................................................ 368 14.3.2 Supply Support—Provisioning of Repair Parts and Consumables .................................................................................. 369 14.3.2.1 Optimal Reorder Quantity ............................................. 370 14.3.2.2 Spares’ Availability and Provisioning............................ 373 14.3.2.3 Provisioning a Product Composed of Replaceable Parts................................................................................ 375 14.3.2.4 Spares’ Optimization...................................................... 378 14.3.3 Manpower and Personnel—Staffing Levels .................................. 382 14.3.4 Support and Test Equipment—Utilization and Productivity......... 385 14.4 Repair Level Analysis ................................................................................. 387 14.5 Summary..................................................................................................... 389 References.............................................................................................................. 390
357 © 2009 by Taylor & Francis Group, LLC
358 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
14.1
INTRODUCTION
All consumers expect that a purchased product will not fail or otherwise malfunction during its expected service life. Furthermore, if a failure or performance degradation does occur, the typical consumer expects prompt repair or replacement at a reasonable cost when compared to the purchase price of the product. In addition, the consumer wants a product that requires little or no preventive or scheduled maintenance, in order to minimize both the cost of ownership and unavailability of the product. In the civilian world, industry has responded by improving the reliability and durability of products and by eliminating, or at least minimizing, preventive maintenance (PM) requirements. Examples of reduced PM in the automotive field include the use of electronic, pointless ignition products and lubricated-for-life bearings. The service life of many products (e.g., front wheel bearing/constant velocity joint assemblies and exhaust/emission products) has also improved to the point that the consumer may not need to replace these costly products during the typical ownership period. Ultimately, servicing or repairs will be required, and maintenance of some sort must be provided by an automobile dealer or an independent shop. The customer would prefer to wait while the servicing is done; if that is not possible, the automobile should be ready the same day or, at most, the next day. To accomplish this, the dealer or repair shop needs trained personnel and appropriate facilities, tools, test equipment, technical data, and repair parts. For fast turnaround, these logistical assets must be pre-positioned. To achieve this, the manufacturer must have invested time and money during the design and development of the product. The planning, acquisition, and positioning of the resources necessary to effect the repair or replacement of a product are termed logistics support. In order to meet consumer needs and expectations, it is important to develop products from a life-cycle perspective. Reliability, maintainability, and effectiveness must be considered by designers and program managers at the initiation of design and development if products are to meet the fundamental performance needs of the consumer cost effectively. However, consumer satisfaction cannot be completely fulfilled without addressing logistics and product support capability and properly integrating them with the application-oriented aspects of the product. Reliability, maintainability, and effectiveness requirements must be applied not only to the prime equipment and applicable software, but also to the acquired resources that comprise the logistics support elements. Integrated logistics support (ILS) applied to products constitutes a life-cycle approach to maintenance and support. ILS is an integral part of all aspects of product planning, design, and development; testing and evaluation; production and construction; utilization; and retirement. Elements of logistics support that concern the developer include the maintenance plan; supply support; product support; packaging, handling, storage, and transportation; manpower and personnel; training and training support; facilities; technical data; computer resources support; and design interface. Blanchard (1992) provides a broad overview of the life-cycle aspects of logistics support. This chapter discusses the influence of reliability on logistics
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
359
support requirements, emphasizing how the reliability of a product, equipment, or assembly influences the need for spares or repair parts, support equipment, and maintenance personnel.
14.2
LOGISTICS ELEMENTS
Logistics support encompasses the planning and management, design and development, and acquisition and positioning of the resources necessary to ensure the effective and economical support of a product throughout its programmed life cycle. The elements of logistics support must be integrated with all other segments of the product in order to ensure that both operational and cost requirements are met. There are several major elements of logistics support. The maintenance plan includes all planning and analysis for the overall support of a product throughout its life cycle. This process is formalized through logistics support analysis (LSA) and repair level analysis (RLA)—or level of repair analysis—and documented in the logistics support analysis record. The LSA is an iterative process that depends upon inputs from reliability and maintainability predictions, failure modes, effects, and criticality analysis. The results of these analyses, coupled with design reviews and audits conducted by logistics engineers, are used to develop the maintenance concept for each product, subassembly, and assembly, and to identify the logistics resources required to support the product if it is deemed replaceable, repairable, or both by the RLA. Like a life-cycle cost analysis, the RLA considers all costs associated with supporting a product over its life cycle. The RLA is an economic evaluation of the cost benefit of repairing or discarding the product, or its constituent components, at specified levels of maintenance. The levels typically considered are the organizational level or on-site, by-the-user level (O-level)—at which, typically, assemblies can only be removed and replaced; the intermediate level (I-level), which can be either an on- or off-site repair facility with a limited capability to repair assemblies and subassemblies; and the depot level (D-level) or remanufacturer, which has the capability to rework or refurbish subassemblies. At each of these maintenance levels, RLA is used to calculate the costs associated with the repair or discard of the candidate product. The results of the RLA are fed back to the LSA so that the final LSA reflects the maintenance concept and logistics resources that must be developed or procured to support the product over its life cycle. Supply support involves all spares (e.g., units, assemblies, modules), repair parts, consumables, special supplies, and related inventories needed to support prime application-oriented product, software, testing and support of the product, transportation and handling of the product, training equipment, and facilities. Supply support could also encompass provisioning documentation, procurement functions, warehousing, and the distribution of material, as well as the personnel associated with the acquisition and maintenance of spare and repair part inventories at all applicable locations. Considerations include each maintenance level and each geographical location where spare and repair parts are distributed and stocked, spares’ demand rates and
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
360 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
inventory levels, distances between stockage points, procurement lead times, and methods of material distribution. Support and test equipment (STE) includes all tools, special-condition monitoring equipment, diagnostic and checkout equipment, metrology and calibration equipment, maintenance stands, and servicing and product handling required to support scheduled and unscheduled maintenance of the product. Test and product support requirements must be addressed at each level of maintenance, as well as the overall requirements for test traceability to a primary or secondary standard. Test and product support may be classified as peculiar (newly designed or off-the-shelf products unique in the user inventory to the product under development) or common (existing products already in the user inventory). Test and product support is also classified into special purpose (designed specifically to support the product under development) or general purpose (typically, off-the-shelf product tests to support end products, in addition to the product under development, without the need for modification). Packaging, handling, storage, and transportation encompass all special provisions, containers (reusable and disposable), and supplies to support packaging, preservation, storage, handling, or transportation of prime product, test and product support, spares and repair parts, personnel, technical data, and mobile facilities. This element involves both the initial distribution of products and the transportation of personnel and materials for maintenance purposes. Manpower and personnel include the personnel required for the installation, checkout, operation, handling, and sustaining maintenance of the product and its associated test and product support. Personnel requirements are identified in terms of quantity and skill levels for each operation and maintenance function by level of support and geographical location. Training and training support involve initial training to familiarize personnel with the product as well as replenishment training to compensate for attrition and the development of replacement personnel. Training is designed to upgrade assigned personnel to the skill levels defined for the product. Training support also includes those aids (e.g., simulators, mock-ups, special products, software) developed to support personnel training operations. Facilities are all physical locations needed to operate the product and to perform maintenance functions at each level: physical plants, real estate, portable buildings, housing, intermediate maintenance shops, calibration laboratories, and special depot repair and overhaul facilities. Capital equipment and utilities (e.g., heat, power, energy requirements, environmental controls, communications) are generally included as part of facilities. Technical data include the installation and checkout procedures, operating and maintenance instructions, inspection and calibration procedures, overhaul procedures, modification instructions, facilities information, drawings, and specifications necessary to perform product operation and maintenance functions. Such data cover not only the prime application equipment, but also testing and support equipment, transportation and handling equipment, training equipment, and facilities.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
361
Computer resources support encompasses all computer equipment and accessories, condition monitoring and maintenance diagnostic aids, software, program tapes, disks, and databases to perform product maintenance functions at each level. Design interface relates logistics design parameters to product readiness, resource requirements, and support cost. These design parameters could include product availability or the attainment of a required product output, compliance with local or national environmental or safety codes or laws, the minimization of the use of energy resources, and the ability of the designed product to be used or easily modified for use in more than one end product.
14.3
INFLUENCE OF RELIABILITY ON LOGISTICS RESOURCES
From the perspective of the logistician, reliability translates into a demand for logistics resources; maintainability translates into the range of logistics resources required to support the operation of the prime product and the length of time during which specific logistics resources (e.g., personnel or product support) are dedicated to a single repair action. The interaction of reliability and maintainability results in the need for logistics assets to maintain a level of operational readiness or availability over the time desired by the user. The equation for operational availability was given by
A0
MTBM MTBM MDT
(14.1)
In this chapter, we will further discuss the term MDT (mean downtime) by examining how the mean time to repair (MTTR) and supply response time affect it. We will also examine how reliability affects supply support, provisioning, and the utilization of product support and maintenance personnel. 14.3.1 Reliability, Maintenance Rates, and Expected Demand for Logistics Resources Low reliability creates an increased demand for logistics resources. The arrival rate is the demand rate that a product or population of products places on a product logistics support. For a product for which the reliability is represented by a homogeneous Poisson process (i.e., a constant hazard rate), the general demand rate equation is given by Demand Rate ( Number of Units in use) r ( Maintenance Action per Unit Time per Unit)
(14.2)
The number of maintenance actions per unit time is a function of the relevant, chargeable failures of the product, as well as removals due to deficient built-in
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
362 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
testing or product monitoring (false alarms and false fault indications), inadequate diagnostic procedures that result in the unnecessary removal of a functional unit, or the removal of one assembly in order to gain access to another (i.e., irrelevant, nonchargeable failures). Figure 14.1 illustrates the potential impact of these incidents. Fault detected
BIT indication
No BIT indication FA% Non BIT true failures (detections) (1-ISO%)
ISO%
Non BIT CNDs
Not isolated (incorrect isolations)
Isolated (correct isolations)
Isolated to 1 Rt
(1-FA%)
BIT CNDs (false alarms)
BIT CNDs (false alarms)
CNDs
True failures
TF%
(1-DET%)
Isolated to 2 Rts DET% Isolated to m Rts Not isolated (incorrect isolations) BIT : Built in test CND : Cannot duplicate ISO : Isolation FA : False Alarm TF : True failures DET : Detection Rt(s) : Repairable item(s)
BIT relevant maintenance actions
Isolated (correct isolation)
Non-BIT true failures (not detected) BIT true failures (detections)
(1-ISO%)
ISO%
Isolated to 1 Rt Isolated to 2 Rts Isolated to m Rts
Figure 14.1 Relationships of FAs, CNDs, and fault isolation to the maintenance of a replaceable item.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
363
Some key terms associated with maintenance actions include: r r r r
false alarm (FA); retest okay (RETOK); cannot duplicate (CND); and no fault found (NFF).
False alarms are indications, usually by a built-in test product (BITE), that something is wrong with the monitored product, even though the operator does not perceive any degradation in performance. If the operator does report a problem, a maintainer may attempt to duplicate the problem and could trigger a CND or an NFF. From the reliability engineer’s viewpoint, a failure may not have occurred. However, from the perspective of the logistician, logistics resources have been expended. As shown in Figure 14.1, if the operator detects a fault and requests assistance from a technician (O-level), one of two events may occur. Either the maintainer can duplicate a fault condition and conduct diagnostics to fault-isolate to a “potentially” failed replaceable or repairable product (RI), or the maintainer may be unable to find or duplicate the fault condition. Several outcomes can result from a repair action. The maintainer could complete the repair or removal and replacement and conduct a checkout procedure. The result could be a fully functional product (successful repair) or the recurrence of the original fault condition (if the wrong component was replaced). If the parts used to effect the repair are themselves repairable, the failed parts will be shipped to a higher level maintenance facility (I- or D-level) or, perhaps, to the manufacturer. If two or more repairable products were replaced, it is likely that at least one was in functional working order (a RETOK event). This is especially true for electronics. In some cases, a product shipped to a repair facility may be found to be functional and meet minimum performance requirements, but the maintenance activity may require that it be refurbished to restore or increase the usable service life. To satisfy requirements at the I- or D-level, logistics support must again provide the required resources. If the O-level maintainer replaces one or more components and the product remains faulty, logistics support must still provide the same resources as though a true failure occurred. Costs (time and money) were expended at the O-level—and perhaps at the I- and D-levels—if repairable assemblies were replaced, but the affected product remains not fully functional. The resolution of the fault may require the use of supplemental diagnostic procedures or more experienced O-level maintainers; the assistance, on-site or remotely, of I- or D-level technicians or manufacturer’s field representatives; or the shipment of the product to a higher maintenance level. Due to differences in defining what incidents or events constitute an FA, a CND, or a RETOK, only two terms, the FA and the CND, are discussed here. The first, the rate (FA), indicates by BITE or some other form of remote monitoring that something is wrong with the product, although the operator cannot perceive a problem. The second term, the cannot duplicate rate (CND), encompasses all nonoperational and shop maintenance actions during which a fault cannot be identified.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
364 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
14.3.1.1 False Alarm Rate (FAR) The FAR is defined as the number of FA actions divided by the total number of fault indications at the user (operator) level for a given product. Generally, FA incidents are considered only at the user level for systems or subsystems that employ some form of BITE or other means of automated or semiautomated monitoring. The FAR rate can be expressed as a percentage or as a decimal equivalent. Hence, FAR
( Number of FA Actions) ( Number of FA Actions) ( Number of True Malfunctions)
(14.3)
Number of FA Actions Total Number of User Fault Indications and 1 FAR
Number of True Malfunctions Total Number of User Fault Indications
(14.4)
Assume that FA incidents do not result in a request for maintenance and thus do not impact logistics support. 14.3.1.2 Cannot Duplicate (CND) Rate CND incidents can occur at any level of the product hierarchy (i.e., system, subsystem, assembly, subassembly, module) but can only be incurred by maintainers. The CND rate for a product at a given maintenance level is given by CND
Number of Maintenance Actions for which No Fault is Found Number of Maintenance Actions for whichh No Fault is Found Number of True Failures Number of Maintenance Actions for which No Fault is Found d Total Number of Maintenance Actions
(14.5) and 1 CND
Number of True Detected Faults Total Number of Maintenance Actions
(14.6)
14.3.1.3 Probability of Fault Detection (DET) An area of concern to both the user and the logistician is the inability of BITE or the user to detect failures. How is the product logistics support impacted when neither
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
365
the operator nor BITE detects all failures? Should the logistician be concerned? If the current and future effects of the failure and subsequent failure of other assemblies on components caused by potential overstress are totally benign, the logistics support system may never plan for or address (repair) failures of this type. If the latent failed product is detected, isolated, and replaced, it will probably be during a maintenance action for an overt failure. Detection and isolation of the latent failed product is a result of the tighter tolerance windows of intermediate or depot support products. For failure modes undetected by BITE that result in undesirable effects, the product logistics support must provide an alternate means of detection and correction. This means that the logistician’s plan must provide the operator with the capability to detect a non-BITE failure mode of the product. In some cases, the logistician may be required to procure and field a special-purpose test product to conduct pre- or postoperational checks. The DET is given by DET
Number of Detected True Failures Total Number of True Failures
(14.7)
14.3.1.4 Probability of Fault Isolation (ISO) As shown in Figure 14.1, a term called the probability of fault isolation (ISO) is associated with how the system, subsystem, equipment, or assembly design permits fault isolation of one or more lower level elements for a given percentage of maintenance actions. For a given assembly, ISO n
Number of Maintenance Actions Isolated to n Component Total Number of Maintenance Actions
(14.8)
For a given product, ISO can also be stated in terms of the number of removals or maintenance actions on that product due to inclusion in an ambiguity group for which the product did not fail, divided by the total number of maintenance actions: Number of (Nonfailure) Removals or Maintenance Actions for Item i due to inclusion in an Ambiguity Group ISOL i Total Number of Maintanence Actions forr Item i
(14.9)
In terms of design for testability, ISO and ISOL define the size of ambiguity groups. From the viewpoint of the logistician and maintainer, the maintainer may have to remove one or more subassemblies, incurring longer repair time (and more downtime). Also, the logistician must provide more spare products or assemblies and allow longer utilization of product support and manpower than if the design permitted fault isolation to one product or assembly 100% of the time. The removal and replacement of ambiguity groups may also affect the CND rates of the removed components. For example, assuming that only one component can fail at a time, consider three printed wiring boards (PWBs) identified as possibly
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
366 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
causing the failure of a transmitter. Unless the maintainer or shop has known good subassemblies, replacement PWBs would have to be drawn from supply or ordered from a higher level of maintenance or the manufacturer. The maintainer could order one PWB at a time, remove the existing installed PWB, and install the new PWB to see if it corrected the problem. If not, the maintainer could order the next PWB on the list and either leave the new PWB in place or reinstall the old one. If the repair action was successful, the failed product would be subject to repair or discard, in accordance with the standing maintenance plan. Alternatively, the maintainer could order all three PWBs at the same time and conduct the fault isolation by substitution. In this case, the maintainer would have to send two good PWBs and one bad PWB back to the supply product or manufacturer. Therefore, the removal of a good (nonfailed) product as a result of its inclusion in an ambiguity group may become a CND at the next maintenance level. 14.3.1.5 Maintenance Action Rate (MAR) The MAR is defined as the number of maintenance actions per operating unit per unit time. In general, the MAR is given by MAR ( Number of True Dectected Failures per Unit Time) ( Number of Cannot Detect Actions per Unit Time) ( Number of Removal Actions due to Ambiguities per Unit Time) (14.10) DET r L L CND r MAR ISOL r MAR DET r L L /(1 CND ISOL) where CND is the probability of a maintenance action being a CND, and ISOL is the probability that the maintenance action is a result of the product being part of an ambiguity group and that the product has not failed. In the preceding equation, hL is the series or logistics failure rate. False alarms are not considered because the term addresses maintenance rather than failure or fault indications. ISOL actions at one level of maintenance (say, where a subassembly is removed) may become CND actions at the next level of maintenance. Then, MAR DET r LL / (1 CND)
(14.11)
If all removals or maintenance actions are true failures or, in the case of provisioning, if a product is removed and shipped to the next higher level of maintenance without further checkout to determine whether it is a fault, then the CND becomes equal to one and the MAR is given by MAR DET r LL
(14.12)
As defined, the CND and ISOL are not statistically independent and cannot both be equal to one.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
367
Redundancy is not normally considered when computing the demand for logistics resources because a failure in a redundant product needs to be corrected at some point, if not immediately. For most systems with redundant products, when one of the redundant products fails, repair is either immediately initiated or done at a time when it does not impede operations. However, for heavily redundant products (m of n must function), repair may be delayed until a fixed number of redundant products fails or a scheduled (periodic) maintenance event occurs. In the latter case, the product logistics support must still provide the same number of parts (at the same time), the same technical data, the same product support, and the same maintainers as for immediate repair. However, in the m of n redundancy case with delayed maintenance, the logistician may be required to restore the product to full functionality within the same time limit as if only one redundant subsystem had failed. This time limit may require the logistician to plan for and provide more maintainers and product support than would be required if the failures were repaired as they occurred. 14.3.1.6 Demand Rate (DEM) For a given time period (TL ), the maintenance action rate can be converted into the absolute, expected number of maintenance actions. Following Equation 14.2, the expected number of maintenance actions in time period TL , DEM is given by DEM MAR r TL
(14.13)
The units of time, TL , must be consistent with the units associated with the term MAR. If MAR is stated in terms of failures, removals, or maintenance actions per operating hour, then T must also be in units of operating hours; if MAR is in units of flight hours, T must be also. For example, to determine the expected demand for a consumable part at an operational (organizational) site, the logistician may define TL as TL K u r N sys r OPHRS r RESUP
(14.14)
where Ku
is the utilization factor–conversion factor from one unit of operating time (e.g., flight hours) to another (e.g., operating hours); Nsys is the number of products being supported; OPHRS is the number of operating or flight hours per unit of calendar time (e.g., per day); and RESUP is the resupply time (e.g., days)—the period of time required to order and receive a part from an off-site source. The utilization factor (Ku) permits the equation to account for products that may be operated at a different rate from the parent product containing an operating timerecording device (e.g., an elapsed time indicator). For example, in addition to the avionics operating while the aircraft is airborne, they may also be in operation for preflight and postflight checkout. Alternatively, some avionics systems may only be
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
368 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
operated for a brief period during an extended application. The utilization factor can also be used to account for operating time during repair actions and to convert different application scenarios to a common time base. 14.3.1.7 Mean Downtime (MDT) The concept of a time period associated with the number of demands for logistics resources can be extended to a more generalized expression that will be employed in supply support. The generalized term is commonly referred to as MDT or mean logistics downtime (MLDT)—here, MDT. MDT is the expected time for a response from the product logistics support. For example, the MDT for a product that can be repaired at the user site can be given by MDT Pos r [ Psos r ( MTTR ros MADMos ) (1 Psos ) r ( MTTR ros MDAMos RESUP)] (1 Pos ) r (TAToff ) (14.15) where Pos Psos MTTRos MADMos RESUP TAToff
is the probability that the necessary repair can be accomplished on-site; is the probability that the necessary spare parts are on-site, given that a repair could be accomplished on-site; is the mean time to repair for on-site repair; is the mean administrative downtime for on-site repairs, including the time to obtain the repair parts from on-site storage; is the average time required to obtain the longest lead time part from an off-site source; and is the turnaround time to have the product shipped and repaired, using an off-site repair facility.
When the product under consideration cannot be repaired on-site, the MDT equation is given by MDT 0.0 r [ Psos r ( MTTR os MADMos ) (1 Psos ) r ( MTTR os MADMos RESUP)] (1 0.0) r (TAToff ) TAToff (14.16) Example 14.1 illustrates the calculation of an MDT for different repair scenarios. As will be discussed in Section 14.3.2.2, the equation for MDT can be tailored for use in provisioning and determining supply support requirements. EXAMPLE 14.1
MEAN LOGISTICS DOWNTIME
Farmer Brown has a tractor that he repairs himself in order to save money. He can make most repairs to the tractor because he has a fairly well-equipped shop, but due to the high cost of parts, he does not keep many spare parts on hand.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
369
He lives in a very remote area that is far from tractor dealers, so he orders any needed parts by telephone and has the parts shipped. Let the probability that the repair can be accomplished on-site, Pos , be 0.90; the probability that the necessary spare parts are on-site, given that a repair could be accomplished onsite, Psos , be 0.40; the mean time to repair for on-site repair, MTTRos , be 6.5 h; the mean administrative downtime for on-site repairs, MADMos , be 0.5 h; the average time required to obtain the longest lead time part from tractor dealers, RESUP, be 2.5 days; and the turnaround time to have the tractor shipped and repaired at an off-site repair facility, TAToff, be 0.5 months. If the tractor malfunctions, what is the average time it will be unavailable for use? Inserting these data into Equation 14.15, MDT 0.9 r [0.4 r (6.5 hrs 0.5 hrs) (1 0.4) r (6.5 hrs 0.5 hrs 2.5 days r 24 hrs/day)] (1 0.9) r (0.5 mo r 30 day/mo r 24 hrs/day) 74.7 hrs 3.1 days
(14.17)
Thus, whenever Farmer Brown’s tractor is malfunctioning, he can expect the tractor to be unavailable for 3.1 days, on average. Let us say that Farmer Brown has the money to buy every part he may ever need to repair the tractor. The term Psos then becomes equal to 1.00 and MLDT is given by MDT 0.9 r [1.0 r (6.5 hrs 0.5 hrs) (1 1.00) r (6.5 hrs 0.5 hrs 2.5 days r 24 hrs/day)] (1 0.9) r (0.5 mo r 30 day/mo r 24 hrs/day) 42.3 hrs 1.8 days
(14.18)
Instead of the tractor being unavailable for 3 days, it is unavailable for just under 2 days. Is stocking every part a good investment for Farmer Brown? Let us suppose that Farmer Brown’s tractor exhibits a mean time between breakdown of 75 days and an MDT of 1.8 days; what is his operational availability?
A0
75 days MTBM 0.976 MTBM MDT 75 days 1.8 days
(14.19)
14.3.2 Supply Support—Provisioning of Repair Parts and Consumables Harris (1915) is generally credited with the earliest published derivation of an inventory model. Raymond (1931) published the first textbook dedicated to the subject. World War II prompted the military to support studies in this area in an effort to optimize the procurement and stocking of spare (black boxes) and consumable repair parts, subject to such constraints as cost, weight, volume, or combined factors. Retailers and manufacturers embraced inventory theory in the 1950s as a means to
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
370 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
reduce costs and overstockages while at the same time minimizing outstockages and backorders. In this section, the discussion of inventory models will be limited to economic order quantity (EOQ), subject to no-shortage and shortage-allowed conditions, and to provisioning in terms of modeling a standby redundant product. The EOQ problem uses a deterministic approach for its solution, while the standby model adopts a stochastic approach. The emphasis here will be on the effect of reliability on the demand for supply support. The reader interested in inventory models is referred to Sivazlian and Stanfel (1975), Hillier and Lieberman (1970), and Goldman and Slattery (1967). 14.3.2.1 Optimal Reorder Quantity One of the simplest inventory models considers a case that assumes that a product is drawn from inventory at a constant demand rate (DEM) and that the inventory stock is replenished periodically in equal amounts (REPLEN). It also assumes that shortages are not allowed. The costs associated with establishing and maintaining the inventory include: r SETUP: the cost to set up the inventory at the start of a time period; r UNIT: the unit production cost; and r HOLD: the inventory holding cost per unit.
The cost associated with placing an order is given by COSTORD SETUP UNIT r REPLEN
(14.20)
The holding cost per time period is given by REPLEN/DEM
HOLDCOST HOLD r
¯
(REPLEN DEM r T ) dT
0
(14.21)
HOLD r REPLEN 2 / (2 r DEM) The total cost per time period is given by COSTPER SETUP UNIT r REPLEN HOLD r REPLEN 2 /(2 r DEM) (14.22) The total cost per unit of time is given by TOTCOST COSTPER / (REPLEN/DEM) DEM r SETUP/REPLEN UNIT r DEM HOLD r REPLEN/ 2
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.23)
LOGISTICS SUPPORT
371
The optimal order size, REPLEN*, is found by taking the first derivative of the total cost with respect to order size and setting the resulting differential equation equal to zero. The optimal order size is thus REPLEN* (2 r DEM r SETUP/HOLD)1/ 2
(14.24)
The average or expected time required to expend the quantity of parts given previously, TIME*, is TIME* REPLEN* /DEM {2 r SETUP/ (HOLD r DEM)}1/ 2
(14.25)
Consider the case where shortages are allowed. Let STOCK denote the stock on hand at the beginning of a period. The holding cost per period, HOLDCOST, is given by HOLDCOST HOLD r STOCK 2 /(2 r DEM)
(14.26)
For a given shortage penalty cost, SHOR$, the shortage cost per period is given by SHORT SHOR $ r (REPLEN STOCK )2 /(2 r DEM)
(14.27)
The total cost per period is found from the following equation: COSTPER SETUP UNIT r REPLEN HOLD r STOCK 2 / (2 r DEM) SHOR $ r (REPLEN STOCK )2 / (2 r DEM)
(14.28)
The total cost per unit of time is given by TOTCOST COSTPER/(REPLEN/DEM) DEM r SETUP/REPLEN UNIT r DEM HOLD r STOCK 2 / (2 r DEM)
(14.29)
SHOR $ r (REPLEN STOCK )2 / (2 REPLEN) In order to find the optimal reorder size, REPLEN*, and optimal initial stock, STOCK*), the first partial derivatives of TOTCOST with respect to reorder size and initial stock size must be found. These derivatives are set equal to zero, and the two resulting equations are solved simultaneously to yield the following: REPLEN* [2DEM r SETUP r (SHOR $ HOLD) / (HOLD r SHOR $)]1/ 2 STOCK * {2 r DEM r SETUP r SHOR $ /[HOLD r (SHOR $ HOLD)]}1/ 2 (14.30)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
372 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The average or expected time to expend the reordered parts is given by TIME* REPLEN* /DEM [2 r SETUP r (SHOR $ HOLD) / (DEM r HOLD r SHOR $)]1/ 2
(14.31)
The average or expected fraction of time that a shortage may exist is given by SHORTIME* STOCK * /REPLEN* SHOR $ / (SHOR $ HOLD) (14.32) Example 14.2 illustrates the use of the EOQ equations for the cases of no shortages allowed and shortages allowed. EXAMPLE 14.2
ECONOMIC ORDER QUANTITY
You are the inventory manager and buyer for Good-Gas Petroleum Company. Upper management is unhappy about the delays in making deliveries and the costs associated with the regulator valves installed in delivery trucks. You are tasked to determine the optimal number of valves to maintain in inventory and how many to order to minimize the company’s cost. Based on historical data, you determine the following: SETUP $100; HOLD $5 per unit; UNIT $500 per unit; DEM five units per month. Using Equation 14.30, REPLEN* (2 r DEM r SETUP/HOLD)1/ 2 [2 r (5 units/month) r ($100) / ($5 per unit)]1/ 2 14.1 units
(14.33)
The average or expected time required to expend the quantity of parts given earlier is given by Equation 14.25. TIME* [2 r SETUP/ (HOLD r DEM)]1/ 2 (2 r ($100)) /[($5 per unit) r (5 units/month)] 2.8 months
(14.34)
Upper management suggests that a few deliveries could be delayed if inventory costs were reduced. A value of $10 per shortage is suggested. Using Equation 14.30, REPLEN* [2 r DEM r SETUP r (SHOR $ HOLD) / (HOLD r SHOR $)]1/ 2 {2 r (5 units/month) r ($100) r [($10 /shortage) ($5/unit)] / [($5/unit) r ($10 /shortage)]}1/ 2 17.3 units STOCK * {2 r DEM r SETUP r SHOR $ /[HOLD r (SHOR $ HOLD)]}1/ 2 1/ 2
¹ ª $10 /shortage «2 r (5 units/month) r ($100) r º 5/unit ) ($ )] ($ 5/unit ) r [($ 10 /shortage) » ¬ 11.5 units
(14.35)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
373
The average or expected time to expend the reordered parts is given by Equation 14.31: TIME* REPLEN* /DEM (17.3 units) / (5 units/month) 3.5 months
(14.36)
Equation 14.32 gives the average or expected fraction of time that a shortage may exist: SHORTIME* STOCK * /REPLEN* (11.5 units) / (17.3 units) 0.7
(14.37)
14.3.2.2 Spares’ Availability and Provisioning For some enterprises, shortages in on-hand inventory can be tolerated. The shortage unit cost, SHOR$, can be used to represent either lost profit or a cost factor related to both lost profit and loss of customer goodwill. In some operational environments, the cost of spares’ shortage cannot be easily assessed. For this case, a commonly used approach to provisioning modeling is to consider the problem in the context of a product that incorporates standby redundancy. It was previously shown that the cumulative Poisson equation can be used to model a standby redundant product if it is assumed that the failure rate remains constant with respect to time. This is given by x
P(X a x )
£ i0
(L r TIME)i r exp ( L r TIME) i!
(14.38)
To serve as a provisioning model, Equation 14.38 is rewritten as follows: S in
Ps ( X a S in ) a
£ i0
(DEMin )i r exp ( DEMin ) i!
(14.39)
where Ps(X ≤ Sin) is the probability that the number of demands (X) will be equal to or less than the number of spares in the inventory—that is, the probability that a spare will be available if needed; Sin is the number of spares in the inventory; and DEMin is the expected demand during a given period of time. As shown in Equation 14.13, the expected demand can be stated as simply the failure rate of the product (h) multiplied by the number of applications (Nl) and the average operating time per application (OPTIME), or DEMin N l r L r OPTIME
(14.40)
Alternatively, Equation 14.40 can be written as a function of a MAR that may include CNDs or maintenance events other than true failures. Then, DEMin N l r MAR r OPTIME
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.41)
374 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
The expected demand can also be stated as a function of the product’s failure or maintenance action rate and the MDT: DEMin MAR r Nl r UTIL r MDT
(14.42)
For the supply support case, the MDT is calculated for a spare product and not the supported system or subsystem: MDTi {Pos r [ Psos r ( MTTR os MADMos ) (1 Psos ) r ( MTTR os MDAMos RESUP)] (1 Pos ) r (TAToff )} (14.43) In Equation 14.43, MTTRos represents the mean time required to repair the failed assembly, subassembly, or module on-site; the term RESUP represents the time to obtain a part to repair the ith product, if the required part is not available on-site; and TAToff represents the time to obtain a replacement product if the failed product cannot be repaired on site. Example 14.3 illustrates the use of Equations 14.39, 14.42, and 14.43 as a provisioning model. EXAMPLE 14.3
BASIC PROVISIONING PROBLEM
You are employed by Lightning-Overnite Delivery. As part of your job, you are tasked with provisioning a new fleet of 150 trucks used for deliveries and pickup of small packages. Because a fixed pickup schedule is not followed, each truck is equipped with a two-way radio so that drivers can be instructed to stop at various locations. If the radio becomes inoperable, the affected truck cannot be used. Management wants to know how many spare radios should be stocked in order to be 95% confident that a truck will leave the company’s facility with an operable radio. Data obtained from the manufacturer of the radio and from historical maintenance data in your company files provide the following information: MAR 0.0002 removals/ophr; Pos 0.10; Psos 0.50; MTTRos 2.5 h; MADMos 0.5 h; RESUP 3 days; TAToff 14 days; UTIL 10 ophrs/day. The MDT due to radio failure is MDT 0.10 r [0.50 r (2.5 hrs 0.5 hrs) (1 0.50) r (2.55 hrs 0.5 hrs 3 days r 24 hrs/day)] (1 0.10) r (14 days r 24 hrs/day) 306.0 hrs
(14.44)
The expected demand during the period MDT is given by DEM (0.0002 removals/ophr ) r 150 trucks r 1 radio/truck r (10 ophrs/day r 1 day/ 24 hrs) r (306.3 hrs) 3.8 radio failures
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.45)
LOGISTICS SUPPORT
375
Table 14.1 Probability of Having a Required Spare S
Ps
0 1 2 3 4 5 6 7
0.02 0.10 0.26 0.47 0.66 0.81 0.91 0.96
The number of spare radios required is given by S
Ps ( X a S ) 0.95 a
( 3.8) £ (3.8) r exp i! i
(14.46)
i 0
The results of iterating the preceding equation are given in Table 14.1. The analysis indicates that the company can expect fewer than four failures, on average, during the time it takes to repair or remove and replace a failed radio at an off-site repair shop and receive a replacement. However, the company must stock seven units in order to satisfy the confidence requirement imposed by management.
14.3.2.3 Provisioning a Product Composed of Replaceable Parts Example 14.3 is representative of a top-down approach to provisioning. The analyst determines the number of spare products required by assuming a stocking level at the next lower level of assembly (i.e., the repair parts may be subassemblies, modules, products, or a combination of these). In order to provision the subassemblies, modules, and products, the analyst assumes an availability of repair parts and a supply response time. Products at the next lower level of assembly must be provisioned so that the aggregate probability of having a repair part available is equal to or greater than the value used in the model for the next higher level of assembly. In a bottom-up approach, the logistics analyst determines the inventory levels to be maintained at the lowest level of assembly and works upwards. At each potential stockage point, the calculation of the MDT and stockage quantity is based upon the probability of spares’ sufficiency determined at the previous (lower) level of support. The spares’ sufficiency, spares’ availability, or probability of having a required spare is M
PS ,sys
P
s ,k
k 1
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.47)
376 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
where Sk
Ps ,k
£
DEM kj r exp ( DEM k )
j!
j 0
(14.48)
is the spares’ sufficiency for a unit with Sk , and DEMk is the expected demand for the kth unit at the next lower level of assembly. The probability of having a spare or repair part, Psos , for use in determining the MDT is not the same as Ps,k . The definition of Psos is the probability that a spare can be drawn from an on-site inventory, given that a failure has occurred. To determine the value of Psos , the following equation can be used: Psos Psc r Ps
(14.49)
where Psc is the probability that the required part is carried in the on-site inventory, and Px is the probability that the required part is in stock, given that it is carried in the on-site inventory. Equations for Psc and Ps are given by k
£N
r Li
i
i 1
Psc
(14.50)
LT
k
£N
r Li r Ps ,i
i
(14.51)
i 1
Ps
k
£N
r Li
i
i 1
where hi is the failure rate of the ith part that is carried in the on-site inventory; Ni is the number of applications for the ith part; hT is the total failure rate for the unit, assembly, or subassembly that is being provisioned, considering both carried and not carried parts; Ps,i is the probability of the sufficiency of the ith part; and Si
Ps ,i
£
DEMij r exp ( DEMi )
j!
j 0
(14.52)
where Si ≥ 1. Therefore, Psos can be written as k
Psos
£N
i
r Li r Ps ,i
i 1
LT
Example 14.4 illustrates the use of this equation.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.53)
LOGISTICS SUPPORT
377
EXAMPLE 14.4 PROVISIONING OF A PRODUCT CONSISTING OF REPLACEABLE PARTS Suppose a product consists of eight assemblies with the failure rates, expected demand, and probability of sparing sufficiency given in Table 14.2. From these data, hT 0.0188 and Psos 0.9466. Psos (1/ 0.0188) r (0.0007 r 0.9977) 0.0028 r 0.997 70 0.001 r 0.9998 0.006 r 0.9769 0.0075 r 0.9927) 0.9466
(14.54)
Many maintenance and supply products are multi-echelon. The level at which maintenance is performed is based on the design for maintainability and testability of the product, doctrine, and economics. The product supply responds to the maintenance concept by providing the range and depth of spare and repair parts needed to maintain the product at each applicable level of repair. For a military product, the depot or an inventory control point acts as the highest level of maintenance and inventory management and physical stocking of spare and repair parts. The depot buys spares and repair parts from either in-house sources or civilian manufacturers, maintains physical inventory control, and distributes required spares and repair parts to lower echelons. A program manager or a designated supply support specialist determines the range and depth of spares and repair parts to be procured, carried in inventory, and distributed to lower echelons for use as on-site inventory. A provisioning model must be developed for each echelon and used to compute the stockage quantity of each potentially repairable or replaceable part. Either a multi-echelon model or separate provisioning models must be developed for repairable and consumable products. Representative models for repairable and consumable products can be written as follows: MDTI,i {Pos,I r Psos,I r ( MTTR ros,I MADMos,I ) (1 Psos,I ) r ( MTTR ros,I MDAMos,I RESUPI )]
(14.55)
(1 Pos,I ) r (TAToff,I )}i MDT for an on-site (intermediate) repairable product is MDTI ,i RESUPI ,i Table 14.2
(14.56)
Data for Example 14.4
Assembly
hi
Di
Carried
Si
Psi, if S ) 1
1 2 3 4 5 6 7 8
0.0007 0.0028 0.0005 0.0002 0.0010 0.006 0.0001 0.0075
0.07 0.28 0.05 0.02 0.10 0.6 0.01 0.75
Yes Yes No No Yes Yes No Yes
1 2 0 0 2 2 0 3
0.9977 0.9970 0.0000 0.0000 0.9998 0.9769 0.0000 0.9927
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
378 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
MDT for an off-site repairable or consumable product is DEM I ,i Dl ,i MAR i r N i r UTIL i r MDTI ,
(14.57)
The on-site demand and stocking level are given by SI , i
Ps ,I ,i
£
DEM Ij ,i r exp ( DEM I ,i )
j!
j 0
(14.58)
At the depot or a major inventory control point, MDTD,I is computed by MDTD,i {(1 Z i ) r ( MTTR ros,D MADMos,D ) (1 Psos,D ) r ( MTTD ros,D MADMos,D REORDER com )
(14.59)
( Z i ) r (REORDER rep )}i The MDT for an off-site repairable product or a consumable product is given by MDTD ,i REORDER com,i
(14.60)
where the subscript D represents the depot or manufacturer level of repair, as appropriate. Zi is the condemnation rate of a repairable product—that is, the probability that a repairable product inducted into the depot cannot be repaired for any reason; REORDERcom is the average time required to order and receive a repair part from a manufacturer; and REORDER rep is the average time required to order and receive a new repairable product from a manufacturer. The on-site demand and stocking levels are found as follows: DEM D ,i DD ,i M sites r MAR i r N i r UTIL i r MLDTD ,i S D ,i
Ps ,D ,i
£
DEM Dj ,i r exp ( DEM D ,i )
j 0
(14.61)
j!
where Msites is the number of operating sites being supported by the depot. Examples 14.5 and 14.6 illustrate the use of these equations.
14.3.2.4 Spares’ Optimization A provisioning list can be very easily optimized for a single variable. A dynamic programming approach can be used by employing the following procedure: r For each level of repair and for each assembly, subassembly, and product, determine the expected demand (DEM R,i) subject to: r level of repair − on-site sparing − depot or inventory control point
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
379
r level of assembly − product or line replaceable unit − assembly or shop replaceable unit − subassembly or modules − products and piece parts r Determine the unit cost of each product (Ci ). r For a specific level of repair and level of assembly (e.g., on-site sparing of a product), compute
Ps,sys
M
Si
i 1
j 0
£
DEMij r exp ( DEMi ) j!
(14.62)
where DEMi is the expected demand for the ith unit at the next lower level of assembly. r For each product, i, determine the change in shortage risk by incrementing the inventory by one:
DEL Si
DEM j 1 r exp ( DEM i )
J!
(14.63)
r For each product, determine the change in shortage risk per dollar expended: DEL S $i DEL S i /C i
(14.64)
r Select the product with the highest DELS$i to have its inventory be incremented by one. r Continue the process until either a dollar constraint has been reached or the desired level of PS,sys has been attained.
How does product reliability or a change in reliability affect provisioning? Example 14.5 gives the case of a proposed improvement in the reliability of a product and the case in which field reliability is less than predicted. EXAMPLE 14.5 USING THE PROVISIONING MODEL AS PART OF A RELIABILITY TRADE-OFF ANALYSIS As a repair and maintenance (R&M) engineer, you have been asked to evaluate an engineering change proposal. The proposal is for the upgrade of products used in a “ruggedized” communications radio. Part of the assigned task is to assess the change in required spare radios due to the improvement in reliability. The mean time between failures (MTBF) of the current product is 1,500 operating hours. A 30% improvement in reliability is expected if the modification is adopted, giving an improved MTBF of 1950 operating hours. Let MAR W r Lsys / (1 CND)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(14.65)
380 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
where MAR is the maintenance action rate; W 1.00; hsys,old 1/MTBF 1/(1500 ophr/failure) 0.0006667 failures/ophr; hsys,new 1/MTBF 1/(1950 ophr/failure) 0.0005128 failures/ophr; andCND 0. Therefore, MARold 0.0006667 failures/ ophr and MAR new 0.0005128 failures/ophr. Let DEM MAR r N sys r OPHRS r MLDT
(14.66)
and MLDT Pos r [ Psos r ( MTTR ros MADMos ) (1 Psos ) r ( MTTR ros MDAMos RESUP)]
(14.67)
(1 Pos ) r (TAToff ) Let Nsys 30 products; OPHRS 10 ophr/products/day; Pos 0.90; Psos 0.90; MTTRros 6.5 h; MADMos 0.5 h; RESUP 2.5 days; TAToff 0.5 month. Then, DEMold (0.0006667 failures/ophr ) r (30 systems) r (10 ophr/system day) r (1 day/ 24 hrs) r {(0.90) r (0.90) r (6.5 hr 0.5 hr ) (1 0.9) r [6.5 hr 0.5 hr (2.5 day) r (24 hr/day)] (1 0.9) r [(0.5 mo) r (30 days/mo) r (24 /hr/day)]} (0.0006667 failures/ophr ) r (596.25) 0.396 failures
(14.68) and DEM new (0.0005128 failures/ophr ) r (596.25 ophr ) 0.306
(14.69)
The sparing level is determined using the Poisson equation as follows: Assume the desired PS is 0.95. Then, Sold
PS ,old 0.95 a
£ j0
(DEMold ) j r exp( DEMold ) j!
(14.70)
0.95 a 0.673 0.266 0.053 0.992 Sold 2 Snew
PS ,new 0.95 a
£ j 0
(DEMAND new ) j r exp ( DEMAND new ) j!
(14.71)
0.95 a 0.736 0.225 0.961 and PS,new 1. The proposed design change would save the cost of one spare. What is the savings if PS is 0.90? If PS is equal to 0.90, both the old and the new (improved design) products require one spare.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
381
EXAMPLE 14.6
SPARING A REPLACEABLE ASSEMBLY
Consider a nonrepairable module with an MTBF of 4,000 operating hours, 15 operating sites, and one depot. Each operating site supports 50 aircraft. Each aircraft uses two modules and flies 100 h per month. The resupply time at the site level is 2 weeks. The lead time to acquire the module from the manufacturer is 15 months. What should the sparing level be at each operating site and at the depot, if the level of sparing sufficiency is 0.90? The expected demand at each operating site is found as follows: DEMopsite (1/MTBF ) r NACFT r NUNITS r OPHR r RESUP (1/ 4000 ophr r 50 aircraft/site r (2 units/aircraft)) r (2 wks r 1 mo/ 4 wk ) 1.25 failures/site
(14.72) Using the cumulative Poisson equation, the required sparing level to achieve a 0.90 sufficiency at an operating site is Sonsite
PS 0.90 a
£
(DEMopsite ) j r exp( DEMopsite )
j0
j!
(14.73)
a 0.286 0.358 0.224 0.093 0.961 and Sonsite 3 spare/site. At the depot, the expected demand is DEM depot NSITES r (1/MTBF ) r NACFT r NUNITS r OPHR r REORDER (15 sites)(1/ 4000 ophr/failure) r (50 aircraft/site) r (2 units/aircraft) r (100 ophrs/aircraft mo) r (15 mo) 562.5 failures
(14.74) At the depot level, the module is a consumable or throw-away item. The depot must stock a sufficient quantity of the modules to meet the expected demand over the reorder period. Unless the cumulative Poisson equation has been programmed on a computer, it becomes tedious to compute the sparing level for an expected demand as large as the value in this example. The following equation can be used to derive an approximate value:
S DEM z A r (DEM)1/ 2 z A2 /8
(14.75)
where z] is the normal variate for the desired sparing sufficiency. For example, if Ps 0.90, z] 1.29, then S 594 for this example.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
382 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
What is the economic order quantity? Assume the setup cost is $2,500 and the holding cost is $25 per spare. The economic order quantity, using Equation 14.2, is REPLEN* (2 r DEM r SETUP/HOLD)1/ 2 (2 r 594 r $2, 500 / $25)1/ 2 345
(14.76)
The expected time to expend the on-hand parts (i.e., the time between orders) is found by using Equation 14.11: TIME REPLEN* /DEM 345 / 594 0.58
(14.77)
In terms of calendar time, TIME* 0.58 r (15 mo) 8.7 months
(14.78)
The results in the example should be interpreted as follows: r An initial spares’ pool of 594 units must be procured. r Immediately, an order should be placed for 345 units scheduled for delivery within 15 months of the initiation of aircraft operations. r If field demand is the same as predicted demand, after 345 units are used (approximately 8.7 months), another order is placed for 345 units to be delivered within 15 months (month 23.7); if field demand differs significantly from predicted demand, the sparing level and EOQ must be recalculated and the procurement quantity adjusted accordingly. r The process is repeated whenever 345 units are used or 8.7 months have elapsed since the last order, subject to statistically significant changes in the actual demand rate.
14.3.3 Manpower and Personnel—Staffing Levels Table 14.3 describes the manpower tasks of a typical, multi-echelon corrective maintenance action. Each block on the diagram represents one or more maintenance, supply, manufacturing or fabrication, quality assurance, or other administrative logistics support tasks that must be accomplished to restore a failed end-product to operation and to repair or replace failed products removed from the end-product. Each action associated with a logistics support task requires a combination of technical skills and experience. The LSA identifies and documents these personnel requirements for organizational, intermediate, and depot maintenance. Using data from the maintainability analysis, the LSA also determines the time (typically clock-hours, but in some instances also man-hours) required to accomplished a given maintenance task. The RLA determines the maintenance tasks that will be accomplished at a given level of maintenance. In the ILS planning process, how does reliability affect manpower and personnel requirements? How can manpower and personnel requirements be estimated?
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
Table 14.3
383
Task Flow for a Multi-Echelon Logistics Support Product Organizational Level
Failures originate at the organizational level and are isolated to a line replaceable unit (LRU). The faulty LRU is removed from the product and replaced with a spare LRU. The product is checked for proper operation. The faulty LRU is sent to the intermediate level base shop for repair. Intermediate Level At the intermediate level, the LRU is repaired by isolating to the faulty shop replaceable unit (SRU). The faulty SRU is removed and replaced with a spare SRU. The repaired LRU is checked for proper operation. Once the LRU is repaired, it is sent to the organizational level or to an inventory control or stockage point. If no fault is found, the LRU is also sent to the inventory control or stockage point. Occasionally, the LRU cannot be repaired by the intermediate level and it is sent to the depot for repair. Depot At the depot, the SRUs (and sometimes LRUs) are repaired by fault-isolating to components. The faulty component is removed and replaced. The SRU (or LRU) is checked for proper operation. Once the repair is complete, the repair unit is sent back to the intermediate or depot level inventory control or stockage point.
An estimate of the demand for manpower resources can be developed using an approach similar to that used to determine the expected demand for a spare or repair part. As part of a design and development program, the component, module, subassembly, assembly, unit, and product failure rates are determined and incorporated into the LSA as task frequency of occurrence. Supplemented by the man-hours to complete a maintenance task, task frequency of occurrence yields a man-hour per operating hour demand for specific skills and experience categories. Applying an operating profile (i.e., operating hours per unit per calendar period) and the number of units to be supported, the man-hours per calendar period for specific skill and experience categories can be found. The expected demand in man-hours per calendar period for the specific skill, experience level, or both required to support a given product is given by MANTIME ijkl M sites r MAR ij r NUNITSi r ( MANTTR ijkl ) r UTIL i
(14.79)
where MANTIMEijkl is man-hours per calendar period of skill level “k” and experience level “1” required to support maintenance action “j” on the ith product; Msites is the number of operating sites supported by the maintenance facility; is the maintenance action rate (actions/unit-operating-hour) for MARij action “j” on product “i”;
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
384 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
NUNITSi MANTTRijkl UTIL
is the number of units of type “i” at each site; is the mean man-hours of skill level “k” and experience level “l” expended per maintenance action “j” on the ith product; and is the average utilization rate of the ith product, operating hours per unit per calendar period.
Note the similarity of Equation 14.79 to the equations developed for provisioning and supply support. In this case, the MDT term has been replaced by MANTTRijkl . Example 14.7 illustrates the use of this equation. EXAMPLE 14.7
MANPOWER REQUIREMENTS
The removal and replacement of a starter motor on a small commuter or business jet requires the skill and experience of a licensed mechanic, a helper, and a quality assurance inspector. The task takes 2.5, 1.25, and 0.5 man-hours per removal action, respectively. The maintenance action rate for the starter is 25.0 removals per million aircraft-unit flight hours (i.e., 50.0 removals per million aircraft operating hours). There is only one maintenance site. Forty aircraft, each with two starters, are supported, for a total of eighty starters. Each aircraft averages 110 flight hours per month. What is the manpower utilization of each skill category per month and per year associated with removal and replacement of the starter? The expected man-hours to be expended for each skill level are given by MANTIME ijkl M sites r MAR ij r NUNITSi r ( MANTTR ijkl ) r UTIL i (1 site) r (25 10 6 removals/aircraft unit flighthour ) r (40 aircraft/site r 2 units/aircraft) r (2.5 1.25 0.5) r (110 flighthours/month/aircraft) (0.22 removals per month) r (4.25 manhours/removal)
(14.80)
Table 14.4 provides the resulting MANTIMEijkl for this example. Note that the MAR for the starter was given in removals per aircraft-unit operating hour. If the MAR was given in terms of starter operating cycles or starter-on time, a conversion factor would be required in order to yield consistent units (i.e., removals per calendar time).
Equation 14.79 can be used to calculate the man-hours associated with one skill level in one form of maintenance action on one unit or product. By performing a summation over the indices “i” and “j,” the total man-hours per calendar time for skill and experience levels “k” and “l” can be determined. Summing over the indices “kl” yields Table 14.4
MANTIMEijkl
Calendar Period Month Year
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Mechanic
Helper
Inspector
Total
0.55 6.60
0.28 3.30
0.11 1.32
0.94 11.22
LOGISTICS SUPPORT
385
yields the expected total man-hours required to support a product. The example illustrates this process for three skill categories, but for only one maintenance action type. In general, as with provisioning, spares and people can only be procured in integer quantities. However, unlike provisioning, the employment of part-time personnel and full-time personnel working overtime can offset surge demands. When using maintenance man-hour data extracted from a maintainability or logistics support analysis, the analyst must remember that the data represent the active man-hours required to conduct a specific task. Generally, the maintainability analysis does not include access time, setup, breakdown, or many other peripheral tasks normally associated with maintenance. In addition, manpower efficiency factors must be applied in order to convert the predicted active maintenance man-hours to employable man-hours. Moreover, the man-hours associated with maintenance support personnel (e.g., administrative, quality assurance, and supply) are not normally addressed in a maintainability or logistics support analysis. Logistics analysts or R&M engineers engaged in manpower planning must consider the following factors in order to derive the manpower and personnel requirements for a product: r A decrease in the man-hours available to perform active maintenance or maintenance support can be expected due to many factors, including: r slack time; r direct administrative time; r direct time associated with peripheral maintenance subtasks not included in the prediction or measured task time; r sick leave; r holidays; r weekends; r shifts (hours per day availability); r job proficiency; r extra or collateral duties; and r work-rule restrictions. r The use of higher percentile (e.g., 2 or 3m) values in lieu of expected (mean) values and the application of the algebra of normal variables will provide more confidence that manpower requirements can be met. Consider applying this approach to the following variables: MAR, MANTTR, and OPTIME. r Excess personnel or overtime may be required to meet surge or other transient circumstances not addressed by the mean value equation.
14.3.4 Support and Test Equipment—Utilization and Productivity The mean or expected value equation for determining the utilization of support and test equipment (STE) “k” is similar to the equation to determine personnel requirements. Like personnel, STE can only be procured in integer quantities and extra work shifts can accommodate surge requirements. Unlike personnel, indirect and other factors affecting producibility are more defined and predictable. A basic equation for
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
386 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
use in determining the utilization of an item of STE attributed to repair action “j” of a prime or lower level of assembly product “i” can be written as follows: SETIME ijk M sites r MAR ij r NUNITSi r ( MTTR ijk SETijk ) r UTIL i r (1 MAR k r MLDTk ) r (1 PMTIME k r PMRATk
(14.81)
CALTIME k r CALRATk ) where SETIMEijk is the STE utilization time for the kth SP attributed to the jth maintenance action on the ith item of prime product; MTTRijk is the active maintenance time using the kth STE for the jth maintenance action on product “i,” STE operating hours; SETijk is any setup or other direct usage time associated with the use of the kth STE supporting the jth maintenance action on the ith prime product; MARk is the maintenance action rate for the kth STE; MLDTk is the mean logistics downtime per STE maintenance action for the kth STE; PMTIMEk is the mean time to conduct preventive maintenance for the kth STE; PMRATk is the PM action rate of occurrence for the kth STE; CALTIMEk is the mean time to conduct calibration for the kth STE; and CALRATk is the calibration action rate of occurrence for the kth STE. When Equation 14.82 is used, care must be taken to ensure consistent use of units of time. When summed over the indices “i” and “j,” the preceding equation yields the total utilization for the kth STE. Note the similarity to the demand equations developed for supply support and manpower. In the preceding equation, the terms equivalent to the MDT in the supply support equation define how long the support product will be used per prime product, support product, or calibration maintenance action. Equation 14.81 assumes that preventive maintenance and calibration of the product support are a function of use and not based upon calendar interval. For many STEs, this is not the case, and PM and calibration are conducted at fixed calendar intervals, irrespective of use during the calendar interval. For this case, the equation can be modified as follows: SETIME k M sites r
N SE
Mi
i 1
j 1
£ £ [MAR
ij
r NUNITSi r ( MTTR ijk SETijk ) r UTIL i
r (1 MAR k r MLDTk ) r CALEND PMTIME k r NPER k CALTIME k r NCAL k (14.82)
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
387
where NSE Mi
is the number of prime products supported by the kth STE; is the number of different maintenance actions for the ith prime product that use the kth STE for test support; CALEND is the number of calendar planning periods under consideration; NPER k is the number of PM cycles in period CALEND; and NCALk is the number of calibration cycles in period CALEND. Equation 14.82 provides SE utilization (hours) per calendar planning period for the kth SE. As in manpower planning, when SE utilization hours computed by the mean value equation are used, the analyst must determine the maximum hours during which an SE will be available during the calendar or planning period. For example, if a company typically works one 8-hour shift per day, 5 days per week, 50 weeks per year, each piece of support equipment (SE) will be available for a maximum of 2,000 hours per year. If the results of applying Equation 14.82 indicate that 4,700 STE operating hours will be expended, including downtime for STE maintenance and calibration, then 2.35 units are required, or three units with rounding. If two shifts are used on a normal basis, the number of required STEs would be reduced. As with provisioning and manpower planning, consistency of units must be ensured in order to obtain useful results. This is especially true when variables such as MLDT are encountered in an equation because product downtime, calendar time, and product-available or operating time must be related to a common time scale. Also, the mean value equations given earlier do not account for surge requirements or for meeting turnaround time or operational availability requirements that may be placed on the prime product. 14.4
REPAIR LEVEL ANALYSIS
The repair level analysis, also sometimes referred to as a level-of-repair analysis, is an economic analysis used to determine if a product should be repaired or discarded, as well as the maintenance level (e.g., organizational, intermediate, or depot) at which the repair or discard action should be made. The RLA is an iterative analysis that should interact with the design process. The RLA can be used to determine if an assembly can be cost effectively repaired, given an initial design approach. If the product is deemed a discard, money can be saved by simplifying the design to remove test points. On the other hand, if the initial analysis indicates that a product should be repairable, redesign to add more test points or initialization circuits may be warranted. Table 14.5 relates the use of the RLA during the life cycle of a product. For many acquisition programs, the RLA is strictly an economic analysis; however, a cost versus operational availability/readiness (or some other measure of effectiveness) trade-off analysis can be easily conducted. Typically, the RLA determines the nonrecurring and recurring costs associated with each of the 10 logistics elements for both the repair and discard alternatives. Unless the analysis is restricted a priori by noneconomic considerations (e.g., printed-wiring boards can only be
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
388 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 14.5
RLA and the Product Life Cycle
Product Life-Cycle Phase
Function of the RLA
Program initiation and concept exploration
Conduct trade-off studies of r Maintenance concept: evaluate possible support scenarios r Product support: new or existing Conduct operational effectiveness analysis to r Develop LCC estimate for budgetary planning r Identify noneconomic constraints to supportability and level of repair
RLA Data and Sources r R&M and LCC data from existing, fielded products r Predictions
Design and development
r Influence design for maintainability and testability r Identify preliminary quantitative requirements for product support, facilities, personnel, and provisioning of major assemblies r Make repair and discard decisions r Evaluate LCC impact of proposed design changes
r R&M predictions r LSA r Developer budgetary cost estimate
Production and initial fielding operations
r Make level of repair decisions r Determine provisioning requirements to include user/on-site spares and maintenance-site repair-part inventory design changes r Evaluate LCC impact of proposed design changes r Review and assess effectiveness of logistics product support r Update provisioning lists r Assess the LCC impact of proposed design changes
r LSA r Test results r R&M prediction designchange proposals r Field maintenance and cost data
repaired at the depot level, or hermetically sealed hybrid circuits are nonrepairable), the costs are computed for each applicable subsystem, assembly, subassembly, module, and potentially repairable product at each potentially applicable level of maintenance. These cost estimates use data drawn from the logistics support analysis report
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
LOGISTICS SUPPORT
389
(LSAR) and mean value models similar to those given in the preceding sections for provisioning, manpower, and SE. The conduct of an RLA can become quite complex, given the number of repairable products in the typical weapon, aerospace, or electronics products. Unless prior noneconomic constraints are applied, each assembly, subassembly, or module must be evaluated for repair or discard. For either of these two alternatives, the selected action can be accomplished at one of three levels of maintenance. When conducting an RLA, the following issues, concerns, and considerations must be addressed: r Sensitivity studies must be conducted in order to assess the potential of changing a decision for repair versus discard or for level of maintenance due to changes in a random variable (e.g., MTBF, mean time between maintenance [MTBM], MAR, MTTR, etc.). r The cost associated with some logistics resources (e.g., multipurpose test product or repair facilities) must be amortized over all the products that will be supported by that particular resource. If the repair decision for a supported product results in that product no longer needing the resource, the analysis must be iterated to reflect the change. r The costs associated with the support of the product can be significant. If new or peculiar STE is required, estimates must be developed for this category. This may entail performing an RLA for the STE. r Although it may be feasible to repair an assembly or subassembly, it is important to consider repair procedure yield and condemnation rates. If the repair yield is low (high condemnation rate), the decision to attempt repair may change. r Operational requirements and the possibility of product obsolescence or the future unavailability of new consumable assemblies may override economically based decisions. (For example, when a manufacturer decides to no longer make an assembly, but keeps the products comprising the assembly readily available, the availability of products may not be any help if repair procedures, other technical data, and the required SE do not exist.)
14.5
SUMMARY
This chapter has discussed how reliability and, to a more limited extent, testability affect the demand for spares and repair parts, personnel, and product support. This was accomplished by developing demand equations based on the assumption of a constant arrival rate and the development of a mean downtime term. The MDT was then tailored to reflect the time during which a product was undergoing maintenance, the time required to repair a spare, the response time of the product supply to deliver a part or spare from an off-site inventory, and the utilization of personnel or product support per maintenance action. The repair level analysis and how this analysis is placed in the life cycle of a product were also discussed.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
390 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
REFERENCES Blanchard, B. S. 1992. Logistics engineering and management. Upper Saddle River, NJ: Prentice Hall. Goldman, A. S., and T. B. Slattery. 1967. Maintainability: A major element of systems effectiveness. New York: John Wiley & Sons. Harris, F. W. 1915. Operations and cost. New York: A. W. Shaw. Hillier, F. S., and G. J. Lieberman. 1970. Introduction to operations research. Ann Arbor: Holden-Day, University of Michigan Press. Raymond, F. E. 1931. Quantity and economy in manufacture. New York: McGraw–Hill. Sivazlian B. D., and L. E. Stanfel. 1975. Analysis of systems in operations research. New York: Prentice Hall.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 15
Product Effectiveness and Cost Analysis Harold S. Balaban, David Weiss
CONTENTS 15.1 Introduction ................................................................................................. 391 15.2 A Framework for Product Effectiveness Quantification Using Markov Processes ..................................................................................................... 393 15.2.1 A Generalization of the Model for Multifunction Operations....... 394 15.2.2 Effectiveness Evaluation Example—Continuous Performance .... 396 15.2.3 Model Applicability .......................................................................400 15.3 Factors to Consider in Analyzing Product Effectiveness............................ 401 15.3.1 Phase I: Define Application, Product, and Logistics Support .......403 15.3.2 Phase II: Select Measures of Effectiveness....................................403 15.3.3 Phase III: Develop the Mathematical Model .................................405 15.3.4 Phase IV: Obtain Data Inputs ........................................................407 15.3.5 Phase V: Exercise, Interpret, and Refine Model ............................407 15.4 Cost-Effectiveness Analysis........................................................................408 15.4.1 Cost Categorization........................................................................408 15.4.2 Cost Estimation.............................................................................. 410 15.4.3 Cost Adjustments ........................................................................... 413 15.4.4 Cost Uncertainty and Cost Sensitivity ........................................... 415 15.4.5 Combining Effectiveness and Cost................................................ 416 15.5 Summary..................................................................................................... 419 Reference ............................................................................................................... 419 Additional Reading ................................................................................................ 419 15.1
INTRODUCTION
This chapter shows how reliability and maintainability data can be combined with performance data to assess overall product effectiveness of the product and how cost aspects can be introduced to provide a more complete basis for design decision. First, 391 © 2009 by Taylor & Francis Group, LLC
392 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
the product effectiveness concept is reviewed. A generalized model for quantifying effectiveness is then developed by first considering single-mode products and then extending the model to multimode cases. This is followed by a general discussion of how product effectiveness is analyzed and the chapter concludes with a discussion of ways of introducing cost into the decision process. We noted in Chapter 1 that product effectiveness represents the overall capability of a product to meet customer or user requirements. It can be formally defined as a measure of the extent to which a product may be expected to achieve a set of specific application requirements in terms of availability, dependability, and capability: r Availability is a measure of the product condition at the start of an application or start of use. It is a function of the relationships among hardware, personnel, and procedures. r Dependability is a measure of the product condition at one or more points during the application, given the product condition at the start of the application. r Capability is a measure of the product’s ability to achieve the application objectives, given the product condition during the application. Capability specifically accounts for the performance spectrum of the product. The capability measure can take on a number of forms, such as a success probability, a measure relative to maximum performance, or a value in terms of the product output (e.g., megawatts of power) or in terms of its impact (e.g., cargo tonnage hauled).
If we consider a very simple product, one that is either “working” or not and that cannot be repaired while in use, the preceding definitions result in the following set of questions that an effectiveness analysis seeks to answer: r Availability: is the product working when the user needs to start using it? r Dependability: will the product work throughout the use period? r Capability: if the product works throughout the use period, will it perform its functions satisfactorily?ww
To complete the argument, we will define “working” to mean that product outputs fall within design specifications. It is surprising how many cases can be adequately handled by this simple model, at least for an initial, ballpark estimate. Speaking of ballparks, suppose the product was a television set, the “application” was to watch a World Series game, and the effectiveness measure was the probability of watching the whole game. The effectiveness of the television set is then given by Eff P {set is available at the start of the game} r P {set is dependable for the duration of the game given—it is available} (15.1) r P {set provides satisfactory picture and sound given—it is dependable) It is not hard to develop takeoffs on this example that would make it more complicated. For example, what if a part exceeded tolerance so that the picture became
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
393
snowy but still viewable? What if sound was lost but a radio was available? What if the colors turned a ghastly hue but the picture was sharp? These questions focus on the requirement of “satisfactory picture and sound” that is embodied in the capability quantification. It is also easy to generate scenarios that focus on availability and dependability issues, especially if we allow for repair to take place during the game. Thus, although a simple “working or not” scenario may be useful, a more generalized approach may be needed.
15.2 A FRAMEWORK FOR PRODUCT EFFECTIVENESS QUANTIFICATION USING MARKOV PROCESSES For relatively simple products with little or no repair capability, availability is not a significant issue, and reliability and dependability models can be formulated. However, today’s products are becoming more complex and frequently involve central computers, digital sensors, distributed microprocessor controllers, and built-in test capability. To model such complex repairable products, which really are systems of products, a standard technique is the use of a Markov model. A Markov process is governed by probabilities that are functions of the immediate past history. A Markov model is a function of the state of the product (e.g., operating, nonoperating) and the time of the observation. It is defined by a set of probabilities (Pij) that define the probability of transition from state i to j. A Poisson process is a special type of Markov process. To formulate a Markov model, we must define all mutually exclusive states of the product. Then, Markov state equations describe the probabilistic transitions from the initial to the final states. For complex products, the number of states in the system model becomes very large and the solution of the state equations can be very computer intensive, generally providing little design insight. To be tractable, it is necessary to reduce the number of states through combination or approximation or to use a Monte Carlo simulation. This section defines a framework for a Markov model. The reader is referred to Shooman (1990) for a detailed discussion of approximation and simplification technologies. Product states can be used as the basis for product description, and state transitions can then be used to reflect the product reliability and maintainability characteristics. The easiest way, but not necessarily the best way, to describe the product states is to consider each product component as a success or a failure. A product state is then a combination of product successes and failures; if there are n components in the product, this method will define 2 n product states. A state transition occurs when the state of a component changes (a successful component fails or a failed component is repaired). With certain simplifying assumptions, state models allow for relatively easy analytical methods using Markov processes. Using the concepts of availability, dependability, and capability, we will describe several forms of this model.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
394 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Single operating mode products—no in-use repair. These products can be adequately described as having only a single operating mode, as was assumed for the television set in the previous example, so repair of failures while in use is not possible. In such cases, we have the following generic model for effectiveness: E ff Av r D p r Cap
(15.2)
where Av is the availability, the probability that the product is operating at the beginning of the use period; Dp is the dependability, the probability that the product continues to operate throughout the use period; and Cap is the capability, a measure of product performance, given that it was dependable throughout the use period. If the application is one in which continuous performance is required over the application length tm , the effectiveness of the product, assuming well-behaved functions, may possibly be quantified as the time average of Eff (tk)—that is, 1 E ff tm
tm
¯E
ff
(t )dt
(15.3)
0
Note that if, at each performance time, the capability coefficient cj equals one if state j belongs to the set of satisfactory states and is zero otherwise, the preceding equation for Eff reduces to the expected fraction of the application performance time that the product is in a satisfactory state. If the Markov assumption does not hold, the capability matrix must be written as an N r N matrix (N number of product states), with an entry for each state transition. 15.2.1 A Generalization of the Model for Multifunction Operations Consider a product that has to perform f functions during an application where the kth function takes place during the interval tk to tk ` k , which we will denote by T k . We will call such an interval the kth functional interval. We shall call the period between the two intervals, T k–1 and T k (i.e., the period from tk–1 ` k–1 to tk) the kth nonfunctional interval and denote it by n k . Figure 15.1 illustrates this symbology using “_ _ _ _” for a functional period and “…….” for a nonfunctional period. For simplicity, assume that effectiveness is measured as a probability of success, there is no overlap of the functional intervals, and that, for success, no transitions can take place during a functional interval. Then, an equation for effectiveness is as follows: E ff Av W P {D p (D k ) P(Tk )Ck }D(D f ) P(T f ) C f
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(15.4)
PRODUCT EFFECTIVENESS AND COST ANALYSIS
395
Figure 15.1 Product that has to perform f functions during an application where the kth function takes place during the interval tk to tk `k.
where Av [a1 a2 z an ]
(15.5)
ai is the probability the product is in state i at time 0 or application start. § w1 ¶ ¨ · w2 0 · ¨ W¨ · ¨0 · ¨© wn ·¸
(15.6)
wi is the probability that the product will be used given state i at time 0. § d11 (T k ) d12 (T k ) z d1n (T k ) ¶ ¨ · ¨ d (T ) d 22 (T k ) z d 2 n (T k ) · D p (T k ) ¨ 21 k · z ¨ · ¨© d n1 z dn 2 d nn ·¸
(15.7)
dif (nk) is the probability of a transition from state i to state k during the kth nonfunctional interval. § p1 (Tk ) ¶ ¨ · p2 (Tk ) 0 ¨ · P(Tk ) ¨ · 0 ¨ · ¨© pn (Tk ) ·¸
(15.8)
pi(Tk) is the probability that, given state i at tk , the beginning of the kth functional interval, there is no state transition before time tk `k . §c1 (Tk ) ¶ ¨ · c2 (Tk ) 0 ¨ · Cap (Tk ) ¨ · ; for k 1 to f 1 0 ¨ · ¨© cn (Tk ) ·¸
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(15.9)
396 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
§ c1 (T f ) ¶ ¨ · ¨ c2 (T f ) · Cap (T f ) ¨ · ¨ z · ¨c (T ) · © n f ¸
(15.10)
ci(Tk) is the probability that the ith product state will lead to successful accomplishment of all required functions during the kth performance interval. To illustrate, assume only two functional intervals. Then, Eeff Av W D p (T 1 ) P(T1 ) Cap (T1 ) D p (T2 ) P (T2 ) Cap (T2 )
(15.11)
and a general term is ai wi dij pj cj djk pk ck . This term represents the probability of starting the application in state i (ai wi); transitioning to state j by the start of the first functional interval and persisting in that state throughout the interval, (dij pj); successfully performing the first function while in state j (cj); transitioning to state k after the first functional interval and before the second functional interval (djk); persisting in state k during the second functional interval, (pk); and successfully performing the second function while in state k (ck). It is, of course, possible to relax one or more of the model restrictions at the cost of more complexity. This is illustrated in the next section through an example. 15.2.2 Effectiveness Evaluation Example—Continuous Performance One of the criticisms of the previous model is its reliance on discrete point, Markovian performance. Although many products can be cast within that framework, it is possible to use the basic model concepts to handle continuous performance. In some cases, this can be done by “attaching” a capability measure to each state transition. The approach is illustrated with the following example. Example product definition. Two communications products, A and B, are used simultaneously to transmit information. Should either of the products fail, the remaining one is capable of transmitting alone (A and B performances are statistically independent). Failures in either product are not repaired during a transmission period, but are repaired during a period when the products are normally shut down. A transmission will be started whenever at least one of the products is available. (In other words, it is not necessary that both A and B be in operable condition in order to start a transmission.) The respective mean times between failure, mean repair times, and transmission bit rates for products A and B are given in Table 15.1. To illustrate the effectiveness evaluation approach, the basic product effectiveness model will be used to answer the question, “What is the effectiveness of A and B combined, if effectiveness is defined as the probability of transmitting at least 800,000 bits during a normal transmission period of 40 minutes?” For the analysis, we shall assume that only one transition is possible. The product
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
397
Table 15.1 Mean Times Between Failure, Mean Repair Times, and Transmission Bit Rates for Products A and B Product A B
Mean Time Between Failure (min)
Mean Repair Time (min)
Transmission Rate (r) (bits/min)
1200 (exponential) 2000 (exponential)
60 80
30,000 15,000
state designations will be given in Table 15.2 (a bar above a letter signifies a failed state). Availability calculations. The availability (Av) of a product is the probability that a product is operating at any point in time and, in the steady-state case, is given by the equation: Av MTBF /( MTBF MTTR )
(15.12)
In particular, the availability of subproducts A and B is as follows: Avail( A) 1200 /(1200 60) 0.9524
(15.13)
Avail( B) 2000//(2000 80) 0.9615 Definition: ai p (state i exists at start of transmission—a function of the availabilities of subproducts A and B): a1 Avail( A) r Avail( B) (0.9524) (0.9615) 0.9158 a2 Avail( A) r [1 Avail ( B)] (0.9524)(0.0385) 0.0366 a3 [1 Avail ( A)] r Avail( B) (0.0476)(0.9615) 0.0458
(15.14)
a4 [1 Avail( A)] [1 Avail ( B)] (0.0476)(0.0385) 0.0018 Then, the availability vector is Av [0.9158 0.0366 0.0458 0.0018]
Table 15.2
The Product State Designation
Configuration
State number
AB
1
AB *
2
AB
3
AB
4
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(15.15)
398 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Dependability calculations. Because there is no in-use repair, dependability is based only on the reliability measures associated with the operation of subproducts A and B. The reliability function is assumed to be exponential and is given by the following equation: R(T ) e t /Q
(15.16)
where t is the application time and 1 is the mean time between failures (MTBF). Thus, the reliabilities for subproducts A and B over a 40-minute period are as follows: RA (3 hours) e 40 /1200 0.9672
(15.17)
RB (3 hours) e 40 / 2000 0.9802 Definition: dij R (transition from state i to state j)3 hours. The state transition probabilities are given in Table 15.3. Capability calculations. Definition: cij R (transmit at least 800,000 bits in 40 minutes while undergoing a transition from state i to state j). We have the following results immediately: c11 1, c12 1, c22 1, c23 0, cij 0, for all i > 2. To illustrate the thinking, c11, c12, and c22 are 1 because they represent the capability when product A works throughout the application and this product can meet the transmission requirement. The states for which the c values are indicated to be 0 are those that represent impossible transmissions or for which product A is not available at all throughout the application. Because product B cannot transmit 800,000 bits in 40 hours, the capability associated with states for which product A starts out failed is 0. For c13, c14, and c24, things get a bit more complicated. We define the following: r r r r
transmission rates per minute: ra and rb, for products a and b, respectively application time: T failure rates: ha and hb total bits transmission requirement: ^B
Table 15.3 State Transition Probabilities Configuration d11 (0.9672) (0.9802) 0.9481 d12 (0.9672) (0.0198) 0.0192 d13 (0.0328) (0.9802) 0.0321 d14 (0.0328) (0.0198) 0.0007 d21 (0.9672) (0) 0 d22 (0.9672) (1) 0.9672 d23 (0.0328) (0) 0 d24 (0.0328) (1) 0.0328
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
State number d31 (0) (0.9802) 0 d32 (0) (0.0198) 0 d33 (1) (0.9802) 0.9802 d34 (1) (0.0198) 0.0198 d41 (0) (0) 0 d42 (0) (1) 0 d43 (1) (0) 0 d44 (1) (1) 1
PRODUCT EFFECTIVENESS AND COST ANALYSIS
399
First, we will consider c24, the probability of transmitting at least ^ bits in time T when starting out in state 2 (A working and B failed), and transitioning to state 4, with both products failed. Clearly, this can only happen if A fails after such time that ^ bits were delivered, or the failure time of A, ta , is no earlier than ^/ra . This probability is given by the following equation: T
c24
La e
La ta
¯ 1 e
LaT
dta
B ra
1 1 e
La T
¶ § La B ¨e ra e LaT · ¨ · © ¸
(15.18)
When substituting the example values, we calculate c24 0.3296. The calculation for c13 is similar. This case is the transition from both A and B working to A failing. If A fails at time ta , it will have delivered ra r ta bits; thus, the surviving B product will transmit rb r (T – ta) bits over the remaining time T – ta . Because the total bits delivered under this case, ra ta rb (T – ta), must be at least ^, a lower limit on ta to meet the criterion is defined and a probability expression can be developed. For the numbers we used, we find that c13 0.8478. For the c14 capability measure, which represents a transition from both A and B working to both failed, we need to use a convolution equation, which represents the probability that the sum of the transmissions of A and B is at least ^ over time T. This is given by the following equation: T
c14
Lae
B ra
Lata
¯ 1 e
La
B ra
T
dta
Lae
T
Lata
¯ 1 e 0
Lat
¯
B rata rb
Lbe
Lbtb
1 e
LbT
dtb dta
(15.19)
The first term is the probability that A fails after it has transmitted ^ bits; the second term represents all the ways in which the sum of bits transmitted by A and B, given that both fail before time T, is at least ^. On solving this equation and inserting the numerical values applicable for the example, we find that c14 0.5509. These calculations then lead to the following C matrix:
C( ap)ij
§1 ¨ 0 ¨ ¨0 ¨ ¨© 0
1 0.8478 0.5509 ¶ · 1 0 0.3296 · 0 0 0 · · 0 0 0 ·¸
(15.20)
Effectiveness is then calculated by the following equation: E ff
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
4
4
i 1
j 1
££a d c
i ij ij
(15.21)
400 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
This may be written as follows: § ¨ ¨ ¨ ¨ ¨ ¨ E ff Av ¨ ¨ ¨ ¨ ¨ ¨ ¨ ©
¶ d1 j c1 j · · j 1 · 4 · d 2 j c2 j · · j 1 · 4 d3 j c3 j · · j 1 · 4 · d 4 j c4 j · · j1 ¸ 4
£ £
(15.22)
£ £
Upon substituting the example values, we have § 0.9948 ¶ ¨ · 0.9780 · E ff [0.9158 0.0366 0.0458 0.0018] ¨ ¨ 0 · ¨ · ¨© 0 ·¸
(15.23)
or Eff 0.947. 15.2.3 Model Applicability It is impossible to support the proposition that a concept as complex as effectiveness can be universally quantified by a single model. As stated earlier, the proposed model provides a conceptual framework for effectiveness analysis. Products have to work to perform, have to be repaired if they fail, and have to do the job if they are operating. That is the essence of the model. One of the criticisms often made of any effectiveness model is that, for complex products, no one single measure can be used to describe how well the product meets its objectives. For example, in evaluating a communication product, one may want to consider capacity, error rate, security, and a number of other factors. Although it may not be possible to develop a single measure for all these factors, one may develop effectiveness measures for each of the important factors, thus providing a set of measures for evaluation. The proposed model, in fact, has the ability to do this. If the capability vector is transformed to a capability matrix, with each column representing the capability associated with a particular form of output (e.g., capacity, errors, security), then the model will develop the vector solution. However, this is possible only if all the availability and dependability formulations are applicable for all capability measures.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
15.3
401
FACTORS TO CONSIDER IN ANALYZING PRODUCT EFFECTIVENESS
The preceding sections provided a basis for quantifying product effectiveness. We will now examine the approaches and factors to consider in evaluating alternative designs or in analyzing fielded products. Figure 15.2 is a flow diagram for a typical program. A product effectiveness analysis is performed at several product and support levels. The first application is usually at the product or major subproduct level (e.g., computer system or data processor) and the associated support levels. Early applications provide decisions on overall design approaches and supply the basis for further analysis at lower hardware and support levels. Thus, analysis at the computer system level helps to define the overall architecture, and analysis at the data processor level will determine how inputs, computations, and outputs are to be handled by the central processing unit and associated hardware.
# !
#
!
# #
!
! "
"
# !
!"
!
Figure 15.2 Flow diagram for a typical product effectiveness analysis.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
402 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Because data are often limited in the early stages, some decisions will have to be deferred and others made on a contingency basis. Until the design is frozen, the results of each iteration of product effectiveness analysis are used to refine the analytical model and criteria and early design and support decisions. Management must plan on performing the analyses at times consistent with key points in the product design process. These steps and the corresponding points in time of the design process are shown in Figure 15.3. The analysis proceeds to translate product requirements and constraints into requirements and constraints on the parameters of progressively smaller parts of the product, which are then fed to the applicable design groups. As this process continues, the relationships between the product components and the
$$
$$% !#% #"&#%$
$ $%%!)$ (
$$% !)$ $%#%$
$$% $%#%$
(#$ +%'$$ $
!&%$
%!!# !#% $%+%'$$ '&$ #!!# !#% $$
#*#$&%$ %# &$#%#
*$
Figure 15.3 Process for model implementation.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
#'$
PRODUCT EFFECTIVENESS AND COST ANALYSIS
403
overall effectiveness of the product become better defined so that decisions among alternative approaches can be made on a factual, rather than subjective, basis. 15.3.1 Phase I: Define Application, Product, and Logistics Support Phase I involves a detailed study of those aspects of the application, product, and logistics support that will eventually form the basis for the effectiveness model (Phase III) and the establishment of design-decision criteria. The first task in this phase involves translating general application objectives into quantitative operational requirements suitable for effectiveness analysis. The overall objective of a command and control system, for example—to maintain command after a disaster—must be translated into specific requirements. These induce the particular type of enemy attack to be survived, the environment under which the product must operate, the information available to and required by the system, and any real-time data requirements. If more than one application is involved, the first phase will lead to a set of operational requirements for each application. The resulting sets of requirements should be reduced to a composite set that defines a product that will be effective for the primary or most likely application types. Weighting by importance and probability of occurrence can be used in deriving the composite. The operational requirements are used to define the product and major subproducts with respect to boundaries, functions, and constraints. The subproducts and their functions must be clearly defined through, for example, preliminary specifications, hardware sketches, and functional block diagrams. The interfaces between the product and the operational user, the logistics support functions, any larger product, and the application environment must be analyzed. In the early stages of product development, block diagrams are used to analyze multimodal capabilities, failure efffects, and redundancy. Analyses should also be made of other general reliability and maintainability design consideration, such as the use of on-board testing and modular avionics. Logistics support should be examined in areas such as maintenance levels, available repair facilities, and available maintenance skills. Various approaches for achieving high reliability, maintainability, and readiness must be formulated for further study, using the effectiveness model. 15.3.2 Phase II: Select Measures of Effectiveness The definitions of product, logistics support, and application developed in Phase I provide a basis for formulating measures of effectiveness, using the factors discussed in Chapter 1 and earlier in this chapter. Some factors, such as reliability, apply to all products. Others are peculiar to a given type of product—maximum gross takeoff weight, for example, is associated only with aircraft. In selecting measures of effectiveness, care must be exercised to avoid limiting design options prematurely. For example, if a low infrared signature is required for an aircraft during supersonic cruise, then the measure of effectiveness should be stated in those terms. If the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
404 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 15.4
Time or Use Requirements Related to Product Operational Demands
Type of Requirement
Description
Example
Instantaneous
Product must meet a given demand at a given instant for a short duration
Retro-rocket package for re-entry of orbiting satellite
Continuous interval
Product must meet a demand continuously for a given interval
Plane on a transport application
Fraction interval
Product must meet a demand for a specified fraction(s) of a given interval
Reconnaissance satellite that takes pictures on demand
measure were restarted as supersonic cruise without afterburner, the options left to the design team would be reduced. The frequency with which the product will be used and the criteria for measuring performance must also be defined. Frequency of use is important because product effectiveness is based on meeting operational demands within a specified time period, or satisfying some other applicable use constraint such as one based on miles or number of attempts. Usually, the function performed will indicate the nature of this time or use requirement, but in some cases there are several possible choices. Three types of time or use requirements are described in Table 15.4, using time as the basis. Given a time or use requirement, it is then necessary to determine the performance measure(s) to consider, which are generally in terms of the product outputs, and how the performance measure(s) will be reflected in an overall measure of effectiveness. In terms of the model discussed earlier, the effectiveness measure is equivalent to the way capability is quantified. Naturally this measure must be related to the application objectives, but there are often a number of ways to express it. For example, consider the communication product example discussed earlier in this chapter. There we defined the communication product performance by the number of bits transmitted, and we defined effectiveness as the probability that a minimum number of bits would be transmitted during a certain time period. Instead, we could have chosen to define effectiveness as the expected number of bits to be transmitted. Instead of bits transmitted as the performance measure we might have some other communication measure such as error rate or transmit delay time. We will discuss two basic forms of effectiveness measures given that a performance measure has been defined. The minimum performance criterion. The minimum performance criterion specifies quantitative bounds on the output of the product. These bounds define the range of acceptable performance; there is no assessment of the degree of acceptability within these bounds. Thus, this criterion leads to the commonly used dichotomous description of product performance: success or failure. For example, a criterion that states that a computer must perform a set of benchmark problems within a certain time period or that a bomb must be dropped within d miles of the center of the target is a minimum performance criterion. The overall performance criterion. The overall performance criterion concerns the complete distribution of output, considered in terms of the actual probability
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
405
distribution or of a related statistical measure, such as the expected value. This criterion may be used when the degree of conformance to application or functional requirements is important and when the minimum performance criterion is too artificial. Referring to the computer benchmark example, other effectiveness measures are the amount of output provided in a given time period or the time it took to complete the benchmark. The choice of which form of measure to use is not always clear. The application objectives, product functions, and associated output sometimes dictate the evaluation criterion. If the functional output is dichotomous (e.g., for a detection product, the output might be detection or no detection), the choice would be the minimum performance criterion using success probability as the measure. For a multi- or continuous-output product, the overall performance criterion can be used if success boundaries would be highly artificial or if interest centers on the statistical properties of the output such as the mean and variance. The choice between the minimum performance criterion and the overall performance criterion can be critical. For an effectiveness measure based on a minimum performance criterion, specified bounds on the product output define the region of acceptable performance and lead to the classification of success or failure. In the other method of quantifying effectiveness, using an overall performance criterion, the complete distribution of output is considered and is usually quantified by an appropriate statistical measure, such as the mean output. Products that have outputs for which partial or degraded information return may be of some value could possibly be measured under the overall performance criterion. In summary, the product output and associated effectiveness measures form the basis for evaluating the effectiveness of proposed designs. This evaluation is performed through mathematical models developed to represent these measures and the associated costs and burdens involved in achieving the design objectives. 15.3.3 Phase III: Develop the Mathematical Model Three tasks must be performed in this phase: r selection of variables that affect product effectiveness; r definition of existing and imposed economic constraints; and r development of the mathematical relationships among the variables to express the effectiveness measures.
The variables are the product and logistics support parameters that influence product effectiveness and will appear in the model. Within the defined limitations on these parameters and on the model, the proper selection of these parameter values will yield a product with an optimal level of product effectiveness. Typical are those variables that affect the major components of effectiveness; reliability, maintainability, and performance and that can be traded off against one another on the basis of some common denominator, which may be called worth.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
406 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Variables include complexity, the number of redundant units and the type of redundancy, the number and level of replaceable modules, and the type and frequency of preventive maintenance. As the effectiveness analysis is iterated, more “detailed variables” must be considered—for example, stresses on parts and components and maintenance accessibility factors, and the numbers and types of test points. The associated burden and benefit of each alternative should be determined. For example, increased weight, cost, and complexity constitute the burden of improving reliability through redundancy. The burden and benefit of improving reliability through redundancy can be compared with the burden and benefit of using ultrahighreliability parts. The second task is to determine existing and imposed physical, support, and economic limitations. Imposed physical limitations can be explicitly required (e.g., available space) or implied by operational requirements; existing constraints are defined by the current state of the art and may include achievable failure rate levels, maintenance repair rates, and data transmission speeds. Constraints such as funding, desired date for fielding the product, and available maintenance staffing and skill levels should also be listed. Also to be considered are such economic and logistic limitations as total cost, development time, test product requirements, and maintenance manpower and skill level requirements. These factors must be introduced into the mathematical model to ensure that the product made will meet its operational requirements within the specified constraints, including cost, schedule, and support. The third task is to develop the mathematical model for (1) estimating the effectiveness of proposed products, (2) evaluating the effectiveness potential of various alternatives, (3) trading off these alternatives, and (4) determining reliability and maintainability requirements at product levels. A general model is first developed in terms of product states defined by the states of major product elements, as we discussed earlier in this chapter. By assessing the performance of these elements with respect to each subfunction, the effect of design and support decisions on overall product effectiveness can be determined. In this early stage, the capability analysis is a product engineering function, often relying more on basic principles because directly applicable data may be very limited. A complete analysis would provide probabilistic performance indices. For example, a cumulative distribution function of detection probability versus range might be appropriate for evaluating radar performance, and radar range equations could serve as the basis for the engineering analysis. At this point, a cost/effectiveness trade-off should also be conducted. For example, building in high reliability and maintainability increases initial investment costs; however, support costs over the product service life should be reduced. This type of trade-off should be investigated at the major product and support levels to narrow the choice of alternatives early in the development stage. Models should be developed and applied to the analysis of costs with respect to reliability and maintenance characteristics. Product failure triggers the support system and thus determines how often a particular product will consume support resources. The expenditure in terms of manpower and maintenance time is a function of the maintainability characteristics.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
407
The framework of the cost model should provide the following information: r comparative cost data on the operation of major subordinate organizations; r the cost of supporting the units; identification of those units that might justify engineering changes; r information on the comparative costs of making a given repair at different echelons of maintenance and the elasticity of support costs with respect to failure frequency; r information needed to trade off support savings against increases in fixed investment (e.g., the introduction of new test equipment); and r a tool such as a simulation model and applicable data with which suggested changes in the support organization can be evaluated before being initiated.
More detailed treatment of cost-effectiveness analysis is provided in the following sections. 15.3.4 Phase IV: Obtain Data Inputs The required data inputs for the model consist of such products as part or component reliability and maintainability parameters, costs, weight and space estimates, and other data on pertinent physical, engineering, or economic factors. Initially, such inputs may be obtained from past experience and from appropriate estimation techniques. As early design approaches become more definite and component and unit development progresses, these inputs should be refined and the model iteratively run. It is important that policy and procedures be established to ensure that the effectiveness analysis team is aware of the data generating activities during development and that it has a say on the data to be collected. 15.3.5 Phase V: Exercise, Interpret, and Refine Model The steps in this phase are essentially: r r r r
designing a product that satisfies constraints; computing the values of effectiveness and worth; comparing these values with requirements; making generalizations concerning appropriate combinations of design support factors; r revising the factors and rerunning the model; and r refining the model as additional data become available.
Schematically, this process might be represented as in Figure 15.4. Only those designs that meet physical and economic limitations are actually evaluated by the model. The range of designs, however, is naturally restricted by the customer’s requirements. With the aid of the model, a conceptual design is translated into actual hardware configurations and support plans that, given the constraints under which the product is being developed, should provide the highest level of product effectiveness.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
408 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Investment Cost ($)
R&D Annual operating
Time Figure 15.4 Process for model implementation.
15.4
COST-EFFECTIVENESS ANALYSIS
In this section we bring cost into the effectiveness picture in order to describe a more complete approach to choosing among competing designs, operational methods, and logistic structures. We can define cost effectiveness analysis as the process of comparing alternative solutions for meeting product or application requirements in terms of value received (effectiveness) and resources expended (costs). Note that cost is the measure of resource usage, and effectiveness is the measure for value received. We must recognize that effectiveness may not include all of the value elements of a product and that cost does not embrace all of the resources required. For example, resource requirements in terms of such factors as personnel skills and schedule delays are often difficult to translate to cost measures. Thus, one must ensure that the documentation associated with any cost-effectiveness analysis include the important elements that are not explicitly considered as cost or effectiveness numbers. Cost effectiveness analysis became prominent in the early 1960s in large-scale military development and acquisition projects. It evolved from economic analysis work (termed cost-benefit analysis) done several decades earlier, such as that done on the flood control project in the late 1930s. Another related term is product analysis, which embraces many of the same ideas as cost-effectiveness analysis but is not as definitive in requiring that cost and effectiveness numbers be produced. The analytic output of a cost-effectiveness analysis may then be fed into the higher level system analysis framework in order that the decision maker can act on it in conjunction with his expert judgment and intuition in deciding on the best course of action. 15.4.1 Cost Categorization Product costs have been categorized in a number of different ways, depending on the product type and applicability of available data. A cost categorization should focus attention on the major resources that will be consumed during the life of the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
Table 15.5
409
Cost Categorization by Program Phase
Research and Development Costs r Preliminary research and design studies r Development engineering and hardware fabrication r Development instrumentation r Industrial facilities Product test r Test vehicle fabrication r Test vehicle spares r Test operations r Test support product r Test facilities r Data collection, reduction, analysis, and storage r Maintenance, supply, miscellaneous Product management and technical direction Initial Investment Costs Equipment r Primary application equipment r Support equipment r Other equipment Stocks r Application product and product spares r Equipment support and part spares r Consumables Initial training Installation r Construction of facilities r Platform modifications Miscellaneous investment r Technical data r Transportation and travel r Administrative and support costs
Design and development
Equipment and installation replacement
Maintenance and support
Recurrent training Inventory management Management and technical data Facilities Operation costs
Operating Costs r Primary application equipment r r r r r r
Specialized equipment Other equipment Installations Primary application equipment Specialized equipment Other equipment
r r r r
Personnel Fuel Power Other
product. A broad categorization based on program phases includes costs associated with research and development, investments, and operation. Examples of the types of costs in each major category are given in Table 15.5. r Research and development costs include all the costs necessary to bring a product into the production or procurement phase.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
410
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
r Initial investment costs comprise all costs incurred in introducing a product into the active inventory. They include production or procurement costs, facility costs, personnel training, installation, and procurement of initial spares. r Operating costs are all costs necessary to the operation of the product once it has been phased into the operational inventory. Although both research and development (R&D) and investment costs are incurred just once, operating costs continue throughout the lifetime of the product.
The curves of Figure 15.5 show a typical distribution of these costs over the life cycle of the product. The term life-cycle cost is often used to represent the sum of the R&D, investment, and operating costs for a period representing the expected lifetime of the product. It is important to note that when alternatives are compared, all costs that will not affect the decision should be excluded. For example, assume that life-cycle costs are to be estimated for several alternatives and they include depot repair. If there is no foreseeable reason for depot management costs to be dependent on the selected alternative, such costs should be excluded in order to simplify the problem. Another cost categorization is shown in Figure 15.6. Each of the eight categories is subdivided. Figure 15.7 shows a breakdown of the development cost category. 15.4.2 Cost Estimation Three general methods for cost estimation are bottom up, top down, and analogy with similar products. Bottom-up method. The bottom-up approach, sometimes called the accounting or grassroots method, attempts to estimate costs by breaking expenditures down into elemental categories called a work breakdown structure (WBS). Estimates are made of required labor and materials for each category; these are then used with standard labor rates and material costs to estimate category costs. These category costs are then aggregated to higher level cost categories to build up the cost estimate. Higher level costs such as facilities are introduced into the buildup process as necessary. Estimates of operation and support (O&S) costs usually involve some form of bottom-up approach, at least with respect to logistics and maintenance costs. As a Cost of ownership 0000
Development
Procurement
1000
2000
Figure 15.5
Installation Maintenance Operation 3000
4000
5000
Management and technical services 6000
Distribution of product costs over the life cycle.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Modification
Disposal
7000
8000
PRODUCT EFFECTIVENESS AND COST ANALYSIS
411
Development 1000
Special support equipment development 1200
Prime equipment development 1100
Initial research and development
Hardware for technical evaluation
Operations evaluation
Other
Initial research and development
Hardware for technical evaluation
Operations evaluation
Other
1110
1120
1140
1150
1210
1220
1240
1250
Initial technical evaluation
Initial technical evaluation
1130
1230
Naval Commercial Contractor Government Contractor Government laboratory laboratory shipyard laboratory 1131
1132
1133
1134
1231
1232
Naval shipyard
Commercial laboratory
1233
1234
Figure 15.6 Total cost of ownership.
simple example, if there are 200 products and each operates 40 hours a month and has an MTBF of 1,000 hours, then Expected Number of Failures per Month 200 r 40 / 1000 8
(15.24)
Estimates of the manpower, materials, and other costs (e.g., transportation costs if the repair facility is not on site) needed to restore the product to operating Performance level 1 2 Line of minimum cost points
Cost ($)
3 $
System Characteristic, e.g., Weight Figure 15.7 Development cost.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Performance Level, e.g., Payload
412
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
condition are used to provide an estimate of the cost of failure, which is then aggregated over the year or time period of interest. In addition to the active repair costs, MTBF and pipeline time determine how many spares should be bought to ensure a required level of product availability. If there is a range of costs that may be applicable for some activities or materials, these can be incorporated into the aggregation methodology to provide a measure of uncertainty in the estimate. As one simplistic approach, by using all the high (low) cost estimates, a bound on the maximum (minimum) cost is provided. If cost distributions for uncertain cost elements are known or can be assumed, then statistical procedures based on sums of random variables can be used for developing a distributional estimate of the aggregated cost. Top-down method. The top-down method, also called the parametric method, uses historical data and statistical techniques in an attempt to find a relationship between high-level costs and a set of product parameters that is applicable to the product under study. The term cost-estimating relationship (CER) is often used for these types of cost-estimating equations. A typical CER development exercise involves collecting data on applicable products of a general class—for example, radar products. Typically, separate CERs are developed for each of the major cost categories (R&D, investment, operation) because factors that influence one cost category may have little influence on another or may even have an opposite effect. For example, a program to produce a product with an ultrahigh reliability may increase development costs over that of a product with a typical reliability level; it is hoped, however, that operational costs may be greatly reduced because of the reduction in the number of failures. Data elements to consider include factors related to product performance, physical attributes, and costs. Data may also be collected on factors related to technology and the acquisition environment. For example, in a computer-costing exercise, it may be important to know the amount of large-scale integration in each of the computers in the data set. If some products were bought in a sole-source environment and others were procured competitively, that may have an important bearing on the prices paid. Factors such as reliability requirements or length of warranty coverage could also impact costs. Using such techniques as multiple regression analysis, factors that are highly correlated to the cost numbers of interest are selected by the regression technique as “significant” and are included in the predictive equation. For radar products, the CER may depend on such technical factors as range and sensitivity; for computers, CPU speed and memory capacity may be good cost predictors; for aircraft, range, speed, and carrying capacity are obvious candidates. The statistical processes used to develop a CER also provide measures of the strength of the relationship and can provide measures of uncertainty in the estimate through such techniques as confidence intervals. A great deal of engineering judgment is required in deciding which factors to include in the database, what adjustments must be made to the cost numbers to account for unusual events, and what screening criteria to use to ensure that the resultant equation makes engineering as well as statistical sense.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
413
To illustrate a CER, some years ago Aeronautical Radio, Incorporated (ARINC) developed the following equation to predict radar product development costs: ln Cst 0.784 0.205 ln Apr 0.165 Ddev 0.151 ln Pk 0.028 S 0.082 SC 1.37 TD (15.25) where Apr is antenna aperture in square feet; Ddev is the degree of development (score 0, 1, 2); Pk is the peak power in kilowatts; S is the sensitivity in negative dBm; SC is the number of special circuits; TD is the type of development (new 2; modification 1); and Cst is the cost of development plus cost of first prototype. Analogy with similar products. The analogy method uses cost data from products with similar characteristics, which are then adjusted to account for known differences between the current products and the one being evaluated. Sources of data may range from price lists to costs incurred under previous procurement contracts. In many cases, the adjustment will involve extrapolation, increased CPU speed for a computer, increased range and better fuel consumption for an engine, or a graphical user interface for a software purchase. Again, experience and good engineering judgment will be required to determine which historical data are relevant and the adjustments to make to account for differences between the past and present. The analogy method clearly will give good results when the products and the acquisition environments of the analogous product and subject product are similar. As the differences increase, the uncertainty about the accuracy of the analogous estimate decreases and the degree of uncertainty is difficult to quantify. 15.4.3 Cost Adjustments Several “standard” techniques are applied to cost estimates to account for significant cost influences that may not be explicitly included in the initial estimate. Two of the more important relate to economy of scale and discounting the time value of money. Economy of scale. When products are made, the production quantity will usually have a significant impact on the unit cost. This is a phenomenon known as economy of scale and can be explained by a number of factors (e.g., ability to buy larger lots of raw material—another economy of scale factor; the dispersion of fixed costs over a larger number of units, and learning effects). A generalized equation that reflects this phenomenon is KCA ( P * / P)
ac
where P* is a reference of standard production size; R is the production lot size under consideration;
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
(15.26)
414
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
ac is a positive constant; and KCA is the cost adjustment factor to adjust a cost based on production lot size of P*. A typical learning curve equation reflecting reduction in time (or some other measure of resource expenditure such as man-hours) as a result of increasing usage or application (i.e., production, maintenance) is given by b
T ( RC ) At RCc where T(R) At RC bc
(15.27)
is the time required for the Rth unit; is the time for the first unit; is the cumulative unit number; and is a negative constant.
If we assume that the resource usage declines by a constant percentage for each doubling of usage, the constant bc would be determined as follows: bc [ln ( percent ) 2]/ ln(2)
(15.28)
Discounting. If work is to be performed over a period of time, it should be clear that it would be better to be paid up front than after the job is finished, thus ignoring such issues related to surety of payment. By being paid up front, one can deposit the money in a bank so that, at the end of the job, one will have the agreed-to amount plus any interest earned. The payer, however, may recognize that, by paying up front, he is losing that potential interest because he must take the money out of the bank to make the payment. He may therefore propose paying a discounted amount that represents his loss of interest. If the discount rate matches the interest rate earned when the money is deposited, the payee will have the full amount at the end of the job and the payer’s bank account will be the same as it would be if he had paid at the end of job. Discounting, therefore, is a process that considers the time value of money. It should be applied to all out-year expenditures so that the costs are all in constant year dollars, which is necessary for an accurate assessment. The discounted amount of an out-year expenditure is called the present value and is computed as follows: PV C fe / (1 id )n
(15.29)
where PV is the present value; n is the number of periods in the future; Cfe is the future expenditure, n periods in the future; and id is the discount rate per period. Consider Table 15.6 as a simple example of comparing costs of two products (a 10% discount rate was used). Both products have the same total expenditure, $3,500,
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
Table 15.6
Year 1 2 3 4 5 6
415
Example of Comparing Costs of Two Products Discount Factor
Project A Expenditure
Project A Discounted Costs
0.91 0.83 0.75 0.68 0.62
1000 500 1000 500 500 3500
910 415 750 340 310 2725
Project B Project B Discounted Expenditure Costs 500 500 500 1000 1000 3500
455 415 375 680 320 2545
before discounting. However, because the majority of product B’s costs are incurred in the last 2 years, its total discounted cost is less than that of product A, where over 70% of the expenditure occurs in the first 3 years. With everything else being equal, product B is the less costly product. Note that discounting and inflation are two different concepts. Inflation refers to the buying power of a unit of currency when compared to some base year and discounting refers to the value of having the money. Cost effectiveness analyses must take the time value of money, but usually need not deal with inflation directly unless there are some unusual circumstances. 15.4.4 Cost Uncertainty and Cost Sensitivity In the discussion of the types of cost estimates that can be used, we indicated that uncertainty is an issue that must be considered. There is usually uncertainty in every cost estimate, whether it be the cost of a small piece part or elemental activity in the bottom-up approach, the result of applying a CER in the top-down approach, or the adjustment of the cost of a similar product using the analogy method. The CER method provides the most direct way of dealing with uncertainty for the statistical techniques used in developing the CER usually can provide measures of uncertainty in the form of standard errors or confidence interval factors. But, even here, there may be additional uncertainty concerning the applicability of the historical data to the product and associated environment under consideration, or the prediction may be an extrapolation of the past rather than an interpolation. For the other two methods, it is not as easy to develop quantitative measures of uncertainty although, as indicated early, statistical theory can be used when summing up uncertain cost numbers in the bottoms-up approach in much the same manner as we do in a program (or project) evaluation and review technique (PERT) analysis, where pessimistic, most likely, and optimistic activity times are used to develop a distribution of project time. For all three methods, when the costing exercise involves a large-scale product where life-cycle cost is to be estimated, there is also the concern that all relevant and significant costs elements have been identified. Whether quantitative measures of uncertainty can be provided or not, it is the responsibility of the cost analyst to provide, as explicitly as possible, cost uncertainty
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
416
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 15.7 Comparisons of Effectiveness and Cost Measures of Two Products Case 1 Case 2 Case 3 Case 4
Effectiveness
Cost
Apparent Decision
A better A better B better B better
A better B better A better B better
A ? ? B
information and to help the decision maker in evaluating the effect of the uncertainty. One way to get further insight into the effects of uncertainty is to perform a sensitivity analysis. By varying a cost variable about which one is uncertain—say, a labor rate—one can determine how sensitive the overall cost is to variations in labor rate. If two products have the same effectiveness, but A is less costly than B using best estimates of elemental costs, then this type of exercise provides an indication of how sensitive the decision to select A is to errors in the labor rate. If, for example, the same decision would be made even if the labor rate was, at the extreme, favorable to B, then the labor rate uncertainty becomes much less of a concern. 15.4.5 Combining Effectiveness and Cost We have now established some methodology for developing effectiveness and cost measures to be used to evaluate candidate products. If we have two products, A and B, we can conceptualize four possibilities when comparing their effectiveness and cost measures in Table 15.7. Cases 1 and 4 show complete dominance of one product over another, whereas the situation is unclear regarding cases 2 and 3. Taking case 2 as an example, if A is much better than B in effectiveness, but almost equal to B in cost, then A might be chosen. But what if cost is the predominant criterion? Even for the dominant cases, the nonquantifiable factors and the uncertainty in the estimates for one product when compared to another may make the decision less obvious than it appears to be. Another complication concerns the concept of leverage or the influence a product may have on factors not explicitly considered in the cost or effectiveness models. As an example, suppose the results in Table 15.8 were obtained for two engine types to be used in a new helicopter. On the surface, the $15 million savings in cost of engine A over that of B would make it the logical choice. However, suppose the design of engine B will allow it, with modification, to be used in an airplane also. If, for example, the expected development cost for the airplane engine were to be reduced by $30 million if engine B were to be selected, then that amount of leverage may make B the better choice. Table 15.8
Example of Leverage
Total cost ($) Effectiveness
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Powerplant A
Powerplant B
60,000,000 0.95
75,000,000 0.95
PRODUCT EFFECTIVENESS AND COST ANALYSIS
Figure 15.8 Equal performance curves.
417
Figure 15.9 Cost versus performance trade-off cost versus product characteristic.
Although we cannot provide a decision methodology that can be consistently used when cost and effectiveness values are provided for alternative designs or alternative products, the general agreement is that the decision maker is better off having the cost and effectiveness values than not having them. However, some techniques that can be used to help in the decision process will now be discussed. Product design studies. We will first discuss the use of effectiveness analysis when alternative designs for a given product class are being considered. One of the desirable outputs is a cost versus effectiveness trade-off curve. Consider, for example, a new transport aircraft in which a particular performance parameter, such as payload, is of interest and some product characteristic, such as engine thrust or aircraft weight, is being analyzed. Figure 15.8 shows the relationship between cost and the product characteristic for several performance levels. In economic terms, these curves are known as isoquants (i.e., equal quantities). If we take the minimum costs for each of the performance levels and plot them on a graph of cost versus performance, we can develop a trade-off curve showing the lowest cost for any performance level. This is illustrated in Figure 15.9. The trade-off curve then becomes a useful tool for the decision maker to consider in selecting a design alternative. If, in fact, all relevant factors are included in the cost and performance measures, the curve provides an optimized solution. Product comparison studies. These studies involve comparing two or more products designed to accomplish the same application. The products may or may not be similar in design or operation (e.g. trucks or trains can be used to transport material, and both transportation forms are capable of meeting the objectives). If it is possible to develop a cost versus effectiveness curve for each product as we described before, we may have a result similar to Figure 15.10. Here we see that, for effectiveness values less then E0, product A provides the lower cost solution and that, for the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
418
PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 15.10 Cost effectiveness curves: products A and B.
higher effectiveness values, product B is preferred. Two possible ways to approach this situation are fixing effectiveness or fixing costs: r Fixed effectiveness: if a required effectiveness or performance level can be specified, then the product that meets that level at minimum cost is the preferred choice, without considering any leverage effects. r Fixed cost: if a total cost has been budgeted, then the product that provides the maximum effectiveness for that cost level is the preferred choice, without considering any leverage effects.
Fixing effectiveness or cost is the more desirable way to proceed if it makes sense to do so. In many cases, it may not be desirable to fix either effectiveness or cost, but they may be constrained. For example, the challenge to the cost-effectiveness analyst is to provide information so that the decision maker can select the best product that has an effectiveness of at least Ew and a total cost of not more than Cw. This type of situation is illustrated by the shaded region shown in Figure 15.11. Any combination of cost and performance inside the region represents an acceptable solution. For the example shown, we see that neither A nor B dominates and the choice is still not clear. One approach that is often used is to compute a ratio of effectiveness to cost; the resultant value then has a “bang for the buck” characteristic. In some cases, this is acceptable, but the approach has been criticized as one that “reaches for corners.” The absence of a standard criterion does not diminish the value of cost-effectiveness analysis. It means that as much information as possible must be made available to the decision maker. Although the information may not be amenable to being wrapped up into a neat individual number, it can be displayed in a manner that facilitates its use in conjunction with the decision maker’s expert judgments. This challenges the
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PRODUCT EFFECTIVENESS AND COST ANALYSIS
419
Figure 15.11 Cost effectiveness curves: region of acceptability.
cost-effectiveness analyst to maintain flexibility in modeling the product so that the evaluation is adaptable to changing information requirements. 15.5
SUMMARY
A Markov model framework for combining product reliability, maintainability, and performance characteristics into an overall effectiveness measure was introduced in this chapter. Although this model framework has many simplifying characteristics, it provides a basis for analyzing complex products through appropriate extension, as illustrated by the communications system example in Section 15.2.2. The approaches and factors to consider in evaluating the effectiveness of alternative designs were discussed, with specific focus given to cost issues. The elements of the three major cost categories—research and development, initial investment, and operating—and cost estimating methods were reviewed. Also addressed were the issues of economy of scale, discounting, and cost uncertainty and sensitivity. The last section dealt with the challenge of combining cost and effectiveness values for use by the decision maker. REFERENCE Shooman, M. L. 1990. Probabilistic reliability: An engineering approach. Malabar, FL: Robert E. Krieger.
ADDITIONAL READING ARINC Research Corporation. 1969. Guidebook for systems analysis/cost effectiveness. U.S. Army Electronics Command.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
420 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
DARCOM P700-6 (Army), NAVMAT P5242 (Navy), AFLCP/AFSCP 800-19 (Air Force). 1977. Joint-design-to-cost guide: Life-cycle cost as a design parameter. Washington, DC: U.S. Department of Defense. Dhillon, B. S. 1989. Life-cycle costing: Techniques, models and applications. New York: Gordon and Breach, Science. English, J. M. 1968. Cost effectiveness—The economic evaluation of engineering systems. New York: John Wiley & Sons. Fabrycky, W. J., and B. S. Blanchard. 1991. Life-cycle cost and economic analysis. Englewood Cliffs, NJ: Prentice Hall. Goldman, T., ed. 1967. Cost-effectiveness analysis. New York: Frederick A. Praeger Michaels, J. V., and W. P. Wood. 1989. Design to cost. New York: John Wiley & Sons. Ostwald, P. F. 1992. Engineering cost estimating, 3rd ed. Englewood Cliffs, NJ: Prentice Hall. Quade E. S., and W. I. Boucher. 1968. Systems analysis and policy planning. New York: American Elsevier. Taguchi, G. 1992. Taguchi methods: Research and development. Englewood, CO: ASI Press.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
CHAPTER 16
Process Capability and Process Control Diganta Das, Michael Pecht
CONTENTS 16.1 16.2 16.3 16.4
Introduction ................................................................................................. 421 Average Outgoing Quality .......................................................................... 421 Process Capability....................................................................................... 423 Statistical Process Control .......................................................................... 426 16.4.1 Control Charts: Recognizing Sources of Variation ....................... 427 16.4.1.1 Constructing a Control Chart......................................... 427 16.5 Examples of Control Chart Constants ........................................................ 434 References.............................................................................................................. 442 Homework Problems.............................................................................................. 443
16.1
INTRODUCTION
Quality is a measure of a product’s ability to meet the workmanship criteria. This chapter introduces the concepts of process capability and the basics of the statistical process control techniques used to attain and maintain part and product quality.
16.2
AVERAGE OUTGOING QUALITY
A measure of product quality is average outgoing quality (AOQ). It is typically defined as the total number of products per million (ppm) that are outside manufacturer specification limits during the final quality control inspection (Ackermann and Fabia 1993). A high AOQ indicates a high defective count and therefore a poor quality level. AOQ
Shaded area under the process curve r 106 Total area under the process curve
(16.1) 421
© 2009 by Taylor & Francis Group, LLC
422 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Figure 16.1 Visualization of average outgoing quality.
where USL is upper specification limit; LSL is lower specification limit, and M is the process mean. For example, manufacturers conduct visual, mechanical, and electrical tests to measure AOQ of electronic products. Visual and mechanical tests include dimensions, solderability, and bent leads. Electrical tests include functional and parametric tests at room temperature, high temperature, and low temperature. AOQ is defined in Equation 16.1, referring to Figure 16.1. The formulae for AOQ calculations may differ among manufacturers. For example, the formula for AOQ based on JEDEC standard JESD 16–A (JEDC 1995) is AOQ P r LAR r 10 6
D r LAR r 10 6 N
(16.2)
D AL r r 10 6 N TL where D is the total number of defective products; N is the total number of products tested; LAR is the lot acceptance rate; AL is the total number of accepted lots; and TL is the total number of lots tested. IDT, a semiconductor manufacturer, provided AOQ based on the following formula: AOQ P r 10 6
D r 10 6 N
(16.3)
where D is the total number of defective products and N is the total number of products tested. Most manufacturers provide data in terms of the number of defective products, D, and the sample size, N.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
16.3
423
PROCESS CAPABILITY
AOQ is a measure of the quality of products as they leave the production facility. In contrast, process capability is a measure of conformance to customer requirements and this is typically measured at key process steps. A process capability assessment is conducted to determine whether a process, given its natural variation, is capable of meeting established requirements or specifications. It can help to identify changes that have been made in the process and determine the percent of product or service that does not meet the requirements. If the process is incapable of making products that conform to the specifications, then a different process needs to be selected or, in some cases, specifications may have to be changed because they may have been set in an unrealistic manner. Figure 16.2 shows specification limits of a product; these are usually based solely on the customer requirements and are not meant to reflect on the capability of the process. Specification limits are used to determine if the products will meet the expectations of the customer. Figure 16.2 overlays a normal distribution curve on top of the specification limits. To determine the process capability, the first step is to determine the process grand average, X, and the average range, R. This is followed by determination of the USL and the LSL. The process standard deviation, m, is then calculated, using the control charts, by R s or Sˆ Sˆ d2 c4
(16.4)
where R and s are the averages of the subgroup ranges and standard deviation for a period when the process was known to be in control, and d2 and c4 are the associated constant values based on the subgroup sample sizes. The process average can be estimated by X, X, and X. A stable process can be represented by a measure of its spread compared with six standard deviations. Comparing six standard deviations of the process variation to the customer specifications provides a measure of capability. Some measures of Lower Specification Limit (LSL)
Upper Specification Limit (USL)
Specification Width Figure 16.2 Measuring conformance to the customer requirements.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
424 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
LSL
USL
Cp < 1
Cp = 1
Cp > 1
Figure 16.3 Cp, simple process capability.
capability include Cp, Cr (inverse of Cp), Cpl, Cpu , and Cpk . Hence, Cp is calculated using the following equation: Cp USL LSL 6Sˆ
(16.5)
The Cp value can predict the reject rate of new products by using normal probability distribution curves. When Cp 1, the process variation exceeds specification and defective products are being made. When Cp 1, the process is just meeting specification. A minimum of 0.3% defective products will be made in this condition—more if the process is not centered. When Cp 1, the process variation is less than the specification; however, defective products might be made if the process is not centered on the target value. Figure 16.3 shows three cases of Cp values in their relation to the specification limits. The indices Cpl and Cpu (for single-sided specification limits) and Cpk (for twosided specification limits) not only measure the process variation with respect to the specification, but also take into account the location of the process average. Capability describes how well centered the curve is in the specification spread and how tight the variation is. Cpk is considered a measure of the process capability and is taken as the smaller of either Cpl or Cpu . If the process is near normal and in statistical control, Cpk can be used to estimate the expected percentage of the defective products. C pl
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
X LSL , C USL X pu C pu 3Sˆ 3Sˆ C pk min{C pu , C pl }
(16.6) (16.7)
PROCESS CAPABILITY AND PROCESS CONTROL
LSL
425
Target
= X-LSL
= USL-X
Actual Spread, 3σˆ
Actual Spread, 3σˆ
= X
Cpl
USL
Cpμ
Figure 16.4 Process not capable of meeting parts.
Figure 16.4 shows an example of a process not capable of meeting targets. For the process in this figure, Cp 1, but the inability of the process arises because it is not centered between LSL and USL. If the process is capable of consistently making products to specification, common causes of the variation in the process must be identified and corrected. Examples of common remedies include assigning another machine to the process, procuring a new piece of equipment, providing additional training to reduce operator variations, and requiring vendors to implement statistical process controls. EXAMPLE 16.1 In the die-cutting process, a control chart was maintained, producing the following statistics: X 212.5, R 1.2 and n 5. Specification limit for this process is 210 o 3. This means that USL 213 and LSL 207. Calculate Cp and Cpk for this process. Also, find the number of defects. Solution: R 1.2 Sˆ .516 d 2 2.326 6 213 207 1.938 C p USL ˆ LSL 6(.516) 3.096 6S C pl
X LSL 212.5 207 5.5 3.553 ˆ 3S 3 (.516) 1.548
C pu
USL X 213 212.5 0.5 0.323 3Sˆ 3 (.516) 1.548
C pk min {C pl , C pu} 0.323
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
426 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
LSL
USL
– X
207 Figure 16.5
208
209
210
211
212
213
214
215
Schematic of a process that is not capable.
Because Cpk 1, defective material is being made. Figure 16.5 shows the schematic of the problem. Defects calculation: If the process is near normal and in statistical control, the process of calculating Cpk can also be used to estimate the expected percent of defective material. The area under the curve outside the specification limits is used to determine number of defects. To determine the area under the curve, the following factors must be calculated: LSL X 207 212.5 10.68 z1 0.516 Sˆ z2
USL X 213 212.5 0.969 Sˆ 0.516
Defects for value of z LSL F(z1); here, F(z1) 0 (approximately). Defects for value of z USL [1 F(z2)]; here, [1 F(z2)] [1 0.832] 0.168. –F(z) P(Z z) is the cumulative distribution value for any value of z obtained from the standard normal distribution table as shown in Figure 16.6. Total defects F(z1) [1 F(z2)] 16.8%.
16.4 STATISTICAL PROCESS CONTROL Statistical process control (SPC) is a technique that uses a measure of central tendency average and the measure of dispersion range to monitor sampled measured data in quality characteristics of a process, instead of inspecting results after a process has produced a product.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
427
16.4.1 Control Charts: Recognizing Sources of Variation A control chart is a type of trend chart with statistically determined upper and lower control limits. It is used to determine if a process is “in control.” A process is said to be in control when the variation within the process is consistently random and within predictable (control) limits. Control charts are used to assess process variations and their sources and to monitor, control, and improve process performance over time. Random variation results from the interaction of the steps within a process. When the performance falls outside the control limits, assignable variation may be the cause. Assignable variation can be attributed to special causes. A control chart will help determine what type of variation is present within the process. Using a control chart, one can distinguish special causes of variation from common causes of variation. Control charts can serve as an ongoing control tool and help improve the process to perform consistently and predictably. 16.4.1.1 Constructing a Control Chart There are many types of control charts. The appropriate control chart depends on the types of data. Figure 16.6 presents the different types of data and the associated control charts. Figure 16.7 shows a guideline to select the control chart based on the information from Figure 16.6. To construct a control chart, follow the steps shown in Figure 16.8. To calculate appropriate statistics, one needs to know the method and the constants for that method. Constants and different formulae that are used in
z –4.00 –3.80 : –3.00 –2.90 –2.80 –2.70 : 0.00 0.10 : 0.70 0.80 0.90 1.00 : 2.80 2.90 3.00
Figure 16.6
0 0.0000 0.0001 : 0.0013 0.0019 0.0026 0.0035 : 0.5000 0.5398 : 0.7580 0.7881 0.8159 0.8413 : 0.9974 0.9981 0.9987
0.02 0.0000 0.0001 : 0.0014 0.0020 0.0027 0.0037 : 0.5080 0.5478 : 0.7642 0.7939 0.8212 0.8461 : 0.9976 0.9982 0.9987
0.04 0.0000 0.0001 : 0.0015 0.0021 0.0029 0.0039 : 0.5160 0.5557 : 0.7704 0.7995 0.8264 0.8508 : 0.9977 0.9984 0.9988
0.06 0.0000 0.0001 : 0.0016 0.0023 0.0031 0.0041 : 0.5239 0.5636 : 0.7764 0.8051 0.8315 0.8554 : 0.9979 0.9985 0.9989
0.08 0.0000 0.0001 : 0.0018 0.0024 0.0033 0.0044 z 0 : Z 0.5319 F(z) = P(Z10 value ~ X and R
Sample size is small, usually 3 to 5
X and s
X and R
A process to select the appropriate control chart.
construction control charts are shown in Tables 16.1 and 16.2 and for variable and attribute data, respectively. Use Tables 16.3 and 16.4 for the values of constants in formulae. While interpreting control charts, one should determine if the process mean (center line) is where it should be relative to specifications or objectives. If the
Start
Variable data (measurable)
Yes
X-moving Range chart
Yes
Median R chart
Figure 16.8
Ranges n10 Yes
Equal sized subgroups ?
No
Yes
Equal sized subgroups ?
No
No X& S chart
X &R chart
np or p chart
c or u chart
Guidelines for selecting control charts.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
u chart
p chart
PROCESS CAPABILITY AND PROCESS CONTROL
429
Table 16.1 Variable Data Type Control Chart Average and range
Central Linea
Control Limits
( X1 X 2 k
Xk )
UCL x X A2 R
(R R2 R 1 k
R k )
LCL X A2 R x
Sample Size 10 but usually 3–5
X and R
X
UCLR D4 R LCLR D3 R
Average and standard deviation
Usually, ≥10 X
( X1 X 2 k
(S S S 1 2 k
X and S
Xk ) S k )
UCL x X A3 S LCL x X A3 S UCLS B4 S LCLS B3 S
Median and range
10, but usually 3–5
2
X
( X1 X k
X k )
R
(R1 R2 k
R k )
X and R
UCL x X A2 R LCL x X A2 R UCLR D4 R LCLR D3 R
Individuals and moving range X and Rm
1
X
( X1 X 2 k
Xk )
Rm |( X i 1 X i )| Rk 1) (R R Rm 1 2 k 1
a
UCL x X E2 Rm LCL x X E2 Rm UCLR D4 Rm m
LCLRm D3 Rm
k number of subgroups; X median value within each subgroup, and X
£X
i
n
process mean is not where it should be, then either the process or objectives have changed. To distinguish between common causes and special causes, data relative to control limits must be analyzed. Upper and lower control limits are not specification limits and do not make a value judgment (good, bad, marginal) about a process. To analyze the data, follow the steps in Figure 16.9 and Figure 16.10. To use a control chart as a monitoring tool, all special causes must be eliminated. The chart will show when special causes reemerge. A common cause is defined as a deviation from the mean due to statistical errors such as normal variations in measurements. Parts affected by random causes usually fall within the control limits. A special cause is a deviation due to a process failure such as a malfunctioning machine. Parts
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
430 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.2 Attribute Data
Table 16.3 Constants Charts X and s Chart
X and R Chart
Sample Size n
A2
D3
D4
A3
B3
B4
C4
2 3 4 5 6 7 8 9 10
1.880 1.023 0.729 0.577 0.483 0.419 0.373 0.337 0.308
0 0 0 0 0 0.076 0.136 0.184 0.223
3.267 2.574 2.282 2.114 2.004 1.924 1.864 1.816 1.777
2.659 1.954 1.628 1.427 1.287 1.182 1.099 1.032 0.975
0 0 0 0 0.030 0.118 0.184 0.239 0.284
3.267 2.568 2.266 2.089 1.970 1.882 1.815 1.761 1.716
0.7979 0.8862 0.9213 0.9400 0.9515 0.9594 0.9650 0.9693 0.9727
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
Table 16.4
431
Constants Charts X and R Chart
X and Rm Chart
Sample Size n
A2
D3
D4
E2
D3
D3
d2
2 3 4 5 6 7 8 9 10
— 1.187 — 0.691 — 0.509 — 0.412 —
0 0 0 0 0 0.076 0.136 0.184 0.223
3.267 2.574 2.282 2.114 2.004 1.924 1.864 1.816 1.777
2.659 1.772 1.457 1.290 1.184 1.109 1.054 1.010 0.975
0 0 0 0 0 0.076 0.136 0.184 0.223
3.267 2.574 2.282 2.114 2.004 1.924 1.864 1.816 1.777
1.128 1.693 2.059 2.326 2.534 2.704 2.847 2.970 3.078
Start
Select the process to be charted, and allow it to run according to standard procedure. Determine the sampling method and plan.
How large a sample can be drawn?
Can all samples be drawn from the same conditions?
Does data shift during different times or due to other factors? (E.g., do traffic patterns change during rush hour?)
Can a baseline be developed from historical data?
Initiate data collection by running the process, gathering data, and recording it properly.
Generally, collect 20–25 random samples.
Calculate the appropriate statistics.
Figure 16.9 Ten steps in control client construction.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
432 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Calculate the control limits using the appropriate formulas.
Construct the control chart(s).
For attribute data, construct one chart, plotting each subgroup’s proportion or number defective, number of defects, or defects per unit.
For variable data, construct one chart with each subgroup’s mean, median, or individual, and a second chart with subgroup’s range or standard deviation.
On all charts, draw a solid horizontal line showing the process average, and dashed horizontal lines for the upper and lower limits.
Figure 16.10 Data analysis process for control charts.
3 2 1
1 2
Figure 16.11 Guidelines to distinguish out-of-control process.
Table 16.5
Rules to Detect Out-of-Control Processes
1.
One or more points fall outside control limits.
2.
Two out of three consecutive points are in zone A.
3.
Four out of five consecutive points are in zone A or B.
4.
Nine consecutive points are on one side of the average.
5.
Six consecutive points are increasing or decreasing.
6.
Fourteen consecutive points alternate up and down.
7.
Fifteen consecutive points within zone C.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
3
PROCESS CAPABILITY AND PROCESS CONTROL
433
One or More Points Fall Outside Control Limits.
Measured Values
UCL Zone A Zone B Zone C Zone C Zone B Zone A LCL
1
2
3
Time Period
4
Six Consecutive Points are Increasing or Decreasing. UCL
Measured Values
Zone A Zone B Zone C Zone C Zone B Zone A LCL
1
2
3
4
5
6
Time Period Fifteen Consecutive Points Within Zone C.
Measured Values
UCL Zone A Zone B Zone C Zone C Zone B Zone A LCL
1
2
Figure 16.12
3
4
5
6
7 8 9 Time Period
10
11
Examples of out-of-control process from Table 16.6.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
12
13
14
15
434 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.6 Common Questions for Investigating an Out-of-Control Process Are there differences in the measurement accuracy of instruments/methods used? Are there differences in the methods used by different personnel? Is the process affected by the environment (e.g., temperature and humidity)? Has there been a significant change in the environment? Is the process affected by predictable conditions (e.g., tool wear)? Were any untrained personnel involved in the process at the time? Has there been a change in the source for input to the process (e.g., raw materials, information)? Is the process affected by employee fatigue? Has there been a change in policies or procedures (e.g., maintenance procedures)? Is the process adjusted frequently? Did the samples come from different parts of the process? Shifts? Individuals? Are employees afraid to report “bad news”?
affected by special causes usually fall outside the control limits or demonstrate unusual patterns, such as all points being on the upper confidence limit (UCL) line. A process is in statistical control if it is not affected by special causes. Statistical control means that the process is consistent; the process must also be checked to see if it fits specification limits. After detecting special causes, one should change the process to fix them; common causes are a fact of life and trying to change them may result in worse deviations later. As long as the process does not change, control limits should not change. There are seven rules to detect out-of-control processes, as shown Table 16.5. Examples of the rules are shown graphically in Figure 16.12. After identifying an out-of-control process, a series of actions must take place in order to bring the process back in control. Table 16.6 shows examples of actions. A team should address any “yes” answer as a potential source of a special cause.
16.5 EXAMPLES OF CONTROL CHART CONSTANTS In this section, we provide examples of several types of control charts, including X(bar)-chart, which displays the variation in the average of a measurement series; r-chart, which displays the variation in the range of a measurement series; c-chart, which displays the variation in the number of defects; u-chart, which displays the variation in the number of defects per unit; p-chart, which displays the variation in the fraction of defective units; and np-chart, which displays the variation in the number of defective units.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
Table 16.7
435
Data for Machine Shop Part
Group No.
A
B
C
D
E
1
1.4
1.2
1.3
1.4
1.2
2
1.3
1.2
1.3
1.5
1.3
3
2.7
1.3
1.4
1.2
1.2
4
1.4
1.2
1.3
1.3
1.4
5
1.5
1.1
1.7
1.3
1.3
6
1.8
1.2
1.5
1.5
1.4
7
1.5
1.2
1.3
1.3
1.2
8
1.7
1.7
1.2
1.2
1.1
9
1.8
1.8
1.7
1.8
1.5
10
1.1
1.2
1.8
1.6
1.3
11
1.2
1.3
1.4
1.4
1.4
12
1.3
1.9
1.9
1.5
1.5
13
1.4
1.8
1.7
1.1
1.3
14
1.8
1.9
1.5
1.4
1.4
15
1.1
1.3
1.1
1.8
1.5
16
1.8
1.9
1.7
1.6
1.3
17
1.2
1.4
1.3
1.2
1.4
18
1.1
1.1
1.7
1.2
1.3
19
1.8
1.6
1.5
1.7
1.8
20
1.1
1.3
1.3
1.4
1.3
X
EXAMPLE 16.2 Analyze the weights of a specific part made in a machine shop using X(bar)- and r-charts. The machine shop sampled the parts at 20 different times (groups) and each group had five measurements (samples). (Data are given in Table 16.7.) Solution: Because we have variable data with constant sample size 5, we choose X(bar)- and r-charts. Mean and range calculation for each group: The mean (X) sum of the samples within the group divided by the group size. For group 1, X (1.4 1.2 1.3 1.4 1.2)/5 1.3. The range (R) difference between the largest observation within a group and the smallest observation within that group. R1 (1.4 1.2) 0.2. Compute total of the X and R columns. Average mean and average range calculation: Overall average (X .) total/total number of groups 28.54/20 1.43.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
R
436 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.8
Add Calculated Data to the Chart
Group No.
A
B
C
D
E
X
R
1
1.4
1.2
1.3
1.4
1.2
1.3
0.2
2
1.3
1.2
1.3
1.5
1.3
1.3
0.3
3
2.7
1.3
1.4
1.2
1.2
1.3
0.5
4
1.4
1.2
1.3
1.3
1.4
1.3
0.2
5
1.5
1.1
1.7
1.3
1.3
1.3
0.6
6
1.8
1.2
1.5
1.5
1.4
1.4
0.6
7
1.5
1.2
1.3
1.3
1.2
1.3
0.3
8
1.7
1.7
1.2
1.2
1.1
1.3
0.3
9
1.8
1.8
1.7
1.8
1.5
1.7
0.3
10
1.1
1.2
1.8
1.6
1.3
1.4
0.7
11
1.2
1.3
1.4
1.4
1.4
1.3
0.2
12
1.3
1.9
1.9
1.5
1.5
1.6
0.6
13
1.4
1.8
1.7
1.1
1.3
1.4
0.7
14
1.8
1.9
1.5
1.4
1.4
1.6
0.5
15
1.1
1.3
1.1
1.8
1.5
1.3
0.7
16
1.8
1.9
1.7
1.6
1.3
1.6
0.6
17
1.2
1.4
1.3
1.2
1.4
1.3
0.2
18
1.1
1.1
1.7
1.2
1.3
1.2
0.6
19
1.8
1.6
1.5
1.7
1.8
1.6
0.3
20
1.1
1.3
1.3
1.4
1.3
1.2
0.3
28.0
9.0
It is also called the grand average. Grand average is used as the center line for the chart. Average of all group ranges (R) total R/total number of groups 9.0/20 0.45. It is used as the center line (average) for the range chart (Table 16.8). Control limits calculation: UCL X X A2 R 1.43 (0.577 r 0.45) 1.69 LCL X X A2 R 1.43 (0.577 r 0.45) 1.17 About 99.73% (three sigma limits) of the average values should fall between 1.17 and 1.69. UCL R D4 R 2.114 r 0.45 0.951 LCL R D3 R 0 r 0.45 0 About 99.73% (three sigma limits) of the sample ranges should fall between 0 and 0.951.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
437
EXAMPLE 16.3 The weights of a product made in the machine shop are given in Table 16.9. For speed of production, only one sample was evaluated over an observation period. Analyze the weights of a specific part made in a machine shop using the MR-chart Solution: Because we have variable data and there is only one product (unit) in each sample, we choose a moving range chart. Calculating the MR: MR \Rn Rn 1\ absolute value of the difference between consecutive range values. This is also known as the two-sample moving range (most common form of moving range). There is no range for the first observation. The first MR value works out to MR1 {1.4 1.3{ 0.1. Calculate total of the sample (X) and MR columns as shown in Table 16.10. Average the mean and group range calculation:
Table 16.9
Data for Example 16.3
Observation No.
Sample (X)
1
1.4
2
1.3
3
1.7
4
1.4
5
1.5
6
1.8
7
1.5
8
1.7
9
1.8
10
1.1
11
1.2
12
1.3
13
1.4
14
1.8
15
1.1
16
1.8
17
1.2
18
1.0
19
1.8
20 Total
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
1.1 28.0
MR
438 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.10
Add Calculated Data to the Chart
Observation No.
Sample (X)
MR
1
1.4
N/A
2
1.3
0.1
3
1.7
0.4
4
1.4
0.3
5
1.5
0.1
6
1.8
0.3
7
1.5
0.3
8
1.7
0.2
9
1.8
0.1
10
1.1
0.7
11
1.2
0.1
12
1.3
0.1
13
1.4
0.1
14
1.8
0.4
15
1.1
0.7
16
1.8
0.7
17
1.2
0.6
18
1.0
0.2
19
1.8
0.8
1.1
0.7
28.0
6.9
20 Total
The overall average (X) sum of the measurements/number of observations 28.90/20 1.45. X is also called the grand average and is used as the center line for the X chart. Average of all group ranges MR total MR/number of ranges 6.9/19 0.36. MR is used as the center line (average) for the MR chart. Determining control limits: UCL X X ( E2 r MR) 1.45 (2.659 r 0.36) 2.41 LCL X X ( E2 r MR) 1.45 (2.659 r 0.36) 0.49 UCL MR D4 r MR 3.267 r 0.36 1.18 LCL MR D3 r MR 0 r 0.36 0 Note: The sample size used to obtain the values for E2, D3, and D4 is two in this case because we are using a two-sample moving range in this example. If a
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
Weight of Parts: X Bar 1.8 1.7 UCL 1.6
X Bar
1.5
CL
1.4 1.3 LCL
1.2 1.1 1
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 Group Number
Figure 16.13 Mean chart.
three-sample moving range is used, the number of ranges will reduce to 18, and values of constants used will change accordingly.
EXAMPLE 16.4 Analyze the weights of a specific part made in a machine shop with the following information. Ten weeks of defective data have been collected with a sample size of 50. (Data are in Table 16.11.) Weight of Parts: R Chart 1 0.9 0.8
UCL
0.7 0.6 R 0.5
CL
0.4 0.3 0.2 0.1 0
LCL 1
Figure 16.14
2
3
4
5
Range chart.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 Group Number
439
440 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.11 Data for Example 16.4 Week No.
No. Defective
1
9
2
7
3
4
4
2
5
4
6
5
7
2
8
3
9
5
10
5
Total
46
Solution: Because we have attribute data with constant sample size and number of defectives, we use the np-chart. Determining the averages: The average percent defective p total defectives/totaled sampled. p
46 46 0.092 (n)(weeks) (50)(10)
The grand average np (center line) also total defectives/total number of samples.
Weight of Parts: X Chart 3 2.5 UCL 2 CL X 1.5 1 LCL
0.5 0 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Observation Number
Figure 16.15 X-chart.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
441
Weight of Parts: MR Chart 1.4 1.2 UCL MR
1 0.8 0.6 CL 0.4 0.2
LCL
0 1
2
3
4
5
6
7
8 9 10 11 12 13 14 15 16 17 18 19 20 Observation Number
Figure 16.16 MR-chart.
np (50)(0.092) 4.6 np 46 / 10 4.6 Determining control limits: UCL np 3 np(1 p) 4.6 3 4.6(1 0.092) 10.731 LCL np 3 np(1 p) 4.6 3 4.6(1 0.092) 0 Note: Because the lower confidence limit (LCL) is less than zero, use zero. Draw the np-chart: 16
Number of Changes
14 12
UCL
10 8 CL 6 4 2
LCL = 0
0 1
2
3
4
5
6 Week
Figure 16.17 np-Chart.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
7
8
9
10
442 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
Table 16.12
Data for Example 16.5
Week No.
No. Defective
1
9
2
7
3
4
4
2
5
4
6
15
7
2
8
3
9
5
10
5
Total
56
EXAMPLE 16.5 A company tracks the number of times a specification was changed by either an engineering change proposal (ECP) or a letter from the contracting officer. Attribute data summarize changes to 50 contracts over a 10-week period. Analyze the weights of a specific part made. Solution: Because we have attribute data with constant sample size and the number of changes is represented by number of “defects,” we use the c-chart (Table 16.12). Determining center line (C) and control limits: C Total defects found/total number of groups 56/10 5.6 (changes per week). Determine control limits. If LCL 0, set LCL 0. UCL c 3 c 5.6 3 5.6 12.699 LCL c 3 c 5.6 3 5.6 0
Draw the c-chart.
REFERENCES Ackermann C. S., and J. M. Fabia. 1993. Monitoring supplier quality at PPM levels. IEEE Transactions on Semiconductor Manufacturing 6 (2): 189–195. JEDEC. 1995. Standard JESD16-A. Assessment of average outgoing quality levels in parts per million (PPM). Electronic Industries Association, Alexandria, VA.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
PROCESS CAPABILITY AND PROCESS CONTROL
443
HOMEWORK PROBLEMS Problem 16.1 For each of the data sets given, identify which of the following control charts should be used to plot the data for process control: c-chart, u-chart, p-chart, np-chart, X(bar)-r-chart, or X-Rm chart. For each case, state why you selected the particular chart type. Data-Set Details
Control Chart Type
An equal number of samples of process output have been monitored each week for the last 5 weeks. Ten defective parts were found the first week, eight the second week, six the third week, nine the fourth week, and seven the fifth week. Different numbers of samples (between 40 and 60) of process output have been monitored each week for the last 4 weeks. In the first week, 1.2 defects per sample were observed. In the second week, 1.5 defects per sample were observed. In the third week, one defect per sample was observed. In the fourth week, 0.8 of a defect per sample was observed. The thicknesses of 10 samples were measured each day for a week. An equal number of samples of process output have been monitored each week for the last 4 weeks. In the first week, eight defects were observed. In the second week, 12 defects were observed. In the third week, 10 defects were observed. In the fourth week, nine defects were observed. The thickness of a single sample was measured each day for a week. A process has been observed each week for the last three weeks. The first week, 10% of the parts were found to be defective; 20% were found to be defective the second week, and 15% were found to be defective the third week
Problem 16.2 The copper content of a plating bath is measured three times per day and the results are reported in products per million. The X(bar)- and r-values for 10 days are shown in the following table. Day
X(bar)
r
1 2
5.45
1.21
5.39
0.95
3
6.85
1.43
4
6.74
1.29
5
5.83
1.35
Day
X(bar)
r
6
7.22
0.88
7
6.39
0.92
8
6.50
1.13
9
7.15
1.25
10
5.92
1.05
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
444 PRODUCT RELIABILITY, MAINTAINABILITY, AND SUPPORTABILITY HANDBOOK 2E
a. b. c
Determine the upper and lower control limits. Is the process in statistical control? Estimate the Cp and Cpk given that the specification is 6.0 o 1.0. Is the process capable? Problem 16.3 Printed circuit boards are assembled by a combination of manual assembly and automation. A reflow soldering process is used to make the mechanical and electrical connections of the leaded components to the board. The boards are run through the solder process continuously, and every hour five boards are selected and inspected for process-control purposes. The number of defects in each sample of five boards is noted. Results for 20 samples are shown in the following table. What type of control chart is appropriate for this case and why? Construct the control chart limits and draw the chart. Is the process in control? Does it need improvement? Problem 16.4 Twelve parts of the same type are tested for 1,000 hours, and seven failures are observed at 250, 450, 510, 625, 750, 825 and 979 hours. The items are removed at failure without replacement. Calculate the upper and lower one-sided 90% confidence limits on mean time between failures. Also, calculate the twosided 90% confidence limits on reliability for a 200-hour period. Sample
No. of Defects
Sample
No. of Defects
1
6
11
9
2
4
12
15
3
8
13
8
4
10
14
10
5
9
15
8
6
12
16
2
7
16
17
7
8
2
18
1
9
3
19
7
10
10
20
13
Problem 16.5 The diameter of a shaft with nominal specifications of 60 o 3 mm is measured six times each hour and the results are recorded. The X(bar)- and r-values for 8 hours are shown in the following table:
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC
Hour
X(bar)
R
1
62.54
1.95
2
60.23
2.03
3
58.46
1.43
4
59.95
1.29
5
61.58
0.78
6
57.93
1.48
7
61.56
0.86
8
57.34
1.35
PROCESS CAPABILITY AND PROCESS CONTROL
445
(a) Determine the upper and lower control limits. (b) Determine if the process is in statistical control. (c) Estimate the Cp and Cpk for the process. Is the process capable? Data-Set Details
Control Chart Type
An equal number of samples of process output have been monitored each week for the last 5 weeks. Ten defective parts were found the first week, eight the second week, six the third week, nine the fourth week, and seven the fifth week. Different numbers of samples (between 40 and 60) of process output have been monitored each week for the last 4 weeks. In the first week, 1.2 defects per sample were observed. In the second week, 1.5 defects per sample were observed. In the third week, one defect per sample was observed. In the fourth week, 0.8 of a defect per sample was observed. The thicknesses of 10 samples were measured each day for a week. An equal number of samples of process output have been monitored each week for the last 4 weeks. In the first week, eight defects were observed. In the second week, 12 defects were observed. In the third week, 10 defects were observed. In the fourth week, nine defects were observed. The thickness of a single sample was measured each day for a week. A process has been observed each week for the last three weeks. The first week, 10% of the parts were found to be defective; 20% were found to be defective the second week, and 15% were found to be defective the third week.
Problem 16.6 The specification for a shaft diameter is 212 o 2 mm. Provided below are 30 recorded observations for the diameter of a shaft (in millimeters) taken at 30 different points in time. 212.1a
214.2
213.7
212.7
212.5
212.7b
212.8
213.0
212.9
212.3
212.5
212.1
211.8
213.5
212.0
213.0
214.5
212.3
212.2
211.9
213.2
212.7
211.9
212.3
212.0
212.8
213.9
212.6
214.0
212.4c
a
First observation. Sixth observation. cThirtieth observation. b
(a) Determine the three-sample X(bar)- and MR(bar)-control limits from the data. (b) Determine from the control charts whether the process is in control or not. (c) Determine the capability indices (Cp and Cpk) for the process. (d) Determine the percent defective shafts produced by the process.
© 2009 by Taylor & Francis Group, LLC
© 2009 by Taylor & Francis Group, LLC