STATISTICAL PROCESS CONTROL The Deming Paradigm and Beyond SECOND EDITION
© 2002 by Chapman & Hall/CRC
STATISTICAL PR...
160 downloads
1071 Views
6MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
STATISTICAL PROCESS CONTROL The Deming Paradigm and Beyond SECOND EDITION
© 2002 by Chapman & Hall/CRC
STATISTICAL PROCESS CONTROL The Deming Paradigm and Beyond SECOND EDITION
James R. Thompson Jacek Koronacki
CHAPMAN & HALL/CRC A CRC Press Company Boca Raton London New York Washington, D.C.
© 2002 by Chapman & Hall/CRC
Library of Congress Cataloging-in-Publication Data Thompson, James R., 1938Statistical process control: the Deming paradigm and beyond / James R. Thompson, Jacek Koronacki.—2nd ed. p. cm. Includes bibliographical references and index. ISBN 1-58488-242-5 (alk. paper) 1. Process control—Statistical methods. 2. Production management—Quality control. I. Koronacki, Jacek. II. Title. TS156.8 .T55 2001 658.5′62′015195—dc21 2001043990
This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 N.W. Corporate Blvd., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation, without intent to infringe.
Visit the CRC Press Web site at www.crcpress.com © 2002 by Chapman & Hall/CRC No claim to original U.S. Government works International Standard Book Number 1-58488-242-5 Library of Congress Card Number 2001043990 Printed in the United States of America 1 2 3 4 5 6 7 8 9 0 Printed on acid-free paper
© 2002 by Chapman & Hall/CRC
To my wife, Ewa Majewska Thompson James R. Thompson
To my wife, Renata Houwald Koronacka and to my children, Urszula and Krzysztof Jacek Koronacki
© 2002 by Chapman & Hall/CRC
Contents
Preface to First Edition Preface to Second Edition
ix xix
1 Statistical Process Control: A Brief Overview
1
1.1. Introduction 1.2. Quality Control: Origins, Misperceptions 1.3. A Case Study in Statistical Process Control 1.4. If Humans Behaved Like Machines 1.5. Pareto’s Maxim 1.6. Deming’s Fourteen Points 1.7. QC Misconceptions, East and West 1.8. White Balls, Black Balls 1.9. The Basic Paradigm of Statistical Process Control 1.10. Basic Statistical Procedures in Statistical Process Control 1.11. Acceptance Sampling 1.12. The Case for Understanding Variation 1.13. Statistical Coda References Problems
1 3 6 9 10 14 18 20 32 33 39 41 45 47 48
2 Acceptance-Rejection SPC
53
2.1. 2.2. 2.3. 2.4. 2.5.
53 55 58 63 67 71
Introduction The Basic Test Basic Test with Equal Lot Size Testing with Unequal Lot Sizes Testing with Open-Ended Count Data Problems
3 The Development of Mean and Standard Deviation Control Charts 75 3.1. 3.2. 3.3. 3.4. 3.5.
Introduction A Contaminated Production Process Estimation of Parameters of the “Norm” Process Robust Estimators for Uncontaminated Process Parameters A Process with Mean Drift
v © 2002 by Chapman & Hall/CRC
75 77 81 90 95
3.6. A Process with Upward Drift in Variance 3.7. Charts for Individual Measurements 3.8. Process Capability References Problems
100 104 118 123 123
4 Sequential Approaches
129
4.1. 4.2. 4.3. 4.4. 4.5. 4.6. 4.7. 4.8.
Introduction The Sequential Likelihood Ratio Test CUSUM Test for Shift of the Mean Shewhart CUSUM Chart Performance of CUSUM Tests on Data with Mean Drift Sequential Tests for Persistent Shift of the Mean CUSUM Performance on Data with Upward Variance Drift Acceptance-Rejection CUSUMs References Problems
129 129 132 136 138 141 158 162 165 166
5 Exploratory Techniques for Preliminary Analysis
171
5.1. 5.2. 5.3. 5.4. 5.5. 5.6.
Introduction The Schematic Plot Smoothing by Threes Bootstrapping Pareto and Ishikawa Diagrams A Bayesian Pareto Analysis for System Optimization of the Space Station 5.7. The Management and Planning Tools References Problems
171 172 177 186 193
6 Optimization Approaches
225
6.1. 6.2. 6.3. 6.4. 6.5. 6.6. 6.7. 6.8. 6.9.
225 228 237 242 252 253 258 260 266
Introduction A Simplex Algorithm for Optimization Selection of Objective Function Motivation for Linear Models Multivariate Extensions Least Squares Model “Enrichment” Testing for Model “Enrichment” 2p Factorial Designs
vi © 2002 by Chapman & Hall/CRC
197 206 219 220
6.10. Some Rotatable Quadratic Designs 6.11. Saturated Designs 6.12. A Simulation Based Approach References Problems
270 276 278 281 282
7 Multivariate Approaches
289
7.1. 7.2. 7.3. 7.4. 7.5. 7.6.
289 290 302 305 308 312 316 317
Introduction Likelihood Ratio Tests for Location Compound and Projection Tests A Robust Estimate of “In Control” Location A Rank Test for Location Slippage A Rank Test for Change in Scale and/or Location References Problems
Appendix A: A Brief Introduction to Linear Algebra
321
A.1. A.2. A.3. A.4. A.5. A.6. A.7. A.8. A.9.
321 323 327 328 331 333 333 337 338
Introduction Elementary Arithmetic Linear Independence of Vectors Determinants Inverses Definiteness of a Matrix Eigenvalues and Eigenvectors Matrix Square Root Gram-Schmidt Orthogonalization
Appendix B: A Brief Introduction to Stochastics
339
B.1. Introduction B.2. Conditional Probability B.3. Random Variables B.4. Discrete Probability Distributions B.5. More on Random Variables B.6. Continuous Probability Distributions B.7. Laws of Large Numbers B.8. Moment-Generating Functions B.9. Central Limit Theorem B.10. Conditional Density Functions B.11. Random Vectors
339 344 346 351 356 360 370 372 376 377 378
vii © 2002 by Chapman & Hall/CRC
B.12. Poisson Process B.13. Statistical Inference B.14. Bayesian Statistics References
387 388 411 420
Appendix C: Statistical Tables
421
C.1. C.2. C.3. C.4. C.5.
422 423 424 425 426
Table Table Table Table Table
of of of of of
the Normal Distribution the Chi-Square Distribution Student’s t Distribution the F Distribution with α = .05 the F Distribution with α = .01
viii © 2002 by Chapman & Hall/CRC
Preface to the First Edition “May you live in interesting times” can be a curse if one lives in a society perceived to be so perfect that improvement can only be marginal and not worth the trauma of change. If, on the other hand, one lives in a society that is based on constant improvement, living in “interesting times” presents opportunity. For good or ill, it is clear that people these days live in very interesting times indeed. The struggle between the West and the Communist World is over. Yet, with the triumph of the Western system, there comes the challenge of seeing what happens now that the political and military conflict is ended. Enormous residues of intellectual energy are now freed to be focused on peacetime pursuits. It is interesting to note that one of the basic optimization postulates of statistical process control (SPC) was developed by Vilfredo Pareto (18481923), who was trained as an engineer but is best known for his economic and sociological works. According to Pareto’s Maxim, the many failures in a system are the result of a small number of causes. General malaise is seldom the root problem. It follows that in order to improve a system, skilled investigators are required to find and correct the causes of the “Pareto glitches.” We are, in this book, concerned about the orderly process of optimization which is the nuts and bolts of statistical process control. But a few words about the social theory of Pareto are in order, since this theory gives insight to the important part SPC is likely to play in the post Cold War world. Pareto perceived the inevitability of elites in control of society. Extrapolating from Pareto’s works, particularly the massive Mind and Society, the American political scientist, James Burnham, writing in the late 1930s and early 1940s, observed the presence of such elites in Fascist, Communist and Bourgeois Capitalist societies. These elites were based on the expertise to seize and maintain power, rather than on excellence in the arts, sciences and technology. Burnham pointed out that Pareto had started out with views similar to those of his father, who had resisted Bourbon control of Italy in favor of a Jeffersonian republic. In early middle age, Pareto’s point of view had changed fundamentally from a kind of Scottish Enlightenment optimism to one of cynicism when he noted the great mistake of Aristotle. Aristotle assumed that once human beings had understood the Aristotelean logic system, everybody would embrace ix © 2002 by Chapman & Hall/CRC
it eagerly. Yet, Pareto observed that human beings tend to make their most important decisions based on gut instincts, passions, and narrow self and group interests. It is not simply that people are inclined to use personal utility functions as opposed to those of the greater society. Frequently, a boss will attempt to assimilate all the information relevant to making a decision, and then, at the end of the day, make a decision which appears whimsical, without particular relevance to any perceivable utility function. Pareto devoted most of his mature years to trying to understand how bosses arrive at decisions. Today we note reconfirmation of Pareto’s views when we see how seemingly irrational managerial decisions have caused so much havoc in the otherwise enviable industrial model of Japan. In The Managerial Revolution, Burnham observed a movement of the members of nomenklaturas back and forth between top posts in seemingly unrelated areas. So, for example, a university president might be named president of an industrial corporation. A key Party official might be given first a post in the Ministry of Justice and then move on to become a general. There was no “bottom line” in terms of effectiveness within the current post that led to lateral or vertical promotion within the society. Essentially, a new feudalism was developing, with all the stagnation that entails. Upon reflections on Burnham’s book, his fellow former Trotskyist colleague, George Orwell, wrote the profoundly pessimistic 1984 about a society in which all notions of human progress had been sacrificed for the purpose of control by the Inner Party. Skipping forward a few decades, it is interesting to see how accurate Pareto’s perceptions had been. We note, for example, in the 1980s, the promotion to Chief of State of the Soviet Minister of Agriculture, a post in which he had produced no growth in productivity. We note in America how the CEO of a large soft drink company became the CEO of a high tech computer company, even though he was unable to write a computer program. Until fairly recently, Burnham’s predictions appeared to be woefully justified. But a sea change has taken place. The abysmal record of the Soviet system to produce even the logistics necessary for maintenance of its military power has driven that former Minister of Agriculture, Mikhail Gorbachev, from his office as emperor of the Soviet empire, and has, in fact, led to the overthrow of that empire. There is no evidence that Gorbachev’s meekly accepting his loss of power was due to any moral superiority over his old mentor, Yuri Andropov, who had crushed the Hungarian workers in 1956. Ultimately, it seems, there was a bottom line x © 2002 by Chapman & Hall/CRC
for the productivity of the Soviet nomenklatura. Their mismanagement of the various productivity areas of society was no longer acceptable in the high tech world of the 1980s. During this period, the workers in the Soviet empire, from the shipwrights in Gdansk to the coal miners in the Kuzbas, realized that they were the productive members of their societies and that they could present their feudal masters with a choice between mass murder and abdication of their power. They organized strikes. Such strikes had been unsuccessful in the past, but the nomenklatura realized that this time destruction of their skilled workers would leave their already disastrous economies completely unfeasible. Technology had advanced too far to replace skilled workers readily. The new technology has given skilled labor a power most do not yet perceive. In 1984, a despairing Winston Smith writes in his diary If there is hope, it lies with the proles. In the case of the Soviet Union and its empire, Orwell’s hero appears to have been prophetic. Yet, in the United States, the nomenklaturas, composed largely of armies of attorneys and MBAs, continue to dominate much of society. They bankrupt one company, merge another, and slither confidently on. Henry Ford hired his employees for life. Few employees of contemporary American companies can assume their current company will employ them until retirement, or even that their company will survive until they retire. MBAs are trained in marketing and finance, almost never in quality and production. American attorneys, who once upon a time proudly bore the title of “counsellor,” now have a low prestige consistent with their current contributions to society. Workers are enjoined to improve their performance “or else,” but they are not given any inkling of how this improvement is to be achieved. And little wonder, for seldom does American management know enough about their production processes to tell their employees how to improve them. Increasingly free markets appear to be presenting American managers with a bottom line that they can no longer avoid. The slogan of “buy American” begins to ring hollow indeed. A worker engaged in a wellmanaged company has difficulty in seeing why he should use his hard earned money to buy an inferior domestic product at a higher price rather than a superior foreign import. American consumers may be on the verge of creating a situation where the nomenklatura in this country will begin to be replaced with managers, who are life members of their xi © 2002 by Chapman & Hall/CRC
companies, experienced in what they do, willing to learn and explain to their colleagues how to improve the process of production. It is unfortunate for all concerned, however, that the directors of most failing American companies seem to spurn the rather well defined statistical process control protocols which could, very frequently, save their businesses. Rather, we seem to have a situation of natural selection. Companies with managers who have adopted the philosophy of statistical process control tend to survive. Those with managers who do not are likely to fail. But, like dinosaurs in a Darwinian jungle, most managers refuse to adapt to the new world of high quality and the paradigms required to participate in it. There may be better ways to improve the general quality of American production than simply watching those firms imprudently managed perish. But at least the discipline of a nearly free market does keep the quality of surviving firms on a path of improvement. A pity that so many American managers are unwilling to learn statistical process control, for, if they did, they might very well deliver themselves, their workers and their stockholders from the inevitability of disaster. Unfortunately, the organic force for the maintenance of control by nomenklaturas continues. In Poland, most of the factory directors have been pensioned off (handsomely, by Polish standards). But the new managers (following advice from experts from the United Nations, the World Bank, the United States, etc.) have not been recruited from the ranks of the foremen, blue collar supervisors generally very familiar with the running of the plant. Rather, to a significant degree, the new directors have been recruited from the junior league of the old nomenklatura. These are individuals, generally in their 30s and 40s, who have supplemented their training with accelerated American-style MBA programs. Thus, big state run or newly privatized companies are being controlled, in large measure, by persons who are attempting to make the transition from one nomenklatura to another nomenklatura. The thousands of new small companies in Poland, freed from stifling governmental controls and unimpeded by nomenklatura personnel, are thriving. But the big companies are not generally improving; they are, in fact, going bankrupt with astounding regularity. One of the tragedies of the newly free countries of Central and Eastern Europe appears to be that, having been subjected for decades to the depredations of a Soviet-style nomenklatura, they now must suffer the inefficiencies of an American-style one. It is the hope of revolutionaries to leapfrog the failed paradigms of the past. Our hope for the newly free countries of the former Soviet empire is that they will adopt the new style of management, advocated xii © 2002 by Chapman & Hall/CRC
by Professor W. Edwards Deming, sooner rather than later. It is interesting to note a practical consequence (observed by Shewhart and Deming) of Pareto’s Maxim: the failures in systems can be viewed, mathematically, as a problem in contaminated distributions. This fact provides us with a tool that can lead to the replacement of the quasifeudal managerial systems Pareto predicted by that nurturing system of continual improvement, that is the hallmark of statistical process control. Statistical process control has nothing to do with attitude adjustment, slogans or boosterism of any sort. It is based on concepts which, though not as theoretically well understood, are as substantial as Newtonian physics. The basic notion of a few rather than many causes of failure in a system was perceived at least as long ago as Pareto. The profound observation that machines operate in fundamentally different ways than do people goes back at least to Henry Ford. Walter Shewhart perceived in the early 1930s that Pareto’s qualitative observation about causes of failure could be quantitized as a model of mixtures of distributions. In the real world, there appear to be switches in time, which periodically transfer the generating process into a distribution not typical to the distribution when the dominant distribution is the driving force behind the process. These epochs in time exhibit product with differences in average measurement and/or variability of measurement from the product produced during “in control” epochs. The control charts developed by Shewhart enabled him to identify these “out of control” epochs. Then, by backtracking, he was frequently able to discover the systemic cause of a “Pareto glitch” and remove it, thus fixing the system. Based on the observation of Ford that a fixed machine tends to stay fixed, Shewhart was able to build the basic paradigm of SPC, which is essentially a kind of step-wise optimization of a system. Taken at first glance, there is no particular reason to be excited about Shewhart’s paradigm. SPC sounds, at first blush, about as likely as Pyramid Power and Transcendental Meditation to improve a system of production. Pareto’s Maxim is not intuitively obvious to most. But then, Galileo’s observation that objects fall to Earth at a velocity independent of their mass does not sound, at first hearing, obvious. Of course, Galileo’s conjecture admits of relatively simple experimental validation. The verification of the utility of SPC tends to require implementation in a rather large and costly system of production. It would be interesting to undertake a careful historical investigation to determine just how much of the Shewhart algorithm was foreshadowed by the production techniques of Henry Ford and by the German xiii © 2002 by Chapman & Hall/CRC
manufacturer, Ferdinand Porsche, who was strongly influenced by Ford’s work. Although World War II did see some implementation of Shewhart’s paradigm in American war production, there is also significant evidence that the influence of the Soviet style of optimization by slogan and psychology grew in the United States very rapidly during this period. On balance, World War II may very well have left America farther away from the Shewhart paradigm than it had been previously. At any rate, it is clear that Deming’s massive implementation of SPC in Japan after World War II brought Japan quickly into a position first of challenging, then surpassing, American automobile and audio/video production. And the Japanese competitive success vis-`a-vis the Germans has also been clear in these areas. One recent study indicates that the total time spent in producing a Lexus LS 400 is less than the average rework time spent on a competitive German product. Whatever portion of the Shewhart paradigm was presaged by Ford and Porsche, it is hard to avoid the conclusion that it was much less than its potential, and that it was Deming who carried out the equivalent of Galileo’s gravity experiment on a massive scale. Perhaps there is a valid comparison between Shewhart and Adam Smith, who had perceived the power of the free market. But there appears to be no single implementer of the free market who was as important in validating The Wealth of Nations as Deming has been in validating the paradigm of Shewhart. There has never been, in world history, so large scale an experiment to validate a scientific hypothesis as Deming’s Japanese validation and extension of the statistical process control paradigm of Shewhart. Revisionists in quality control abound. If Deming has his Fourteen Points, other quality control “experts” also have theirs. If Deming specifically warns, in his Fourteen Points, against sloganeering and posters, others specifically advocate such devices in theirs. If Deming argues for dumping the productivity-destroying paradigm of “quality assurance” and going to optimization of the production process via SPC, others argue that SPC and QA are just two different tools in the quality arsenal and promise a smooth, painless transition from QA to SPC. Some argue that they have gone far beyond Deming by the implementation of continuous feedback concepts according to the paradigms of classical control theory, thereby demonstrating that they really haven’t much of a clue what Pareto, Deming and Shewhart had discovered. Multibillion dollar corporations in the United States are as likely to consult revisionist gurus as they are to consult those who implement the mixture paradigm of Shewhart. Somehow, American CEOs seem to xiv © 2002 by Chapman & Hall/CRC
think that “all these guys are implementing the quality control system of the Japanese. We need to pick an expert who has a presentation which is well packaged and management friendly.” Such CEOs would probably do their employees and stockholders a considerable service if they would at least do a bit of research to see how revisionist gurus are regarded by the Japanese. Returning to the analogy with Adam Smith, it is certainly possible to argue against the free market for many reasons. Lack of effectiveness is not one of them. We may quarrel with the simple SPC paradigm of Deming, but not on the basis of its record of performance. There is more to the evolution of technology than SPC, just as there is more to the improvement in the living standard of the population than is presented by the free market. But those managers who neglect the system optimization paradigm hinted at by Pareto, postulated by Shewhart, and implemented and validated by Deming, do so at the hazard of their futures and those of their companies. In 1989, James Thompson and Jacek Koronacki began investigations as to how statisticians might assist in the economic development of postMarxist Poland. Koronacki had translated Thompson’s short course notes (used in a number of industrial settings in Texas) into Polish, and it was decided to use these notes first to train instructors, and then as a basis of within factory teaching and consultation. In June of 1991, Kevin McDonald, president of the International Team for Company Assistance, headquartered in Warsaw, provided United Nations funding to hire one dozen Polish Ph.D. statisticians for a period of 6 months, during which period they would go into ten Polish companies recently privatized or on the verge of privatization and introduce the Deming approach. The resulting consortium, entitled the Quality Control Task Force, led by Koronacki, has been functioning since that time. Two colleagues from the Department of Statistics at Rice University, Marek Kimmel and Martin Lawera, together with Thompson, have provided on-site and remote consultation to the group since the summer of 1991. The current book is an evolutionary development, starting with the short course notes developed over a decade of consulting and lecturing on quality improvement, including recent experiences in Poland, and adding a mathematical modeling background not generally employed in industrial courses. The book is organized so that the earlier part of the book can be utilized by persons only interested in the practical implementation of the paradigm of statistical process control. Some of the material in the later part of the book deals with topics that are of xv © 2002 by Chapman & Hall/CRC
ongoing interest in our research and that of our students. By including mathematical and statistical appendices, we have attempted to write a book which, say, a foreman might utilize over time as he wishes to develop both his practical and theoretical insights into statistical process control. Problems are given at the end of each chapter. For university instruction, the book is appropriate both for advanced undergraduate and graduate level courses. Chapter 1 represents an overview of the practical implementation of statistical process control. The intuitive contaminated distribution approach taken in this chapter is appropriate for use in industrial short courses and has been so employed in both Poland and Texas. We consider in Chapter 2, as a beginning, the data available in the present state of quality activity in most American firms, namely that of quality assurance. Though such data are much less desirable than modularized measurement data, it is a natural starting point for quality investigators dealing with the world as they find it, as opposed to how they might wish it to be. Chapter 3 considers in some detail the theory of contaminated distributions which forms the model basis of most statistical process control. The performance of Shewhart Control Charts is considered in a variety of practical situations. Various procedures for robustification of the parameters characterizing the uncontaminated process distribution are considered. In Chapter 4 we examine a variety of sequential procedures favored by some as alternatives to the Shewhart Control Charts. These include CUSUMs, Shewhart CUSUMs, Acceptance-Rejection CUSUMs, Page CUSUMs and Exponentially Weighted Moving Averages. Chapter 5 presents a number of exploratory and graphical techniques frequently useful for troubleshooting in those situations which do not readily lend themselves to standard SPC paradigms. In Chapter 6 we present a number of optimization techniques useful for designing experiments and modifying process conditions for enhanced production performance. Included among these are the simplex algorithm of Nelder and Mead and the rotatable designs of Box, Hunter and Draper. Chapter 7 deals with the subject of examining time indexed multivariate data for clues to quality improvement. A compound test is suggested as an alternative to the generally standard approach of testing one dimension at a time. A robust procedure for estimating the mean of the dominant distribution in a data contamination situation is proposed. A xvi © 2002 by Chapman & Hall/CRC
nonparametric test for shift of location is considered. Appendices A and B give an overview of the linear algebra and mathematical statistics used in the rest of the book. The inclusion of these appendices is an attempt to make the book as self-contained as possible. For reasons of practicality, we have attempted to create a book which does not need an accompanying software diskette. The standard statistical procedures in SPC are not particularly involved. The standard SPC charts can be handled very nicely with an inexpensive hand-held calculator, such as the TI-36X solar, which we tend to introduce into the firms with whom we consult. There are a number of excellent spreadsheet-based statistical packages which would be of use in handling, say, the linear models sections in Chapter 6 as well as dealing more quickly with the problems in the first five chapters. These include the various versions of SYSTAT and Statview. Indeed, simple spreadsheet packages such as Lotus 1-2-3 and Excel can easily be adapted to assist with many of the problems. For the simulations in Chapter 7, we used programs written in C (by Martin Lawera). Lawera’s work with Thompson in the development of the “king of the mountain” algorithm for finding the location of the multivariate “in control” distribution is duly noted. The book was typeset using a Macintosh IIfx, utilizing the Textures LaTex processing program of Blue Sky Research, using graphics from SYSTAT 5.2, MacDraw Professional and MacPaint II. The support of the Army Research Office (Durham) under DAAL03-91-G-0210 is gratefully acknowledged. The support given to Polish members of the Quality Control Task Force by the International Team for Company Assistance has facilitated the introduction of SPC into Polish production. We particularly wish to thank our Rice colleagues Marek Kimmel and Martin Lawera (both originally from Poland) for their valuable work with the QCTF in Poland. The graphs in Chapter 7 and the trimmed mean algorithm therein are due to Martin Lawera, as are the reference tables at the end of the book. We also extend our thanks to Gerald Andersen, Wojciech Bijak, Barry Bodt, Diane Brown, Barbara Burgower, Jagdish Chandra, Gabrielle Cosgriff, Dennis Cox, Miroslaw Dabrowski, Piotr Dembinski, Katherine Ensor, James Gentle, Stuart Hunter, Renata Koronacka, Robert Launer, Kevin McDonald, Charles McGilchrist, Jan Mielniczuk, Marek Musiela, Michael Pearlman, Joseph Petrosky, Paul Pfeiffer, Zdzislaw Piasta, Rick Russell, Michael Sawyers, David Scott, Beatrice Shube, Andrzej Sierocinxvii © 2002 by Chapman & Hall/CRC
ski, Malcolm Taylor, Ewa Thompson, John Tukey, Matt Wand, Geoffrey Watson, Jacek Wesolowski, Edward Williams and Waldemar Wizner. James R. Thompson and Jacek Koronacki Houston and Warsaw, September 1992
xviii © 2002 by Chapman & Hall/CRC
Preface to the Second Edition The first edition of this book, Statistical Process Control for Quality Improvement, was published nearly ten years ago. Nevertheless, we find our earlier work a fair attempt (and, as near as we have been able to find, the only book length attempt) at mathematically modeling the Deming paradigm (SPC) for continuous quality improvement. Deming was an optimizer, not a policeman. Philosophically, the Quality Assurance paradigm (QA), which has dominated American manufacturing since World War II, has about as much in common with SPC as a horse and buggy has to a Mercedes. Both of these have wheels, brakes, and an energy source for locomotion, but it would be less than useful to think of them as “essentially the same,” as so many managers still seem to consider QA and SPC to be the same. Both QA and SPC use control charts, but to very different purposes. QA wants to assure that bad units are not shipped. SPC wants to assure that bad units are not created in the first place, and that the units are being produced by a system in a continuing state of improvement. If this point is not understood by the reader of this book, it is not from any lack of trying on the part of the authors. Anyone who has walked into both QA establishments and SPC establishments knows the extreme difference in the sociology of the two types. In the QA establishment, the worker is being watched for poor performance. In the SPC establishment, the worker is a manager, calmly focused on the improvement of the process, and with constant recognition of his contributions. Again, the Deming paradigm must not be confused with the touchyfeely boosterism associated with the “Quality Is Free” movement. Although SPC is one of the best things ever to happen to making the workplace a friendly and fulfilling environment, its goal is to improve the quality of the goods delivered to the customer by a paradigm as process oriented as physics. It was the insight of Deming that has led to the realization that one can use the fact that a lot exhibits a mean well away from the overall mean to indicate that something specific is wrong with the system. Using this technique as a marker, the team members can backtrack in time to see what caused the atypicality and fix the problem. As time progresses, relatively minor problems can be uncovered, once the major causes of jumps in variability have been found and removed. For reasons of user friendliness, Deming advocated the already venerxix © 2002 by Chapman & Hall/CRC
able run charts and control charts as a means of seeking out atypicality. Thus, Deming’s methodology, on the face of it, does appear to be very much the same thing as that advocated by the Quality Assurance folks or the New Age “Quality Is Free” school. Perhaps Deming himself was partly to blame for this, for he never wrote a model-based explanation for his paradigm. Moreover, Deming was advocating the use of old tests familiar to industrial engineers to achieve quite a different purpose than those of the older QA school: namely, Deming used testing to achieve the piecewise optimization of an ill-posed control problem. And, as anyone who has used SPC on real problems will verify, SPC works. This book is an updated and extended version of the first edition, with an increased length of roughly 25%. Criticisms by our colleagues and students of that earlier endeavor have been taken into account when preparing this book, as have been our own new experiences in the fascinating area of quality improvement in the manufacturing, processing and service industries. As in the first edition, we have tried to give examples of real case studies flowing from work we have ourselves undertaken. Consequently, as beyond the “production line” examples, we include an example of problems encountered when a new surgical team was brought into the mix of teams dealing with hip replacement. There is an example showing problems experienced by a company involved in the production of ecologically stable landfills. A look is given at a possible start-up paradigm for dealing with continuous improvement of the International Space Station. Thus, one aim of this book is to convince the reader that CEOs and service industries need SPC at least as much as it is needed on production lines. Deming viewed SPC as a managerial tool for looking at real world systems across a broad spectrum. So do we. Revisions of the former book include discussions, examples and techniques of particular interest for managers. In addition, the new edition includes a new section recapitulating in Chapter 1 how properly to understand and react to variability within a company; new section on process capability in Chapter 3; on the Pareto and cause-and-effect diagrams, as well as on Bayesian techniques, on bootstrapping and on the seven managerial and planning tools (also known as the Japanese seven new tools) in Chapter 5; and on multivariate SPC by principal components in Chapter 7. Usually, Professor Deming discussed methodologies for use with systems rather mature in the application of SPC. In the United States (and more generally) most systems in production, health care, management, etc., are untouched by the SPC paradigm. Consequently, we find it usexx © 2002 by Chapman & Hall/CRC
ful to look at real world examples where SPC is being used on a system for the first time. It does the potential user of SPC no good service to give the impression that he or she will be dealing with “in control” systems. Rather, our experience is quite the contrary. Startup problems are the rule rather than the exception. Exploratory Data Analysis and other minimal assumption methodologies are, accordingly, in order. In this new edition, we introduce Bayesian techniques for the early stages of operation of a complex system. This is done in the context of a real world problem where one of us (Thompson) was asked by NASA to come up with a speculative quality control paradigm for the operation of the International Space Station. NASA, which uses very sophisticated reliability modeling at the design stage, generally does not use SPC in the operation of systems. We show how even a very complex system, untouched by SPC, can be moved toward the Deming Paradigm in its operation by the use of Bayesian techniques. The design and operational problems of optimization are quite different. American companies frequently have excellent design capabilities, but forget that a system, once built, needs continually to be improved. On the other hand, the SPC professional should realize that a horse buggy is not likely to “evolve” into a Lexus. Design and continuous operational optimization, over the long haul, must both be in the arsenal of the successful health care administrator, industrial engineer, and manager. Deming knew how to combine design and operational optimization into one methodologically consistent whole. In this edition, we discuss his unifying approach by referring to the so-called ShewhartDeming Plan-Do-Study-Act cycle and, based on it, spiral of continual improvement. We also elaborate on means to help design an innovation, namely on the so-called seven managerial and planning tools. Because measurement statistics in quality control activities are generally based on averaging, there is a (frequently justifiable) tendency to assume the statistics of reference can be based on normal theory. Rapid computing enables us to use the nonparametric bootstrap technique as a means of putting aside the assumption of normal theory when experience hints that deviation from normality may be serious. Furthermore, rapid computing enables us to deal with multivariable measurement SPC. It is true that most companies would greatly improve their operations if they used even one dimensional testing. Nevertheless, experience shows that multivariate procedures may provide insights difficult to be gleaned by a battery of one dimensional tests. Most SPC today is still being done away from a computer workstation. That is changing. xxi © 2002 by Chapman & Hall/CRC
The support of the Army Research Office (Durham) under DAAL03-91-G-0210, DAAH04-95-1-0665 and DAAD19-99-1-0150 is gratefully acknowledged. We wish to thank Andrzej Blikle, Barry Bodt, Sidney Burrus, Roxy Cramer, Kathy Ensor, Sarah Gonzales, Jørgen Granfeldt, Chris Harris, Marsha Hecht, Richard Heydorn, Olgierd Hryniewicz, Stu Hunter, Renata Koronacka, Marek Kimmel, Vadim Lapidus, Robert Launer, Martin Lawera, Andrea Long, Brian Macpherson, Jan Mielniczuk, Jim Murray, Ken Nishina, Philippe Perier, Rick Russell, Janet Scappini, Bob Stern, Ewa Thompson, Ed Williams and Waldemar Wizner. James R. Thompson and Jacek Koronacki Houston and Warsaw, Christmas 2001
xxii © 2002 by Chapman & Hall/CRC
Chapter 1
Statistical Process Control: A Brief Overview 1.1
Introduction
The common conception about quality control is that it is achieved by diligence, a good attitude and hard work. Yet there are many companies where the employees display all these attributes and the quality of the product is poor. An example of this is the construction of the famous Liberty ships of World War II. These were ships hastily constructed to transport supplies to some of America’s allies. Largely due to the fact that everyone — designers, welders, shipwrights, painters, engineers, etc.— had a keen sense that they were engaged in an activity which was essential for the survival of the United States, there was strong motivation. Unfortunately, keenness was not enough, and these ships were prone to sinking, sometimes immediately after being launched. Naturally, in the case of a wartime emergency, it could be argued that it is quite reasonable to sacrifice quality in order to increase production. If America had insisted that Liberty ships be perfect, then the war might have been lost. There is some merit in this argument, and to one extent or another the argument can be used in any production setting. There are orders to be filled by a specific date. If a company cannot make the deadline, then it is quite possible that the order will be given to another company. Short range concerns may, in some cases, overwhelm the long range goals of delivering a product of the best possible quality to a customer. In general, however, the intelligent application of the philosophy of 1
© 2002 by Chapman & Hall/CRC
2 chapter 1. statistical process control: a brief overview statistical process control will enable us to seek steady improvement in the quality of a product even while dealing with the day-to-day crises which are to one extent or another an unavoidable part of “staying alive” in the highly competitive world of a high tech society. At present, there appears to be a kind of revisionist theory to the effect that there is a smooth transition from the policing-based “quality assurance,” which makes up the bulk of “quality control” as practiced in the United States and Europe, and the “statistical process control” paradigm generally associated with the name of W. Edwards Deming. One reads a great deal about “quality assurance” leading naturally to “quality improvement” (under which category statistical process control is supposed to be only one of many techniques). Some of the more successful companies in the lucrative business of teaching “quality control” to the unwary certainly take such a point of view. In reality, however, the statistical process control paradigm is quite different from the end product inspection strategies associated with “quality assurance.” Almost anyone who does quality control consulting in realworld settings finds that the QA people are already firmly installed as the resident experts in QC, and that it is their ineffectiveness to achieve improvement which has given “quality control” a generally bad reputation among management and staff alike. Even worse, we frequently find that the “quality control” group in a factory has already co-opted the language of statistical process control to mean something entirely different from its classical definition, to mean, as often as not, just the same old end inspection stuff, with some “touchy-feely” industrial psychology thrown in for good measure. So, when one talks about using control charts, the QA staff trots out ream upon ream of their own charts dating back, in some cases, decades. That these charts are generally based on arbitrary tolerance levels, that they are not recorded at sensing stations throughout the production process, that they are used to decide who is OK and who is not rather than for improvement appears to matter little to them. They are doing everything right, marching on as quickly as possible to the nirvana of “zero defects.” Statistical process control, as we understand it, has as its irreducible core three steps: (1) Flowcharting of the production process. (2) Random sampling and measurement at regular temporal intervals at numerous stages of the production process. (3) The use of “Pareto glitches” discovered in this sampling to backtrack in time to discover their causes so that they can be removed.
© 2002 by Chapman & Hall/CRC
quality control: origins, misperceptions
3
Statistical process control is a paradigm for stepwise optimization of the production process. That it works so well is not self-evident, but is empirically true. During the course of this book, we shall attempt to give some insights as to why the SPC paradigm works. The given wording of the core three steps of SPC and of the SPC paradigm in general is for production processes. Let us emphasize most emphatically, however, that it could equally well be for service processes and also for management processes. An example of a process in a service setting, health care, is discussed in Section 2.4. Examples of management processes are alluded to in later sections of this chapter.
1.2 Quality Control: Origins, Misperceptions There are a number of paradoxes in quality control. A “paradoxical” subject, of course, ceases to be a paradox once it is correctly perceived. It is not so much our purpose in this chapter to go into great detail concerning control charts and other technical details of the subject as it is to uncover the essence of the evolutionary optimization philosophy, which is always the basis for an effective implementation of quality control. Most of us, including many quality control professionals, regard quality control as a kind of policing function. The quality control professional is perceived as a kind of damage control officer; he tries to keep the quality of a product controlled in the range of market acceptability. The evolutionary optimization function of quality control is frequently overlooked. Yet it is the most important component. The major aspect of much quality control philosophy in most American companies is based on worker motivation and attitudinal adjustment. Workers, it is deemed, are too lazy, too careless, too insensitive to the mission of the enterprise. The mission of the quality control professional is based on the notion of getting workers to be regular, sober, and keen. Such a notion has always been paramount in the Soviet Union. The example of the Hero of Socialist Labor Stakhanov is remembered. Stakhanov was extremely adept at getting coal picked out of a coal seam. Some Soviet bureaucrat seized upon the brilliant idea of setting Stakhanov up as an example to his fellow workers. The idea was that although it might not be reasonable to expect every worker to perform at 100% of Stakhanov’s efficiency, there was some fraction of 100% which could be set as a lower bound for satisfactory performance. Of course, some other bureaucrat decided that it might be a good idea to give Stakhanov a bit
© 2002 by Chapman & Hall/CRC
4 chapter 1. statistical process control: a brief overview of assistance as he was setting the official standard of good performance. Thus Stakhanov was presented with a rich, easily accessible coal seam, and he had two “helpers” as he was picking away. So Stakhanov set a really fine standard, and workers who did not perform at an arbitrary fraction of the standard were docked, and if they fell below an arbitrary floor, they could be sent to Siberia to work without salary in the Gulags. The Stakhanovite program was one of the principal causes for worker unrest in the Soviet Block, and was the subject of one of the strongest anti-Communist films (produced in Peoples Poland during a brief thaw), Man of Marble. Unfortunately, even in contemporary America, it is apparent that variants of the Stahkanovite program are still employed. Time and motion studies, piecework, quotas, etc., are an integral part of all too many businesses. However, in the best companies, the goal has been to work smart rather than hard. By cooperation with management and technological innovation, workers in these firms have brought their per person production well above that of Stakhanov. Statistical process control is a paradigm for achieving quality by which we can continuously learn from our experiences to work smarter rather than harder. To find the exact time where the notion of quality control began is not easy. We might recall the craft guilds of medieval Europe. The function of such guilds was oriented toward the training of apprentices and journeymen so that they might become competent masters of the craft. In order to become a master of the craft of coppersmithing would typically require ten years or more of learning by doing. Yet, in the sense in which we generally understand quality control, its true beginnings were in 1798 when Eli Whitney developed his musket factory. This was the first time in which a nontrivial production based on the notion of interchangeable parts was used. It was noted, with concern by some members of Congress, that the fledgling Republic might be drawn into the Napoleonic Wars. At that time, America had few producers of muskets, and a revolution in production was required if the musket shortage was to be solved. This may sound strange, since our textbooks are full of stories about Daniel Boone and other stalwart Indian fighters. But the reality of the situation is that few Americans even in the early days of the Republic were frontiersmen. Most spent their lives peacefully engaged in agriculture and other pursuits. The sizes of the armies engaged in the American Revolution, moreover, had been small when compared with the massive formations engaged in the European conflict. The musket craftsmen of the day turned out weapons
© 2002 by Chapman & Hall/CRC
quality control: origins, misperceptions
5
one at a time. The process was lengthy and risky. The failure to achieve appropriate fits of the various components caused many a musket to explode during the test firing. Whitney conceived the idea of making a musket out of approximately 50 key parts. Each of the 50 parts was milled on a machine in such a way that each copy was nearly identical to any other. At the test assembly, Whitney invited members of Congress to pick parts at random from numbered containers. He then assembled the musket, loaded it, and test fired it, while the Congressmen withdrew to a discreet distance. After the process had been carried out several times, it dawned on the Congressional delegation that something rather remarkable had been achieved. Just what would have happened to the United States during the War of 1812 absent Whitney’s development of the first modern assembly line process is a matter of conjecture. A century after Whitney’s musket factory, Henry Ford developed the modern automobile. The key to the success of the Ford was its reliance on the internal combustion engine, with its high power to weight ratio. The Model T had approximately 5,000 parts. It is interesting to note that Ford’s Model T contained approximately 100 times as many parts as Whitney’s musket. This is a true quantum leap in complexity. Many scoffed at the idea that such a complex system could work. Indeed, using anthropomorphic analogies, there is little chance that 5,000 human beings could be relied upon to perform precise tasks at precise times, over and over again millions of times. Ford demonstrated that machines do not perform like people. This fact continues to elude the attention of many, even those who are chief executive officers of large industrial corporations. The importation of Ford technology into Germany created the high technological society for which that nation is rightly known. The Volkswagen created by Ferdinand Porsche was a downsized version of the Ford vehicle, and its name was taken from Ford’s name for the Model T, “the People’s Car.” Unfortunately and to the distress of pacifist Ford, the Germans learned Ford’s methods all too well, applying his techniques to the production of a highly efficient war industry brutally used against mankind. During World War II, the Germans generally maintained higher standards of quality control than did the Americans, who had developed the notion of the assembly line in the first place. It has been argued [6] that in a real sense World War II America moved perceptibly toward Soviet-style production, characterized by grandiose schemes, poor planning, and quality achieved by empty slogans and pres-
© 2002 by Chapman & Hall/CRC
6 chapter 1. statistical process control: a brief overview sure on the workers. As we shall see, it is a serious mistake to use workers as well disciplined automatons. In a high technology society, most work is done by machines. The function of a worker is not regularity, anymore than the function of a scientist is regularity. A worker, like a scientist, should be an innovator. To attempt to manage an industrial system by stressing the workers is generally counterproductive. The work is done by machines. Machines, unlike human beings, are incredibly regular and, with some very high tech exceptions, completely unable to innovate. The function of “quality control” is to provide the maximum amount of creativity at all levels to achieve a constantly improving standard of performance. Quality control, correctly perceived, is an orderly system for monitoring how well we are doing so that we can do better. These observations apply to quality control philosophy in general, whether in a production setting or in service and management settings. Only the role of machines in the production setting is played in the two latter settings by procedures to be followed by employees when they process their tasks related to an external customer or report results to management.
1.3
A Case Study in Statistical Process Control
Let us consider a favorite example of the effect of intelligent quality control presented in Quality, Productivity and Competitive Position [1] by W. Edwards Deming. Nashua Corporation, a manufacturer of carbonless paper, would appear to be in an excellent competitive position to market its product in Europe. Located in New Hampshire, Nashua had a ready source of wood pulp for paper; the expensive coating material for the paper was made completely from materials in the United States. Nashua had the very latest in equipment. The manufacture of carbonless paper is rather high tech. Paper is coated with photosensitive material on a web 6 to 8 feet wide traveling at a speed of 1,100 linear feet/minute. A key part of the process is to coat the paper with the minimal amount of photosensitive material consistent with good quality of performance. Parts of the paper with too little material will not reproduce properly. In late 1979, Nashua was using 3.6 pounds per 3,000 square feet. It would appear that Nashua was in a strong competitive position indeed. There was, however, a small problem. The Japanese, without trees for paper and without the minerals for the paper coating, and far
© 2002 by Chapman & Hall/CRC
a case study in statistical process control
7
from Europe, were able to sell carbonless paper to the Europeans for less cost than Nashua. Even worse, the quality of the Japanese paper was better (i.e., more uniform and, hence, less likely to jam and better for reproduction) than that of the Nashua product. What was to be done? One immediate answer preferred by some American industrialists and union leaders goes as follows. “It is unfair to ask American companies to compete with low Japanese wage structures. Put a tariff on the Japanese product.” There are problems with this argument and solution. First of all, labor cost differentials are not generally dominant in a highly automated industry. The differential might make up something like 5% of the competitive advantage of the Japanese, an amount clearly not equal to the Nashua advantages of ready raw materials and relative closeness to markets. Furthermore, suppose the Japanese decided to retaliate against an American tariff on carbonless paper by putting a tariff on lumber from Washington and Oregon, purchasing their material rather from British Columbia. Americans who had to buy a more expensive American product would be giving a subsidy to Nashua, raising the cost of production in their own businesses and, hence, marginally making their own products less competitive on a world market. And since a major portion of Nashua’s target market was in Europe rather than in the United States, an American tariff would be an impossibility there. Demanding a tariff on carbonless paper simply was not a feasible option for Nashua. The most likely option in late 1979 appeared to be to go into bankruptcy. Another was to buy a new coating head, at a cost of $700,000 plus loss of production during installation. The solution used by Nashua was to try to find out what was going on in an orderly fashion. It is always a good idea, when a competitor producing the same product with equipment and logistical support inferior to our own, is able to best us both in terms of quality and price, to learn what he is doing better than we are. A modular investigation involves measuring the input and output at each module of the process. In the Nashua case, it turned out that the major problem and its solution were incredibly simple. The output record of the coating machine had been used to achieve instantaneous feedback control. Thus, if the paper was being coated too thickly, an immediate adjustment was made in the coating machine. A first step in quality control involves a modularization of the system of the process. Such a modularization is shown in Figure 1.1.
© 2002 by Chapman & Hall/CRC
8 chapter 1. statistical process control: a brief overview X (t)
X (t)
1
2
X (t) 3
3 (t) X 4
X (t) 5
X (t) X (t) 7
X (t) X (t) 2
X (t) 4
2
6
X (t)
X (t)
4
14
X (t) X (t)
X (t)
7
9
X (t) 10
X (t) 11
X (t)
13
12
X (t) 8
Figure 1.1. Typical System Flowchart. It turned out that the coating machine reached equilibrium only after some time. Instantaneous control, then, was a terrible idea, leading to an institutionalized instability causing high variability in the product and causing a waste of substandard paper and a loss of expensive coating material. The solution to Nashua’s problem was to cease “overcontrolling” the coating process. Letting the coating machine do its job produced immediate dividends. First of all, it was unnecessary to purchase a new coating head; hence, a saving of $700,000 was made possible. More importantly, very quickly Nashua was able to produce a uniform product using a coating rate of 3.6 pounds per 3,000 square feet. This made the Nashua product competitive. But this was not the end of the story. Quality control is a continuous evolutionary optimization process. Noting that the coating costs $100,000 per tenth of pound per year, Nashua began to take advantage of its high tech coating machine. In less than a year, it was able to reduce the coating rate to 2.8 pounds per 3,000 square feet, producing an annual savings of $800,000. By 1985, the coating rate had been reduced to 1.0 pound per 3,000 square feet [4]. The consequences of the above example are clear. An unprofitable American company escaped ruin and became a savvy enterprise which can stand up to competition from any quarter. No recrimination was taken against those who had overcontrolled the system. A superior, more profitable product benefited both the company and its customers. The price paid was not increased stress for the workers. Indeed, the improvements in production eliminated a good deal of stress on the part of everyone involved. The company had simply engaged in that which Americans have typically done well: finding out problems, describing them and solving them. The Nashua example is fairly typical, in that the problem involved
© 2002 by Chapman & Hall/CRC
if humans behaved like machines
9
was not a general malaise across the production line, but a failure at one particular point of the process. It is not that everything else except overadjustment of the coating machine was perfect. But the other failings were relatively insignificant when compared to that of overadjustment.
1.4
If Humans Behaved Like Machines
Imagine a room filled with blindfolded people which we would wish to be quiet but is not because of the presence of a number of noise sources. Most of the people in the room are sitting quietly, and contribute only the sounds of their breathing to the noisiness of the room. One individual, however, is firing a machine gun filled with blanks, another is playing a portable radio at full blast, still another is shouting across the room, and, finally, one individual is whispering to the person next to him. Assume that the “director of noise diminution” is blindfolded also. Any attempt to arrange for a quiet room by asking everyone in the room to cut down his noise level 20% would, of course, be ridiculous. The vast majority of the people in the room, who are not engaged in any of the four noise making activities listed, will be annoyed to hear that their breathing noises must be cut 20%. They rightly and intuitively perceive that such a step is unlikely to do any measurable good. Each of the noise sources listed is so much louder than the next down the list that we could not hope to hear, for example, the person shouting until the firing of blanks had stopped and the radio had been turned off. The prudent noise diminution course is to attack the problems sequentially. We first get the person firing the blanks to cease. Then, we will be able to hear the loud radio, which we arrange to have cut off. Then we can hear the shouter and request that he be quiet. Finally, we can hear the whisperer and request that he also stop making noise. If we further have some extraordinary demands for silence, we could begin to seek the breather with the most clogged nasal passages, and so on. But generally speaking, we would arrive, sooner or later, at some level of silence which would be acceptable for our purposes. This intuitively obvious analogy is a simple example of the key notion of quality control. By standards of human psychology, the example is also rather bizarre. Of the noise making individuals, at least two would be deemed sociopathic. We are familiar with the fact that in most gatherings, there will be a kind of uniform buzz. If there is a desire of a master of ceremonies to quieten the audience, it is perfectly reasonable for him to ask everyone please to
© 2002 by Chapman & Hall/CRC
10 chapter 1. statistical process control: a brief overview be quiet. The fact is that machines and other systems tend to function like the (by human standards) bizarre example and seldom behave like a crowd of civilized human beings. It is our tendency to anthropomorphize systems that makes the effectiveness of statistical process control appear so magical.
1.5
Pareto’s Maxim
A cornerstone of SPC is an empirical observation of the Italian sociologist Vilfredo Pareto: in a system, a relatively few failure reasons are responsible for the catastrophically many failures. Let us return again to the situation described in Figure 1.1. Each one of the boxes represents some modular task in a production process. It is in the nature of the manufacturing process that it be desirable that the end product output X14 (t) be maintained as constant as possible. For example, a company which is making a particular type of machine bolt will want to have them all the same, for the potential purchasers of the bolt are counting on a particular diameter, length, etc. It is not unheard of for people to pay thousands of dollars for having a portrait painted from a simple photograph. The artist’s ability to capture and embellish some aspect he perceives in the photograph is (quite rightly) highly prized. A second artist would produce, from the same photograph, quite a different portrait. No one would like to see such subjective expression in the production of bolts. If we allowed for such variability, there would be no automobiles, no lathing machines, and no computers. This fact does not negate aesthetic values. Many of the great industrial innovators have also been major patrons of the arts. Modularity demands uniformity. This fact does not diminish the creative force of those who work with modular processes any more than a net interferes with the brilliance of a tennis professional. And few workers in quality control would wish to have poems written by a CRAY computer. The measurement of departures from uniformity, in a setting where uniformity is desired, furnishes a natural means of evolutionary improvement in the process. Now by Pareto’s Maxim, if we see variability in X14 (t) , then we should not expect to find the source of that variability distributed uniformly in the modules of the process upstream. Generally, we will find that one of the modules is offering a variability analogous to that of the machine gun firing blanks in an auditorium. The fact that this variability is very
© 2002 by Chapman & Hall/CRC
11
pareto’s maxim
likely intermittent in time offers the quality control investigator a ready clue as to where the problem is. Naturally, there may be significant time lags between X3 (t), say, and X4 (t). The sizes of the lags are usually well known; e.g., we usually know how far back the engine block is installed before final inspection takes place. Thus, if a particular anomaly in the final inspection is observed during a certain time epoch, the prudent quality control worker simply tracks back the process in time and notes variabilities in the modules which could impact on the anomaly in the proper time frame. Although they are useful for this purpose, statistically derived control charts are not absolutely essential. Most of the glitches of the sort demonstrated in Figure 1.2 are readily seen with the naked eye. The Model T Ford, which had roughly 5,000 parts, was successfully monitored without sophisticated statistical charts. We show, then, the primitive run chart in Figure 1.2. Once we have found the difficulty which caused the rather substantial glitch between hours 4 and 8, we will have significantly improved the product, but we need not rest on our laurels. As time proceeds, we continue to observe the run charts.
Departure from 2 Average in tenths of inches 0
2
6 time in hours
8
10
14
–1 –2
Figure 1.2. Run Chart. We observe a similar kind of profile in Figure 1.3 to that in Figure 1.2. However, note that the deviational scale has been refined from tenths to hundredths of inches. Having solved the problem of the machine gun firing blanks, we can now approach that of the loud radio. The detective process goes forward smoothly (albeit slowly) in time. Ultimately, we can produce items which conform to a very high standard of tolerance.
© 2002 by Chapman & Hall/CRC
12 chapter 1. statistical process control: a brief overview
Departure from 2 Average in hundredths of inches 0
2
4 6 time in hours
8
10
12
14
–1 –2
Figure 1.3. Run Chart of Improved Process. The run charts showed in Figures 1.2 and 1.3 allude again to an industrial setting. It is easy, however, to construct such charts in other settings too. For example, for accounts payable, one can depict percent of unpaid invoices month by month. Equally easily, one can monitor monthly sales, seasonally adjusted if needed. One can gather numbers of customers serviced in consecutive periods, percents of items returned by customers, numbers of incoming or outgoing telephone calls, energy consumption, etc., etc. It is in fact amazing how often and how many numerical data on our daily activities in a business can be recorded to be later used to advantage. A friend of one of the authors, a British SPC consultant and assistant to Deming in his last years of work, made patients of a cardiologist record their blood pressure data to help the doctor better see, diagnose and react to the glitches thus revealed. Another Difference between Machines and People. A second paradox in quality control, again due to our tendency to treat machines and other systems as though they were human beings, has to do with the false perception that a source of variability, once eliminated, is very likely to occur again as soon as we turn our attention to something else. In many human activities, there is something like an analogy to a man juggling balls. As he hurls a red ball upward, he notices that he must immediately turn his attention to the yellow ball which is falling. When he hurls the yellow ball upward, he notes that the green ball is falling. By the time he throws the green ball upward, he must deal with the red
© 2002 by Chapman & Hall/CRC
pareto’s maxim
13
ball again. And so on. If a human decides to give up smoking, he is likely to note an increase in his weight. Once he worries about weight control, he may start smoking again. Happily, machines suffer from no such difficulty. One of the authors drives an 11-year-old Volvo. The engine has experienced many millions of revolutions and still functions well. No one is surprised at such reliability with mechanical devices. The ability to execute identical operations many times without fail is in the very nature of machines. Interestingly, there are many systems in which similar reliability is to be expected. In highly automated industrial systems, once a source of variability has been eliminated, it is unlikely to be a problem again unless we are very careless indeed. The in-line, four-cylinder Volvo engine, developed over 40 years ago, still functions very well. A newer, more sophisticated engine may be much less reliable unless and until we have carefully gone through the quality control optimization process for it as well. Innovation is usually matched by new problems of variability not previously experienced in older systems. This does not speak against innovation; rather it warns us that we simply do not reasonably expect to have instantaneous quality control with a new system. Companies and other organizations that expect to go immediately from a good idea to a successful product are doomed to be disappointed. The Basic Paradigm for Quality. The basic paradigms of both Whitney and Ford were essentially the same: (1) Eliminate most potential problems at the design stage. (2) Use extensive pilot study testing to eliminate remaining undiscovered problems. (3) Use testing at the production stage to eliminate remaining glitches as a means of perfecting the product, always remembering that such glitches are generally due to defects in a few submodules of the production process rather than general malaise. Ford’s Four Principles of Manufacturing. Henry Ford codified his ideas in his four principles of manufacturing [3]: (1) An absence of fear of the future or veneration of the past. One who fears failure limits his activities. Failure is only the opportunity to begin again. There is no disgrace in honest failure; there is disgrace in fearing to fail. What is past is useful only as it suggests ways and means for progress. (2) A disregard of competition. Whoever does a thing best ought to be the one to do it. It is criminal to try to get business away from another man – criminal because one is then trying to lower for personal gain the
© 2002 by Chapman & Hall/CRC
14 chapter 1. statistical process control: a brief overview condition of one’s fellowmen – to rule by force instead of by intelligence. (3) The putting of service before profit. Without a profit, business cannot expand. There is nothing inherently wrong about making a profit. Wellconducted business enterprise cannot fail to return a profit, but profit must and inevitably will come as a reward for good service – it must be the result of service. (4) Manufacturing is not buying low and selling high. It is the process of buying materials fairly and, with the smallest possible addition of cost, transforming those materials into a consumable product and giving it to the consumer. Gambling, speculating and sharp dealing tend only to clog this progression. Ford’s four principles no longer sound very modern. In the context of what we are accustomed to hear from America’s contemporary captains of industry, Ford’s four principles sound not only square but rather bizarre. That is unfortunate, for they would not sound bizarre to a contemporary Japanese, Taiwanese, Korean or German industrialist.
1.6
Deming’s Fourteen Points
The modern paradigm of quality control is perhaps best summarized in the now famous fourteen points of W.E. Deming [1], who is generally regarded as the American apostle of quality control to Japan: (1) Create constancy of purpose toward improvement of product and service, with a plan to become competitive and to stay in business. Decide to whom top management is responsible. (2) Adopt the new philosophy. We are in a new economic age. We can no longer live with commonly accepted levels of delays, mistakes, defective materials and defective workmanship. (3) Cease dependence on mass inspection. Require instead statistical evidence that quality is built in, to eliminate need for inspection on a mass basis. Purchasing managers have a new job and they must learn it. (4) End the practice of awarding business on the basis of price tag. Instead, depend on meaningful measures of quality, along with price. Eliminate suppliers that can not qualify with statistical evidence of good quality. (5) Find problems. It is management’s job to work continually on the system (design, incoming materials, composition of material, maintenance, improvement of machine, training, supervision, and retraining).
© 2002 by Chapman & Hall/CRC
deming’s fourteen points
15
(6) Institute modern methods of on the job training. (7) Institute modern methods of supervision of production workers. The responsibility of foremen must be changed from sheer numbers to quality. Improvement of quality will automatically improve productivity. Management must prepare to take immediate action on reports from foremen concerning barriers such as inherited defects, machines not maintained, poor tools, fuzzy operational definitions. (8) Drive out fear, so that everyone may work effectively for the company. (9) Break down barriers between departments. People in research, design, sales, and production must work as a team, to foresee problems of production that may be encountered with various materials and specifications. (10) Eliminate numerical goals, posters and slogans for the work force, asking for new levels of productivity without providing methods for improvement. (11) Eliminate work standards that prescribe numerical quotas. (12) Remove barriers that stand between the hourly worker and his right to pride of workmanship. (13) Institute a vigorous program of education and training. (14) Create a structure in top management which will push every day on the above 13 points. The paradigm of Deming is sufficiently general that it might be interpreted as not being applicable to a specific industry at a particular time. We are all familiar with vacuous statements to the effect that we should somehow do a good job. But we should examine the Deming Fourteen Points carefully before dismissing them. Below we consider the points one at a time. (1) Management is urged to plan for the long haul. We cannot simply lurch from crisis to crisis. We need to consider what we believe the company really is about. What do we plan to be doing next quarter, next year, five years, ten years from now. It goes without saying that our plans will require constant modification. But we need to ask what our direction is. What new technologies will likely affect us. We need to plan a strategy that will make the firm unique. We need to be pointed on a course which will make us the best in the world in what we do. If we have no such course plotted, it is unlikely that we will simply stumble upon it. (2) A common assumption in some American industries is that a significant proportion of the things they do will simply be defective. Once
© 2002 by Chapman & Hall/CRC
16 chapter 1. statistical process control: a brief overview again, we return to the misconception that machines function as people do. They do not. If we move methodically toward improved production, we need not fear that our improved excellence will be fleeting. A human juggling balls is very different from a production schedule optimally designed. Our goal must be systematic improvement. Once we achieve a certain level of excellence, it is locked in forever. If, on the other hand, we assume that prospective clients will always be satisfied to return defective items or have us repair them, we are due, down the road, to an unpleasant surprise. Sooner or later, a competitor will discover how to produce a product like ours but without the defects. (3) Once we have achieved the middle level of excellence, we will not require extensive inspection. Again, this fact points to the fact that work done by machines will generally be satisfactory once we have determined how to make them operate properly. One of the major benefits of a well designed system of statistical process control is that it will rather quickly liberate us from carrying out exhaustive (or nearly exhaustive inspection). Similarly, we should be able to demand from those companies who supply us a steady stream of products of excellent quality. (4) Companies which constantly change their suppliers on the basis of marginally lower prices are playing a lottery with quality. We have the right to expect that those companies to whom we sell will think long and hard before replacing us with a competitor. This is not being polite but simply rational. “If it ain’t broke, don’t fix it” is a good rule. If our supplier is delivering high quality products to us at a reasonable price, then, on the offer by another supplier to deliver us the same products for less, we would probably be wise, as a first step, to meet with our original supplier to see whether there might be some adjustment downward in the price of his product. Even if no such adjustment is offered, we probably will be well advised to stick with the original supplier, particularly if the saving involved in the switch is modest and/or if we have no firm evidence that the product being offered is of quality comparable to that from the current supplier. (5) Once a production process has been brought to the middle level of excellence, where day-to-day crisis management is not the norm, we will have arrived at the region where we can increase our understanding of the process and attempt to improve it. Many American firms always operate in crisis mode. In such a mode, it is not easy to effect even the simplest of improvements, such as rearranging the production line to minimize the necessity of transporting intermediate products from building to building.
© 2002 by Chapman & Hall/CRC
deming’s fourteen points
17
(6) In a modern industrial setting, a worker is a manager, perhaps of other workers, perhaps of machines. The worth of a worker to the company is based on the cumulative knowledge and experience of that worker. If the company is beyond the crisis mode of operation, then there is time to enhance worker knowledge by a well thought-out system of on the job training. There are few more profitable investments which the company might make than those spent in expanding the ability of the worker to deal effectively with the challenges of his current job and of those further down his career path. (7) A foreman should not be a taskmaster, looking to improve production by stressing the workers. This is simply not the way to achieve improved performance, as the failure of economic systems based on such a paradigm indicates. A foreman is a manager of other managers, and his job is to coordinate the goals of the section and to entertain suggestions as to how these might best be achieved. (8) So far from increasing productivity, arguments from management along the lines, “You have to increase your output by 10%, or you are likely to lose your job,” are very counterproductive. It is not hard to figure out why. In addition to the natural frustrations associated with any job, management has just added another frustration: a demand that the worker increase his output without being shown how to do so. An upbeat environment in which management provides concrete help for the increase of productivity, without threats, has proved the best way to achieve maximum quality and productivity. (9) Every organization requires some form of structure. Over time, however, changes in products and technology tend to render the boundaries between departments less and less relevant. One solution is to have reorganization of the firm every six months or so. Another, much more appropriate one, is to permit free flow of information and cooperation between departments. Once again, experience shows that such an approach does not impair organizational discipline and does enhance the total quality and productivity of the company. (10) No society has been bigger on goals than that of the Soviet Bloc. No society has been bigger on posters, slogans and the like. The result has been disastrous. A principal duty of management is to show how productivity can been improved. It is unnecessary to emphasize the fact that productivity enhancements are crucial to the life of a firm and to those of the workers. Everyone knows that. (11) This point is a reemphasis of the previous one. It is so important that it deserves special emphasis. Workers need to be shown how to
© 2002 by Chapman & Hall/CRC
18 chapter 1. statistical process control: a brief overview perform better. Simple demands that they do so are generally counterproductive. (12) No human being wishes to be a cog in a machine. Every worker in an effectively managed firm is important. This is reality, not just a slogan. A major source of improvement in the production process is discoveries by workers as to how things might be done better. Such discoveries will be encouraged if those who make the discoveries are given full credit. (13) Beyond on the job training, a firm is well advised to enhance the ability of workers to improve their skills by in house courses. (14) A quality control program which is treated by top management as a pro forma activity is not likely to be effective. Top management needs to be convinced as to the importance of the program and needs to take as much a lead in it as they do in marketing and administration.
1.7
QC Misconceptions, East and West
Problems in production are seldom solved by general broad spectrum exhortations for the workers to do better. An intelligent manager is more in the vein of Sherlock Holmes than that of Norman Vincent Peale. This rather basic fact has escaped the attention of most members of the industrial community, including practitioners of quality control. To give an example of this fact, we note the following quotation from the Soviet quality control expert Ya. Sorin [5]: Socialist competition is a powerful means of improving the quality of production. Labour unions should see to it that among the tasks voluntarily assumed by brigades and factories ... are included those having to do with improving quality control .... In the course of socialist competition new forms of collective fight for quality control are invented. One such form is a “complex” brigade created in the Gorky car factory and having as its task the removal of shortcomings in the design and production of cars. Such “complex brigades” consist of professionals and qualified workers dealing with various stages of production. The workers who are members of such brigades are not released from their basic quotas. All problems are discussed after work. Often members of such
© 2002 by Chapman & Hall/CRC
QC misconceptions, east and west
19
brigades stay in factories after work hours to decide collectively about pressing problems. In the above, we note the typical sloganeering and boosterism which is the hallmark of a bad quality control philosophy. The emphasis, usually, of bad quality control is to increase production by stressing the workers. The workers are to solve all problems somehow by appropriate attitudinal adjustment. Quality control so perceived is that of a man juggling balls, and is doomed to failure. Of course, Soviet means of production are proverbial for being bad. Surely, such misconceptions are not a part of American quality control? Unfortunately, they are. Let us consider an excerpt from a somewhat definitive publication of the American Management Association’s Zero Defects: Doing It Right the First Time ( [7], pp 3-9): North American’s PRIDE Program is a positive approach to quality that carries beyond the usual system of inspection. It places the responsibility for perfection on the employee concerned with the correctness of his work. Its success depends on the attitude of the individual and his ability to change it. This program calls for a positive approach to employee thinking .... A good promotional campaign will help stimulate interest in the program and will maintain interest as the program progresses. North American began with a “teaser” campaign which was thoughtfully conceived and carried out to whet the interest of employees prior to the kick-off day .... Various things will be needed for that day – banners, posters, speeches, and so forth.... Many of the 1,000 companies having Zero Defects programs give employees small badges or pins for signing the pledge cards and returning them to the program administrator. Usually these badges are well received and worn with pride by employees at all times on the job .... In addition, it is good to plan a series of reminders and interest-joggers and to use promotional techniques such as issuing gummed stickers for tool boxes, packages and briefcases. Coffee cup coasters with printed slogans or slogans printed on vending machine cups are good. Some companies have their letterhead inscribed with slogans or logos which can also be printed on packing cases. There is little doubt that an intelligent worker confronted with the Zero Defects program could think of additional places where manage-
© 2002 by Chapman & Hall/CRC
20 chapter 1. statistical process control: a brief overview ment could stick their badges, pins and posters. A worker, particularly an American worker, does not need to be tricked into doing his best. In a decently managed company, employee motivation can be taken as a given. But a human being is not a flywheel. The strong point of humans is their intelligence, not their regularity. Machines, on the other hand, can generally be counted on for their regularity. The correct management position in quality control is to treat each human worker as an intelligent, albeit erratic, manager of one or more regular, albeit nonreasoning, subordinates (machines). Thus, human workers should be perceived of as managers, problem finders and solvers. It is a misunderstanding of the proper position of human workers in our high-tech society which puts many industries so much at the hazard.
1.8
White Balls, Black Balls
We begin with a simple example. In Figure 1.4, we show ten lots of four balls each. Each ball in lots 1 through 10 is white. There is no evidence to suggest that the process which produced the balls in any of the lots is different from any of the others.
1
2
3
4
5
6
7
8
9
10
Figure 1.4. Perfect White Ball Production. In Figure 1.5, we show ten lots of four balls each. We note that in nine of the lots, the balls are all white. In the tenth lot, all the balls are black. There is no doubt that the balls come from lots which are significantly different in terms of color. The production process for lot 10 appears significantly different from that which produced the balls in lots 1 through 9.
1
2
3
4
5
6
7
8
9
10
Figure 1.5. White Ball Production With One Bad Lot.
© 2002 by Chapman & Hall/CRC
21
white balls, black balls
Next, in Figure 1.6, we show ten lots of four balls. In the tenth lot, one of the balls is black, and it would appear likely that the manufacturing process used to produce the balls is different in lot 10 than in the other lots.
1
2
3
4
5
6
7
8
9
10
Figure 1.6. White Ball Production: Trouble In Tenth Lot. In Figure 1.7, the situation becomes more ambiguous. Black balls appear in each of the lots except the fourth. White balls appear in each of the lots except for the eighth.
1
2
3
4
5
6
7
8
9
10
Figure 1.7. White Ball Production: Troubles Throughout. We could, of course, say that the manufacturing process behaved quite differently in lot 4 than it did in lot 8. But if we are attempting to characterize changes across the manufacturing process, without knowing any more than the data before us, we would probably note that out of a total of 40 balls, 13 are black and 27 are white, and that there seems to be a random process at work by which black and white balls are produced in random fashion from lot to lot, with around 13/40 of these black and 27/40 white. In other words, there appears to be a difference in the conclusions we draw in Figures 1.4, 1.5, 1.6 and 1.7. In Figure 1.4, where the balls are all of the same color, we have no reason to doubt that the same production process was used for each of the lots. In Figure 1.5, there appears to be a big difference in the production process for lot 10, where the balls are all black. In Figure 1.6, we note one black ball in the tenth lot, and state, without a great deal of confidence, that we believe the production process was different for lot 10 than for lots 1 through 9. In Figure 1.7 we see such a variability in the process, that we are inclined to guess that each of the ten lots was produced by the same variable production process.
© 2002 by Chapman & Hall/CRC
22 chapter 1. statistical process control: a brief overview In the consideration of Figures 1.4 - 1.7, we see an apparent paradox. In Figure 1.4, which produced only white balls, we believe that each of the ten lots was produced by the same process. In Figure 1.7, which exhibited a widely varying number of white and black balls, there appeared no reason to question that the ten lots were each produced by a common process. In Figure 1.5, where one of the lots consisted exclusively of black balls, it appeared that lot 10 was probably produced by a different process than that of the lots 1-9. Figure 1.6 is, perhaps, the most ambiguous. The one black ball in lot 10 may be indicative of a basic change in the production process, or it may simply be the result of a small level of variability across the production process in all the lots. It is interesting to note that the most uniform figure, Figure 1.4, and the most variable, Figure 1.7, seem to indicate that in the two figures, the production process does not change from lot to lot. Figure 1.6, which produced all white balls except for one black ball, is, nonetheless, more likely to have exhibited a change in the production process than the highly variable Figure 1.7. The above example indicates a basic concept of statistical process control. In SPC, we examine the process as it is, rather than as we wish it to be. We might desire to produce only white balls. However, we must deal with the production record before us. In Figure 1.7, the performance is disappointing if our goal is the production of exclusively white balls. But there appears little information in the data as to where the production problem occurs in time. The process is “in control” in the sense that it appears that the process, though variable, is not particularly more variable in one lot than in any other. Let us consider the situation in Figure 1.8. Here, there are four black balls out of 40. There appears no guide, just from the data before us, as to what we can do to improve the situation. In other words, there appears to be no “Pareto glitch” which we can use as a guide to improving the production system.
1
2
3
4
5
6
7
8
9
10
Figure 1.8. White Ball Production: Troubles Throughout. Recall, however, the situation in Figure 1.5. Here, we also saw only four black balls out of forty. However, the fact that all these balls occurred in lot 10 may well give us some guide to improving the production
© 2002 by Chapman & Hall/CRC
white balls, black balls
23
process. Figure 1.5 exhibits a “Pareto glitch.” Both Figures 1.5 and 1.8 exhibit the same proportion of black balls (10%). However, Figure 1.5 exhibits a situation where we can ask meaningful questions, such as, “Why is it that each of the nine lots was exclusively white except for lot 10, which exhibited only black balls?” Fortunately, the situation in Figure 1.5 is very common in industrial production. When we seek out the cause for the “Pareto glitch” we will probably be able to fix it. Unfortunately, black balls are frequently produced as well, and each black ball is equally bad. In the real world, things are seldom black and white. Good items are not color coded as white, bad ones not coded as black. Generally, we have to rely on measurements to determine how we are doing, and measurements do not so easily fall into “good” and “bad” categories as did the white and black balls recently considered. The previous example concerning the production of white balls is a study in black and white. We are trying to produce white balls, and any white ball is equally perfect. The real world situation of quality control frequently starts out as a “white ball, black ball” dichotomy. In a medical setting, a patient who survives a heart bypass operation and returns home is a “white ball.” One who dies before getting back home is a “black ball.” But as the process is refined, we will tend to use more sophisticated measurements than life and death. For example, we might consider also heart function one month after surgery as measured against some standard of performance. Length of postoperative stay in hospital is another measure. The refinement of measurement beyond life and death is almost always desired as we go down the road of optimizing treatment for bypass patients. But it brings us into the real world of shades of gray. In an industrial setting, statistical process control will generally have to start with an in place “quality assurance” protocol. Thus, in the manufacturing of bolts, there will be two templates representing the upper and lower limits of “acceptable” diameter. A bolt is satisfactory if it fits into the upper limit template and fails to fit into the lower limit template. As time progresses, however, quality control will move to measuring diameters precisely, so as to provide greater possibility for feedback to improve the production.
© 2002 by Chapman & Hall/CRC
24 chapter 1. statistical process control: a brief overview
Lot
Table 1.1 Bolt 2 Bolt 3
Bolt 1 smallest 9.93 10.00 9.94 9.90 9.89 9.91 9.89 9.96 9.98 9.93
1 2 3 4 5 6 7 8 9 10
10.04 10.03 10.06 9.95 9.93 10.01 10.01 9.97 9.99 10.02
10.05 10.05 10.09 10.01 10.03 10.02 10.04 10.00 10.05 10.10
Bolt 4 largest 10.09 10.12 10.10 10.02 10.06 10.09 10.09 10.03 10.11 10.11
Thickness
In Table 1.1, we show measurements of thicknesses in centimeters of 40 bolts in ten lots of four each.Our goal is to produce bolts of thickness 10 centimeters. We have arranged the bolts in each of the ten lots from the smallest to the largest. In Figure 1.9, we display the data graphically. There does not appear to be any lot which is obviously worse than the rest.
10.125 10.1 10.075 10.05 10.025 10 9.975 9.95 9.925 9.9 9.875 0
Bolt 1
1
2
Bolt 2
3
4
Bolt 4
Bolt 3
5
6
7
8
9
Lot Number
Figure 1.9. Thicknesses In Ten Lots Of Bolts.
© 2002 by Chapman & Hall/CRC
10
11
25
white balls, black balls
Lot
Bolt 1 smallest 9.93 10.00 9.94 9.90 9.89 9.91 9.89 9.96 9.98 10.43
1 2 3 4 5 6 7 8 9 10
Table 1.2 Bolt 2 Bolt 3 10.04 10.03 10.06 9.95 9.93 10.01 10.01 9.97 9.99 10.52
10.05 10.05 10.09 10.01 10.03 10.02 10.04 10.00 10.05 10.60
Bolt 4 largest 10.09 10.12 10.10 10.02 10.06 10.09 10.09 10.03 10.11 10.61
On the other hand, in Table 1.2 and in Figure 1.10, we note the situation when .500 centimeters are added to each of the four observed measurements in lot 10. Simply by looking at Figure 1.10, most people would say that something about the process must have changed during the production of the tenth lot. This is an analog of the four black balls produced in the tenth lot in Figure 1.5.
Bolt 1 Bolt 4
Thickness
10.7 10.6 10.5
Bolt 2
Bolt 3
10.4 10.3 10.2 10.1 10 9.9 9.8 0
1
2
3
4
5
6
7
8
9
10
Figure 1.10. Thicknesses In Ten Lots: One Bad Lot.
© 2002 by Chapman & Hall/CRC
11
26 chapter 1. statistical process control: a brief overview
Lot 1 2 3 4 5 6 7 8 9 10
Bolt 1 smallest 9.93 10.00 9.94 9.90 9.89 9.91 9.89 9.96 9.98 9.93
Table 1.3 Bolt 2 Bolt 3 10.04 10.03 10.06 9.95 9.93 10.01 10.01 9.97 9.99 10.02
10.05 10.05 10.09 10.01 10.03 10.02 10.04 10.00 10.05 10.10
Bolt 4 largest 10.09 10.12 10.10 10.02 10.06 10.09 10.09 10.03 10.11 10.61
In Table 1.3 and the corresponding Figure 1.11, we note the situation in which .500 centimeters have been added to only one of the measurements in lot 10. The situation here is not so clear as that in Figure 1.10. The one bad bolt might simply be occurring generally across the entire production process. One cannot be sure whether the process was degraded in lot 10 or not. We recall the situation in Figure 1.6, where there was only one black ball produced.
Bolt 1 Bolt 4
10.6
Bolt 2
Bolt 3
Thickness
10.5 10.4 10.3 10.2 10.1 10 9.9 9.8 0
1
2
3
4
5
6
7
8
9
10
Lot Number
Figure 1.11. Thicknesses In Ten Lots: One Bad Bolt.
© 2002 by Chapman & Hall/CRC
11
27
white balls, black balls
Lot 1 2 3 4 5 6 7 8 9 10
Bolt 1 smallest 9.93 10.00 9.94 9.90 9.89 9.91 9.89 10.46 9.98 9.93
Table 1.4 Bolt 2 Bolt 3 10.04 10.03 10.06 9.95 9.93 10.01 10.01 10.47 9.99 10.02
10.05 10.55 10.09 10.01 10.03 10.02 10.04 10.50 10.05 10.10
Bolt 4 largest 10.59 10.62 10.60 10.02 10.56 10.59 10.59 10.53 10.61 10.61
Next, in Table 1.4 and the corresponding Figure 1.12, we observe a situation where .500 centimeters have been added to one bolt each in lots 1,3,5,6,7,9, and 10 and to two of the bolts in lot 2 and four of the bolts in lot 8. Note the similarity to Figure 1.7, where there were so many black balls distributed across the lots that it seemed unlikely that we could be confident in saying the process was performing better one place than another: it was generally bad but since equally bad everywhere, could not be said to be “out of control.”
Bolt 1 Bolt 4
10.7
Bolt 2
Bolt 3
Thickness
10.6 10.5 10.4 10.3 10.2 10.1 10 9.9 9.8 0
1
2
3
4
5
6
7
8
9
10
Lot Number
Figure 1.12. Thicknesses: System Generally Variable.
© 2002 by Chapman & Hall/CRC
11
28 chapter 1. statistical process control: a brief overview
Lot 1 2 3 4 5 6 7 8 9 10
Bolt 1 smallest 9.93 10.00 9.94 9.90 9.89 9.91 9.89 9.96 9.98 9.93
Table 1.5 Bolt 2 Bolt 3 10.04 10.03 10.06 9.95 9.93 10.01 10.01 9.97 9.99 10.02
10.05 10.05 10.09 10.01 10.03 10.02 10.04 10.00 10.05 10.10
Bolt 4 largest 10.09 10.62 10.60 10.02 10.06 10.09 10.59 10.03 10.11 10.61
In Table 1.5 and Figure 1.13, we show the situation where .500 centimeters have been added to one bolt each in lots 2,3,7, and 10. We note that unlike the situation in Figure 1.10, where we also had four bad bolts, but all in one lot, there seems to be little guide in Figure 1.13 as to any “Pareto glitch” in the production process.
10.7
Bolt 1
Bolt 2
Bolt 4
Bolt 3
Thickness
10.6 10.5 10.4 10.3 10.2 10.1 10 9.9 9.8 0
1
2
3
4
5
6
7
8
9
10
11
Lot Number
Figure 1.13. Thicknesses: Several Wild Bolts. The Run Chart. Figures 1.9 - 1.13 are rather crowded. We have elected to show all of the forty thickness measurements. If there were 10 thicknesses measured for each of the ten lots, then the figures would be very crowded indeed. We need to come up with one figure for each lot which gives us a measure of the average thickness for the lot. The most common measure is the sample mean. Returning to Table 1.1, let
© 2002 by Chapman & Hall/CRC
29
white balls, black balls
us compute the sample mean for each lot. To obtain the average for lot 1, for example, we compute
x ¯=
Lot
9.93 + 10.04 + 10.05 + 10.09 = 10.028 4
Bolt 1 smallest 9.93 10.00 9.94 9.90 9.89 9.91 9.89 9.96 9.98 9.93
1 2 3 4 5 6 7 8 9 10
Table 1.6 Bolt 2 Bolt 3 10.04 10.03 10.06 9.95 9.93 10.01 10.01 9.97 9.99 10.02
Bolt 4 largest 10.09 10.12 10.10 10.02 10.06 10.09 10.09 10.03 10.11 10.11
10.05 10.05 10.09 10.01 10.03 10.02 10.04 10.00 10.05 10.10
(1.1)
x ¯ 10.028 10.050 10.048 9.970 9.978 10.008 10.008 9.990 10.032 10.040
We show the run chart of means of the ten lots of bolts from Table 1.1 (Figure 1.9) in Figure 1.14.
Average Thickness
10.06 10.05 10.04 10.03 10.02 10.01 10 9.99 9.98 9.97 9.96 0
1
2
3
4
5
6
7
8
Lot Number
Figure 1.14. Run Chart of Means.
© 2002 by Chapman & Hall/CRC
9
10
11
30 chapter 1. statistical process control: a brief overview 10.6
Thickness
10.5 10.4 10.3 10.2 10.1 10 9.9 0
1
2
3
4
5
6
7
8
9
10
11
Lot Number
Figure 1.15. Thicknesses In Ten Lots Of Bolts. We now compare the situation in Figure 1.14 with the run chart in Figure 1.15, based on the data from Table 1.2 graphically displayed in the corresponding Figure 1.10. We note the fact that simply using the run chart of means, it is clear that the production process changed dramatically in run 10 of Figure 1.15. In contrast, there does not appear to be any indication of such a dramatic change during any of the runs in Figure 1.14. The concept of the run chart is quite intuitive and predates the modern science of statistical process control by some decades. In many situations, a run chart of lot means will point out the “Pareto glitch” as surely as formal SPC charts. This simple chart should be viewed as the prototype of the more sophisticated techniques of statistical process control. If a firm does not employ anything more complex than run charting, it will be far better off than a firm that does no statistical process control at all. Still there are situations where simply tracking the sample means from lot to lot is insufficient to tell us whether a “Pareto glitch” has occurred. For example, let us reconsider the data in Figure 1.10. If we subtract 2 centimeters from the lowest measurement in each lot, 1 centimeter from the next lowest measurement in each lot, then add 2 centimeters to the highest measurement in each lot, and 1 centimeter to the second highest measurement in each lot, then the lot averages do not change from those in Figure 1.15. On the other hand, if we plot this data, as we do in Figure 1.16, it is no longer so clear that the production process has changed in the tenth lot.
© 2002 by Chapman & Hall/CRC
31
Thickness
white balls, black balls
13 12.5 12 11.5 11 10.5 10 9.5 9 8.5 8 7.5 0
Bolt 1 Bolt 4
1
2
Bolt 2
3
4
5
Bolt 3
6
7
8
9
10
11
Lot Number
Figure 1.16. Thicknesses In Ten Lots Of Bolts.
So the two data sets, that in Figure 1.10 and that in Figure 1.16, have the same running mean charts, but the data in Figure 1.10 showed a production change for run 10, whereas the data in Figure 1.16 did not. The mean chart does not always tell the whole story.
Lot 1 2 3 4 5 6 7 8 9 10
© 2002 by Chapman & Hall/CRC
Bolt 1 smallest 9.93 10.00 9.94 9.90 9.89 9.91 9.89 9.96 9.98 7.93
Table 1.7 Bolt 2 Bolt 3 10.04 10.03 10.06 9.95 9.93 10.01 10.01 9.97 9.99 9.02
10.05 10.05 10.09 10.01 10.03 10.02 10.04 10.00 10.05 11.10
Bolt 4 largest 10.09 10.12 10.10 10.02 10.06 10.09 10.09 10.03 10.11 12.11
x ¯ 10.028 10.050 10.048 9.970 9.978 10.008 10.008 9.990 10.032 10.040
Thickness
32 chapter 1. statistical process control: a brief overview
13 12.5 12 11.5 11 10.5 10 9.5 9 8.5 8 7.5 0
Bolt 1 Bolt 4
1
2
Bolt 2
3
4
5
6
Bolt 3
7
8
9
10
11
Lot Number
Figure 1.17. Ten Lots: Variable Thicknesses.
To give another example where the chart of running means fails to tell us everything, let us return to the data set of Table 1.1. We modify, in Table 1.7, the data for lot 10 by subtracting 2 centimeters from the smallest measurement, 1 centimeter from the second smallest, and then adding 2 centimeters to the largest measurement and 1 centimeter to the second largest. We note that the mean for lot 10 has not changed. However as we observe the graph from Table 1.7 in Figure 1.17, we would be very much inclined to say the variability of the measurements in lot 10 is such that we would be inclined to conjecture a change in the production process has taken place. Once again, in our example, we have referred to an industrial context. It is clear, however, that the same charts could be obtained (except for the units used) if lots comprised of 4 similar stores in similar locations and weekly sale volumes were measured for each store, or 4 similar offices of a travel agency were recorded as to their monthly output of a type, etc.
1.9
The Basic Paradigm of Statistical Process Control
We see that in addition to the average thickness in each lot of bolts, we need also to pay attention to the variability of the production process. A major task in statistical process control is to seek lots which exhibit
© 2002 by Chapman & Hall/CRC
basic statistical procedures in statistical process control 33 behavior which is significantly different from those of the other lots. Once we find such lots, we will then be able to investigate what caused them to deviate from the norm. And it must be pointed out that our notion of “norm” in process control is not some predetermined standard, but rather the measurements associated with the great bulk of the items being produced. Sometimes we will not be trying to “fix” an imperfection of the production process at a given time. For example, we may find that a lot exhibits tensile strength 20% above that of the other lots. In such an instance, we will be trying to find out why the nonstandard lot was so much better than the norm. This is the basic paradigm by which we proceed in SPC: (1) Find a Pareto glitch (a nonstandard lot); (2) Discover the causes of the glitch; (3) Use this information to improve the production process. This basic paradigm is as effective as it is simple. Most people who hear it explained to them for the first time feel that it is simply too good to be true. The paradigm of SPC works and the fact that it works makes for one of the most cost effective pathways to excellence in modern industrial production. We note again how little statistical process control really has in common with the end inspection schemes of quality assurance. Both paradigms use similar statistical tools. However, the goal of quality assurance is to remove bad items at the end of production, before they get to the customer. In order to be really effective, QA requires 100% inspection. And the effectiveness simply consists in removing the results of flawed procedures, not in correcting them. It is this fundamental difference in the philosophies of quality assurance and quality improvement (i.e., statistical process control) that needs to be grasped if one is to institute modern quality control. The quality control teams in all too many enterprises have the false notion that they are on the cutting edge of QC technique, when, in actuality, they are little more advanced than the industrial commissars of Stakhanov’s day.
1.10
Basic Statistical Procedures in Statistical Process Control
We have seen that the notion of variability is key in our search for Pareto glitches. We have yet to give concrete procedures for dealing with vari-
© 2002 by Chapman & Hall/CRC
34 chapter 1. statistical process control: a brief overview ability. We do so now. There is an obvious rule for deciding which lots to investigate. We could simply investigate them all. Naturally, this is neither practical nor desirable. Recall furthermore that we are not dealing with issues which are as clear as finding black balls when we are attempting to produce white ones. We will decide about whether a lot is unusual or not based on measurements. This means that sometimes we will incorrectly pick a lot as indicating a divergence from the usual production process when, in fact, it really represents no such divergence. In other words, we must face the prospect of “false alarms.” One standard is to pick a rule for declaring a lot to be out of the norm so that the chance of a “false alarm” is roughly one in five hundred. This generally keeps the number of lots whose production is to be investigated within manageable bounds and still gives us the kind of stepwise progression to excellence, which is the pay-off of an effective SPC strategy. Postponing going into mathematical details to Chapter 3, we develop below the essential rules for carrying out statistical process control. We return to the data in Table 1.2 (and Figure 1.10). Lot 1 2 3 4 5 6 7 8 9 10
Bolt 1 smallest 9.93 10.00 9.94 9.9 9.89 9.91 9.89 9.96 9.98 10.43
Table 1.8 Bolt 2 Bolt 3 Bolt 4 largest 10.04 10.05 10.09 10.03 10.05 10.12 10.06 10.09 10.10 9.95 10.01 10.02 9.93 10.03 10.06 10.01 10.02 10.09 10.01 10.04 10.09 9.97 10.00 10.03 9.99 10.05 10.11 10.52 10.60 10.61
x ¯
s
10.028 10.050 10.048 9.970 9.978 10.008 10.008 9.990 10.032 10.540
.068 .051 .074 .056 .081 .074 .085 .032 .060 .084
We have already explained how the sample mean x ¯ is computed. For lot 10, it is 10.43 + 10.52 + 10.60 + 10.61 . (1.2) x ¯= 4 Now in order to find an unusual lot, we must find the average for all lots via ¯= x
© 2002 by Chapman & Hall/CRC
10.028 + 10.05 + . . . + 10.54 . 10
(1.3)
basic statistical procedures in statistical process control 35 The standard deviation, s, is a bit more complicated to obtain. It is a measure of the spread of the observations in a lot about the sample mean. To compute it, we first obtain its square, the sample variance
s2 =
(10.43 − 10.54)2 + . . . + (10.61 − 10.54)2 = .007056. 3
(1.4)
(We observe that, whenever we are computing the sample variance, we divide by one less than the number of observations in the lot.) The sample standard deviation is then easily obtained via
s=
√
s2 = .084.
(1.5)
We need to obtain a value of s which is the approximate average over all the sampled lot. We obtain this value via
s¯ =
.068 + .051 + . . . + .084 = .0665. 10
(1.6)
Now let us recall that our goal is to find a rule for declaring sample values to be untypical if we incorrectly declare a typical value to be untypical one time in 500. For the data in Table 1.2, we will generally achieve roughly this goal if we decide to accept all values between 10.0652 + 1.628(.0665) = 10.173 and 10.0652 − 1.628(.0665) = 9.957. In Figure 1.18, we show the lot means together with the upper and lower control limits. It is seen how lot 10 fails to fall within the limits and hence we say that the production system for this lot is out of control. Thus we should go back to the production time record and see whether we can find out the reason for lot 10 to have gone nonstandard. We note from Table 1.9 below the multiplication factors used in the above computation of the limits. We see that the factor 1.628 comes from the A3 column for a lot size of 4. More generally, the acceptance interval on the mean is given by ¯ ± A3 (n)¯ s. x
© 2002 by Chapman & Hall/CRC
(1.7)
36 chapter 1. statistical process control: a brief overview 10.6
10.5
Average Lower Control Limit Upper Control Limit
Average
10.4
10.3
10.2 10.1
10 9.9
0
1
2
3
4
5
6
7
8
9
10
11
Lot Number Figure 1.18. Mean Control Chart. This is generally the most important of the quality control charts. n 2 3 4 5 6 7 8 9 10 15 20 25
Table 1.9 B3 (n) B4 (n) 0.000 3.267 0.000 2.568 0.000 2.266 0.000 2.089 .030 1.970 .118 1.882 .185 1.815 .239 1.761 .284 1.716 .428 1.572 .510 1.490 .565 1.435
A3 (n) 2.659 1.954 1.628 1.427 1.287 1.182 1.099 1.032 .975 .789 .680 .606
It is not always the case that the mean control chart will detect a system out of control. Let us consider, for example, the data in Table 1.7.
© 2002 by Chapman & Hall/CRC
basic statistical procedures in statistical process control 37
Lot 1 2 3 4 5 6 7 8 9 10
Bolt 1 smallest 9.93 10.00 9.94 9.90 9.89 9.91 9.89 9.96 9.98 7.93
Table 1.7 (enhanced) Bolt 2 Bolt 3 Bolt 4 largest 10.04 10.05 10.09 10.03 10.05 10.52 10.06 10.09 10.60 9.95 10.01 10.02 9.93 10.03 10.06 10.01 10.02 10.09 10.01 10.04 10.59 9.97 10.00 10.03 9.99 10.05 10.11 9.02 11.10 12.11
x ¯
s
10.028 10.050 10.048 9.970 9.978 10.008 10.008 9.990 10.032 10.040
.068 .051 .074 .056 .081 .074 .085 .032 .060 1.906
We compute the average mean and standard deviation across the table via ¯= x
10.028 + 10.05 + . . . + 10.04 = 10.015 10
(1.8)
.068 + .051 + . . . + 1.906 = .249 10
(1.9)
and s¯ =
respectively. The control limits on the mean are given by LCL = 10.015 1.628(.249) =9.610, and UCL = 10.015 + 1.628(.249) = 10.420. Plotting the means and the mean control limits in Figure 1.19, we note that no mean appears to be out of control. We recall from Figure 1.17 that there is a clear glitch in lot 10, which does not show up in the mean control chart in Figure 1.19. We need to develop a control chart which is sensitive to variability in the standard deviation. The control limits on the standard deviation are given by LCL = 0 (.249) = 0.0 and UCL = 2.266 (.249) = .564.
© 2002 by Chapman & Hall/CRC
38 chapter 1. statistical process control: a brief overview 10.5 10.4 10.3
Average Lower Control Limit Upper Control Limit
Average
10.2 10.1 10 9.9 9.8 9.7 9.6 9.5
0
1
2
3
4
5
6
7
8
9
10
11
Lot Number Figure 1.19. Mean Control Chart.
As we note in Figure 1.20, the standard deviation in lot 10 is clearly out of control. More generally, the lower control limit for the standard deviation is given by
s Lower Control Limit = B3 (n)¯
(1.10)
and the upper control limit is given by
s. Upper Control Limit = B4 (n)¯
© 2002 by Chapman & Hall/CRC
(1.11)
39
acceptance sampling 2 1.8 1.6
s Lower Control Limit Upper Control Limit
1.4 1.2
s
1 .8 .6 .4 .2 0 -.2
0
1
2
3
4
5
6
7
8
9
10
11
Lot Number Figure 1.20. Standard Deviation Control Chart.
1.11
Acceptance Sampling
Although it is generally a bad idea to base a quality control program on an acceptance-rejection criterion, in the most primitive stages of quality control such a strategy may be required. Unfortunately, starting with World War II, a preponderance of quality control in the United States has been based on acceptance-rejection. The idea in applying statistical process control to an acceptancerejection scenario is that, instead of looking at measurements, we simply look at the variability in the proportion of items defective. Let us return to the data in Table 1.2. Suppose that an item is considered defective if it is smaller than 9.92 or greater than 10.08. In order to apply a statistical process control procedure to data which is simply characterized as defective or not, we note the proportion of defectives in each lot. The overall proportion of defectives is our key statistic. For example in Table 1.2, the overall proportion is given simply by ˆˆ = .25 + .25 + .50 + . . . + 1.00 = .375. p 10
© 2002 by Chapman & Hall/CRC
(1.12)
40 chapter 1. statistical process control: a brief overview
Lot 1 2 3 4 5 6 7 8 9 10
Table 1.2 (enhanced) Bolt 1 Bolt 2 Bolt 3 Bolt 4 smallest largest 9.93 10.04 10.05 10.09 10.00 10.03 10.05 10.12 9.94 10.06 10.09 10.10 9.90 9.95 10.01 10.02 9.89 9.93 10.03 10.06 9.91 10.01 10.02 10.09 9.89 10.01 10.04 10.09 9.96 9.97 10.00 10.03 9.98 9.99 10.05 10.11 10.43 10.52 10.60 10.61
p .25 .25 .50 .25 .25 .50 .50 0.00 .25 1.00
To obtain the control limits, we use
ˆˆ − 3 LCL = p and
ˆˆ(1 − p ˆˆ) p n
(1.13)
ˆˆ(1 − p ˆˆ) p , (1.14) n respectively. It is crucial to remember that n is the size of the lot being tested for typicality. That means that here n = 4. Naturally, the approximation to normality for such a small sample size is very unrealiable. Nevertheless, if we perform the standard normal theory computation, we find that following are given for the two control limits. ˆˆ + 3 UCL = p
UCL = .375 + 3 and
LCL = .375 − 3
.375(1 − .375) = 1.101, 4
(1.15)
.375(1 − .375) = −.3511. 4
(1.16)
Thus there simply is no way to reject a lot as untypical, since the proportion of failures must alway be between 0 and 1. But what if we use the precise value obtained from the binomial distribution itself? We recall that, if X is the number of failed items in a lot,
P (X) =
© 2002 by Chapman & Hall/CRC
n X
pX (1 − p)n−X .
(1.17)
the case for understanding variation
41
In the current example, the probability that all four items fail if p = .375 is almost .02. It might be appropriate under the circumstances to accept an increased possibility of false alarms (declaring the lot is untypical when it is, in fact, typical) of .02. Such compromises are frequently necessary when using rejection data. In such a case, we might wish to consider the tenth lot as being “out of control.” The point to be made here is that the use of failure data is a very blunt instrument when compared to use measurement data. This is particularly the case when, as frequently happens, the lot sizes are very small.
1.12
The Case for Understanding Variation
Let us summarize briefly what has been said so far. First of all, one has to agree that managing by focusing on end results or on end inspection has nothing to do with quality improvement. As one outstanding SPC consultant has succinctly put it: such a management is like driving along the road by watching the white line in the rear-view mirror. Instead, one has to look upstream, that is to shift the focus to the source of quality, this source being the design of product and processes which lead to the product. Indeed, there is no other way to influence quality. So understood, quality improvement has to be arranged in an orderly and economically sound way. Not incidentally, the very first (and up-todate although published in 1931) guide on the subject, written by Walter Shewhart, bears the title Economic Control of Quality of Manufactured Product. But then, this amounts to just one more reason that it must be the responsibility of organization’s management, from its CEO’s to foremen on the shop floor, to implement both overall strategy and everyday activity for quality improvement. And more than that, it is in particular necessary for the CEO’s to better their understanding of what process control is and to improve the ways they manage the organization. We shall dwell more on these last issues in the sequel. Having thus agreed that the strategy for continuous improvement should encompass all of the organization and that it has to form an integrated and selfconsistent system, one has to decide what is the real core of the strategy to be implemented. We have argued in the preceding Sections that it is a wisely arranged reduction of variability or variation within processes and within the organization as a whole. Variation within any process, let alone a system of processes, is inevitable. It can be small, seemingly negligible or large, it can be easy to
© 2002 by Chapman & Hall/CRC
42 chapter 1. statistical process control: a brief overview measure or not, but it always is. It adds to complexity and inefficiency of a system if it is large. Indeed, the larger the variability of subprocesses, the more formidable task it becomes to combine their outputs into one product (or satisfactory service). Under such circumstances, it is particularly transparent that managing by results is a hindrance on the way to improvement, as Deming’s Fourteen Points amply demonstrate. On the other hand, reduction of variability within subprocesses, i.e., looking upstream and working on improvement of the sources of quality, does not loose the aim of the whole system from sight. To the contrary, it helps achieve the overall aim and leads to reduction of system’s complexity. In turn, one has to address the issue of how to arrange for wise reduction in variation of a process. Following Walter Shewhart, we have argued that there are two qualitatively different sources of variability: common cause variation, and special cause variation. As regards the latter, it is this cause of variation which leads to Pareto glitches. We have already discussed at some length how to detect them and how to react to them, in particular in Sections 1.9 and 1.10. Pareto glitches, also called signals, make the process unpredictable. We often say that the process is then out of statistical control due to assignable causes. What we in fact mean by the latter term is intentional – we want these causes to be assignable indeed, for we want to find and eliminate them in order to improve the process (fortunately, they most often can be found and hence removed). As we already know, a process which has been brought to stability or, in other words, to a state of statistical control, is not subject to special cause variation. It is subject only to common cause or inherent variation, which is always present and cannot be reduced unless the process itself is changed in some way. Such a process is predictable, although it does not have to perform satisfactorily. For instance, both the white ball production as depicted in Figure 1.7 and the bolts production as observed in Table 1.4 and Figure 1.12 are in control processes. Both processes are next to disastrous, but they are such consistently and stably. In each of these two situations, we are faced with a highly variable but consistent production process with no sudden change in its performance. It is possible, e.g., that the bolts are turned on an old and faulty lathe. In service industry or, e.g., in processing invoices in a company, high common cause variation is often due to inadequacy of adopted procedures. In any case, if no signal due to a special cause is revealed when running a process, it is wrong to treat a particular measurement or lot as a signal and waste time and effort on finding its alleged cause. In the situations
© 2002 by Chapman & Hall/CRC
the case for understanding variation
43
mentioned, proper action would be to change the process itself, i.e., the lathe or, respectively, the adopted procedures. Another example of an in control process, which exhibits only natural or inherent variability, seems to be provided by data from Table 1.1, shown in Figure 1.9 and summarized into the corresponding run chart of means in Figure 1.14 (we urge the Reader to calculate control limits for both the mean and standard deviation control charts for these data to see if the process is indeed stable). This time, the process variation is much smaller than in the case of data from Table 1.4, which is not to say that further improvement of the process would not be welcome. Assuming that the upper control limit for lot means for data from Table 1.1 lies above the value of 10.05 (as it most likely does), it would be wrong to treat means larger than, say, 10.04 as atypical signals. However mistaken, a practice of this sort is surprisingly popular among managements. One can hardly deny that the executives like to set goals as “the lot mean thickness of bolts should not be larger than 10.04” (or, more seriously, that drop in monthly sales should not exceed a%, amount of inventory should not exceed b units, monthly late payments should amount to less than c%, off budget expenditures be less than d%, etc.) with no regard to common cause variation of the process in question whatsoever. Worst of all is the fact that lots 2 and 3 in Figure 1.14, if flagged as atypical signals, will not only lead to waste of energy on finding an alleged special cause, but that such a cause will sooner or later be declared found. As a rule, this misguided policy brings an opposite result to that intended — after removal of the alleged special cause (which often amounts to policing the workforce), common cause variation becomes larger. Reacting to common causes as if they were special ones is known as tampering or the error of the first kind, in contrast to the error of the second kind, which consists in disregarding signals and their special causes, and treating the latter as if they were common causes. All in all, control charts have been found an excellent and in fact irreplaceable means to find whether a process is in control and to guide the action of bringing a process to statistical control when needed. If stable, the process reveals only common cause variation, measured by process capability (see Section 3.8), whose reduction requires improvement of the process itself. As is now clearly seen, the two actions, one of bringing a process into stability and another of improving a stable process, are qualitatively very different. When summarizing in Section 1.5 the basic paradigm for quality, as
© 2002 by Chapman & Hall/CRC
44 chapter 1. statistical process control: a brief overview perceived already by Whitney and Ford, we have combined the two actions mentioned into one whole. Upon closer inspection, one can note that the first two points given there refer to a process at the design stage and the third to that process at the production stage. Actually, while the third point refers to improving an out of control process, the first two can be claimed to pertain to a stable one which is to be redesigned to bring further improvement. The so-called Plan – Do – Study – Act (PDSA) cycle, developed by Shewhart and later refined by Deming, provides a particularly elegant prescription for improvement of a stable process (see Figure 1.21). In order to make a change, one has first to Plan it. It is recommended that such a plan be based, if possible, on a mathematical model of the process under scrutiny. Some optimization techniques, in particular for regression models, as well as issues of design of experiments are treated in Chapter 6. After the change has been planned, Do it, if possible on a small scale. The next step is to Study the effects of change. Here one has to remember that conclusions concerning the effects can be drawn only after the process in question has again reached a stable state. Finally, one can, and should, Act accordingly: adopt the change if it has proved successful or re-run the pilot study under different conditions or abandon the change and try some other. While it sounds simple, the methodology presented is not only most powerful but the only reasonable one. In fact, it is a basis for the spiral of continuous improvement. At each level of an organization, it is to be used repeatedly, as one one of the two core steps in an iterative process of quality improvement. Each new process has to be brought to statistical control, and this is one of the two core steps, to be followed by the implementation of the PDSA cycle, which is again to be followed by bringing the improved process to stability. It is often said that the spiral described is obtained by successive turns of the quality wheel. (The Shewhart−Deming cycle is sometimes abbreviated as PDCA, with Check replacing Study term.)
Act
Study
Plan
Do
Figure 1.21.Shewhart−Deming PDSA Cycle.
© 2002 by Chapman & Hall/CRC
statistical coda
45
Let us conclude this Section by emphasizing once again that without proper understanding of variability one’s efforts to improve quality are doomed to failure (even incredible luck can help one once or twice, but this luck cannot change the overall outcome for a company). Generally, without such understanding, one is doomed either to tampering or to committing errors of the second kind. We are all sinful and all like reports on sales, inventory, productivity, stocks, etc., if not in the form of monthly reports for company’s CEOs, then in the press. Such reports give numbers for the current month and the same month one year ago, and, if we are lucky, the previous month. They are next to useless, harmless if we read them for the sake of curiosity only, but counterproductive if they are a basis for business decisions. The fact of the matter is that numbers, seen apart from their context, are meaningless. In the situations hinted to, such context is provided by the time series of data from consecutive months. It is only because the CEO keeps past data, more or less accurately, in memory that his or her judgments if a particular number reflects common or special cause variation are not patently wrong. And let us add one more remark: it is not only that the data should be seen in their context as a time series, but also that the data investigated should mostly refer to processes, not to end results.
1.13
Statistical Coda
Our basic strategy in the detection of Pareto glitches is to attempt to answer the following basic question: Do the items in a given lot come from the same statistical distribution as the items in a previous string of “typical lots”? In order to answer this question, it is possible to use a variety of very powerful mathematical tools, including nonparametric density estimation. However, in SPC, we must always be aware of the fact that it is not essential that we detect every “nontypical” (and hence “out of control”) lot. Consequently, we will frequently find it satisfactory to answer the much more simple question: Is it plausible to believe that the sample mean of the items in a lot is consistent with the average of sample means in the previous string of lots? Now, we recall from the Central Limit Theorem (see Appendix B) that for n large x ¯−μ √ ≈Z σ/ n
(1.18)
where Z is a normal random variable with mean 0 and variance 1. Z will lie between 3 and −3 roughly 99.8% of the time. Generally speaking, if
© 2002 by Chapman & Hall/CRC
46 chapter 1. statistical process control: a brief overview the number of previous lots is lengthy, say 25 or more, the average of the ¯ , will give us an excellent estimate of μ. averages of those sample lots, x The average s¯ of the sample standard deviations from the previous lots is a good estimate of σ when multiplied by a unbiasing factor a(n) (see (3.36) ). So, a natural “in control” interval for x¯ is given by s¯ s¯ ¯ − 3a(n) √ ≤ x ¯ + 3a(n) √ = U CL(¯ LCL(¯ x) = x x). n n
(1.19)
The multiplication of the unbiasing term a(n) by 3 and the division √ by n gives as the acceptable interval (see Table 1.9) x ¯ ± A3 (n).
(1.20)
Similarly, a sample standard deviation also has the property that, if n is large s − E(s) ≈Z (1.21) standard deviation of s where Z is a normal random variable with mean 0 and variance 1. In (3.58), we show that this enables us to find an interval in which s should fall roughly 99.8% of the time if the items in the lot have the same underlying variance as those in the string of lots from which we estimate s¯. Referring to Table 1.9, this yields us as the “in control” interval for a new lot s where lot size is n: LCL(s) = B3 (n)¯ s ≤ s ≤ B4 (n)¯ s = U CL(s).
(1.22)
Finally, for acceptance-rejection data, the Central Limit Theorem also gives us that for large lot size n the proportion of defectives in a lot pˆ pˆ − p
p(1−p) n
=Z
(1.23)
where Z is a normal random variable with mean 0 and variance 1. If the number of lots in a string of prior lots in which the proportion of ˆˆ is a good estimate defectives has been estimated, then their average p for p. So we have
ˆˆ − 3 LCL(ˆ p) = p
ˆˆ(1 − p ˆˆ) p ˆˆ − 3 ≤ pˆ ≤ p n
ˆˆ(1 − p ˆˆ) p = U CL(ˆ p). (1.24) n
We should never lose sight of the fact that it is the large number of lots in the string prior to the present one which enables us to claim that
© 2002 by Chapman & Hall/CRC
references
47
ˆˆ is a good ¯ is a good estimate of μ, a(n)¯ x s is a good estimate of σ, and p estimate of p. But the n in the formulae for the control limits is always the size of an individual lot. Consequently, the normal approximation may be far from accurate in determining the control limits to give a one chance in 500 of rejecting an “in control” lot. When we use the control limits indicated above, we expect to reject an in control item at a proportion different from the nominal 0.2% probability which would apply if the CLT was in full force. This is particularly the case in the acceptance-rejection situation. Nevertheless, the plus or minus 3 sigma rule generally serves us very well. Our task is to find lots which are untypical, so that we can backtrack for possible flaws in the system and correct them. To achieve this task, for one dimensional testing, the old plus or minus 3 sigma rule generally is extremely effective.
References [1] Deming, W. E. (1982). Quality, Productivity and Competitive Position. Center for Advanced Engineering Studies, pp. 16-17. [2] Falcon, W. D. (1965). Zero Defects: Doing It Right the First Time, American Management Association, Inc. [3] Ford, H. (1926). My Life and Work. Sydney: Cornstalk Press, p. 273. [4] Mann, N. R. (1985). The Keys to Excellence: the Story of the Deming Philosophy, Los Angeles: Prestwick Books. [5] Sorin, Ya. (1963).On Quality and Reliability: Notes for Labor Union Activists. Moscow: Profizdat, p. 120. [6] Thompson, J. R. (1985). “American quality control: what went wrong? What can we do to fix it?” Proceedings of the 1985 Conference on Applied Analysis in Aerospace, Industry and Medical Sciences, Chhikara, Raj, ed., Houston: University of Houston, pp. 247-255. [7] Todt, H. C. “Employee motivation: fact or fiction,” in Zero Defects: Doing It Right the First Time, Falcon, William, ed., New York: American Management Association, Inc., pp. 3-9.
© 2002 by Chapman & Hall/CRC
48 chapter 1. statistical process control: a brief overview
Problems Problem 1.1. The following are the data from 24 lots of size 5. Lot 1 2 3 4 5 6 7 8 9 10 11 12
measurements 995, 997, 1002, 995, 1000 990, 1002, 997, 1003, 1005 1003, 1005, 998, 1004, 995 1002, 999, 1003, 995, 1001 1001, 996, 999, 1006, 1001 1004, 1001, 998, 1004, 997 1003, 1002, 999, 1003, 1004 1001, 1007, 1006, 999, 998 999, 995, 994, 991, 996 994, 993, 991, 993, 996 994, 996, 995, 994, 991 994, 996, 998, 999, 1001
Lot 13 14 15 16 17 18 19 20 21 22 23 24
measurements 1002, 1004, 999, 996, 1000 1003, 1000, 996, 1000, 1005 996, 1001, 1006, 1001, 1007 995, 1003, 1004, 1006, 1008 1006, 1005, 1006, 1009, 1008 996, 999, 1001, 1003, 996 1001, 1004, 995, 1001, 1003 1003, 996, 1002, 991, 996 1004, 991, 993, 997, 1003 1003, 997, 998, 1000, 1001 1006, 1001, 999, 996, 997 1005, 1000, 1001, 998, 1001
Determine whether the system is in control. Problem 1.2. The following are the data from 19 lots of size 5. Lot 1 2 3 4 5 6 7 8 9 10
831, 829, 838, 844, 826, 841, 816, 841, 831, 830,
measurements 839, 831, 833, 836, 826, 840, 833, 831, 831, 827, 831, 838, 834, 831, 831, 831, 832, 831, 836, 826, 822, 832, 829, 828, 833, 833, 831, 838, 835, 830,
820 831 831 826 831 833 831 828 835 834
Lot 11 12 13 14 15 16 17 18 19
832, 834, 825, 819, 842, 832, 827, 838, 841,
measurements 836, 825, 828, 836, 833, 813, 850, 831, 831, 819, 844, 830, 835, 830, 825, 831, 834, 831, 831, 832, 828, 830, 822, 835, 832, 829, 828,
832 819 832 832 839 833 826 830 828
Determine whether the system is in control. Problem 1.3. The following are the sample means and sample standard deviations of 24 lots of size 5. Lot 1 2 3 4 5 6 7 8 9 10 11 12
¯ X 90.008 90.031 89.971 90.002 89.982 89.992 89.968 90.004 90.032 90.057 90.030 90.062
s 0.040 0.062 0.060 0.047 0.038 0.031 0.026 0.057 0.019 0.022 0.058 0.061
Lot 13 14 15 16 17 18 19 20 21 22 23 24
¯ X 90.007 89.996 89.546 89.627 89.875 89.800 89.925 90.060 89.999 90.068 90.042 90.073
s 0.032 0.021 0.066 0.056 0.042 0.053 0.054 0.047 0.058 0.032 0.041 0.026
Determine whether the system is in control. Problem 1.4. The following are the sample means and sample standard deviations of 20 lots of size 5. Lot 1 2 3 4 5 6 7 8 9 10
© 2002 by Chapman & Hall/CRC
¯ X 146.21 146.18 146.22 146.31 146.20 146.15 145.93 145.96 145.88 145.98
s 0.12 0.09 0.13 0.10 0.08 0.11 0.18 0.18 0.16 0.21
Lot 11 12 13 14 15 16 17 18 19 20
¯ X 146.08 146.12 146.26 146.32 146.00 145.83 145.76 145.90 145.94 145.97
s 0.11 0.12 0.21 0.18 0.32 0.19 0.12 0.17 0.10 0.09
49
problems Determine whether the system is in control.
Remark: In problems 1.5-1.10, we report on initial stages of implementing quality control of the production process of the piston of a fuel pump for Diesel engine. The pumps are manufactured by a small subcontractor of a major U.S. company. It has been decided that, initially, four operations, all performed on a digitally controlled multiple-spindle automatic lathe, should be examined. The operations of interest, numbered from 1 to 4, are indicated in Figure 1.22. The examination was started by selecting 28 lots of pistons. The lots, each consisting of five pistons, were taken at each full hour during four consecutive working days. 2
1
3
.
4
Figure 1.22. Piston Of A Fuel Pump. Problem 1.5. The table below shows the sample means and sample standard deviations of lengths of the front part of pistons (length 1 in Figure 1.22) for 28 lots of size five. (Measurements are in millimeters.) Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14
x ¯ 2.960 3.054 3.010 3.032 2.996 2.982 2.912 3.000 3.036 2.954 2.976 2.998 2.982 2.956
s .018 .042 .072 .087 .018 .013 .051 .039 .029 .022 .036 .019 .046 .045
Lot 15 16 17 18 19 20 21 22 23 24 25 26 27 28
x ¯ 3.034 2.994 2.996 3.046 3.006 3.024 3.012 3.066 3.032 3.024 3.020 3.030 2.984 3.054
s .047 .044 .071 .023 .035 .051 .059 .021 .011 .080 .058 .045 .059 .046
Comment on whether the system is in control. Problem 1.6. The table below shows the sample means and sample standard deviations of diameter 2 of pistons (see Figure 1.22) for 28 lots of size five. (Measurements are in millimeters.)
© 2002 by Chapman & Hall/CRC
50 chapter 1. statistical process control: a brief overview Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14
x ¯ 7.996 8.028 8.044 8.036 8.008 8.018 7.962 8.022 8.020 7.992 8.018 8.000 7.980 7.952
s .017 .039 .075 .075 .044 .032 .064 .059 .025 .030 .052 .034 .023 .073
Lot 15 16 17 18 19 20 21 22 23 24 25 26 27 28
x ¯ 7.964 7.964 7.944 7.956 7.992 8.032 7.974 7.946 7.978 7.986 8.016 7.988 7.962 8.002
s .039 .045 .060 .035 .019 .044 .080 .029 .038 .058 .059 .074 .041 .024
Comment on whether the system is in control. Problem 1.7. The table below shows the sample means and sample standard deviations of diameter 3 of pistons (see Figure 1.22) for 28 lots of size five. (Measurements are in millimeters.) Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14
x ¯ 6.794 6.785 6.730 6.750 6.746 6.758 6.683 6.746 6.780 6.758 6.752 6.758 6.760 6.766
s .011 .017 .051 .034 .021 .015 .053 .056 .053 .013 .047 .057 .031 .037
Lot 15 16 17 18 19 20 21 22 23 24 25 26 27 28
x ¯ 6.809 6.773 6.762 6.776 6.782 6.758 6.770 6.788 6.744 6.756 6.736 6.750 6.746 6.736
s .036 .016 .047 .032 .019 .032 .038 .031 .023 .029 .051 .034 .037 .018
Comment on whether the system is in control. Problem 1.8. The table below shows the sample means and sample standard deviations of length 4 of pistons (see Figure 1.22) for 28 lots of size five. (Measurements are in millimeters.) Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14
x ¯ 59.250 59.240 59.246 59.278 59.272 59.222 59.170 59.220 59.210 59.200 59.220 59.210 59.212 59.214
s .018 .014 .013 .043 .008 .022 .019 .027 .009 .011 .011 .010 .013 .015
Lot 15 16 17 18 19 20 21 22 23 24 25 26 27 28
x ¯ 59.210 59.214 59.200 59.206 59.202 59.210 59.208 59.190 59.200 59.208 59.206 59.196 59.200 59.212
s .010 .018 .007 .011 .008 .023 .019 .010 .007 .019 .022 .011 .007 .011
Comment on whether the system is in control. Problem 1.9. After the control charts from problems 1.5-1.8 were constructed and analyzed, some corrections of the production process have been introduced. The next 21 lots of size 5 of pistons were selected. The measurements of diameter 3 and length 4 (see Figure 1.2) are summarized in the following two tables.
© 2002 by Chapman & Hall/CRC
51
problems
Lot 1 2 3 4 5 6 7 8 9 10 11
Lot 1 2 3 4 5 6 7 8 9 10 11
x ¯ 6.756 6.744 6.760 6.726 6.728 6.748 6.742 6.758 6.758 6.726 6.754
Diameter 3 s Lot .027 12 .011 13 .021 14 .010 15 .008 16 .016 17 .022 18 .022 19 .030 20 .016 21 .020
x ¯ 6.734 6.732 6.756 6.764 6.756 6.798 6.776 6.764 6.778 6.774
x ¯ 59.200 59.214 59.210 59.198 59.206 59.220 59.200 59.196 59.200 59.204 59.204
Length 4 s Lot .012 12 .015 13 .014 14 .008 15 .005 16 .010 17 .008 18 .011 19 .007 20 .009 21 .009
x ¯ 59.208 59.200 59.200 59.204 59.202 59.212 59.208 59.210 59.194 59.200
s .021 .008 .017 .021 .020 .024 .020 .020 .024 .020
s .013 .007 .008 .011 .011 .013 .015 .016 .005 .010
Determine whether the two data sets are in control. Compare the control charts obtained with those from problems 1.7 and 1.8, respectively. In particular, are the intervals between the lower and upper control limits narrower than they were before the corrections of the production process? Problem 1.10. In the table below, the measurements of diameter 2 (see Figure 1.22) for lots 50 to 77 are summarized. Before the measurements were taken, some corrections, based on the analysis of earlier data, had been made. The lots’ size is 5. Determine whether the system is in control. Compare the control charts obtained with those from Problem 1.6. Lot 50 51 52 53 54 55 56 57 58 59 60 61 62 63
x ¯ 8.038 7.962 8.002 8.010 8.024 8.014 7.974 7.986 8.040 7.986 7.974 7.994 7.974 7.962
s .044 .039 .008 .019 .034 .023 .038 .028 .029 .028 .028 .011 .023 .030
Lot 64 65 66 67 68 69 70 71 72 73 74 75 76 77
x ¯ 7.984 8.006 8.034 8.044 8.005 7.994 7.958 8.006 8.022 8.004 7.990 8.000 7.994 7.998
s .021 .038 .031 .023 .022 .031 .018 .011 .019 .018 .029 .019 .027 .031
Problem 1.11. In a plant which is a major supplier of shock absorbers for Polish railway cars, the system of statistical process control includes measurements of the inner diameter of a cylinder turned in a metal casting. The diameter is specified as 91.4 − .2 mm. Lots of size 5 are taken. The head of the plant’s testing laboratory verified that the process is in control and that the last control limits are 91.361 and 91.257 for
© 2002 by Chapman & Hall/CRC
52 chapter 1. statistical process control: a brief overview the lot means and .076 for the lot standard deviations. He decided that the line workers and foreman responsible for the operation monitor the process using the given limits as the control limits for subsequent lots. Any Pareto glitch observed should call for an immediate action by the line. The results for new lots are as follows. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13
x ¯ 91.305 91.259 91.308 91.296 91.290 91.291 91.307 91.350 91.262 91.255 91.259 91.309 91.311
s .076 .038 .048 .046 .066 .039 .065 .043 .060 .020 .031 .030 .033
Lot 14 15 16 17 18 19 20 21 22 23 24 25
x ¯ 91.310 91.337 91.336 91.328 91.320 91.350 91.368 91.308 91.330 91.331 91.348 91.246
s .064 .052 .047 .029 .036 .029 .011 .078 .048 .050 .043 .051
¯ chart) and on No cause of the glitches on lots 10 and 20 (on the X lots 1 and 21 (on the s chart) was found. After detecting still another glitch on lot 25, the foreman asked the plant’s foundry to examine casting moulds. A failure of one of the moulds in use was detected and corrected. The following are the summary statistics of subsequent lots: Lot 26 27 28 29 30 31 32 33 34 35 36 37 38
x ¯ 91.290 91.301 91.271 91.300 91.331 91.333 91.308 91.298 91.337 91.326 91.297 91.320 91.290
s .040 .009 .028 .064 .035 .046 .047 .020 .026 .052 .031 .036 .043
Lot 39 40 41 42 43 44 45 46 47 48 49 50
x ¯ 91.319 91.290 91.320 91.318 91.291 91.290 91.309 91.287 91.309 91.311 91.328 91.275
s .034 .038 .018 .051 .031 .010 .046 .034 .020 .055 .052 .028
a. Since the cause of the Pareto glitches was found, delete lots 1, 10, 20, 21 and 25, and use the other of the first 25 lots to calculate control limits. Are the first 25 lots (excluding lots 1, 10, 20, 21 and 25) in control, relative to the limits obtained? b. Calculate the limits for lots 26 to 50 and verify whether these lots are in control. Problem 1.12. Comment on whether the following system is in control. Lot 1 2 3 4 5 6 7 8 9
© 2002 by Chapman & Hall/CRC
Lot Size 100 100 100 100 100 100 100 100 100
Number Defectives 5 7 3 0 1 13 5 5 2
Lot 10 11 13 14 15 16 17 18 18
Lot Size 100 100 100 100 100 100 100 100 100
Number Defectives 3 9 14 2 10 2 1 3 5
Chapter 2
Acceptance-Rejection SPC 2.1
Introduction
The paradigm of quality control is largely oriented toward optimization of a process by the use of monitoring charts to discover “Pareto glitches” which can then be backtracked until assignable causes for the glitches can be found and rectified. The goal is simply real and measurable improvement. “Zero defects” is seldom, if ever, a realistic goal of the quality improvement specialist. For example, if we apply the techniques of statistical process control in the management of a cancer care facility, we can certainly expect measurable improvements in the mortality rate of the patients. However, a goal of zero mortality is not a realistic one for achievement by the normal techniques of statistical process control. Such a goal is achievable only by a major medical breakthrough. A manufacturer of micro-chips may realistically expect failure rates measured in failures per million once SPC paradigms have been in place for some time. A manufacturer of automobiles may expect failures measured in failures per thousand. The director of a general health care facility may expect patient death rates measured in deaths per hundred admissions. Each type of process has its natural limit of improvement of the failure rate, absent scientific breakthrough. The use of the paradigm of statistical process control does, very frequently, produce improvements so substantial as to appear miraculous. The current state of Japanese automobiles, and that of electronics compared to what they were 40 years ago, is such an example. But such improvements are obtained by steady monitoring and searching for causes of variation, not by the setting of 53
© 2002 by Chapman & Hall/CRC
54
chapter 2. acceptance-rejection spc
arbitrary utopian goals. Few things can foil quality improvement more surely than the announcement of unrealistic QC goals without any reasonable road map as to how they are to be achieved. When it becomes obvious that the “pie in the sky” is not forthcoming, workers in the system rightly “turn off” the quality enhancement program as just so much management huffing and puffing. In some failed quality control cultures, there is a tendency to replace unattainable goals on relevant variables by attainable goals on irrelevant variables. Some years ago, a new, highly motivated manager of a newly built wood products factory in the Soviet Union set out to have the best run wood products factory in the empire. He spurred the workers on with promises of bonuses for their enthusiastic cooperation. His plans for efficiency went so far as using the sawdust produced as fuel to power the factory. At the end of his first year, fully expecting to find the government evaluation of the plant to be very high, he was crushed to find his factory ranked dead last among wood products plants in the Soviet Union. The next year, he redoubled his efforts. He apologized to the workers, explaining that while their plant was moving forward, no doubt others were as well. He made new plans for accelerated improvement. At the end of the second year, justifiably pleased with the factory’s progress, he was informed by Moscow that his plant was still last in the Soviet Union and that his productivity had fallen behind that of the first year. No bonuses for the workers and the danger of demotion of the plant manager. Distraught to desperation, the young manager boarded the next train to Moscow. Perhaps his plant was at the bottom of the list, but there was no way the productivity of his plant the second year was less than that of the first. Upon meeting with his bosses in Moscow, he presented graphs and tables to demonstrate how productivity had improved from the first to the second year. Unimpressed, his superiors noted they had also doublechecked their computations and productivity had indeed declined. Desperately, the manager asked what criteria were being used for productivity measurement. He discovered that the criterion was a rather simple one, namely the pounds of waste sawdust per worker taken from the plant each year. The happy ending to the story is that the third year, the bright, but
© 2002 by Chapman & Hall/CRC
the basic test
55
now somewhat cynical, manager achieved his goal of seeing his wood products factory ranked among the highest in the Soviet Union. Almost equally bizarre situations regularly occur in the United States. For example, as part of the “zero defects” program, employees may be asked to sign cards pledging themselves to the goal of zero defects. When a certain proportion of workers have signed the cards, a day of celebration is decreed to mark this “important event.” Note that it is not the achievement of zero defects which is being celebrated, but rather the somewhat less significant pledging of a commitment to zero defects (whatever that means).
2.2
The Basic Test
As we have pointed out earlier, statistical process control using failure data is a rather blunt instrument. However, at the start of almost any statistical process control program, the data available will frequently be end product failure rate. Whether or not a production item is satisfactory or not will frequently be readily apparent, if by no other feedback than by the judgment of the end user of the product. Consequently, if we have past records to give us the average proportion ˆˆ, we can use the information to see whether a new of failed products, p lot of size n and proportion of failures pˆ appear to exhibit an atypical proportion of failed items. If the difference is truly significant, then we can use the information to suggest the presence of a Pareto glitch. Looking at the various factors which could have led to the glitch may then provide us the opportunity to find an assignable cause for the glitch. If the assignable cause has led to a significantly higher fraction of failed items (as is usually the case), then we can act to remove it. If the assignable cause has led to a significantly lower fraction of failures, then we may consider whether the cause should be made a part of our standard production protocols. An investigator who wanted never to miss an assignable cause would have to declare a lot to be out of control whatever the measurement results might be. Such an approach could be described as management by constant crisis. Consistent with the old story of the shepherd constantly shouting “Wolf!, ” a quality control investigator who is always declaring the system to be out of control will generally be unable to detect an out of control system when it occurs. Contrariwise, an investigator who only wanted to be sure he never
© 2002 by Chapman & Hall/CRC
56
chapter 2. acceptance-rejection spc
turned in a false alarm would pass every lot as being in control. Such an investigator could (and should) be replaced by a recording saying, “Yea, verily, yea,” over and over. Clearly, we must steer a course between these two extremes. As a matter of fact, experience has shown that we should steer rather closer to the investigator who accepts everything as being in control rather than the one who rejects every lot. We recall that the main function of statistical process control is not as a police action to remove defective items. To continue the analogy, a good detective would be much more interested in finding and destroying the source of a drug ring than simply arresting addicts on the streets. He will be relatively unimpressed with the discovery of casualties of the drug ring except insofar as such discoveries enable him to get to the source of the distribution system. Just so, the statistical process control investigator is looking for a lot which is almost surely not typical of the process as a whole. To err on the side of not crying wolf unless we are nearly certain a lot is out of control (i.e., not typical of the process as a whole), it is customary to set the alarm level so that a lot which really is typical (and thus “in control”) is declared to be out of control roughly only once in 500 times. The exact level is not usually very important. A test may have the false alarm rate as high as one chance in 50 or as low as one chance in 5,000 and still be quite acceptable for our purpose. Let us suppose that past experience indicates that 5% of the items produced are not acceptable. We have a lot of size 100 and find that 7 are not acceptable. 7% is certainly higher than 5%. But, if the lot is typical, we would expect a higher value than 5% half the time. What is the probability that 7 or more defective items will be found in a lot of size 100 drawn from a population where the probability of any item being bad is .05? In accordance with the formula for a binomial distribution (see Appendix B), we simply compute P (X ≥ 7) = 1 − P (X ≤ 6) = 1−
6 j=0
100 j
(2.1)
(.05)j (.95)100−j
= .234. If we choose to intervene at such a level, we shall be intervening constantly, so much so that we will waste our efforts. We need a higher level of failure rate before taking notice. Suppose we decide rather to
© 2002 by Chapman & Hall/CRC
57
the basic test
wait until the number of defectives is 10 or more. Here the probability of calling special attention to a lot which actually has an underlying failure rate of 5% is given by
P (X ≥ 10) = 1 − P (X ≤ 9) = 1−
9 j=0
100 j
(2.2)
(.05)j (.95)100−j
= .02819. By our predetermined convention of 1 in 500 (or .002), this is still too high, though probably we would not be seriously overchecking at such a rate. The issue in statistical process control is not to have the alarm ring so frequently that we cannot examine the system for assignable causes of apparent Pareto glitches. When we get to 12 defectives, the probability goes to .0043. Then, we arrive at 13, the probability drops just below the .002 level, namely to .0015. And at 14, the probability is only .00046. It is quite reasonable to set the alarm to ring for any number of defectives past 10 to 14. The “one in 500” rule is purely a matter of convenience. We should note, moreover, that it might be appropriate for the alarm to ring when there were too few failures. When we reach a level of 0 defectives, for example, if the underlying probability of a failure is truly .05, then we would expect to see such a low number of defectives with probability only .95100 = .0059. It might well be the case that a statistical process control professional would want to see if perchance some really positive innovation had occurred during this period, so that standard operating procedure could be modified to take advantage of a better way of doing things. In such a case, where we wish to use both the low failure rate and the high failure rate alarms, we could set the low alarm at 0, say, and the high at 12, for a pooled chance of a “false alarm” equal to .0059 + .0043, that is, for a pooled “false alarm” rate of .0102 or roughly one in a hundred. Looking at the size of our quality control staff, we might make the judgment that we simply did not have enough people to check at a level where the process was carefully examined for major defects 1% of time when good lots were being produced. Then we might properly decide to check the process only when 13 or more of the items in a lot were faulty. Essentially, there is no “right” answer as to where we should set the alarm level.
© 2002 by Chapman & Hall/CRC
58
2.3
chapter 2. acceptance-rejection spc
Basic Test with Equal Lot Size
Let us consider the sort of situation encountered in the production of bolts of nominal diameter of 10 millimeters. Our customer has agreed with the factory management that he will accept no bolts with diameter below 9.8 millimeters, none above 10.1 millimeters. The factory has been using the old “quality assurance” paradigm, i.e., simply use 100% end inspection, where a worker has two templates of 9.799 mm and 10.101 mm. Any bolt which fits into the 9.799 template is rejected, as is any which will not fit into the 10.101 template. 9.799 mm Template
10.101 mm Template
.
Bolt to be Tested
Figure 2.1. Templates for Finding Defective Bolts. This is the kind of paradigm which is essentially avoided by factories which have been using the SPC (statistical process control) approach for some time. But it is the kind of paradigm still used in most factories in the industrial world. When one is starting up an SPC operation, this is the kind of data with which we have to work. And such data generally contains valuable first step SPC information. Let us consider a consecutive sample of 40 lots of 100 bolts each inspected in a classical quality assurance system. Typically, these observations will represent 100% inspection. The lots are indexed by time. In a “quality assurance” program, feedback from the proportion of defectives in the output stream is generally casual. The pattern in Table 2.1 pictured in Figure 2.2 is not untypical. The proportion of defectives hovers around 5% . Starting with lot 11, the proportion drifts upward to 15%. Then, the proportion drifts downward, perhaps because of intervention,
© 2002 by Chapman & Hall/CRC
59
basic test with equal lot size
perhaps not. In general, the graph in Figure 2.2 is of marginal utility if it is used only retrospectively after all 40 lots have been examined. .16 .14 Upper Control Limit
Proportion Defective
.12 .1 .08 .06 .04 .02 0
0
5
10
15 20 25 Lot Number
30
35
40
45
Figure 2.2. Acceptance-Rejection Control Chart. Let us note how such information might be used to improve the production process. We elect to use as our upper control limit a proportion of 13%. This boundary is crossed on lot 15. Suppose that as a result of a point above the upper control limit, we engage in a thorough examination of the production process. We discover that a lubrication fault in the milling machine in the form of intermittent overheating is present. We correct the lubrication problem, and the resulting proportion of defectives is demonstrated in Figure 2.3. The resulting proportion of defectives drops to around 1%. Such an excellent outcome is, interestingly enough, not as untypical as one might suppose. The gains available by time based control charting are far out of proportion to the labor required for their creation. Note that looking at the full 40 points after the entire lot has been collected is unlikely to discover the overheating problem. It is generally difficult to find “assignable causes” from a cold data set. Such a problem will simply generally get worse and worse as time progresses until the bearings burn out and the machine is replaced.
© 2002 by Chapman & Hall/CRC
60
chapter 2. acceptance-rejection spc
Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
© 2002 by Chapman & Hall/CRC
Table 2.1 Number Defectives Proportion Defectives 3 .03 2 .02 5 .05 0 .00 6 .06 4 .04 2 .02 4 .04 1 .01 2 .02 7 .07 9 .09 11 .11 12 .12 14 .14 15 .15 12 .12 10 .10 8 .08 3 .03 5 .05 6 .06 0 .00 1 .01 3 .03 3 .03 4 .04 6 .06 5 .05 5 .05 3 .03 3 .03 7 .07 8 .08 2 .02 0 .00 6 .06 7 .07 4 .04 4 .04
61
basic test with equal lot size .16 .14
Proportion Defective
.12 .1 .08 .06 .04 .02 0
0
5
10
15
20 25 Lot Number
30
35
40
45
Figure 2.3. Run Chart Following Correction. As we have noted, obtaining an exact significance level is not generally important in SPC. The usual rule of thumb is that the control limits are given by the mean plus and minus 3 times the standard deviation. In the case of a Gaussian variate X with mean μ and standard deviation σ we have X −μ | ≥ 3) = .0027. (2.3) σ As we note in the Appendix, for the binomial variate X from a distribution with probability p and sample size n, for a sufficiently large sample, the normalized variate Z defined below is approximately Gaussian with mean 0 and standard deviation 1. P (|
X −p X − np . Z= = n p(1−p) np(1 − p)
(2.4)
n
To obtain an upper control limit, we can use U CL − np = 3, np(1 − p)
giving,
U CL = np + 3 np(1 − p).
© 2002 by Chapman & Hall/CRC
(2.5)
(2.6)
62
chapter 2. acceptance-rejection spc
In the example from Table 2.1, with bolts of lot size 100 and p = .05, we have (2.7) U CL = 100(.05) + 3 100(.05)(.95) = 11.538. So, we would reject a lot with 12 or more defectives as untypical of a population with a proportion of defectives equal to .05. Next, suppose that the lot sizes are much larger, say n = 1000. In that case the upper control limit is given by
U CL = 1000(.05) + 3 1000(.05)(.95) = 70.676.
(2.8)
We note that for the sample size of 100, we have an upper control limit of 12% defectives. For the sample size of 1,000, the upper control limit is 7.1% defectives. We recall that in both cases, we seek a probability of passing the U CL when the proportion of defectives is 5% of roughly .002. Accordingly, as the sample size increases, an ever smaller departure from 5% will cause us to investigate the process for possible problems. For Acceptance-Rejection SPC, the lower control limit is frequently not used at all. An abnormally low proportion of defectives is less likely to indicate a need for process modification than is a high one. In the case of the lower control limit, for n = 100,
LCL = 100(.05) − 3 100(.05)(.95) = −1.538,
(2.9)
which we naturally truncate to 0. In this case, there is no lower control limit. But for n = 1000,
LCL = 1000(.05) − 3 1000(.05)(.95) = 29.324.
(2.10)
In general, we have that
LCL = max{0, np − 3 np(1 − p)}.
(2.11)
A control chart for the number of defectives, usually referred to as the np chart, is defined by the limits (2.6) and (2.11). Equivalently, we can construct a control chart for the proportion defectives. Indeed, in our discussion, we freely switched from one chart to another. Formally, the equivalence between the two types of control charts is given by (2.4). The
© 2002 by Chapman & Hall/CRC
testing with unequal lot sizes
63
right hand side of (2.4) refers to the proportion of defectives. Control limits for the chart for the proportion defectives have the form
U CL = p + 3
p(1 − p) n
and
LCL = max{0, p − 3
p(1 − p) }. n
(2.12)
(2.13)
This last chart is usually referred to as the p chart.
2.4
Testing with Unequal Lot Sizes
There are many situations in which it is not reasonable to suppose that the lot sizes are equal. For example, let us suppose that we are looking at the number of patients (on a monthly basis) in a hospital who experience postoperative infections following hip replacement surgery. We consider such a set of data in Table 2.2 for a period of two years. We note that the average proportion of patients who contract infections during their stay in the hospital following hip replacement surgery is .08, which we are told is normal for the hospital. We would like to be able to use the database to improve the operation in the hospital. As in the case of the production of bolts, such information viewed at the end of the two year period is likely to be of marginal value. A Pareto glitch in anything but the immediate past is typically difficult to associate with an assignable cause of that glitch. Of much less difficulty is the minor computational problem of computing the U CL when the number of patients varies from month to month. This is easily dealt with via the formula
U CL = .08 + 3
.08(1 − .08) . n
(2.14)
In general, the UCL for the p chart assumes now the form
U CL = p + 3
p(1 − p) , nk
(2.15)
where k is the lot’s number and nk is the lot’s size. If the lot sizes are not equal, the control limit for the proportion defectives has to be calculated for each lot separately.
© 2002 by Chapman & Hall/CRC
64
chapter 2. acceptance-rejection spc Table 2.2. Infections Following Hip Replacement. Month Patients Infections Prop. UCL Prop./UCL 1 50 3 .060 .195 .308 2 42 2 .048 .206 .232 3 37 6 .162 .214 .757 4 71 5 .070 .177 .399 5 55 6 .109 .190 .575 6 44 6 .136 .203 .673 7 38 10 .263 .212 1.241 8 33 2 .061 .222 .273 9 41 4 .098 .207 .471 10 27 1 .037 .237 .157 11 33 1 .030 .222 .137 12 49 3 .061 .196 .312 13 66 8 .121 .180 .673 14 49 5 .102 .196 .520 15 55 4 .073 .190 .383 16 41 2 .049 .207 .236 17 29 0 .000 .231 .000 18 40 3 .075 .209 .359 19 41 2 .049 .207 .236 20 48 5 .104 .197 .527 21 52 4 .077 .193 .399 22 55 6 .109 .190 .575 23 49 5 .102 .196 .520 24 60 2 .033 .185 .180
We note that in Table 2.2 and Figure 2.4, in the seventh month, the proportion of patients developing infections went to 1.241 times the U CL. If we are examining the data very long after that month, we are unlikely to find the assignable cause. In the situation, here, after much retrospective work, we find that there were three teams (composed of surgeons, anesthetists, surgical nurses, etc.) performing this type of surgery during the period. We shall refer to these as teams A, B, and C. The numbers of patients will be noted by n, the number of these who developed infections m. The results of the breakdown are given in Table 2.3.
© 2002 by Chapman & Hall/CRC
65
testing with unequal lot sizes 1.4
Proportion Infections/UCL
1.2 1 .8 .6 .4 .2 0 -.2
0
5
10 15 Month
20
25
Figure 2.4. Run Chart of Proportion of Infections Divided by UCL. From Table 2.3, we note that Team C only began operation in month 4. The number of surgeries handled by this team then grew slowly. By the end of the second year, it was handling roughly as many surgeries as the other two teams. Figure 2.5 gives us a fair picture of the progress of Team C. Early on, the proportion of patients developing infections was much higher than that of the other two teams. From month 15 onward, the infection rate of Team C appears comparable to that of the other two teams. Since the data was “cold” by the time we received it, perhaps we shall never know what steps were taken in month 15 which appears to have rid Team C of its earlier comparatively poor performance. Suffice it to say, however, that employment of process control early on might have spared a number of patients the trauma associated with an infection following a joint replacement procedure. Benjamin Franklin’s adage that “Experience keeps a hard school, and a fool will learn by no other” is obviously relevant here. Any start-up procedure is likely to have problems associated with it. But, absent an orderly regular measurement procedure, these start-up problems may take a long time in solving. Here, it took almost a year. And that is a better record than that usually associated with surgical procedures whose records are not regularly monitored for “Pareto glitches.” Frequently, in a medical setting, poor performance by a surgical team may go unnoticed and uncorrected for years, until a personal injury lawyer brings the matter to the attention of the hospital. When this occurs, the hospital gains a quick and expensive demonstration of the principle that modest expense and effort in implementing a
© 2002 by Chapman & Hall/CRC
66
chapter 2. acceptance-rejection spc
good statistical process control regime generally saves vast sums later on. Table 2.3. Team Performance Data. Month nA mA nB mB n C mC 1 20 1 30 2 0 0 2 22 2 20 0 0 0 3 20 2 17 4 0 0 4 30 2 35 1 6 2 5 17 2 25 2 13 2 6 20 1 15 2 9 3 7 15 2 10 2 13 6 8 21 1 9 0 3 1 9 19 1 19 2 3 1 10 10 0 15 0 2 1 11 15 1 15 0 3 0 12 25 1 20 1 4 1 13 31 2 20 2 15 4 14 19 1 20 1 10 3 15 25 1 20 2 10 1 16 19 2 15 0 7 0 17 10 0 9 0 10 0 18 14 1 16 1 10 1 19 10 1 10 1 21 0 20 15 1 10 2 23 2 21 20 1 20 2 12 1 22 19 2 17 2 19 2 23 14 1 15 2 20 2 24 20 1 20 1 20 0 In this discussion, we relied on the p chart. Let us mention that, instead of this chart, one can use a chart for standardized proportion defectives, pk − p , (2.16) Zk = p(1−p) nk
where k is the lot’s number, nk is the lot’s size and pk is the proportion of defectives in lot k. Since the Zk ’s are approximately normal with mean 0 and standard deviation 1, the upper and lower control limits become 3 and -3, respectively.
© 2002 by Chapman & Hall/CRC
67
testing with open-ended count data .6 Team A
Team B
Team C
.5
Proportion Infections
.4
.3
.2
.1
0
0
5
10
Month
15
20
25
Figure 2.5. Comparison of Team Performances.
2.5
Testing with Open-Ended Count Data
Worse even than feedback from end inspection is that obtainable from items returned due to unsatisfactory performance. Unfortunately, such information is frequently the major source of information to a firm as to the quality of a product. One problem in dealing with such data is that the potential number of returns is essentially infinite. Under fairly general conditions, e.g., that the probability that there will be an item returned in a small interval of time is proportional to the length of the interval and that the returns in one time interval do not impact the number of returns in a later interval, the number of items X returned per week has, roughly, the Poisson distribution, i.e., P (X ≥ m) = 1 −
m−1 i=0
where λ is the average value of X.
© 2002 by Chapman & Hall/CRC
e−λ
λi i!
(2.17)
68
chapter 2. acceptance-rejection spc Table 2.4. Weekly Returns Week Number Returned 1 22 2 13 3 28 4 17 5 22 6 29 7 32 8 17 9 19 10 27 11 48 12 53 13 31 14 22 15 31 16 27 17 20 18 24 19 17 20 22 21 29 22 30 23 31 24 22 25 26 26 24
In Figure 2.6, we show weekly numbers of items returned for a six month period. The average number of returns per week is computed to ¯ = 26.27. We recall that for the Poisson distribution, the mean and be X the variance σ 2 are equal. Accordingly, our estimate for the standard deviation σ = 5.12. (With all observations included, the sample standard deviation is 8.825, much bigger than the 5.12 we get making the Poisson assumption.) The upper control limit is given by U CL = 26.27 + 3(5.12) = 41.63.
(2.18)
We note that two weeks have return numbers above the U CL.√With these ¯ = 4.92. ¯ = 24.25, giving X two items removed from the pool, we find X
© 2002 by Chapman & Hall/CRC
testing with open-ended count data
69
The sample standard deviation is 5.37 ≈ 4.92.
Figure 2.6. Run Chart Items Returned per Week. What was the difficulty causing the unusually high number of returns in weeks 11 and 12? A backtracking of records indicated that 10 of the actual returns in week 11 and 12 of those in week 12 had been erroneously entered twice. Such a sounding off of a “QC alert” by incorrect measurements or erroneous data rather than by a basic flaw in the production system is quite common in quality control. Here, little harm seems to have been done. But again, we must emphasize the desirability of looking at the data before it has become cold. And we must note that, realistically speaking, anyone who starts a statistical process control paradigm of stepwise optimization will be confronted with the fact that there are reams of “useful data” available which, hot or cold, represent some sort of starting point for analysis. Indeed, in the United States, SPC consultants frequently are called in to resolve some crisis in which nothing but rough end product acceptancerejection data is available. If the consultant is clever enough to use such data in order to solve the immediate crisis, he may find that he is thanked, paid (handsomely) and bid farewell to, until the next crisis occurs. It is a continuing tragedy of American managers that they
© 2002 by Chapman & Hall/CRC
70
chapter 2. acceptance-rejection spc
tend to lurch from crisis to crisis, lacking the attention span and perception to obtain the vastly superior increases in quality and productivity if a methodical system of statistical process control were instituted and developed. Let us conclude this section with a brief discussion of another problem with data whose values cannot be bounded from above. Until now, we have dealt with numbers of defective items. Sometimes, however, we are interested in the number of defects per lot of a fixed size (possibly 1). Depending on the situation, an item can be considered defective if it possesses just one defect of some known sort. In the example of producing bolts, a bolt was determined as being defective if its diameter lay outside tolerance limits. In cases of end inspection of assembled products, an item may usually be defective in many different and wellknown ways. Or, say, a molded plastic item can be considered defective if it has at least one scratch on its visible part, or it has at least one flash, or unacceptable flow line, or its bending strength is below a tolerance limit. As a rule, we can assume that the number of defects per lot is a Poisson random variable, that is, that the number of defects in one “region” is independent of the number of defects in a disjoint region, the probability that there will be a single defect in a small region is proportional to the size of the region, and the probability of more than one defect occurring in a small region is negligible. A particular definition of the region depends, of course, on the case under scrutiny. If the number of defects follows the Poisson distribution (2.17), we obtain essentially the same control chart as discussed earlier in this section. Namely, √ U CL = c¯ + 3 c¯ (2.19) and
√ LCL = max{0, c¯ − 3 c¯},
(2.20)
where c¯ is either equal to the mean number of defects per lot (λ in (2.17)), assumed known from past experience, or is the mean’s estimate, c¯ =
N 1 ck , N k=1
(2.21)
where N is the number of lots, and ck is the number of defects in lot k. The control chart presented is often called the c chart. Clearly, the c chart should not be used if lots are of unequal sizes. It is natural then to replace the c chart by a control chart for the number
© 2002 by Chapman & Hall/CRC
71
problems
of defects per unit, or per item. Denoting the number of defects per unit in lot k of size nk by uk , ck , (2.22) uk = nk we obtain the following control limits for the kth lot
U CL = u ¯+3
u ¯ nk
and
LCL = max{0, u ¯−3 where
(2.23)
u ¯ }, nk
(2.24)
N
ck . k=1 nk
k=1 u ¯ = N
(2.25)
We note that the limits have to be calculated for each lot separately. The chart obtained is often called the u chart. Obviously, one can use the u chart, instead of the c chart, if lot sizes are equal, nk = n for all k. It suffices to replace nk by n in (2.21) to (2.24).
Problems Problem 2.1. Consider the production of fuel pumps for Diesel engine. In the early stage of the production, before the introduction of the SPC paradigm, 100% end inspection of pumps’ cylinders in lots of 100 was being performed. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
© 2002 by Chapman & Hall/CRC
Proportion Defectives .04 .02 .03 .04 .05 .04 .03 .02 .03 .04 .03 .04 .03 .08 .11 .14
Lot 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Proportion Defectives .13 .11 .06 .04 .07 .04 .07 .04 .05 .09 .06 .05 .07 .08 .06 .08
72
chapter 2. acceptance-rejection spc
Construct the run chart for the data and then add the Upper Control Limit to the chart. Are the data in control? Given the run chart, should one wait until all 32 lots are inspected and become “cool” or, rather, should one construct the control chart for the first 20 lots? Does the control chart for the first 20 lots only exhibit a Pareto glitch? Problem 2.2. Consider the production process summarized in Figure 2.3. Suppose that from lot 1 onwards, the process is examined in accordance with the SPC rules. Past experience indicates that 5% of the items produced are not acceptable. Corresponding Upper Control Limit is provided by formula (2.7). The limit is crossed on lot 14 and the cause of the observed Pareto glitch is found and corrected immediately after the production of lot 15 is completed. The subsequent data are given in the table below and in Figure 2.3.
Lot 16 17 18 19 20 21 22 23 24 25 26 27 28
Number Defectives 3 1 1 0 0 1 2 0 1 1 1 1 2
Proportion Defectives .03 .01 .01 0 0 .01 .02 0 .01 .01 .01 .01 .02
Lot 29 30 31 32 33 34 35 36 37 38 39 40
Number Defectives 0 0 3 1 1 1 2 0 1 1 0 1
Proportion Defectives .00 .00 .03 .01 .01 .01 .02 .00 .01 .01 .00 .01
a. Delete lots 14 and 15 and compute the sample mean of the proportion of defectives for the first 13 lots. Using the value obtained, compute the new Upper Control Limit (i.e., that for lots 1 to 13). Use the new limit as the “trial” limit for verification whether lots 16 onwards are in control. Give motivation for such an approach. b. Construct the control chart for lots 16 to 40 without any use of the past information on the process under scrutiny. Are the lots in control? Compare the Upper Control Limit obtained with the trial limit. c. Should one insist on computing trial limits for the future data (as in a) or, rather, wait until sufficiently many data are available for constructing charts independent of the more distant past? If using the trial limits is to be recommended, should one recommend relying on the trial control charts solely? Problem 2.3. Consider the following data.
© 2002 by Chapman & Hall/CRC
73
problems Lot 1 2 3 4 5 6 7 8 9 10 11 12 13
Number Inspected 531 2000 2150 1422 2331 1500 2417 850 1700 2009 1393 1250 2549
Number Defectives 25 58 89 61 75 73 115 27 49 81 62 46 115
Lot 14 15 16 17 18 19 20 21 22 23 24 25
Number Inspected 685 2385 2150 2198 1948 2271 848 2214 1197 2150 2394 850
Number Defectives 28 89 58 86 41 67 30 68 56 77 82 33
Comment on whether the system is in control. Problem 2.4. The following data pertain to the inspection of pressure relief valves for autoclaves for food processing. The company producing autoclaves is furnished with the valves by two subcontractors.
Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Lot Size 100 100 90 95 100 105 200 210 155 155 120 120 110 115 120
Number Defectives 6 5 3 4 6 5 8 11 7 7 6 6 5 5 5
Lot 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Lot Size 150 120 120 110 110 115 100 100 90 95 110 120 120 95 95
Number Defectives 6 6 5 5 5 9 11 9 7 9 12 9 12 7 8
It is known that the first, more expensive, subcontractor provided the company with the valves for lots 1 to 20. The second subcontractor’s valves were used for the remaining 10 lots. a. Construct the control charts for lots 1 to 20. b. Compare the quality of valves from both subcontractors. In order to do this, construct the control chart for lots 21 to 30 using the sample mean of the proportions of defectives for lots 1 to 20 (i.e., construct the control chart using the mean based on the past experience). Problem 2.5. A company producing refrigerators keeps records of failures of its products under warranty, reported by authorized dealers and repair workshops. In the following table, numbers of failures in consecutive 26 weeks are given for a particular type of refrigerator.
© 2002 by Chapman & Hall/CRC
74
chapter 2. acceptance-rejection spc Week 1 2 3 4 5 6 7 8 9 10 11 12 13
Number Repaired 22 16 29 24 18 21 28 24 29 21 22 16 20
Week 14 15 16 17 18 19 20 21 22 23 24 25 26
Number Repaired 38 29 25 16 22 26 27 14 19 29 17 27 25
Determine whether the system is in control. Even if any Pareto glitches are detected, do they help in finding their assignable causes? Problem 2.6. After the production of a new item has been set up, it was decided that the number of defects in the items should be examined. Three different types of defects are possible. It was decided that lots of size 5 be taken. In the following table, numbers of defects (of whatever type) per lot are given for the first 30 lots. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number Defects 11 16 8 12 6 6 5 12 14 11 9 11 12 7 6
Lot 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Number Defects 12 14 9 7 5 9 12 11 13 20 14 12 7 8 6
a. Verify that the system is out of control. b. An investigation was performed which revealed that defects of the first type constituted the majority of all defects. The following are the numbers of defects of the first type per lot. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number Defects 6 9 5 10 5 4 2 9 10 7 8 7 9 3 2
Lot 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Do the data reveal any Pareto glitches?
© 2002 by Chapman & Hall/CRC
Number Defects 8 7 6 4 2 7 9 8 9 16 10 8 6 5 3
Chapter 3
The Development of Mean and Standard Deviation Control Charts 3.1
Introduction
It is generally a good idea to approach a pragmatic paradigm from the standpoint of a conceptual model. Even though flawed and incomplete, if some care is given to its formulation, a model based approach is likely to give us a more useful frame of reference than a paradigm of the “just do it” variety. Statistical process control is no exception to this rule. Let us first of all assume that there is a “best of all possible worlds” mechanism at the heart of the process. For example, if we are turning out bolts of 10 cm diameter, we can assume that there will be, in any lot of measurements of diameters, a variable, say X0 , with mean 10 and a variance equal to an acceptably small number. When we actually observe a diameter, however, we may not be seeing only X0 but a sum of X0 plus some other variables which are a consequence of flaws in the production process. These are not simply measurement errors but actual parts of the total diameter measurements which depart from the “best of all possible worlds” distribution of diameter as a consequence of imperfections in the production process. One of these imperfections might be excessive lubricant temperature, another bearing vibration, another nonstandard raw materials, etc. These add-on variables will generally be intermittent in time. A major task in SPC is to find measurements which appear to show “contamination” of the basic production process. 75
© 2002 by Chapman & Hall/CRC
76 chapter 3. mean and standard deviation control charts For simplicity’s sake, let us assume that the random variables are added. In any lot, indexed by the time, t, of sampling, of samples, we will assume that the measured variable can be written as Y (t) = X0 +
k
Ii (t)Xi ,
(3.1)
i=1
where Xi comes from distribution Fi having mean μi and variance σi2 and indicator
Ii (t) = 1 with probability pi = 0 with probability 1 − pi .
(3.2)
If such a model is appropriate, then, with k assignable causes, there may be in any lot 2k possible combinations of random variables contributing to Y . We assume that there is sufficient temporal separation from lot to lot that the parameters driving the Y process are independent from lot to lot. Further, we assume that an indicator variable Ii maintains its value (0 or 1) throughout a lot. Let I be a collection from i ∈ 1, 2, . . . , k. Then Y (t) = X0 +
Xi with probability [
i∈I
pi ][
(1 − pi )].
(3.3)
i∈I c
i∈I
In the special case where each distribution is normal, then the observed variable Y (t) is given by Y (t) = N (
i∈I
μi ,
σi2 ),
(3.4)
i∈I
with probability [
i∈I
pi ][
(1 − pi )].
i∈I c
Of course, in the real world, we will not know what the assignable causes (and hence the Xi ) are, let alone their means and variances nor their probability of being present as factors in a given sampled lot. A major task of SPC is to identify the variables, other than X0 , making up Y (t) and to take steps which remove them.
© 2002 by Chapman & Hall/CRC
77
a contaminated production process
3.2
A Contaminated Production Process
In the matter of the manufacturing of bolts of 10 cm diameter, let us suppose that the underlying process variable X0 is N (μ = 10, σ 2 = .01). In addition, we have (unbeknownst to us) a variable due to intermittent lubricant heating, say X1 which is N (.4, .02) and another due to bearing vibration, say X2 which is N (−.2, .08) with probabilities of occurrence, respectively, p1 = .01 and p2 = .005. So, for a sampled lot, Y (t) will have the following possibilities: Y (t) = X0 ; probability (1 − .01)(1 − .005) = .98505
(3.5)
Y (t) = X0 + X1 ; probability (.01)(1 − .005) = .00995
(3.6)
Y (t) = X0 + X2 ; probability (1 − .01)(.005) = .00495
(3.7)
Y (t) = X0 + X1 + X2 ; probability (.01).005 = .00005.
(3.8)
Recalling that (see Appendix B) for the sum of normal variates, the resulting variate is also normal with means and variances the sum of the means and variances of the added variables, we have Y (t) = N (10, .01); probability .98505
(3.9)
Y (t) = N (10.4, .03); probability .00995
(3.10)
Y (t) = N (9.8, .09); probability .00495
(3.11)
Y (t) = N (10.2, .11); probability .00005.
(3.12)
For (3.1) (which is not itself normal), the reader can verify that E(Y ) = μ0 + p1 μ1 + p2 μ2 V ar(Y ) =
σ02
+
p1 σ12
+
p2 σ22
(3.13) + p1 (1 −
p1 )μ21
+ p2 (1 −
p2 )μ22
.
For the p1 and p2 of .01 and .005, respectively, this gives E(Y ) = 10.003 V ar(Y ) = .0124 .
(3.14)
But if we increase p1 to .10 and p2 to .05, then we have E(Y ) = 10.03 V ar(Y ) = .0323 .
(3.15)
We note how even a modest amount of contamination can seriously inflate the variance. Now, in actuality, we really need to estimate σ02 , not
© 2002 by Chapman & Hall/CRC
78 chapter 3. mean and standard deviation control charts V ar(Y ). We cannot precisely achieve this goal, since we do not know a priori which lots are contaminated and which are not. However, we can elect to use the following strategy: instead of estimating variability about the mean of the mean of all the lots, rather obtain an unbiased estimate for the variance (or, more commonly of the standard deviation) of the distribution (here there are four possible distributions) of each lot and take the average over all base lots. If we do this, then for the estimate σ ˆ 2 , we will have E(ˆ σ 2 ) = σ02 + p1 σ12 + p2 σ22 .
(3.16)
We shall follow the convention of basing our estimate for σ02 (or of σ0 ) on the average of lot variability estimates throughout this book (except when driven by lot sizes of 1 item per lot). This is our first example of a robust procedure, and, happily, it is followed generally by SPC professionals. Then, with p1 = .1, and p2 = .05, if we use the mean of lot sample means and the mean of lot sample variances, we would obtain estimates for μ0 and σ02 very close to μ = 10 + .1(.4) + .05(−.2) = 10.03 σ 2 = .01 + .1(.02) + .05(.08) = .016 . Suppose we made a test using the (still somewhat flawed by contamination) estimates for μ0 and σ02 . We shall assume the lot size is 5. Then for the upper control limit, we would have: U CL − 10.03 = 3, .016/5
(3.17)
giving U CL = 10.03 + 3(.05659) = 10.2
(3.18)
LCL = 10.03 − 3(.05659) = 9.86 .
(3.19)
We note that the means of the three contaminated possible distributions, 10.4, 9.8 and 10.2, each lies outside (or on the boundary) of the acceptable interval 9.86 < x ¯ < 10.2. Hence, even for the very high contamination probabilities p1 = .1 and p2 = .05, using the contaminated values 10.03 and .016 for the mean and variance (as opposed to the uncontaminated values 10 and .01), we would identify each of the contaminated lots, as atypical, with probability .5 or greater using the modest lot size of 5. The contaminated distribution N (10.2, .13) will
© 2002 by Chapman & Hall/CRC
a contaminated production process
79
be the most difficult to pick as “atypical,” but even here, half the area of the density is to the right of the upper control limit, 10.2. We recall that our goal in statistical process control is to find and correct problems. If we have a testing procedure which rings the alarm with at least 50% chance every time we have an atypical lot, then we are doing well. Contamination becomes more of a problem in the case of tests based on the sample variances of lots. We recall that the definitions of sample mean and variance, based on a sample of size n, are given by n 1 xi n i=1
(3.20)
n 1 (xi − x ¯)2 . n − 1 i=1
(3.21)
x ¯= and s2 =
And, if the sample is from a normal distribution with mean μ and variance σ 2 , we recall that (n − 1)s2 (3.22) χ2 = σ2 has a Chi-square distribution with n − 1 degrees of freedom. Let us suppose we have estimated the variance to be .016 (as opposed to the variance of the uncontaminated population .01). Let us take a contaminated lot with a variance as small (and therefore as hard to distinguish from the norm as possible). In the bolt manufacturing example given, this would be the case where the underlying variance is actually .03. Given the model of contamination used here, a one tailed test on σ 2 is clearly indicated. For moment, let us use a one tailed test with chance one in 500 of a false alarm. Using the tables of the Chi-square distribution, we have: (n − 1)s2 ≥ 16.924) = .002. (3.23) P( σ2 Now then, for the estimated value of σ 2 = .016, we shall reject the typicality of a lot if (n − 1)s2 ≥ 16.924(.016) = .2708.
(3.24)
Now, in the event the lot comes from the distribution with variance equal to .03, the critical statistic becomes χ2 =
© 2002 by Chapman & Hall/CRC
(n − 1)s2 .2708 = = 9.026. .03 .03
(3.25)
80 chapter 3. mean and standard deviation control charts And, with n − 1 = 4 degrees of freedom, P (χ2 ≥ 9.03) = .0605.
(3.26)
Thus, we could have a chance of as little as 6% of finding a bad lot. Had we used the true norm variance value of .01, the rejection region would be given by (n − 1)s2 ≥ 16.924(.01) = .16924.
(3.27)
So for an estimated variance of .03, we would have .16924 = 5.6413, .03
(3.28)
P (χ2 > 5.6413) = .23.
(3.29)
χ2 = giving
Contamination is generally more of a problem in the estimation of the standard for the variance than it is for that of the mean. There are many ways of reducing the problem of contamination in the estimation of the normative mean and variance, and we shall discuss some of these later. In most SPC programs, little allowance is made for contamination in the estimation of the mean and variance of the uncontaminated process. Generally, all the observations are left in for the computation of the overall mean and variance, as long as the assignable cause has not been discovered and removed. We recall that our task in SPC is not to find every atypical lot. Rather, we try to find bad lots (sooner, hopefully, but later will do), track down the assignable causes, and remove them. Shortly, we shall show how the use of median based estimates of the norm process can sometimes enhance our ability to find atypical lots. In Figure 3.1, we see a graphical view of our contamination model. The four different types of epochs, three of them “contaminated,” are shown. We have assumed that the sample we take is sufficiently small that it is unlikely that a sample will be drawn from two different epochs. Once we draw a sample from one of the contaminated epochs of production, we have a chance of running a test that will tell us about the presence of contamination. This, hopefully, will be the clue we need to find an assignable cause of the contamination and remove it.
© 2002 by Chapman & Hall/CRC
estimation of parameters of the “norm” process
81
Time N(10., .01) N(10.4,.03) N(9.8,.09) N(10.2,.11)
Figure 3.1. Diagram of Contamination Model.
3.3
Estimation of Parameters of the “Norm” Process
Let us suppose we have been sampling lots for some time since the last assignable cause was found and corrected. Thus, we have at our disposal for estimating the mean and variance of the uncontaminated process N sample means and variances. Useful estimates are obtained by simply averaging these statistics: N 1 x ¯j ; N j=1
(3.30)
N 1 s¯2 = s2 . N j=1 j
(3.31)
x= and
Now, as is shown in Appendix B, the expectation of x ¯ is the popu2 lation mean μ, and the expectation of s is the population variance σ 2 . Thus, both estimators are unbiased. These results hold regardless of the underlying distribution of the data. Let us suppose we have an estima¯ tor θˆ which is composed of an average of unbiased estimators {θˆj } of a parameter θ. Then, we immediately have that N 1 1 ¯ˆ = E(θˆj ) = (N θ) = θ. E(θ) N j=1 N
(3.32)
To develop an estimator for a parameter in statistical process control, it is customary to find an unbiased estimator for the parameter using data
© 2002 by Chapman & Hall/CRC
82 chapter 3. mean and standard deviation control charts from a lot, and then take the average of such estimators across many lots. This procedure tends to work well if the basic unbiased lot estimator has reasonable properties. As we know from the Central Limit Theorem in Appendix B, the asymptotic distribution of the mean of unbiased parameter estimates is normal with mean equal to the parameter, and variance equal to that of a lot estimator divided by N , the number of lots. To obtain an estimator for the process standard deviation, we can simply take the square root of s¯2 . The expectation of this estimator is, indeed, the population variance. A more commonly used estimator for the process standard deviation is the average of the sample standard deviations (i.e., the square roots of the sample variances) s¯ = =
N 1 sj N j=1
(3.33)
N 1 s2 j . N j=1
But we must note that the expected value of s is not quite σ. To see that this is true, we need only recall that for normal data, (n − 1)s2 = z = χ2 (n − 1). σ2
(3.34)
Thus, E(s) =
1 σ n−1 2 n − 1 Γ( 2 ) √
0
1 1 n−1 2(n − 1) Γ( 2 ) Γ( n2 ) √ σ √ 2. n − 1 Γ( n−1 2 )
= σ =
∞
1 z n−1 e−z/2 ( ) 2 −1 z 2 dz 2
∞ 0
z n e−z/2 ( ) 2 −1 dz 2
(3.35)
As an unbiased estimator for σ, then we have (as tabulated in Table 3.1) σ ˆ =
√
n−1
= a(n)¯ s.
Γ( n−1 1 2 )√ s¯ n Γ( 2 ) 2 (3.36)
Traditionally, because it was not so easy to compute sample deviations on the line before the time of cheap hand held calculators, it used to be
© 2002 by Chapman & Hall/CRC
estimation of parameters of the “norm” process
83
common to use a multiple of the average of the sample ranges as an estimate for the population standard deviation. For the j’th sample of size 5, the range Rj is given by Rj = max{xj1 , xj2 , xj3 , xj4 , xj5 } − min{xj1 , xj2 , xj3 , xj4 , xj5 }.
(3.37)
The average of the N ranges is given by N ¯= 1 Rj . R N j=1
(3.38)
It is proved in Appendix B, that for normal data, the expected value of the range is a multiple of the standard deviation. Accordingly, when ¯ to estimate standard deviation σ, we use the formula applying R ¯ σ ˆ = b(n)R,
(3.39)
where bn is given by (B.209). Table 3.1 sample size a(n) b(n) 2 1.253 .8865 3 1.1284 .5907 4 1.0854 .4857 5 1.0638 .4300 6 1.0510 .3946 7 1.0423 .3698 8 1.0363 ..3512 9 1.0317 .3367 10 1.0281 .3249 15 1.0180 .2880 20 1.0133 .2677
c(n) 1.1829 1.0646 1.0374 1.0260 1.0201 1.0161 1.0136 1.0116 1.0103 1.0063 1.0046
In the case where there is no contamination (no removable assignable causes), averages of lots’ sample means, x, converge almost surely to μ0 . Such is not the case where we have contamination. Accordingly, we might decide to employ estimates which would not be affected by sample means far away from the underlying population mean. For example, we might choose as our estimate for μ0 the median of the sample means: ˜¯ = med{x¯j ; j = 1, 2, . . . , N }. x
© 2002 by Chapman & Hall/CRC
(3.40)
84 chapter 3. mean and standard deviation control charts This estimator is more “robust” than that based on taking the average of the sample means. In other words, it will be influenced less by information from contaminating distributions, and reflect better the “typical” uncontaminated distribution. For normal data, the sample means are also normal, hence symmetrical, so that the median of their distribution is also equal to the mean. That means that for the uncontaminated case, E( sample median) = μ0 .
(3.41)
For the contaminated case, the sample median will generally be closer to μ0 than will x. Regardless of the lot size, the x estimator for the mean of the uncontaminated process will not converge to that mean. In ˜¯ estimator does converge stochastically the Appendix, we show that the x (as both n and N go to infinity) to the process mean if the proportion of lots from the uncontaminated process is greater than 50%. More importantly, even for small lot sizes the median of means estimator will tend to be closer to the mean of the uncontaminated distribution than will the mean of means estimator. Clearly, unusually large sample means and unusually small means may be the result of contamination of the production process. In fact, it is our purpose to identify these “outliers” as Pareto glitches and remove their source. If we include them in the computation of our estimate for “the norm” it is possible that they may so distort it that we will be impaired in identifying the contaminants. Similarly, we may find it useful to use a multiple of the median of sample standard deviations to estimate σ0 s˜ = med{sj ; j = 1, 2, . . . , N }.
(3.42)
Given (3.34), we can solve for the median of the distribution of sj by solving the equation .5 =
1 Γ( n−1 2 )
0
(n−1)s2 σ2
e−z z
n−1 −1 2
dz.
(3.43)
We have included in Table 3.1, a column of constants for the use in obtaining the ratio of E(s) and med(s) via E(s) = c(n). s˜ Thus, the estimator for σ based on the median of sample standard deviations assumes the form σ ˜ = a(n)c(n)˜ s.
© 2002 by Chapman & Hall/CRC
(3.44)
estimation of parameters of the “norm” process
85
The disadvantages associated with the use of the median estimation procedure are a slight loss of efficiency in the uncontaminated normal case and a difficulty in having workers on the line enter a long list of sample mean data into a calculator which can then carry out the sorting of the data in order to obtain sample median (i.e., the middle observation in a sort from smallest to largest). The first of these objections is of little practical consequence. Typically, we have rather a long list of sample means. For practical purposes, we can consider it to be infinite. Moreover, a small error in estimating the mean and variance of the uncontaminated process is unlikely to be important. The second objection, concerning the difficulty of requesting line workers to enter data into calculators for sorting, has more validity. However, as a practical matter, it is not a particularly good idea to bother line workers with this degree of technical trivia; they have other more pressing concerns. The estimates for the uncontaminated mean and standard deviation, and hence for the upper and lower control limits, need not be updated at every lot sampling. Daily or less frequently will generally do nicely. The computation of the control limits should be carried out according to a regular standard protocol by a more central computation center than the line. At one of the factories where we have installed an SPC system, the head of the testing laboratories has run data entered into a spreadsheet, which feeds into a standard protocol for plotting and updating control limit estimates. Any crossing of the control limits is a matter of immediate concern and action by the line workers. If they find the cause of a “Pareto glitch” and remove it, this is noted for possible use in deleting data prior to the fix (not a high priority item if the robust median estimation protocol is followed). Each morning, the newest versions of the control charts are put on the desk of the head of the testing laboratories, who then examines them and passes them on to the lines, frequently carrying them himself so that discussions with foremen are possible. Most of the companies with whom we have worked in America and Poland had already some sort of “quality control” or “quality assurance” group. Some used control charts and had strong feelings about using the one rule or another for constructing the mean control chart. Each of these was more or less oriented to the “plus or minus three standard deviations” rule. In showing workers how easy it is to enter data and automatically obtain sample mean and standard deviation data with one of the sturdy, background light powered calculators, such as the Texas Instruments TI36 Solar (cost of around $15 in 1992), most are willing to begin using direct estimates of the standard deviation rather than the range. But if
© 2002 by Chapman & Hall/CRC
86 chapter 3. mean and standard deviation control charts one prefers to use a range based estimate or some other similar statistic for the standard deviation, that is a matter of small concern. Let us consider now 30 days of simulated data (one sample per shift) from the bolt production line discussed earlier. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
x1 9.927 9.862 10.061 9.820 9.737 9.876 9.898 10.001 9.928 9.896 10.011 9.983 9 10.127 10.025 9.953 10.007 10.062 10.168 9.986 9.786 9.957 9.965 9.989 9.983 10.063 9.767 9.933 10.227 10.022 9.845 9.956 9.876 9.932 10.016 9.927 9.952 9.941 10.010 9.848 10.002 10.031 9.990 9.995 9.980 10.058 10.006 10.132 10.012 10.097 10.007 9.967 9.981 9.841 9.992 9.908 10.011 10.064 9.891 9.869 10.016 10.008 10.100 9.904 9.979 9.982
© 2002 by Chapman & Hall/CRC
x2 9.920 10.003 10.089 10.066 9.937 9.957 9.959 10.050 10.234 9.994 10.011 .974 9.935 9.890 10.000 10.005 10.005 10.045 10.041 10.145 9.984 10.011 10.063 9.974 10.075 9.994 9.974 10.517 9.986 9.901 9.921 10.114 9.856 9.990 10.066 10.056 9.964 9.841 9.944 9.452 10.061 9.972 10.056 10.094 9.979 10.221 9.920 10.043 9.894 9.789 9.947 10.053 9.926 9.924 9.894 9.967 10.036 10.055 9.934 9.996 10.157 9.853 9.848 10.008 9.963
x3 10.170 9.829 9.950 10.062 9.928 9.845 9.924 10.263 9.832 10.009 10.090 10.071 9.979 10.002 10.141 9.883 10.070 10.140 9.998 10.012 10.273 9.810 10.148 9.883 9.988 9.935 10.026 10.583 10.152 10.020 10.132 9.938 10.085 10.106 10.038 9.948 9.943 10.031 9.828 9.921 9.943 10.068 10.061 9.988 9.917 9.841 10.094 9.932 10.101 10.015 10.037 9.762 9.892 9.972 10.043 10.204 9.733 10.235 10.216 10.095 9.988 10.067 9.949 9.963 10.061
Table x4 9.976 9.824 9.929 9.897 10.144 9.913 9.989 9.982 10.027 9.835 10.095 10.099 10.014 9.999 10.130 9.941 10.270 9.918 9.992 10.110 10.142 10.057 9.826 10.153 10.071 10.114 9.937 10.501 9.922 9.751 10.016 10.195 10.207 10.039 9.896 9.802 10.085 9.975 9.834 9.602 9.997 9.930 10.016 9.961 9.881 10.115 9.935 10.072 9.959 9.941 9.824 9.920 10.152 9.755 9.903 9.939 9.985 10.064 9.962 10.029 9.926 9.739 9.929 10.132 9.970
3.2 x5 9.899 10.077 9.935 10.013 9.965 9.941 9.987 10.076 10.121 10.162 10.120 9.992 9.876 9.937 10.154 9.990 10.071 9.789 9.961 9.819 10.190 9.737 10.041 10.092 10.096 9.964 10.165 10.293 10.101 10.088 10.109 10.010 10.146 9.948 9.871 9.947 10.049 9.880 10.091 9.995 9.952 10.113 10.044 10.140 9.966 9.964 9.975 9.892 10.040 10.013 9.938 10.107 9.965 9.925 9.842 10.077 9.972 10.092 10.012 10.080 10.008 10.092 9.904 9.924 9.937
x ¯ 9.978 9.919 9.993 9.972 9.942 9.906 9.951 10.074 10.028 9.979 10.065 10.024 9.986 9.971 10.076 9.965 10.096 10.012 9.996 9.974 10.109 9.916 10.013 10.017 10.059 9.955 10.007 10.424 10.034 9.921 10.027 10.027 10.045 10.020 9.960 9.941 9.996 9.947 9.909 9.794 9.997 10.015 10.034 10.033 9.960 10.029 10.011 9.990 10.018 9.953 9.943 9.965 9.955 9.914 9.918 10.040 9.958 10.067 9.999 10.043 10.017 9.970 9.907 10.001 9.983
s 0.111 0.114 0.076 0.109 0.145 0.046 0.040 0.112 0.158 0.125 0.051 0.057 0.094 0.056 0.092 0.053 0.101 0.158 0.029 0.165 0.135 0.137 0.119 0.106 0.041 0.125 0.096 0.154 0.091 0.135 0.092 0.129 0.147 0.059 0.087 0.090 0.066 0.083 0.112 0.252 0.050 0.074 0.028 0.079 0.067 0.145 0.096 0.076 0.090 0.097 0.077 0.134 0.119 0.093 0.075 0.106 0.131 0.122 0.132 0.042 0.085 0.165 0.038 0.079 0.047
R 0.271 0.253 0.160 0.246 0.406 0.112 0.092 0.281 0.402 0.327 0.108 0.125 0.251 0.136 0.201 0.124 0.266 0.379 0.080 0.359 0.316 0.321 0.322 0.270 0.108 0.347 0.232 0.356 0.124 0.337 0.212 0.318 0.352 0.158 0.195 0.254 0.144 0.190 0.262 0.550 0.118 0.183 0.066 0.179 0.176 0.380 0.212 0.179 0.207 0.226 0.213 0.346 0.311 0.236 0.201 0.265 0.330 0.345 0.346 0.099 0.231 0.361 0.101 0.208 0.124
s2 0.012 0.013 0.006 0.012 0.021 0.002 0.002 0.013 0.025 0.016 0.003 0.003 0.009 0.003 0.009 0.003 0.010 0.025 0.001 0.027 0.018 0.019 0.014 0.011 0.002 0.016 0.009 0.024 0.002 0.018 0.009 0.017 0.022 0.003 0.008 0.008 0.004 0.007 0.013 0.064 0.003 0.006 0.001 0.006 0.004 0.021 0.009 0.006 0.008 0.009 0.006 0.018 0.014 0.009 0.006 0.011 0.017 0.015 0.017 0.002 0.007 0.027 0.001 0.006 0.002
87
estimation of parameters of the “norm” process
Lot 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
x1 10.028 9.995 9.936 10.014 10.005 10.116 9.934 9.972 10.014 10.093 9.927 10.177 9.825 10.333 9.972 10.059 9.832 9.958 10.087 10.232 10.066 10.041 9.868 10.084 10.063
x2 10.079 10.029 10.022 10.070 10.044 10.028 10.025 9.855 10.000 9.994 9.832 9.884 10.106 10.280 10.116 9.992 10.075 9.884 9.994 9.966 9.948 10.044 9.955 10.018 10.055
Table 3.2 (continued) x3 x4 x5 x ¯ 9.970 10.087 10.094 10.052 9.991 10.232 10.189 10.087 9.940 10.248 9.948 10.019 9.890 10.137 9.901 10.002 10.016 10.188 10.116 10.074 10.152 10.047 10.040 10.077 10.129 10.054 10.124 10.053 9.931 9.785 9.846 9.878 9.978 10.133 10.100 10.045 10.090 10.079 9.998 10.051 9.806 10.042 9.914 9.904 10.070 9.980 10.089 10.040 9.959 9.901 9.964 9.951 10.509 10.631 10.444 10.439 10.084 10.059 9.914 10.029 9.981 9.800 9.950 9.956 10.111 9.954 9.946 9.984 9.986 10.008 10.113 9.990 9.915 10.023 9.883 9.980 9.991 10.021 9.965 10.035 9.769 10.102 9.932 9.963 10.091 10.031 9.958 10.033 9.769 10.023 9.921 9.907 9.941 10.052 10.026 10.024 10.104 10.080 10.064 10.073
s 0.052 0.115 0.133 0.107 0.077 0.054 0.080 0.074 0.068 0.050 0.093 0.112 0.103 0.140 0.084 0.096 0.112 0.083 0.082 0.112 0.131 0.048 0.096 0.053 0.019
R 0.123 0.241 0.312 0.247 0.183 0.124 0.195 0.187 0.155 0.098 0.236 0.293 0.281 0.204 0.202 0.259 0.279 0.229 0.205 0.324 0.333 0.133 0.254 0.143 0.049
s2 0.003 0.013 0.018 0.011 0.006 0.003 0.006 0.005 0.005 0.003 0.009 0.013 0.011 0.006 0.007 0.009 0.012 0.007 0.007 0.014 0.017 0.002 0.009 0.003 0.001
Let us note what one obtains from this data set as estimates of the uncontaminated mean and standard deviation using the various estimates mentioned. The mean of the 90 sample means is seen to be 10.004.The median of the sample means is 10.000. The mean of the sample standard deviations is seen to be .095, giving (using Table 3.1) as estimate for the standard deviation of the uncontaminated process σ ˆ = a(5)¯ s = (1.0638)(.095) = .1011.
(3.45)
The median of the sample standard deviations is .093 giving as a robust estimate of the standard deviation of the uncontaminated process σ ˜ = a(5)c(5)˜ s = (1.0638)(1.026).093 = .1015.
(3.46)
The mean of the sample ranges is .232, giving the estimate (using Table 3.1) for the population standard deviation σ ˆR = (.232)(.430) = .0998.
(3.47)
In constructing our control chart, we shall use the most common estimate for the mean, namely the average of the sample means, as our center for the mean control chart, i.e., 10.004. As our estimate for the standard deviation, we will use the most common of the estimates, namely the average of the sample standard deviations multiplied by the appropriate a(n) from Table 3.1. Thus, our control chart for the mean is given by a(5)¯ s U CL = x + 3 √ 5
© 2002 by Chapman & Hall/CRC
(3.48)
88 chapter 3. mean and standard deviation control charts .1011 = 10.004 + 3 √ 5 = 10.139;
(3.49)
a(5)¯ s LCL = x − 3 √ 5 .1011 = 10.004 − 3 √ 5 = 9.868.
(3.50)
(3.51)
We note that lots 28 and 79 are above the upper control limit. Hopefully, we would be able to use this information to identify the source of the N (10.4,.03) contamination and remove it. Similarly, lot 40 has a mean below the lower control limit. Hopefully, we would be able to use this information to find the source of the N (9.8,.09) contamination and remove it. 10.5
Lot Sample Means
10.4 10.3 10.2
UCL
10.1 10.0 9.9
LCL
9.8 9.7
0
20
40
60
80
100
Lot
Figure 3.2. Mean Control Chart. To summarize, the usual method for obtaining the upper and lower control limits for an SPC sample mean control chart is to find the sample ¯. mean of the sample means of the lots of size n, hence the expression x We then find the sample mean of the sample standard deviations of the lots, hence the expression s¯. This latter statistic is not quite an unbiased
© 2002 by Chapman & Hall/CRC
estimation of parameters of the “norm” process
89
estimator for σ. We correct this via the formula σ ˆ = a(n)¯ s.
(3.52)
The interval in which we take a lot sample mean to be “typical” and hence the production process to be “in control” is given by
Thus, we have
and
σ ˆ σ ˆ x − 3√ ≤ x ¯ ≤ x + 3√ . n n
(3.53)
s¯ U CL = x + 3a(n) √ n
(3.54)
s¯ LCL = x − 3a(n) √ . n
(3.55)
Let us now go on to examine the control chart for the standard deviation. We shall follow the convention of estimating the standard deviation using N 1 sj . (3.56) σ ˆ = a(n)¯ s = a(n) N j=1 We need an estimator of the standard deviation of the lot standard deviations. Now, for each lot V ar(s) = E(s2 ) − [E(s)]2 σ 2 = σ2 − [ ] a(n) = σ2(
a(n)2 − 1 ). a(n)2
(3.57) (3.58) (3.59)
Recalling that our customary estimator for σ is a(n)¯ s, we have U CL = s¯ + 3¯ s[a(n)2 − 1] 2 .
(3.60)
LCL = s¯ − 3¯ s[a(n)2 − 1] 2 .
(3.61)
1
1
For the data set at hand, this gives us 1
U CL = .095 + 3[1.063812 − 1] 2 .100 = .2039.
(3.62)
Similarly, we have 1
LCL = .095 − 3[1.063812 − 1] 2 .100 = −.0139.
© 2002 by Chapman & Hall/CRC
(3.63)
90 chapter 3. mean and standard deviation control charts Naturally, this latter control limit is truncated to 0. Here, we note that we do pick up only one alarm (lot 40) from the 90 standard deviations. Typically, we expect to discover more Pareto glitches from the mean rather than the standard deviation control charts. In the real world, if we are doing our job properly, we will not wait for 30 days of data before intervening to improve the system. Indeed, if we do not investigate the causes for glitches immediately, we shall generally not be able to determine an assignable cause.
Lot Sample Standard Deviation
0.3
0.2
0.1
0.0
0
20
40
60
80
100
Lot
Figure 3.3. Standard Deviation Control Chart. Generally, whenever we examine variability of measurements within lots, we ourselves rely on standard deviation control charts. In Section 3.7, however, we show how to construct control charts for the lots’ ranges.
3.4
Robust Estimators for Uncontaminated Process Parameters
The contamination of the estimators for the “typical” process by data points from the contaminating distribution(s) can pose problems. Let us consider, for example, the situation where the “typical” process is Gaussian with mean 10 and standard deviation .1. The contaminating distribution is Gaussian with mean 9.8 and standard deviation .3. We
© 2002 by Chapman & Hall/CRC
robust estimators for uncontaminated process parameters 91 will assume that with probability .70 a lot comes from N (10,.01) and with probability .3 it comes from N (9.8,.09). Thus, we are considering a case where the contamination of the “typical” process is very high indeed. As previously, we shall assume the lot sample size is 5. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
x1 9.987 9.629 10.122 9.861 9.816 9.925 10.103 10.057 9.956 9.591 10.185 10.051 9.966 9.987 9.926 9.734 9.980 10.012 10.263 9.954 9.762 10.008 10.091 9.795 9.968 9.948 9.864 10.162 9.941 9.791 10.052 10.058 10.144 10.080 9.912 9.973 10.003 10.096 10.023 10.188 10.005 10.018 9.961 9.914 9.969 9.976 9.927 9.934 9.956 10.001 10.097 10.179 9.941 10.089 10.006 9.976 10.003 9.875 9.978 10.057 9.853 9.893 10.105 9.963 9.593
x2 9.921 9.798 9.705 10.033 10.061 9.830 9.921 10.078 10.106 9.996 9.809 9.998 9.748 9.863 9.964 9.267 9.980 9.936 9.040 10.025 10.100 9.884 9.936 9.874 10.140 10.008 9.937 9.925 10.064 9.695 9.993 9.982 10.052 10.047 9.991 9.863 10.087 9.977 10.109 10.173 9.973 9.927 10.021 9.870 9.946 10.005 10.052 10.087 9.824 9.935 9.896 10.018 10.316 9.846 9.883 10.010 9.927 10.093 10.101 9.911 9.595 9.574 10.064 9.562 9.945
© 2002 by Chapman & Hall/CRC
x3 9.969 9.342 9.910 9.806 9.753 9.556 10.066 10.107 10.349 10.047 9.367 9.998 9.501 10.121 10.137 9.803 10.005 10.080 10.058 10.212 9.928 10.067 10.040 9.491 9.855 9.967 9.990 9.986 10.012 9.415 9.970 10.031 10.077 10.113 9.933 9.785 10.032 10.009 9.893 10.057 10.034 10.002 10.014 9.824 10.317 10.019 9.987 9.919 10.003 10.033 9.977 10.126 9.315 9.949 10.116 10.049 10.009 10.116 10.061 9.752 10.358 9.342 10.166 9.337 10.264
Table x4 10.239 9.644 9.537 9.930 10.435 9.448 9.970 10.024 9.818 9.544 10.179 9.957 10.086 10.125 10.116 9.903 9.957 10.023 9.670 9.954 9.986 10.002 10.085 9.472 10.030 10.010 9.991 9.963 9.957 10.155 10.040 10.078 9.932 9.874 9.935 10.015 10.068 10.193 9.881 9.927 10.029 9.829 10.147 10.057 10.043 10.001 10.060 9.567 10.044 10.019 10.040 9.711 9.708 9.421 10.033 10.045 10.012 9.971 9.951 9.879 9.751 9.759 10.101 9.381 9.918
3.3 x5 9.981 10.465 9.968 9.923 10.165 9.654 10.083 9.993 10.146 9.787 9.791 9.886 10.123 10.093 9.925 9.884 9.927 10.025 9.918 9.933 10.156 9.994 10.012 9.702 10.131 9.890 9.980 9.834 9.981 9.540 10.017 10.104 9.999 10.116 10.099 10.007 10.039 9.951 9.873 9.963 9.903 9.910 9.927 9.965 10.163 9.986 10.003 9.893 10.072 10.137 10.021 10.027 9.518 9.395 10.028 10.081 9.961 9.968 9.884 10.196 9.765 10.011 9.885 9.498 9.925
x ¯ 10.020 9.775 9.848 9.910 10.046 9.682 10.029 10.052 10.075 9.793 9.866 9.978 9.885 10.038 10.014 9.718 9.970 10.015 9.790 10.016 9.987 9.991 10.033 9.667 10.024 9.965 9.952 9.974 9.991 9.719 10.014 10.051 10.041 10.046 9.974 9.929 10.046 10.045 9.956 10.062 9.989 9.937 10.014 9.926 10.088 9.997 10.006 9.880 9.980 10.025 10.006 10.012 9.760 9.740 10.013 10.032 9.982 10.005 9.995 9.959 9.864 9.716 10.064 9.548 9.929
s .125 0.419 0.229 0.085 0.276 0.195 0.079 0.044 0.201 0.228 0.338 0.061 0.260 0.113 0.105 0.261 0.029 0.052 0.471 0.115 0.154 0.066 0.063 0.180 0.119 0.049 0.054 0.120 0.048 0.283 0.034 0.047 0.080 0.100 0.076 0.101 0.033 0.099 0.105 0.119 0.054 0.076 0.084 0.090 0.154 0.017 0.054 0.191 0.098 0.073 0.075 0.181 0.388 0.315 0.084 0.040 0.037 0.099 0.087 0.171 0.291 0.265 0.107 0.249 0.237
R .318 1.123 0.585 0.227 0.682 0.476 0.181 0.113 0.532 0.502 0.818 0.165 0.622 0.262 0.212 0.636 0.078 0.144 1.223 0.278 0.394 0.183 0.155 0.402 0.285 0.119 0.127 0.328 0.122 0.740 0.083 0.122 0.213 0.243 0.187 0.230 0.084 0.242 0.236 0.261 0.131 0.189 0.220 0.233 0.371 0.043 0.132 0.520 0.248 0.202 0.201 0.468 1.001 0.693 0.233 0.104 0.085 0.241 0.217 0.444 0.763 0.669 0.281 0.626 0.671
s2 .016 0.176 0.053 0.007 0.076 0.038 0.006 0.002 0.040 0.052 0.115 0.004 0.067 0.013 0.011 0.068 0.001 0.003 0.222 0.013 0.024 0.004 0.004 0.032 0.014 0.002 0.003 0.014 0.002 0.080 0.001 0.002 0.006 0.010 0.006 0.010 0.001 0.010 0.011 0.014 0.003 0.006 0.007 0.008 0.024 0.000 0.003 0.036 0.010 0.005 0.006 0.033 0.150 0.099 0.007 0.002 0.001 0.010 0.008 0.029 0.085 0.070 0.011 0.062 0.056
92 chapter 3. mean and standard deviation control charts Lot 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
x1 9.580 9.975 10.140 10.181 9.414 9.592 9.952 9.839 9.964 9.770 9.503 9.676 10.015 9.799 10.147 10.104 10.082 10.174 9.922 10.208 9.868 10.029 10.053 9.585 9.956
x2 9.389 10.103 9.905 9.860 9.791 9.892 10.171 9.888 10.154 9.729 9.682 9.879 10.177 10.026 9.948 10.060 10.047 10.076 9.957 10.008 9.964 9.512 10.152 10.098 9.899
Table 3.3 (continued) x3 x4 x5 x ¯ 10.070 9.324 10.301 9.733 10.064 10.078 10.057 10.056 9.709 10.386 9.835 9.995 10.027 10.051 9.965 10.017 9.625 9.876 9.704 9.682 9.587 9.872 10.116 9.812 9.971 9.992 10.009 10.019 10.191 9.983 10.024 9.985 9.855 9.962 10.009 9.989 9.829 10.088 9.187 9.721 9.527 9.817 9.964 9.698 9.909 9.969 10.194 9.925 9.756 9.997 9.962 9.981 9.906 10.153 10.014 9.980 10.014 9.734 9.711 9.911 9.989 9.813 9.930 9.979 10.088 9.963 9.932 10.023 10.159 9.961 9.965 10.067 10.053 10.075 10.049 10.011 10.064 10.164 9.889 10.067 9.799 10.004 9.972 9.921 9.608 9.784 10.071 9.801 10.191 9.952 10.045 10.079 9.538 9.441 9.719 9.676 9.947 9.909 10.023 9.947
s 0.432 0.048 0.269 0.118 0.177 0.224 0.088 0.137 0.108 0.329 0.195 0.186 0.151 0.134 0.186 0.114 0.071 0.102 0.067 0.127 0.085 0.248 0.095 0.256 0.049
R 0.978 0.128 0.677 0.321 0.462 0.529 0.219 0.352 0.299 0.900 0.461 0.518 0.421 0.354 0.435 0.291 0.156 0.213 0.153 0.319 0.205 0.560 0.239 0.657 0.124
s2 0.186 0.002 0.072 0.014 0.031 0.050 0.008 0.019 0.012 0.108 0.038 0.035 0.023 0.018 0.035 0.013 0.005 0.010 0.005 0.016 0.007 0.062 0.009 0.066 0.002
Let us now consider using the standard procedure of creating a mean control chart. The sample mean of the 90 lot sample means is 9.939. The sample mean of the 90 lot standard deviations is .146. Thus, our customary control limits would be given by 9.939 −
3(.146)(1.0638) 3(.146)(1.0638) √ √ ≤ x ≤ 9.939 + . 5 5
(3.64)
The resulting control chart is shown in Figure 3.4.
Lot Sample Mean
10.1 10.0 9.9 9.8 9.7 9.6 9.5
0
20
40
60
80
100
Lot
Figure 3.4. Standard Mean Control Chart With 30% Contamination.
© 2002 by Chapman & Hall/CRC
robust estimators for uncontaminated process parameters 93 Here the control limits seem to bear little relation to the obvious two clusters revealed by the graph. Nevertheless, using the naive sample mean control chart blindly, we would be able to pick 10 of the 29 lots which come from the contaminating distribution. Now, let us consider, based on the data from Table 3.3, the mean control chart we obtain when we take the median of the lot sample means to estimate the mean of the uncontaminated distribution and the median of the sample standard deviations, appropriately adjusted, to estimate the standard deviation of the uncontaminated distribution. Sorting the 90 lot sample means from smallest to largest, we find that the median of the lot sample means is given by ˜¯ = 9.984. x
(3.65)
The median of the lot standard deviations is given by .110, giving as the estimate for the uncontaminated standard deviation σ ˜ = c(5)a(5)˜ s = 1.0638(1.026).110 = .1203.
(3.66)
This gives as a control interval, using robust estimates, 9.984 −
3(.110)(1.0638) 3(.110)(1.0638) √ √ ≤ x ≤ 9.984 + . 5 5
(3.67)
In Figure 3.5, we show that the lot mean control chart, using median estimates for the uncontaminated mean and standard deviation, does a much better job of separating N (9.8,.09) lots from N (10.0,.01) lots.
Lot Sample Mean
10.1 10.0 9.9 9.8 9.7 9.6 9.5 0
20
40
60
80
100
Lot
Figure 3.5. Robust Lot Mean Control Chart With 30% Contamination.
© 2002 by Chapman & Hall/CRC
94 chapter 3. mean and standard deviation control charts Next, we consider in Figure 3.6 the two UCL values obtained for s, depending on whether one uses a mean of the lot standard deviations or a median of the lot standard deviation based estimates. For the 30% contamination example in Table 3.3, the average of the sample standard deviations is s¯ = .146. The median of the sample standard deviations is s˜ = .110. We recall from (3.60) and (3.61) that the upper and lower control limits are given by U CL = s¯ + 3¯ s[a(n)2 − 1] 2 = .3049
(3.68)
LCL = s¯ − 3¯ s[a(n)2 − 1] 2 = −.013.
(3.69)
1
and 1
For the median of the lot sample standard deviations, we replace s¯ by c(5)˜ s = 1.0260(.110) = .1129. This gives us U CL = c(5)˜ s(1. + 3[a(n)2 − 1] 2 ) = .2358
(3.70)
LCL = c(5)˜ s(1. − 3[a(n)2 − 1] 2 ) = −.01.
(3.71)
1
and 1
Again, as with the s¯ based chart, the lower control limit will be truncated to zero. We note the dashed upper control limit, representing that corresponding to s¯. This limit identifies only 7 of the 29 points from the contaminating distribution. On the other hand, the solid, s˜ based upper control limit splits the two clusters N (10,.01) and N (9.8,.09) of points much more effectively, successfully identifying 18 of the 29 points from N (9.8,.09). Of course, we stated early on that in statistical process control, it is not generally necessary that we identify all the bad lots, since our goal is not removal of bad lots, but identification and correction of the production problem causing the bad lots. Nevertheless, there would seem to be little reason not to avail ourselves of the greater robustness ˜¯ and s¯, other than the fact that, in so doing, we will be using a of x procedure slightly different from the standard approaches, based on x and s˜. We ourselves generally default to more standard approaches with which a client is more familiar.
© 2002 by Chapman & Hall/CRC
95
a process with mean drift
Lot Sample Standard Deviation
0.5
0.4
Mean Based UCL
0.3
Median Based UCL 0.2
0.1
0.0
0
20
40
60
80
100
Lot
Figure 3.6. Standard and Robust Control Charts With 30% Contamination. In the Remark in Section 3.6 we show how to use ranges to construct control charts for the ranges. Thus, if one prefers ranges to standard deviations, one can not only use ranges to obtain a control chart for the mean but also to add to it a control chart for the ranges.
3.5
A Process with Mean Drift
Although the contamination process model is probably the most useful for improvement via statistical process control, various other models may prove important in various situations. Let us consider, for example, a model which is related to the case of the juggler in Chapter 1. We recall that the tiring juggler is confronted with the increasingly difficult problem of keeping three balls moving smoothly through the air. It is this sort of anthropomorphic model that most people, not familiar with quality improvement, assume must be close to production reality. Happily, this is not generally the case. Nevertheless, let us try to model such decaying quality in our example of the production of bolts. Here, we shall assume that a lot sampled at regular time t gives bolts with measured diameters distributed normally with mean μt and variance σ 2 . But we shall assume that the mean drifts according to the so-called Markov (one step memory) model: μt = μt−1 + t ,
© 2002 by Chapman & Hall/CRC
(3.72)
96 chapter 3. mean and standard deviation control charts where t is normally distributed with mean 0 and variance τ 2 . In our simulation of 90 lots, μ1 = 10, σ = .1, and τ = .01. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
x1 9.927 9.856 10.031 9.850 9.858 10.127 10.131 9.924 9.869 10.132 9.877 10.089 10.074 9.977 10.040 10.049 10.040 9.964 10.058 10.133 10.063 10.040 9.889 10.092 9.948 10.119 10.012 9.943 10.072 10.062 9.988 10.137 10.020 9.870 10.107 10.118 9.928 9.934 9.893 10.120 9.996 10.071 9.949 9.930 9.983 10.086 9.896 9.997 9.987 9.958 10.075 10.099 10.097 10.117 9.779 10.015 9.849 10.010 9.897 9.857 9.887 9.920 9.978 10.126 9.951 10.083 9.869 10.196 10.142 9.988
© 2002 by Chapman & Hall/CRC
x2 10.019 9.985 9.957 10.071 10.159 10.006 10.144 10.008 9.902 9.974 9.970 9.803 9.954 10.011 9.961 10.086 10.039 10.058 9.885 10.049 10.033 9.954 10.049 9.889 10.188 9.949 10.194 9.835 10.019 10.280 10.101 10.031 10.031 10.156 10.053 10.059 9.974 9.870 9.997 10.019 9.796 9.776 9.869 9.998 9.967 9.995 9.999 10.089 9.928 9.999 9.940 10.042 10.133 10.020 10.133 9.885 10.153 9.825 9.975 10.244 9.913 9.996 9.944 9.950 9.964 10.051 10.165 10.018 10.097 10.043
x3 9.963 10.074 10.133 10.000 10.039 9.852 9.964 10.043 9.739 9.993 9.980 9.901 9.957 9.966 9.841 10.056 10.014 9.995 10.081 10.023 10.050 9.955 10.117 9.947 10.014 10.210 9.980 10.036 9.946 9.955 10.034 10.045 10.060 10.166 9.785 10.145 10.120 10.068 9.923 9.946 9.967 9.941 9.925 9.836 10.045 9.955 10.051 9.893 10.006 10.078 10.027 10.017 9.997 9.813 10.086 10.034 9.996 9.883 9.914 9.961 10.021 10.120 9.921 10.107 9.883 10.023 10.095 9.828 9.957 9.953
Table 3.4 x4 x5 10.073 9.911 9.885 9.970 9.877 10.078 10.115 10.011 10.141 10.114 10.095 10.090 10.027 9.995 10.155 9.939 10.171 10.111 9.967 10.058 10.031 9.987 9.925 9.890 9.819 9.943 9.876 9.890 10.251 10.125 10.045 10.113 9.992 9.836 9.940 10.060 10.030 10.011 10.147 9.904 10.065 9.983 10.031 9.940 10.157 9.962 9.975 10.203 9.970 9.975 10.114 9.835 10.057 10.014 9.882 10.160 10.119 10.026 10.028 10.131 9.951 10.150 10.061 9.936 9.899 9.979 10.052 10.022 9.902 9.980 10.149 9.968 10.021 9.959 9.828 9.745 9.891 10.007 10.001 10.055 9.976 9.999 9.904 10.077 10.009 9.920 10.195 10.020 9.993 10.119 9.969 9.830 9.878 9.857 9.968 9.971 9.997 10.108 9.912 10.163 9.800 9.926 9.845 10.061 9.911 9.890 9.934 10.070 9.823 10.103 10.013 10.008 9.896 9.842 9.967 9.930 10.081 9.842 9.940 9.896 9.924 9.897 9.957 10.060 9.864 10.007 10.027 10.020 10.113 9.994 9.918 10.111 9.910 9.978 10.037 9.958 10.062 10.085 10.104 10.018
x ¯ 9.979 9.954 10.015 10.009 10.062 10.034 10.052 10.014 9.958 10.025 9.969 9.922 9.949 9.944 10.044 10.070 9.984 10.003 10.013 10.051 10.039 9.984 10.035 10.021 10.019 10.045 10.051 9.971 10.036 10.091 10.045 10.042 9.998 10.053 9.965 10.088 10.001 9.889 9.942 10.028 9.947 9.954 9.934 9.996 10.021 9.967 9.936 9.984 10.005 10.022 9.954 10.013 10.006 9.991 9.985 9.991 9.947 9.923 9.942 9.980 9.929 10.010 9.943 10.046 9.981 10.037 10.003 10.007 10.069 10.021
s 0.067 0.086 0.101 0.101 0.123 0.111 0.081 0.093 0.179 0.070 0.057 0.104 0.090 0.058 0.156 0.029 0.085 0.054 0.076 0.098 0.034 0.047 0.110 0.125 10.019 0.150 0.084 0.130 0.065 0.123 0.081 0.072 0.062 0.120 0.127 0.076 0.075 0.121 0.056 0.065 0.085 0.126 0.051 0.132 0.062 0.092 0.084 0.070 0.065 0.099 0.106 0.098 0.108 0.121 0.169 0.060 0.131 0.072 0.091 0.153 0.054 0.080 0.055 0.071 0.084 0.074 0.125 0.133 0.069 0.057
Δ -0.021 -0.025 0.061 -0.006 0.053 -0.028 0.018 -0.038 -0.055 0.066 -0.056 -0.047 0.027 -0.005 0.100 0.026 -0.085 0.019 0.010 0.038 -0.012 -0.055 0.051 -0.014 -0.002 0.026 0.006 -0.080 0.065 0.055 -0.047 -0.003 -0.044 0.055 -0.088 0.122 -0.087 -0.112 0.053 0.086 -0.081 0.007 -0.019 0.061 0.026 -0.054 -0.031 0.047 0.021 0.017 -0.068 0.059 -0.007 -0.015 -0.006 0.006 -0.043 -0.025 0.019 0.038 -0.051 0.082 -0.068 0.103 -0.065 0.056 -0.034 0.004 0.061 -0.048
97
a process with mean drift
Lot 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
x1 10.082 9.967 10.040 10.035 10.093 10.066 10.117 10.119 10.000 10.012 10.025 10.055 10.124 9.964 9.971 10.100 10.153 9.893 9.932 9.982
x2 9.855 9.987 10.003 10.075 10.005 9.952 10.107 10.030 10.123 10.010 10.087 10.171 10.121 10.185 10.081 10.137 10.104 9.968 10.168 10.133
Table x3 9.943 9.988 9.964 10.071 10.106 9.972 10.015 10.114 9.974 9.980 10.248 10.121 9.951 10.202 10.152 10.115 10.184 10.133 10.191 9.916
3.4(continued) x4 x5 10.026 10.112 10.080 10.264 9.919 10.029 10.175 10.014 10.025 10.004 10.166 9.907 9.891 9.994 10.152 10.037 9.869 10.027 10.052 10.121 9.943 9.939 10.113 10.162 10.096 10.090 10.195 10.071 10.035 10.131 10.063 10.041 10.080 10.047 9.875 10.160 10.116 9.977 10.155 9.862
x ¯ 10.004 10.057 9.991 10.074 10.047 10.013 10.025 10.090 9.999 10.035 10.048 10.124 10.076 10.123 10.074 10.091 10.114 10.005 10.077 10.010
s 0.105 0.123 0.050 0.062 0.049 0.103 0.092 0.054 0.092 0.055 0.127 0.046 0.072 0.104 0.074 0.039 0.055 0.133 0.116 0.130
Δ -0.017 0.053 -0.066 0.083 -0.027 -0.034 0.012 0.066 -0.092 0.036 0.013 0.076 -0.048 0.047 -0.049 0.017 0.022 -0.108 0.071 -0.067
In Figure 3.7, we note that even though the control limits have not been crossed, it is easy to observe that the data are tending to drift. The control limits are most effective in picking out data from a stationary contaminating distribution. However, we discussed in Chapter 1 the “run” chart developed early on by Ford workers where lot means were simply plotted without any attention to control limits. These charts captured most of the power of the control chart procedure. The reality is that the human visual system can frequently detect pathologies on a control chart even where the conditions are so highly nonstandard that the built-in alarm system given by the control limits is not particularly useful. In Figure 3.8, based on the further continuation of Table 3.4 for another 30 lots, we observe that the means do indeed cross the upper control limit around lot 100. However, the prudent production worker will note trouble in the system long before that.
Lot Sample Mean
10.2
10.1
10.0
9.9
9.8 0
20
40
60
80
100
Lot
Figure 3.7. Lot Mean Control Chart For Data With Mean Drift.
© 2002 by Chapman & Hall/CRC
98 chapter 3. mean and standard deviation control charts Lot 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 114 115 116 117 118 119 120
x1 10.226 10.199 10.264 10.112 10.126 10.158 10.156 10.257 9.956 10.057 10.161 10.331 10.299 10.209 10.059 9.967 10.276 10.410 10.156 10.209 9.955 10.043 9.881 10.103 10.039 10.237 10.136 10.223 10.096
x2 10.210 9.951 10.186 10.026 10.011 10.155 10.253 10.067 10.203 10.158 10.307 10.109 10.040 10.306 10.171 10.144 10.018 9.947 10.084 10.230 10.052 9.967 10.383 10.118 10.164 9.953 10.265 9.901 10.041
Table x3 10.292 10.032 10.188 10.104 10.066 10.206 10.114 10.167 10.146 10.165 10.210 10.070 10.051 10.104 10.037 10.226 10.206 10.285 10.119 10.162 10.070 10.220 10.363 10.215 10.301 10.216 10.141 10.073 9.961
3.4(continued) x4 x5 10.350 10.339 10.239 10.076 10.264 10.032 10.111 10.217 10.147 10.131 10.101 10.260 10.176 10.150 10.030 10.177 10.044 10.124 10.220 10.144 10.066 10.083 10.228 10.060 10.093 10.074 10.153 10.296 9.954 10.033 10.207 10.114 10.275 10.133 10.020 10.125 10.150 10.263 10.072 10.068 10.195 10.188 10.100 10.171 10.243 10.136 10.125 10.146 10.278 10.080 10.105 10.253 10.070 10.384 10.236 10.096 10.132 10.110
x ¯ 10.284 10.100 10.187 10.114 10.096 10.176 10.170 10.140 10.095 10.149 10.165 10.160 10.111 10.214 10.051 10.132 10.182 10.157 10.154 10.148 10.092 10.069 10.202 10.141 10.172 10.153 10.199 10.106 10.068
s 0.064 0.119 0.095 0.068 0.057 0.060 0.052 0.091 0.096 0.059 0.098 0.117 0.107 0.088 0.078 0.103 0.109 0.190 0.067 0.076 0.101 0.117 0.205 0.044 0.116 0.126 0.125 0.136 0.068
Δ 0.274 -0.184 0.087 -0.073 -0.018 0.079 -0.006 -0.030 -0.045 0.054 0.016 -0.006 -0.048 0.102 -0.163 0.081 0.050 -0.025 -0.003 -0.006 -0.056 -0.034 0.133 -0.060 0.031 -0.020 0.047 -0.094 -0.038
10.3
Lot Sample Mean
10.2
10.1
10.0
9.9
9.8
0
50
100
150
Lot
Figure 3.8. Lot Mean Control Chart For Data With Mean Drift. It is always a good idea, if possible, to come up with a plausible model which explains why the data appears to exhibit “nontypical” lots. In the most common of cases, that of contamination of the means, we could, for example, take the points below the median based UCL in Figure 3.6, and from them compute a control chart. Then we could take the points
© 2002 by Chapman & Hall/CRC
99
a process with mean drift
above the UCL and from these compute another control chart. If the resulting two charts showed all points to be “in control,” then we might take as a working hypothesis the notion of a mixture of two different distributions as representing the process. In the case of the mean drift data, such a model does not appear to be very useful. But let us note that if we have xt,j = μt + ηt,j ,
(3.73)
where j goes from 1 to 5, ηt,j is N (0,σ 2 ), and μt+1 = μt + t+1
(3.74)
where t is N (0,τ 2 ) then we note that the first difference of the t + 1’th and t’th mean is given by ¯t = t+1 + Δ=x ¯t+1 − x
5 5 1 1 ηt+1,j − ηt,j . 5 j=1 5 j=1
(3.75)
Thus, by the independence of all the variables making up Δ we have 2 σΔ 2 = τ 2 + σ 2 , 5
(3.76)
E(Δ) = 0.
(3.77)
and, of course, For the 119 sample differences of means from lot to lot in Table 3.4, we find a mean of differences equal to .0006, and a variance of .0041. A sample estimate for σ is available by noting that the average of the sample standard deviations is .0921, giving us σ ˆ = a(5)¯ s = 1.0638(.0921) = .0978.
(3.78)
Then we have an estimate of τ 2 via 2 (3.79) τˆ2 = .0041 − (.0978)2 = .00026. 5 We compare the values of .0978 and .00026 with the actual values of .1000 and .0001, respectively. Naturally, the true values will be unknown to us in the real world. However, we can construct a control chart for the Δt . The“in control” interval is given by .0006 − 3(.0640) = −.1914 ≤ Δt ≤ .0006 + 3(.0640) = .1926. We plot the 119 sample mean differences in Figure 3.9.
© 2002 by Chapman & Hall/CRC
(3.80)
100 chapter 3. mean and standard deviation control charts 0.3
Lot Sample Mean
0.2
0.1
0.0
-0.1
-0.2
0
50
100
150
Lot
Figure 3.9. Lot Mean Difference Control Chart For Data With Mean Drift. All the differences, except for that between lots 91 and 90, appear to be “in control.” The difference “out of control” appears to be due to chance. The drift we discussed is, of course, of a more complex nature than those arising from tear and wear of, say, positioning devices in lathes, rolling machines, etc. Such drifts are almost deterministic and monotone. While it is a very good idea to deal with them using control charts, the process is rather routine and obvious, and we shall not dwell on it.
3.6
A Process with Upward Drift in Variance
Another kind of anthropomorphized process is that in which the variability of output randomly increases as time progresses. Again, this is consistent with the performance of an increasingly fatigued juggler, trying to keep the balls moving smoothly through the air. It seldom makes much sense to think of a process which, without any intervention, exhibits declining variance. The Nashua case study mentioned in Chapter 1 appears to give a counterexample in which the process was “brought under control” by simply leaving it alone until the time delayed servomechanisms had done their job. And there are many other examples of variability diminishing by “breaking in,” by workers becoming clever in “tweaking” the process,
© 2002 by Chapman & Hall/CRC
101
a process with upward drift in variance
etc. Nevertheless, it is no doubt true that wear and tear can cause an increase in variability. Let us consider an example of such a situation. We are still attempting to produce bolts of 10 cm diameter. Here, we shall assume that the output from lot to lot follows a normal distribution with mean 10 and starting standard deviation 0.1. As we begin sampling from each lot, however, there is a probability of 10% that the standard deviation will increase by .01 cm. In Table 3.5, we demonstrate 90 simulated lots of five bolts each. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
x1 10.036 9.851 9.916 9.829 10.194 10.057 9.875 10.006 10.013 9.957 9.902 9.851 10.022 9.990 10.125 9.858 9.837 10.201 10.052 10.017 9.778 10.056 10.058 9.955 10.001 10.016 10.052 10.005 9.881 9.775 10.042 9.917 9.728 10.032 10.071 10.004 9.859
© 2002 by Chapman & Hall/CRC
x2 9.982 10.259 10.019 9.751 9.921 9.972 10.468 9.853 10.015 10.009 9.925 10.248 9.976 10.084 10.161 10.079 10.305 9.925 9.850 9.932 10.144 10.008 9.834 9.946 10.027 10.091 9.959 9.977 10.159 10.003 10.098 10.011 9.886 10.102 9.949 10.102 10.286
Table 3.5 x3 x4 9.996 10.157 10.132 10.038 9.920 10.140 9.955 9.847 10.159 10.003 10.038 9.864 10.017 9.997 10.054 10.009 10.073 10.184 10.057 9.948 9.904 9.913 10.172 10.006 10.104 9.997 10.028 9.973 10.029 9.811 9.976 9.967 9.794 9.868 9.859 10.127 9.840 9.951 10.194 9.954 10.110 10.118 10.027 10.054 9.980 9.895 10.041 9.928 9.932 9.953 9.934 9.948 9.967 10.050 10.092 9.829 9.986 10.090 9.971 9.819 9.988 10.108 10.116 9.970 9.884 10.139 9.997 9.986 9.972 9.976 10.194 10.022 9.963 10.000
x5 10.013 9.930 10.064 10.048 9.886 9.908 9.834 9.913 10.048 9.942 9.991 10.065 10.051 10.184 9.710 10.204 9.948 9.857 9.906 10.143 10.117 10.107 10.082 9.998 9.936 9.946 10.023 10.094 10.093 9.968 9.918 10.078 9.955 9.991 10.223 10.052 9.930
x ¯ 10.037 10.042 10.012 9.886 10.032 9.968 10.038 9.967 10.067 9.983 9.927 10.068 10.030 10.052 9.967 10.017 9.950 9.994 9.920 10.048 10.053 10.050 9.970 9.974 9.970 9.987 10.010 9.999 10.042 9.907 10.031 10.018 9.918 10.022 10.038 10.075 10.007
s 0.070 0.162 0.096 0.116 0.138 0.083 0.252 0.081 0.070 0.049 0.037 0.153 0.050 0.085 0.198 0.130 0.206 0.160 0.087 0.116 0.154 0.038 0.106 0.046 0.042 0.067 0.045 0.109 0.109 0.103 0.079 0.080 0.149 0.049 0.113 0.076 0.164
102 chapter 3. mean and standard deviation control charts Lot 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
x1 10.348 10.000 9.926 9.881 10.252 10.162 10.044 9.897 9.916 10.088 10.371 10.192 9.725 10.064 10.056 9.917 9.993 10.015 10.012 9.846 10.016 9.853 9.884 10.018 10.015 9.826 10.121 10.132 9.816 10.160 9.854 9.915 10.062 10.095 10.185 9.963 10.289 10.467 9.995 9.952 9.822 9.869 9.988 9.837 9.877 10.283 10.040 9.761 9.937 10.128 9.962 10.017 9.942
© 2002 by Chapman & Hall/CRC
Table 3.5(continued) x2 x3 x4 x5 9.967 9.874 10.057 10.000 10.044 9.811 9.811 10.144 10.305 10.036 9.978 10.039 10.002 9.781 9.872 10.197 10.203 9.845 10.207 10.012 9.779 10.150 9.954 9.950 10.013 10.030 9.850 10.014 9.994 9.857 9.860 9.902 9.924 9.855 9.912 9.980 9.754 10.122 9.951 10.013 10.031 10.203 10.197 10.109 10.076 9.915 10.035 10.021 10.001 10.127 10.051 10.053 9.888 10.082 9.915 9.940 10.131 9.870 9.835 9.894 9.973 10.170 9.919 9.997 9.910 9.938 10.071 9.921 9.867 9.836 9.789 9.748 10.040 10.134 9.926 10.165 9.807 9.824 10.171 9.837 10.046 9.862 9.936 10.040 9.989 10.066 10.455 9.813 10.012 9.894 9.923 10.152 9.661 10.019 9.872 9.685 10.034 9.781 9.829 10.042 10.200 10.029 10.135 9.930 9.863 10.008 10.161 9.894 9.913 10.268 10.070 10.148 9.880 9.733 9.809 9.947 9.843 9.848 10.401 10.123 10.031 9.947 9.980 9.981 9.931 9.982 9.928 10.105 10.147 9.705 10.053 9.869 10.088 10.340 10.014 9.965 9.986 10.038 10.405 10.117 10.162 10.186 9.700 9.751 9.940 9.963 10.200 9.904 9.884 9.952 10.255 10.190 9.811 9.781 10.059 10.277 9.943 9.844 10.159 9.882 9.795 10.041 9.945 10.098 10.136 9.638 9.858 9.591 10.016 9.943 10.167 9.772 9.840 10.184 10.148 10.076 10.260 10.197 9.909 10.044 9.827 9.959 10.172 10.256 9.970 10.165 10.076 9.888 10.010 9.900 10.092 10.088 10.098 10.059 9.709 10.153 9.836 10.179 10.145 10.067 10.070 10.053 10.160 9.886 10.126 10.103 9.807 10.181 9.857 9.919 9.988 9.800
x ¯ 10.049 9.962 10.057 9.946 10.104 9.999 9.990 9.902 9.917 9.986 10.182 10.048 9.992 9.978 9.957 9.995 9.967 9.851 10.055 9.897 9.980 10.035 9.973 9.851 9.940 10.024 10.010 10.106 9.837 10.075 9.959 9.972 9.967 10.100 10.146 9.952 10.059 10.150 9.985 9.956 9.940 9.818 9.977 10.017 10.057 10.099 10.028 9.970 9.991 10.071 10.026 10.047 9.901
s 0.180 0.147 0.146 0.161 0.172 0.160 0.080 0.055 0.045 0.145 0.127 0.100 0.155 0.089 0.129 0.104 0.066 0.102 0.096 0.154 0.079 0.256 0.112 0.173 0.125 0.151 0.133 0.130 0.081 0.235 0.066 0.079 0.178 0.144 0.163 0.225 0.173 0.236 0.202 0.122 0.132 0.217 0.142 0.168 0.170 0.198 0.105 0.140 0.177 0.137 0.105 0.147 0.074
a process with upward drift in variance
103
The values of x and s¯ are 9.999 and .125, respectively. This gives an x ¯ control interval of
9.999−
3(.125)a(5) 3(.125)a(5) √ √ = 9.821 ≤ x ¯ ≤ 9.999+ = 10.178. (3.81) 5 5
We plot the sample control chart in Figure 3.10. Although two of the points are slightly outside the control interval for the sample means of the process, such an alarm is unlikely to lead us to discover the upward drift in the variance of the production process. The standard control interval for s is given by s = −.011 ≤ s ≤ (1 + 3[a(5)2 − 1].5 )¯ s = .261. (3.82) (1 − 3[a(5)2 − 1].5 )¯
We show the resulting s control chart in Figure 3.11.
10.2
Lot Sample Mean
10.1
10.0
9.9
9.8
0
20
40
60
80
100
Lot
Figure 3.10. Lot Mean Control Chart For Data With Variance Drift.
© 2002 by Chapman & Hall/CRC
104 chapter 3. mean and standard deviation control charts
Lot Sample Standard Deviation
0.3
0.2
0.1
0.0
0
20
40
60
80
100
Lot
Figure 3.11. Lot Standard Deviation Control Chart For Data With Variance Drift. We note that none of the s values is “out of control.” Nevertheless the trend upward in Figure 3.11 is rather clear. Again, we are essentially left with a “run chart.” An experienced control operator would note the trend upward and attempt to find the source of the increasing production variability.
3.7
Charts for Individual Measurements
It is obvious that, whenever possible, measurements taken at some nodal point of a production process should be grouped into statistically homogeneous lots. The greater the size of such a lot, the more accurate are the estimates, x ¯ and s2 , of the population mean and variance. In turn, the greater the accuracy of our estimates, the greater the chance of detecting a Pareto glitch. It is the limitations in time and cost, as well as the requirement of homogeneity, that make us content with lot size of only 4 or 5. It may happen, however, that no grouping of the measurements into lots of size greater than one is possible. Typically, the reason is that either the production rate is too slow or the production is performed under precisely the same conditions over short time intervals. In the first case, items appear at a nodal point too rarely to let us assume safely that two or more of them are governed by the same probability
© 2002 by Chapman & Hall/CRC
charts for individual measurements
105
distribution. Grouping measurements into lots of size greater than one is then likely to lead to hiding, or “smoothing out,” atypical measurements among typical ones. As an example of the second of the two reasons mentioned, let us consider temperature of a chemical reactor. Clearly, the temperature is constant over small time intervals and, therefore, it does not make sense to take more than one measurement of any physical quantity, which is a function of temperature, in one such interval. In order to see how to construct control charts for individual measurements, let us consider 90 observations coming from N (10, .01) with probability .855, from N (10.4, .03) with probability .095, from N (9.8, .09) with probability .045 and from N (10.2, .11) with probability .005. The given probabilities have been obtained as in (3.9) to (3.12), using p1 = .1 and p2 = .05. Actually, the 90 observations are the data x1 of Table 3.2. They are repeated as observations x in Table 3.6.
Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
© 2002 by Chapman & Hall/CRC
x 9.927 9.862 10.061 9.820 9.737 9.876 9.898 10.001 9.928 9.896 10.011 9.983 10.127 10.025 9.953 10.007 10.062 10.168 9.986 9.786 9.957 9.965 9.989 9.983 10.063
Table 3.6 MR Lot 46 0.065 47 0.199 48 0.241 49 0.083 50 0.139 51 0.022 52 0.103 53 0.073 54 0.032 55 0.115 56 0.028 57 0.144 58 0.102 59 0.072 60 0.054 61 0.055 62 0.106 63 0.182 64 0.200 65 0.171 66 0.008 67 0.024 68 0.006 69 0.080 70
x 10.006 10.132 10.012 10.097 10.007 9.967 9.981 9.841 9.992 9.908 10.011 10.064 9.891 9.869 10.016 10.008 10.100 9.904 9.979 9.982 10.028 9.995 9.936 10.014 10.005
MR 0.052 0.126 0.120 0.085 0.090 0.040 0.014 0.140 0.151 0.084 0.103 0.053 0.173 0.022 0.147 0.008 0.092 0.196 0.075 0.003 0.046 0.033 0.059 0.078 0.009
106 chapter 3. mean and standard deviation control charts
Lot 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
Table 3.6 (continued) x MR Lot x 9.767 0.296 71 10.116 9.933 0.166 72 9.934 10.227 0.294 73 9.972 10.022 0.205 74 10.014 9.845 0.177 75 10.093 9.956 0.111 76 9.927 9.876 0.080 77 10.177 9.932 0.056 78 9.825 10.016 0.084 79 10.333 9.927 0.089 80 9.972 9.952 0.025 81 10.059 9.941 0.011 82 9.832 10.010 0.069 83 9.958 9.848 0.162 84 10.087 10.002 0.154 85 10.232 10.031 0.029 86 10.066 9.990 0.041 87 10.041 9.995 0.005 88 9.868 9.980 0.015 89 10.084 10.058 0.078 90 10.063
MR 0.111 0.182 0.038 0.042 0.079 0.166 0.250 0.352 0.508 0.361 0.087 0.227 0.126 0.129 0.145 0.166 0.025 0.173 0.216 0.021
¯ control chart, suited to An obvious counterpart of the mean, or X, the case of lots of size one, can be obtained in the following way. First, ¯ has to be replaced by that for X’s themselves. In the run chart for X’s ¯ if the sample size is 1. Second, since X = X ¯ and there is fact, X = X no variability within lots of size 1, we can calculate sample mean and ¯ and S 2 , for all 90 observations and use as the upper control variance, X limit and lower control limit, respectively, x ¯ + 3a(N )s
(3.83)
x ¯ − 3a(N )s,
(3.84)
and where a(N ) is given by (3.36) with n replaced by N and N = 90. In the example considered, U CL = 9.986 + 3(.101) = 10.289
(3.85)
LCL = 9.986 − 3(.101) = 9.683,
(3.86)
and since a(90) = 1. The X chart obtained is given in Figure 3.12.
© 2002 by Chapman & Hall/CRC
charts for individual measurements
107
The chart recognizes the Pareto glitch on lot 79, slightly fails to detect atypicality of lot 28 and apparently fails to detect atypicality of lot 40. This rather poor result, in particular when compared with the results for lots of size 5 from the same distributions (see Figure 3.2), should not surprise us. Now, the distance between the UCL and LCL is, approximately, equal to 6 “norm” standard deviations while that for lots of size √ 5 was reduced by the factor 5. Moreover, in Figure 3.2, the “observation” on lot, say 40, was in fact the mean of 5 data from N (9.8, .09), while in Figure 3.12 it is only the first of those 5 data. So, given the circumstances, we should be happy that the X chart detected the Pareto glitch on lot 79 and hinted at the possibility of another glitch on lot 28.
Figure 3.12. X Control Chart. Lot 1 2 3 4 5 6 7 8 9 10
© 2002 by Chapman & Hall/CRC
x 9.987 9.629 10.122 9.861 9.816 9.925 10.103 10.057 9.956 9.591
Table 3.7 MR Lot 46 0.358 47 0.493 48 0.261 49 0.045 50 0.109 51 0.178 52 0.046 53 0.101 54 0.365 55
x 9.976 9.927 9.934 9.956 10.001 10.097 10.179 9.941 10.089 10.006
MR 0.007 0.049 0.007 0.022 0.045 0.096 0.082 0.238 0.148 0.083
108 chapter 3. mean and standard deviation control charts
Lot 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
Table 3.7 (continued) x MR Lot x 10.185 0.594 56 9.976 10.051 0.134 57 10.003 9.966 0.085 58 9.875 9.987 0.021 59 9.978 9.926 0.061 60 10.057 9.734 0.192 61 9.853 9.980 0.246 62 9.893 10.012 0.032 63 10.105 10.263 0.251 64 9.963 9.954 0.309 65 9.593 9.762 0.192 66 9.580 10.008 0.246 67 9.975 10.091 0.083 68 10.140 9.795 0.296 69 10.181 9.968 0.173 70 9.414 9.948 0.020 71 9.592 9.864 0.084 72 9.952 10.162 0.298 73 9.839 9.941 0.221 74 9.964 9.791 0.150 75 9.770 10.052 0.261 76 9.503 10.058 0.006 77 9.676 10.144 0.086 78 10.015 10.080 0.064 79 9.799 9.912 0.168 80 10.147 9.973 0.061 81 10.104 10.003 0.030 82 10.082 10.096 0.093 83 10.174 10.023 0.073 84 9.922 10.188 0.165 85 10.208 10.005 0.183 86 9.868 10.018 0.013 87 10.029 9.961 0.057 88 10.053 9.914 0.047 89 9.585 9.969 0.055 90 9.956
MR 0.030 0.027 0.128 0.103 0.079 0.204 0.040 0.212 0.142 0.370 0.013 0.395 0.165 0.041 0.767 0.178 0.360 0.113 0.125 0.194 0.267 0.173 0.339 0.216 0.348 0.043 0.022 0.092 0.252 0.286 0.340 0.161 0.024 0.468 0.371
Let us now construct the X chart for observations x1 of Table 3.3. This time, we have a strong, 30% contamination of the norm N (10, .01) distribution by the N (9.8, .09) distribution. The data are repeated in Table 3.7. We have that U CL = 9.953 + 3(.168) = 10.457
(3.87)
LCL = 9.953 − 3(.168) = 9.449.
(3.88)
and
© 2002 by Chapman & Hall/CRC
charts for individual measurements
109
The X chart is depicted in Figure 3.13. We note that not only the lower control limit is crossed just once, but the two clusters of the data are not as transparent as they are in Figure 3.4. The reason is similar to that which prevents us from constructing lots of statistically heterogeneous data. In fact, both x ¯ and, to the worse, s are now calculated for such a lot of size 90. Since as many as 29 observations come from the contaminating distribution, the value of the norm standard deviation is grossly overestimated. Indeed, s = .168 while σ0 = .1. Fortunately, in most SPC problems, we are faced with just a few Pareto glitches among many, many more uncontaminated lots (of whatever size).
Figure 3.13. X Control Chart With 30% Contamination. Until now, we have been using rather automatically the “3σ” control limits. We already know that such limits are well justified as long as lots come from a normal distribution. Let us recall that we have then that P (|
X −μ | ≥ 3) = .0027, σ
(3.89)
where X denotes a normally distributed measurement with mean μ and variance σ 2 , and, hence, we have chance one in 1/.0027 370 of a false alarm. Moreover, by the Central Limit Theorem, even if individual measurements come from another distribution but a lot is of a “reasonable”
© 2002 by Chapman & Hall/CRC
110 chapter 3. mean and standard deviation control charts size, the lot’s mean is still approximately normally distributed and the ¯ This last property alequality like (3.89) approximately holds for X. lowed us to rely on “normal theory” in justifying the 3σ limits for all ¯ charts, regardless of the form of a parent distribution of a production X process. But it is no more the case when constructing the X chart for lots of size one. The probabilistic argument preserves its validity only if individual measurements are governed by a normal distribution. Otherwise, although the 3σ limits can still be used, we cannot claim that the probability of getting a false alarm has known value (equal to .0027). It is another matter that the issue of normality should not bother us unless data come from a distribution which is far from normal, for example a heavily skewed distribution. In practice, we never know whether measurements come from a normal or approximately normal distribution. Accordingly, before the control chart is drawn, a test of normality of the data should be performed. For our purposes, it suffices to construct a normal probability plot, which is a very simple and yet efficient device provided by any statistical software. One form of the normal probability plot is the following (see Chambers, Cleveland, Kleiner and Tukey [2]). Given random measurements X1 , X2 , . . . , XN , we sort them from smallest to largest to obtain order statistics X(1) ≤ X(2) ≤ . . . ≤ X(r) ≤ . . . ≤ X(N ) . After standardization, the expected value of the rth order statistic can be estimated by the quantile z(r−.5)/N of order (r −.5)/N of the standard normal random variate Z (arguing similarly as in Subsections B.13.6 and B.13.7 of Appendix B, we may note that constant .5, although quite natural, is somewhat arbitrary). Now, plotting the X(r) ’s on the horizontal axis against corresponding quantiles z(r−.5)/N on the vertical axis, we should obtain, approximately, a straight line if the measurements follow a normal distribution. The normal probability plot for the measurements of Table 3.6 is given in Figure 3.14, while that for the measurements of Table 3.7 is given in Figure 3.15. Quite rightly, the normality assumption proves justified for the data of Table 3.6 and it is unwarranted for the data of Table 3.7. In the latter case, we have that a measurement comes from N (10, .01) with probability .7 and it comes from N (9.8, .09) with probability .3. Considered unconditionally, each measurement is, in fact, the following sum of random variables Y = X0 + I2 X2 ,
© 2002 by Chapman & Hall/CRC
charts for individual measurements
111
where X0 comes from N (10, .01), X2 comes from N (−.2, .08), and I2 assumes value 1 with probability .3 and value 0 with probability .7, and all three random variables are mutually independent. Of course, Y is not a normal random variable. The data of Table 3.6 come predominantly from N (10, .01). Strictly speaking, each datum is a random variable of the form X0 + I1 X1 + I2 X2 , where I1 assumes value 1 with probability .1 and value 0 with probability .9, I2 assumes value 1 with probability .05 and value 0 with probability .95, X1 is N (.4, .02) and X0 and X2 are as before. For some purposes, the influence of X1 and X2 can be considered to be “almost” negligible, and the data to be approximately normal.
Figure 3.14. Normal Probability Plot.
Figure 3.15. Normal Probability Plot for Highly Contaminated Data.
© 2002 by Chapman & Hall/CRC
112 chapter 3. mean and standard deviation control charts As regards the data set of Table 3.7, the reader experienced in analyzing normal probability plots will find that Figure 3.15 hints at possible skewness of the underlying distribution. The reason may be a strong contamination of a norm process, as Figure 3.13 with its two rather apparent clusters confirms. Let us now examine the set of measurements x1 of Table 3.4, tabulated again in Table 3.8. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
x 9.927 9.856 10.031 9.850 9.858 10.127 10.131 9.924 9.869 10.132 9.877 10.089 10.074 9.977 10.040 10.049 10.040 9.964 10.058 10.133 10.063 10.040 9.889 10.092 9.948 10.119 10.012 9.943 10.072 10.062 9.988 10.137 10.020 9.870 10.107 10.118 9.928 9.934 9.893 10.120 9.996 10.071 9.949 9.930 9.983
Table 3.8 MR Lot 46 0.071 47 0.175 48 0.181 49 0.008 50 0.269 51 0.004 52 0.207 53 0.055 54 0.263 55 0.255 56 0.212 57 0.015 58 0.097 59 0.063 60 0.009 61 0.009 62 0.076 63 0.094 64 0.075 65 0.070 66 0.023 67 0.151 68 0.203 69 0.144 70 0.171 71 0.107 72 0.069 73 0.129 74 0.010 75 0.074 76 0.149 77 0.117 78 0.150 79 0.237 80 0.011 81 0.190 82 0.006 83 0.041 84 0.227 85 0.124 86 0.075 87 0.122 88 0.019 89 0.053 90
x 10.086 9.896 9.997 9.987 9.958 10.075 10.099 10.097 10.117 9.779 10.015 9.849 10.010 9.897 9.857 9.887 9.920 9.978 10.126 9.951 10.083 9.869 10.196 10.142 9.988 10.082 9.967 10.040 10.035 10.093 10.066 10.117 10.119 10.000 10.012 10.025 10.055 10.124 9.964 9.971 10.100 10.153 9.893 9.932 9.982
MR 0.103 0.190 0.101 0.010 0.029 0.117 0.024 0.002 0.020 0.338 0.236 0.166 0.161 0.113 0.040 0.030 0.033 0.058 0.148 0.175 0.132 0.214 0.327 0.054 0.154 0.094 0.115 0.073 0.005 0.058 0.027 0.051 0.002 0.119 0.012 0.013 0.030 0.069 0.160 0.007 0.129 0.053 0.260 0.039 0.050
The mean of this data set is 10.01, the standard deviation is .093 and, hence, the control limits are 10.289 and 9.731, respectively. The X chart is plotted in Figure 3.16. The data are not only in control but the chart fails to detect any drift, at least at a first glance. The drift present in the data is too small, relative to the population’s standard deviation, to be discovered. The normal probability plot, given in Figure 3.17, reveals that the data can hardly be considered to be normally distributed.
© 2002 by Chapman & Hall/CRC
charts for individual measurements
113
Figure 3.16. X Control Chart For Data With Mean Drift.
Figure 3.17. Normal Probability Plot For Data With Mean Drift. As with the normal probability plot, statistical softwares are equipped with similarly constructed probability plots for other distributions. We
© 2002 by Chapman & Hall/CRC
114 chapter 3. mean and standard deviation control charts checked the fit of the data of Table 3.8 to uniform, gamma and Weibull distributions with a variety of possible parameters and we found that no such fit exists. In fact, it follows from the normal probability plot in Figure 3.17 that the underlying distribution is likely to be platykurtic. A conjecture that the data set comes from a contaminated norm process is one possibility. Another, given a weak pattern one can perhaps find in the data upon closer inspection of Figure 3.16, can be that of some type of autocorrelation. As we know, a relatively small drift in the process mean is indeed present. In order to get a better insight into statistical properties of a data set, it is tempting to construct a chart that would give us some information on the process’s variability. To achieve this goal, we can form artificial lots of size 2 or 3 and calculate the so-called moving ranges. Given N individual measurements, X1 , X2 , . . . , XN , we calculate the ith moving range MRi as the difference between the largest and the smallest value in the ith artificial lot formed from the n measurements Xi , . . . , Xi+n−1 , where i = 1, 2, . . . , N −n+1 and n is the lot’s size, equal to 2 or 3. Thus, if n = 2, MR1 = |X2 − X1 |, MR2 = |X3 − X2 |, . . . , MRN −1 = |XN − XN −1 |. Clearly, proceeding in this way, we violate the requirement that measurements be considered separately and, hence, any analysis of the moving ranges should be undertaken with extreme caution. Hence, also, n should not be greater than 3. It should be noted that the moving ranges are not statistically independent one from another, since the artificial lots are overlapping. Once the moving ranges have been formed, we can plot their run chart and, moreover, we can add the control limits in the following way. Just as we calculated the mean of the range (see (B.208) in Appendix B), we can use the same trick to find that E(R2 ) = D2 σ 2 , where σ 2 is the population variance, D2 is a suitable constant depending on lot size n and R denotes the range of a lot. Thus,
σR =
Var(R) = (E(R2 ) − [E(R)]2 )1/2 = dσ,
(3.90)
where d = D2 − 1/b2 and b is defined by (B.205) and tabulated in Table 3.1 as bn . Factors d ≡ dn and again bn are given in Table 3.9 (for reasons which will be given later, the two factors are given also for
© 2002 by Chapman & Hall/CRC
charts for individual measurements
115
lots of sizes greater than 3). Using (3.39), we can estimate the standard deviation of the range as follows ¯ σ ˆR = bn dn R.
(3.91)
Applying (3.91) to the average of N − n + 1 moving ranges, MR =
N −n+1 1 MRi , N − n + 1 i=1
(3.92)
we obtain the upper and lower control limits for the control chart of moving ranges: U CL = (1 + 3bn dn )MR = D4 (n)MR
(3.93)
LCL = max{0, 1 − 3bn dn }MR = D3 (n)MR.
(3.94)
and Factors D3 (n) and D4 (n) are tabulated in Table 3.9.
Lot size 2 3 4 5 6 7 8 9 10 15 20
Table 3.9 bn dn D3 (n) .8865 .8524 0 .5907 .8888 0 .4857 .8799 0 .4300 .8645 0 .3946 .8480 0 .3698 .8328 .076 .3512 .8199 .136 .3367 .8078 .184 .3249 .7972 .223 .2880 .7546 .348 .2677 .7296 .414
D4 (n) 3.267 2.575 2.282 2.115 2.004 1.924 1.864 1.816 1.777 1.652 1.586
Remark: Let us note in passing that the above argument enables one to construct the control chart for usual ranges Rj , when lots of size greater than one can be used. Given the run chart for the ranges, it suffices to replace MR in (3.91) and (3.92) by the average of the N ranges, (3.38). Thus, if one prefers ranges to standard deviations, one can use the control chart for the ranges instead of the standard deviation control chart.
© 2002 by Chapman & Hall/CRC
116 chapter 3. mean and standard deviation control charts
Figure 3.18. MR Control Chart.
Figure 3.19. MR Control Chart For Data With 30% Contamination. The upper control limits for the data sets of Tables 3.6 and 3.7 are equal to (.111)(3.267) = .359 and (.167)(3.267) = .546, respectively. The UCL for the data set of Table 3.8 is (.103)(3.267) = .3365. Corresponding MR charts are given in Figures 3.18, 3.19 and 3.20.
© 2002 by Chapman & Hall/CRC
charts for individual measurements
117
Figure 3.20. MR Control Chart For Data With Mean Drift For the data of Table 3.6, we note that the UCL is crossed on lots 79 and 80. Along with the X chart, it confirms the existence of a Pareto glitch on lot 79. Note that the detection of this glitch on the MR chart is rather due to the sudden change of the measurements’ mean than to the change of the standard deviation. Similarly, the two glitches on the MR chart for the data of Table 3.7 suggest some sort of contamination in the data. Also, the fact that the MR control chart for the data of Table 3.8 reveals a Pareto glitch hints, rather deceptively, at the possibility of some contamination in the data. Let us turn again to the X charts. We noted that the population standard deviation was considerably overestimated in the case of strongly contaminated data of Table 3.7. We can try to estimate σ using the average of the moving ranges, MR, and, thus, to construct the control limits of the following form x ¯ ± 3b2 MR.
(3.95)
With MR = .167 and, hence, b2 MR = .148 as opposed to s = .168, we obtain new control limits, equal to 10.397 and 9.509, respectively. The lower control limit is now crossed twice, on lots 70 and 76. Note that it is the lower control limit which should be likely to be crossed by measurements governed by the contaminated distribution. The lower
© 2002 by Chapman & Hall/CRC
118 chapter 3. mean and standard deviation control charts (and upper) control limit may still be slightly increased by switching from the sample mean x ¯ to the sample median of all 90 measurements, since the latter is greater than the former by .021. Unfortunately, it does not lead to detecting any more Pareto glitches. Interestingly, for the data with mean drift (see Table 3.8 and Figure 3.16), the control limits for the X chart provided by the average of the moving ranges are slightly narrower than those based on s. The former are equal to 10.284 and 9.736 while the latter are equal to 10.289 and 9.731. Both types of the limits are very close one to another for the data set of Table 3.6, with those based on MR being again minimally narrower. For both sets of data, the sample means are practically equal to sample medians. These examples show that it is reasonable to base the SPC analysis on all the available control charts as well as on other possible means, such as probability plots. Depending on the case, any chart may prove more informative than another. And it is crucial that charting be accompanied by a thorough visual analysis of the run charts, regardless of what control limits are used.
3.8
Process Capability
In the context of statistical process control, a production or any other process is examined as it is, whatever its technological specifications. One can reasonably claim that, from the methodological point of view, the technological specifications have nothing to do with SPC narrowly understood. What we listen to is the “voice of the process.” However, in due time, the “voice of the customer” can hardly be left unheard. It is the purpose of this Section to align the two voices. There are many ways of doing this, and we shall describe just one of them. We shall refer to just some of a multitude of capability indices. Terminology and approach we shall use seems to be prevalent now, although neither of the two can be claimed already well established. In any case, they are consistent, e.g., with those used by “The Big Three” (Chrysler Corporation, Ford Motor Company and General Motors Corporation; see their Statistical Process Control (SPC) Reference Manual [5]). It is our basic prerequisite that, when studying process capability, we deal only with processes in a state of statistical control. Prior to calculations of any capability indices, a process has to be brought to
© 2002 by Chapman & Hall/CRC
process capability
119
stability. Indeed, an out of control process is unpredictable and hence any scrutiny of its capability is useless. In the “best of all possible worlds,” the voice of a process is heard via ¯ and its inherent variation, measured by an the process’s overall mean x estimate of the population standard deviation, such as σ ˆ given by (3.36) or (3.39). One should note here that the estimates of inherent variation are based on within-lot or, as is said more often, within-group variation. On the other hand, the voice of the customer is heard via nominal value and tolerance (or specification) limits. The two voices are fundamentally different. It is summary statistics for lots that are examined for the purpose of controlling a process, while individual measurements are compared to specifications. Still, the aforementioned statistics needed for charting sample means enable one to readily assess process capability in relation to technological specifications imposed. One of the standard measures of process capability is the so-called Cp index, U SL − LSL , (3.96) Cp = 6ˆ σ where U SL denotes the upper specification limit, LSL denotes the lower specification limit and σ ˆ is an estimate of the process’s inherent variation. The index has a very clear interpretation provided the process is not only in statistical control but also the individual data come from an approximately normal distribution. The estimated 6σ range of the process’s inherent variation, which appears in the denominator of (3.96) and which itself is often termed process capability, is the range where (in principle) 99.73% of the underlying probability mass lies. It is easy to see that the 6σ range is closely related to the natural “in control” interval for individual data. The Cp index relates the process’s inherent variation to the tolerance width, U SL − LSL. For obvious reasons, Cp should be equal at least to 1 for the process capability to be considered acceptable. If the individual data of an in control process do not follow normal distribution, sometimes the following modification of the Cp index is postulated U SL − LSL , (3.97) Cp = P99.865 − P0.135 where P0.135 and P99.865 are the 0.135 and 99.865 percentiles of the underlying distribution, respectively. Of course, the two indices coincide for normal distribution. In a non-normal case, it is then suggested to find
© 2002 by Chapman & Hall/CRC
120 chapter 3. mean and standard deviation control charts the underlying distribution and estimate the percentiles needed. However, we agree with Wheeler and Chambers [7] that (3.96) can be used to advantage in most practical situations, even when the process distribution is heavily skewed. Although (3.96) is then not equivalent to (3.97), usually, as Wheeler and Chambers amply show, the great bulk of the probability mass of the underlying distribution still remains within the 6σ range. To put it otherwise, as a rule, whether the process distribution is normal or not, if Cp is greater than 1, the process can be considered capable. A safer approach has been proposed by Wetherill and Brown [6]. Namely, if we realize that the 6σ range represents the actual process range (minus, approximately, 0.3% of it) uder normality, then – given data from whatever distribution – we can simply calculate their actual range and cut off 0.15% of it at each end. Process capability, as measured by Cp alone, refers only to the relationship between process’s inherent variation and tolerance width as specified by the user, irrespective of the process centering. The capability index which accounts also for process centering is the so-called Cpk index, Cpk = min{CP U, CP L},
(3.98)
where CP U =
¯ − LSL ¯ x U SL − x and CP L = . 3ˆ σ 3ˆ σ
CP U and CP L are referred to, respectively, as upper and lower capability indices. The Cpk index relates process capability to the process mean ¯ . Clearly, also Cpk should be equal at least to 1 for the via its estimate x process capability to be considered acceptable. For reasons already given for Cp , while it is good to have data distributed normally, the Cpk index can either be used unchanged in, or readily adapted to, cases with non-normal data. Let us consider summary statistics for lots 94 to 125 of size 5 of diameter 3 from Figure 1.22 (the problem is described in the Remark above the Figure), given in Table 3.10.
© 2002 by Chapman & Hall/CRC
121
process capability
Lot 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
x ¯ 6.736 6.742 6.742 6.726 6.732 6.742 6.742 6.728 6.730 6.732 6.744 6.730 6.728 6.728 6.740 6.730
Table s .031 .027 .024 .018 .022 .017 .024 .016 .015 .009 .012 .019 .028 .014 .025 .018
3.10 Lot 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
x ¯ 6.732 6.742 6.746 6.740 6.722 6.732 6.734 6.732 6.742 6.726 6.731 6.738 6.742 6.742 6.740 6.746
s .025 .026 .024 .012 .016 .017 .015 .021 .017 .020 .027 .009 .022 .015 .004 .011
We leave it to the Reader to verify that the system is in control (cf. Problem 3.10). The diameter was specified to 6.75 mm with tolerances ±0.1 mm. Assuming approximate normality, let us calculate Cp and Cpk . Clearly, U SL − LSL = .2. It follows from the Table that s¯ = 0.0187 and ¯ = 6.7356. Thus x Cp =
0.2 = 1.67, 6(1.0638)(0.0187)
6.85 − 6.7356 = 1.91, 0.06 6.7356 − 6.65 CP L = = 1.43 0.06
CP U =
and Cpk = 1.43. The Cp value shows high capability with respect to variation, while the Cpk value can be considered satisfactory. (It is now rather common to treat Cpk values greater than 1.33 as satisfactory and Cp values greater than 1.6 as showing high capability.) It is another matter that we should not attach too much meaning to the numbers obtained, as they are in fact random. Obviously, centering the process at the nominal value of 6.65 and leaving inherent variation unchanged would lead to increase of Cpk (cf. Problem 3.12). As given by (3.96) and (3.98), both Cp and Cpk apply to cases with two-sided specification limits. The Cp index is meaningless when a onesided specification limit is given. If only upper specification limit is given, Cpk can be defined as equal to CP U . Analogously, for a one-sided, lower specification limit, one can take Cpk equal to CP L. Neither of the two indices should be used alone. It is clear that a process with Cp well above 1, say, 1.6, may be producing non-conforming
© 2002 by Chapman & Hall/CRC
122 chapter 3. mean and standard deviation control charts products from time to time due to being poorly centered. At the same time, high values of Cp show that we can reduce or eliminate the nonconforming product by merely centering the process, and thus increasing Cpk . Comparing, however, the costs of decreasing the process’s variation and better centering the process it may occur that both actions should be undertaken to suficiently decrease Cpk at a minimal cost. Generally speaking, knowing values of both indices may help prioritize the order in which a process should be improved. We know well that the ultimate aim of SPC is continual improvement. Capability indices help measure that improvement. Repeatedly applying the PDSA cycle to a process results in its changes. After the process was changed, it has to be brought to a state of statistical control and, once that has been achieved, current process’s capability indices are measured and compared with those which resulted from the previous turn of the PDSA wheel. A process with mean drift, considered in Section 3.5, is one example of processes with non-zero between-lot, or between-group, variation. That is, there may be situations when process’s total variation is not equal to inherent variation only, as is the case in the best of all possible worlds. Total variation can be a result of the presence of both within-lot or inherent variation and between-lot variation. The latter is usually due to special causes but nevertheles it does not always manifest itself as producing out of control signals, at least for some time. It may therefore be of use to compare Cp and Cpk with their variants which account for total variation. Such variants are obtained by replacing in Cp and Cpk the 6σ range of a process’s inherent variation by the 6σ range of the process’s total variation. Accordingly, the following performance indices can be introduced: U SL − LSL Pp = , (3.99) 6ˆ σtotal ¯ x ¯ − LSL U SL − x , }, (3.100) Ppk = min{ 3ˆ σtotal 3ˆ σtotal where σ ˆtotal is an estimate of the process’s total variation. The most common such estimate is provided by the sample standard deviation based on all lots pooled into one sample of individual observations. The 6σ range of the process’s total variation, estimated by 6ˆ σtotal , is itself called process performance. Let us emphasize again that the performance indices should be used only to compare their values with those of Cp and Cpk , respectively, and in this way to help measure continual improvement and prioritize the order in which to improve processes. Let us also
© 2002 by Chapman & Hall/CRC
problems
123
note that differences between corresponding capability and performance indices point to the presence of between-lot variation.
References [1] Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., Rogers, W.H., and Tukey, J.W. (1972). Robust Estimates of Location. Princeton: Princeton University Press. [2] Chambers, J.M., Cleveland, W.S., Kleiner, B., and Tukey, P.A. (1983). Graphical Methods for Data Analysis. Boston: Duxbury Press. [3] Huber, P.J. (1977). Robust Statistical Procedures. Philadelphia: Society for Industrial and Applied Mathematics. [4] Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression. Reading: Addison-Wesley. [5] Statistical Process Control Reference Manual. (1995) Chrysler Corporation, Ford Motor Company, and General Motors Corporation. [6] Wetherill, G.B. and Brown, D.W. (1991). Statistical Process Control. Chapman & Hall. [7] Wheeler, D.J. and Chambers, D.S. (1992). Understanding Statistical Process Control. SPC Press.
Problems Remark: Unless otherwise stated, the reader is asked to use a method of his or her choice to estimate unknown population mean and standard deviation. Problem 3.1. Construct the mean control chart using the sample ranges to estimate the population standard deviation a. for the data in Problem 1.1, b. for the data in Problem 1.2. Compare the charts obtained with the corresponding charts from Problem 1.1 and Problem 1.2, respectively. Problem 3.2. Construct the mean and standard deviation control charts for the following data from lots of size 5.
© 2002 by Chapman & Hall/CRC
124 chapter 3. mean and standard deviation control charts Lot 1 2 3 4 5 6 7 8 9 10 11 12
x ¯ 62.028 62.038 62.026 62.025 62.016 62.022 62.027 62.028 62.036 62.026 62.025 62.023
s .0040 .0030 .0041 .0040 .0030 .0031 .0020 .0040 .0041 .0031 .0040 .0049
Lot 13 14 15 16 17 18 19 20 21 22 23 24
x ¯ 62.018 62.026 62.027 62.019 62.025 62.030 62.023 62.025 62.020 62.026 62.023 62.025
s .0050 .0032 .0031 .0052 .0031 .0020 .0031 .0009 .0030 .0032 .0019 .0030
Assume that assignable causes are found for every Pareto glitch observed and recompute the upper and lower control limits, so that they could be used for examining future data. Remark: Problems 3.3-3.10 are the continuation of Problems 1.5-1.9 from Chapter 1. The operations under examination are indicated in Figure 1.22 in Chapter 1 and, along with the nominal dimensions, in Figure 3.21 below. More precisely, the two lengths were specified as 3.10 mm with tolerance -.2 mm and 59.20 mm with tolerances ±.1 mm, respectively. The two diameters were specified as 8.1 mm with tolerance -.2 mm and 6.75 mm with tolerances ±.1 mm, respectively. It should be noted that, for obvious reasons, technological specifications may require readjustments/resettings in the production process, even if the process is in control. From the point of view of statistical process control for quality improvement, such readjustments/resettings can cause some minor inconvenience but, of course, have to be taken into account. For instance, in the case of operation 3 (see Problems 3.10 and 3.12 below), a foreman may be said to center the actual diameters at the nominal value, 6.75 mm. If so, suitable caution is needed when analyzing future lots on the basis of control limits obtained in the immediate past.
3.10
φ 8.10
φ 6.75
.
59.20
Figure 3.21. Piston Of A Fuel Pump.
© 2002 by Chapman & Hall/CRC
125
problems
Problem 3.3. Consider the (standard) control charts for Problem 1.5. Assume that assignable causes for every Pareto glitch observed are found and recompute the control limits. Use the limits obtained to examine the following data for the next 30 lots of size five. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x ¯ 3.034 3.028 3.046 3.010 3.010 3.030 3.018 3.048 3.028 3.012 3.016 3.036 3.012 3.020 3.050
s .032 .017 .018 .012 .028 .016 .028 .024 .019 .010 .020 .013 .008 .018 .024
Lot 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
x ¯ 3.022 3.028 3.050 3.022 3.032 3.012 3.038 3.014 3.040 3.030 3.036 3.030 3.022 3.036 3.028
s .018 .028 .010 .019 .021 .024 .024 .016 .024 .024 .013 .020 .021 .021 .014
Comment on whether the given data are in control without any reference to the past experience (use estimators of the population mean and standard deviation of the same type as previously). Compare both analyses. Problem 3.4. Consider the data in Problem 1.5. Construct the mean and standard deviation control charts using the estimates of the population mean and standard deviation based on the sample medians. Which of the two types of estimates of the population mean and standard deviation seems to be more reliable? Given the new charts for the data examined (instead of the standard control charts), solve Problem 3.3. Problem 3.5. Consider the (standard) control charts for Problem 1.7. Assume that assignable causes for every Pareto glitch observed are found and recompute the control limits. Use these new limits to examine the data for the next 21 lots. The data mentioned are given in Problem 1.9. Compare the conclusions obtained with the corresponding conclusions for Problem 1.9. Problem 3.6. Consider the data in Problem 1.7. Construct the mean and standard deviation control charts using the estimates of the population mean and standard deviation based on the sample medians. Which of the two types of estimates of the population mean and standard deviation seems to be more reliable? Given the new charts for the data examined (instead of the standard control charts), solve Problem 3.5. Problem 3.7. Consider the (standard) control charts for Problem 1.8. Assume that assignable causes for every Pareto glitch observed are found and recompute the control limits. Use these new limits to examine the data for the next 21 lots. The data mentioned are given in Problem 1.9.
© 2002 by Chapman & Hall/CRC
126 chapter 3. mean and standard deviation control charts Compare the conclusions obtained with the corresponding conclusions for Problem 1.9. Problem 3.8. Consider the data in Problem 1.8. Construct the mean and standard deviation control charts using the estimates of the population mean and standard deviation based on the sample medians. Which of the two types of estimates of the population mean and standard deviation seems to be more reliable? Given the new charts for the data examined (instead of the standard control charts), solve Problem 3.7. Problem 3.9. Consider the data in Problem 1.10. Assume that assignable causes for every Pareto glitch observed (if any) are found and recompute the control limits. Use these new limits to examine the data for lots 78 to 101. The data mentioned are summarized in the table below. Lot 78 79 80 81 82 83 84 85 86 87 88 89
x ¯ 7.992 8.008 8.000 8.000 7.996 7.998 8.000 8.005 8.024 8.029 8.026 8.030
s .025 .008 .016 .016 .027 .019 .035 .013 .015 .017 .024 .016
Lot 90 91 92 93 94 95 96 97 98 99 100 101
x ¯ 7.990 8.028 8.020 8.014 7.980 8.016 8.038 8.014 8.026 8.008 8.022 8.020
s .021 .016 .020 .028 .037 .046 .023 .028 .018 .022 .030 .014
Verify also whether the given data are in control without any reference to the past experience. Compare both analyses. Problem 3.10. In Table 3.10, summary statistics for lots 94 to 125 of size 5 of diameter 3 (see Figure 1.22) are given. Show that the system is in control. Compare the control charts obtained with those from Problems 1.7 and 1.9. Problem 3.11. The system of statistical process control referred to in Problem 1.11 includes also measurements of another cylinder turned in a metal casting. The inner diameter of the cylinder is specified as 97 − .2 mm. Lots of size 5 are taken. The head of the plant’s testing laboratory verified that the process is in control and that the last control limits are 96.907 and 96.857 for the lot means and .037 for the lot standard deviations. He decided that the given limits be used as trial control limits for subsequent lots. After a few new in-control lots were observed, the lots’ sample means suddenly crossed the upper control limit and stayed consistently above it until the production process was stopped. All the time, lots’ sample standard deviations were well below the limit .037, in fact never crossing the level of .02. A failure of the turning tool’s seat was found. The seat was replaced by a new one, but the
© 2002 by Chapman & Hall/CRC
127
problems
means’ values proved first to lie barely below the upper control limit and soon crossed it. Again, the standard deviations stayed well below the control limit. The production process had to be stopped once more. Investigation revealed excessive hardness of castings processed. A cause of the material’s excessive hardness was duly removed. Subsequent lots’ means and standard deviations are given in the following table. Lot 1 2 3 4 5 6 7 8 9 10 11 12
x ¯ 96.868 96.874 96.861 96.872 96.878 96.873 96.875 96.874 96.870 96.872 96.882 96.894
s .010 .014 .011 .019 .018 .008 .012 .009 .079 .008 .014 .021
Lot 13 14 15 16 17 18 19 20 21 22 23 24
x ¯ 96.884 96.877 96.878 96.900 96.877 96.875 96.880 96.879 96.880 96.878 96.876 96.891
s .011 .009 .007 .008 .014 .008 .013 .014 .010 .013 .008 .012
Determine whether the system is in control. Problem 3.12. Consider the data set in Problem 3.10. a. Suppose that the process has been centered at its nominal value 6.75 while its standard deviation has remained unchanged (realistically speaking, the assumptions not to be fulfilled immediately after resetting the machine). Calculate the new value of Cpk , using the same σ ˆ as in ¯ Section 3.8 (see calculations following Table 3.9) and x = 6.75. b. Find the value of σ ˆ such that Cpk assumes the same value as in a, ¯ while x is the same as in Section 3.8 (see calculations following Table 3.9). That is, show that the process capability can be increased not only by centering a process, but also by reducing its variability. Problem 3.13. Construct control charts for the ranges a. for the data set in Problem 1.1, b. for the data set in Problem 1.2. Are there any Pareto glitches revealed by the charts obtained? Compare the charts obtained with the standard deviation control charts for the same data. Problem 3.14. Consider the following data set of simulated individual measurements.
© 2002 by Chapman & Hall/CRC
128 chapter 3. mean and standard deviation control charts Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x -0.529 0.096 -0.869 -1.404 2.361 -0.097 1.350 -1.378 -0.297 0.490 -0.184 -0.784 0.469 0.101 0.162 0.693 -0.824 1.294 0.287 -0.936
Lot 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
x -2.728 0.957 0.734 -1.614 -0.224 0.417 -0.213 -1.645 -2.298 0.436 -0.505 1.003 -0.144 5.063 0.559 0.994 0.324 -0.449 -0.275 1.120
The data come from N (0, 1) with probability .9 and from N (3, 1) with probability .1 (actually, only measurements 5, 7 and 34 come from the latter distribution). Determine whether the system is in control. Perform the investigation thrice: construct X charts using s (given that a(40) = 1.006) as well as using moving ranges of artificial lots of size 2 and 3. Interpret the results. Problem 3.15. Consider the following data set of simulated individual measurements. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x -0.953 -0.880 -0.172 -0.205 2.067 -1.016 0.398 -2.352 0.265 -1.049 4.218 2.389 3.323 2.578 2.985
Lot 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
x 4.094 3.261 4.640 4.398 1.466 2.754 4.267 2.720 3.193 4.560 4.222 3.186 3.459 2.919 1.615
The first 10 measurements come from N (0, 1) while measurements 11 to 30 from N (3, 1). Determine whether the system is in control. Perform the investigation thrice: construct X charts using s (given that a(30) = 1.009) as well as using moving ranges of artificial lots of size 2 and 3. Interpret the results.
© 2002 by Chapman & Hall/CRC
Chapter 4
Sequential Approaches 4.1
Introduction
Statistical Process Control usually deals with data lots indexed in time. If we look at the mean drift model in Section 3.5 as being close to reality, then we might be tempted to look on the production process as wandering, over time, from a region of satisfactory performance, to one which is not satisfactory. If this were the case, then we might want to develop a test where an alarm sounded as we moved outside the region of satisfactory performance so that we could take corrective action. In our view, this is generally a poor way to implement quality control. It is very much in the vein of the “quality assurance” philosophy which has worked rather badly both in the Soviet Union and in the United States. It rests largely on the anthropomorphic fallacy of machines and systems operating as so many tiring jugglers. Nevertheless, since this notion is still so popular in the quality control community, it is appropriate that we address it. In developing the CUSUM testing procedure, we recall the important concept of likelihood ratio tests covered in Appendix B. The argument below follows essentially that of Wilks [7].
4.2
The Sequential Likelihood Ratio Test
If our observations come from a density f (x; θ) and we have a time ordered data set (x1 , x2 , . . . , xn ), we may wish to test whether the true parameter is θ0 or θ1 . The logarithm of the ratio of the two densities 129
© 2002 by Chapman & Hall/CRC
130
chapter 4. sequential approaches
gives us a natural criterion for deciding between the two parameters: zt (θ) = ln(
f (xt , θ1 ) ). f (xt , θ0 )
(4.1)
We wish to find constants k0 and k1 such that as long as ln(k0 ) < z1 + z2 + . . . + zt < ln(k1 ),
(4.2)
we are in a “region of uncertainty” and continue sampling (i.e., we are in region Gn ). But when (4.3) z1 + z2 + . . . + zt ≥ ln(k1 ), we decide for parameter θ1 (i.e., we are in region Gn 1 ). And when z1 + z2 + . . . + zt ≤ ln(k0 ),
(4.4)
we decide for parameter θ0 (i.e., we are in region Gn 0 ). At the nth stage, assuming we have not previously declared for θ0 or θ1 , the sample (x1 , x2 , . . . , xn ) falls in one of three sets: (x1 , x2 , . . . , xn ) ∈ Gn , continue sampling;
(4.5)
(x1 , x2 , . . . , xn ) ∈ Gn 0 , declare for θ0 ;
(4.6)
(x1 , x2 , . . . , xn ) ∈ Gn 1 , declare for θ1 .
(4.7)
Thus, the probability, if the true parameter is θ, of ever declaring for θ0 is given by L(θ) = P (G1 0 ) + P (G2 0 ) + . . . . (4.8) By the definition of the likelihood ratio test in (4.2), P (Gn 0 |θ1 ) ≤ k0 P (Gn 1 |θ0 ). So L(θ1 ) =
∞ n=1
0
P (Gn |θ1 ) ≤ k0
∞
P (Gn 1 |θ0 ).
(4.9)
(4.10)
n=1
Let us suppose that if θ is truly equal to θ0 , we wish to have L(θ0 ) = 1 − α.
© 2002 by Chapman & Hall/CRC
(4.11)
the sequential likelihood ratio test
131
And if θ is truly equal to θ1 , we wish to have L(θ1 ) = β,
(4.12)
where α and β are customarily referred to as Type I and Type II errors. Then, we must have β = L(θ1 ) ≤ k0 L(θ0 ) = k0 (1 − α). So k0 ≥
β . 1−α
(4.13)
(4.14)
By a similar argument for Gn 1 , we have k1 ≤
1−β . α
(4.15)
Rather than embarking on the rather difficult task of finding precise value for k0 and k1 , let us choose k0 =
β 1−α
(4.16)
and
1−β . (4.17) α What will the resulting actual Type I and Type II errors (say α∗ and ∗ β ) be? Substituting in (4.14) and (4.15), we have k1 =
and
β β∗ = k0 ≥ 1−α 1 − α∗
(4.18)
1−β 1 − β∗ . = k1 ≤ α α∗
(4.19)
So
and
β∗ ≤
β 1−α
(4.20)
α∗ ≤
α . 1−β
(4.21)
In practise, since α and β are usually less than .2, this ad hoc selection of k0 and k1 gives values for the actual Type I and Type II errors which are close to the target values.
© 2002 by Chapman & Hall/CRC
132
4.3
chapter 4. sequential approaches
CUSUM Test for Shift of the Mean
Let us now consider a sequential test to be used in the detection of a shift of the mean of a production process from μ0 to some other value μ1 . We will assume that the variance, σ 2 , is known and unchanged as we shift from one mean to the other. We carry out the test on the basis of the log likelihood ratio of N sample means, each of size n: N
R1 = ln(
=
√ 1 σ 2π √n
exp(−
(¯ xj −μ1 )2 ) 2σ 2 /n
N √ 1 j=1 2π √σ n
exp(−
(¯ xj −μ0 )2 ) 2σ 2 /n
j=1
)
(4.22)
N 1 [(¯ xj − μ0 )2 − (¯ xj − μ1 )2 ] 2σ 2 /n j=1
μ0 − μ1 ¯] [N (μ1 + μ0 ) − 2N x 2σ 2 /n μ + μ1 μ1 − μ0 ¯N − 0 = N 2 [x ]. σ /n 2
=
¯N The test statistic is then clearly based on the difference between x and the average of μ0 and μ1 . Note that if we considered the case where μ1 was unspecified, then we would replace μ1 in (4.22) by the maximum ¯ N , and the test statistic would be likelihood estimator for μ, namely x simply R1 =
¯ N − μ0 )2 N (x σ 2 /n
(4.23)
as shown in (B.168). We note that (4.23) indicates a procedure somewhat similar to that of a control chart for the mean. A major difference is that, at any stage ¯ j rather than on x ¯j . Clearly such a cumulaj, we base our test on x tive procedure (CUSUM) is not so much oriented to detecting “Pareto glitches,” but rather to discovering a fundamental and persistent change in the mean. Now let us return to the case of two levels for μ0 and μ1 . How shall we adapt such a test to process control? We might take the upper limit for acceptable μ and call it μ1 . Then we could take the lower acceptable limit for μ and call it μ0 . Typically, but not always, the target value is
© 2002 by Chapman & Hall/CRC
cusum test for shift of the mean
133
the midpoint
μ0 + μ1 . (4.24) 2 Then the analogue of the acceptance interval in the control chart approach becomes the region of uncertainty μ∗ =
ln(k0 ) < N
μ1 − μ0 μ + μ1 ¯N − 0 [x ] < ln(k1 ). 2 σ /n 2
(4.25)
Suppose that we observe the first N observations to be equal to μ∗ . Then our statistic becomes R1 = N
μ1 − μ0 μ0 + μ1 μ0 + μ1 [ − ] = 0. σ 2 /n 2 2
(4.26)
Then, suppose the next observation is μ∗ + δ. The statistic becomes R1 = (N + 1) =
μ1 − μ0 μ0 + μ1 δ μ0 + μ1 [ + − ] 2 σ /n 2 N +1 2
(4.27)
μ1 − μ0 δ. σ 2 /n
We note that this statistic is independent of N . So, in one sense, the statistic does not have the highly undesirable property that a run of 100 lots, each precisely on target, can mask a 101’st bad lot. On the other hand, suppose that the 101’st lot gives x ¯101 = μ∗ + δ, ∗ and the 102’nd lot gives x ¯102 = μ − δ. Then, the pooled test will have returned our statistic to a neutral R1 = 0. If we believe that Pareto glitches both below and above the mean can occur, we note that the sequential test allows (undesirably) for the cancellation of positive departures from the standard by negative ones. We note an additional problem with the kind of sequential test developed above. We recall that this test identifies “in control” with the sequential “region of uncertainty.” Thus the “in control” period is viewed as a region of instability, a kind of lull before the inevitable storm. In Statistical Process Control, we view Pareto glitches not as evidence of inevitable fatigue in the production system. Rather they are clues to be used to find means of improving the production system. It is the region where items are in control which represents stability. Let us observe the use of this test on the simulated data set of Table 3.2. We recall that this data came predominantly from the distribution N (10,.01) (with probability .985) but with data also coming from
© 2002 by Chapman & Hall/CRC
134
chapter 4. sequential approaches
N (10.4,.03) (with probability .00995) and from N (9.8,.09) (with probability .00495). We need to ask what are the natural choices for α and β so that we might pick values for ln(k0 ) and ln(k1 ). We note that, as we have posed the problem, neither α nor β is particularly natural. Suppose we use .01 for both α and β. Then ln(k0 ) = ln(
β .01 ) = ln( ) = −4.595. 1−α .99
(4.28)
Similarly, we have ln(k1 ) = 4.595. In Table 4.1, we add to Table 3.2 the indicated sequential test statistic, namely μ1 − μ0 μ + μ1 ¯N − 0 R1 = N 2 [x ]. (4.29) σ /n 2 The value of μ1 − μ0 which we shall use is σ. (In practice, the values for -ln(k0 ) and ln(k1 ) are generally taken to be in the 4 to 5 range. The value for μ1 − μ0 is commonly taken to be .5σ or 1.0σ.) Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
x1 9.927 9.862 10.061 9.820 9.737 9.876 9.898 10.001 9.928 9.896 10.011 9.983 10.127 10.025 9.953 10.007 10.062 10.168 9.986 9.786 9.957 9.965 9.989 9.983 10.063 9.767 9.933 10.227 10.022 9.845 9.956 9.876 9.932 10.016 9.927 9.952 9.941 10.010 9.848 10.002 10.031 9.990 9.995
x2 9.920 10.003 10.089 10.066 9.937 9.957 9.959 10.050 10.234 9.994 10.011 9.974 9.935 9.890 10.000 10.005 10.005 10.045 10.041 10.145 9.984 10.011 10.063 9.974 10.075 9.994 9.974 10.517 9.986 9.901 9.921 10.114 9.856 9.990 10.066 10.056 9.964 9.841 9.944 9.452 10.061 9.972 10.056
© 2002 by Chapman & Hall/CRC
x3 10.170 9.829 9.950 10.062 9.928 9.845 9.924 10.263 9.832 10.009 10.090 10.071 9.979 10.002 10.141 9.883 10.070 10.140 9.998 10.012 10.273 9.810 10.148 9.883 9.988 9.935 10.026 10.583 10.152 10.020 10.132 9.938 10.085 10.106 10.038 9.948 9.943 10.031 9.828 9.921 9.943 10.068 10.061
Table 4.1 x4 x5 9.976 9.899 9.824 10.077 9.929 9.935 9.897 10.013 10.144 9.965 9.913 9.941 9.989 9.987 9.982 10.076 10.027 10.121 9.835 10.162 10.095 10.120 10.099 9.992 10.014 9.876 9.999 9.937 10.130 10.154 9.941 9.990 10.270 10.071 9.918 9.789 9.992 9.961 10.110 9.819 10.142 10.190 10.057 9.737 9.826 10.041 10.153 10.092 10.071 10.096 10.114 9.964 9.937 10.165 10.501 10.293 9.922 10.101 9.751 10.088 10.016 10.109 10.195 10.010 10.207 10.146 10.039 9.948 9.896 9.871 9.802 9.947 10.085 10.049 9.975 9.880 9.834 10.091 9.602 9.995 9.997 9.952 9.930 10.113 10.016 10.044
x ¯ 9.978 9.919 9.993 9.972 9.942 9.906 9.951 10.074 10.028 9.979 10.065 10.024 9.986 9.971 10.076 9.965 10.096 10.012 9.996 9.974 10.109 9.916 10.013 10.017 10.059 9.955 10.007 10.424 10.034 9.921 10.027 10.027 10.045 10.020 9.960 9.941 9.996 9.947 9.909 9.794 9.997 10.015 10.034
¯N x 9.978 9.948 9.963 9.966 9.961 9.952 9.952 9.967 9.974 9.974 9.982 9.986 9.986 9.985 9.991 9.989 9.996 9.996 9.996 9.995 10.001 9.997 9.998 9.998 10.001 9.999 9.999 10.014 10.015 10.012 10.013 10.013 10.014 10.014 10.013 10.011 10.010 10.009 10.006 10.000 10.001 10.001 10.002
R1 -1.100 -5.150 -5.505 -6.900 -9.750 -14.400 -16.800 -13.200 -11.700 -13.000 -9.900 -8.400 -9.100 -10.500 -6.750 -8.800 -3.400 -3.600 -3.800 -5.000 1.050 -3.300 -2.300 -2.400 1.250 -1.300 -1.350 20.300 21.750 18.000 20.150 20.800 23.100 23.800 22.050 19.800 18.500 17.100 11.700 0.000 2.050 2.100 4.300
R2 -0.492 -1.629 -1.421 -1.543 -1.950 -2.629 -2.840 -2.087 -1.744 -1.838 -1.335 -1.084 -1.129 -1.255 -0.779 -0.984 -0.369 -0.379 -0.390 -0.500 0.102 -0.315 -0.214 -0.219 0.112 -0.114 -0.116 1.716 1.806 1.470 1.618 1.644 1.798 1.825 1.667 1.476 1.360 1.241 0.838 0.000 0.143 0.145 0.293
135
cusum test for shift of the mean
Lot 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
x1 9.980 10.058 10.006 10.132 10.012 10.097 10.007 9.967 9.981 9.841 9.992 9.908 10.011 10.064 9.891 9.869 10.016 10.008 10.100 9.904 9.979 9.982 10.028 9.995 9.936 10.014 10.005 10.116 9.934 9.972 10.014 10.093 9.927 10.177 9.825 10.333 9.972 10.059 9.832 9.958 10.087 10.232 10.066 10.041 9.868 10.084 10.063
x2 10.094 9.979 10.221 9.920 10.043 9.894 9.789 9.947 10.053 9.926 9.924 9.894 9.967 10.036 10.055 9.934 9.996 10.157 9.853 9.848 10.008 9.963 10.079 10.029 10.022 10.070 10.044 10.028 10.025 9.855 10.000 9.994 9.832 9.884 10.106 10.280 10.116 9.992 10.075 9.884 9.994 9.966 9.948 10.044 9.955 10.018 10.055
x3 9.988 9.917 9.841 10.094 9.932 10.101 10.015 10.037 9.762 9.892 9.972 10.043 10.204 9.733 10.235 10.216 10.095 9.988 10.067 9.949 9.963 10.061 9.970 9.991 9.940 9.890 10.016 10.152 10.129 9.931 9.978 10.090 9.806 10.070 9.959 10.509 10.084 9.981 10.111 9.986 9.915 9.991 9.769 10.091 9.769 9.941 10.104
Table 4.1 (continued) x4 x5 x ¯ 9.961 10.140 10.033 9.881 9.966 9.960 10.115 9.964 10.029 9.935 9.975 10.011 10.072 9.892 9.990 9.959 10.040 10.018 9.941 10.013 9.953 9.824 9.938 9.943 9.920 10.107 9.965 10.152 9.965 9.955 9.755 9.925 9.914 9.903 9.842 9.918 9.939 10.077 10.040 9.985 9.972 9.958 10.064 10.092 10.067 9.962 10.012 9.999 10.029 10.080 10.043 9.926 10.008 10.017 9.739 10.092 9.970 9.929 9.904 9.907 10.132 9.924 10.001 9.970 9.937 9.983 10.087 10.094 10.052 10.232 10.189 10.087 10.248 9.948 10.019 10.137 9.901 10.002 10.188 10.116 10.074 10.047 10.040 10.077 10.054 10.124 10.053 9.785 9.846 9.878 10.133 10.100 10.045 10.079 9.998 10.051 10.042 9.914 9.904 9.980 10.089 10.040 9.901 9.964 9.951 10.631 10.444 10.439 10.059 9.914 10.029 9.800 9.950 9.956 9.954 9.946 9.984 10.008 10.113 9.990 10.023 9.883 9.980 10.021 9.965 10.035 10.102 9.932 9.963 10.031 9.958 10.033 10.023 9.921 9.907 10.052 10.026 10.024 10.080 10.064 10.073
¯1 x 10.002 10.001 10.002 10.002 10.002 10.002 10.001 10.000 10.000 9.999 9.997 9.996 9.996 9.996 9.997 9.997 9.998 9.998 9.998 9.996 9.996 9.996 9.997 9.998 9.999 9.999 10.000 10.001 10.002 10.000 10.000 10.001 10.000 10.000 10.000 10.005 10.006 10.005 10.005 10.005 10.004 10.005 10.004 10.004 10.003 10.003 10.004
R1 4.400 2.250 4.600 4.700 4.800 4.900 2.500 0.000 0.000 -2.650 -8.100 -11.000 -11.200 -11.400 -8.700 -8.850 -6.000 -6.100 -6.200 -12.600 -12.800 -13.000 -9.900 -6.700 -3.400 -3.450 0.000 3.550 7.200 0.000 0.000 3.750 0.000 0.000 0.000 19.750 24.000 20.250 20.500 20.750 16.800 21.250 17.200 17.400 13.200 13.350 18.000
R2 0.297 0.150 0.303 0.307 0.310 0.313 0.158 0.000 0.000 -0.163 -0.493 -0.663 -0.669 -0.675 -0.511 -0.515 -0.346 -0.349 -0.352 -0.710 -0.716 -0.721 -0.545 -0.366 -0.184 -0.186 0.000 0.188 0.379 0.000 0.000 0.194 0.000 0.000 0.000 0.994 1.200 1.006 1.012 1.019 0.820 1.031 0.829 0.834 0.629 0.633 0.849
]
30
σ2/ n
10 ln(k
N
1
)
0 ln(k
1
μ − μ0
[x -
0
2
μ + μ1
20
-10
-20 0
20
40
60
80
0
)
100
Lot = N
Figure 4.1. Sequential Test (CUSUM) for Mean Shift.
© 2002 by Chapman & Hall/CRC
136
4.4
chapter 4. sequential approaches
Shewhart CUSUM Chart
We display the results of our sequential test graphically in Figure 4.1. We recall that of the 90 lot simulations, all but lots 28, 40 and 79 came from N (10.0,.01). Lots 28 and 79 came from N (10.4,.03). Lot 40 came from N (9.8,.09). Unfortunately, the fact that the first eight lot sample means were, by chance, below 10.0 gives us an erroneous false alarm early on. We do see the big jump upward at lot 28, but it is rather clear that this test is not appropriate for detecting Pareto glitches of an intermittent, nonpersistent nature, as with the data set in Table 4.1. We recall, moreover, that each of the three glitches was picked up satisfactorily using the standard mean control chart strategy as employed in Figure 3.2. A popular empirical alternative to tests based on a formal sequential argument (e.g., the CUSUM Chart) is the “Shewhart CUSUM Chart.” Rather than being based on testing each lot mean, as in the case of the control chart for the mean, this test is based on the pooled running mean ¯N = x
N 1 x ¯j . N j=1
(4.30)
If all the lot means are identically and independently distributed with common lot mean μ0 and lot variance σ0 2 /n, where n is the lot size, then ¯N ) = μ E(x
(4.31)
and
V ar(¯ xj ) σ0 2 /n σ0 2 = = . (4.32) N N nN Accordingly, in the Shewhart CUSUM procedure for a change from the mean level μ0 , we use as the “in control” region: ¯N ) = V ar(x
−3
0) + P (zN < −6) = 1/2 and the ARL is equal to 2. Analogously, one obtains that the ARL’s for the standardized shifts of values 4, 2, 1 and 0 are 1/.8413=1.189, 1/.1587=6.301, 1/.023=43.956 and 1/.0027=370.37, respectively. We
© 2002 by Chapman & Hall/CRC
152
chapter 4. sequential approaches
note that the last number is the ARL until a false alarm occurs. Computing the ARL’s for the Page CUSUMs is a rather difficult task. This task has been performed by Lucas and Crosier [3] and we shall confine ourselves to citing a few of their results for the Page CUSUMs with k = .5 (again for standardized mean shifts) in Table 4.6.
Shift 0 1 2 3 4
Table 4.6 h=4 h=5 168 465 8.38 10.4 3.34 4.01 2.19 2.57 1.71 2.01
Comparing the ARL’s for the standard mean control chart with those for the Page CUSUMs, we find that it may indeed be advantageous to use the latter on data with small, single mean shift. We note also that the Page test with h = 4 is likely to ring a false alarm much earlier than the corresponding test with h = 5. A thorough analysis of the ARL’s for the Page tests reveals that it is indeed reasonable to use k equal to one-half of the absolute value of the mean shift and to choose h equal to 4 or 5. Lucas and Crosier [4] summarize their numerical work stating that “this k value usually gives a control scheme having the shortest ARL(2k) for a given ARL(0), where ARL(D) is the average run length at a mean shift of D.” Moreover, the given values of h have been selected “to give the largest in-control ARL consistent with an adequately small out-of-control ARL.” Until now we have assumed that the variance of a production process is known. In practice, this assumption is neither often fulfilled nor actually needed. Whenever necessary, σ can be replaced by its estimate. We mentioned at the beginning of this section that the Page test is the most widely used, and misused, tool for detecting persistent shifts of the mean.The upper and lower cumulative sums, (4.39) and (4.40), have an intuitively appealing form and can be used more or less automatically on data with a single mean shift. On the other hand, using automatically the CUSUM test of Section 4.3 on such data is likely to lead to worse results (see Problem 4.1). The latter test is in fact the sequential likelihood ratio test which was developed for choosing the true mean of a population out of its two possible values. For the task mentioned, the sequential likelihood ratio test is the best possible in a certain sense. Namely, for
© 2002 by Chapman & Hall/CRC
tests for persistent mean shift
153
N denoting the number of observations until a decision is taken, the expected value of N is smaller than the number of observations needed by any fixed-sample-size test which has the same values of Type I and Type II errors. But the problem with detecting the mean shift is that it is a different task and, therefore, no optimality properties of the sequential likelihood ratio test carry over automatically to the case of the CUSUM test of Section 4.3. In order to suggest the way out of the problem, let us turn again to the data set of Table 4.3. The mean of each of the lots is either μ0 = 10 or μ1 = 11. Now, instead of automatically calculating R1 for successive lots, we should stop the test each time a decision is taken (that is, each time |R1 | becomes greater than 4.595) and start it anew at the next lot. In this way, the question of detecting directly the shift is likely to be replaced by deciding what is the current value of the mean. Results obtained using such a sequence of the CUSUM tests can be hoped to be similar to those obtained by means of the Page test (see Problem 4.1). The relationship between the CUSUM test of this Section and that of Section 4.3, which is directly based on the likelihood ratio test, has been given for the first time by Johnson [2]. In order to briefly discuss this relationship, let us assume that the in control process has mean μ0 and we want a test for the mean’s shift to μ1 , where μ1 > μ0 (of course, the case with μ1 < μ0 is analogous and does not require separate analysis). For the Page test, (4.39) is used to detect possible upward shift. For the test which is directly based on the likelihood ratio, one can use (4.25) for the shift’s detection. The upward shift of the mean to μ1 is claimed ¯ N gives to have occurred when for the first time, x N
μ + μ1 μ1 − μ0 ¯N − 0 [x ] ≥ ln(k1 ) 2 σ /n 2
for k1 suitably defined. The left inequality in (4.25) is disregarded since we are interested only in the upward shift (hence, also the current interpretation of μ0 differs from that in Section 4.3). To simplify notation, let us switch to standardized units. After simple manipulations, the above inequality assumes the form N
δ(
δ zi − N ) ≥ ln(k1 ), 2 i=1
0 where δ = μσ12−μ is the mean’s shift in standardized units. Thus, it /n follows from (4.17) that the upper limit of uncertainty or continuation
© 2002 by Chapman & Hall/CRC
154
chapter 4. sequential approaches
region is N i=1
zi =
1−β 1 ln δ α
δ + N. 2
The limit given can be given the following equivalent form: flag the ˜ N process as out of control when, for the first time, the value of SH becomes greater than or equal to h, where ˜ N = (zN − k) + SH ˜ N −1 , SH
k=
1−β δ 1 and h = ln 2 δ α
,
˜ 0 = 0. Now, comparing the decision rule obtained with recursive with SH equation (4.39) and recalling that the Page CUSUM test flags detection of the mean’s shift when cumulative sum SHN becomes greater than h, we get the relationship between that test and the one directly based on the likelihood ratio. We find that the Page test is a modification of the other test, well suited to the situation tested. We note also that our comparison provides a new interpretation for parameters k and h of the Page test. Let us conclude our discussion of the Page test with two remarks. First, Lucas and Crosier [3] have recommended that the upper and lower CUSUMs be given “a head start,” that is, that SH0 and SL0 in (4.39) and (4.40) be assigned a positive value. They suggest that the head-start value be equal to h/2. This head start does not change considerably the average run length until a false alarm, when there is no shift of the mean. On the other hand, it does seemingly reduce the ARL when there is a shift, but only if the mean shift and the addition of the head start coincide or almost coincide (see Problem 4.6). The CUSUM test described is known as a FIR CUSUM scheme. Our second remark is that, in our view, neither the Page test nor FIR CUSUMs should ever be used alone, without a parallel analysis of the standard mean control chart. Some authors, however, claim that sometimes intermittent Pareto glitches can, or even should, be disregarded. If the reader finds a situation when this last claim is justified, he or she may then use a robust CUSUM, developed by Lucas and Crosier [4] in such a way as to make the procedure insensitive to the presence of contaminated data. We shall close this section with a brief description of the exponentially weighted moving average (EWMA) charts. Although these charts are not
© 2002 by Chapman & Hall/CRC
tests for persistent mean shift
155
based on calculating cumulative sums, they perform as well as CUSUM tests on data with small shifts of the mean of a production process. However, they can hardly be claimed to be superior to the CUSUMs, and we mention them only because they are advocated by some members of the quality control community. The EWMA charts are based on calculating exponentially weighted moving averages of sample means of past lots. Let us assume that a certain number of lots has been observed and ¯ .The exponentially weighted that the sample mean across all the lots is x moving average for the N th lot is computed as ˆ¯N = r¯ ˆ¯N −1 , x xN + (1 − r)x
(4.41)
¯ . If the means of first lots can ˆ¯0 is most often set equal to x where x ˆ¯0 = μ0 . The be safely assumed to be equal to some μ0 , we can let x weighting factor r is positive and not greater than 1. ˆ¯N = x ¯N , if r = 1. Otherwise, We note that no averaging takes place, x ˆ¯N with the past sample means x ¯N −i are included in the moving average x i exponential weight r(1 − r) . Clearly, if there is a shift in the lot means, it will likely be reflected in the values of the moving average. Careful analysis shows that, in order that small mean shifts be handily detected, the factor r should lie somewhere between .2 and .5. The most often choices are .25 and .333 (see Robinson and Ho [5], Sweet [6] and Hunter [1]). Control limits for the EWMA chart may be written as a(n)¯ s ¯+3 √ U CL = x n and
a(n)¯ s ¯−3 √ LCL = x n
r 2−r
(4.42)
r , 2−r
(4.43)
where a(n) is a constant with values given by (3.37) and Table 3.1. The second term on the right-hand-sides of the above expressions is an approximation of the exact value of three standard deviations of the ˆ¯N is equal to ˆ¯N . In fact, standard deviation of x moving average x
σ r(1 − (1 − r)2N ) √ n 2−r
1/2
,
where, as usual, σ is the population standard deviation. ˆ¯0 = μ0 = 10 is The EWMA chart for the data set of Table 4.3 and x ˆ¯N given in Figure 4.17. The values of the exponential moving averages x
© 2002 by Chapman & Hall/CRC
156
chapter 4. sequential approaches
are given in the last column of Table 4.3. Taking advantage of the fact that we know true values of both the standard deviation σ and initial mean μ0 , we can construct the following control limits: 3 UCL = 10 + √ 5 and
3 LCL = 10 − √ 5
.333 = 10.6 1.667 .333 = 9.4. 1.667
Figure 4.17. The EWMA Chart (μ0 and σ Known). If the initial value of the mean is unknown and it is therefore reasonable ¯ , we would obviously like that x ¯ be a good estimate of μ0 . ˆ¯0 = x to set x Now, if we can expect a single shift of the mean of a production process, ¯ should be based on we have to proceed cautiously. On the one hand, x as many lots as possible. On the other hand, only the lots preceding the mean shift should be taken into account. In practice, the method works ¯ , that is, given reasonably well if we can use at least 40 data to calculate x ¯ 8 = 10.119 lots of size 5, if we can use at least 8 lots. In our example, x ¯ 10 = 10.078. Using either of these cumulative means does while, say, x
© 2002 by Chapman & Hall/CRC
tests for persistent mean shift
157
not change the UCL or LCL considerably. Moreover, since the influence of past moving averages on the current one decreases exponentially fast, ¯ 10 or even x ˆ¯0 = x ¯ 8 , instead of setting x ˆ¯0 = μ0 , leads ˆ¯0 = x setting x to slight changes of only few moving averages (see Problem 4.8). If, ¯ 40 , the situation would change dramatically, since however, we used x ¯ x40 assumes then value 10.479 and, hence, both limits would be shifted upward by .479. This situation is depicted in Figure 4.18. Although the EWMA procedure grasps the directions of the mean trends, it rings false alarm on lots 13 through 17.
¯ 40 ). Figure 4.18. The EWMA Chart (Based on x
As is the case for the CUSUMs, the EWMA procedure should capture the shifts upward and downward of the means of lots given in Table 4.2. The EWMA chart for this set of data is demonstrated in Figure 4.19. ˆ¯0 = μ0 = 10. Given that When calculating control limits, we set x σ = .1, one readily obtains that the UCL = 10.06 and LCL = 9.94. Cumulative means given in Table 4.2 show that the chart would have ˆ¯0 . not been essentially changed had we used practically any of them as x
© 2002 by Chapman & Hall/CRC
158
chapter 4. sequential approaches
Figure 4.19. The EWMA Chart for Data with Mean Drift.
4.7
CUSUM Performance on Data with Upward Variance Drift
Again, the performance of CUSUM approaches should improve on data where there is a stochastic drift upward in the variance of the output variable. We recall that if lots of observations of size n are drawn from a normal distribution with mean μ and variance σ 2 then letting sample variance be given by sj 2 =
nj
1 (xj,i − x ¯ j )2 , nj − 1 i=1
(4.44)
we have the fact (from Appendix B) that zj =
(nj − 1) 2 sj σ2
(4.45)
has the Chi-square distribution with nj − 1 degrees of freedom. That is f (zj ) =
1 n −1 Γ( j2 )2
nj −1 2
zj
nj −1 −1 2
zj
e− 2 .
(4.46)
Then, assuming that each lot is of the same size, n, following the argument used in (4.22), we have that, after the N ’th lot, the likelihood
© 2002 by Chapman & Hall/CRC
159
variance drift
ratio interval corresponding to a continuation of sampling is given by ln(k0 ) < R3 < ln(k1 ), where R3 = ln(
( σ11 2 )N
n−3 2
n−1 exp(− 2σ 2 1
( σ10 2 )N
n−3 2
n−1 exp(− 2σ 2 0
N
j=1 sj
N
2)
2 j=1 sj )
).
(4.47)
A straightforward Shewhart CUSUM chart can easily be constructed. We recall that for each sample of size n, sj 2 =
n 1 (xj,i − x ¯j )2 n − 1 i=1
(4.48)
is an unbiased estimator for the underlying population variance, σ 2 . Thus, the average of N sample variances is also an unbiased estimator for σ 2 , i.e., N 1 sj 2 ) = σ 2 . (4.49) E( N j=1 Recalling the fact that if the xj,i are independently drawn from a normal distribution with mean μ and variance σ 2 , then (n − 1)s2 = χ2 (n − 1) . σ2
(4.50)
For normal data, then, we have, after a little algebra (see Appendix B), V ar(s2 ) = E(s2 − σ 2 )2 1 σ 4 4! n − 3 4 = ( − σ ) n 4 2! n − 1 2σ 4 = . n−1
(4.51) (4.52) (4.53)
So, then, a natural Shewhart CUSUM statistic is given by R4 =
© 2002 by Chapman & Hall/CRC
N
j=1 sj
2
− N σ0 2
2
σ0 √ N
2 n−1
.
(4.54)
160
chapter 4. sequential approaches
Here, we shall assume that the output from lot to lot follows a normal distribution with mean 10 and starting standard deviation 0.1. As we begin sampling from each lot, however, there is a probability of 10% that the standard deviation will increase by .01 cm. In Table 4.7, we show the data from Table 3.5.
Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
x1 10.036 9.851 9.916 9.829 10.194 10.057 9.875 10.006 10.013 9.957 9.902 9.851 10.022 9.990 10.125 9.858 9.837 10.201 10.052 10.017 9.778 10.056 10.058 9.955 10.001 10.016 10.052 10.005 9.881 9.775 10.042 9.917 9.728 10.032 10.071 10.004 9.859 10.348 10.000 9.926 9.881 10.252 10.162 10.044 9.897 9.916 10.088 10.371 10.192 9.725 10.064 10.056 9.917 9.993 10.015 10.012 9.846 10.016 9.853 9.884
x2 9.982 10.259 10.019 9.751 9.921 9.972 10.468 9.853 10.015 10.009 9.925 10.248 9.976 10.084 10.161 10.079 10.305 9.925 9.850 9.932 10.144 10.008 9.834 9.946 10.027 10.091 9.959 9.977 10.159 10.003 10.098 10.011 9.886 10.102 9.949 10.102 10.286 9.967 10.044 10.305 10.002 10.203 9.779 10.013 9.994 9.924 9.754 10.031 10.076 10.001 9.888 10.131 9.973 9.910 9.867 10.040 9.807 10.046 9.989 10.012
© 2002 by Chapman & Hall/CRC
x3 9.996 10.132 9.920 9.955 10.159 10.038 10.017 10.054 10.073 10.057 9.904 10.172 10.104 10.028 10.029 9.976 9.794 9.859 9.840 10.194 10.110 10.027 9.980 10.041 9.932 9.934 9.967 10.092 9.986 9.971 9.988 10.116 9.884 9.997 9.972 10.194 9.963 9.874 9.811 10.036 9.781 9.845 10.150 10.030 9.857 9.855 10.122 10.203 9.915 10.127 10.082 9.870 10.170 9.938 9.836 10.134 9.824 9.862 10.066 9.894
x4 10.157 10.038 10.140 9.847 10.003 9.864 9.997 10.009 10.184 9.948 9.913 10.006 9.997 9.973 9.811 9.967 9.868 10.127 9.951 9.954 10.118 10.054 9.895 9.928 9.953 9.948 10.050 9.829 10.090 9.819 10.108 9.970 10.139 9.986 9.976 10.022 10.000 10.057 9.811 9.978 9.872 10.207 9.954 9.850 9.860 9.912 9.951 10.197 10.035 10.051 9.915 9.835 9.919 10.071 9.789 9.926 10.171 9.936 10.455 9.923
Table 4.7 x5 10.013 9.930 10.064 10.048 9.886 9.908 9.834 9.913 10.048 9.942 9.991 10.065 10.051 10.184 9.710 10.204 9.948 9.857 9.906 10.143 10.117 10.107 10.082 9.998 9.936 9.946 10.023 10.094 10.093 9.968 9.918 10.078 9.955 9.991 10.223 10.052 9.930 10.000 10.144 10.039 10.197 10.012 9.950 10.014 9.902 9.980 10.013 10.109 10.021 10.053 9.940 9.894 9.997 9.921 9.748 10.165 9.837 10.040 9.813 10.152
x ¯ 10.037 10.042 10.012 9.886 10.032 9.968 10.038 9.967 10.067 9.983 9.927 10.068 10.030 10.052 9.967 10.017 9.950 9.994 9.920 10.048 10.053 10.050 9.970 9.974 9.970 9.987 10.010 9.999 10.042 9.907 10.031 10.018 9.918 10.022 10.038 10.075 10.007 10.049 9.962 10.057 9.946 10.104 9.999 9.990 9.902 9.917 9.986 10.182 10.048 9.992 9.978 9.957 9.995 9.967 9.851 10.055 9.897 9.980 10.035 9.973
s 0.070 0.162 0.096 0.116 0.138 0.083 0.252 0.081 0.070 0.049 0.037 0.153 0.050 0.085 0.198 0.130 0.206 0.160 0.087 0.116 0.154 0.038 0.106 0.046 0.042 0.067 0.045 0.109 0.109 0.103 0.079 0.080 0.149 0.049 0.113 0.076 0.164 0.180 0.147 0.146 0.161 0.172 0.160 0.080 0.055 0.045 0.145 0.127 0.100 0.155 0.089 0.129 0.104 0.066 0.102 0.096 0.154 0.079 0.256 0.112
sj 2 0.005 0.031 0.040 0.054 0.073 0.080 0.144 0.151 0.156 0.158 0.159 0.183 0.185 0.192 0.231 0.248 0.290 0.316 0.323 0.336 0.360 0.361 0.372 0.374 0.376 0.380 0.382 0.394 0.406 0.417 0.423 0.429 0.451 0.453 0.466 0.472 0.499 0.531 0.553 0.574 0.600 0.629 0.655 0.661 0.664 0.666 0.687 0.703 0.713 0.737 0.745 0.762 0.773 0.777 0.787 0.796 0.820 0.826 0.892 0.905
R3 -0.193 1.713 1.920 2.628 3.835 3.841 9.548 9.555 9.361 8.868 8.276 9.982 9.489 9.496 12.703 13.709 17.216 19.124 19.130 19.737 21.444 20.851 21.257 20.765 20.272 19.978 19.485 19.992 20.498 20.905 20.813 20.720 22.226 21.733 22.340 22.246 24.253 26.761 28.268 29.674 31.581 33.788 35.694 35.601 35.209 34.715 36.122 37.029 37.336 39.042 39.149 40.157 40.563 40.270 40.577 40.784 42.490 42.398 48.305 48.911
R4 -0.707 1.100 0.816 0.990 1.455 1.155 3.955 3.550 3.111 2.594 2.089 2.572 2.157 1.965 2.958 3.111 4.116 4.533 4.315 4.301 4.629 4.251 4.187 3.868 3.564 3.328 3.048 3.047 3.046 3.021 2.870 2.725 2.979 2.741 2.773 2.640 2.999 3.464 3.691 3.891 4.196 4.561 4.852 4.712 4.512 4.295 4.476 4.552 4.505 4.740 4.654 4.746 4.720 4.561 4.519 4.460 4.683 4.568 5.560 5.569
161
variance drift Table 4.7 (continued) Lot 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
x1 10.018 10.015 9.826 10.121 10.132 9.816 10.160 9.854 9.915 10.062 10.095 10.185 9.963 10.289 10.467 9.995 9.952 9.822 9.869 9.988 9.837 9.877 10.283 10.040 9.761 9.937 10.128 9.962 10.017 9.942
x2 9.661 10.034 10.200 9.863 9.913 9.880 9.843 10.031 9.931 10.147 10.088 9.986 10.162 9.940 9.884 9.811 9.943 9.795 10.136 10.016 9.840 10.260 9.827 9.970 10.010 10.098 9.836 10.070 10.126 9.857
x3 10.019 9.781 10.029 10.008 10.268 9.733 9.848 9.947 9.982 9.705 10.340 10.038 10.186 9.963 9.952 9.781 9.844 10.041 9.638 9.943 10.184 10.197 9.959 10.165 9.900 10.059 10.179 10.053 10.103 9.919
x4 9.872 9.829 10.135 10.161 10.070 9.809 10.401 9.980 9.928 10.053 10.014 10.405 9.700 10.200 10.255 10.059 10.159 9.945 9.858 10.167 10.148 9.909 10.172 10.076 10.092 9.709 10.145 10.160 9.807 9.988
x5 9.685 10.042 9.930 9.894 10.148 9.947 10.123 9.981 10.105 9.869 9.965 10.117 9.751 9.904 10.190 10.277 9.882 10.098 9.591 9.772 10.076 10.044 10.256 9.888 10.088 10.153 10.067 9.886 10.181 9.800
x ¯ 9.851 9.940 10.024 10.010 10.106 9.837 10.075 9.959 9.972 9.967 10.100 10.146 9.952 10.059 10.150 9.985 9.956 9.940 9.818 9.977 10.017 10.057 10.099 10.028 9.970 9.991 10.071 10.026 10.047 9.90
s 0.173 0.125 0.151 0.133 0.130 0.081 0.235 0.066 0.079 0.178 0.144 0.163 0.225 0.173 0.236 0.202 0.122 0.132 0.217 0.142 0.168 0.170 0.198 0.105 0.140 0.177 0.137 0.105 0.147 0.074
sj 2 0.935 0.951 0.974 0.992 1.009 1.015 1.070 1.074 1.080 1.112 1.133 1.160 1.211 1.241 1.297 1.338 1.353 1.371 1.418 1.438 1.466 1.495 1.534 1.545 1.565 1.596 1.615 1.626 1.648 1.653
R3 51.218 52.125 53.731 54.838 55.846 55.753 60.559 60.266 60.173 62.679 64.086 66.094 70.501 72.807 77.714 81.121 81.927 83.034 87.042 88.348 90.455 92.662 95.869 96.275 97.582 99.990 101.196 101.603 103.110 40.534
R4 5.885 5.945 6.129 6.223 6.297 6.180 6.911 6.757 6.640 6.964 7.099 7.333 7.962 8.236 8.932 9.376 9.396 9.464 9.992 10.088 10.308 10.542 10.928 10.878 10.968 11.224 11.296 11.246 11.363 102.917
In Figure 4.20, we display the CUSUM Test with σ0 2 = .01 and σ1 2 = .02. As with earlier CUSUM Tests, we use α = β = .01, with the consequential ln(k0 )= -4.595 and ln(k1 )= 4.595. We note how effective the CUSUM procedure is in detecting the kind of random increase in the process variability.
Figure 4.20. CUSUM Test for Data with Variance Drift. In Figure 4.21, we note the Shewhart CUSUM Test picks up the increase in variance quite handily.
© 2002 by Chapman & Hall/CRC
162
chapter 4. sequential approaches
Figure 4.21. Shewhart CUSUM for Data with Variance Drift. Clearly, statistics R3 and R4 can be used for detecting a persistent shift in the variance from one value to another. As in the case of using statistic R1 to pick up a mean shift, we can replace a single test based on R3 by a sequence of such tests. Starting the test anew whenever a decision is taken enables us to get rid of the inertia of the single test.
4.8
Acceptance-Rejection CUSUMs
Let us turn now to that area of “quality assurance” which has perhaps made the greatest use of CUSUM procedures. Suppose that we have a production system where the proportion of defective goods at the end of the process is given by p. A manufacturer believes it appropriate to aim for a target of p0 as the proportion of defective goods. When that proportion rises to p1 he proposes to intervene and see what is going wrong. Let us suppose that the size of lot j is nj . Then the likelihood ratio test is given by N
R5 =
© 2002 by Chapman & Hall/CRC
nj ! xj j=1 xj !(n−xj )! p1 (1 ln( N nj ! xj j=1 xj !(n−xj )! p0 (1
− p1 )nj −xj − p0 )nj −xj
)
(4.55)
163
acceptance-rejection cusums
Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
Defectives 3 2 5 0 6 4 2 4 1 2 7 9 11 12 14 15 12 10 8 3 5 6 0 1 3 3 4 6 5 5 3 3 7 8 2 0 6 7 4 4
Proportion .03 .02 .05 .00 .06 .04 .02 .04 .01 .02 .07 .09 .11 .12 .14 .15 .12 .10 .08 .03 .05 .06 .00 .01 .03 .03 .04 .06 .05 .05 .03 .03 .07 .08 .02 .00 .06 .07 .04 .04
Table 4.8 Cum Defectives 3.000 5.000 10.000 10.000 16.000 20.000 22.000 26.000 27.000 29.000 36.000 45.000 56.000 68.000 82.000 97.000 109.000 119.000 127.000 130.000 135.000 141.000 141.000 142.000 145.000 148.000 152.000 158.000 163.000 168.000 171.000 174.000 181.000 189.000 191.000 191.000 197.000 204.000 208.000 212.000
CUSUM -0.488 -1.509 -0.934 -3.017 -1.910 -1.867 -2.887 -2.844 -4.396 -5.416 -3.778 -1.076 2.689 6.985 12.345 18.236 22.533 25.766 27.936 27.448 28.022 29.129 27.046 25.494 25.005 24.517 24.560 25.667 26.242 26.817 26.328 25.840 27.478 29.648 28.628 26.544 27.651 29.289 29.332 29.376
Shewhart 0.000 -0.415 0.338 -0.586 0.262 0.479 0.222 0.415 0.000 -0.185 0.530 1.523 2.764 4.073 5.600 7.181 8.246 8.981 9.414 9.176 9.210 9.374 8.801 8.376 8.207 8.048 8.010 8.198 8.273 8.348 8.212 8.083 8.368 8.746 8.522 8.109 8.288 8.559 8.542 8.527
This gives, as the “interval of acceptability” ln(k0 ) < ln(
N N p1 1 − p1 ) xj + ln( ) (nj − xj ) < ln(k1 ). p0 j=1 1 − p0 j=1
(4.56)
As has been our practice, we will use here α = β = .01 giving ln(k0 )
© 2002 by Chapman & Hall/CRC
164
chapter 4. sequential approaches
30
20
10
0
l o g(
p1
N
1–p
1 )∑ x + l o g( ) p 0 j= 1 j 1–p 0
N
j= 1
∑ (n j–x j)
= -4.596 and ln(k1 ) = 4.596. Let us recall the defect data in Table 2.1 reproduced here as Table 4.8 with additional columns added for CUSUM and Shewhart CUSUM tests. We recall that the lot sizes were all equal to 100. Let us suppose that the manufacturer has a contract assuring that no more than 7% of the items are defective. Then, we might use .05 for p1 . For p0 we will use .03. We note (Figure 4.22) that we go below the lower control limit here at lot 10. Perhaps it would be appropriate here to see why the performance is better than expected. Generally we will only take interest when we get to lot 14, where the cumulative failure rate becomes unacceptably high. We recall from Figure 2.2 in Chapter 2 that it was at lot 14 that we detected a Pareto glitch.
-10 0
10
20
30
40
50
Lot
Figure 4.22. CUSUM Test for Defect Data. A natural Shewhart CUSUM is available by first noting that under the hypothesis that the production proportion of defectives is p0 . If the number of defectives in lot j is equal to xj , then E(
N
xj ) = p0
j=1
N
nj .
(4.57)
j=1
Thus the appropriate Shewhart CUSUM statistic is N
j=1 xj
R6 =
− p0
N
N j=1 nj p0 (1
© 2002 by Chapman & Hall/CRC
j=0 nj
− p0 )
.
(4.58)
165
references
In Figure 4.23, we note that the cumulative defect level arises above 3 on lot 14. 10
N
∑x j –
N
p0 ∑n
j= 1
j= 1
j 5
N
∑n j p0 (1 –
p0 )
j= 1
0
-5 0
10
20
30
40
50
Lot
Figure 4.23. Shewhart CUSUM Test for Defect Data.
References [1] Hunter, J.S. (1986).“The exponentially weighted moving average,” Journal of Quality Technology, 18, pp. 203-210. [2] Johnson, N.L. (1961). “A simple theoretical approach to cumulative sum charts,” Journal of the American Statistical Association, 56, pp. 835-840. [3] Lucas, J.M. and Crosier, R.B. (1982a). “Fast initial response for CUSUM quality control schemes: Give your CUSUM a head start,” Technometrics, 24, pp.199-205. [4] Lucas, J.M., and Crosier, R.B. (1982b). “Robust CUSUM: A robustness study for CUSUM quality control schemes,” Communications in Statistics — Theory and Methods, 11, pp. 2669-2687. [5] Robinson, P.B. and Ho, T.Y. (1978). “Average run lengths of geometric moving average charts by numerical methods,” Technometrics, 20, pp. 85-93. [6] Sweet, A.L. (1986). “Control charts using coupled exponentially weighted moving averages,” Transactions of the IIE, 18, pp.26-33.
© 2002 by Chapman & Hall/CRC
166
chapter 4. sequential approaches
[7] Wilks, S.S. (1962). Mathematical Statistics. New York: John Wiley & Sons, pp. 472-496.
Problems Problem 4.1. Perform the standard CUSUM test for shift of the mean for the data set of Table 4.3, setting μ0 = 10, μ1 = 11 and σ = 1. Then, perform a sequence of standard CUSUM tests, starting each test anew whenever a decision is taken (that is, whenever the statistic R1 becomes greater than 4.595 or smaller than -4.959, stop the test, set N = 1 for the next lot, and start the test anew at that lot). Compare the results obtained with those for Page tests with k = 1 and k = .5 (see Section 4.6 for the latter results). Problem 4.2. Suppose the simulated data set given in the following table is the set of standardized lot means of 50 lots of size 4. Standardization was performed under the assumption that the population variance is 4 and the mean is equal to a goal value μ0 . It is conjectured, however, that the upward shift by 1 (in stadardized units) is possible. In fact, the first 20 lots come from N (0, 1) while lots 21 to 50 come from N (1, 1). a. Perform the standard CUSUM test for shift of the mean setting
zN − .5], R1 = N [¯
since the data is given in standardized units. Then, perform a sequence of standard CUSUM tests, starting each test anew whenever a decision is taken (see Problem 4.1 for explanation). b. Construct the Shewhart CUSUM chart for standardized lot means. c. Perform four Page CUSUM tests for standardized lot means using h = 4 in all cases, and k = .25, .5, 1 and 2, respectively. Comment on the results obtained.
© 2002 by Chapman & Hall/CRC
167
problems Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
z 0.071 0.355 -0.176 -0.043 1.572 0.129 -0.428 -1.495 2.590 1.319 0.377 -0.700 -1.269 -0.843 0.190 -0.795 1.398 0.640 0.869 -1.709 1.064 -0.294 1.714 0.785 0.704
Lot 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
z 2.851 1.523 -0.482 3.770 -0.875 -0.202 0.525 1.867 2.831 0.321 -0.286 2.155 -0.304 0.202 1.476 -0.364 -0.453 0.559 0.145 1.068 1.157 0.386 2.766 0.582 2.299
Problem 4.3. Suppose the simulated data set given in the following table is the set of 50 lots of size 4 from a population with standard deviation σ = .1. It is conjectured, however, that the upward shift by 1 (in stadardized units) is possible. In fact, the first 20 lots come from N (0, 1) while lots 21 to 50 come from N (1, 1). a. Perform the standard CUSUM test for shift of the mean setting R1 = N [¯ zN − .5], since the data is given in standardized units. Then, perform a sequence of standard CUSUM tests, starting each test anew whenever a decision is taken (see Problem 4.1 for explanation). b. Construct the Shewhart CUSUM chart for standardized lot means. c. Perform four Page CUSUM tests for standardized lot means using h = 4 in all cases, and k = .25, .5, 1 and 2, respectively. Comment on the results obtained.
© 2002 by Chapman & Hall/CRC
168
chapter 4. sequential approaches Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
x1 4.880 4.952 5.087 5.183 4.932 4.871 5.116 4.870 4.920 5.048 4.864 4.855 4.956 4.914 5.007 5.016 4.939 5.177 4.958 5.130 5.263 5.065 5.133 4.990 5.002
x2 5.085 4.998 4.988 4.992 5.112 5.060 4.957 4.976 4.980 5.186 5.065 5.003 5.085 5.162 5.018 5.043 4.961 4.882 5.238 4.969 5.222 5.060 5.056 5.185 4.865
x3 4.941 4.997 4.889 4.871 5.035 5.163 5.053 4.972 4.967 4.908 5.094 4.930 5.069 4.895 4.879 5.024 4.968 5.058 4.944 4.923 5.135 5.089 5.009 5.028 5.091
x4 5.119 5.042 5.010 4.919 4.883 4.951 5.139 5.101 5.040 5.089 5.057 5.094 5.032 5.057 5.078 5.107 5.024 4.993 5.135 5.052 4.913 5.171 5.175 5.143 5.064
Lot 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
x1 5.001 5.069 5.098 5.013 5.020 5.096 5.071 4.875 5.054 5.031 5.127 4.907 5.128 5.078 4.950 5.182 5.039 5.125 5.127 4.982 4.862 5.052 5.026 4.900 5.024
x2 5.175 5.011 5.009 5.121 5.087 5.061 5.075 4.925 5.061 5.100 4.964 5.045 4.931 4.933 4.969 5.014 4.980 4.986 4.929 4.976 5.034 5.007 5.123 4.845 5.152
x3 5.081 5.144 4.983 5.102 5.060 5.124 5.111 4.936 5.157 4.921 5.018 4.971 5.005 5.043 4.939 4.924 5.037 4.971 5.019 5.289 5.031 5.138 4.993 5.049 4.897
x4 4.976 5.014 5.187 5.166 4.975 5.162 5.028 4.872 5.051 5.084 5.078 5.058 5.030 5.030 5.055 5.007 4.977 5.003 5.062 4.986 5.130 5.073 5.075 5.124 5.204
It is believed that the mean, at least for the first lots, is equal to the nominal value 5. It is conjectured, however, that an upward shift of the population mean by .5σ to σ is possible. In fact, the first 20 lots come from N (5, .01) while lots 21 to 50 come from N (5.05, .01). a. Perform the standard CUSUM test for shift of the mean twice, setting first μ0 = 5, μ1 = 5.05 and then μ0 = 5, μ1 = 5.1. b. Perform a sequence of standard CUSUM tests with μ0 = 5 and μ1 = 5.05, starting each test anew whenever a decision is taken (see Problem 4.1 for explanation). c. Perform a sequence of standard CUSUM tests with μ0 = 5 and μ1 = 5.1, starting each test anew whenever a decision is taken. d. Construct the Shewhart CUSUM chart for mean shift. e. Perform the Page CUSUM test for mean shift thrice, using h = 4 in all cases, and k = .25, .5, and 1, respectively. f. Construct the EWMA chart twice, first for r = .25 and then for ˆ¯ = 5 in both cases). r = .333 (use x Comment on the results obtained. Problem 4.4. Using statistical tables of normal distribution, compute
© 2002 by Chapman & Hall/CRC
169
problems
the average run lengths for standard mean control charts for shifts of values 0, 1, 2 and 4 (in standardized units). Problem 4.5. In the table below, summary statistics for 20 lots immediately preceding those considered in Problem 3.10 are given. Lot 74 75 76 77 78 79 80 81 82 83
x ¯ 6.740 6.731 6.771 6.731 6.748 6.752 6.744 6.736 6.741 6.781
s .023 .017 .018 .022 .020 .016 .023 .019 .021 .026
Lot 84 85 86 87 88 89 90 91 92 93
x ¯ 6.772 6.751 6.778 6.782 6.755 6.742 6.736 6.742 6.748 6.734
s .021 .027 .025 .018 .023 .024 .022 .017 .018 .021
Use the above data to estimate population mean and standard deviation. Given the estimates obtained, use a CUSUM test of your choice to verify if lots 94 to 125 reveal a .5σ mean shift. Interpret your results. Problem 4.6. Examine the behavior of the FIR CUSUM test on the data set of Table 4.3, assuming μ0 = 10, σ = 1, k = 1, h = 4, SH0 = SL0 = 2 and a. starting the test on lot 1; b. starting the test on lot 11; c. starting the test on lot 21. Interpret the results. Problem 4.7. Consider the data set of Table 3.3. Estimate the population mean and standard deviation, and construct the EWMA chart taking r = .333. Repeat the analysis for r = .25. Interpret the results. Problem 4.8. Perform the EWMA test on the data set of Table 4.3, setting r = .333 and ˆ¯0 = x ¯8; a. x ˆ¯0 = x ¯ 10 . b. x Compare the results with corresponding results in Section 4.6. Problem 4.9. In the following table, simulated lots 1 to 10 come from N (5, (.3)2 ) and lots 11 to 40 from N (5, (.6)2 ).
© 2002 by Chapman & Hall/CRC
170
chapter 4. sequential approaches Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
x1 4.802 5.149 4.742 5.189 4.804 4.861 5.191 5.128 5.302 4.701 5.562 4.583 5.416 4.371 4.277 5.145 4.805 5.346 4.667 4.538
x2 5.367 4.650 4.604 5.031 5.092 4.723 4.871 5.079 4.979 4.705 4.895 5.581 5.268 4.604 4.830 4.158 5.034 4.646 5.325 5.310
x3 4.725 5.017 5.323 4.789 5.243 5.265 4.826 5.536 5.141 5.297 4.246 4.924 4.527 4.813 6.435 4.887 5.529 4.658 4.996 4.084
x4 5.095 5.177 5.164 5.637 5.026 4.934 5.190 5.347 4.973 5.072 5.340 5.566 5.194 5.341 5.467 5.639 5.141 4.960 5.809 5.312
Lot 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
x1 5.480 3.722 4.499 3.558 4.281 4.958 5.055 4.798 5.588 4.196 5.598 4.423 4.434 5.649 5.523 5.269 5.176 4.566 5.916 5.696
x2 4.875 4.778 5.345 5.208 3.973 5.540 4.552 5.451 4.422 5.633 5.512 5.236 4.752 4.866 5.248 5.186 5.566 4.599 5.315 5.062
x3 4.687 6.330 5.360 5.644 4.810 5.219 4.474 5.336 4.766 4.164 5.197 3.835 4.578 4.540 6.410 5.032 5.523 4.906 6.270 5.729
x4 5.612 6.099 5.036 5.692 5.324 4.583 5.391 5.493 4.489 4.974 5.359 5.770 5.018 4.133 5.759 4.982 3.106 5.304 4.986 4.989
Use a suitable CUSUM test (or, if necessary, a sequence of tests) to detect the persistent shift of the variance. Problem 4.10. Consider the data set of Problem 2.4. Suppose the company aims at signing a contract with a supplier of valves who is able to keep the proportion of defectives always below 5% and to maintain, on the average, a 3% level of defectiveness. Use the CUSUM acceptancerejection test to verify whether the first of the company’s subcontractors satisfies these requirements.
© 2002 by Chapman & Hall/CRC
Chapter 5
Exploratory Techniques for Preliminary Analysis 5.1
Introduction
Perhaps the most interesting and challenging stage of a quality control investigation is at the very beginning. It is very unusual for an SPC professional to be presented with neat, chronologically recorded, lot data. And, indeed, it is rather unusual for an SPC program to begin as a well thought-out management decision. Much more common is the situation where the quality consultant is called in “to put out fires.” In the United States, these fires may very well be the result of a litigation concerning a real or perceived defect in a production item. It will do no one much good to respond to a potential client whose business is collapsing by offering to implement an orderly program of statistical process control on a production network which is devoid of measuring sensors, which might not even have been charted. A fatal flaw for any consultant is to insist on dealing with the world as he would like to see it rather than as it is. We are accustomed, in both Poland and the United States, to receiving a call for implementing a “comprehensive program of quality control” quickly followed by a statement of the sort “By the way, we have some little problem on which you might care to help.” The “little problem” may be a lawsuit or a cancellation by a major client. A situation where the “problem” is actually a business effect rather than the technological cause of the effect requires a certain amount of exploratory work, perhaps to the point of modeling the process which produced the real or perceived defect. 171
© 2002 by Chapman & Hall/CRC
172
chapter 5. exploratory techniques
5.2
The Schematic Plot
We will first introduce the schematic plot of John W. Tukey, the founder of Exploratory Data Analysis [11].1 Let us suppose we have a set of N measurements of the same variable Y . Let us sort the measurements from smallest to largest y1 ≤ . . . ≤ y.25N ≤ . . . ≤ y.5N ≤ . . . ≤ y.75N ≤ . . . ≤ yN .
(5.1)
Here, y.25N and y.75N denote the lower and upper sample quartiles, respectively. (In the terminology of the founder of Exploratory Data Analysis, John W. Tukey, these are referred to as the upper and lower hinges.) Now we compute the interquartile range H = y.75N − y.25N .
(5.2)
We then compute the size of a Step = 1.5H.
(5.3)
Next we add one step to the upper hinge to obtain the Upper Inner Fence = Upper Hinge + Step.
(5.4)
Similarly, we obtain the Lower Inner Fence = Lower Hinge − Step.
(5.5)
Finally, we compute the Upper Outer Fence = Upper Hinge + 2Step.
(5.6)
Lower Outer Fence = Lower Hinge − 2Step.
(5.7)
Let us suppose that the data are normally distributed. We will also assume that the sample median falls essentially at the population median (and hence at μ ). Further, we will assume that the hinges occur essentially at the population values. Then, expressing all variates in standard form, i.e., X − E(X) , (5.8) Z= V ar(X) 1
For a concise model based approach to EDA see [8].
© 2002 by Chapman & Hall/CRC
173
the schematic plot we have the fact that 1 Prob(Z > Upper Hinge) = √ 2π
∞
z2
e− 2 dz = .25.
(5.9)
.6745
Next 1 Prob(Z > Upper Inner Fence) = √ 2π
∞ 2.698
z2
e− 2 dz = .0035.
(5.10)
Then, 1 Prob(Z > Upper Outer Fence) = √ 2π
∞ 4.722
z2
e− 2 dz = 1.2 10−6 . (5.11)
Upper Outer Fence
Upper Inner Fence
Upper Hinge Median Lower Hinge
Lower Inner Fence
Lower Outer Fence
Figure 5.1. Schematic Plot. Thus, if we are dealing with a data set from a single normal distribution, we would expect that an observation would fall outside the hinges half the time. An observation would fall outside the inner fences 7 times in
© 2002 by Chapman & Hall/CRC
174
chapter 5. exploratory techniques
a thousand. Outside the outer fences, an observation would fall with chances only 2 in a million. In a sense, observations inside the inner fences are not particularly suspect as having been generated by other contaminating distributions. An observation outside the inner fences can begin to point to a contamination, to a Pareto glitch. Outside the outer fences, it is clear that something is pretty much the matter with the assumption that the data came from a dominant normal distribution. Some refer to the schematic plot as a “box plot” or a “box and whiskers plot,” but these terms actually refer to slightly less complex, less outlier oriented plots. We show an idealized schematic plot in Figure 5.1, i.e., one in which the boundaries were determined assuming the data are from a single normal distribution. Let us now use this approach with the sample means from the data set of Table 3.2. We first sort that data from the lowest x ¯ to the greatest. Lot 40 73 76 6 63 88 39 54 22 55 2 30 36 5 51 38 7 78 50 26 53 81 57 35 45 86 16 52 62 14 4 20 1 10 84 65 82 13 48 83 3 19 37 41 59
x1 10.002 9.972 9.927 9.876 9.904 9.868 9.848 9.992 9.965 9.908 9.862 9.845 9.952 9.737 9.967 10.010 9.898 9.825 10.007 9.767 9.841 10.059 10.064 9.927 10.058 10.066 10.007 9.981 10.100 10.025 9.820 9.786 9.927 9.896 10.087 9.982 9.832 10.127 10.012 9.958 10.061 9.986 9.941 10.031 9.869
© 2002 by Chapman & Hall/CRC
x2 9.452 9.855 9.832 9.957 9.848 9.955 9.944 9.924 10.011 9.894 10.003 9.901 10.056 9.937 9.947 9.841 9.959 10.106 9.789 9.994 9.926 9.992 10.036 10.066 9.979 9.948 10.005 10.053 9.853 9.890 10.066 10.145 9.920 9.994 9.994 9.963 10.075 9.935 10.043 9.884 10.089 10.041 9.964 10.061 9.934
Table 5.1 x3 x4 9.921 9.602 9.931 9.785 9.806 10.042 9.845 9.913 9.949 9.929 9.769 10.023 9.828 9.834 9.972 9.755 9.810 10.057 10.043 9.903 9.829 9.824 10.020 9.751 9.948 9.802 9.928 10.144 10.037 9.824 10.031 9.975 9.924 9.989 9.959 9.901 10.015 9.941 9.935 10.114 9.892 10.152 9.981 9.800 9.733 9.985 10.038 9.896 9.917 9.881 9.769 10.102 9.883 9.941 9.762 9.920 10.067 9.739 10.002 9.999 10.062 9.897 10.012 10.110 10.170 9.976 10.009 9.835 9.915 10.023 10.061 9.970 10.111 9.954 9.979 10.014 9.932 10.072 9.986 10.008 9.950 9.929 9.998 9.992 9.943 10.085 9.943 9.997 10.216 9.962
x5 9.995 9.846 9.914 9.941 9.904 9.921 10.091 9.925 9.737 9.842 10.077 10.088 9.947 9.965 9.938 9.880 9.987 9.964 10.013 9.964 9.965 9.950 9.972 9.871 9.966 9.932 9.990 10.107 10.092 9.937 10.013 9.819 9.899 10.162 9.883 9.937 9.946 9.876 9.892 10.113 9.935 9.961 10.049 9.952 10.012
x ¯ 9.794 9.878 9.904 9.906 9.907 9.907 9.909 9.914 9.916 9.918 9.919 9.921 9.941 9.942 9.943 9.947 9.951 9.951 9.953 9.955 9.955 9.956 9.958 9.960 9.960 9.963 9.965 9.965 9.970 9.971 9.972 9.974 9.978 9.979 9.980 9.983 9.984 9.986 9.990 9.990 9.993 9.996 9.996 9.997 9.999
Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
175
the schematic plot
Lot 64 69 27 47 18 23 42 24 61 49 68 34 12 89 31 32 9 46 80 44 87 29 43 85 56 77 60 33 74 75 66 72 25 11 58 90 8 70 15 71 67 17 21 28 79
x1 9.979 10.014 9.933 10.132 10.168 9.989 9.990 9.983 10.008 10.097 9.936 10.016 9.983 10.084 9.956 9.876 9.928 10.006 9.972 9.980 10.041 10.022 9.995 10.232 10.011 10.177 10.016 9.932 10.014 10.093 10.028 9.934 10.063 10.011 9.891 10.063 10.001 10.005 9.953 10.116 9.995 10.062 9.957 10.227 10.333
x2 10.008 10.070 9.974 9.920 10.045 10.063 9.972 9.974 10.157 9.894 10.022 9.990 9.974 10.018 9.921 10.114 10.234 10.221 10.116 10.094 10.044 9.986 10.056 9.966 9.967 9.884 9.996 9.856 10.000 9.994 10.079 10.025 10.075 10.011 10.055 10.055 10.050 10.044 10.000 10.028 10.029 10.005 9.984 10.517 10.280
Table 5.1(continued) x3 x4 x5 9.963 10.132 9.924 9.890 10.137 9.901 10.026 9.937 10.165 10.094 9.935 9.975 10.140 9.918 9.789 10.148 9.826 10.041 10.068 9.930 10.113 9.883 10.153 10.092 9.988 9.926 10.008 10.101 9.959 10.040 9.940 10.248 9.948 10.106 10.039 9.948 10.071 10.099 9.992 9.941 10.052 10.026 10.132 10.016 10.109 9.938 10.195 10.010 9.832 10.027 10.121 9.841 10.115 9.964 10.084 10.059 9.914 9.988 9.961 10.140 10.091 10.031 9.958 10.152 9.922 10.101 10.061 10.016 10.044 9.991 10.021 9.965 10.204 9.939 10.077 10.070 9.980 10.089 10.095 10.029 10.080 10.085 10.207 10.146 9.978 10.133 10.100 10.090 10.079 9.998 9.970 10.087 10.094 10.129 10.054 10.124 9.988 10.071 10.096 10.090 10.095 10.120 10.235 10.064 10.092 10.104 10.080 10.064 10.263 9.982 10.076 10.016 10.188 10.116 10.141 10.130 10.154 10.152 10.047 10.040 9.991 10.232 10.189 10.070 10.270 10.071 10.273 10.142 10.190 10.583 10.501 10.293 10.509 10.631 10.444
x ¯ 10.001 10.002 10.007 10.011 10.012 10.013 10.015 10.017 10.017 10.018 10.019 10.020 10.024 10.024 10.027 10.027 10.028 10.029 10.029 10.033 10.033 10.034 10.034 10.035 10.040 10.040 10.043 10.045 10.045 10.051 10.052 10.053 10.059 10.065 10.067 10.073 10.074 10.074 10.076 10.077 10.087 10.096 10.109 10.424 10.439
Rank 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
In constructing the Tukey schematic plot, we first compute the median, in this case, the average between the means ranked 45 and 46, respectively, 9.999 + 10.001 . (5.12) 2 The lower hinge is obtained by going up one fourth of the ranks. In this case the rank is essentially 23, so the lower hinge is 9.958. Similarly, the upper hinge is the observation having rank 68, or 10.034. This then gives us for the step size Median =
Step = 1.5(10.034 − 9.958) = .114.
(5.13)
Lower Inner Fence = 9.958 − .114 = 9.844;
(5.14)
Then we have
© 2002 by Chapman & Hall/CRC
176
chapter 5. exploratory techniques Lower Outer Fence = 9.958 − 2(.114) = 9.730;
(5.15)
Upper Inner Fence = 10.034 + .114 = 10.148;
(5.16)
and Upper Outer Fence = 10.034 + 2(.114) = 10.262.
(5.17)
We have one value between the lower outer and inner fences (denoted by an asterisk) and two values outside the upper outer fence (denoted by hollow circles). We recall that our ordinary control chart analysis identified these three points as being “out of control.” Why, then, one might well ask, should one bother with the schematic plot? The answer is that in the beginning of an SPC investigation, it is very common to have data not regularly indexed on time. In such a case, the schematic plot provides us a quick feel as to whether there appear to be easily identified Pareto glitches. Once we answer this question affirmatively, we can try and backtrack in time to see whether we might be able to identify the assignable cause of the glitch. Even if we cannot do so, we have “seen gold nuggets in the stream” and we can proceed forward with confidence that a procedure of instituting regular control charting is likely to yield big dividends, quickly. 10.5 10.4 10.3
MEAN
10.2 10.1 10.0 9.9 9.8
*
9.7
Figure 5.2. Schematic Plot of 90 Bolt Lot Means.
© 2002 by Chapman & Hall/CRC
smoothing by threes
5.3
177
Smoothing by Threes
It is the basic task of control charting to find the “high frequency” Pareto glitch. Consequently, it might appear that the last thing we would want to consider in this context is smoothing, a device generally oriented to removing the high frequency wiggles in a time indexed data set. A concrete example will show the importance of smoothing to the SPC investigator. A manufacturer producing flexible construction material had a number of samples that were outside the tolerance limits on a particular strength index. (This was a manufacturer practicing “quality assurance” rather than statistical process control.) There were about a dozen such measurement indices which were measured both by the manufacturer and his client. Not a great deal of attention was paid to these indices, which were perhaps only marginally related to performance of the sheets. Rather the manufacturing workers more or less identified good material visually. Nevertheless, the indices were a part of the state approved “code.” Failure to comply with them on the part of the builder might expose him to subsequent lawsuits by an end user. Some weeks after the material had been used, it was noted by a builder, who was the major client of the manufacturer, that a number of samples were outside stated tolerance levels for one of the strength indices. An on site task force was immediately dispatched to sites where the material had been used. Both the manufacturer and the builder agreed that the material was performing well. Extensive examination over a period of some months failed to find a problem. The head of the quality assurance section left the manufacturing firm “to pursue exciting opportunities elsewhere.” Pressure from the builder increased. Veiled threats to changing suppliers were made. At this time, we were called in to find a solution to the problem. Naturally, the first thing to consider was to take past runs and create a schematic plot. We show the plot in Figure 5.3. None of the observations were outside the inner fences. The schematic plot indicated that the data was skewed to the right, not normally distributed. Several transformations of the data were attempted, namely, the square root, the fourth root, the natural logarithm and the logarithm of the logarithm. None of these brought the chart to symmetry.
© 2002 by Chapman & Hall/CRC
178
chapter 5. exploratory techniques
0
100
5
200 Strength
300
10
15
400
20
S tr e n g th
2
3
4
5
5
6
S tr e n g th
3
4 Ln(Strength)
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Ln(Ln(Strength))
Figure 5.3. Schematic Plots of Strength Data.
© 2002 by Chapman & Hall/CRC
179
smoothing by threes
Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 © 2002 by Chapman & Hall/CRC
x 278.000 181.400 265.400 111.000 187.600 74.400 71.840 258.400 225.700 249.200 86.600 94.480 189.000 264.000 72.600 249.000 66.000 226.200 151.400 36.600 85.400 210.800 84.800 57.400 49.800 54.200 204.400 233.000 210.200 77.550 89.900 112.000 64.000 70.600 59.800 66.800 235.600 90.800 216.000 120.500 239.200 87.000 318.800 44.480 91.840 40.800 232.200
Table5.2 √ √ x x 16.673 4.083 13.468 3.670 16.291 4.036 10.536 3.246 13.697 3.701 8.626 2.937 8.476 2.911 16.075 4.009 15.023 3.876 15.786 3.973 9.306 3.051 9.720 3.118 13.748 3.708 16.248 4.031 8.521 2.919 15.780 3.972 8.124 2.850 15.040 3.878 12.304 3.508 6.050 2.460 9.241 3.040 14.519 3.810 9.209 3.035 7.576 2.753 7.057 2.656 7.362 2.713 14.297 3.781 15.264 3.907 14.498 3.808 8.806 2.968 9.482 3.079 10.583 3.253 8.000 2.828 8.402 2.899 7.733 2.781 8.173 2.859 15.349 3.918 9.529 3.087 14.697 3.834 10.977 3.313 15.466 3.933 9.327 3.054 17.855 4.226 6.669 2.583 9.583 3.096 6.387 2.527 15.238 3.904
ln(x) 5.628 5.201 5.581 4.710 5.234 4.309 4.274 5.555 5.419 5.518 4.461 4.548 5.242 5.576 4.285 5.517 4.190 5.421 5.020 3.600 4.447 5.351 4.440 4.050 3.908 3.993 5.320 5.451 5.348 4.351 4.499 4.718 4.159 4.257 4.091 4.202 5.462 4.509 5.375 4.792 5.477 4.466 5.765 3.795 4.520 3.709 5.448
ln(ln(x)) 1.728 1.649 1.719 1.550 1.655 1.461 1.453 1.715 1.690 1.708 1.495 1.515 1.657 1.718 1.455 1.708 1.433 1.690 1.613 1.281 1.492 1.677 1.491 1.399 1.363 1.384 1.671 1.696 1.677 1.470 1.504 1.551 1.425 1.449 1.409 1.435 1.698 1.506 1.682 1.567 1.701 1.496 1.752 1.334 1.509 1.311 1.695
180
chapter 5. exploratory techniques 400
Strength
300
200
100
0 0
10
20
30
40
50
Lot
Figure 5.4. Scatter Plots of Strength Data.
20
0.4
15
0.3
0.2
10
0.1
5
0
150
300
450
COUNT
Proportion Per Bar
0.5
600
Strength
Figure 5.5. Histogram of Strength Data. Although the data had not been collected in any sort of regular time interval fashion, and we had only one datum per lot, it was possible to order the data by production run as we demonstrate in Table 5.2 and in Figure 5.4. The split of the data into two groupings is made even more apparent by the histogram of the strength data in Figure 5.5. A standard control chart here is obviously inappropriate. The variability induced by the two groupings causes all the lots to fall inside the control limits. We recall that in the case where we have n samples in a lot, to
© 2002 by Chapman & Hall/CRC
smoothing by threes
181
obtain an estimator for the population variance on the basis of N lots, we can use N 1 s2 σ ˆ = N j=1 j 2
(5.18)
where s2j =
n 1 (xj,i − x ¯j )2 . n − 1 i=1
(5.19)
Note that we could have used as an estimator for σ 2 , under the assumption that all the data come from the same normal distribution, n N 1 2 ˆ ¯ )2 . σ ˆ = (xj,i − x nN j=1 i=1
(5.20)
This estimator is not generally used, because it can be inflated greatly if the mean is not truly constant. But in our example, where there is ˆˆ 2 as an only one sample per lot, we have little choice except to use σ estimator for σ 2 . When we use the estimator on the untransformed “strength” data in ¯ value of Table 5.2, we obtain the inflated estimate for σ 2 , 6,881.5. The x 144.398, with three times the square root of our variance estimate, gives, as upper and lower control limits, 393.27 and 0, respectively. A glance at Figure 5.6 shows that all measurements are nominally (and naively) “in control.” It turns out that the data appears to be dividing into a group with “strength” measurements in the proper range, and another very much on the low side. Such a data set is not untypical of the kind confronting statistical process control professionals as soon as they walk in the door. In the United States, in particular, SPC people seem to be called in by desperate managers more frequently than by managers who have cooly decided that this is the year when the company will get serious about quality control. The Deming/Shewhart analysis, which is generally excellent as a procedure once some degree of order has been introduced, frequently needs to be supplemented by nonstandard techniques in the first stages of an SPC implementation.
© 2002 by Chapman & Hall/CRC
182
chapter 5. exploratory techniques 400
UCL
Strength
300
200
100
0 0
10
20
30
40
50
Roll Figure 5.6. Naive Control Chart of Strength Data. The anecdotal observations of field personnel to the effect that there seemed to be no discernible difference between the material from lots in which the “strength” index was low and those when it was in the satisfactory range could indicate many things. For example, the index could simply have nothing to do with anything. In this situation, that seemed not to be the case. The company engineers were confident that suppressed values of the index should, in some cases, lead to material failures, of which there had, in fact, been none. Perusing old records, we found cases in which the laboratory had inadvertently made two measurements from the same lot (naturally, they should have been replicating their tests always, but they had not purposely done so). And, in some cases, it turned out that on the same testing machine, with the same material, normal range as well as low values had been found. So, then, one might suppose that the technicians obtaining the low readings were improperly trained or otherwise deficient. But still further investigation found that the same technician obtained normal and low readings for the same material on two different testing machines (there were 3 testing machines). So, then, perhaps there was a problem with one of the machines. But further investigation showed that low values had been observed on each of the machines, with approximately the same frequency. At this point, we arranged for the chief technician to take samples
© 2002 by Chapman & Hall/CRC
smoothing by threes
183
from three different lots and run them on each of the three machines, each lot replicated three times. For each set of material and for each machine, the technician observed both high and low measurements. A look at the testing curves (which had been automatically suppressed for observation, since an automated read-out software package bundled with the testing device was being utilized) revealed the difficulty.
Break Tension
Tension
Time
Figure 5.7. Tension Test of Material. In the case of this material, a sample was stretched by the machine until its elasticity was exceeded. At this time, the tension automatically dropped. But as the material was stretched further, the now deformed material actually increased in strength, until the material broke. The strength measurement which was being recorded was the tension at the time of break of the material. We show such a curve in Figure 5.7. After retrieving the curves typical of the material under examination (see Figure 5.8) a possible candidate for the cause of the difficulty was apparent. Due to the special characteristics of the material, a high frequency jitter was present. If the software in the measuring instrument was not smoothing the data, and if it was attempting to find the “strength” (tension at failure) by looking for the tension the second time the tension began to decrease, then the kind of problem observed could well occur.
© 2002 by Chapman & Hall/CRC
184
chapter 5. exploratory techniques
Figure 5.8. Tension Test of Nonstandard Material. We immediately told the client to disconnect the “automatic yield finder” and reconnect the graphics terminal, so the technician could read the actual value as opposed to a bogus one. Then we called the instrumentation vendor to see if the conjecture about the way they wrote the software to find the “strength” could be substantiated. It was. Next, we indicate a quick fix for such a problem. Again, we rely on an Exploratory Data Analysis algorithm of Tukey’s, namely the 3R smooth. In Table 5.3, the “Tension” column shows the raw measurements (of the sort the software was using to find the second local maximum). In the second column, “Tension3,” we have used the “3R” smooth. To show how this works, let us consider the 13’th Tension measurement, which is equal to 93. We look at the observation before and that after to see the triple {101,93,98}. We then replace the 13’th observation by the middle observation of the triple formed by the observation and the one just before and that just after. In other words, we write down 98 to replace 93. We start at the top of the list, and continue all the way through to the end (generally, the first and the last observations are not changed). We denote in boldface those values which have changed from the preceding iteration of the “3” smooth. In this case, three iterations brings us to point where there are no more changes. “3R” means “3 smooth repeated until no further changes take place.” Accordingly, here, the “Tension333” column is the “3R” end result. In Figure 5.9, we denote the raw data and the evolution of smoothing as we proceed through the iterations. We note how the smooth tends to eliminate the perception of high frequency jitter as local maxima and minima.
© 2002 by Chapman & Hall/CRC
185
smoothing by threes
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
© 2002 by Chapman & Hall/CRC
Tension 100 107 114 121 125 127 125 120 116 111 107 101 93 98 87 96 91 105 100 108 102 121 125 130 128 139 158 168 175 188 195 205 215 220 182 164 140
Table 5.3 Tension3 Tension33 100 100 107 107 114 114 121 121 125 125 125 125 125 125 120 120 116 116 111 111 107 107 101 101 98 98 93 96 96 93 91 96 96 96 100 100 105 102 102 105 108 108 121 121 125 125 128 128 130 130 139 139 158 158 168 168 175 175 188 188 195 195 205 205 215 215 215 215 182 182 164 164 140 140
Tension333 100 107 114 121 125 125 125 120 116 111 107 101 98 96 96 96 96 100 102 105 108 121 125 128 130 139 158 168 175 188 195 205 215 215 182 164 140
186
chapter 5. exploratory techniques
Time Tension Tension3 Tension33 Tension333
Figure 5.9. Tension Measurements with Smooths.
5.4
Bootstrapping
Some might be curious as to why workers in SPC seldom deal with the t (or “Student’s”) distribution. After all, it was the late nineteenth century proto quality control expert W.S. Gossett (aka “Student,” who used an alias in his early publishing lest his employer, Guiness Breweries, bear the opprobrium of having used statistical techniques in its manufacturing process) who discovered that the distribution of t=
x ¯−μ √ s/ n
was not N (0, 1), as had been supposed, but was something having heavier tails. Why do SPC workers revert to the old assumption that if x1 , x2 , . . . , xn are independent and normally distributed, then, indeed: t=z=
x ¯−μ √ s/ n
may be treated as though it were N (0, 1)? The reason is clear enough. We do not, in SPC, estimate the underlying variance from the sample variance of a given lot, but from the pooled average of the sample variances from many lots, say M —typically over 50. The estimate2 M 1 s2 s¯2 = M i=1 i 2
For lot i of size ni , s2i = 1/(ni − 1)
© 2002 by Chapman & Hall/CRC
ni
j=1
(xij − x¯i )2 .
bootstrapping
187
is very close to σ 2 because by the Strong Law of Large Numbers (see Appendix) the average of many independent sample mean estimates of a parameter tends to the parameter as the number of estimates gets large. For many, perhaps most, situations in SPC we will not be far off the mark if we simply assume that our data consists of draws from a dominant normal distribution plus occasional draws from contaminating normal distributions (as discussed in Chapter 3). However, we have already seen with the example in Section 5.3 that there are significant exceptions to the rule. That case was particularly hard to deal with since the “in control” part was not really dominant. The bad observations were as numerous as the good ones. Most of the standard testing in Statistical Process Control is based on lot means and standard deviations. Because for almost all realistic cases we are dealing with distributions which produce sample lot means which converge to the Gaussian (normal) distribution as the size of lots gets large (say greater than 20), we frequently assume that the lot means are normally distributed. Yet our lot sizes are generally 10 or less where the convergence of the sample mean to normality has not yet taken place (unless the underlying distribution is itself rather close to Gaussian). Resampling gives a means for using lot means in such a way that we are not making any assumptions about their being normally distributed. Similarly, we can make tests about lot standard deviations which do not make any assumptions about the underlying normality of the data. The range of sophistication that can be used in resampling is extensive. But for dealing with low order moments—the mean and variance—the nonparametric bootstrap (developed in its full glory by Bradley Efron [1], but in the simple form we use here much earlier by the late Julian Simon [5]) works perfectly well. Suppose we have a data set of size n and from this data set wish to make a statement about a confidence interval of the mean of the distribution from which the data was taken. Then, we can select with replacement n of the original observations and compute the sample mean. If we carry out say 10,000 such resamplings each of size n, we can order the resulting sample means from smallest to largest. Then ¯ 250 , X ¯ 9750 ] gives us a 95% confidence interval for the true [X value of the mean of the distribution from which the sample was taken. We are 95% sure the mean μ lies in this interval.
© 2002 by Chapman & Hall/CRC
188
chapter 5. exploratory techniques Similarly, let us take that data set of size n and the same set of 10,000 resamplings and compute s, the sample standard deviation for each of the 10,000 resamplings. Rank order the sample standard deviations. A 95% confidence interval for σ would be [s250 , s9750 ].
Of course, we are pretending that the data we have represents all the data that ever we could have and that further examinations will always yield simply a repeating of our original data set, but with some points missing, and others included more than once. This has a certain intuitive appeal, although mathematically a great deal more work is required to put it on solid ground. Let us return to the data in Table 3.2. We recall that this data was actually the result of a baseline normal distribution N (μ = 10, σ 2 = .01), contaminated by two other normals. The variables in each lot were Y (t) = N (10, .01); probability .98505
(5.21)
Y (t) = N (10.4, .03); probability .00995
(5.22)
Y (t) = N (9.8, .09); probability .00495
(5.23)
Y (t) = N (10.2, .11); probability .00005.
(5.24)
The proviso was that Y (t) stayed in the same distribution for all items in the same lot.
Histogram of Resampled Sample Means Lot 40
2000
1500
F r e q u e n c y
1000
500
9.4
9.5
9.6
9.7
9.8
9.9
10
10.1
Resampled Sample Mean
Figure 5.10. Bootstrapped Means from Lot 40.
© 2002 by Chapman & Hall/CRC
10.2
189
bootstrapping Histogram of Resample Lot Standard Deviations in Lot 40
2000
1500
F r e q u e n c y
1000
500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Resampled Lot Standard Deviation
Figure 5.11. Bootstrapped Standard Deviations from Lot 40. Now we first try a bootstrapping approach where we use the lot histograms to obtain a 99.8% confidence interval for μ and 99.8% confidence interval for σ. Why 99.8%? Because we have been using 3σ level normal theory tests, and that convention corresponds to 99.8% confidence intervals. We note that for lot 40, the 99.8% confidence interval for the standard deviation is (0, 0.30125). For the 90 lots, the average sample variance is .01054, giving an estimate of σ = .1026. This is well within the confidence interval from lot 40. The 99.8% confidence interval for the resampled means, based on lot 40, is (9.482, 10.001) The overall average of the lot means from the 90 lots is 10.004. Since 10.004 is outside the 99.8% resampled mean confidence interval from lot 40, we reject lot 40 as being out of control. We recall that when we went through our parametric test, we rejected lot 40 on the basis of standard deviation as well as mean. And we should have done so, since we know that lot 40 comes from N (9.8, .09). If we carry through similar testing for lots 28 and 79, we also reject them as being out of control on the basis of the 99.8% mean confidence interval formed from each lot. Again, since we recall that these lots were actually drawn from N (10.4, .03), we see they should have been rejected as contaminated (out of control). On the other hand, we also reject noncontaminated lots 6, 7, 54, 63 and 73 on the basis of the resampled means confidence interval test. These lots were drawn from the in control
© 2002 by Chapman & Hall/CRC
190
chapter 5. exploratory techniques
distribution N (10.0, .01), and they were not rejected by the parametric normal theory based procedure in Chapter 3. We know that, in SPC, false alarms can really cause havoc down when they come at such a high rate. It would appear that for very small lot sizes, resampling confidence intervals based on the individual lots is probably not a very good idea. What else might we try? Let us make the assumption that all the bolts are drawn from the same in control distribution. We put all 90 × 5 = 450 observations from Table 3.2 into a pool. From this pool we make 10,000 random draws of size 5. The resulting histogram of the 10,000 sample means is shown in Figure 5.12. The 99.8% confidence interval is (9.8614, 10.139). The three contaminated lots: 28, 40 and 79 are rejected as being out of control. No noncontaminated lots are rejected.
1000
Histogram of Resampled Sample Means
750
F r eq u e 500 n c y
250
9.8
9 .85
9 .9
9.95
10
10.05
10.1
10.15
10.2
Sample Mean
Figure 5.12. Bootstrapped “Lot Means.”
The 10,000 sample standard deviations histogram of the 10,000 random draws of size 5 is shown in Figure 5.13. The 98.8% confidence interval is (0.01567, 0.20827). Only lot 40 is rejected on the basis of the standard deviation confidence interval. This is exactly what we gleaned from the Gaussian based test in Section 3.3 for the same data. Note that the pooling of all the data into an urn from which we draw lots of size five will tend to inflate estimates of the lot standard deviation.
© 2002 by Chapman & Hall/CRC
bootstrapping
191
Figure 5.13. Bootstrapped “Lot Standard Deviations.” Next, following [10], we could look at the entire set of 90 lots and obtain resampling confidence intervals there against which we could compare the means and standard deviations of each of the 90 lots. In a way, the procedure we shall suggest is closer to the spirit of the normal theory tests of Chapter 3. Although no explicit assumptions of normality are made, the notion that the distribution of the bootstrapped grand means of √ all 90 lots of 5 is that of a lot of size 5 but with scale changed by 1/ 90 allows the central limit theorem induced normality of the grand mean be implied for the lot means as well. (Of course, if the underlying distribution of the bolts is not too bad, there will already be some moving toward normality by the averaging of five bolts to get the lot mean.) Let us suppose we make the assumption that all the lots come from the “in control” distribution. Then we can take the overall pooled mean of sample means (10.0044), and record all the means from all the lots as differenced from 10.0044, (dif f )j = Xj − 10.0044. Now, it is an easy matter then to use bootstrapping to find a 99.8% confidence interval about the mean of sample means in which the population mean should lie. But, recalling that V ar(X) = V ar(Xj )/90, where j is simply one of the 90 lots, we realize that, under the assumptions given, the standard deviation of the lot means should be taken
© 2002 by Chapman & Hall/CRC
192
chapter 5. exploratory techniques
to be (90) times that of the mean of lot means. So, to find a confidence interval about 10.0044 in which a lot mean would be expected to fall, with 99.8% chance, if it is truly from the “in control” population, we construct a histogram of the √ bootstrapped grand differenced (from 10.0044) means multiplied by 90 = 9.4868 with 10.0044 added. The grand differenced means are obtained from the bootstrapped differences defined above. We show this in Figure 5.14. Histogram of Resampled Means Using Grand Means Approach
1000
F r e q u e n c y
750
500
250
9.6
9.7
9.8
9.9
10
10.1
10.2
10.3
10.4
Resampled Mean Using All 90 Lots
Figure 5.14. Bootstrapped “Lot Means.” The 99.8% confidence interval (a sort of bootstrapped mean control chart) is given by (9.7751,10.294). In Figure 5.14, we note that (contaminated) lots 28 and 79 both are out of control. But (contaminated) lot 40 is not recognized as being out of control from the resampled mean test. Next, let us carry out a similar approach for the resampled sample standard deviations. The average sample variance over the entire 90 lots is .01054. The 99.8% confidence interval for σ is (.0045941, 0.20639). We note that only lot 40 has a standard deviation outside this confidence interval. We note that between the bootstrapped mean test and the bootstrapped standard deviation test, we have found all three out of control lots. We note that none of the uncontaminated lots is identified as out of control. Consequently, in this case, the bootstrap control chart worked satisfactorily. By using the deviation of lot means from the grand mean,
© 2002 by Chapman & Hall/CRC
pareto and ishikawa diagrams
193
we have, however, made an assumption which may give us an inflated measure of the in control process variability and causes us to construct confidence intervals which are too wide, thus accepting too many bad lots as being in control. There are other steps we might take. For example, we might use median estimates for the grand mean of sample means and for the overall standard deviation of the in control distribution (see Section 3.3). 1000
F r e q u e n c y
75 0
Histogram of Resampled Sample Standard Deviations Using All 90 Lots
50 0
25 0
0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 Resampled Standard Deviation
Figure 5.15. Bootstrapped “Lot Standard Deviations.”
5.5
Pareto and Ishikawa Diagrams
In free market economies, we are in a different situation than managers in command economies, for there generally is a “bottom line” in terms of profits. The CEO of an automobile company, for example, will need to explain the dividends paid per dollar value of stock. If it turns out that these dividends are not satisfactory, then he can take “dramatic action” such as having his teams of lobbyists demand higher tariffs on foreign automobiles and instructing his advertising department to launch intimidating “buy American” campaigns. Sometimes, he might take even more dramatic action by trying to build better automobiles (but that is unusual). We note that if the decision is made to improve the quality of
© 2002 by Chapman & Hall/CRC
194
chapter 5. exploratory techniques
his product then there is the question of defining what it means for one car to be better than another. It is all very well to say that if profits are good, then we probably are doing OK, but a reasonable manager should look to the reasons why his sales should or should not be expected to rise. Uniformity of product is the measure which we will be using to a very large degree in the development of the statistical process control paradigm. But clearly, this is not the whole story. For example, if a manufacturer was turning out automobiles which had the property that they all ran splendidly for 10,000 miles and then the brake system failed, that really would not be satisfactory as an ultimate end result, even though the uniformity was high. But, as we shall see, such a car design might be very close to good if we were able simply to make appropriate modification of the braking system. A fleet of cars which had an average time to major problems of 10,000 miles but with a wide variety of failure reasons and a large variability of time until failure would usually be more difficult to put right. The modern automobile is a complex system with tens of thousands of basic parts. As with most real world problems, a good product is distinguished from a bad one according to an implicit criterion function of high dimensionality. A good car has a reasonable price, “looks good,” has good fuel efficiency, provides safety for riders in the event of an accident, has comfortable seating in both front and rear seats, has low noise levels, reliably starts without mishap, etc., etc. Yet, somehow, consumers manage to distill all this information into a decision as to which car to purchase. Certain criteria seem to be more important than others. For example, market analysts for years have noted that Japanese automobiles seem to owe their edge in large measure to the long periods between major repairs. One hears statements such as, “I just changed the oil and filter every five thousand miles, and the thing drove without any problems for 150,000 miles.” Long time intervals between major repairs make up one very important criterion with American car buyers. Fine. So then, an automotive CEO might simply decide that he will increase his market share by making his cars have long times until major repairs. How to accomplish this? First of all, it should be noted that broad spectrum pep talks are of negative utility. Few things are more discouraging to workers than being told that the company has a problem and it is up to them to solve it without any clue as to how this is to be achieved. A reasonable first step for the CEO would be to examine the relative frequencies of causes of first major repair during a period of, say, three
© 2002 by Chapman & Hall/CRC
195
pareto and ishikawa diagrams
months. The taxonomy of possible causes must first be broken down into the fifty or so groups. We show in Figure 5.16 only the top five. It is fairly clear that management needs to direct a good deal of its attention to improving transmissions. Clearly, in this case, as is generally true, a few causes of difficulty are dominant. The diagram in Figure 5.16 is sometimes referred to as a Pareto diagram, inasmuch as it is based on Pareto’s Maxim to the effect that the failures in a system are usually the consequence of a few assignable causes rather than the consequence of a general malaise across the system.
Number of Failures
30000
20000
10000
Brakes
Engine
Paint
Suspension
Transmission
0
Figure 5.16. Failure Pareto Diagram. What is the appropriate action of a manager who has seen Figure 5.16? At this point, he could call a meeting of the managers in the Transmission Section and tell them to fix the problem. This would not be inappropriate. Certainly, it is much preferable to a general harangue of the entire factory. At least he will not have assigned equal blame to the Engine Section with 203 failures (or the Undercoating Section with no failures) as to the Transmission Section with 27,955 failures. The use of hierarchies is almost inevitable in management. The Pareto diagram tells top management where it is most appropriate to spend resources in finding (and solving) problems. To a large extent, the ball really is in the court of the Transmission Section (though top management would
© 2002 by Chapman & Hall/CRC
196
chapter 5. exploratory techniques
be well advised to pass through the failure information to the Suspension Section and indeed to all the sections).
N umber of Failures
25000
20000
15000
10000
5000
Module G
Mo dule F
Module E
Module D
Module C
Module B
Module A
0
Figure 5.17. Transmission Failure Pareto Diagram. What should be the approach of management in the Transmission Section? The obvious answer is a Pareto diagram (Figure 5.17) on the 27,955 faulty transmissions. That may not be realistic. It is easier to know that a transmission has failed than what was the proximate cause of that failure. We might hope that the on site mechanics will have correctly diagnosed the problem. Generally speaking, in order to save time, repair diagnostics will be modularized; i.e., there will be a number of subsections of the transmission which will be tested as to whether they are satisfactory or not. Naturally some of the transmissions will have more than one failed module. Clearly, Module A is causing a great deal of the trouble. It is possible to carry the hierarchy down still another level to find the main difficulty with that module. The problem may be one of poor design, or poor quality of manufacture. Statistical process control generally addresses itself to the second problem. The “cause and effect” or “fishbone” diagram of Ishikawa is favored by some as a tool for finding the ultimate cause of a system failure. Such a diagram might look like Figure 5.18 for the present problem.
© 2002 by Chapman & Hall/CRC
197
bayesian pareto analysis
Paint
Suspension
Engine
Failed Car Module A Poor Manufacture
Poor Design
Module B Module C Module D Module E Module F Module G Brakes
Transmission
Figure 5.18. Fishbone (Ishikawa) Diagram. The fishbone diagram should not be thought of as a precise flowchart of production. The chart as shown might lead one to suppose that the transmission is the last major component installed in the car. That is not the case. We note that Figure 5.18 allows for free form expression. For example, fishbone diagrams are frequently the product of a discussion where a number of people are making inputs on the blackboard. Each of the paths starting from a box is really a stand-alone entity. We have here developed only one of the paths in detail. We note that in the case of Transmissions, we go down the next level of hierarchy to the modules and then still one more level to the design and quality of manufacturing. In practice, the fishbone diagram will have a number of such paths developed to a high level of hierarchy. Note that each one of the major branches can simply be stuck onto the main stem of the diagram. This enables people in “brainstorming” sessions to submit their candidates for what the problem seems to be by simply sticking a new hierarchy onto the major stem.
5.6
A Bayesian Pareto Analysis for System Optimization of the Space Station
In 1995, one of us (Thompson) was asked to design a theoretical prototype for implementing Statistical Process Control in the construction and operation of the joint American-Russian Space Station. NASA tra-
© 2002 by Chapman & Hall/CRC
198
chapter 5. exploratory techniques
ditionally has excellent reliability in its design, but, at the time, was not engaged in the operational Statistical Process Control paradigm. (Incredible though this may seem, it is not unusual to find excellent engineering prospective system design unaccompanied by an orderly process optimization of a created system.) It was an interesting opportunity to design a “start from scratch” operation for a system of incredible complexity. The foregoing industrial examples bear on system optimization for the Space Station. Yet they differ in important aspects. An industrialist might, if he so chooses, simply allocate optimization resources based on customer complaints. We note that we were dealing with nearly 30,000 cases of transmission complaints alone. We have no such leisure when we consider system optimization of the Space Station. We cannot simply wait, calmly, to build up a data base of faulty seals and electrical failures. We must “start running” immediately. Thus, we will require an alternative to a hierarchy of histograms. Yet there are lessons to be learned from the industrial situation.
5.6.1
Hierarchical Structure
First of all, in the case of building a car, we recall that we had a hierarchy of parts of the system to be optimized. We did not simply string out a list of every part in a car. We formed a hierarchy; in the case of a car, we had three levels. Possibly, in the complexity of the Space Station, we will need to extend the hierarchy to a higher number than three, possibly as high as six or seven levels. A top level might consist, say, of structure, fluid transmission, life support, electromechanical function, kinetic considerations and data collection. Again, we note that modern quality control seldom replaces a bolt or a washer. The irreducible level is generally a “module.” We would expect such a practice to be utilized with the Space Station also. If we assume that we have a hierarchy of six levels and that there are roughly seven sublevels for each, then we will be dealing with approximately 76 = 117, 649 basic module types for consideration. Next, in Figure 5.19 we demonstrate the sort of hierarchical structure we advocate through three levels. Even at three levels, using seven categories at each stage, we would be talking about 73 = 343 end stages.
© 2002 by Chapman & Hall/CRC
bayesian pareto analysis
199
Figure 5.19. Three Levels of Hierarchy.
5.6.2
Pareto’s Maxim Still Applies
Again, in the case of the Space Station, it would be folly to assume that at each level of the hierarchy, the probability of less than satisfactory performance in each category is equally likely. We do not have experiential histograms to fall back on. Classical flow charting will not be totally satisfactory, at least in the early days of operation. We need an alternative to the (say) six levels of histograms.
5.6.3
A Bayesian Pareto Model
In this section, we follow arguments in [9] with a look back at [2,3,5,10]. Let us suppose that at a given level of hierarchy, the failures (by this we mean any departures from specified performance) due to the k components are distributed independently according to a homogeneous Poisson process. So, if t is the time interval under consideration, and the rate of failure of the ith component is θi , then the number yi of failures in category i is given (see Section B.14) by f (yi |θi ) = exp(−θi t)
(θi t)yi . yi !
(5.25)
The expected number of failures in category i during an epoch of time length t is given by E(yi |θi ) =
∞ yi =0
yi e−θi t
(θi t)yi = θi t. yi !
Similarly, it is an easy matter to show that the variance of the number of failures in category i during an epoch of time length t is also given, in the case of the Poisson process, by θi t. Prior to the collection of failure data, the distribution of the ith failure rate is given by the prior density: p(θi ) =
© 2002 by Chapman & Hall/CRC
θiαi −1 exp(− βθii ) Γ(αi )βiαi
.
(5.26)
200
chapter 5. exploratory techniques
Then, the joint density of yi and θi is given by taking the product of f (yi |θi ) and p(θ): αi −1 exp(− βθii ) (θi t)yi θi f (yi , θi ) = exp(−θi t) . yi ! Γ(αi )βi αi
(5.27)
Then, the marginal distribution of yi is given by f (yi ) = =
tyi yi !Γ(αi )βiαi
∞ 0
exp(−θi (t +
1 ))θiyi +αi −1 dθi βi
ty i Γ(yi + αi ). yi !Γ(αi )βiαi (t + 1/βi )yi +αi
(5.28)
Then the posterior density of θi given yi is given by the quotient of f (yi , θi ) divided by f (yi ): g(θi |yi ) = exp[−θi (t + 1/βi )]θi yi +αi −1 (t + 1/βi )yi +αi /Γ(yi + αi ). (5.29) Then, looking at all k categories in the level of the hierarchy with which we are currently working, we have for the prior density on the parameters θ1 , θ2 , . . . , θk , p(θ1 , θ2 , . . . , θk ) =
θiαi −1 exp(− βθii ) k Πi=1 . Γ(αi )βi αi
(5.30)
Similarly, after we have recorded over the time interval [0, t], y1 , y2 , .., yk failures in each of the modules at the particular level of hierarchy, we will have the posterior distribution of the θi given the yi , g(θ1 , θ2 , . . . , θk |y1 , y2 , . . . , yk ) =
(5.31)
Πki=1 exp[−θi (t + 1/βi )]θi yi +αi −1 (t + 1/βi )yi +αi /Γ(yi + αi ). It should be observed in (5.26) that our prior assumptions concerning α had roughly the same effect as adding αi failures at the beginning of the observation period. We note that t . (5.32) E[θi ] = (yi + αi ) t + 1/βi Furthermore, V ar[θi ] = (
© 2002 by Chapman & Hall/CRC
t )2 (yi + αi ). t + 1/βi
(5.33)
201
bayesian pareto analysis
We note that if we rank the expectations from largest to smallest, we may plot E[tθi ] values to obtain a Bayesian Pareto plot very similar to the Pareto plot in Figure 5.16. How shall one utilize expert opinion to obtain reasonable values of the αi and βi ? First of all, we note that equations (5.32) and (5.33) have two unknowns. We are very likely to be able to ask an expert the question, “how many failures do you expect in a time interval of length t?” This will give us the left hand side of equation (5.32). An expression for the variance is generally less clearly dealt with by experts, but there are various ways to obtain nearly equivalent “spread” information. For example, we might ask the expert to give us the number of failures which would be exceeded in a time interval of length t only one time in ten.
5.6.4
An Example
Let us suppose that at the top level of hierarchy, we have seven subcategories. At the beginning of the study, expert opinion leads us to believe that for each of the subcategories, the expected “failure rate” per unit time is 2, and the variance is also 2. This gives us, before any data are collected, αi = 2 and βi = 1. So, for each of the prior densities on θi we have the gamma density shown in Figure 5.20.
1.0 0.9 0.8 0.7 0.6
g
0.5 0.4 0.3 0.2 0.1 0.0
0
10
20
30
θ Figure 5.20. Priors without Data.
© 2002 by Chapman & Hall/CRC
40
202
chapter 5. exploratory techniques
However, after 5 time units have passed, we discover that there have been 100 “failures” in the first module, and 5 in each of the other modules. This gives us the posterior distributions shown in Figure 5.21.
1.0 0.9 0.8 0.7 0.6
g
0.5 0.4 0.3 0.2 0.1 0.0
0
10
20
30
40
θ Figure 5.21. Evolving Posterior Distributions. Clearly, we now have a clear indication that the posterior on the right (that of the first module) strongly indicates that the major cause of “failures” is in that first module, and that is where resources should be allocated until examination of the evolutionary path of the posteriors in lower levels of the hierarchy gives us the clue to the cause of the problem(s) in module seven, which we then can solve. Perhaps of more practical use to most users would be a Bayesian Pareto Chart, which is simply the expected number of failures in a time epoch of length seven. From (5.32) we note that
E[tθi ] = (yi + αi )
We show such a chart in Figure 5.22.
© 2002 by Chapman & Hall/CRC
t . t + 1/βi
203
bayesian pareto analysis 25
20
15
E( tθ ) i
10
5
0 1
2
3
4
5
6
7
Module = i Figure 5.22. Bayesian Pareto Chart. One very valid criticism might have to do with the inappropriateness of the assumption that the rates of failure in each category at a given level of hierarchy are independent. The introduction of dependency in the prior density will not be addressed here, since the study of the independent case allows us conveniently to address the evolution of posterior densities without unnecessarily venturing into a realm of algebraic complexity.
5.6.5
Allowing for the Effect of Elimination of a Problem
It should be noted that when we solve a problem, it is probably unwise to include all the past observations which include data before the problem was rectified. For example, if we fix the first module in Figure 5.22, then we should discount, in a convenient way, observations which existed prior to the “fix.” On the other hand, we need to recognize the possibility that we have not actually repaired the first module. It might be unwise to discount completely those 100 failures in the 5 time units until we are really sure that the problem has been rectified. Even if we did not discount the failures from the time period before the problem has been rectified, eventually the posterior distribution would reflect the fact that less attention needs be given to repairs in the seventh module. But
© 2002 by Chapman & Hall/CRC
204
chapter 5. exploratory techniques
“eventually” might be a long time. One way to discount records from the remote past is to use an exponential smoother such as zi−1 + rzi zˆi = (1 − r)ˆ where a typical value for r is 0.25. Let us consider the data in Table 5.4. Here, a malfunction in the first module was discovered and repaired at the end of the fifth time period. zij represents the number of failures of the ith module in the tth time period. zi0 = αi .
Module 1 2 3 4 5 6 7
zi0 2 2 2 2 2 2 2
zi1 20 1 1 0 2 0 1
zi2 18 1 2 2 1 2 1
Table zi3 zi4 23 24 2 0 0 1 0 2 0 0 1 1 0 2
5.4 zi5 15 1 1 1 2 1 1
zi6 2 2 1 1 0 1 1
zi7 2 0 2 1 0 1 1
zi8 0 1 1 1 2 1 1
zi9 2 1 0 2 2 2 0
zi10 1 2 0 0 1 1 2
Application of the exponential smoother with r=.25 gives the values in Table 5.5. Module 1 2 3 4 5 6 7
z ˆi0 2 2 2 2 2 2 2
z ˆi1 6.5 1.75 1.75 1.5 2.00 1.5 1.75
z ˆi2 9.38 1.56 1.81 1.62 1.75 1.62 1.56
z ˆi3 12.78 1.67 1.36 1.22 1.31 1.47 1.17
Table 5.5 z ˆi4 z ˆi5 15.59 15.44 1.25 1.19 1.27 1.20 1.41 1.31 0.98 1.24 1.35 1.26 1.38 1.28
z ˆi6 12.08 1.39 1.15 1.23 0.93 1.20 1.21
z ˆi7 9.56 1.04 1.36 1.17 0.70 1.15 1.16
z ˆi8 7.17 1.03 1.27 1.13 1.02 1.11 1.12
z ˆi9 3.29 1.01 0.32 1.78 1.76 1.78 0.28
z ˆi10 1.57 1.75 0.08 0.45 1.19 1.19 1.57
In Figure 5.23, we show the exponentially weighted Pareto chart at the end of time interval 5 and the exponentially weighted Pareto chart at the end of time interval 10. In Figure 5.24, we show time lapsed exponentially weighted charts for all ten time intervals. It is clear that by the end of the ninth time interval, we should consider relegating module one to a lower level of risk of failures and reallocating inspection resources accordingly. Finally, we might ask whether the Bayesian framework might ever be appropriately replaced, as the Space Station matures, by a more classical control chart strategy applied to a network. The answer is obviously in the affirmative. So far, with the manner in which we have presented
© 2002 by Chapman & Hall/CRC
205
bayesian pareto analysis
the Pareto histogram, our orientation has really been more to the oldfashioned “quality assurance” paradigm, which is, in practice, utilized much more than Deming control charts. This is the leisurely approach by which managers may attack problems when they are good and ready to do so, and can use long histories of complaints as input. By the incorporation of expert opinion, we have achieved a Bayesian paradigm which allows us to search for problems from the outset. But, as time progresses, it would clearly be in order to move toward control charting. 20
15
^z
5
10
5
0 1
2
3
4
5
6
7
5
6
7
MODULE
2.0
1.5
z^
10
1.0
0.5
0.0 1
2
3
4 MODULE
Figure 5.23. Exponentially Weighted Pareto Charts.
© 2002 by Chapman & Hall/CRC
206
chapter 5. exploratory techniques
Time 1 2 3 4
5 6 7 8 9 10 Figure 5.24. Time Lapsed Exponentially Weighted Pareto Charts.
5.7
The Management and Planning Tools
All the techniques discussed so far, both those of exploratory character and those directly applicable for quality improvement, are based on data gathered from running processes and are for on-line step-wise optimization of these processes. By far the most important of these techniques are control charts, but the others, i.e., simple run charts, schematic or box plots, histograms, smoothing techniques, Pareto diagrams and cause and effect or Ishikawa diagrams, are indispensable in some situations
© 2002 by Chapman & Hall/CRC
the management and planning tools
207
too. When looking for correlations or nonlinear interdependencies between pairs of data, scatter diagrams or scattergrams which depict the pairs in a Cartesian plane should also be included into the set of such indispensable and simple means (in Chapter 6, we come directly to regression models for bivariate data and hence skip separate treatment of the scattergrams). The seven techniques mentioned, run charts, control charts, schematic plots, histograms, Pareto and Ishikawa diagrams, and scattergrams, are often called the seven SPC tools (some authors prefer to use this term for the set of techniques with schematic plots replaced by data collection sheets; others, us included, consider the latter as tool number zero since data collection is an obvious and necessary, however nontrivial, prerequisite for any on-line analysis). To these, we have added in Section 5.6 the Bayesian Pareto model which, no doubt, is not so simple, but which is to be used when there is not enough data to begin scrutiny for opportunities for improvement with the standard Pareto diagram. Within the Shewhart-Deming PDSA cycle, the seven SPC tools form the basic set of techniques for analysis of data coming from the cycle’s Study stage. But what about the Plan stage, which is to be performed off-line and whose aim is to plan an innovation? It is wise, when possible, to begin with correctly recognizing what is called the true quality characteristic or true quality characteristics if more than one are found. Such characteristics, which describe customers’ quality requirements, are based on market research, on feedback from customers and on otherwise gathered information on the customers’ needs, wants and expectations. As we already know, we can thus call them the voice of the customer, although they go beyond what a customer is able to formulate explicitly since customers do not have to be well aware of what innovation they will meet with delight, let alone they do not need to understand more distant aims of a system which provides product or service under consideration. As a rule, true quality characteristics are imprecise, as based on impressions and feelings rather than on technical specifications. They are the implicit multidimensional criterion function, with which we met at the beginning of Section 5.5 in the context of perfecting a car, such as car’s safety, reliability, fuel efficiency, comfort of seating, etc. (in this example, only fuel efficiency can be considered well defined from the technical point of view). It is the substitute characteristics which give technical flesh to the voice of the customer, as they describe technical requirements to be imposed on a product or service if true quality characteristics are to be met. That
© 2002 by Chapman & Hall/CRC
208
chapter 5. exploratory techniques
is, they are specifications and functionalities described in technical terms which, when fulfilled, make the voice of the processes conforming to the voice of the customer. Since providing substitute characteristics requires introduction of measures based on some well defined units, it paves the way for a structured and technically feasible step-wise improvement of processes. At least three problems, however, need to be emphasized here. First, as a rule, several substitute characteristics should be considered surrogates for one true quality characteristic (e.g., the comfort of seating must evidently be described by more than one parameter). Second, neither of the substitutes is related to its true original in a mathematically rigorous way. Rather, the two are somehow correlated one with another. As a consequence, usually the choice of proper substitute characteristics, as well as their relation to the true characteristic, are established experimentally in a PDSA cycle. And third, the process of translating true quality characteristics into their surrogates is in fact a hierarchical one, most often with many levels. We saw this in a simple example of coming from the requirement for a car to be reliable and thus have possibly a long time between major repairs to that of maximizing the time until the first repair. This last requirement can already be considered a substitute and numerically defined characteristic. The cause and effect diagram helped us then to translate that requirement into the requirement to maximize time until first fault in, or in fact to improve, transmission. Of course, that was not the end of but rather a signal to begin the work on finding ways to better transmission, by properly stating aims and technical requirements for transmission modules and submodules and, finally, by turning to improving critical processes. A structured method in which customer requirements are translated into appropriate technical requirements for each stage of product development and production, as it is put by the Big Three (Chrysler, Ford Motor and General Motors) in [4], is referred to as the quality function deployment (QFD). More specifically, in the document of the Big Three, QFD is decomposed into two dimensions: quality deployment, which consists in translating customer requirements into Product Design Requirements, and function deployment, which amounts to translating design requirements into appropriate Part, Process and Production Requirements. One can readily rephrase the above definition so that it fit the service industry context. In the example of Section 5.5, the situation was rather simple. Hints gained from market research have led to a single most important sub-
© 2002 by Chapman & Hall/CRC
the management and planning tools
209
stitute characteristic, and a clever use of the cause and effect diagram sufficed to deploy further stages of QFD. Essentially, one was faced with the need to improve existing processes. In general, e.g., when a true innovation or a breakthrough in design is sought, QFD requires more effort. In fact, already defining the true quality characteristics may be far from obvious. In particular, it is a different story to think of improving one of the products or services delivered by a well established company with its rather stable market share and to undertake an effort to gain an edge in the market and substantially enlarge this very company’s share in some way. It is also not easy to formulate a strategy for achieving a good and stable position in the competitive market by a newly established company which, in its relentless efforts to gain new customers, has been caught by too many challenges and has fallen into the state of fighting fires which spread everywhere in the company almost without a break (it is still another story that had the company started its operation from the outset with a sound program for continual improvement, it would have not ended in the state mentioned). Moreover, it is not always easy to see why a seemingly good product or service fails to bring customers, and profit, to a company. What is wrong with a French bakery and confectionery with a caf´e, whose location is good for this type of product, quality of food is very good, prices are OK, no other offer of the same type can be found around and yet the owners complain that only a few customers come? There are many ways of providing precise structure into planning a change in general, when the issue of what to accomplish needs to be worked out from scratch. We shall discuss briefly a planning model as described in [6]. In the model, the Plan stage of the PDSA cycle is decomposed into four phases – issue definition, analysis of action, organization of action and contingency analysis. The plans are developed using the so-called seven management and planning tools, also known as the seven (Japanese) new tools. These are: affinity diagram, relations diagram or interrelationship diagraph, tree or systematic diagram, matrix diagram, arrow diagram, process decision program chart (PDPC; sometimes another tool for contingency analysis is considered one of the seven tools) and matrix data analysis (in [6] a glyph is mentioned instead). Needless to say, the Plan stage requires work by multi-disciplined teams if the efforts are to be successful. In planning phase I, devoted to defining the issue, the affinity diagram and relations diagram are of help. Clearly, at the very start of any
© 2002 by Chapman & Hall/CRC
210
chapter 5. exploratory techniques
project, there has to be clear and common understanding and agreement of what is to be accomplished. Common agreement is needed as to the issue statement. Given that, information about the situation has to be gathered and organized into logical groupings. This information is mostly if not only verbal but, of course, it is welcome in the form of numerical data whenever such are available. Already the issue statement is not a trivial matter. It should facilitate thorough exposition of all the relevant issues and problems, not immediate answers, let alone proposals for action. First a detailed identification of problems is needed. It is noticed in [6] that good issue statements begin with phrases like: factors that influence ..., elements of ..., what are the issues involved in ...?, what would be the characteristics of success of ...?, what makes ... effective?, what are the barriers/problems involved in ...? Having agreed on the issue statement, a team has to turn to generating and organizing information on that issue. It is the affinity diagram which helps in the matter. By open brainstorming team members come up with as many ideas as possible. The best way is to write them all on separate cards, to be referred to as the idea cards. In an organized manner, the idea cards are in turn grouped into logical clusters. This is an iterative process, with clusters sometimes broken up and recombined whenever appropriate. Within a cluster, subclusters can be built as well. Clusters consisting of just one idea are also a possibility. In turn, clusters are defined, that is, are given headings which encompass the meaning of the cluster (i.e., the common meaning of the idea cards in it). The headings are written down on the so-called header cards, so that the whole diagram could be posted on a wall or placed on a table in a neat and diagrammatic form. If, despite a long discussion, the heading for a cluster cannot be agreed upon, the team should move to another cluster. If, after returning to the cluster later, the team still cannot come up with a heading capturing the meaning of the cluster, the cluster has to be broken up and some logically consistent solution found. When ready, an affinity diagram may have the form like that sketched in Figure 5.25. The affinity diagram, just as the rest of the seven new tools, can be applied to help provide a structured plan for solution of any problem faced by a company, whether it is a strict sense technical problem, that concerning a service segment or an organizational problem. The authors do not know of a company which, after having decided to implement the paradigm of continual improvement, did not turn to problems with internal cooperation within the company.
© 2002 by Chapman & Hall/CRC
the management and planning tools
211
Issue statement Header card
Header card
Header card
Header card
Header card
symbolizes an idea card
Figure 5.25. Affinity Diagram. In one such company, a team has come up with an affinity diagram comprised of the following 5 clusters (in brackets, some or all problems which were included in a cluster are given): problems with superior subordinate relations (lack of respect for a subordinate, hidden behind a mask of courtesy; distaste for meeting and discussion with subordinate; orders without explanation how to achieve the goal set; no cooperation with superiors; fear against asking a superior how to fulfill an order; frequent changes of superior’s opinion on people and quality of work; frequent changes of orders and task assignments, seemingly without reason; subordinates’ inclination to cheat; and others); problems with flow of information (superiors provide incomplete information; people learn of problems of their concern from peers, not from superiors; people receive incomplete instructions; information reaches rank-and-file with delay); poor organization of a company (a subordinate happens to receive contradictory orders from immediate and higher superiors; people responsible for one process impose their will on those working on other processes; lack of cooperation between departments); problems with peer-to-peer relations (peers treated as intruders; insensitivity to problems communicated by a peer working on the same or related process); personal flaws (blindness to one’s own errors; inconsistency between words and deeds; distaste for cooperation; lack of creativity; insistence on repeating the same errors). Interestingly, the team proved unwilling to ascribe lack of creativity to poor relations between superiors and subordinates or some other flaw in the company as a whole and decided to include it in the
© 2002 by Chapman & Hall/CRC
212
chapter 5. exploratory techniques
last cluster. The relations diagram or interrelationship diagraph has been developed to identify and describe cause and effect relationships between components of an issue or a problem. It does not have to but can be, and usually is, obtained from an affinity diagram. If it is, one begins by displaying the issue statement above affinity headings arranged in a circle, as depicted in Figure 5.26.
Issue statement
Clus terheading heading Affinity
Cluster heading Affinity heading
Clus ter heading Affinity
Affinity heading
Cluster heading heading Affinity
Affinity heading
Figure 5.26. Towards Relations Diagram.
Each affinity heading describes some (clustered) idea (to be referred to as a factor). Now, one draws lines connecting related factors, and adds an arrow head to each line, thus indicating the direction from cause to effect. Once all such lines have been drawn, one includes at the bottom of each factor the number of arrows pointing away and pointing toward the factor (i.e., number away/number toward). This is the way to identify ideas which are the key cause factors (these are the factors with the most arrows pointing away from them) and those which are the key effect factors (i.e., the factors with the most arrows pointing toward them). In Figure 5.27, the key cause factors are those with 6 arrows pointing away and the key effect factor is the one with 6 arrows pointing toward it. As a rule, the key cause factors are marked by heavy lines around them.
© 2002 by Chapman & Hall/CRC
the management and planning tools
213
Issue statement
Affinity Clus terheading heading 0/6
Cluster heading Affinity heading 2/4
Cluster heading Affinity heading 1/2
Clus terheading heading Affinity 6/1
Affinity heading 2/3
Cluster heading heading Affinity 6/1
Affinity heading 3/2
Affinity heading 2/3
Figure 5.27. Relations Diagram. We encourage the reader to return to our example with the French bakery and confectionery with a caf´e and create a plausible affinity diagram for finding factors which influence the lack of enough customers. Clearly, one can think of clusters associated with service both in the shop and in the caf´e, with the menu in the caf´e, product variety in the shop, indoor and outdoor outlook, advertising, etc. Given the affinity diagram, one can turn to the relations diagram. It is always a good idea to display the factors from left to right, with the key cause factors at the left end and the key effect factors at the right one. Of course, many variations of the above description of the affinity diagram can be used to advantage. For example, sometimes it may be recommended to begin with all idea cards from the affinity diagram, not only with the header cards. In any case, the obvious purpose of the relations diagram is to help plan priorities of actions. This brings us to the second phase in planning, namely to the analysis of action or to answering the question what to do to accomplish an objective. The tree or systematic diagram is a tool to facilitate the task. If the affinity and relations diagrams were developed earlier, it is the latter which helps select the objective of immediate interest. The former can then be used to start constructing the tree diagram by referring to
© 2002 by Chapman & Hall/CRC
214
chapter 5. exploratory techniques
the ideas from the affinity cluster which has been chosen as the objective to deal with. The tree diagram is begun from the left and goes to the right, from general to specific. The initial (most general) objective is placed leftmost. Given an objective at a certain level of the hierarchy, one asks what actions to take to achieve this objective or what are this objective’s subobjectives. In this way, the ideas displayed in the diagram go not only from general to specific but also from objectives to actions. The tree, as sketched in Figure 5.28, should be complete, that is, it should include all the action items needed to accomplish the main objective.
Figure 5.28. Tree Diagram. In particular, if a rightmost item is an objective, not an action, the branch with this item is not complete — it has to be continued to the right, to answer how to achieve the objective in question. On the other hand, the tree should be pruned if some branches prove, upon final scrutiny, not necessary for achieving the main objective.
© 2002 by Chapman & Hall/CRC
215
the management and planning tools
One should note that the difference between an objective and an action may not be clear from the outset. If, e.g., the rightmost item in a branch is “lessen the working load of truck drivers,” while it sounds like an action, it is usually to be considered an objective – ways how to lessen the load mentioned are the actions which should be put at the end of the branch. It is good to consider actions as a means of achieving some objective. A means for achieving an objective, then, becomes an objective for the next level of the tree, as displayed in Figure 5.29, the process being continued for as long as found desirable.
Objective
Means
Objecti ve
Means
Objecti ve
Means
Figure 5.29. A Way to Develop Tree Diagram.
In turn, once it is agreed what to do, it is necessary to decide how to do what is to be done. Thus, phase III, i.e., organization of action, comes into play. In this context, one can mention the matrix diagram and arrow diagram as tools whose purpose is to help organize action. The matrix diagram has in fact much wider applicability. We shall not dwell on it, but the reader will readily realize that the matrix diagram can, for example, be used to relate true quality characteristics to substitute characteristics or to set priorities. If one aims at relating whats with hows, one can use the so-called L-matrix, given in Figure 5.30, where the symbols used (which can also be numerical scores) describe the strength of relationship between two factors involved. Whats and hows can be considered rather loosely. For instance, an L-matrix can be the responsibility matrix, when whats amount to tasks and hows to people who can have primary responsibility for a given task, secondary responsibility or who should only be kept informed on how the task is being performed (the given levels of responsibility can then be represented by the symbols in the figure).
© 2002 by Chapman & Hall/CRC
216
chapter 5. exploratory techniques hows a
b
c
d
e
A whats
B C D Very strong relationship Strong relationship Weak relationship
Figure 5.30. L-Matrix. A more complex matrix can be used if one wants to see the relationship between two characteristics given a third one. For example, using the socalled T-matrix, one can relate simultaneously problems (factors) of our concern, countermeasures needed (whats) and action taken (hows), as sketched in Figure 5.31. In the figure, for illustration, only one combination of the three variates of interest is described. It follows for this combination that, if we are interested in factor or problem d, countermeasure 1 has a strong influence on this factor and, in turn, this countermeasure can very well be implemented by action E.
Factors/Problems
a b c d e f 1
Countermeasures
2
3
A
Actions taken
B C D E F G
Figure 5.31. T-Matrix.
© 2002 by Chapman & Hall/CRC
4
5
217
the management and planning tools
The Arrow diagram is an extremely efficient tool for providing a schedule for completing a project or process. In fact, it is a flowchart with explicitly displayed time to complete a task (Time in Figure 5.32), the earliest time the task can be started (EST in Figure 5.32) and the latest time the task can be started so that the whole project is finished on schedule (LST in Figure 5.32). If EST=LST for a given task, then there is no slack time for this task. Such tasks form a critical path (see thick line in the diagram). Task 1
18
0 0
Task 2 20
Task 3
18
10
18
18 28
Task 4
Task 12
E ST Time
38 38
LST
Figure 5.32. Arrow Diagram (Initial Part). No project is immune to contingencies. Accordingly, carefully developed plans are needed either to prevent potential problems or to arrange for quick recovery from problems that do arise. The tools used most often to construct such plans are process decision program chart and failure mode and effects analysis. We shall skip their discussion and confine ourselves to referring the reader to [6] for a brief description of the former and to [4] for a comprehensive exposition of the latter (in the context of the automobile industry). The last of the seven managerial and planning tools is the matrix data analysis. By that, one usually means a simple graphical tool which depicts bivariate data in a Cartesian plane. The data can, for example, be two dimensional quality characterizations of shops which are considered by our French bakery and confectionery as potential subcontractors to
© 2002 by Chapman & Hall/CRC
218
chapter 5. exploratory techniques
sell its product in some region. If we want to characterize potential subcontractors by more than two characteristics, we can use, e.g., a glyph. There are several versions of glyphs, and we shall describe only one of them. Let a circle of fixed (small) radius be given. Further, let each characteristic be represented by a ray emanating from the circle, the rays being equally spaced in the plane. Now, to each subcontractor and to each of its characteristics there corresponds some value of this characteristic which can be represented as a point on the corresponding ray. Joining up points representing a subcontractor, one obtains a polygon describing this subcontractor. For the sake of clarity, in Figure 5.33 polygons for only two subcontractors are displayed. Location Better
Decor
Prices
Sales
Service
Figure 5.33. Glyph. A glyph is thus a simple but efficient graphical means to compare objects described by multidimensional quality characteristics. If too many objects are to be displayed around one circle to retain clarity of the picture, each object can be assigned a separate circle (of the same radius) and the polygons obtained for the objects can be juxtaposed in some order. Let us note that the quality characteristics do not have to be numerical. If they are not, some numerical scores have to be assigned to possible “levels” of a characteristic. Whether characteristics are numerical or not, when interpreting a glyph one has to be careful about scales used for different characteristics. It is best when the scales can be made at least roughly equivalent. In our discussion of the seven new Japanese tools we have essentially followed [6], i.e., the way they are taught by the British experts whose approach is inspired by the theories
© 2002 by Chapman & Hall/CRC
references
219
of W. Edwards Deming. Needless to say, there are differences, however minor, in how different authors see and use these tools. For instance, the experts of the Japanese Union of Scientists and Engineers prefer a slightly different construction of the relation diagram. In any case, however brief, the given account of the seven managerial and planning tools already suffices to show the reader their great explanatory and analytic power. No wonder that they have become an indispensable set of means for planning purposes.
References [1] Efron, B. (1979). “Bootstrap methods—another look at the jackknife,” Annals of Statistics, pp. 1-26. [2] Martz, Harry F. and Waller, R. A.(1979). “A Bayesian zero-failures (BAZE) reliability analysis,” Journal of Quality Technology, v.11., pp. 128-138. [3] Martz, H. F. and Waller, R. A. (1982). Bayesian Reliability Analysis. New York: John Wiley & Sons. [4] QS-9000 Quality System Requirements. (1995) Chrysler Corporation, Ford Motor Company, and General Motors. See in particular parts 7 and 4: Advanced Product Quality & Control Plan and Potential Failure Mode Effects Analysis. [5] Simon, J.L.(1990). Resampling Stats. Arlington: Resampling Stats, Inc. [6] The Process Manager. (1996) Process Management International Ltd. [7] Thompson, J. R. (1985). “American quality control: what went wrong? What can we do to fix it?” Proceedings of the 1985 Conference on Applied Analysis in Aerospace, Industry and Medical Sciences, Chhikara, Raj, ed., Houston: University of Houston, pp. 247-255. [8] Thompson, J.R. (1989). Empirical Model Building. New York: John Wiley & Sons, pp. 133-148. [9] Thompson, J.R. and Walsh, R. (1996)“A Bayesian Pareto analysis for system optimization” in Proceedings of the First Annual U.S. Army Conference on Applied Statistics, pp. 71-83.
© 2002 by Chapman & Hall/CRC
220
chapter 5. exploratory techniques
[10] Thompson, J.R. (1999). Simulation: A Modeler’s Approach. New York: John Wiley & Sons, pp. 221-222. [11] Tukey, J.W. (1977). Exploratory Data Analysis. Reading: AddisonWesley. [12] Waller, R. A., Johnson, M.M., Waterman, M.S. and Martz, H.F. (1977). “Gamma Prior Distribution Selection for Bayesian Analysis of Failure Rate and Reliability,” in Nuclear Systems Reliability Engineering and Risk Assessment, Philadelphia: SIAM, pp. 584-606.
Problems Remark: In Problems 5.1-5.3, a statistical software of the reader’s choice has to be used. One of the aims of solving these problems is to gain a better understanding, and appreciation, of random phenomena. In particular, the problems show that statistical inference requires special caution when the size of a sample is small. Whenever a histogram is to be constructed, the reader is asked to use a default bar width first, and then to try using several other widths of his or her choice. Problem 5.1. Generate two independent samples of size 100 each from the standard normal distribution. a. Construct the schematic plots, normal probability plots and histograms of both samples. b. Transform one of the original samples to a sample from N (1, 1), and combine the sample obtained with the other original sample into one sample of size 200. Construct the schematic plot, a normal probability plot and a histogram for the last sample. c. Transform one of the original samples to a sample from N (5, 1), and combine the sample obtained with the other original sample into one sample of size 200. Construct the schematic plot, a normal probability plot and a histogram for the last sample. d. Transform one of the original samples to a sample from N (0, .01), transform the other sample to a sample from N (1, .01) and combine the two samples obtained into one sample of size 200. Construct the schematic plot, a normal probability plot and a histogram for the last sample. e. Compare the results for a through d. Repeat the whole experiment several times to find what is typical of the results and what is rather due
© 2002 by Chapman & Hall/CRC
221
problems
to chance. f. Repeat the whole experiment (a through e) starting with two samples of size 50 from the standard normal distribution. g. Repeat the whole experiment (a through e) starting with two samples of size 25 from the standard normal distribution. h. Construct the schematic plots, normal probability plots and histograms for the data set x1 of Table 3.2, Table 3.3 and Table 4.3. Comment on the results obtained. Problem 5.2. Consider the following sample of random variates drawn from an unknown probability distribution. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
x 0.947 2.468 0.937 1.707 4.592 6.744 0.618 1.564 1.954 0.489 2.701 0.417 0.302 0.763 0.163 0.647 0.365 0.590 0.401 1.258 0.764 0.497 1.246 0.709 2.136
Lot 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
x 0.421 0.108 0.305 1.103 2.683 0.407 0.949 3.686 1.541 0.731 4.281 0.328 2.460 2.086 0.233 2.137 1.046 3.294 1.173 2.011 0.236 4.794 1.435 3.997 3.066
a. Use the schematic plot to verify that the data distribution is skewed. Try to symmetrize the distribution transforming the data to its square root, fourth root, natural logarithm and, possibly, using other transformations. b. Use the normal probability plot to verify whether the data distribution can be readily transformed to an approximately normal distribution.
© 2002 by Chapman & Hall/CRC
222
chapter 5. exploratory techniques
c. Given that the data comes from a lognormal distribution, is it possible to find the transformation that transforms the data to a normally distributed sample? Problem 5.3. Consider the following sample of random variates drawn from an unknown probability distribution. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
x 0.003 0.816 0.004 0.286 2.323 3.643 0.232 0.200 0.449 0.511 0.987 0.764 1.435 0.073 3.293 0.190 1.013 0.278 0.833 0.053 0.072 0.488 0.049 0.118 0.576
Lot 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
x 0.750 4.966 1.410 0.010 0.974 0.806 0.003 1.702 0.187 0.098 2.115 1.239 0.810 0.540 2.124 0.576 0.002 1.421 0.025 0.488 2.084 2.457 0.130 1.919 1.255
a. Use the schematic plot to verify that the data distribution is skewed. Try to symmetrize the distribution attempting several transformations of the data. b. Use the normal probability plot to verify whether the data distribution can be readily transformed to an approximately normal distribution. c. Given that the data comes from a χ2 distribution, is it possible to find the transformation that transforms the data to a normally distributed sample? Problem 5.4. Preliminary analysis of a criterion function of an unknown analytical form has to be performed. Function values can be
© 2002 by Chapman & Hall/CRC
223
problems
observed, but each observation is corrupted by a random error. Below, the set of such “noisy” observations is given for the function’s argument, x, varying from .1 to 3.
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
f (x) 0.816 0.511 0.561 0.339 0.220 0.345 0.142 -0.108 0.287 -0.188 -0.110 -0.008 0.177 0.343 0.182
x 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
f (x) 0.231 0.606 0.510 0.730 1.048 1.074 1.295 1.646 1.874 2.257 2.576 2.829 3.417 3.568 4.130
Use the 3R smooth algorithm to remove the jitters. Given that the data constitute noisy observations of the quadratic function (x − 1)2 , does the smooth enable one to better approximate the true shape of the function? Problem 5.5. In the following table, noisy observations of the quadratic function (x − 1)2 are given. For each x, the observation is in fact the sum of the function value and a normally distributed random variate with mean 0 and standard deviation .3. Use the 3R smooth to remove the jitters. Does the smooth enable one to better approximate the function? Given that the noise in Problem 5.4 was normally distributed with mean 0 and standard deviation .1, compare the two results of using the smooth.
© 2002 by Chapman & Hall/CRC
224
chapter 5. exploratory techniques x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
f (x) 0.831 0.747 0.437 0.347 0.722 0.199 -0.038 -0.408 0.787 0.396 0.123 -0.170 -0.291 -0.093 0.307
x 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0
f (x) 0.121 0.909 0.832 1.071 0.487 0.463 1.304 1.232 2.105 1.767 3.088 2.676 3.673 3.617 3.690
Problem 5.6. Use bootstrap tests for the mean and standard deviation to determine which lots in the highly contaminated data set in Table 3.3 are out of control. Problem 5.7. We have a system consisting of five independent modules which experience failures according to Poissonian flow. We incorporate our prior information about the distribution of θ in the form of Gamma distributions appropriately parameterized. Each of these has prior expectation of 4 and variance also 4 for modules 1−4. For module five, the prior expectation is 2 and the variance is 6.The failure rates are given, for the first eight time intervals by: Module 1 2 3 4 5
zi1 2 0 1 0 4
zi2 1 1 0 0 1
Table 5.6 zi3 zi4 2 24 2 0 0 1 0 2 5 6
zi5 15 1 1 1 7
zi6 2 2 1 1 5
zi7 2 0 2 1 6
zi8 0 1 1 0 8
Create time lapsed exponentially weighted Pareto charts for the first eight time intervals for each module. Problem 5.8. Return to the problem of the French bakery and confectionery with a caf´e from Section 5.7 and create a plausible affinity diagram for finding factors which influence the lack of enough customers. Given the diagram, complete the Plan stage of the PDSA cycle for finding ways out of the problem.
© 2002 by Chapman & Hall/CRC
Chapter 6
Optimization Approaches 6.1
Introduction
Our activities to this point have largely been involved in the utilization of time indexed glitch information to find qualitative causes of excess variation and fix them. Our principal goal has been the reduction of variability. Our modes of intervention might be typically qualitative in nature, such as providing training for a joint replacement team or replacing faulty bearings. In other words, the use of control charts can be looked upon as a way to bring about conformity to the system as it was designed, always keeping an eye open to the possibility that improvements in the system over those suggested by the design are possible. And, speaking realistically, control charts are useful in improving systems which, if truth be told, were never fully designed, for which no complete flow chart exists. Our experiences in the United States and Poland indicate that such systems make up a large fraction of those in operation. But what of “fine tuning”? That is, how shall we decide how to make small quantitative changes in the level of control variables in order to improve a process? And we recall that this improvement may be measured by an infinite number of possible criteria. These could include, say, a decrease in the deviations of the output product from specification. Or, we might use per unit cost of the output product. Or, we might use amount of product produced per unit time. And so on. Typically, we will be looking at several criteria simultaneously. In the best of all possible worlds, optimization by changing continuous control variables would be well posed as a constrained optimization prob225
© 2002 by Chapman & Hall/CRC
226
chapter 6. optimization approaches
lem. For example, we might seek to maximize our total product output subject to the constraint that average squared deviation from product specification was below a specified amount and unit cost was kept below a certain level. Life is usually not so simple. A natural criterion function is generally not available. And it is frequently a mistake to attempt a strategy which pits quantity of output against quality. To view statistical process control as, somehow, a branch of “control theory,” i.e., that field of engineering and applied mathematics which deals with well posed problems in constrained optimization, is a serious mistake, which can lead to disaster. To utilize the “Control Theory” approach most effectively, we need to make a number of assumptions of “givens.” These include but are not limited to: 1. A natural and well posed criterion function to be maximized (or minimized). 2. A natural set of constraints for various output variables. 3. A well posed mathematical model of the mechanism of the process. Our experience indicates that it is seldom the case that any of these “givens” is actually available in other than a crude approximation form. Perhaps that is the reason that control theoretic approaches to SPC have usually proved to be so disappointing. Returning again to the “best of all possible worlds” scenario, let us suppose that we have given to us a function of k variables subject to m constraints: (6.1) Max f (X1 , X2 , . . . , Xk ) subject to gi (X1 , X2 , . . . , Xk ) ≤ 0 ; i=1,2,. . . ,m .
(6.2)
In the rather unusual situation where we explicitly know f and the m constraints, there are a variety of algorithms for finding the value of (X1 , X2 , . . . , Xk ) which maximizes f subject to the constraints (see, e.g., Thompson and Tapia [10], 253-286). In the real world, we will usually not know the functional relationship between f and (X1 , X2 , . . . , Xk ) . In many situations, if we input specific values of the control variables (X1 , X2 , . . . , Xk ), we will not be able to see f explicitly, but rather a value of f corrupted by experimental error. In other words, we will usually not know the function f and will only be
© 2002 by Chapman & Hall/CRC
introduction
227
able experimentally to observe pointwise evaluations of f corrupted by noise. Let us then consider the model Y (X1 , X2 , . . . , Xk )(j) = f (X1 , X2 , . . . , Xk ) + (X1 , . . . , Xk )(j)
(6.3)
where (X1 , X2 , . . . , Xk )(j) is the error at time j at a particular level of (X1 , X2 , . . . , Xk ). We shall assume that the errors have zero averages and a variance which may be dependent on the level of the control variables (X1 , X2 , . . . , Xk ). We shall assume that the lot samplings are sufficiently separated in time that an error at any time j is stochastically independent of an error at any other time i. So, then, the situation is that we can construct an experiment, by setting the levels of (X1 , X2 , . . . , Xk )(j) to particular values and then observing a somewhat degraded version of f (X1 , X2 , . . . , Xk ), namely Y (X1 , X2 , . . . , Xk )(j). We note here that we have avoided assuming as known the functional form of f . One possible f of interest would be the variability of the output process. This is a quantity we would surely wish to minimize. The situation confronting the production staff would be, typically, that they believe they know roughly the value of (X1 , X2 , . . . , Xk ) which gives good performance. The notion that the people in production will be ecstatic over anybody’s notion that we should “fool around” with the process in order to achieve some improvement is naive. We recall that we observed that once a process was “in control,” typically it was not a great deal of trouble to keep it in control. Changing the settings of key variables can be very risky. We have never experienced any production foreman who was keen on departing from the notion that “if it ain’t broke, don’t fix it.” If we knew a functional relationship between process variability and various variables whose levels we are free to control, then the optimization process would simply be numerical. Naturally, we would probably have a function sufficiently complex that we would require a computer to find numerically the value of the control variables which minimized the variance. That means, that to find the value of the function for a given value of the control variates, we would need to input the control variables into a computer. But in the real world we do not generally expect to know a good mathematical model of the variance. So, to find the variance for a given level of the control variables, we actually would need to use the process itself as our computer, namely we would have to run the process at the level of the control variables.
© 2002 by Chapman & Hall/CRC
228
chapter 6. optimization approaches
If the records of the process have been carefully recorded in user friendly fashion, then we may have some feel as to what levels of the control variables are likely to minimize the variability of the output process. There are always glitches in the control variables, times when their values are different from the standard. Perhaps at some of these glitches, it was noted that the variance appeared to be less. If such information is available, then it should be utilized in our search for minimum variance conditions.
6.2
A Simplex Algorithm for Optimization
In the following example, we shall deal with a continuous chemical reaction running in aqueous medium,
M+N⇒O. For this reaction, the input compounds M and N are converted into product O. The conversion is, essentially, complete. But there are small traces of other, nonharmful compounds, which are by-products of the reaction, and these occur in very small amounts. This affects, somewhat, the output concentration of O (target value is 50 parts per thousand). The major factors in the reaction are the amounts of the input compounds introduced, and we shall not be dealing with these in this study. However, it has been suggested by some of the production staff that changes in the temperature and/or acidity (pH) of the solution might produce a less variable output concentration of O. Current operating conditions are 30 degrees Celsius and a pH of 7.2. We have observed, in the past, operating temperatures as low as 20 and as high as 40, and pH values as low as 5 and as high as 8.5. It appears that within these ranges the reaction runs essentially to completion and that satisfactory product is produced. But we would like to see what might be done to minimize lot to lot variation, by selecting “optimal” values for temperature and pH. How shall we proceed to find the optimal value of (T,pH) (if, indeed, there is one)? The first approach we shall use is a bit in the time honored engineering approach of “flailing around” (called “random search” by some). But, rather than using random search, we shall employ the more orderly “simplex method” of Nelder and Mead [6] (not, of course, to be confused with the quite different simplex algorithm of linear programming). In order to proceed, we need to have collected lot output data for one more
© 2002 by Chapman & Hall/CRC
a simplex algorithm for optimization
229
set of settings of the control variables vectors than is the dimension of the control variable. Here, the dimension of the control variable to be optimized is two (Temperature and pH). So we shall use three different settings and collect at each setting, say, ten lot measurements of output in parts per thousand. The (Temperature, pH) settings initially are (30, 7.2), (40, 6.5), and (35, 6.0). We note that, as it has turned out, the highest sample variance corresponds to (30, 7.2), so we shall designate this as the W(orst) point on the graph in Figure 6.1. The second highest variance was obtained in the lots with (Temperature, pH) set to (40, 6.5); so we shall call this Second Worst point, 2W. The lowest variance was obtained for (35, 6.0), which we designate in Figure 6.1 as the B(est) point. Now the Nelder-Mead approach is to replace the W(orst) point, by another, superior set of conditions, which we find in an orderly fashion. So, next, let us find the midpoint of the line segment joining the B and 2W points and call this point the C(entroid) of all points except for the W(worst). We would like to go away from the W(orst) point to something better. So, let us proceed from W to C (with coordinates, here, of (37.5, 6.25) ) and then continue the line to P, which is as far from C as C is from W. This point, P(rojection), with coordinates (45, 5.3) gives the conditions for our next experiment. In Table 6.1, we note that the variance has increased to a level which, not only is not less than that for B, but is actually worse than that for W and 2W. Accordingly, we move to point PP, the midpoint between points W and 2W (and the rest goes as is), carry out ten experiments, and, happily, find that the variance has dropped to a new low. Shortly, we shall describe the Nelder-Mead “enlightened search” algorithm more generally, but here we only make a few observations about it. First of all, we note that the algorithm does not avail itself of the actual values of the variances at W, 2W, B, P and C, only whether the values there are greater or lower when compared internally. Thus, we are sacrificing some information in order to obtain robustness of the algorithm. Those familiar with Newton method approaches will note that we have denied ourselves the possibility of making the giant steps to the optimum associated with Newton methods. We do this, in part, because of the contamination of the underlying function value (the variance) by noise, and also because we do not know the functional relationship between the variance and the temperature and pH, so that derivative information is not available. Next, although we have chosen not to proceed with the algorithm further here, it is clear that, typically, many iterations will be
© 2002 by Chapman & Hall/CRC
230
chapter 6. optimization approaches
required to arrive at a (local) minimum variance condition. The simplex algorithm is one of slow envelopment and collapsing toward an optimum. Recall that we are not evaluating a function on a computer. We are carrying out a real-world experiment. The Nelder-Mead approach is, frequently, a rough and ready method for making a satisfactory process better. W
(30, 7.2)
PP (35, 6.85)
(40, 6.5) 2W C
(37.5, 6.25)
B (35, 6.0)
P (45, 5.3)
Figure 6.1. Nelder-Mead Experimental Optimization.
Points 1 2 3 4 5 6 7 8 9 10 x ¯ s2
© 2002 by Chapman & Hall/CRC
W 46.6 52.5 47.2 52.9 51.8 52.0 50.9 49.5 49.3 46.9 49.960 5.827
Table 6.1 2W B 48.1 49.5 49.1 51.2 47.4 48.3 48.4 49.5 52.3 52.2 51.6 51.8 48.2 49.4 51.9 52.2 50.8 48.1 52.2 47.7 50.000 49.990 3.769 3.001
P 52.8 45.8 51.5 54.1 46.0 46.4 54.3 50.4 51.5 47.2 50.000 11.360
PP 50.6 48.4 50.7 50.8 49.1 51.1 50.7 48.5 48.9 51.3 50.010 1.301
a simplex algorithm for optimization
231
Let us note also that, obviously, the control variables are never allowed to assume all values from minus to plus infinity. Most often, they are confined to finite intervals. The requirement that the Nelder-Mead algorithm never leave an admissible set of values of the control variables can be satisfied in a very simple way. Whenever a function evaluation outside of the admissible set is needed, the function value at such a point is declared to be “worse” (e.g., smaller when function maximization is considered) than the function value at W. Thirty-five years ago, an evolutionary method of operational improvement of various production criteria was proposed by Box and Draper [2] and given the name of Evolutionary Operations (EVOP). It was to be a means of continuing institutionalized optimization of processes within a plant. The production staff would simply expect, all the time, to take processes in control and jiggle them, so that improved strategies could be devised. Although the idea behind EVOP is quite intriguing, in retrospect it should not be surprising that EVOP is underutilized all over the world. Jiggles happen, and information obtained when they do occur should be carefully recorded and utilized. But production staffs are entitled to expect that most production time will be spent in an “in control” situation. To expect production personnel to function almost continuously “on the edge” is not particularly reasonable or desirable. The kind of improvement we have been able to obtain in our brief example, utilizing five settings of (Temperature, pH), contains an essential flaw, which is not uncommon in textbook process optimization examples. Namely, the sample sizes are too small to give realistic recommendations for change in the standard operating values of temperature and acidity. If the observations are normally distributed, then for n lots, (n − 1)s2 /σ 2 has the χ2 distribution with n−1 degrees of freedom. Let us consider the 90% confidence interval for the true value of the variance at the present operating conditions of (30,7.2), σ 2 (30, 7.2). From the tabulated values of the χ2 distribution, we can obtain the values of χ2.05 and χ2.95 in the chain inequality χ2.95 ≤
(n − 1)s2 ≤ χ2.05 . σ2
(6.4)
From the χ2 tables, we then have (n − 1)s2 ≤ 16.919 , σ2 giving us that we are 90% certain that 3.325 ≤
© 2002 by Chapman & Hall/CRC
(6.5)
232
chapter 6. optimization approaches
3.100 ≤ σ 2 (30,7.2) ≤ 15.772 .
(6.6)
The same inequality applied to the sample variance of the point (37.5, 6.25) gives us .692 ≤ σ 2 (37.5, 6.25) ≤ 3.52 .
(6.7)
In other words, the 90% confidence intervals for parts per thousand for the old conditions and the suggested improvement overlap. Without more confidence in the suggestions of our data, we are probably ill advised to change the conditions of temperature and acidity. Optimization of a well specified criterion function uncontaminated by noise is a different matter than optimization of noisy experimental pointwise evaluation. Fortunately, in the kind of problem here, it is very frequently the case that sufficient data is available to overcome the problem of noise. Monitoring of output on a continuous basis is generally available. Moreover, we are well able to analyze very large data sets. In the example under consideration, experience shows that output records separated by one minute are stochastically independent. Thus, during less than one day, over 1,000 output data points are readily available. For the (30,7.2) setting, we show, in Figure 6.2, 1,000 successive output readings.
Parts per Thousand
60
55
50
45
40 0
200
400
600
800
1000
Time Figure 6.2. Output Under Existing Conditions.
© 2002 by Chapman & Hall/CRC
a simplex algorithm for optimization
233
In Figure 6.3, we show the data stream for each set of 1,000 observations under the five conditions considered earlier and recalled in Table 6.2. Table 6.2 W 2W B 30 40 35 7.2 6.5 6.0
Points Temperature pH
P 45 5.3
C 37.5 6.25
Time W(30,7.2) W(24,6.5) B(35,6.0) P(45,5.3) C(37.5,6.25) Figure 6.3. Output Under Varying Conditions. We show summary results of our 5,000 readings in Table 6.3.
Points x ¯ s2
© 2002 by Chapman & Hall/CRC
W 49.943 5.826
Table 6.3 2W B 49.963 49.972 4.109 3.054
P 49.944 12.236
C 50.067 1.345
234
chapter 6. optimization approaches
For n large, the critical values of the χ2 statistic can be readily obtained using the approximation √ (z + 2n − 3)2 2 χ = , (6.8) 2 where z is the standard normal variate corresponding to the desired confidence level (e.g., for 90%, z = ±1.645). Thus, for the W point, we have 90% certainty that σ 2 is between 5.42 and 6.28. For the C point, we have 90% certainty that σ 2 is between 1.25 and 1.45. Hence, our recommendation to change the operating conditions of temperature and acidity to 37.5 and 6.25, respectively can be made with some confidence. In the real world where noise and empiricism replace determinism and exact knowledge of the dynamics of the production system, we must be ever mindful of the dangers we run if we recommend changes based upon what amounts to weak evidence of improvement. Nothing so much undermines the credibility of innovators as a recommendation which turns out to be ill advised and counterproductive. In the example considered here where a modest improvement was obtained as the result of five days of sampling (1,000 points each day) we have really caused production about the same amount of trouble as if we had asked for the 10 observations per condition set. The very fact that the conditions are changed five times will require significant skill in effecting the adjustments. Time will be required for transients to die out after conditions have been modified. Probably, we will wish to make the changes during the same shift each day. Work may be required to port the data out of the system (in many factories, the data stream is continuously recorded on a paper roll, but no arrangements are made to produce the data in computer readable form). We now show a version of the Nelder-Mead algorithm which can be applied rather generally. Although we develop the argument for two control variables, generalization to higher dimensions is rather clear. In a practical EVOP situation, it is unlikely that more than four control variables will be optimized in a given study. Consequently, the rather slow convergence properties of the algorithm for dimensions greater than six are not generally a problem. Although the algorithm is presented here for the two dimensional situation, it may be used generally, if one remembers that the “C(entroid)” point is always taken to be the centroid of all points other than the W(orst). The parameter γE is frequently taken to be 1.0, and γC to be 0.5. There must be one more point than the dimensionality of the problem, and the initial points must be selected such that none is a linear combination of the others.
© 2002 by Chapman & Hall/CRC
a simplex algorithm for optimization A Simplex Algorithm Expansion Mode Let P = C + γE (C - W) If Y (P)< Y (B), then [a] Let PP = P + γE (C - W) If Y (PP)< Y (P), then [c] Replace W with PP as new vertex Else [b] Accept P as new vertex End If Else [b] If Y (P)< Y (2W), then accept P as new vertex Else Contraction Mode If Y (W)< Y (P), then [a*] PP = C + γC (W - B) If Y (PP)< Y (W), then [b*] Replace W with PP as new vertex Else [c*] Total Contraction 2W+B Replace W with W+B 2 and 2W with 2 End If Else Contraction Mode If Y (2W)< Y (P), then [aa] PP = C+ γC (P-B) If Y (PP)< Y (P), then [bb] Replace W with PP as new vertex Else [cc] Total Contraction 2W+B Replace W with W+B 2 and 2W with 2 End If Else Replace W with P End If End If
© 2002 by Chapman & Hall/CRC
235
236
chapter 6. optimization approaches
PP Expans ion P
a
B C
W
2W
P B
b Expan sion
PP 2W
B
c Expan sion
2W
P B
a*
C
2W
W
PP B
b*
Par tial In side Contraction 2W
PP
Figure 6.4. Simplex Algorithm. γE = 1.0, γC = 0.5.
© 2002 by Chapman & Hall/CRC
237
selection of objective function
B
Total Con tr ac tion
c* (B+2W)/2
(B+W)/2
P 2W
W
B
P PP
aa
C
W
2W
Part ial Outsid e Con tra ct ion B
bb
PP
2W Total Con tr act ion B
(B+ 2W)/2
2W
cc
(B+W)/2
W
Figure 6.5. Simplex Algorithm. γE = 1.0, γC = 0.5.
6.3
Selection of Objective Function
As a practical matter the selection of the function to be maximized or minimized (frequently called the “objective” function) is seldom very
© 2002 by Chapman & Hall/CRC
238
chapter 6. optimization approaches
P/N = Profit Per Tractor Produced
objective. There are the relatively rare situations where we want to maximize some explicit dollar profit function. Even these are seldom completely objective. For example, let us suppose that we wish to maximize profit per item produced. Because of “economies of scale” (i.e., within limits, the per unit cost is generally lower as we increase the volume of items produced), the profit per item profile is frequently as shown in Figure 6.6. Past a limit, the situation usually deteriorates as the volume of goods we produce becomes sufficient to depress the profit margin or we get to the capacity of a factory and further volume would cause large, perhaps rather unpredictable costs.
10
20 30 N=Tractors Produced Per Year (thousands)
Figure 6.6. Profitability Profile. If we know the curve in Figure 6.6, then we might decide to use the production level which gives maximum per unit profit. More likely we might choose to maximize total profit: P Total Profit = N . N
(6.9)
As a practical matter, we will not know the P/N curve, and even if we did know it, it would be changing over time. That is the situation
© 2002 by Chapman & Hall/CRC
selection of objective function
239
in most economic scenarios. If we expand capacity in a rising demand period, what shall we do with excess capacity when the market recedes? The point here is that planning is very hard. If it were not so, then everyone who could carry out a numerical optimization would be rich. In statistical process control, we are frequently locked into a product, a plant design, etc. No one should expect us, using the same production process, to “upgrade” our automobile production to airplane production. (Although both SAAB and Mitsubishi produce both cars and airplanes, they do so in processes of production specific to the one or the other.) In the petro-chemical industry (where much of the EVOP philosophy was developed), hardware is not necessarily devised to produce particular products in particular quantities. So, for example, a 100,000 barrel/day petroleum distillation column can produce a wide variety of products from gasoline, to jet fuel, to asphalt, in varying quantities depending on the composition of the input stream and the demands of the market. In such situations, we frequently have optimization problems in which we seek to maximize, say, the quantity of the jet fuel stream, subject to constraints in terms, say, of the quantity and characteristics of all the output streams. In such a case, we probably will not know explicitly the functional form of the objective function or those of the constraints. But we will be in a situation where there is a direct profit related objective function to be maximized. In most SPC situations, such as the manufacturing of a particular car, our task will be to produce a definite product composed of 10,000 definite parts, with specifications designed years earlier. In such situations, our objective functions will not be directly related to profit, but will rather be related to variability of output and conformance to design specifications. Let us consider some commonly used objective functions used in the case where we are seeking to minimize departures from conformity to design specifications. The first to be considered is simply the sample variance, as considered in our example of a continuous chemical reaction: s2 =
n 1 (xi − x ¯)2 . n − 1 i=1
(6.10)
We may find it useful to use the sample mean square error from the target μ0 : x − μ0 )2 . MSE = s2 + (¯
© 2002 by Chapman & Hall/CRC
(6.11)
240
chapter 6. optimization approaches
The second term, the square of the bias, is generally something which we can adjust essentially to zero by shifting operating conditions. As a practical matter, the hard part in lowering the mean square error is lowering the variance. Hence, there is little to choose between s2 and the MSE as an objective function. Sometimes, the argument is made that variance and MSE are not naturally scaled. If there is such a problem, then it can frequently be resolved by dividing the sample standard deviation by the sample mean to give the coefficient of variation: CV =
s . x ¯
(6.12)
As a practical matter, the value of x ¯ will generally change but little as we seek to minimize departures from the target μ0 . Hence, there will be little change whether we use as objective function the sample variance or the square of the CV. Among other objective functions in common use by the SPC professional, we have the “signal to noise” ratios of the Japanese quality control leader Genichi Taguchi [7]. One of these deals with “smaller is better” situations: S = −10log10 ( N
n
i=1 (xi
− μ0 )2
n
).
(6.13)
Another deals with “larger is better” situations, e.g.: S = −10log10 ( N
n
1 i=1 (xi −μ0 )2
n
).
(6.14)
A third objective function of Taguchi is essentially the logarithm of the reciprocal of the square of the coefficient of variation. It is recommended by its creator for “nominal/target is best” situations: x ¯2 S = 10 log10 2 . N s
(6.15)
Some take great care about selection of the appropriate objective function. There has been interest about using, as an alternative to the sample variance or its square root, the sample standard deviation, the mean absolute deviation: n 1 |xi − x ¯|. (6.16) MAD = n i=1
© 2002 by Chapman & Hall/CRC
241
selection of objective function
The MAD objective function has the advantage of not giving a large deviation squared, and, hence perhaps unduly emphasized, importance. In Figure 6.7, we compare Taguchi’s S/N ratio −10 log(MSE) with the MSE for various functional forms of the MSE. Someone who is trying to minimize the MSE reaches a point of diminishing returns, where there is very little difference in the criterion function as we move θ closer to zero. There is no such diminishing returns phenomenon with the Taguchi S/N ratio (which, of course, is to be maximized rather than, as is the case with MSE, minimized). This function goes to infinity for both MSE functions shown here as θ goes to zero. Such an approach appears consistent with a philosophy which seeks to bring the MSE all the way to zero. 30.0
−20λογ(θ)
Objecti ve Functi on
20.0
θ2 10.0
θ
0.0
–10log(θ)
-10.0
− 10log( θ2) -20.0 0.0
1.0
2.0
3.0
4.0
5.0
θ Figure 6.7. A Variety of Objective Functions. Naturally, it could be argued that it is senseless to attempt unlimited lowering of the MSE. Suppose, for example, there is a “wall,” i.e., a boundary below which the error cannot be lessened without fundamental change in design. In Figure 6.8, we consider such situations, using MSE models of the form M SE(θ) = a + θ2 . Here, of course, the constant a serves the function of a “wall.” The Taguchi S/N ratio, as we note, is not completely insensitive to the phe-
© 2002 by Chapman & Hall/CRC
242
chapter 6. optimization approaches
nomenon of diminishing returns, though it is much less so than using simply the MSE.
30.0
Objective Function
20.0
–10 log(.01+θ2) 10.0
–10 log(.1+θ2)
0.0
-10.0
-20.0 0.0
–10 log(1+ θ2) 1.0
2.0
3.0
4.0
5.0
θ Figure 6.8. A Variety of Objective Functions.
6.4
Motivation for Linear Models
The simplex algorithm for optimization is essentially model free. One proceeds sequentially by experimentation to obtain (generally noisy) pointwise evaluations of the function to be optimized. There are advantages to such procedures, but they frequently require a large number of experiments carried out sequentially over time. Most experimental design is carried out according to a different, more batch oriented, approach. Namely, we carry out experiments in an orderly procedure over a predetermined grid and fit a simple function to it. In Figure 6.9, we have 20 points to which we have fitted the quadratic Y = 1 − 2X + X 2 .
© 2002 by Chapman & Hall/CRC
(6.17)
243
motivation for linear models
5 + +
4 +
2
Y=1-2X+X
+
3
Y
+ + +
2
+ +
+ + + +
1
+
+ +
0 -1 -1
+ ++
+ +
0
1
2
3
4
X Figure 6.9. A Quadratic Fit to Data. We can now use the simple optimization approach of taking the derivative of Y with respect to X and setting it equal to zero. This gives us Y = −2 + 2X = 0
(6.18)
with solution X = 1. Some of the advantages of fitting a quadratic to a set of data and then minimizing the fitted quadratic (or maximizing as the situation dictates) are clear. First of all, the procedure is “holistic.” Unlike the simplex approach in which each point was taken as a “stand-alone,” the fitting procedure uses all the points to find one summary curve. Generally, this hanging together of the data tends to lessen the effect of noise. Next, if the fitting model is simple, say linear or quadratic, then the optimization algorithms we can use may be relatively efficient. And, of course, by knowing beforehand precisely how many experiments we are going to make, we can tell the production people that we will be inconveniencing them for so long and no longer, as opposed to the situation with the
© 2002 by Chapman & Hall/CRC
244
chapter 6. optimization approaches
more open-ended simplex strategy. There is no doubt that the “fooling around” strategies, of which the simplex is among the most satisfactory, account for most planned experimentation in industrial systems. But fitted quadratic based approaches are, in general, more appropriate. We recall from calculus Taylor’s formula:
f (X) = f (a) + (X − a)f (a) +
(X − a)2 f (a) + . . . + 2!
(X − a)k−1 (k−1) (a) + Rk f (k − 1)!
(6.19)
where f has continuous derivatives through the kth over the interval a ≤ X ≤ b and
X (X − t)k−1 (k) Rk = (6.20) f (t)dt. (k − 1)! a The remainder formula Rk is generally unobtainable. However, if f has derivatives of all orders, then an infinite series version of Taylor’s formula, one requiring no remainder term is available: f (X) =
∞ f (j) (a)(X − a)j
(6.21)
j!
j=0
provided, for example, that |f (j) (X)| ≤ M j for all X, a ≤ X ≤ b, all j and some constant M . Recall that the use of Taylor’s formula formally requires that we know the precise form of f (X). However, in the case where we do not know f (X) but do have the observations of f (X) for a number of values of X, we might write Taylor’s formula as f (X) ≈
k
Aj X j .
(6.22)
j=0
If we have n observations of f (xi ); i = 1, 2, . . . , n, then we might find the values Aˆj ; j = 1, 2, . . . , n which minimize S(Aˆ1 , Aˆ2 , . . . , Aˆk ) =
n i=1
[f (xi ) −
k
Aˆj xji ]2 .
(6.23)
j=0
Taylor’s formula gives many a sense of overconfidence. (This overconfidence has greatly lessened the usefulness of whole fields of research,
© 2002 by Chapman & Hall/CRC
motivation for linear models
245
for example, econometrics.) They feel that it really is not so important whether they actually understand the underlying process. If only enough terms are included in the Taylor’s approximation, they believe that empirical data can somehow be used to obtain an excellent fit to reality. Unfortunately, this sense of confidence is misplaced whenever, as is almost always the case, the observations have a component of noise. To note the ambiguities encountered in practice, we show in Figure 6.9, over the X range of 0 to 3, graphs of
X = X,
(6.24)
Y = X + .2 sin(20x)
(6.25)
W = X + .2Z
(6.26)
and
where Z is a normal random variable with mean 0 and variance 1. Most observers presented with such data would declare that the underlying mechanism for each was, simply, a straight line, and that the Y and W data had simply been contaminated by noise. In general, we will probably be well advised to attempt to use a relatively simple approximating function rather than to entertain notions of a “rich” model. Naturally, if we understand the mechanism of the process so that we can replace some of our ad hocery by a concrete model, we will be well advised to do so. However, in a statistical process control setting, our main criterion function will generally be a measure of the variability of an output variable. Even in cases where we have a fair idea of the mechanism of the process under consideration, the mechanism of the variation will not be perceived very well. Thus, as a practical matter, we will generally be choosing approximating models for their simplicity, stability and ease of manipulation. Generally speaking, these models will be polynomials of degree one or two.
© 2002 by Chapman & Hall/CRC
246
chapter 6. optimization approaches
X
X=X
Y = X+ .2 sin(20X)
W = X + .2N(0,1)
0
3
Figure 6.10. Deterministic and Stochastic Functions.
Let us consider the simple exponential function and its Taylor’s series representation about a = 0.
eX = 1 + X +
X2 X3 + + ... . 2! 3!
(6.27)
In Figure 6.11, we note linear and quadratic approximations to ex over the interval 0 ≤ X ≤ .50. We note that over the range considered probably the linear and certainly the quadratic approximations will be satisfactory for most purposes. The cubic is indistinguishable from ex at the level of resolution of Figure 6.11. We should observe that ex is an “exploding” function when compared with any polynomial. But still, over a modest range, the low order polynomial approximations work rather well.
© 2002 by Chapman & Hall/CRC
247
motivation for linear models
exp(x) and Some Approximations
1.7
y=exp(x)
1.6
y=1+x+.5x 2
1.6 1.5 1.4
1+x
1.4 1.3 1.2 1.1 1.1 1.0
0.0
0.1
0.2
0.3
0.4
0.5
x
Figure 6.11. Approximations to eX . Note the relatively disastrous quality of the three polynomial approximations to ex in Figure 6.12. Our strategy should be to choose an approximation interval which is sufficiently tight so that our approximation will be of reasonable quality. 21.0
exp(x) 16.0
3
1+x+.5x2 + 1/6
11.0
2
1+x+.5x
6.0 1+x
1.0 0.0
1.0 1.0
2.0 2.0
3.0 3.0
x
Figure 6.12. Approximations to eX . Sometimes, it is clear that a function is growing much faster (or slower)
© 2002 by Chapman & Hall/CRC
248
chapter 6. optimization approaches
in x than a polynomial of degree one or two. For example we may have a data base like that shown in the X, Y columns of Table 6.4 and graphed in Figure 6.13. X 0.10 0.34 0.46 0.51 0.73 0.91 1.06 1.21 1.43 1.59 1.86 2.01 2.16 2.33 2.44 2.71 2.88 2.95 3.21 3.51 3.71 3.99
Y 0.6420 1.2820 1.1672 1.2751 1.9933 1.9022 2.9460 3.2087 4.2880 5.5137 6.9380 9.1154 11.0467 13.4119 15.2298 20.9428 25.7533 27.4802 37.0459 53.4271 68.5668 96.0248
Table √ 6.4 .25 Y Y 0.8013 0.8951 1.1323 1.0641 1.0804 1.0394 1.1292 1.0626 1.4118 1.1882 1.3792 1.1744 1.7164 1.3101 1.7913 1.3384 2.0707 1.4390 2.3481 1.5324 2.6340 1.6230 3.0192 1.7376 3.3237 1.8231 3.6622 1.9137 3.9025 1.9755 4.5763 2.1392 5.0748 2.2527 5.2422 2.2896 6.0865 2.4671 7.3094 2.7036 8.2805 2.8776 9.7992 3.1304
ln(Y ) -0.4431 0.2484 0.1546 0.2431 0.6898 0.6430 1.0805 1.1659 1.4558 1.7072 1.9370 2.2100 2.4021 2.5961 2.7233 3.0418 3.2486 3.3135 3.6122 3.9783 4.2278 4.5646
100
80
60
Y 40
20
0 0
1
2
3
4
X
Figure 6.13. Rapidly Growing Function.
© 2002 by Chapman & Hall/CRC
249
motivation for linear models Let us consider one version of the “transformational ladder.” exp(eY ) eY Y4 Y2 √Y Y Y .25 ln(Y ) ln(ln(Y ))
Functions which are growing faster than linear are transformed by transformations below Y in the ladder. The only function which is readily identified by the eye is the straight line. We will, accordingly, continue going down the ladder until the transformed data is approximately linear. √ As a first try, we show a plot of Y versus X in Figure 6.14. The transformed curve is still growing much faster than a straight line. 10
8
6
Y 4
2
0 0
1
2
3
4
X
Figure 6.14. Square Root Transformation. Continuing to the next rung in the ladder, we plot the fourth root of Y versus X in Figure 6.15. We note that the curve still is distinctly concave upward.
© 2002 by Chapman & Hall/CRC
250
chapter 6. optimization approaches 4
3
2
Y .25 1
0 0
1
2
3
4
X
Figure 6.15. Fourth Root Transformation. Continuing down the ladder to the logarithmic transformation in Figure 6.16, we note that something very close to a straight line has been achieved. 5 4 3 2
ln(Y) 1 0 -1 0
1
2
3
4
X
Figure 6.16. Logarithmic Transformation. Fitting a straight line (by least squares), that is, essentially by (6.23), gives the approximation: ln(Y ) ≈ −.34 + 1.24X.
© 2002 by Chapman & Hall/CRC
(6.28)
motivation for linear models
251
Exponentiating both sides, we have Y ≈ .71e1.24X .
(6.29)
This might suggest that for the data set considered, we might consider in our further work fitting the curve: Y = A0 + A1 Z + A2 Z 2
(6.30)
Z = e1.24X .
(6.31)
where
Returning to the fitting of the straight line to ln(yi ), we recall that, due to its simplicity and nonnegativity, least squares was selected by K.F. Gauss as a natural criterion function whose minimization frequently gives ready estimates of parameters. In the present context, suppose we have n observations of a (generally noisy) variable, {wi }, to which we wish to fit a linear function of x: wi = a + bxi + i .
(6.32)
Here, i denotes the miss between wi and the approximating straight line. (In the immediate example, here, we recall that wi = ln yi .) So we shall find a and b to minimize S(a, b) =
n
[wi − (a + bxi )]2 .
(6.33)
i=1
We recall that a necessary condition for a maximum or minimum of S is ∂S ∂a ∂S ∂b
= 0
(6.34)
= 0.
(6.35)
This gives: n
¯x i=1 xi wi − nw¯ ˆb = n 2 − n¯ 2 x x i=1 i and
© 2002 by Chapman & Hall/CRC
(6.36)
252
chapter 6. optimization approaches
a ˆ=w ¯ − ˆb¯ x.
(6.37)
For the solution to the necessary conditions give a minimum, it is sufficient that D=(
∂2S 2 ∂2S ∂2S > 0. ) − ∂a∂b ∂a2 ∂b2
(6.38)
As we see here D = (2n¯ x)2 − (2n)(−2
n
x2i ) > 0.
(6.39)
i=1
For the particular example at hand, using the data in Table 6.4, we obtain the fit a ˆ = −.34 and ˆb = 1.23.
6.5
Multivariate Extensions
Let us proceed to problems where there are a number of potentially causal (“independent”) variables involved in a function f . Then, assuming we have continuous partial derivatives of all orders, we obtain the Taylor’s series representation of f expanding about the vector (a1 , a2 , . . . , ap )
f (X1 , X2 , . . . , Xp ) = f (a1 , a2 , . . . , ap ) +
p
(Xj − aj )
j=1
+
p p
(Xj − aj )(Xi − ai )
j=1 i=1
∂f |a ∂Xj j
(6.40)
∂2f |a ,a + . . . . 2∂Xj ∂Xi j i
Writing out the first few terms of the Taylor’s expansion in increasing powers, we have
f (X1 , X2 , . . . , Xp ) = A0 + +
p
Aj X j +
j=1 p p p
p j=1
Ajj Xj2
+
p p
Aji Xj Xi
j=1 i=j+1
Ajih Xj Xi Xh + . . . .
(6.41)
j=1 i=j h=i
It is very rare that we will have occasion to use a polynomial of degree higher than three. Two will generally suffice, and very frequently a
© 2002 by Chapman & Hall/CRC
least squares
253
polynomial of only the first degree will do nicely. We recall that we are laboring under several major disadvantages. First of all, we expect that our observations will be degraded by noise. And, of course, if we are using a polynomial model in ad hoc fashion, that implies that we do not actually know the true model. If we had no noise and an infinite data base, then the Taylor’s expansion of appropriately high degree might well be appropriate to consider. As it is, putting too many terms in the expansion will lead to instability in the sense that if we changed one of the data points, we might obtain very different coefficients for our fitting model. We recall that we are not asking miracles of our polynomial fit. We will simply use it for guidance as to how we might gingerly change conditions to lower variance, or to improve purity of product, etc. We will not be enthusiastic in using the fitted polynomial for extrapolations well outside the range of the data base. The particular quadratic, say, fit we obtain will be used simply for operating within the narrow ranges where we believe it to be a satisfactory fit. Having said all this, we must now address the issue of fitting a polynomial to a data set. We shall generally follow orthodox statistical practise, commenting from time to time where the standard assumptions are false to the level where we must be very careful if we are to stay out of trouble.
6.6
Least Squares
We consider the standard “linear statistical model” Y =
k
θj Xj + ,
j=0
where θj , j = 0, 1, . . . , k, are unknown coefficients, Xj , j = 0, 1, . . . , k can be considered as causal variables and as an “error term.” First of all, we note that the Xj can be almost anything. For example, we could let (and generally do) X0 = 1. Then, we could allow X2 to equal X12 . And, we could allow X3 = exp(sin(X1 )). The term “linear” applies to the θj coefficients rather than to the power of the variables. Next, we note that is supposed to serve many convenient functions. First of all, any departures of truth from our fitting model are subsumed in the process, which has always zero expectation and constant variance. These assumptions are generally overly optimistic, and we need to be aware that they are.
© 2002 by Chapman & Hall/CRC
254
chapter 6. optimization approaches
We want estimates for the θj coefficients, given, say, n observations of the random variable Y . Actually, therefore, the linear statistical model assumes the form Yi =
k
θj Xij + i .
(6.42)
j=0
Here i goes from 1 to n, and Yi denotes the ith observation, corresponding to the ith vector of variables (Xi1 , Xi2 , . . . , Xik ). The error term i ensemble has the miraculous property that, for all i, E(i ) = 0 ,
(6.43)
E(2i )
(6.44)
2
= σ ,
E(i l ) = 0, if i = l .
(6.45)
Again we observe that the Xi can be virtually anything. We will generally let X0 = 1. Then, we could allow X2 to equal X19 . We could set X3 = exp(exp(X1 )). The term “linear” applies to the θj coefficients. Generally speaking, we shall not concern ourselves much here with addressing the question of dealing with those situations where model inadequacy becomes confounded with error due simply to noise. If we use the model carefully without attempting to make giant steps outside the region of reasonable approximation, frequently the massive amounts of empiricism, which are an essential part of linear modeling, do us no great harm. Next, we obtain estimates for the θj , which we shall denote by θˆj , by minimizing the sum of squared deviations of fit from reality,
Minθ0 ,...,θk S(θ0 , θ1 , . . . , θk ) = Minθ0 ,...,θk = Minθ0 ,...,θk
n i=1 n
ˆ2i [yi −
i=1
(6.46) k
θj xij ]2 .
j=0
Necessary conditions for existence of a minimum are n k ∂S = −2 xij [yi − θj xij ] = 0; j = 0, 1, . . . , k . ∂θj i=1 j=0
Summarizing in matrix notation, we have
© 2002 by Chapman & Hall/CRC
(6.47)
255
least squares
∂S = 2X (y − XΘ) = 0, ∂Θ ⎡
where
⎢ ⎢
y=⎢ ⎢ ⎣ ⎡ ⎢ ⎢
Θ=⎢ ⎢ ⎣
y1 y2 .. . yn θ0 θ2 .. .
(6.48)
⎤ ⎥ ⎥ ⎥; ⎥ ⎦
(6.49)
⎤ ⎥ ⎥ ⎥; ⎥ ⎦
(6.50)
θk and ⎡ ⎢ ⎢
X=⎢ ⎢ ⎣
x10 x20 .. .
x11 x21 .. .
. . . x1k . . . x2k .. .
⎤ ⎥ ⎥ ⎥. ⎥ ⎦
(6.51)
xn0 xn1 . . . xnk Then we have X y = X XΘ. Assuming
X X
(6.52)
is invertible, we then have ˆ = (X X)−1 X y. Θ
(6.53)
Now we should observe that so far we have not utilized the prescribed conditions of the {} process. We could, formally, use the least squares criterion absent these. And, in general, the least squares procedure frequently works satisfactorily when one or more of the conditions is not satisfied. We show below a very powerful consequence of least squares estimation in conjunction with the indicated conditions on the error process {}. Rewriting these conditions in matrix form, we can write the summarized dispersion matrix as V() = E( ) = σ 2 I,
(6.54)
where I is the n by n identity matrix. Then, substituting the least squares estimate for Θ, we have
© 2002 by Chapman & Hall/CRC
256
chapter 6. optimization approaches
ˆ = (X X)−1 X (XΘ + ) Θ = Θ + (X X)−1 X .
(6.55)
ˆ is Then, we easily see by substitution that the dispersion matrix of Θ given by ˆ = E[(Θ ˆ − Θ)(Θ ˆ − Θ) ] V(Θ) = E[{(X X)−1 X }{(X X)−1 X } ] = (X X)−1 X E( )X(X X)−1 = σ 2 (X X)−1 .
(6.56)
The last step in (6.56) (based upon assumptions in (6.44) and (6.45) ) has important implications in the construction of experimental designs. For example, if we wished the estimates for the components of Θ to be independent of each other, we could simply see to it that the off diagonal components of X X were all zero. That is, we could construct design X matrices with orthogonal columns. As an example of such a design matrix, let us suppose we wish to obtain least squares estimates appropriate for the model Y = θ0 + θ1 X1 + θ2 X2 + .
(6.57)
Here, we will create X0 , a variable which is always equal to 1. Now, if we wish to design an experiment of size 4, we could consider using the design matrix ⎡ ⎤ 1 2 5 ⎢ 1 −2 5 ⎥ ⎢ ⎥ X=⎢ (6.58) ⎥. ⎣ 1 2 −5 ⎦ 1 −2 −5 This gives
⎡
⎤
4 0 0 ⎢ ⎥ X X = ⎣ 0 16 0 ⎦ . 0 0 100
(6.59)
Thus, for such a design, again assuming the rather strong conditions in (6.44) and (6.45) hold, we would have ⎡ ⎢
1 4
ˆ = σ2 ⎣ 0 V(Θ) 0
© 2002 by Chapman & Hall/CRC
0
1 16
0
⎤
0 ⎥ 0 ⎦.
1 100
(6.60)
257
least squares
Thus, we would be assured that there was no correlation between the estimates for each of the θj . Next, let us suppose that we had obtained the least squares estimates for the θj . Suppose then that we wished to obtain estimates for some linear combination of the θj , say −.1θ0 + 27θ1 + 1002θ2 . In the best of all possible worlds we might hope that the optimum estimator would be given by −.1θˆ0 + 27θˆ1 + 1002θˆ2 . We show below, following Kendall and Stuart’s explication ([5], v. II, 79-80) of Plackett’s simplification of the proof of the Gauss-Markov Theorem, that we are able to come close to the “best of all possible worlds” result. Let us suppose that we have a matrix of dimension k + 1 by r, say, C, and we wish to estimate r linear combinations of the unknown parameters, say c = CΘ. (6.61) Suppose, moreover, that we wish to restrict ourselves to estimators which are linear combinations of the n by 1 observation vector y, say t = Ty.
(6.62)
We shall employ Markov’s restriction of unbiasedness, namely that E(Ty) = c = CΘ.
(6.63)
And our criterion shall be to minimize the diagonal elements of the dispersion matrix of t, namely V(t) = E[(t − CΘ)(t − CΘ) ].
(6.64)
Now, the condition of unbiasedness gives us E[Ty] = E[T(XΘ + )] = CΘ.
(6.65)
TX = C.
(6.66)
So then, we have Thus, the dispersion matrix of t becomes V(t) = E(T T ) = σ 2 TT .
(6.67)
Next, we observe Plackett’s clever decomposition of TT , namely TT = [C(X X)−1 X ][C(X X)−1 X ]
−1
+ [T − C(X X)
© 2002 by Chapman & Hall/CRC
(6.68) −1
X ][T − C(X X)
X]
258
chapter 6. optimization approaches
as may easily be verified by multiplication of the right hand side. Now, a matrix of the form MM has only non-negative elements on the main diagonal. Thus, both of the terms on the right hand side of (6.68) can only add to the magnitude of each of the terms on the diagonal of TT . The best that we can hope for is to minimize the amount they add to the dispersion. But we have no control over anything except our choice of T, so there is no way we can do anything about the first term on the right hand side of (6.68). But we have a means of reducing the second term to 0, namely, by letting T = C(X X)−1 X .
(6.69)
So, finally, we have demonstrated that once the least squares estimates for the θ’s have been found, we can readily obtain estimators for any set ˆ of linear functions of the θ’s, say CΘ, by using CΘ. Let us turn now to the situation where the dispersion matrix of the errors is given by σ 2 V, where V is completely general. Then it can be shown that the unbiased minimum variance (Gauss-Markov) estimator of CΘ is given by
and
6.7
t = C(X V−1 X)−1 X V−1 y,
(6.70)
V(t) = σ 2 C(X V−1 X)−1 C .
(6.71)
Model “Enrichment”
One might suppose that if we managed to fit a linear model to a set of operations data, and if, on the basis of that model, it appeared that we should modify current control conditions to a different level of one or more variables, we would be well advised to exercise caution in recommending a change from the current level. The reasons are fairly obvious. Under the current conditions, the production staff has most likely achieved a certain degree of control, and the process probably is, in most senses, satisfactory. We would not have carried out the analysis had we not been willing to make recommendations using it. But we cannot do so on the basis of shaky information. Even to carry out a scheduled experimental study on a production process is generally regarded as an intrusive inconvenience by some production staffs. Thirty years ago, one of us, as a young chemical engineer (Thompson), was carrying out a factorial design on operating conditions
© 2002 by Chapman & Hall/CRC
model “enrichment”
259
for a large petroleum distillation column to determine the feasibility of a change in operating conditions in order to increase by 50% the volume of the medium weight “jet fuel” sidestream, without significant specification degradation to the other fractions of output. To get the unit to steady state after changing conditions required around 36 hours of continuous attention by the instigator (the production staff not being inclined to assist with such foolishness). Steady state having been achieved, the instigator returned briefly to his office for a short nap. One hour later, a frantic call was received from the production staff. The pressure at the top of the column had opened the safety valves, and a fine spray was being diffused from the great height of the column over large sections of Baton Rouge, Louisiana. The instigator drove furiously to the column, pouring over in a fatigued mind what thermodynamic anomalies possibly could have caused this multimillion dollar catastrophe. Anxiously, he gazed unsuccessfully through a dirty windscreen and the hot July haze to see signs of the ecological disaster. Stopping the car and jogging furiously to the operations room, he was greeted with loud guffaws of laughter, from the members of the operations staff, who were having a bit of not so good-natured fun with the smart aleck, who was fixing something that was not “broke.” Naturally, from the standpoint of the company, changing market conditions had made jet fuel very profitable relative to other products—hence the study. But from the standpoint of the production staff, changes in a perfectly satisfactory production paradigm were being made, and they were not overjoyed. From their standpoint, little upside good was likely to come from the whole business. And their little prank indicated that it was just possible that something awful could occur on the downside. It is most important that, before we go to the stage of designing experiments to change “in control” processes, we should establish a thorough familiarity of the more basic notions of statistical process control, such as control charting, with the entire production staff. There is nothing more terrifying to a staff, most of whom have little idea of how to bring a system into control, than purposefully taking an in control system out of control. Assuming that we have a level of confidence on the part of all concerned which enables us to carry out experiments, it is even more important that we not recommend ineffective, let alone disastrous, changes. If we discover from our ad hoc model that we may be able to lower the variability of the output by, say 50%, then such an improvement is probably well worth having. But it is a bad idea to recommend change
© 2002 by Chapman & Hall/CRC
260
chapter 6. optimization approaches
unless we are confident of our ground. An important part of gaining this confidence is making sure our model is sufficiently supported by the data to recommend a change in operating conditions.
6.8
Testing for Model “Enrichment”
We start, following Kendall and Stuart [5], consideration of two alternative hypotheses: (6.72) H0 :yi = i or H1 :yi = xi0 θ0 + xi1 θ1 + . . . + xip θp + i .
(6.73)
In both cases, we assume the j are independent and identically distributed N (0, σ 2 ) random variables. Clearly, there may well be a potential gain in process enhancement if we know that H1 is true rather than H0 . On the other hand, opting for H1 when H0 is true may well cause us to move the process out of control to no good purpose. Clearly, we need to be able to analyze data to distinguish between strong evidence that H1 is true as opposed to a mere statistical fluke. We will now develop a testing procedure for this purpose. First of all, on the basis of a sample {yi , Xi }ni=1 , we shall obtain the least squares estimates of Θ, assuming H1 to be true. Then we shall observe the differences between each yj and its least squares estimate: ˆ = [XΘ + ] − X[(X X)−1 X y] y − XΘ
−1
= [XΘ + ] − X[(X X)
(6.74)
X (XΘ + )].
Canceling the terms in Θ we have ˆ = [In − X(X X)−1 X ]. y − XΘ
(6.75)
Here, In is the n by n identity matrix. Next, multiplying the term in brackets by its transpose, we have [In − X(X X)−1 X ] [In − X(X X)−1 X ] =
−1
In − 2X(X X) Thus,
© 2002 by Chapman & Hall/CRC
−1
X + X(X X)
X
(6.76)
−1
= [In − X(X X)
B = [In − X(X X)−1 X ]
X ].
(6.77)
testing for model “enrichment”
261
is idempotent, and, from Appendix A, we know that its rank is equal to its trace. Recall that we have assumed X X is invertible, and hence it has rank p + 1. Now, tr[In − X(X X)−1 X ] = [tr(In )] − tr[(X X)−1 X X)]
(6.78)
= n − (p + 1). Then, we have ˆ = B ˆ (y − XΘ) (y − XΘ) =
n
bii 2i +
(6.79)
bij i j .
(6.80)
ˆ = σ 2 tr(B) = σ 2 (n − p − 1). ˆ (y − XΘ)] E[(y − XΘ)
(6.81)
i=1
i=j
By the i.i.d. assumption of the j we then have
Thus, an unbiased estimator of σ 2 is given by s2 =
1 ˆ ˆ (y − XΘ). (y − XΘ) n − (p + 1)
(6.82)
At this point, we bring normality into the argument, for n − (p + 1) 2 B s = 2 σ2 σ
(6.83)
is an idempotent quadratic form in n independent N (0, 1) variates. Hence, from Appendix B, we have that (n − p − 1)s2 /σ 2 is a χ2 variate with (n − p − 1) degrees of freedom regardless of the true value of Θ. Next, we note that ˆ (y − XΘ) ˆ + (XΘ) ˆ (XΘ). ˆ y y = (y − XΘ)
(6.84)
The first term on the right, when divided by σ 2 , is χ2 (n−p−1) regardless ˆ ˆ (XΘ). of Θ. Let us investigate (XΘ) ˆ = Θ ˆ X XΘ ˆ ˆ (XΘ) (XΘ) = [(X X)−1 X y] X X(X X)−1 X y = y X(X X)−1 X y = (Θ X + )X(X X)−1 X (XΘ + ).
© 2002 by Chapman & Hall/CRC
(6.85)
262
chapter 6. optimization approaches
Now if Θ = 0, then we have ˆ = X(X X)X . ˆ (XΘ) (XΘ)
(6.86)
Now X(X X)−1 X is idempotent, since [X(X X)−1 X ] X(X X)−1 X = X(X X)−1 X X(X X)−1 X = X(X X)−1 X .
(6.87)
Thus its rank is given by its trace tr[X(X X)−1 X ] = tr[(X X)−1 XX ] = p + 1,
(6.88)
since (X X)−1 is assumed to be invertible. Moreover X(X X)−1 X is χ2 (p + 1), (6.89) σ2 since it is an idempotent quadratic form in standard normal variates. But then, since the ranks n − p − 1 and p + 1 add to n, the rank of the left hand side, we know from Cochran’s Theorem (see Appendix B) that ˆ and X(X X)−1 X are ˆ (y − XΘ) the residual sum of squares (y − XΘ) independently distributed. Indeed, if the null hypothesis is true, all the terms of (6.84) are quadratic forms in . Now we have a means of testing H0 , for if H0 be true, then ˆ (XΘ)/(p ˆ (XΘ) + 1) is distributed as Fp+1,n−(p+1) , ˆ ˆ (y − XΘ) (y − XΘ)/(n − p − 1) (6.90) since we then have the quotient of two χ2 variables with p + 1 and n − (p + 1) degrees of freedom, divided by their respective degrees of freedom. We have seen that whatever be Θ, ˆ = (n − p − 1)σ 2 . ˆ (y − XΘ)] E[(y − XΘ)
(6.91)
But ˆ ˆ (XΘ)] = E[ X(X X)−1 X ] + Θ X XΘ E[(XΘ) = (p + 1)σ 2 + (XΘ) (XΘ).
(6.92)
Thus, if H0 be false and H1 be true, for a fixed significance level α, we should expect that ˆ ˆ (XΘ)/(p + 1) (XΘ) > Fp+1,n−p−1 (α). ˆ (y − XΘ)/(n ˆ (y − XΘ) − p − 1)
© 2002 by Chapman & Hall/CRC
(6.93)
263
testing for model “enrichment”
Next, let us suppose we have been using, with some success, a linear model (6.94) yi = xi0 θ0 + xi1 θ1 + . . . + xip1 θp1 + i , which we wish to consider enhancing to the model yi = xi0 θ0 + xi1 θ1 + . . . + xip1 θp1 + xi,p1 +1 θp1 +1 + . . . + xi,p1 +p2 θp1 +p2 + i . (6.95) Once again we are making the customary assumption that the s are independent and identically distributed as N (0, σ 2 ). We shall write this in shorthand notation for n experiments y = Z1 Θ1 + Z2 Θ2 + .
(6.96)
X = [Z1 0] + [0 Z2 ],
(6.97)
That is, where
⎡
⎢ ⎢ X=⎢ ⎢ ⎣
x1,0 x2,0 .. .
x1,1 x2,1 .. .
xn,0 xn,1
. . . x1,p1 . . . x2,p1 .. .. . . . . . xn,p1
x1,p1 +1 x2,p1 +1 .. . xn,p1 +1
. . . x1,p1 +p2 . . . x2,p1 +p2 .. .. . . . . . xn,p1 +p2
⎤ ⎥ ⎥ ⎥. ⎥ ⎦
(6.98)
So Z1 is the (n, p1 + 1) matrix consisting of the first p1 + 1 columns on the right hand side, and Z2 is the next p2 columns. The null hypothesis which we wish to test here is H0 : θp1 +1 = θp1 +2 = . . . = θp1 +p2 = 0.
(6.99)
We shall first consider the case where Z1 and Z2 are orthogonal, i.e., Z2 Z1 = 0p2 ,p1 +1 .
(6.100)
Then, for the enhanced model, the least squares estimates for Θ are given by ˆ = (X X)−1 X y Θ =
(Z1 Z1 )−1 Z 1 y
(6.101) +
(Z2 Z2 )−1 Z 2 y.
(6.102)
Recalling the argument in (6.79), we have ˆ ˆ (y − XΘ)] = [In − Z1 (Z1 Z1 )−1 Z 1 − Z2 (Z2 Z2 )−1 Z2 ] [(y − XΘ) = [In − Z1 (Z1 Z1 )−1 Z 1 ] − Z2 (Z2 Z2 )−1 Z2 . (6.103)
© 2002 by Chapman & Hall/CRC
264
chapter 6. optimization approaches
Rearranging terms, we have ˆ (y − XΘ)] ˆ + Z2 (Z2 Z2 )−1 Z2 . [In − Z1 (Z1 Z1 )−1 Z 1 ] = [(y − XΘ) (6.104) But from the argument leading to (6.83), we know that the term on the left hand side is χ2 with n − (p1 + 1) degrees of freedom. The first term on the right, regardless of the value of Θ1 or that of Θ2 , we have shown, again in the argument leading to (6.83), is χ2 with n−(p1 +p2 +1) degrees of freedom. Finally, by the argument leading to (6.89) the second term on the right hand side is χ2 with p2 degrees of freedom. So, again by Cochran’s Theorem, since the degrees of freedom of the two quadratic forms on the right hand side add to that of the quadratic form on the left, they are stochastically independent. Thus, if Θ2 = 0, then ˆ 2 ) (Z2 Θ ˆ 2 )/(p2 ) (Z2 Θ ˆ (y − XΘ)/(n ˆ (y − XΘ) − p1 − p2 − 1)
(6.105)
is distributed as Fp2 ,n−(p1 +p2 +1) , since the null hypothesis implies that ˆ (Z2 Θ). ˆ Z2 (Z2 Z2 )−1 Z2 = (Z2 Θ) Next, let us consider the situation where Z2 is not necessarily orthogonal to Z1 . Consider ˆ1 y.1 = y − Z1 Θ
(6.106)
−1
= y − Z1 [(Z1 Z1 )
Z 1 y].
(6.107)
Multiplying by Z1 , we have Z1 y.1 = Z1 y − Z1 Z1 [(Z1 Z1 )−1 Z1 y] = 0.
(6.108) (6.109)
Thus, we have established that Z1 is orthogonal to y.1 . Now, let us “regress” Z2 on Z1 to obtain ˆ 2 = (Z1 Z1 )−1 Z1 Z2 Z1 Z = AZ1 , where
© 2002 by Chapman & Hall/CRC
A = (Z1 Z1 )−1 Z1 Z2 .
(6.110) (6.111) (6.112)
testing for model “enrichment”
265
Z2.1 = [I − Z1 (Z1 Z1 )−1 Z1 ]Z2
(6.113)
Next, let
= Z2 − Z1 A.
(6.114)
We note that Z1 is orthogonal to Z2.1 , for Z1 Z2.1 = Z1 Z2 − (Z1 Z1 )(Z1 Z1 )−1 Z1 Z2 = 0.
(6.115) (6.116)
So, we have, in a sense, reduced the more general situation to the orthogonal case if we consider the decomposition: y = Z1 Θ1 + Z2 Θ2 +
(6.117)
= Z1 (Θ1 + AΘ2 ) + (Z2 − Z1 A)Θ2 + ∗
= Z1 Θ + Z2.1 Θ2 + , where
(6.118) (6.119)
Θ∗ = Θ1 + AΘ2 .
(6.120)
As in (6.104), we have ˆ (y − XΘ)] ˆ [In − Z1 (Z1 Z1 )−1 Z 1 ] = [(y − XΘ)
+
(6.121)
Z2.1 (Z2.1 Z2.1 )−1 Z2.1 .
Now, by the argument in (6.104), the left hand side of (6.121) is χ2 with n − p1 + 1 degrees of freedom and the first term on the right hand side of (6.122) is χ2 with n − p1 − p2 − 1 degrees of freedom regardless of Θ2 . Now if Θ2 = 0, then ˆ 2 ) (Z2.1 Θ ˆ 2 ) = Z2.1 (Z2.1 Z2.1 )−1 Z2.1 , (Z2.1 Θ
(6.122)
so that the second term on the right hand side (6.122) is χ2 with p2 degrees of freedom. Since the degrees of freedom of the quadratic forms on the two sides add to n − p1 + 1, by Cochran’s Theorem, we have that the two on the right are independent of each other. Thus, if Θ2 = 0, then ˆ 2 ) (Z2.1 Θ ˆ 2 )/(p2 ) (Z2.1 Θ is distributed as Fp2 ,n−(p1 +p2 +1) . ˆ (y − XΘ)/(n ˆ (y − XΘ) − p1 − p2 − 1) (6.123)
© 2002 by Chapman & Hall/CRC
266
6.9
chapter 6. optimization approaches
2p Factorial Designs
Let us consider designing an experiment for estimating the regression coefficients in the model: yi = β0 + β1 Xi1 + β2 Xi2 + i
(6.124)
where i is N (0, σ 2 ). We will assume that the X1 and X2 variables, or factors, have been recoded via a simple linear transformation. Suppose that the original variables are W1 and W2 and we wish to sample each at two levels. Then we let W1 − W1 lower −1 W1 upper − W1 lower W2 − W2 lower = 2 − 1. W2 upper − W2 lower
X1 = 2
(6.125)
X2
(6.126)
A convenient two factor design with each recoded factor at two possible levels 1 and −1 has the form Table 6.5 Experiment Number = i X0 1 1 2 1 3 1 4 1
Xi1 1 -1 1 -1
Xi2 1 1 -1 -1
yi y1 y2 y3 y4
The given design is orthogonal in that the columns of the design matrix X, X0 , X1 and X2 , are mutually orthogonal. We note that the design consists of 22 different experimental points. Then, the least squares estimator for (β0 , β1 , β2 )’ is given by ⎡ ⎛ ⎢ 1 ⎢⎝ 1 ⎣ 1
⎞
⎛
1 1 1 1 ⎜ 1 −1 1 −1 ⎠ ⎜ ⎝ 1 1 −1 −1 1 ⎡
4 =⎣ 0 0
© 2002 by Chapman & Hall/CRC
0 4 0
⎞⎤−1 ⎛ 1 1 1 ⎥ −1 1 ⎟ ⎟⎥ ⎝ 1 1 −1 ⎠⎦ 1 −1 −1
⎤−1 ⎛
0 0 ⎦ 4
1 ⎝ 1 1
⎛
⎞ y1 1 1 1 ⎜ y2 ⎟ ⎟ −1 1 −1 ⎠ ⎜ ⎝ y3 ⎠ 1 −1 −1 y4 ⎞
⎛
⎞ y1 1 1 1 ⎜ y2 ⎟ ⎟ −1 1 −1 ⎠ ⎜ ⎝ y3 ⎠ 1 −1 −1 y4 ⎞
2p factorial designs ⎛
1 1 = I⎝ 1 4 1
⎛ ⎞ y1 1 1 1 ⎜ y2 −1 1 −1 ⎠ ⎜ ⎝ y3 1 −1 −1 y4 ⎛ y1 +y2 +y3 +y4 ⎞
=⎝
4 y1 +y3 −y2 −y4 4 y1 +y2 −y3 −y4 4
267 ⎞ ⎟ ⎟ ⎠
(6.127)
⎠.
The geometrical motivation for the estimator is clear. βˆ0 is simply the average of all the observed yj . βˆ1 is the difference of the yj levels at X1 = 1 and X1 = −1. βˆ2 is the difference of the yj levels at X2 = 1 and X2 = −1. y2
1
y1
X2
X1
-1
1
-1
y4
y3
Figure 6.17. Two Factor Orthogonal Design. In a more general setting, if the linear model has p factors, yi = β0 + β1 Xi1 + β2 Xi2 + . . . + βp Xip + i , the orthogonal 2p factorial design is constructed in the following way: - Each column has length 2p .
© 2002 by Chapman & Hall/CRC
268
chapter 6. optimization approaches
- The zeroth column, corresponding to factor X0 , consists of ones. - In the first column, corresponding to factor X1 , the signs alternate in groups of 20 , that is, they alternate each time. - In the second column, corresponding to factor X2 , the signs alternate in groups of 21 , that is, they alternate in pairs. - In the third column, corresponding to factor X3 , the signs alternate in groups of 22 , that is, they alternate in groups of four. - In general, in the kth column, corresponding to factor Xk , k ≥ 1, the signs alternate in groups of 2k−1 . Again, we have assumed here that the factors have been recoded and can take values ±1 only. Now, let us return to the two factor model. Typically, we might wish to test whether the levels of the variables X1 and X2 really affect y. So we could argue that the value of y fluctuates about the constant βˆ0 , which is typically not equal to zero, but any appearance of effect by X1 and X2 is simply due to statistical fluke. So then, the model presently under use is: (6.128) yi = β0 + i . The new model, which may be based on illusion rather than on reality, is yi = β0 + β1 Xi1 + β2 Xi2 + i (6.129) where i is N (0, σ 2 ) for both models. Another way of expressing the above is by stating the null hypothesis H0 : β1 = β2 = 0.
(6.130)
We can now employ equation (6.105), where ⎛ ⎜ ⎜ ⎝
Z1 = ⎜ and
⎛ ⎜ ⎜ ⎝
Z2 = ⎜
ˆ2 = Θ
© 2002 by Chapman & Hall/CRC
⎞ ⎟ ⎟ ⎟ ⎠
(6.131)
⎞
1 1 −1 1 1 −1 −1 −1
and
1 1 1 1
βˆ1 βˆ2
⎟ ⎟ ⎟ ⎠
(6.132)
.
(6.133)
2p factorial designs
269
So, then, if H0 be true ˆ 2 ) (Z2 Θ ˆ 2 )/(p2 ) (Z2 Θ = ˆ (y − XΘ)/(n ˆ (y − XΘ) − p1 − p2 − 1) (4βˆ12 + 4βˆ22 )/2 2 ˆ ˆ ˆ i=1 (yj − β0 − β1 Xi1 − β2 Xi2 )
n
(6.134)
is distributed as F2,1 . Below, we show the results of carrying out such an experiment. Table 6.6. Two Factor Experiment Experiment Number = i X0 Xi1 Xi2 yi 1 1 1 1 7.01984 2 1 -1 1 14.08370 3 1 1 -1. 18.70730 4 1 -1 -1 27.73476 The fitted model is given by yˆ = 16.886 − 4.023X1 − 6.334X2 .
(6.135)
In (6.134), the numerator and denominator are given by 112.621 and .064, respectively. This gives an F statistic of 116.836, which is significant at the .065 level. That means, formally, that there is a chance of only .065 that such a large value would have been expected if the null hypothesis were in fact true. In the example considered, y represents the variation in the output of a process, so it is desirable to minimize it. We note that to achieve this, our model indicates it to be best to increase X1 and X2 in the steepest descent direction indicated by the estimated linear coefficients. Thus we might profitably march away from the origin according to the proportions: 4.023 (6.136) X1 = X2 . 6.334 If we wished to find the logical point on the circle inscribed in the current design (which has unit radius), we could try (X1 , X2 ) = (.536, .844). Or, we might go to the steepest descent point on the circle with radius equal to two, namely, (X1 , X2 ) = (.758, 1.193). There are many ways we might proceed. But in many production systems the question of whether X1 and X2 are of relevance to the output
© 2002 by Chapman & Hall/CRC
270
chapter 6. optimization approaches
has been answered affirmatively beforehand. And in such cases, the conditions may be reasonably close to optimality already. The use of a model which includes only linear terms in X1 and X2 always pushes us to the borders of the design region, and beyond. It might be argued that the use of a model containing quadratic terms in X1 and X2 is appropriate, since the quadratic model may admit of a “bottom of the trough” solution.
6.10
Some Rotatable Quadratic Designs
Let us consider the more complicated model 2 2 yi = β0 + β1 Xi1 + β2 Xi2 + β11 Xi1 + β22 Xi2 + β12 Xi1 Xi2 + i . (6.137)
If we tried to use the experiment already designed to estimate the six coefficients of the quadratic model, we know, intuitively, that we would be in trouble, since we have only four experimental points. Nevertheless, let us consider the resulting design table.
i 1 2 3 4
Table 6.7. Two Factor Experiment 2 2 X0 Xi1 Xi2 Xi1 Xi2 Xi1 Xi2 yi 1 1 1 1 1 1 7.01984 1 -1 1 1 1 -1 14.08370 1 1 -1 1 1 -1 18.70730 1 -1 -1 1 1 1 27.73476
2 and X 2 levels are precisely those of the X We note that the Xi1 0 i2 vector. There is thus no hope for using the existing design to estimate the coefficients of these two terms. With Xi1 Xi2 we have a better chance, though we would then have no possibility of estimating the variance via (see (6.82))
σ ˆ2 =
ˆ (y − Xβ)] ˆ [(y − Xβ) . sample size - number of βs estimated
(6.138)
Clearly, the two level design is insufficient for estimating the coefficients of the quadratic model. We would like to effect modifications to the four level model which would enable estimation of the coefficients of the quadratic model, while retaining the useful property of orthogonality.
© 2002 by Chapman & Hall/CRC
some rotatable quadratic designs
271
Returning to the simple model in (6.124), we recall another useful property of this model. Namely, if we look at the variance of a predictor yˆ at some new (X1 , X2 ) value, we have V ar(ˆ y ) = V ar(βˆ0 ) + V ar(βˆ1 )X12 + V ar(βˆ2 )X22 .
(6.139)
This simple form is due to the orthogonality of the design matrix, XD , which insured the independence of the estimated coefficients, for we recall that V(βˆ0 , βˆ1 , βˆ2 ) = (XD XD )−1 σ 2 1 2 = Iσ . 4
(6.140) (6.141)
We note, then, that the variance of yˆ increases as (X1 , X2 ) moves away from the origin in such a way that it is constant on (hyper)spheres about the origin. Such a property is called rotatability by Box and Hunter, and it is, for the case where only the linear terms are included in the model, a natural consequence of the type of orthogonal design employed here. Rotatability is desirable, inasmuch as is generally the case that the center of a design is the present perceived optimum value of X, and generally we will choose experimental points which we believe may produce approximately the same order of effect on the response variable y. Thus, prior to the experiment, we would like for our confidence in the result to be higher and higher as we approach the prior best guess for the optimum, namely the origin. If we should relax too much approximate rotatability of design, we might encounter the bizarre spectacle of patches of low variability of y away from the origin, paid for by increased variability near the origin. One might suppose that rotatability can be most nearly achieved by placing design points in the X space on concentric (hyper)spheres about the origin. And, indeed, this is the strategy proposed in 1957 by Box and Hunter [3],[4]. They have suggested creating a quadratic design which builds upon the original (±1, ±1, . . . , ±1) design points of the orthogonal design for estimating the coefficients of a model of first degree. In other words, it would be desirable to create a design which might utilize the data from a ±1 model. Naturally, such points, in p dimensions, lie on the surface of a (hyper)sphere of radius r=
12 + 12 + . . . 12 =
√
p.
(6.142)
There are 2p such points. Next, we place several points at the origin, thus on the surface of a hypersphere of radius zero. And, finally,
© 2002 by Chapman & Hall/CRC
272
chapter 6. optimization approaches
we place “star” points on a (hyper)sphere of radius α. These points are placed such that all the X coordinates except one are zero. Thus, they are obtained by going from the origin through each of the faces of the (±1, ±1, . . . , ±1) hypercube a distance of α, i.e., these points are of the form (±α, 0, . . . , 0), (0, ±α, . . . , 0), (0, 0, . . . , ±α). There are 2p such points. Let us consider such a design in Figure 6.18.
*
(0,1.41)
(–1,1)
(1,1)
*(–1.41,0)
* (1.41,0)
(–1,–1)
(1,–1)
*
(0,–1.41)
Figure 6.18. Rotatable Two Factor Design. We examine the enhanced experimental design with two center points and four star points in Table 6.8
i 1 2 3 4 5 6 7 8 9 10
Table 6.8. Two Factor Experiment 2 2 X0 Xi1 Xi2 Xi1 Xi2 Xi1 Xi2 yi 1 1 1 1 1 1 7.01984 1 -1 1 1 1 -1 14.08370 1 1 -1 1 1 -1 18.70730 1 -1 -1 1 1 1 27.73476 1 0 0 0 0 0 15.01984 1 0 0 0 0 14.08370 √0 1 2 0 2 0 0 11.05540 √ 1 − 2 √0 2 0 0 23.36286 1 0 2 0 2 0 8.56584 √ 1 0 − 2 0 2 0 27.28023
© 2002 by Chapman & Hall/CRC
some rotatable quadratic designs
273
The quadratic fitted by least squares is given by yˆ = 14.556 − 4.193X1 − 6.485X2 + 1.159X12 + 1.518X22 + .490X1 X2 . (6.143) We now attempt to see if we have gained anything by including the quadratic terms, i.e., we wish to test the null hypothesis β11 = β22 = β12 = 0. We have to employ now (6.123) with ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ Z2 = ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
and
⎞
1 1 1 1 0 0 2 2 0 0
1 1 1 −1 ⎟ ⎟ 1 −1 ⎟ ⎟ ⎟ 1 1 ⎟ ⎟ 0 0 ⎟ ⎟ 0 0 ⎟ ⎟ 0 0 ⎟ ⎟ ⎟ 0 0 ⎟ ⎟ 2 0 ⎠ 2 0
⎛
⎞
βˆ11 ⎜ ˆ ⎟ ˆ Θ2 = ⎝ β22 ⎠ . βˆ12
(6.144)
(6.145)
Then, we have
ˆ 2 ) (Z2.1 Θ ˆ 2) (Z2.1 Θ = 4.491. 3 For the residual from the full quadratic model, we have
(6.146)
10 1 2 2 (yi − βˆ0 − βˆ1 Xi1 − βˆ2 Xi2 − βˆ11 Xi1 − βˆ22 Xi2 − βˆ12 Xi1 Xi2 )2 10 − 6 i=1 (6.147) = .435.
The resulting ratio of these two estimates (under the null hypothesis) for σ 2 is given by 4.491/.435= 10.324. Looking at the F3,4 tables, we find this value to be significant at the .025 level. Thus, we most likely will wish to use the full quadratic model. From (6.143), we may seek an optimum by solving the necessary conditions ∂ yˆ ∂ yˆ = = 0. ∂X1 ∂X2
© 2002 by Chapman & Hall/CRC
(6.148)
274
chapter 6. optimization approaches
This gives, as the supposed minimizer of E(ˆ y ), X1 = 1.405 and X2 = 1.909. In point of fact, the data in this two by two factorial experiment were all generated from the model y = 2 + (X1 − 2)2 + (X2 − 3)2 + ,
(6.149)
where is N (0, 1). In fact, substituting the experimentally determined minimum conditions for y in (6.149), we note that we achieve an average y of 3.544, an improvement well worth having over that at the conditions of X1 = 0 and X2 = 0, namely, E(y) = 15 . For the design at hand, the variance of an estimate at a new (X1 , X2 ) value is given by: Var(ˆ y) = σ2 ⎡ ⎢⎛ ⎢ ⎢⎜ ⎢⎝ ⎢ ⎢ ⎣
1 X1 X2 X12 X22 X1 X2 ⎛
1 1 1 1 1 1
1 −1 1 1 1 −1
1 1 −1 1 1 −1
1 −1 −1 1 1 1
1 0 0 0 0 0
× Indeed,
1 0 0 0 0 0
√1 2 0 2 0 0
√1 − 2 0 2 0 0
1 √0 2 0 2 0
1 √0 2 0 2 0
⎞⎜ ⎜ ⎟⎜ ⎠⎜ ⎜ ⎜ ⎝
1 1 1 1 1 1 1 1 1 1
1 −1 1 −1 0 √0 √2 − 2 0 0
1 X1 X2 X12 X22 X1 X2 ⎛ ⎜ ⎜ ⎜ ˆ ⎜ yˆ = Θ ⎜ ⎜ ⎜ ⎝
1 X1 X2 X22 X22 X1 X2
× 1 1 −1 −1 0 0 0 √0 √2 − 2
(6.150) 1 1 1 1 0 0 2 2 0 0
1 1 1 1 0 0 0 0 2 2
1 −1 −1 1 0 0 0 0 0 0
⎞⎤−1 ⎟⎥ ⎟⎥ ⎟⎥ ⎟⎥ ⎟⎥ ⎟⎥ ⎠⎦
.
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
and the required result follows easily from the fact that Var(ˆ y ) = E(ˆ y 2 ) − [E(ˆ y )]2 . After a little algebra, we have that
where
V ar(ˆ y ) = σ 2 [.5 + ρ2 (−.5 + .2188ρ2 )]
(6.151)
ρ2 = X12 + X22 .
(6.152)
We show a plot of V ar(ˆ y )/(σ 2 ) from (6.151) in Figure 6.19.
© 2002 by Chapman & Hall/CRC
275
some rotatable quadratic designs
250.0 200.0
^ σ2 Var(y)/ 150.0 100.0
X
3.0 4.0
1.0 2.0
-2.0 -1.0 0.0
4.0 3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0 X2 -4.0
-4.0 -3.0
50.0
1
Figure 6.19. Variance Profile of Rotatable Design.
A seemingly reasonable alternative to the two factor rotatable design considered above is the two factor design with each X at three levels. We show such a design in Table 6.9.
Table 6.9. 32 Experimental Design 2 2 i X0 Xi1 Xi2 Xi1 Xi2 Xi1 Xi2 1 1 -1 -1 1 1 1 2 1 0 -1 0 1 0 3 1 1 -1 1 1 -1 4 1 -1 0 1 0 0 5 1 0 0 0 0 0 6 1 1 0 1 0 0 7 1 -1 1 1 1 -1 8 1 0 1 0 1 0 9 1 1 1 1 1 1
© 2002 by Chapman & Hall/CRC
276
chapter 6. optimization approaches
8.0
^ /σ2 Var(y)
7.0 6.0 5.0 4.0 1.0 0.5
X2
1.0
0.0
0.5 0.0
-0.5
-0 .5
-1.0 -1.0
X
1
Figure 6.20. Variance Profile of 32 Orthogonal Design. In general, the rotatable designs of Box, Hunter and Draper are very useful when one believes himself to be sufficiently close to the optimal value of the X value that it makes sense to fit a quadratic model. For dimensionality p, we start with the simple orthogonal factorial design having 2p points at the vertices of the (hyper)cube (±1, ±1, . . . , ±1). Then we add 2p star points at (±α, 0, . . . , 0) (0, ±α, 0, . . . , 0), . . ., (0, 0, . . . , 0, ±α). Then, we generally add two points (for, say, p ≤ 5, more for larger dimensionality) at the origin. A sufficient condition for rotatability of the design, i.e., that, as above, V ar(ˆ y ) is a function only of ρ2 = X12 + X22 + . . . + Xp2 ,
(6.153)
can be shown to be [4] that α = (2p ).25 .
(6.154)
In Table 6.10 we show rotatable designs for dimensions 2,3,4, 5 and 6. Dimension 2 3 4 5 6
6.11
Table 6.10. Some Rotatable Designs Num. Cube Points Num. Center Points Num. Star Points 4 2 4 8 2 6 16 2 8 32 2 10 64 2 12
Saturated Designs
Let us return to the two factor experiment in Table 6.6.
© 2002 by Chapman & Hall/CRC
α √ 2 2.75 2 21.25 21.5
277
saturated designs Table 6.11. Saturated Experiment Number = i X0 1 1 2 1 3 1 4 1
Three Factor Design Xi1 Xi2 Xi1 Xi2 = Xi3 1 1 1 -1 1 -1 1 -1 -1 -1 -1 1
yi y1 y2 y3 y4
We note that the Xi1 Xi2 column is orthogonal to the X1 and X2 columns. If we assume that there is no interaction effect between X1 and X2 , then we “saturate” the 22 design by confounding Xi1 Xi2 with a third variable X3 , using the design indicated. Let us extend this notion to the saturation of the 23 design in Table 6.12. Table 6.12. Saturated Seven Factor Design Xi1 1 -1 1 -1 1 -1 1 -1
Xi2 1 1 -1 -1 1 1 -1 -1
Xi3 1 1 1 1 -1 -1 -1 -1
Xi1 Xi2 = Xi4 1 -1 -1 1 1 -1 -1 1
Xi1 Xi3 = Xi5 1 -1 1 -1 -1 1 -1 1
Xi2 Xi3 = Xi6 1 1 -1 -1 -1 -1 1 1
Xi1 Xi2 Xi3 = Xi7 1 -1 1 1 -1 1 1 -1
In the design indicated, we note that each of the columns is orthogonal to the others, so our least squares estimation procedure is unusually simple. Let us consider the simple linear model yi = β0 +
7
βj Xij + i .
(6.155)
j=1
We note, for example, that we will have ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
βˆ0 βˆ1 βˆ2 βˆ3 βˆ4 βˆ5 βˆ6 βˆ7
⎞
⎛ 1 1 1 1 1 ⎟ ⎜ 1 −1 ⎟ 1 −1 1 ⎜ ⎟ ⎜ ⎟ ⎜ 1 1 −1 −1 1 ⎟ ⎜ ⎟ 1⎜ 1 1 1 1 −1 ⎟ ⎟= ⎜ ⎟ 1 1 8⎜ ⎜ 1 −1 −1 ⎟ ⎜ ⎟ 1 −1 −1 ⎜ 1 −1 ⎟ ⎜ ⎟ ⎝ 1 1 −1 −1 −1 ⎠
1 −1
1
1 −1 1 −1 −1 1 −1 1 −1 1
1 1 −1 −1 −1 −1 1 1
1 −1 −1 −1 1 1 1 −1
⎞⎛ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎠⎝
y1 y2 y3 y4 y5 y6 y7 y8
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
(6.156) Generally speaking, one would seldom wish to be quite so ambitious as to carry out such a massive saturation. In particular, we would have zero degrees of freedom for the purposes of estimating the variance. However,
© 2002 by Chapman & Hall/CRC
278
chapter 6. optimization approaches
some feel that such designs are useful in “fishing expeditions,” situations where we have vague feelings that some variables might just possibly be useful in minimizing, say, the production variability. It is clear how one could construct designs with less than full saturation. For example, we might decide to confound only one variable, say X4 with one of the interaction terms, for example, with the X1 X2 X3 term. All the least squares computations for orthogonal designs can be employed, and then one would have three degrees of freedom remaining for the estimation of the underlying variance.
6.12
A Simulation Based Approach
There are many situations where we have reasonable comprehension, at the microaxiom level, but have little justification for using an ad hoc Taylor’s expansion type model to describe what is going on at the macro aggregate level. Attempts to utilize such ad hocery, to assume that if one throws in enough terms all will be well, has rendered the utility of some fields, such as econometrics, of marginal utility. There is generally a very great problem in getting from the micro axioms to the macro aggregate model (“closed form”) which is its consequence. Happily, rapid computing gives us hope in many cases of estimating the parameters of the micro model from aggregate data without the necessity of explicitly computing, say, a closed form likelihood function. The SIMEST algorithm is explicated more fully elsewhere (e.g., [8], [9], [10]). Here we simply give an indication of its potential in a statistical process control setting. The reader may find it useful to review the Poisson process section in the Appendix. Let us consider the following axiomitization of a plausible mechanism by which system errors are generated and corrected. (1) Following the standard assumptions of an homogeneous Poisson process, the probability that a system error not caused by a prior system error will appear in the infinitesimal interval [t, t + Δt] is given by P1 (system error in [t, t + Δt]) = λΔt. Thus, the cumulative distribution function of time to occurrence of a new system error is given by F1 (t) = 1 − e−λt .
© 2002 by Chapman & Hall/CRC
279
a simulation based approach
(2) At its origin, the “effect” of the system error is 1. As time progresses, it grows exponentially, i.e., s time units after the error is created, its effect is given by E = eαs . (3) The probability that a system error, not discovered previously, is caught and eliminated in the time interval [s, s + Δs] is proportional to the “effect” of the system error, i.e., P2 ( system error caught in [s, s + Δs]) = γeαs Δs. After a little integration and algebra, we see that the cdf of time after its generation until the detection of a system error is given by F2 (sD ) = 1 − P [no detection by sD ] = 1 − exp(−
sD 0
γeατ dτ )
γ = 1 − exp(− [eαsD − 1]). α We note that the time origin of sD is the instant where the new system error came into being. (4) The probability that an existing system error will itself generate a new system error somewhere else in the system during the interval [s, s + Δs] is proportional to the effect of the system error, i.e., P3 (new secondary system error generated in [s, s + Δs]) = φeαs Δs. Hence, the cdf of time until creation of a new (secondary) system error is given by φ F3 (sS,1 ) = 1 − exp(− [eαsS,1 − 1]). α (5) Once generated, each system error has the same underlying kinetic characteristics in terms of effect, spread and detection as any other system error. Now, from the real aggregate world, all that we observe is the discovery and correction times of system errors. In order to take these times, say {T1 , T2 , . . . , TM }, and employ them for the estimation of the characterizing parameters, α, λ, γ, φ, in a “closed form” setting would require the computation of the likelihood function or some such surrogate. Experience [8] indicates that this kind of exercise is exceptionally time consuming. Rapid computing gives us a means of bypassing this step.
© 2002 by Chapman & Hall/CRC
280
chapter 6. optimization approaches
Let us take a time interval in a simulation which is equal to that of the observed data train. Assuming a value for the vector parameter (α, λ, γ, φ), we begin to generate “pseudo” system errors. We note that the task for accomplishing this is relatively easy, since, for any random variable say, X, with increasing (in X) cdf F (X), the random variable F (X) is uniform on the unit interval. To see this, we note that if we look at the cdf G of F (X), we have G(y) = P [F (X) ≤ y] = P [X ≤ F −1 (y)] = F (F −1 (y)) = y. Thus, we can simulate the time of a system error by generating a random uniform variate u and then solving for t in u = F1 (t) = 1 − e−λt .
(6.157)
Using this t as the starting point for the time at which there is a risk of the generation of “secondary” system errors, we can generate these using the relationship that if u is a uniform random observation over the unit interval, then we can solve for sS,1 using φ u = F3 (sS,1 ) = 1 − exp(− [eαsS,1 − 1]). α
(6.158)
We then generate the time of the discovery of the first primary system error by generating the uniform random variable u and then solving for sD from γ (6.159) u = F2 (sD ) = 1 − exp(− [eαsD − 1]). α In the event that (6.160) sD < sS,1 then we will not have to worry that the primary system error has generated a secondary system error. But if sD ≥ sS,1
(6.161)
then a secondary system error will have been generated, and we will have to go through a similar exercise to see what additional errors it might have caused, and so on. (Indeed, in such a case, we will have to generate tertiary system errors which may have been generated by the secondary ones.) For the first of these, clearly we have γ u = F2 (sS,2 ) = 1 − exp(− [eαsS,2 − eαsS,1 ]). α
© 2002 by Chapman & Hall/CRC
(6.162)
281
references
A complete flowcharting of this sort of simulation is straightforward though nontrivial. For further details in dealing with such a simulation, we refer the reader to [8], [9], [10]. But, after we have computed, say 10,000, simulations, we note the average number of pseudo system errors discovered up to time T1 , say n1 , the average number of pseudo system errors after T1 and before T2 , say n2 , etc. Then a natural criterion function for the appropriateness of the parameters assumed might be a goodness of fit type of function, such as χ2 (α, λ, γ, φ) =
M +1
(ni − 1)2 .
(6.163)
i=1
Utilizing a standard optimization routine such as the Nelder-Mead algorithm described in Section 6.2, we then have a straightforward means of moving through the parameter space until we have good concordance between our real world data and that generated assuming a value of (α, λ, γ, φ).
References [1] Adams, B.M. and Woodall, W.H. “An analysis of Taguchi’s on-line process control model under a random-walk model,” Technometrics, 31, pp. 401-413. [2] Box, G.E.P. and Draper, N.R. (1969). Evolutionary Operation. New York: John Wiley & Sons. [3] Box, G.E.P. and Draper, N.R. (1989). Empirical Model-Building and Response Surfaces. New York: John Wiley & Sons. [4] Box, G.E.P. and Hunter, J.S. (1957). “Multifactor experimental designs for exploring response surfaces,” Annals of Statistics, 28, pp. 195241. [5] Kendall, M.G. and Stuart, A.(1958). The Advanced Theory of Statistics, I & II. New York: Hafner. [6] Nelder, J.A. and Mead, R. (1965). “A simplex method for function minimization,” Computational Journal,7, pp. 308-313. [7] Roy, R.(1990). A Primer on the Taguchi Method. New York: Van Nostrand Reinhold.
© 2002 by Chapman & Hall/CRC
282
chapter 6. optimization approaches
[8] Thompson, J.R., Atkinson, E.N. and Brown, B.W. (1987). “SIMEST: An algorithm for simulation-based estimation of parameters characterizing a stochastic process,” Cancer Modeling, Thompson, J.R. and Brown, B.W., eds., New York: Marcel Dekker, pp. 387-415. [9] Thompson, J.R. (1989). Empirical Model Building. New York: John Wiley & Sons. [10] Thompson, J.R. and Tapia, R.A. (1990). Nonparametric Function Estimation, Modeling and Simulation. Philadelphia: Society for Industrial and Applied Mathematics.
Problems Remark: In Problems 6.1-6.3, extrema of functions of one variable are to be found. Such problems can also be solved using the Nelder-Mead simplex algorithm. When compared with problems with two control variables, each step of the algorithm is simplified in that the Best and the Second Worst points coincide. Whatever the dimension of an optimization problem, successful implementation of the simplex algorithm requires that a stopping rule be incorporated into the algorithm, so that it terminates in a finite time. The reader is asked to use stopping rules of his or her own choice. Problem 6.1. Consider the function f (x) = .1(x4 − 20x2 + 5x) over the interval x ∈ [−5, 5]. a. Apply the simplex algorithm to find a minimum of the function (in the given interval). Use two points of your choice from the function domain as the starting points of the procedure. After stopping the algorithm, repeat the search several times, using another starting point. Plot the function using a computer graphics software and compare your results with those given by the plot. b. Repeat a with a noise corrupting the readings of the function values: Whenever a function evaluation is needed, add to the function value a random variable from N (0, .0025), generated by a computer’s random number generator. Random errors corrupting the readings should form a set of independent random variables. Use the same starting points as in a.
© 2002 by Chapman & Hall/CRC
problems
283
c. Repeat b with a random noise from N (0, .01). d. Repeat b with a random noise from N (0, .25). e. Repeat b with a random noise from N (0, 1). Problem 6.2. Consider the function f (x) =
sin(4x) x
over the interval x ∈ [−1, 6]. a. Apply the simplex algorithm to find a maximum of the function (in the given interval). Use two points of your choice from the function domain as the starting points of the procedure. After stopping the algorithm, repeat the search several times, using other starting points. Plot the function using a computer graphics software and compare your results with those given by the plot. b. Repeat a with a noise corrupting the readings of the function values: Whenever a function evaluation is needed, add to the function value a random variable from N (0, .0025), generated by a computer’s random number generator. Random errors corrupting the readings should form a set of independent random variables. Use the same starting points as in a. c. Repeat b with a random noise from N (0, .01). d. Repeat b with a random noise from N (0, .25). e. Repeat b with a random noise from N (0, 1). Problem 6.3. Consider the function f (x) = x sin(x) over the interval x ∈ [0, 12]. a. Apply the simplex algorithm to find a minimum of the function (in the given interval). Use two points of your choice from the function domain as the starting points of the procedure. After stopping the algorithm, repeat the search several times, using other starting points. Plot the function using a computer graphics software and compare your results with those given by the plot. b. Repeat a with a noise corrupting the readings of the function values: Whenever a function evaluation is needed, add to the function value a random variable from N (0, .0025), generated by a computer’s random number generator. Random errors corrupting the readings should form a set of independent random variables. Use the same starting points as in a.
© 2002 by Chapman & Hall/CRC
284
chapter 6. optimization approaches
c. Repeat b with a random noise from N (0, .01). d. Repeat b with a random noise from N (0, .25). e. Repeat b with a random noise from N (0, 1). f. Repeat a to e replacing minimization function by maximization. Problem 6.4. Consider the function f (x1 , x2 ) = x1 sin(x1 ) + x2 sin(x2 ) over the square x1 ∈ [0, 12], x2 ∈ [0, 12]. a. Apply the simplex algorithm to find a minimum of the function (in the given square). Use three points of your choice from the function domain as the starting points of the procedure. After stopping the algorithm, repeat the search several times, using other starting points. Plot the function using a computer graphics software and compare your results with those given by the plot. b. Repeat a with a noise corrupting the readings of the function values: Whenever a function evaluation is needed, add to the function value a random variable from N (0, .0025), generated by a computer’s random number generator. Random errors corrupting the readings should form a set of independent random variables. Use the same starting points as in a. c. Repeat b with a random noise from N (0, .01). d. Repeat b with a random noise from N (0, .25). e. Repeat b with a random noise from N (0, 1). Problem 6.5. Consider the function f (x1 , x2 ) = .5(x21 + x22 ) over the square x1 ∈ [−4, 4], x2 ∈ [−4, 4]. a. Apply the simplex algorithm to find a minimum of the function (in the given square). Use three points of your choice from the function domain as the starting points of the procedure. After stopping the algorithm, repeat the search several times, using other starting points. Plot the function using a computer graphics software and compare your results with those given by the plot. b. Repeat a with a noise corrupting the readings of the function values: Whenever a function evaluation is needed, add to the function value a random variable from N (0, .0025), generated by a computer’s random number generator. Random errors corrupting the readings should form
© 2002 by Chapman & Hall/CRC
285
problems
a set of independent random variables. Use the same starting points as in a. c. Repeat b with a random noise from N (0, .01). d. Repeat b with a random noise from N (0, .25). e. Repeat b with a random noise from N (0, 1). Problem 6.6. Consider the function
f (x1 , x2 ) =
2 sin(2 x21 + x22 )
x21 + x22
over the square x1 ∈ [−5, 5], x2 ∈ [−5, 5] (the function has been borrowed from a manual of the SYSTAT statistical package). a. Apply the simplex algorithm to find a minimum of the function (in the given square). Use three points of your choice from the function domain as the starting points of the procedure. After stopping the algorithm, repeat the search several times, using other starting points. Plot the function using a computer graphics software and compare your results with those given by the plot. b. Repeat a with a noise corrupting the readings of the function values: Whenever a function evaluation is needed, add to the function value a random variable from N (0, .0025), generated by a computer’s random number generator. Random errors corrupting the readings should form a set of independent random variables. Use the same starting points as in a. c. Repeat b with a random noise from N (0, .01). d. Repeat b with a random noise from N (0, .25). e. Repeat b with a random noise from N (0, 1). Problem 6.7. Consider the following data set. X 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5
© 2002 by Chapman & Hall/CRC
Y 1.5 1.022 1.050 1.055 1.079 1.120 1.113 1.184 1.160 1.174 1.174 1.198 1.218 1.218 1.250 1.258
X 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
Y 1.262 1.277 1.295 1.306 1.323 1.344 1.342 1.352 1.354 1.369 1.383 1.382 1.391 1.393 1.405
286
chapter 6. optimization approaches
Use the transformational ladder to find the transformation that (approximately) linearizes the relationship between the independent variable X and dependent variable Y . Problem 6.8. Consider the following data set. X 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5
Y 0.320 0.587 1.279 2.775 2.805 3.709 8.808 9.917 10.548 13.620 15.847 19.652 23.211 28.933 33.151 39.522
X 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0
Y 45.652 53.058 62.228 70.982 80.754 94.021 104.442 119.730 135.101 150.323 166.241 184.782 208.577 230.051 256.714
Use the transformational ladder to find the transformation that (approximately) linearizes the relationship between the independent variable X and dependent variable Y . Problem 6.9. Perform the following experiment. Tabulate the function Y = ln(X) for X = 3.1, 3.2, 3.3, . . . , 5.8, 5.9, 6.0. a. Add normal noise of mean 0 and standard deviation .01 to the function readings, that is, use a computer’s random number generator to generate a sequence of thirty normal variates from N (0, (.01)2 ) and add these variates to successive readings of the function. Using the transformational ladder, attempt to find the transformation that (approximately) linearizes the relationship between X and Y . b. Repeat a with normal noise of mean 0 and standard deviation .02. c. Repeat a with normal noise of mean 0 and standard deviation .05. d. Repeat a with normal noise of mean 0 and standard deviation .1. Comment on your results. Problem 6.10. Perform the following experiment. Tabulate the function Y = exp(X) for X = 2.1, 2.2, 2.3, . . . , 4.8, 4.9, 5.0. a. Add normal noise of mean 0 and standard deviation .01 to the function readings, that is, use a computer’s random number generator to generate a sequence of thirty normal variates from N (0, (.01)2 ) and add these variates to successive readings of the function. Using the transformational ladder, attempt to find the transformation that (approximately) linearizes the relationship between X and Y . b. Repeat a with normal noise of mean 0 and standard deviation .02.
© 2002 by Chapman & Hall/CRC
287
problems
c. Repeat a with normal noise of mean 0 and standard deviation .05. d. Repeat a with normal noise of mean 0 and standard deviation .1. Comment on your results. Compare your conclusions with those from Problem 6.9. Problem 6.11. It is believed that the variability of a production process is a quadratic function of two decision variables, X1 and X2 . The variables have been linearly transformed in such a way that the perceived optimum lies at X1 = X2 = 0. For the given rotatable quadratic design, the following function readings have been obtained. j 1 2 3 4 5 6 7 8 9 10
X0 1 1 1 1 1 1 1 1 1 1
X1j 1 -1 1 -1 0 0 1.414 -1.414 0 0
X2j 1 1 -1 -1 0 0 0 0 1.414 -1.414
2 X1j 1 1 1 1 0 0 2 2 0 0
2 X2j 1 1 1 1 0 0 0 0 2 2
X1j X2j 1 -1 -1 1 0 0 0 0 0 0
yj 3.365 3.578 12.095 11.602 5.699 5.320 9.434 8.145 0.098 12.431
Estimate the coefficients of the quadratic model in a neighborhood of the origin. Test the null hypothesis that the term with X1 X2 is negligible. What should be the next step (or steps) of minimizing the process variability? Problem 6.12. It is believed that the variability of a production process is a quadratic function of three decision variables, X1 , X2 and X3 . The variables have been linearly transformed in such a way that the perceived optimum lies at X1 = X2 = X3 = 0. For the given rotatable quadratic design, the following function readings have been obtained. j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
X0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
X1j 1 -1 1 -1 1 -1 1 -1 0 0 1.682 -1.682 0 0 0 0
X2j 1 1 -1 -1 1 1 -1 -1 0 0 0 0 1.682 -1.682 0 0
X3j 1 1 1 1 -1 -1 -1 -1 0 0 0 0 0 0 1.682 1.682
yj 4.912 4.979 13.786 13.065 4.786 4.253 14.295 13.659 6.188 5.650 11.023 11.236 2.196 15.159 6.699 6.320
Estimate the coefficients of the quadratic model in a neighborhood of 2 , X2 , the origin (note that the given design implies the values of X1j 2j ..., X2j X3j ). Test the null hypothesis that the terms with X1 X2 , X1 X3
© 2002 by Chapman & Hall/CRC
288
chapter 6. optimization approaches
and X2 X3 are all negligible. Test the negligibility of each of the terms mentioned separately. Test the null hypothesis that the term with X12 is negligible. Test the null hypothesis that the term with X22 is negligible. Test the null hypothesis that the term with X32 is negligible. Estimate the variance of the error terms. What should be the next step (or steps) of minimizing the process variability? Problem 6.13. Perform the following experiment. Suppose the problem consists in finding the minimum of the quadratic function f (x) = 3 + (x1 − 3)2 + (x2 − 2)2 + 2(x23 − 4)2 when function evaluations are subject to normally distributed random errors with mean 0 and variance 1. Simulate the process of searching for the minimum, starting from a neighborhood of the origin and: a. estimating the coefficients of the linear approximation of the function then making one step in the steepest descent direction, and, finally, approximating the function by a quadratic; b. implementing the Nelder-Mead simplex algorithm; c. comparing the two approaches.
© 2002 by Chapman & Hall/CRC
Chapter 7
Multivariate Approaches 7.1
Introduction
In very many cases, multivariate time indexed data are available for analysis. Yet it is rather standard for investigators to carry out their searches for Pareto glitches by control charting the data base one dimension at a time. A little modeling of the SPC situation gives us an indication that we are dealing with a rather special multivariate scenario. Multivariate tests can easily be constructed which give us the potential of substantial improvement over the one factor at a time approach. We recall that the density function of a variable x of dimension p from a multivariate normal distribution is given by 1 −1 f (x) = |2πΣ|−1/2 exp{− (x − μ ) Σ (x − μ )}, 2
(7.1)
where μ is a constant vector and Σ is a constant positive definite matrix. By analogy with the use of control charts to find a change in the distribution of the output and/or the input of a module, we can describe a likely scenario of a process going out of control as the mean suddenly changes from, say, μ 0 to some other value. One approach to discovering the point in time where a glitch occurred would be to find lots where μ = μ 0 . Let us note that dealing with multivariate data often requires that the data be standardized prior to an analysis, i.e., that they be mean centered and rescaled to have components with unit variance. In particular, standardization is usually recommended if observation components are measured in different units. 289
© 2002 by Chapman & Hall/CRC
290
7.2
chapter 7. multivariate approaches
Likelihood Ratio Tests for Location
First of all, let us consider the situation where we believe that the shift in the mean will not be accompanied by a significant shift in the variance. In this case, if we have a long history of data from a process in control, we can (see Appendix) transform the data so that the covariance matrix is diagonal, i.e.,
f (x; μ ) =
p
1 (2πσii )−1/2 exp{− (xi − μi )σii−1 (xi − μi )}. 2 i=1
(7.2)
We will assume that the variables have been shifted so that in the “in control” situation μ = 0.
(7.3)
Here, the likelihood ratio statistic is given by n f (x; x ¯) λ(x1 , x2 , . . . , xn ) = i=1 . n i=1 f (x; 0)
(7.4)
But then, after a little algebra, we are left with the test statistic Q1 =
p j=1
(
x ¯j √ )2 . σj / n
(7.5)
If, indeed, the process is under control (i.e., μ = 0), then we are left with the sum of squares of p independent normal random variables with means zero and unit variances and the statistic Q1 is distributed as χ2 with p degrees of freedom. In the situation where we believe that the variance structure may change when the mean changes, we can estimate the covariance matrix Σ in the obvious fashion (using the maximum likelihood estimator) via n 1 S= (xi − x ¯)(xi − x ¯) . n − 1 i=1
(7.6)
We wish to obtain a likelihood ratio test to test the null hypothesis H0 : μ = μ 0 .
© 2002 by Chapman & Hall/CRC
(7.7)
291
likelihood ratio tests for location
Going through a fair amount of straightforward algebra, we obtain the likelihood ratio statistic T 2 = n(¯ x − μ 0 ) S−1 (¯ x − μ 0 ).
(7.8)
Essentially, the likelihood ratio test (Hotelling’s T 2 ) is based on comparing the spread of a lot of size n about its sample mean with the spread of the same lot about μ 0 . Further algebra demonstrates that we can obtain a probability of rejecting the null hypothesis when it is truly equal to α, by using as the critical region those values of T 2 which satisfy T2 >
p(n − 1) Fp,n−p (α), n−p
(7.9)
where Fp,n−p (α) is the upper (100α)th percentile of the Fp,n−p distribution. We have assumed that a process goes “out of control” when the normal distribution of the output changes its vector mean from μ 0 to some other value, say μ . Let us define the dispersion matrix (covariance of the set of estimates μˆ ), ⎡ ⎢ ⎢ V(p×p) = ⎢ ⎢ ⎣
Var(ˆ μ1 ) Cov(ˆ μ1 , μ ˆ2 ) . . . Cov(ˆ μ1 , μ ˆp ) ˆ2 ) Var(ˆ μ2 ) . . . Cov(ˆ μ2 , μ ˆp ) Cov(ˆ μ1 , μ .. .. .. . . . Cov(ˆ μ1 , μ ˆp ) Cov(ˆ μ2 , μ ˆp ) . . . Var(ˆ μp )
Here, naturally, μ ˆj = x¯j =
n 1 xij n i=1
⎤ ⎥ ⎥ ⎥. ⎥ ⎦
(7.10)
(7.11)
for each j. It is of interest to investigate the power (probability of rejection of the null hypothesis) of the Hotelling T 2 test. This can be done as a function of the noncentrality −1 λ = (μ − μ 0 ) V(p×p) (μ − μ 0 ).
(7.12)
For n very large we can use an asymptotic (in n ) χ2 approximation for the power of the Hotelling T 2 test: P (λ) =
© 2002 by Chapman & Hall/CRC
∞ p+λ ( p+2λ )χ2α (p)
dχ2 (p +
λ2 ), p + 2λ
(7.13)
292
chapter 7. multivariate approaches
where dχ2 (p) is the differential of the cumulative distribution function of the central χ2 distribution with p degrees of freedom and χ2α (p) its 100(1 − α) per cent point. We demonstrate below power curves for the Hotelling T 2 test for various sample sizes, n, and various noncentralities λ. The dimensionalities are one through five in Figures 7.1 through 7.5 and ten in Figure 7.6. 0
10
20
30
40
50
1.0
1.0
χ2
0.9
n=20
0.9
n=15
0.8
50 0.8 0.7
0.7
Power
n=10 0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0
10
20
30
λ
40
50
0.0
Figure 7.1. Power Curves for One Dimensional Data. 0
20
10
30
40
0.9
0.9
n=50 n=20
χ2
0.8
0.8
n=15
0.7
0.7
Power
50 1.0
1.0
n=10
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1 0.0
0.0
0
10
20
λ
30
40
50
Figure 7.2. Power Curves for Two Dimensional Data.
© 2002 by Chapman & Hall/CRC
293
likelihood ratio tests for location
1.0
0
10
20
30
0.9
200
50
1.0 0.9
n=20
χ2
0.8
0.8 n=50
0.7
Power
40
0.7
0.6
0.6
0.5
0.5 n=15
0.4
0.4 n=10
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0 0
10
20
λ
30
40
50
Figure 7.3. Power Curves for Three Dimensional Data.
1.0
0
10
20 χ
0.9
30
40
50
2
200
1.0 0.9
n=50
0.8
0.8 n=20
Power
0.7
0.7
0.6
0.6
n=15
0.5
0.5
0.4
0.4
0.3
0.3 n=10
0.2
0.2
0.1
0.1
0.0
0.0 0
10
20
λ
30
40
50
Figure 7.4. Power Curves for Four Dimensional Data.
© 2002 by Chapman & Hall/CRC
294
chapter 7. multivariate approaches 0
10
20
30
40
50
1.0
1.0
χ2 0.9
0.9 200
Power
0.8
0.8
n=50
0.7
0.7 n=20
0.6
0.6
0.5
0.5
n=15
0.4
0.4
0.3
0.3 n =10
0.2
0.2
0.1
0.1
0.0
0.0 0
10
20
30
40
50
λ
Figure 7.5. Power Curves for Five Dimensional Data. 0
10
20
30
40
50
1.0
1.0
0.9
0.9
χ2
n=200
Power
0.8
0.8 n=50
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
n=20
0.2
0.2
0.1
0.1
n=15
0.0
0.0 0
10
20
30
40
50
λ
Figure 7.6. Power Curves for Ten Dimensional Data. Let us consider, next, the situation where we wish to compare the performance of the multidimensional Hotelling T 2 test with α = probability of Type I error equal to .002 with the one dimension at a time test. We consider, by way of comparison, a situation where −1 H0 : λ = (μ − μ 0 ) V(p×p) (μ − μ 0 ) = 0.
© 2002 by Chapman & Hall/CRC
(7.14)
295
likelihood ratio tests for location We consider the alternative H1 : λ > 0.
(7.15)
We are assuming here that the noncentrality is equal in each dimension. Suppose, for example, that we wish to compare the one dimension at a time test with Hotelling’s T 2 in the case of p dimensions. For noncentrality of 0, we wish to have probability of rejection equal to .002. This means that for each one dimensional test, we design a test with α = 1 − .998p . In Figure 7.7, we show the ratio of noncentrality per dimension for the battery of one dimensional tests required to give the same power as that of the multivariate likelihood test given in (7.5). The efficiency is plotted versus the noncentrality per dimension used in the Hotelling T 2 test. 0
5
10
15
20
2.0
2.0
p=5
1.9
1.9
Efficiency
1.8
1.8
p=4 1.7
1.7
1.6
1.6
p=3
1.5
1.5
1.4
1.4
1.3
1.3
p=2 1.2
1.2
1.1
1.1
1.0
1.0
0
5
λ/p
10
15
20
Figure 7.7. Relative Efficiency. We note that the efficiencies in this case exceed 100%. Let us consider a particular example where the efficiencies of the multivariate test are
© 2002 by Chapman & Hall/CRC
296
chapter 7. multivariate approaches
even more impressive. Suppose that
H0 : μ = 0.
(7.16)
We shall assume a covariance matrix of ⎡ ⎢ ⎢ ⎢ Σ(5×5) = ⎢ ⎢ ⎣
1 .8 .8 1 .8 .8 .8 .8 .8 ..8
.8 .8 1 .8 .8
.8 .8 .8 1 .8
.8 .8 .8 .8 1
⎤ ⎥ ⎥ ⎥ ⎥. ⎥ ⎦
(7.17)
We are assuming here an α level of .002 for both the battery of five one dimensional tests and the Hotelling T 2 five dimensional test. For equal slippage in each dimension, we show the ratio of the sample size required to obtain the power of .9 with the battery of one dimensional tests with that required to obtain a power of .9 with the Hotelling test. The “slippage per dimension” abscissa is the absolute value per dimension of each of the means. In this case, we have assumed that the “slippages” in the first two dimensions are positive; the last three are negative. The irregularities in Figure 7.8 are due to the granularity effects of discrete sample sizes. Let us now consider explicitly the situation when we are given several, say N , lots of size n of p-dimensional data. In order to verify whether the lots are “in control,” we can then use (7.8) repeatedly. Whenever the value of the Hotelling’s T 2 statistic is greater than the right hand side of (7.9), the lot is deemed to be “out of control.” Thus, the right hand side of (7.9) is in fact the Upper Control Limit for the in-control values of the lots’ T 2 statistics. The Lower Control Limit is simply equal to zero and is of no interest, since the T 2 statistic is necessarily non-negative. Charting the lots’ T 2 values against the UCL gives us a multidimensional ¯ chart for univariate data. counterpart of the standard X
© 2002 by Chapman & Hall/CRC
likelihood ratio tests for location
297
Figure 7.8. Relative Efficiency. If the lot size is small, as is frequently the case, the sample dispersion matrix S can sometimes fail to reveal the true correlation structure of data under scrutiny. A way out of the problem consists in constructing a “pooled” sample dispersion matrix, the same for all lots and equal to the average of N lot sample dispersion matrices. Let us consider this last proposal in greater detail. Let Sj denote the sample dispersion matrix of the jth lot and let N ¯= 1 S Sj . N j=1
In order to be in full analogy with the one dimensional case, let us also replace μ 0 by its sample counterpart, ¯= x
N 1 ¯j , x N j=1
¯ j is the sample mean for the jth lot. Now the T 2 -like statistic where x for the jth lot assumes the form
© 2002 by Chapman & Hall/CRC
298
chapter 7. multivariate approaches
¯ −1 (¯ ¯ ) S ¯ ), xj − x xj − x Tj2 = n(¯
(7.18)
where j = 1, 2, . . . , N . It can be shown (see Alt, Goode and Wadsworth [5]) that nN − N − p + 1 2 T p(n − 1)(N − 1) j has F distribution with p and nN − N − p + 1 degrees of freedom. Thus, we consider the jth lot to be out of control if Tj2 >
p(n − 1)(N − 1) Fp,nN −N −p+1 (α), nN − N − p + 1
(7.19)
where Fp,nN −N −p+1 (α) is the upper (100α)th percentile of the F distribution with p and nN − N − p + 1 degrees of freedom. In other words, the right hand side of (7.19) is the UCL for the Tj2 statistics. Usually, α is set equal to .002. ¯ charts for detecting Pareto glitches among As in the case of using X lots of one dimensional data, the above analysis enables one to examine future multivariate data on the basis of the past. Suppose N1 lots have already been examined, N of them being “in control.” Suppose also that assignable causes for N1 − N glitches have been found and removed. ¯ and the pooled dispersion Thus, we can recompute the sample mean x matrix S for the remaining N lots. Now, the T 2 -like statistic can be ¯∗, constructed for a new lot with the sample mean x ¯ −1 (¯ ¯ ) S ¯ ). x∗ − x x∗ − x T∗2 = n(¯
(7.20)
Alt, Goode and Wadsworth have shown that nN − N − p + 1 2 T p(n − 1)(N + 1) ∗ has the F distribution with p and nN − N − p + 1 degrees of freedom. Thus, we can consider the new lot to be out of control if T∗2 >
p(n − 1)(N + 1) Fp,nN −N −p+1 (α). nN − N − p + 1
(7.21)
We note that the UCL’s provided by (7.19) and (7.21) are slightly different. This difference is due to the fact that in the latter case the new lot is ¯ and the pooled dispersion not used when computing the sample mean x
© 2002 by Chapman & Hall/CRC
likelihood ratio tests for location
299
matrix S. If N is sufficiently large, both limits are practically the same. ¯ by μ 0 in (7.18) and (7.20), Similarly, we can replace the sample mean x and still use the UCL’s given by (7.19) and (7.21), provided N is not too small. The above T 2 tests are, of course, for location or the process mean. The first of them, given by (7.19), is for past observations, while that given by (7.21) is for future observations. Both tests are for p-dimensional observations grouped into lots of size greater than 1 (actually, of size n > 1 in our considerations). Given individual (i.e., ungrouped) past observations, it is an easy exercise to construct a Hotelling’s T 2 chart for future observations, but developing such a chart for past individual data requires more effort. The latter chart has been provided by Wierda, and both are described in Wierda [21]; see also a discussion by Ryan [17], Section 9.5. We shall conclude this Section with an illustration why a T 2 test behaves differently from a corresponding battery of one dimensional tests. The reason is in fact rather obvious and can be stated briefly: it is only the former test which takes into account possible correlations between vector components. For the sake of illustration, we shall confine ourselves to the case p = 2. However, before we proceed with the illustration, let us pause for a moment on the problem of proper choice of (common) significance level for one dimensional tests. If the two vector components were stochastically independent, one would require that the two one dimensional tests have significance level β, where 1 − α = (1 − β)2 . Clearly, given that α is small, we would then have 1 − α ≈ 1 − 2β and hence we could use β = α/2. Interestingly, it follows from the celebrated Bonferroni inequality that we should use β = α/2 also when the vector components are not independent. Namely, in its full generality, the Bonferroni inequality states that if we are given p events Ai , i = 1, 2, . . . , p, and if P (Ai ) = 1 − α/p,
© 2002 by Chapman & Hall/CRC
300
chapter 7. multivariate approaches
then
P (A1 , A2 , . . . , Ap ) ≥ 1 − α.
Thus, if we construct p (2 in our case) one dimensional tests, each of significance level α/p, then the significance level for the whole battery of p tests applied simultaneously is not greater than α, whatever the stochastic relationships between the p variables. Returning to our illustration of how a T 2 test differs from a corresponding battery of one dimensional tests, let us assume that we are given a sample of 25 bivariate normally distributed observations with known covariance matrix
Σ=
1 0.7 0.7 1
.
(7.22)
Assume that we want to test hypothesis (7.3) at the α = 0.002 significance level. Since the covariance matrix is known, in lieu of Hotelling’s (7.8), we can use the χ2 test with statistic ¯ 25¯ x Σ−1 x distributed under H0 as χ2 with p (in our case 2) degrees of freedom. Accordingly, we get an elliptical acceptance set with boundary
¯ = 12.429. 25¯ x Σ−1 x
(7.23)
On the other hand, the two one dimensional tests (each at the 0.001 level) lead to the square acceptance set with vertices (±0.658, ±0.658), since 3.29/5 = 0.658.
© 2002 by Chapman & Hall/CRC
likelihood ratio tests for location
301
Out of Control by1-d testing
Out of Control by1-d testing
Figure 7.9. Acceptance Sets for the T 2 and Simultaneous One Dimensional Tests. Both sets are given in Figure 7.9. It is clear that due to a rather strong positive dependence between the two variates of an observation (correlation is equal to 0.7) the two dimensional test will ring an alarm if one variate is large while the other is small (see observation (0.4, −0.4) in Figure 7.9). At the same time, each of the variates may well stay within the one dimensional limits, since the one dimensional tests are by definition incapable of taking into account the association between the variates comprising an observation. The two dimensional test and the corresponding battery of one dimensional tests become close to one another when the variates involved are independent and both have the same variance (or both are standardized), since then an ellipse (ellipsoid for p > 2) becomes a circle (sphere or hypersphere for p > 2). Remark: In our exposition of multivariate tests, and hence of control charts, which are all based on normality assumption for observations, we skip tests and charts for process variability or dispersion. Their use is rather limited and they have rather apparent drawbacks. For their discussion see, e.g., Alt [3] and [4] and Ryan [17]. Let us mention, however, that charts for future observations either draw from known results for likelihood ratio tests for the dispersion matrix Σ or are based on tests for the determinant |S| of S, the sample dispersion matrix. Charts for
© 2002 by Chapman & Hall/CRC
302
chapter 7. multivariate approaches
past observations are usually based on the determinant mentioned.
7.3
Compound and Projection Tests
Most SPC professionals analyze their time indexed data one dimension at a time. We propose here a compound test which begins by looking at all the one dimensional control charts. As an example, let us suppose p = 5. Then, if we use the customary α level of .002, we will declare that the null hypothesis appears to be false and that a lot appears to be out of √ control if x ¯j falls outside the interval (μ0,j ±3σj / n) for any j between 1 and 5. In the event that we wish to use an estimate of variance based on the lot variability rather than that of the record of previous “in control” lots, we will reject the null hypothesis if, for any j, x ¯j falls outside the √ interval (μ0,j ± t.998,n−1 sj / n). Here, we recall that μ0 is the vector of “in-control” means. Let us suppose that all five dimensional sample means have outputs stochastically independent of each other. Then, the actual α level of the five one dimensional tests is α = 1 − (1 − .002)5 = .01.
(7.24)
In the proposed test, we next construct all two dimensional T 2 tests of nominal α level .002. There are 10 such tests. Then, we construct all three dimensional T 2 tests. There are 10 such tests. This is followed by all four dimensional T 2 tests. There are 5 such tests. Finally, we construct the 1 five dimensional T 2 test. Altogether, we will construct 25 − 1 tests. If all the tests were independent, we would expect an actual α level of α = 1 − (1 − .002)31 = .06.
(7.25)
A false alarm level of 6 out of 100 lots would generally be rather high for practical quality control work. Fortunately, however, the tests are ¯2 test not independent. The failure of a lot on the two dimensional x ¯1 , x is more likely also to fail the x ¯1 test than if the two dimensional test had been passed. In Table 7.1 we show simulation based estimates (50,000 simulated tests per table entry) of the actual α levels of the proposed compound test for a nominal α level of .002 for each test performed. The second column represents the actual α level if one uses only the one dimensional tests, with stochastic independence between the dimensions.
© 2002 by Chapman & Hall/CRC
303
compound and projection tests
p 2 3 4 5
1-d Tests 0.00400 0.00599 0.00798 0.00996
n=5 0.00434 .00756 0.01126 0.01536
n = 10 0.00466 0.00722 0.01072 0.01582
Table 7.1 n = 15 n = 20 0.00494 0.00508 0.00720 0.00720 0.01098 0.01098 0.01552 0.01544
n = 50 0.00512 0.00704 0.01108 0.01706
n = 100 0.00528 0.00826 0.01136 0.01670
n = 200 0.00546 0.00802 0.01050 0.01728
We observe that the increase in Type I error by using all the T 2 tests as opposed to only the one dimensional tests is rather modest. For example, for the three dimensional case, using the two and three dimensional tests in addition to the one dimensional tests increases the actual α only around a third. And we know that the compound test always has greater power than using only the one dimensional tests. As a practical matter, the general default to the purely one dimensional tests may be due simply to the increased computational complexity associated with higher dimensional T 2 tests. But present generation hand held calculators easily admit of programming the more informative compound test. Indeed, the compound test is extremely useful from the practical point of view. It not only happens that a multidimensional test detects a true Pareto glitch while the one dimensional tests fail to do so, but also the opposite is possible. For example, let us consider two dimensional data that are strongly positively dependent and have correlation close to 1. If, for a few lots, rather large values of one variate are associated with rather small values of the other, the two dimensional test may ring an alarm, since the T 2 -like tests are sensitive to departures from the correlation structure implied by S. If, however, these values are not extremely large, one dimensional tests may fail to ring the alarm. On the other hand, if the two variates “vary together,” the two dimensional test may not ring an alarm when some of these values happen to be unusually large (cf. Figure 7.9). It is the one dimensional test which is more sensitive to this sort of a glitch. Using the compound test to advantage is a very good idea as long as the dimensionality of data is not too large. Clearly, with dimensionality increasing, explanatory power of the approach diminishes very quickly. It is here where statistical projection methods, most notably principal component analysis (PCA), should come in (for an excellent chapter-long introduction to PCA see Krzanowski [12] and for thorough book-long expositions see Jolliffe [10] and Jackson [8]). In order to briefly introduce the reader to PCA concepts, let us return to our illustrative example in Section 7.2 and assume that we are given observations from bivariate normal distribution with mean zero and covariance matrix (7.22). Ellipses like the one given by (7.23) and Figure 7.9 are called the contours of the distribution (note that the normal den-
© 2002 by Chapman & Hall/CRC
304
chapter 7. multivariate approaches
sity is constant on these ellipses). In general, for a p-variate normally distributed random vector x with mean μ and covariance matrix Σ, the contours assume the form of an ellipsoid (x − μ ) Σ−1 (x − μ ) = c2 ,
(7.26)
where c is a positive constant. Let ΓΛΓ be the spectral decomposition of Σ, i.e., Σ = ΓΛΓ , with Λ = diag(λ1 , λ2 , . . . , λp ) and Γ = [γ (1) γ (2) . . . γ (p) ], where γ (i) is the eigenvector corresponding to eigenvalue λi (see Appendix A for the definition of spectral decomposition). We shall assume for later reference that λ1 ≥ λ2 ≥ . . . ≥ λp > 0. The direction cosine vector of the i-th axis of ellipsoid (7.26) is γ (i) . The 1/2
length of the i-th axis is 2cλi (for the ellipse given by (7.23), λ1 = 0.068 and λ2 = 0.012). For i = 1, 2, . . . , p, let us consider now projections of the centered observation x on the ellipsoid’s i-th axis, yi = γ
(i) (x
− μ ).
(7.27)
One can show that Var(yi ) = λi ,
(7.28)
i = 1, 2, . . . , p, and that the yi ’s are independent. Taken together, the given p projections define a one-to-one transformation from x’s to y’s, where y = [y1 y2 . . . yp ] : y = Γ (x − μ ).
(7.29)
In the above analysis, the normality assumption was needed only to have ellipsoids (7.26) be the contours of the underlying distribution and to prove that the yi ’s are independent. In all other considerations, only the variance-covariance structure of the distribution was taken into account. In fact, for any continuous p-dimensional distribution, it can be shown that transformation (7.29), to be called the principal component transformation, which has components (7.27), the i-th of them to be called the i-th principal component of x, has the following properties: (i) no linear combination a x, where a is a vector of length 1, has variance larger than λ1 and, thus, the first principal component, y1 , is that linear combination of the x variables (with the other vector standardized to have length 1) which has the largest variance; (ii) the second principal component, y2 , whose variance is λ2 , is that linear combination of the x variables (with the other vector standardized) which has the second
© 2002 by Chapman & Hall/CRC
a robust estimate of “in control” location
305
greatest variance subject to the constraint that it is uncorrelated with the first principal component; (iii) the j-th principal component, yj , j ≤ p, whose variance is λj , is that linear combination of the x variables (with the other vector standardized) which has the j-th greatest variance subject to the constraint that it is uncorrelated with the first j − 1 principal components. We have noticed already that (7.29) is a one-to-one transformation. However, due to interdependencies between the x variables, it is usually the case that a few of the first λ’s account for a large portion of the sum i λi (put otherwise, usually a few λ’s prove much larger than the remaining majority of λ’s). Now, given that the yi ’a are uncorrelated, it is reasonable to consider the sum i λi a measure of the overall variability hidden in the data and, hence, to claim that, under the condition just mentioned, a few first principal components account for most of the overall variability. Thus, it is then justified to reduce the original pvariate problem to one described by a much smaller number of principal components, as providing a good approximation in few dimensions to the original problem in many more dimensions. It is this reduced problem which is eventually subjected to SPC analysis. It is another problem how to interpret out of control signals for principal components, and how to react to them. For a relatively early but thorough exposition of the problem, see Jackson [8]. However concise, an excellent and more recent survey has been given by MacGregor [14]. In that paper, a PCA approach particularly suited to dealing with batch processes has been also described (see also Nomikos and MacGregor [16]). Extensions of PCA approaches to the so-called dynamic biplots have been dealt with by Sparks et al. [18].
7.4
A Robust Estimate of “In Control” Location
We have seen in Chapter 3 how one might estimate location for one dimensional data in such a way that contaminants do not very much affect this estimate. The primary device used there was the median as an estimator for the center of the dominant “in control” distribution. For higher dimensional data one needs more sophisticated trimming procedures. The following “King of the Mountain” algorithm of Lawera and Thompson [13] appears to be promising:
© 2002 by Chapman & Hall/CRC
306
chapter 7. multivariate pproaches “King of the Mountain” Trimmed Mean Algorithm
1. Set the counter M equal to the historical proportion of bad lots times the number of lots. ¯ i }N . 2. For N lots compute the vector sample means of each lot {X i=1 ¯ 3. Compute the pooled mean of the means X. 4. Find the two sample means farthest apart in the cloud of lot means. ¯ 5. From these two sample means, discard the farthest from X. 6. Let M = M − 1 and N = N − 1. ¯ as 7. If the counter is still positive, go to 1, otherwise exit and print out X ¯ XT .
To examine the algorithm, we examine a mixture distribution of lot means γN (0, I) + (1 − γ)N (μ, I).
(7.30)
Here we assume equal slippage in each dimension, i.e., (μ) = (μ, μ, . . . , μ).
(7.31)
¯ with the customary Let us compare the trimmed mean procedure X T ¯ procedure of using the untrimmed mean X. In Tables 7.2 and 7.3, we show, for dimensions two, three, four and five, the average MSEs of the two estimators when γ = .70, for simulations of size 1,000.
μ 1 2 3 4 5 6 7 8 9 10
d=2 ¯ X T 0.40 0.21 0.07 0.06 0.05 0.06 0.06 0.06 0.06 0.06
© 2002 by Chapman & Hall/CRC
Table d=2 ¯ X 0.94 1.24 1.85 3.01 4.61 6.53 8.94 11.58 14.71 18.13
7.2.MSEs for d=3 d=3 ¯ ¯ X X T 0.43 1.17 0.17 1.62 0.09 2.67 0.09 4.48 0.09 6.88 0.08 9.85 0.09 13.41 0.08 17.46 0.08 22.03 0.08 27.24
50 Lots.γ = .7. d=4 d=4 d=5 ¯ ¯ ¯ X X X T T 0.46 1.42 0.54 0.17 2.00 0.18 0.12 3.49 0.15 0.12 5.93 0.14 0.11 9.16 0.15 0.12 13.19 0.14 0.11 17.93 0.14 0.11 23.27 0.14 0.11 29.40 0.15 0.11 36.07 0.15
d=5 ¯ X 1.72 2.37 4.33 7.41 11.52 16.50 22.33 28.99 36.67 45.26
a robust estimate of “in control” location
μ 1 2 3 4 5 6 7 8 9 10
d=2 ¯ X T 0.28 0.05 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03
Table 7.3. MSEs for d=2 d=3 d=3 ¯ ¯ ¯ X X X T 0.71 0.30 0.94 0.88 0.06 1.30 1.72 0.05 2.58 2.95 0.04 4.41 4.56 0.04 8.25 6.56 0.04 8.31 8.87 0.04 13.28 11.61 0.04 17.32 14.67 0.04 21.98 18.04 0.04 27.12
100 Lots. γ = .7. d=4 d=4 d=5 ¯ ¯ ¯ X X X T T 0.32 1.16 0.34 0.07 1.74 0.08 0.06 3.37 0.07 0.06 5.86 0.07 0.06 9.13 0.08 0.06 13.08 0.07 0.06 17.67 0.07 0.06 23.17 0.07 0.06 29.28 0.07 0.06 36.06 0.07
307
d=5 ¯ X 1.34 2.10 4.25 7.34 11.39 16.35 22.16 28.86 36.61 45.13
In Tables 7.4 and 7.5 we show the MSEs of the trimmed mean and the customary pooled sample mean for the case where γ = .95.
μ 1 2 3 4 5 6 7 8 9 10
Table 7.4. MSEs for 50 Lots. γ = .95. d=2 d=2 d=3 d=3 d=4 d=4 d=5 ¯ ¯ ¯ ¯ ¯ ¯ ¯ X X X X X X X T T T T 0.05 0.10 0.07 0.15 0.09 0.18 0.12 0.05 0.11 0.07 0.16 0.09 0.20 0.11 0.04 0.11 0.06 0.17 0.08 0.22 0.11 0.04 0.13 0.06 0.19 0.08 0.26 0.11 0.04 0.16 0.06 0.24 0.08 0.32 0.11 0.04 0.19 0.06 0.29 0.09 0.40 0.10 0.04 0.24 0.06 0.35 0.08 0.48 0.11 0.04 0.29 0.06 0.42 0.08 0.58 0.10 0.04 0.33 0.06 0.51 0.08 0.68 0.11 0.042 0.39 0.06 0.59 0.08 0.79 0.11
d=5 ¯ X 0.24 0.26 0.28 0.33 0.41 0.49 0.61 0.71 0.86 1.01
μ 1 2 3 4 5 6 7 8 9 10
Table 7.5. MSEs for 100 Lots. γ = .95. d=2 d=2 d=3 d=3 d=4 d=4 d=5 ¯ ¯ ¯ ¯ ¯ ¯ ¯ X X X X X X X T T T T 0.02 0.05 0.04 0.09 0.05 0.12 0.06 0.02 0.07 0.03 0.10 0.05 0.14 0.05 0.02 0.09 0.03 0.14 0.04 0.18 0.05 0.02 0.13 0.03 0.20 0.05 0.27 0.06 0.02 0.18 0.03 0.27 0.04 0.37 0.06 0.02 0.24 0.03 0.38 0.04 0.49 0.05 0.02 0.31 0.03 0.47 0.04 0.62 0.06 0.02 0.40 0.03 0.60 0.04 0.80 0.05 0.02 0.50 0.03 0.74 0.04 1.00 0.05 0.02 0.60 0.03 0.90 0.05 1.20 0.05
d=5 ¯ X 0.14 0.16 0.23 0.34 0.46 0.60 0.80 1.00 1.21 1.50
© 2002 by Chapman & Hall/CRC
308
chapter 7. multivariate approaches
If the level of contamination is substantial (e.g., if 1 − γ = .3), the use of a trimming procedure to find a base estimate of the center of the “in control” distribution contaminated by observations from other distributions may be strongly indicated. For more modest, but still significant levels of contamination (e.g., if 1 − γ=.05), then simply using ¯ may be satisfactory. We note that the trimming procedure considered X here is computer intensive and is not realistic to be performed on the usual hand held scientific calculator. However, it is easily computed on a personal computer or workstation. Since the standards for rejecting the null hypothesis that a lot is in control are generally done by offline analysis on a daily or weekly basis, we do not feel the increase in computational complexity should pose much of a logistical problem.
7.5
A Rank Test for Location Slippage
Let us consider next a test which might be used in detecting changes in location when the assumptions of normality are questionable. If we have a group of N lots, then our procedure for testing whether a new lot is “typical” or not is as follows Rank Test for Location Shift 1. Compute the distances of the sample means of each of the N lots from each other. 2. Compute the average, on a per lot basis, of these distances {Di }N i=1 . 3. Compute the distances of the mean of the new lot from each of the N lots. 4. Compute the average of these distances D0 . 5. Reject the hypothesis that the new lot is “typical” if D0 > M axN i=1 {Di }.
It has been confirmed by extensive simulations that the significance level of this and other tests considered in this and the next section is approximately equal to 1/(N + 1). Let us now compare its performance with that of the parametric likelihood ratio test when we have as the generator of the “in control” data a p-variate normal distribution with mean 0 and covariance matrix I, the identity. We will consider as alternatives “slipped” normal distributions, each with covariance matrix I but with a translated mean each of whose components is equal to the “slippage” μ. In Figure 7.10, using 20,000 simulations of 50 lots of size 5 per slippage
© 2002 by Chapman & Hall/CRC
a rank test for location slippage
309
value to obtain the base information, we compute the efficiency of the rank test to detect a shifted 51st lot relative to that of the likelihood ratio test in (7.5), i.e., the ratio of the power of the rank test to that of the χ2 (p) test (where p is the dimensionality of the data set). In other words, here, we assume that both the base data and the lots to be tested have identity covariance matrix and that this matrix is known. We note that the efficiency of the rank test here, in a situation favorable to the likelihood ratio test, is close to one, with generally improving performance as the dimensionality increases. Here, we have used the critical values from tables of the χ2 distribution. For such a situation, the χ2 is the likelihood ratio test, so in a sense this is a very favorable case for the parametric test. Next, in Figure 7.11, we note the relative efficiency when the data are drawn from a multivariate shifted t distribution with 3 degrees of freedom. We generate such a random variable in p dimensions, say, in the following manner. First, we generate a χ2 variable v with 3 degrees of freedom. Then we generate p independent univariate normal variates X = (X1 , X2 , . . . , Xp ) from a normal distribution with mean 0 and variance 1. If we wish to have a mean vector μ and covariance matrix Σ, we then find a linear transformation 1
Z = AX = Σ 2 X
(7.32)
so that Z has the desired covariance matrix. Then, Z +μ t= v/3
(7.33)
will have a shifted t distribution with 3 degrees of freedom. To find an appropriate linear transformation, we can employ spectral decomposition techniques (see Appendix A) or, less elegantly, simply use Z1 = a11 X1
(7.34)
Z2 = a21 X1 + a22 X2
(7.35)
...
(7.36)
...
Zp = ap1 X1 + ap2 X2 + . . . + app Xp . If Σ is given by
⎛ ⎜ ⎜ ⎝
Σ=⎜
© 2002 by Chapman & Hall/CRC
σ11 σ12 σ12 σ22 ... ... σ1p σ2p
. . . σ1p . . . σ2p ... ... . . . σpp
(7.37)
⎞ ⎟ ⎟ ⎟, ⎠
(7.38)
310
chapter 7. multivariate approaches
then, writing down the covariance for each Zj , we note that
σ11 = a211 σ12 = a11 a21 ...
...
(7.39)
σ1p = a11 ap1 σ22 = a221 + a222 ...
...
σpp = a2p1 + a2p2 + . . . + a2pp .
0.0 1.000
0.5
1.0
1.5 1.000
d=5
0.956
0.912
Efficiency
0.956
d=4
d=2
0.868
0.912
0.868
0.824
0.824 d=3
0.780 0.0
0.5
1.0
Slippage per Dimension Figure 7.10. Simulated Efficiencies.
© 2002 by Chapman & Hall/CRC
0.780 1.5
311
a rank test for location slippage
1.1
d=2
Efficiency
1.0 0.9
d=4
0.8 0.7 0.6
d=3 d=5
0.5 0.0
1.0
2.0
3.0
Slippage per Dimension Figure 7.11. Simulated Efficiencies with t(3) Data. Thus, we can simply go down the list in obvious fashion, solving for the aij . To generate the χ2 variate with 3 degrees of freedom, we simply generate three independent N (0, 1) variates, square them, and take their sum. In Figure 7.11, we compute the efficiency of the rank test compared with that of a “χ2 ” test where the critical values have been picked to have the same Type I error as that of the rank test. In other words, we have attempted to be unusually fair to the parametric test. The critical values for the noncentrality are much larger than those under the assumption that the data are normal. At first glance, it appears that the rank test has actually lost utility when compared to the χ2 test as we move from normal to more taily data. However, we note that the rank test we have described above always has a Type I error equal to 1/(N + 1), regardless of the form of the underlying distribution. The critical level for the noncentrality to obtain 1/(N + 1) as the critical value using the Hotelling T 2 test for t(3) data is much greater than that when the data are normal. We note that, as was the case with normal data, the powers of the rank test and that of the parametric test are comparable. But note that we have actually given the parametric test the advantage of determining the critical value assuming the data are t(3). Had we rather used that which would normally be utilized, namely the critical value assuming the
© 2002 by Chapman & Hall/CRC
312
chapter 7. multivariate approaches
data are normal, we would have obtained a test which very frequently declared the data to be out of control, when, in fact, it was in control. As a practical matter, perhaps the major advantage of the rank test is that we have a critical region determined naturally and independently of the functional form of the underlying density.
7.6
A Rank Test for Change in Scale and/or Location
Having observed some utility in a rather simple “ranking” procedure, we now suggest a somewhat more complex algorithm. Basically, in statistical process control, we are looking for a difference in the distribution of a new lot, anything out of the ordinary. That might seem to indicate a nonparametric density estimation based procedure. But the general ability to look at averages in statistical process control indicates that for most situations, the Central Limit Theorem enables us to use procedures which point to distributions somewhat close to the normal distribution as the standards. In the case where data are truly normal, the functional form of the underlying density can be based exclusively on the mean vector and the covariance matrix. Let us now suppose that we have a base sample of N lots, each of size n, with the dimensionality of the data being given by p. For each ¯i and sample covariance of these lots, we compute the sample mean X matrix Si . Next, we compute the average of these N sample means, say, ¯ and the average of the sample covariance matrices S. ¯ Then, we use X, the transformation ¯ ¯ − 12 (X − X) (7.40) Z=S ¯ into a variate with approximate mean 0 and approxwhich transforms X imate covariance matrix I. Next, we apply this transformation to each of the N lot means in the base sample. For each of the transformed lots, we compute the transformed mean and covariance matrix, Z¯i and SZi , respectively. For each of these, we apply, respectively, the location norm ||Z¯i ||2 =
p
2 , Z¯i,j
(7.41)
j=1
and scale norm ||Si ||2 =
p p j=1 l=1
© 2002 by Chapman & Hall/CRC
Si,j,l 2 .
(7.42)
313
a rank test for change in scale and/or location
If a new lot has location norm higher than any of those in the base sample, we flag it as untypical. If its scale norm is greater than those of any lot in the base sample, we flag it as untypical. The Type I error of either test is given by 1/(N + 1); that of the combined test is given very closely by 1 − [1 −
1 2 2N + 1 . ] = N +1 (N + 1)2
(7.43)
In Figure 7.12, we apply the location test only for the data simulation in Figure 7.10. We note that the performance compares favorably to the parametric test, even for normal data.
0.0 1.1
0.5
1.0
1.5 1.1
Efficiency
d=5
d=5
1.0
1.0
d=2 0.9
0.9
d=3 0.8 0.0
0.5
1.0
0.8 1.5
Slippage per Dimension Figure 7.12. Simulated Efficiencies for Second Location Test.
Next, we consider applying the second rank test for location only to the t(3) data of Figure 7.11.
© 2002 by Chapman & Hall/CRC
314
chapter 7. multivariate approaches
0.0 1.3
1.0
2.0
3.0 1.3
d=2
1.2
1.2
Efficiency
d=4 1.1
1.1
1.0
1.0
d=5
0.9
0.9
d=3 0.8
0.8
d=2 0.7 0.6 0.0
0.7
1.0
2.0
0.6 3.0
Slippage per Dimension
Figure 7.13. Simulated Efficiencies for t(3) Test. Once again, the rank test performs well when its power is compared to that of the parametric test even though we have computed the critical value for the parametric test assuming the data are known to be t(3), an assumption very unlikely to be valid in the real world. Up to this point, we have been assuming that both the base lots and the new lots were known to have identity covariance matrices. In such a case, the appropriate parametric test is χ2 if the data are normal, and, if the data are not, we have employed simulation techniques to find appropriate critical values for the distribution in question. Now, however, we shift to the situation where we believe that the covariance matrices of the new lots to be sampled may not be diagonal. In such a situation, the appropriate test is the Hotelling T 2 test with degrees of freedom equal to (p, n − p), i.e., T2 =
p(n − 1) Fp,n−p (α). n−p
We have been assuming that the base lots (each of size 5) are drawn from N (0, I). The sampled (bad) lot is drawn from N (μ, Σ) where ⎛ ⎜ ⎜ μ=⎜ ⎜ ⎝
μ μ .. . μ
© 2002 by Chapman & Hall/CRC
⎞ ⎟ ⎟ ⎟ ⎟ ⎠
(7.44)
315
a rank test for change in scale and/or location and ⎛ ⎜ ⎜ ⎜ ⎜ Σ=⎜ ⎜ ⎜ ⎝
⎞
1 .8 .8 . . . .8 .8 1 .8 . . . .8 ⎟ ⎟ ⎟ .8 .8 1 . . . .8 ⎟ . .. .. .. .. .. ⎟ ⎟ . . . . . ⎟ ⎠ .. 8 .8 .8 . . . .
(7.45)
Thus, we are considering the case where the lot comes from a multivariate normal distribution with equal slippage in each dimension and a covariance matrix which has unity marginal variances and covariances .8. In Figure 7.14, we note the relative power of the “location” rank test when compared to that of the Hotelling T 2 procedure. The very favorable performance of the rank test is largely due to the effect that it picks up not only changes in location but also departures in the covariance matrix of the new lot from that of the base lots.
Efficiency
0.0 8.0
0.5
1.0
1.5 8.0
7.0
7.0
6.0
6.0
5.0
d=4
5.0
d=3
4.0
4.0
3.0
3.0
d=2 2.0
2.0
1.0
1.0
0.0 0.0
0.5
1.0
0.0 1.5
Slippage per Dimension Figure 7.14. Simulated Efficiencies for Correlated Data The basic setting of statistical process control lends itself very naturally to the utilization of normal distribution theory, since computation of lot averages is so customary. But, as we have seen, for modest lot sizes it is possible to run into difficulty if the underlying distributions have heavy tails. An attractive robust alternative to rank testing is that
© 2002 by Chapman & Hall/CRC
316
chapter 7. multivariate approaches
of “continuous resampling” based tests via such procedures as SIMDAT [19], [20].
References [1] Alam, K. and Thompson, J.(1973).“On selecting the least probable multinomial event,” Annals of Mathematical Statistics, 43, pp. 19811990. [2] Alam, K. and Thompson, J.(1973).“A problem of ranking and estimation with Poisson processes,” Technometrics, 15, pp. 801-808. [3] Alt, F.B. (1985). “Multivariate quality control,” in Encyclopedia of Statistical Sciences, (vol. 6), eds. S. Kotz and N. Johnson. New York: John Wiley & Sons. [4] Alt, F.B. (1986). “SPC of dispersion for multivariate data,” ASQC Annual Quality Congress Transactions, pp. 248-254. [5] Alt, F.B., Goode, J.J., and Wadsworth, H.M. (1976). Ann. Tech. Conf. Trans., ASQC, pp. 754-759. [6] Andrews, D.F., Bickel, P.J., Hampel, F.R., Huber, P.J., Rogers, W.H., and Tukey, J.W. (1972). Robust Estimates of Location. Princeton: Princeton University Press. [7] Bridges, E., Ensor, K.B., and Thompson, J.R. (1992). “Marketplace competition in the personal computer industry,” Decision Sciences, pp. 467-477. [8] Jackson, J.E. (1991). A User’s Guide to Principal Components. New York: John Wiley & Sons. [9] Johnson, R.A. and Wichern, D.W. (1988). Applied Multivariate Statistical Analysis. Englewood Cliffs: Prentice Hall. [10] Jolliffe, I.T. (1986). Principal Component Analysis. New York: Springer-Verlag. [11] Kendall, M.G. and Stuart, A.(1958). The Advanced Theory of Statistics, II. New York: Hafner. [12] Krzanowski, W.J. (1988). Principles of Multivariate Analysis. Oxford: Clarendon Press.
© 2002 by Chapman & Hall/CRC
problems
317
[13] Lawera, M. and Thompson, J.R. (1993). “Multivariate strategies in statistical process control,” Proceedings of the Thirty-Eighth Conference on the Design of Experiments in Army Research Development and Testing, pp. 99-126 [14] MacGregor, J.F. (1997). “Using on-line process data to improve quality,” Int. Statist. Rev., 65, pp.309-323. [15] Mosteller, F. and Tukey, J.W. (1977). Data Analysis and Regression. Reading: Addison-Wesley. [16] Nomikos, P. and MacGregor, J.F. (1995). “Multivariate SPC charts for monitoring batch processes,” Technometrics, 37, pp.41-59. [17] Ryan, T.P. (1989). Statistical Methods for Quality Improvement. New York: John Wiley & Sons. [18] Sparks, R., Adolphson, A. and Phatak, A. (1997). “Multivariate process monitoring using the dynamic biplot.” Int. Statist. Rev., 65, pp. 325-349. [19] Taylor, M.S. and Thompson, J.R. (1986). “A data based algorithm for the generation of random vectors,” Computational Statistics and Data Analysis, 4, pp. 93-101. [20] Taylor, M.S. and Thompson, J.R. (1992). “A nonparametric based resampling algorithm,” Exploring the Limits of the Bootstrap, Billard, L. and LePage, R., eds., pp. 397-403. New York: John Wiley & Sons. [21] Wierda, S.J. (1994). “Multivariate statistical process control – recent results and directions for future research,” Statistica Neerlandica, 48, pp. 147-168.
Problems Problem 7.1. In the following table, the lot of 20 bivariate measurement data is given. Provided that the process is in control, it can be assumed that the lot comes from a normal distribution with mean μ 0 = (90.0, 85.1) . a. Find the p-value of the Hotelling’s T 2 test of the hypothesis that the process’ mean is equal to μ 0 . b. Find the p-value of the t test (see (B.172) in the Appendix) of the hypothesis that the mean of the first variate, X1 , is 90.
© 2002 by Chapman & Hall/CRC
318
chapter 7. multivariate approaches
c. Find the p-value of the t test of the hypothesis that the mean of the second variate, X2 , is 85.1. ¯ charts to verify whether the two univariate sets of data, d. Use the X X1 and X2 , can be considered (separately) to be in control.
e. Try the Page CUSUM charts to verify whether the two univariate sets of data, X1 and X2 , can be considered (separately) to be in control.
f. Comment on the results.
Lot 1 2 3 4 5 6 7 8 9 10
x1 89.987 90.016 89.962 89.976 90.137 89.993 89.937 89.831 90.239 90.112
x2 85.150 85.073 85.179 85.122 85.229 85.281 85.147 84.930 85.419 85.098
Lot 11 12 13 14 15 16 17 18 19 20
x1 90.018 89.910 89.853 89.896 89.999 89.900 90.120 90.044 90.067 89.809
x2 85.081 85.057 85.113 85.212 85.105 84.992 85.320 85.092 85.144 85.054
Problem 7.2. In the following table, 20 lots of size 4 of bivariate measurement data are given. Provided that the process is in control, it can be assumed that the lots come from a bivariate normal distribution.
© 2002 by Chapman & Hall/CRC
319
problems Lot 1
2
3
4
5
6
7
x1 10.133 10.012 9.774 10.012 9.915 9.790 10.293 10.010 10.023 9.854 9.886 10.028 9.965 9.978 10.118 9.943 10.090 9.953 9.966 10.081 9.908 10.006 10.108 9.930 10.081 10.088 9.942 10.179
x2 10.151 10.106 9.862 10.017 10.065 10.069 10.343 10.087 10.092 9.932 10.087 10.070 10.128 10.128 10.101 10.137 10.140 10.108 10.037 10.101 10.093 9.952 10.116 9.879 10.071 10.157 10.066 10.201
Lot 8
9
10
11
12
13
14
x1 10.047 10.099 9.937 9.877 9.920 9.822 9.872 9.965 9.848 10.100 9.902 9.905 10.251 10.077 9.896 9.908 10.005 10.225 9.932 9.972 10.038 9.947 9.957 9.989 9.955 10.066 10.068 10.039
x2 10.203 10.074 10.127 9.945 9.977 10.053 10.073 10.107 10.014 10.118 10.141 10.116 10.237 10.173 9.984 10.155 10.011 10.269 9.996 10.048 10.059 10.080 10.057 10.161 10.100 10.099 10.127 10.027
Lot 15
16
17
18
19
20
x1 10.092 9.815 10.033 9.918 10.032 10.059 10.055 10.212 10.009 9.978 10.063 10.116 9.991 10.024 10.253 10.073 9.863 9.978 10.095 10.110 9.824 9.854 10.207 10.170
x2 10.168 9.929 10.162 10.080 10.107 10.115 10.179 10.273 9.984 10.149 10.091 10.235 10.025 9.808 9.762 9.821 9.975 10.069 10.196 10.199 10.044 9.950 10.282 10.226
a. Use the suitable version of the Hotelling’s T 2 test to verify the hypothesis that the process’ mean is μ 0 = (10.0, 10.1) . ¯ charts to verify whether the two univariate sets of data, b. Use the X X1 and X2 , can be considered (separately) to be in control. Problem 7.3. It is conjectured that a simplified heat treatment of a metal casting, suggested by a plant’s foundry, may lead to a deterioration of the casting’s quality. One of the cross-sections of the castings should be circular with diameter 285 mm. A trial sample of 20 castings has been made. A test consists in measuring two perpendicular diameters of the cross-section of each of the 20 castings. The data set obtained is given in the following table. Lot 1 2 3 4 5 6 7 8 9 10
x1 284.963 285.057 285.020 284.979 284.936 284.939 284.925 285.023 285.004 284.994
© 2002 by Chapman & Hall/CRC
x2 285.000 285.041 284.979 285.014 284.991 284.948 284.788 284.997 284.992 285.052
Lot 11 12 13 14 15 16 17 18 19 20
x1 285.022 285.058 285.028 285.004 285.011 284.958 285.028 284.992 285.057 284.918
x2 285.049 285.041 285.037 284.938 284.858 284.644 285.039 284.939 284.946 284.972
320
chapter 7. multivariate approaches
Perform the compound test to verify whether the process is in control. Comment on the results. Problem 7.4. Consider Problem 7.2. Delete the out-of-control lots ¯ and S. Verify whether the five new observed (if any), and recompute x lots, given in the following table, are in control. Lot 1
2
3
4
5
x1 9.874 9.987 9.921 9.969 10.239 9.981 10.088 9.943 9.999 9.847 9.948 10.222 10.060 10.107 9.968 10.037 9.912 10.056 9.961 9.861
x2 10.035 9.953 9.995 10.023 10.435 10.091 10.224 10.049 10.251 10.080 10.149 10.285 10.149 10.117 9.994 10.091 10.138 10.211 10.102 10.066
Problem 7.5. The following are 20 measurements of four dimensional data. Provided that the process is in control, it can be assumed that the data set comes from a normal distribution with mean μ 0 = (0, .1, .5, .3) . Perform the compound test to see whether the process is in control. Lot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
© 2002 by Chapman & Hall/CRC
x1 0.057 0.094 0.032 0.057 0.078 0.107 0.024 -0.007 0.135 0.052 0.102 0.183 0.006 0.115 0.054 -0.070 0.065 0.082 -0.085 -0.004
x2 0.182 0.258 0.125 0.037 0.245 0.173 0.109 0.131 0.193 0.135 0.141 -0.187 0.235 0.220 0.125 0.201 0.214 0.234 0.000 0.087
x3 0.363 0.621 0.625 0.593 0.514 0.426 0.464 0.637 0.616 0.425 0.612 0.478 0.322 0.501 0.534 0.528 0.508 0.480 0.480 0.505
x4 0.331 0.176 0.197 0.263 0.237 0.406 0.338 0.253 0.306 0.471 0.024 0.385 -0.008 0.334 0.308 0.246 0.318 0.493 0.270 0.240
Appendix A
A Brief Introduction to Linear Algebra A.1
Introduction
Frequently, the subject of interest cannot be adequately described by a single number but, rather, by a set of numbers. For example, the standard operational controls of a chemical process may involve not only the temperature, but the pressure as well. In each such case, suitable measurements yield a multivariate characteristic of the subject of interest. Convenient analytical tools for dealing with multivariate data are vectors and matrices. In this appendix, only some basic concepts from vector and matrix algebra are briefly introduced or recalled, namely those which are of particular interest from the point of view of statistical multivariate analysis. An n-tuple, or an array, x of n real numbers (elements) x1 , x2 , . . ., xn arranged in a column, is called a column vector and it is written as ⎡ ⎢ ⎢ x=⎢ ⎢ ⎣
x1 x2 .. .
⎤ ⎥ ⎥ ⎥. ⎥ ⎦
(A.1)
xn Analogously, an n-tuple of real numbers x1 , x2 , . . . , xn arranged in a row is called a row vector and it is written as x = [x1 , x2 , . . . , xn ] , 321
© 2002 by Chapman & Hall/CRC
(A.2)
322
appendix a. a brief introduction to linear algebra
where the prime denotes the operation of transposing a column into a row. The reason that we define the row vector as the transposition of the column vector is that usually in the literature vectors are written as column vectors. Accordingly, whenever we write “vector” we mean a column vector. In the sequel, vectors are denoted by boldface lower case letters. Elements xi , i = 1, . . . , n, of a vector x are often called components of x. A scalar is a vector consisting of just one element, i.e., it is a single real number. A more general concept is that of a matrix . An n × p matrix A is a rectangular array of real numbers (elements) arranged in n rows and p columns, and it is written as ⎡
⎢ ⎢ A(n×p) = ⎢ ⎢ ⎣
a11 a21 .. .
a12 a22 .. .
. . . a1p . . . a2p .. .
⎤
⎥ ⎥ ⎥ = (aij ) ; ⎥ i = 1, . . . , n ⎦
(A.3)
j = 1, . . . , p
an1 an2 . . . anp
aij , referred to as the (i, j)th element of the matrix, denotes the element in the ith row and jth column of the matrix A, i = 1, . . . , n , j = 1, . . . , p. Sometimes, we write (A)ij for aij . If it does not lead to ambiguity, we write A for A(n×p) and (aij )
for
(aij )
i = 1, . . . , n j = 1, . . . , p
.
(A.4)
In what follows, we denote matrices by boldface upper case letters. A matrix, all of whose elements are zeros, is called a zero matrix; otherwise, a matrix is nonzero. If A is an n × p matrix, we say that it is of order (or dimension) n × p. Clearly, vector x is a matrix with n rows and one column, i.e., it is a matrix of order n × 1; for short, n-element vectors are said to be of dimension n. Correspondingly, row vector x is a matrix of order 1 × n. The transpose of an n × p matrix A = (aij ), denoted by A , is the p × n matrix obtained from A by interchanging the rows and columns; i.e., A is the matrix with elements aji , j = 1, . . . , p , i = 1, . . . , n , ⎡
⎢ ⎢ A =⎢ ⎢ ⎣
© 2002 by Chapman & Hall/CRC
a11 a21 . . . an1 a12 a22 . . . an2 .. .. .. . . . a1p a2p . . . anp
⎤
⎥ ⎥ ⎥ = (aji ). ⎥ ⎦
(A.5)
323
elementary arithmetic Example: For A of order 3 × 4 , ⎡
⎤
1 2 3 4 ⎢ ⎥ 2 −1 −7 ⎦ , A=⎣ 0 11 −5 4 1 we have
⎡ ⎢ ⎢ ⎣
A = ⎢
(A.6)
⎤
1 0 11 2 2 −5 ⎥ ⎥ ⎥. 3 −1 4 ⎦ 4 −7 1
(A.7)
Clearly, for all matrices A, we have (A ) = A.
(A.8)
Note that our definition of the row vector is consistent with the general definition of the transpose of a matrix. A matrix is square if it has the same number of rows and columns. For short, square matrices of order n × n are often said to be of order n . A square matrix A is symmetric if A = A , i.e., if aij = aji for all i and j . It is diagonal if aij = 0 for all i = j, i.e., if it can have nonzero elements on the main diagonal only, where the main diagonal of an n × n matrix A consists of elements a11 , a22 , . . . , ann . A diagonal matrix A of order n is often written as diag(a1 , a2 , . . . , an ) . A diagonal matrix with all elements of the main diagonal equal to one is called the identity matrix , and is denoted by I or In when its order n × n is to be given explicitly, ⎡ ⎢ ⎢ I=⎢ ⎢ ⎣
⎤
1 0 ... 0 0 1 ... 0 ⎥ ⎥ .. .. .. ⎥ ⎥. . . . ⎦ 0 0 ... 1
(A.9)
All multivariate data dealt with in this book can be represented by vectors. However, performing suitable transformations of the data requires some matrix algebra.
A.2
Elementary Arithmetic
Two matrices A = (aij ) and B = (bij ) of the same order n×p are said to be equal if and only if aij = bij , i = 1, 2, . . . , n, j = 1, 2, . . . , p. Matrices
© 2002 by Chapman & Hall/CRC
324
appendix a. a brief introduction to linear algebra
of the same order can be added. The sum of two matrices A = (aij ) and B = (bij ) of the same order n × p is the n × p matrix C = (cij ), where cij = aij + bij , i = 1, 2, . . . , n, j = 1, 2, . . . , p;
(A.10)
we then write C = A + B. Example: ⎡
⎤
⎡
⎤
⎡
⎤
2 3 −7 8 −5 11 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 1 2 2 1 3 ⎦. + = ⎣ ⎦ ⎣ ⎦ ⎣ 3 4 −1 3 1 7 0
(A.11)
Clearly, for all matrices A, B and C of the same order, we have (i) A+B = B+A (ii) A + (B + C) = (A + B) + C (iii) (A + B) = A + B . Matrices can be multiplied by a constant. The scalar multiple cA of the matrix A is obtained by multiplying each element of A by the scalar c. Thus, for A = (aij ) of order n × p, ⎡ ⎢ ⎢ cA = ⎢ ⎢ ⎣
ca11 ca21 .. .
ca12 ca22 .. .
. . . ca1p . . . ca2p .. .
⎤ ⎥ ⎥ ⎥. ⎥ ⎦
(A.12)
can1 can2 . . . canp Combining the definitions of matrix addition and scalar multiplication enables us to define matrix subtraction. The difference between two matrices A = (aij ) and B = (bij ) of the same order n × p is the n × p matrix C = (cij ), given by C = A − B = A + (−1)B;
(A.13)
i.e., cij = aij + (−1)bij = aij − bij , i = 1, 2, . . . , n, j = 1, 2, . . . , p. If the number of columns of a matrix A is equal to the number of rows of a matrix B, one can define the product AB of A and B. The product AB of an n × p matrix A and a p × m matrix B is the n × m matrix C whose (i, j)th element is given by cij =
p k=1
© 2002 by Chapman & Hall/CRC
aik bkj , i = 1, 2, . . . , n, j = 1, 2, . . . , m.
(A.14)
325
elementary arithmetic Example:
2 1 2 3 4 1
⎤
⎡
1 0 6 7 ⎥ ⎢ . ⎣ 0 −1 ⎦ = 5 0 2 4
(A.15)
It is not hard to check that, for all matrices A, B and C of orders conforming in a suitable sense, (i) A(BC) = (AB)C (ii) A(B + C) = AB + AC (iii) (B + C)A = BA + CA (iv) (AB) = B A . It should be noted that the order of multiplication is important. In order to see this consider an n × p matrix A and a p × m matrix B. Now, if n is not equal to m, the product AB is well-defined while the product BA does not exist. Further, if n = m but n and m are not equal to p, both products are well-defined but the matrices AB and BA are of different orders (n × n and p × p, respectively). Finally, even if A and B are square matrices, n = m = p, the matrices AB and BA are of the same order but, usually, are different. Example: 2 1 1 2 2 5 = (A.16) 0 1 3 −4 3 −4
1 2 0 1
2 1 3 −4
=
8 −7 3 −4
.
(A.17)
Let us observe in passing that, for any square matrix A of order n and the identity matrix In , AIn = In A = A; hence the name of the matrix In . The concept of a matrix can, of course, be viewed as a generalization of that of a real number. In particular, a scalar is a 1 × 1 matrix. Similarly, the matrix algebra can be seen as a generalization of the algebra of real numbers. However, as the notion of the matrix product shows, there are important differences between the two algebras. One such difference has already been discussed. Another is that the product of two nonzero matrices need not be a nonzero matrix, whereas for all scalars c and d, whenever cd = 0, then either c = 0 or d = 0. Example: ⎡ ⎤ ⎡ ⎤⎡ ⎤ 0 2 3 −1 6 ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ 0 ⎥ 6 2 ⎦ ⎣ −4 ⎦ = ⎢ ⎥ . (A.18) ⎣ 4 ⎣ 0 ⎦ −6 −9 7 0 0
© 2002 by Chapman & Hall/CRC
326
appendix a. a brief introduction to linear algebra
Let us now introduce the concept of the trace of a square matrix. The trace of an n × n matrix A, denoted by tr(A), is defined as the sum of the diagonal elements of A, tr(A) = ni=1 aii . It can be shown that, for all matrices A, B, C, D of conforming orders and a scalar γ, (i) tr(A ± B) = tr(A) ± tr(B) (ii) tr(γA) = γtr(A) (iii) tr(CD) = tr(DC). Since vectors are special cases of matrices, there is no need to consider separately the sum, scalar multiple and the difference of vectors. It is, however, useful to introduce two types of products for suitably transposed vectors. Let x and y be two vectors of the same dimension. The inner product (or dot product or scalar product) of x and y is defined as x y. That is, for n-dimensional vectors ⎡ ⎢ ⎢
x=⎢ ⎢ ⎣
x1 x2 .. .
⎤
⎡
⎥ ⎥ ⎥ ⎥ ⎦
and y = ⎢ ⎢
⎢ ⎢ ⎣
xn
y1 y2 .. .
⎤ ⎥ ⎥ ⎥ , ⎥ ⎦
(A.19)
yn
the inner product x y is given by x1 y1 + x2 y2 + . . . + xn yn .
(A.20)
Note that the inner product is always a scalar and that x y = y x. The square root of the inner product of x with itself, √
x x =
x21 + x22 + . . . + x2n ,
(A.21)
is called the length of the vector x. The outer product of x and y is defined as xy . That is, for the given n-dimensional vectors x and y, the outer product is an n × n matrix ⎡ ⎢ ⎢ ⎢ ⎢ ⎣
x1 y1 x2 y1 .. .
x1 y2 x2 y2 .. .
. . . x1 yn . . . x2 yn .. .
⎤ ⎥ ⎥ ⎥. ⎥ ⎦
(A.22)
xn y1 xn y2 . . . xn yn Statistical analysis of multivariate data, in particular when the data are normally distributed, relies heavily on the concepts and calculations of the inverse and the square root of a matrix. The concepts themselves
© 2002 by Chapman & Hall/CRC
327
linear independence of vectors
are straightforward but effective calculations of inverses and square roots of matrices require some additional considerations. To be honest, one should add that calculating the inverse and square root can be done on a computer using standard programs, and so the reader who is not interested in algebraic details can come directly to the sections on inverses and square roots, and refer to the skipped sections only to find a few necessary definitions.
A.3
Linear Independence of Vectors
Let x1 , x2 , . . . , xk be k vectors of the same dimension. The vector y = a1 x1 + a2 x2 + . . . + ak xk ,
(A.23)
where a1 , a2 , . . . , ak are any fixed constants, is said to be a linear combination of the vectors x1 , x2 , . . . , xk . A set of vectors x1 , x2 , . . . , xk is said to be linearly independent if it is impossible to write any one of them as a linear combination of the remaining vectors. Otherwise, vectors x1 , x2 , . . . , xk are said to be linearly dependent. It is worthwhile to mention that linear independence may be equivalently defined by first defining linear dependence of vectors. Namely, vectors x1 , x2 , . . . , xk can be said to be linearly dependent if there exist k numbers a1 , a2 , . . . , ak , not all zero, such that a1 x1 + a2 x2 + . . . + ak xk = 0 ,
(A.24)
where 0 denotes the zero vector. Now, if it is not the case, the set of vectors is linearly independent. A particular case of linearly independent vectors is that of the orthogonal vectors. Vectors x and y are said to be orthogonal (or perpendicular ) if x y = 0. It can indeed be shown that a set of mutually orthogonal vectors is the set of linearly independent vectors (but not necessarily conversely!). The concept of linear independence of vectors enables one to define the rank of a matrix. The row rank of a matrix is the maximum number of linearly independent rows of the matrix (the rows of the matrix are considered here as row vectors). Analogously, the column rank of a matrix is the maximum number of linearly independent columns of the matrix (the columns are considered here as the column vectors). It can
© 2002 by Chapman & Hall/CRC
328
appendix a. a brief introduction to linear algebra
be proved that the row rank and the column rank of a matrix are always equal. Thus, we can define either of them to be the rank of a matrix. Let A be a matrix of order n × p and denote the rank of A by r(A). It follows from the definition that r(A) is a nonnegative number not larger than min{n, p} and that r(A) = r(A ). If n = p and r(A) = n, A is said to be of full rank . It can be shown that, for all matrices A, B and C of conforming orders, (i) r(A + B) ≤ r(A) + r(B) (ii) r(AB) ≤ min{r(A) , r(B)} (iii) r(A A) = r(AA ) = r(A) (iv) r(BAC) = r(A) if B and C are of full rank. The concept of orthogonality can be extended to square matrices. A square matrix A is said to be orthogonal if AA = I. In order to see how this last concept is related to the orthogonality of vectors, assume that A is of order n × n and denote the ith row of A by ai , i = 1, 2, . . . , n; hence, ai denotes the ith column of A . Now, the condition AA = I can be written as ⎡ ⎢ ⎢ ⎢ ⎢ ⎣
a1 a2 .. .
an
⎤ ⎥ ⎥ ⎥ a1 ⎥ ⎦
⎡
a2 . . . an
⎢ ⎢
=⎢ ⎢ ⎣
a1 a1 a2 a1 .. .
a1 a2 a2 a2 .. .
. . . a1 an . . . a2 an .. .
an a1 an a2 . . . an an
⎤ ⎥ ⎥ ⎥=I ⎥ ⎦
(A.25) and it follows that the rows of an orthogonal matrix A are mutually orthogonal and have unit lengths (indeed, ai aj = 0 if i = j and ai ai = 1 for i = 1, 2, . . . , n). We shall see in Section A.5 that, provided A is orthogonal, AA = A A = I. Thus, the columns of A are also orthogonal and have unit lengths; to verify this, it suffices to write A as a(1) a(2) . . . a(n) , where a(i) denotes the ith column of A, and use this form of A to write the condition A A = I.
A.4
Determinants
Important Remark: Throughout the rest of this appendix we consider only square matrices. The determinant, although seemingly neither intuitive nor operational, is one of the most useful concepts of matrix algebra. In order to avoid possible frustration of the reader, let us announce already here that one can calculate determinants in a simple way, without directly using their
© 2002 by Chapman & Hall/CRC
determinants
329
rather complicated definition. That simple way of evaluating determinants is given by (A.28)-(A.35) and the Sarrus’ diagram below. Accordingly, the reader may decide to skip definition (A.26) and come directly to the formulas mentioned. The reader should then consider formulas (A.28)-(A.35) to be the defining properties of determinant (A.27). The definition of the determinant has to be preceded by introducing the notion of the inversion of an ordered pair. Let α1 , α2 , . . . , αn be a given permutation (i.e., arrangement) of the numbers 1, 2, . . . , n, and consider all ordered pairs formed from this permutation: (α1 , α2 ) , (α1 , α3 ) , . . . , (α1 , αn ) , (α2 , α3 ) , (α2 , α4 ) , . . . , (α2 , αn ), . . . , (αn−1 , αn ). The pair is said to form the inversion if αi > αk and i < k. The determinant |A| of a square matrix A of order n is the sum
(−1)N a1j1 a2j2 · · · anjn ,
(A.26)
where the summation is taken over all possible permutations j1 , j2 , . . . , jn of the numbers 1, 2, . . . , n, and N is the total number of inversions of the permutation j1 , j2 , . . . , jn ; for instance, for A of order 4 and the particular permutation 4, 1, 3, 2, N = 4. It is often convenient to write determinants in a “more explicit” form: ! ! a ! 11 ! ! a21 |A| = !! . ! .. ! ! an1
a12 a22 .. . an2
! ! ! ! ! !. ! ! ! . . . ann !
. . . a1n . . . a2n .. .
(A.27)
Obviously, for A of order 1 × 1, a scalar A = a11 , we have |A| = a11 ,
(A.28)
while for A of order 2 × 2, we have |A| = a11 a22 − a12 a21 .
(A.29)
It is also easy although a bit tedious to check that for A of order 3 × 3 we have |A| = a11 a22 a33 + a12 a23 a31 + a13 a21 a32 − a13 a22 a31 − a12 a21 a33 − a11 a23 a32 .
© 2002 by Chapman & Hall/CRC
(A.30)
330
appendix a. a brief introduction to linear algebra
In fact, for n = 3, there are exactly six permutations: 1,2,3; 2,3,1; 3,1,2; 1,3,2; 2,1,3; 3,2,1; and the numbers of their inversions are, respectively, 0, 2, 2, 1, 1, 3. An easy and almost automatic way of calculating determinants of matrices of order 3 is to use the following Sarrus’ diagram:
a 11
a
a a
21
a 31
a
12
a 13
a 22
a
a 23
a 32
a
11
a 21
a 33
12
22
a 31
32
Namely, the determinant is obtained via adding the first two columns on the right hand side of the matrix, summing the products of elements along the solid (NW-SE) lines and subtracting the products along the dashed (NE-SW) lines. With the order of A increasing, computing the determinant becomes an apparently hopeless task. However, one can then use a different method of computation. In order to introduce it, one has to define first the minor and the cofactor of an element of a matrix. The minor of aij , the (i, j)th element of A, is the value of the determinant obtained after deleting the ith row and the jth column of A. The cofactor of aij , denoted by Aij , is equal to the product of (−1)i+j and the minor of aij . For instance, for A of order 3, ! ! a ! A11 = ! 22 ! a32 ! ! a ! A12 = − ! 21 ! a31 ! ! a ! A13 = ! 21 ! a31
! ! ! ! = a22 a33 − a23 a32 , ! ! a23 !! ! = a23 a31 − a21 a33 , a33 ! ! a22 !! ! = a21 a32 − a22 a31 . a32 !
a23 a33
(A.31)
(A.32) (A.33)
One can prove that, for A of order n × n not less than 2 × 2, we have |A| =
n j=1
© 2002 by Chapman & Hall/CRC
aij Aij
(A.34)
inverses for any i and |A| =
n
aij Aij
331
(A.35)
i=1
for any j. For instance, for A of order 3, we can write |A| = a11 A11 + a12 A12 + a13 A13 .
(A.36)
Therefore, when dealing with matrices of higher orders, we can use either of the given formulas to reduce successively the orders of cofactors which have to be effectively computed. Whatever the order of the matrix whose determinant is to be calculated, this procedure requires in effect effective computation of determinants of arbitrarily low order. And, as we mentioned already, we can let a computer do the whole job. The following results hold for determinants of any matrices A and B of the same order: (i) |A| = |A | (ii) |cA| = cn |A| , where c is a scalar and A is of order n (iii) |AB| = |A| |B| . Also, if A is a diagonal matrix of order n, |A| = a11 a22 · · · ann .
(A.37)
Note that properties (i) and (iii) imply that |A| = 1 for an orthogonal matrix A. A matrix is called singular if its determinant is zero; otherwise, it is non-singular . Note that a scalar, which is a matrix of order 1, is singular if and only if it is zero. One should be warned, however, that in general a matrix need not be the zero matrix to be singular. Example: ! ! ! 1 3 2 !! ! ! ! (A.38) ! −2 4 −1 ! = 0 . ! ! ! 2 6 4 ! A matrix is non-singular if and only if it is of full rank.
A.5
Inverses
The inverse of A is the unique matrix A−1 such that AA−1 = A−1 A = I.
© 2002 by Chapman & Hall/CRC
(A.39)
332
appendix a. a brief introduction to linear algebra
The inverse exists if and only if A is non-singular or, equivalently, of full rank. We shall show that an orthogonal matrix A has A as its inverse. By definition, AA = I. Denote the product A A by C. Now, ACA = AA AA . Hence, ACA = AA and A−1 ACA (A )−1 = A−1 AA (A )−1 , but this implies that C = I and the required result follows. Thus, in the case of orthogonal matrices computing inverses is easy. In general, the following result holds for non-singular A = (aij ): A−1 =
1 (Aij ) , |A|
(A.40)
where the matrix (Aij ) is the transpose of (Aij ), the matrix with the (i, j)th element Aij being the cofactor of aij . Example: For ⎡ ⎤ 3 2 1 ⎢ ⎥ A = ⎣ 0 1 2 ⎦, (A.41) 2 0 1 we have |A| = 9,
⎡
⎤
1 4 −2 ⎢ ⎥ 1 4 ⎦ (Aij ) = ⎣ −2 3 −6 3
and, hence,
⎡
A−1
(A.42) ⎤
1/9 −2/9 1/3 ⎢ ⎥ 1/9 −2/3 ⎦ . = ⎣ 4/9 −2/9 4/9 1/9
(A.43)
One can show that for all non-singular matrices A and B of the same order and for all nonzero scalars c, we have (i) |A| |A−1 | = 1 (ii) (cA)−1 = c−1 A−1 (iii) (A−1 ) = (A )−1 (iv) (AB)−1 = B−1 A−1 . As was mentioned previously, computations of inverses can be transferred to a computer. One should be warned, however, that if a matrix
© 2002 by Chapman & Hall/CRC
definiteness of a matrix
333
is “nearly singular” (i.e., its determinant is close to zero), such a transfer becomes risky: round-off errors may then prove unacceptably large. Hence, it is recommended to check whether the product of the matrix and its obtained inverse is indeed equal to I.
A.6
Definiteness of a Matrix
Important Remark: Throughout the rest of this appendix all matrices are assumed to be symmetric. Let x be a p-dimensional vector and A be a symmetric matrix of order p × p. The scalar x Ax =
p p
aij xi xj
(A.44)
i=1 j=1
is called a quadratic form in the vector x. A symmetric matrix A (and a quadratic form associated with it) is called (a) positive definite if x Ax > 0 for all x = 0; (b) positive semi-definite if x Ax ≥ 0 for all x; (c) negative definite if x Ax < 0 for all x = 0; (d) negative semi-definite if x Ax ≤ 0 for all x; we write then A > 0 , A ≥ 0 , A < 0 , A ≤ 0 , respectively. Otherwise, A is called indefinite. Sometimes, positive semi-definite matrices are called nonnegative definite while negative semi-definite matrices are called nonpositive definite. Note that positive (negative) definiteness of a matrix implies its positive (negative) semi-definiteness. It can be proved that positive- and negative- definite matrices are necessarily non-singular.
A.7
Eigenvalues and Eigenvectors
Eigenvalues and eigenvectors provide an extremely useful decomposition of symmetric matrices. For a symmetric matrix A of order p and a scalar λ, the determinant |A − λI| (where, of course, I = Ip ) is a pth order polynomial in λ. The p roots of the equation |A − λI| = 0 (A.45) are called the eigenvalues (or characteristic roots or latent roots) of A. The equation itself is called the characteristic equation of A. The char-
© 2002 by Chapman & Hall/CRC
334
appendix a. a brief introduction to linear algebra
acteristic equation may have multiple roots, and then some of the eigenvalues λ1 , λ2 , . . . , λp are equal. But, as can be shown for any symmetric A, all the λi , i = 1, 2, . . . , p, are real numbers. For each eigenvalue λi , i = 1, 2, . . . , p, there exists a vector xi satisfying the vector equation Axi = λi xi (or (A − λi I)xi = 0 ).
(A.46)
Any vector satisfying the above equation is called the eigenvector (or characteristic vector or latent vector ) associated with the eigenvalue λi . For obvious reasons, any solution of this equation may be standardized (or normalized ) so that, after standardization, it has length one, xi xi = 1. The following results hold for the eigenvalues and eigenvectors of any symmetric matrix A of order p: (i) |A| = λ1 λ2 · · · λp ; (ii) tr(A) = pi=1 λi ; (iii) if A is diagonal, A = diag(a1 , a2 , . . . , ap ), then λi = ai for all i, i = 1, 2, . . . , p; (iv) if A is positive (negative) definite, then all eigenvalues of A are positive (negative); (v) if A is positive (negative) semi-definite of rank r, then exactly r of the eigenvalues of A are positive (negative) while the remaining eigenvalues are zero; (vi) the rank of A is equal to the number of non-zero eigenvalues of A; (vii) if λi is an eigenvalue of A and no other eigenvalue of A equals λi , then the eigenvector associated with λi is unique up to a scalar factor (i.e., it is unique if it is assumed to be standardized); (viii) eigenvectors associated with distinct eigenvalues of A are necessarily orthogonal one to another; (ix) eigenvectors of A can always be chosen in such a way that they are mutually orthogonal. From the point of view of applications, the most important result to be introduced in this section is the following spectral decomposition (or Jordan decomposition) of a symmetric matrix A of order p:
A = ΓΛΓ =
p
λi xi xi ,
i=1
where Λ is a diagonal matrix of eigenvalues of A, Λ = diag(λ1 , λ2 , . . . , λp ),
© 2002 by Chapman & Hall/CRC
(A.47)
eigenvalues and eigenvectors
335
and Γ is an orthogonal matrix whose columns are standardized eigenvectors of A, Γ = [x1 , x2 , . . . , xp ]. In view of the previously presented results, this decomposition can always be done; it suffices to find the eigenvalues of A and choose corresponding eigenvectors in such a way that they be standardized and mutually orthogonal. Let us use the concept of the spectral decomposition of a matrix to prove a useful property of the so-called idempotent matrices. A symmetric matrix A is called idempotent if A = A2 , where A2 = AA. By the spectral decomposition of a symmetric matrix A, we have that A = ΓΛΓ and A2 = ΓΛΓ ΓΛΓ = ΓΛ2 Γ . If A is idempotent, Λ = Λ2 and, hence, λi = 0 or 1 for all i. Thus, by (vi) and (ii) above, r(A) = tr(A)
(A.48)
for any symmetric and idempotent A. We shall conclude this section with an example of a detailed derivation of the spectral decomposition of a matrix of order 3. Another problem is how to effectively find eigenvalues and eigenvectors of a matrix of order greater than 3. Finding the roots of a third-order polynomial is already cumbersome but, in general and without referring to iterative methods of numerical analysis, the task becomes unworkable if a polynomial is of order greater than four. Moreover, the greater is the order of a matrix the more tedious, although in principle routine, is the computation of the matrix’ eigenvectors. No wonder, therefore, that the job is in fact always left to a computer. Example: Let ⎡ ⎤ 18 6 2 ⎢ ⎥ A = ⎣ 6 23 3 ⎦ . (A.49) 2 3 15 Then |A − λI| = −λ3 + 56λ2 − 980λ + 5488, and the eigenvalues of the characteristic equation are λ1 = 28 and λ2 = λ3 = 14. The eigenvector associated with λ1 = 28 is given by a solution of the following set of three equations, linear in components x11 , x21 , x31 of x and written below in the vector form: ⎡
⎤⎡
⎤
−10 6 2 x11 ⎢ ⎥⎢ ⎥ 6 −5 3 ⎣ ⎦ ⎣ x21 ⎦ = 0. 2 3 −13 x31
(A.50)
It can be easily verified that only two of the three rows of the above matrix of coefficients are linearly independent. In order to do this, it
© 2002 by Chapman & Hall/CRC
336
appendix a. a brief introduction to linear algebra
suffices to: i) consider the rows as vectors, multiply the first row by 2, the second by 3, leave the third row unchanged and add so obtained vectors to get 0; ii) observe that any two of the three rows are linearly independent. Thus, one of the three equations is redundant. We can arbitrarily set x31 = 1 and, say, from the first two equations obtain x11 = 2 and x21 = 3. Finally, x1 = [2, 3, 1]. As the eigenvectors associated with λ2 = 14 and λ3 = 14, we can readily choose linearly independent solutions of the following sets of equations (of course, this is in fact one set of equations, since the eigenvectors to be found are associated with equal eigenvalues): ⎡
⎤⎡
⎤
⎡
⎤⎡
⎤
4 6 2 x12 4 6 2 x13 ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎣ 6 9 3 ⎦ ⎣ x22 ⎦ = 0 and ⎣ 6 9 3 ⎦ ⎣ x23 ⎦ = 0. x32 x33 2 3 1 2 3 1
(A.51)
It is worth noting that the fact that one can find linearly independent eigenvectors x2 and x3 follows, without looking at the equations, from property (ix) in this Section.1 Indeed, this time two of the three equations are redundant, i.e., the maximum number of linearly independent rows in the matrix of coefficients, or the rank of the matrix, is one. Arbitrarily setting x12 = x22 = 1, we obtain x32 = −5 and, hence, x2 = [1, 1, −5]. Analogously, setting x13 = 0 and x23 = 1 yields x33 = −3, i.e., x3 = [0, 1, −3]. One can easily check that, in accordance with property (viii), x1 x2 = 0 and x1 x3 = 0. On the other hand, although linearly independent, the vectors x2 and x3 are not orthogonal. In the last section of this appendix, we give a surprisingly simple procedure of replacing linearly independent eigenvectors of a matrix by vectors which are orthogonal and are still the eigenvectors of the matrix under scrutiny. In particular, the procedure mentioned leads to the ˜ 3 = [−16, 11, −1]. replacement of vector x3 of the example by vector x ˜ 3 is an eigenvector of A as well, and It is a trivial matter to verify that x ˜ 3 are mutually orthogonal. that x1 , x2 and x In order to state the spectral decomposition of A, and thus to conclude the example, it remains to standardize the eigenvectors obtained: √ √ √ 1 y1 = x1 = 2/ 14, 3/ 14, 1/ 14 , x1 x1 √ √ √ 1 y2 = x2 = 1/ 27, 1/ 27, −5/ 27 , x2 x2 1
Actually, property (ix) implies more, namely that x2 and x3 may be chosen to be mutually orthogonal.
© 2002 by Chapman & Hall/CRC
337
matrix square root y3 =
1 ˜ x ˜ 3 x ˜3 3 x
√ √ √ = −16/3 42, 11/3 42, −1/3 42 .
Hence, A = ΓΛΓ
(A.52)
where Λ = diag(28, 14, 14) and √ √ √ ⎤ 2/√14 1/√27 −16/3√42 ⎢ ⎥ Γ = ⎣ 3/√14 1/√27 11/3√42 ⎦ . 1/ 14 −5/ 27 −1/3 42 ⎡
A.8
(A.53)
Matrix Square Root
The matrix square root of a positive semi-definite matrix A of order p, denoted by A1/2 , is the matrix satisfying the equality A = A1/2 A1/2 .
(A.54)
A1/2 can easily be found using the spectral decomposition of A, A = ΓΛΓ ,
(A.55)
where Λ is a diagonal matrix of eigenvalues of A and Γ is an orthogonal matrix whose columns are standardized eigenvectors of A. Since A is positive semi-definite, all its eigenvalues λi are nonnegative and, hence, 1/2 1/2 1/2 Λ = Λ1/2 Λ1/2 , where Λ1/2 = diag(λ1 , λ2 , . . . , λp ). Thus, A = ΓΛ1/2 Λ1/2 Γ = ΓΛ1/2 Γ ΓΛ1/2 Γ
(A.56)
since, due to orthogonality of Γ, Γ Γ = I. Finally, therefore, we get that A1/2 = ΓΛ1/2 Γ
(A.57)
for any positive semi-definite A. One can readily show that A1/2 is a symmetric matrix with the inverse (A1/2 )−1 = ΓΛ−1/2 Γ , −1/2
where Λ−1/2 = diag(λ1
© 2002 by Chapman & Hall/CRC
−1/2
, λ2
−1/2
, . . . , λp
).
(A.58)
338
A.9
appendix a. a brief introduction to linear algebra
Gram-Schmidt Orthogonalization
Let the characteristic equation of a symmetric matrix A of order p have a root of multiplicity k, 1 < k ≤ p. Thus, A has k equal eigenvalues, say, λ1 = λ2 = · · · = λk . Assume that we are given k linearly independent eigenvectors x1 , x2 , . . . , xk associated with λ1 , λ2 , . . . , λk , respectively, and note that x1 , x2 , . . . , xk are orthogonal to the remaining eigenvectors, xk+1 , xk+2 , . . . , xp , of A (by property (viii) of Section A.7). Replace vectors x1 , x2 , . . . , xk by the following vectors: ˜ 1 = x1 x ˜ x x ˜ 2 = x2 − x˜ 2 x˜ 11 x ˜1 x 1 .. . ˜ x x ˜ k = xk − x˜k x˜ 11 x ˜1 − · · · − x 1
(A.59) ˜ k−1 xk x ˜ k−1 . ˜ k−1 x ˜ k−1 x x
˜2, . . . , x ˜ k , xk+1 , . . . , xp form a set of ˜1, x It can be proved that x mutually orthogonal eigenvectors of A. The process of orthogonalization given above is known as the Gram-Schmidt procedure. Example: We shall apply the Gram-Schmidt procedure to the eigenvectors x2 = [1, 1, −5] and x3 = [0, 1, −3] of the example from Section A.7. ⎡
⎤
⎡
⎤
1 −16/27 16 ⎢ ⎥ ⎢ ⎥ ˜ 3 = x3 − x ˜ 2 = ⎣ 11/27 ⎦ . ˜ 2 = x2 = ⎣ 1 ⎦ and x x 27 −5 −1/27 To simplify matters a bit, one can replace the last vector by the vector [−16, 11, −1] , since the latter preserves the required properties of the former.
© 2002 by Chapman & Hall/CRC
Appendix B
A Brief Introduction to Stochastics B.1
Introduction
Probability calculus provides means to deal with chance, or random, phenomena in a mathematically rigorous way. Such phenomena can be either those which we passively observe or those which we purposely create. In the latter case, it is of course natural to speak of random experiments. However, it has become customary in the probabilistic and statistical literature to speak of random experiments in the former case as well; in fact, we may consider ourselves to be passive observers of the “experiments” performed by nature. We shall follow this convention in the sequel. We say that an experiment is random if it can be performed repeatedly under essentially the same conditions, its outcome cannot be predicted with certainty, and the collection of all possible outcomes is known prior to the experiment’s performance. When we say that the conditions under which the experiment is performed are “essentially” the same, we mean that we may assume that these conditions are the same and retain sufficient accuracy of our model of the actual experiment. For example, when tossing repeatedly a die on the beach, we can disregard the fact that some tosses are accompanied by gentle blows of a breeze. Of course, in the case of tossing a fair die, it is reasonable to assert unpredictability of the outcome. The third attribute of randomness is also fulfilled. The collection of all possible outcomes consists of all faces of the die. The set of all possible outcomes of a random experiment is called a 339
© 2002 by Chapman & Hall/CRC
340
appendix b. a brief introduction to stochastics
sample space and it will be denoted by S. An element of the sample space (i.e., a single outcome) is called a sample point or an elementary event. For instance, one can consider the time to the first failure of the hard disk of a computer to be a random experiment with S = [0, ∞), where each real number sS is a sample point corresponding to the time to the first failure equal to s. One is often interested rather in the occurrence of any elementary event from a particular subset of S than in the occurrence of a particular elementary event. It is definitely so in the case of the time to the first failure. One can hardly be interested that this time be, say, exactly 2,557,440 minutes (or 5 years). At the same time, one can be interested that this time be not smaller than 5 years. In the case of tossing a six-sided die with faces 1 to 6, one can be interested in getting 6, but one can also be interested in getting an even outcome, 2 or 4 or 6. A subset of a sample space is called an event. It is worth noting that even simple experiments require sometimes introducing quite involved sample spaces. Suppose for example that we are interested in the experiment consisting in tossing a coin until the head occurs for the first time. Denoting the head and tail occurrences by H and T, respectively, we easily find out that the sample space consists of infinitely many elementary events, S = {H, T H, T T H, T T T H, T T T T H, . . .}. In practice, we are interested not in events alone but, at the same time, in chances, or probabilities, of their occurrence. Although the very term probability has several interpretations, we rarely have problems when assigning probabilities to “simple” events. The common understanding of the notion of probability in engineering sciences is well described by the so-called relative frequency approach. In order to present this approach, let us refer to the simplest possible example of tossing a fair coin once. In this example, S = {H, T } and we ask what is the probability of, say, a head occurring. Imagine an experiment in which we toss the coin N times and take the ratio n/N , where n is the number of heads occurring in N tosses, as an approximation of the probability of head’s occurrence in one toss. Had we let N tend to infinity, a conceivable but actually not possible experiment, we would have observed that the values of the ratio n/N , as N increases, stabilize. At the turn of the last century, Karl Pearson (1857-1936) performed the experiment with N = 24, 000 and obtained n/N = .5005. More recently, computers were used to simulate our experiment with N assuming astronomically large values and with
© 2002 by Chapman & Hall/CRC
introduction
341
the ratio n/N being always practically equal to 1/2. The problem with this approach is that although it is possible to prove that a limit value of the ratio in question exists, we cannot prove experimentally what this limit value is, since we cannot perform infinitely many experiments in a finite time. Still, it is undoubtedly reasonable to assign value 1/2 to the probability of the head occurring in one toss of a fair coin. By the same token, we assign probability 1/6 to obtaining any particular face in one toss of a fair, that is, balanced, six-sided die. Note that the relative frequency approach imposes the following natural bounds on probabilities of events. The probability of a certain event must be equal to one, while the probability of an impossible event must be equal to zero. Formally, denoting the probability of an event A ⊆ S by P (A), we can write P (S) = 1 and P (∅) = 0,
(B.1)
where ∅ denotes the empty set. Indeed, S is the certain event, since the occurrence of S is equivalent to the occurrence of any of all elementary events, and one of these must of course occur. For example, if we toss a six-sided die, one of its faces must be shown or, if we have a PC, its hard disk must have some time to the first failure, this time being an element of S = [0, ∞). On the other hand, an “empty event,” which corresponds to the empty set, is an impossible event, since the random experiment has been constructed in such a way that some outcome from S must occur. It also follows from the relative frequency approach that, for all events A ⊆ S, probability P (A) is defined to be a real number such that 0 ≤ P (A) ≤ 1. (B.2) In practice, we are most often interested in calculating probabilities of some combinations of events. For example, PC owners can be interested in maximizing the probability that their hard disks will break down either in the first year (i.e., during the warranty period) or after, say, five years from the purchase (practically nobody keeps a PC for more than five years). If we were given the probabilities of the events {time to failure is not greater than 1 year} and {time to failure is not smaller than 5 years}, we could calculate the probability of their union as P ({time to failure ≤ 1 year} ∪ {time to failure ≥ 5 years}) = P (time to failure ≤ 1 year) + P (time to failure ≥ 5 years),
© 2002 by Chapman & Hall/CRC
342
appendix b. a brief introduction to stochastics
where C ∪ D denotes the union of sets C and D, i.e., the set of elements which belong either to C or to D or to both. Similarly, the probability of obtaining 2 or 4 or 6 in one toss of a fair die with faces 1 to 6 is 1/2. Let us consider one more example. Suppose we are interested in the experiment consisting in two successive tosses of a fair coin, and we wish to find the probability of obtaining a head in the first toss. In this experiment, the sample space contains four elementary events, S = {HH, HT, T H, T T }. Of course, we assume that all the four elementary events are equally likely (i.e., their probabilities are 1/4). We have P (head in the first toss) = P (HH ∪ HT ) = P (HH) + P (HT ) = 1/2, in accordance with what should be expected intuitively. In all the above calculations of probabilities, we have applied the following general rule. If, for any fixed n, the events A1 , A2 , . . . , An are mutually exclusive (i.e., they are such that no two have an element in common, Ai ∩ Aj = ∅ whenever i = j), then P (A1 ∪ A2 ∪ . . . ∪ An ) = P (A1 ) + P (A2 ) + . . . + P (An ).
(B.3)
Conditions (B.1), (B.2) and (B.3) are known as the axioms of probability. Whenever necessary, condition (B.3) can be generalized to the following one: If the events A1 , A2 , A3 , . . . are mutually exclusive, then P (A1 ∪ A2 ∪ A3 ∪ . . .) = P (A1 ) + P (A2 ) + P (A3 ) + . . . . Here, infinitely many mutually exclusive events are taken into account. As we have seen, condition (B.3) enables one to calculate probabilities of mutually exclusive events in a rigorous and consistent way. If events A and B are not exclusive, i.e., they have some elementary events in common, condition (B.3) can still be used after decomposing the union of the events, A ∪ B, into two nonoverlapping sets. Actually, one can then show that, for any two events A and B, P (A ∪ B) = P (A) + P (B) − P (A ∩ B),
(B.4)
where A ∩ B denotes the intersection (i.e., the common part) of A and B. Let us conclude this section with some more problems connected with games of chance. In fact, much of probability calculus in the past centuries was concerned with such problems. Whatever the reason for that historical fact, it is the easiest path to enter the world of probabilistic
© 2002 by Chapman & Hall/CRC
343
introduction
ideas. Note that our coin or die tossing examples were the problems of this kind. The reader, however, should not get the false impression that probability theory is good for nothing more than dealing with artificial games far removed from reality. Let us first compute the number of ways that we can arrange in a distinctive order k objects selected without replacement from n, n ≥ k, distinct objects. We easily see that there are n ways of selecting the first object, n − 1 ways of selecting the second object, and so on until we select k − 1 objects and note that the kth object can be selected in n − k + 1 ways. The total number of ways is called the permutation of n objects taken k at a time, P (n, k), and is seen to be given by P (n, k) = n(n − 1)(n − 2) · · · (n − k + 1) =
n! , (n − k)!
(B.5)
where m! = m(m − 1)(m − 2) · · · 2 × 1, 0! = 1. In particular, there are n! ways that we can arrange n objects in a distinctive order. Next, let us compute in how many ways we can select k objects from n objects when we are not concerned with the distinctive order of selection. This number of ways is called the combination of n objects taken k at a time, and is denoted by C(n, k). We can find it by noting that P (n, k) could be first computed by finding C(n, k) and then multiplying it by the number of ways k objects could be distinctly arranged (i.e., k!). So we have P (n, k) = C(n, k)P (k, k) = C(n, k)k!
and thus
n k
= C(n, k) =
n! . (n − k)!k!
(B.6)
For example, the game of stud poker consists in the drawing of 5 cards from a 52 card deck (4 suits, 13 denominations). The number of possible hands is given by C(52, 5) =
52! = 2, 598, 960. 47!5!
We are now in a position to compute some basic probabilities which are slightly harder to obtain than, say, those concerning tossing a die. Each of the 2,598,960 possible poker hands is equally likely. To compute the probability of a particular hand, we simply evaluate P (hand) =
© 2002 by Chapman & Hall/CRC
number of ways of getting the hand . number of possible hands
344
appendix b. a brief introduction to stochastics
Suppose we wish to find the probability of getting an all-spade hand. There are C(13, 5) ways of selecting 5 spades (without regard to their order) out of 13 spades. Hence, P (an all spade hand) = =
(9)(10)(11)(12)(13) (5!)(2, 598, 960)
C(13, 5) C(52, 5)
= .0000495.
Finding the probability of getting four cards of a kind (e.g., four aces, four kings) is a bit more complicated. There are C(13, 1) ways of picking a denomination, C(4, 4) ways of selecting all the four cards of the same denomination, and C(48, 1) of selecting the remaining card. Thus, P (four of a kind) = =
C(13, 1)C(4, 4)C(48, 1) C(52, 5) (13)(1)(48) = .00024. 2, 598, 960
Similarly, to find the probability of getting two pairs, we have P (two pairs) = =
C(13, 2)C(4, 2)C(4, 2)C(44, 1) C(52, 5) (78)(6)(6)(44) = .0475. 2, 598, 960
The next two Sections, as well as Sections B.5 and B.10, provide the reader with necessary elements of the mathematical framework of probability calculus. In Sections B.4, B.6 and B.12, the most widely used probabilistic models are briefly discussed. In Sections B.7-B.9, some specific and particularly powerful tools for dealing with random phenomena are introduced. Section B.11 deals with multivariate random phenomena. In the last Section, a brief exposition of statistical methods is given.
B.2
Conditional Probability
The probability of an event B occurring when it is known that some event A has occurred is called a conditional probability of B given that A has occurred or, shortly, the probability of B given A. The probability of B given A will be denoted by P (B|A). Suppose, for instance, that B
© 2002 by Chapman & Hall/CRC
conditional probability
345
is the event of having a hand of 5 spades in stud poker. It is clear that we should expect the “unconditional” probability P (B) to be different from the conditional probability of having an all-spades hand given that we already know that we have at least four spades in the hand. Actually, we would rather expect P (B|A) to be greater than P (B). The fact that the occurrence of one event may influence the conditional probability of another event occurring is perhaps even more transparent in the following example. Suppose that we are interested in the event that the price of rye is higher tomorrow in Chicago than it is today. Since there is a transfer of grain use between rye and wheat, if we know that the price of wheat will go up, then the probability that the price of rye will go up is greater than if we did not have the information about the increase in the price of wheat. In order to become able to evaluate conditional probabilities effectively, let us look into this concept more thoroughly. First, since event A has occurred, only those outcomes of a random experiment are of interest that are in A. In other words, the sample space S (of which A is a subset) has been reduced to A, and we are now interested in the probability of B occurring, relative to the new sample space A. Formally, we should postulate that P (A|A) = 1: indeed, the probability of A given A has occurred should be one. Second, since A is now the sample space, the only elements of event B that concern us are those that are also in A. In other words, we are now interested in the intersection (or the common part) of A and B. Formally, we should postulate that P (B|A) = P (A ∩ B|A), where A∩B denotes the set consisting of only those elements which belong to A and to B. And finally, upon noticing that P (A ∩ B) and P (A ∩ B|A) are the probabilities of the same event, although computed relative to different sample spaces, we should postulate that P (A∩B) and P (A∩B|A) be proportional. More precisely, the last postulate states that for all events B such that P (A ∩ B) > 0 the ratio P (A ∩ B|A)/P (A ∩ B) is equal to some constant independent of B. It is not hard to show that the three postulates mentioned imply the following formal definition of the conditional probability of B given A (assuming P (A) is not zero): P (B|A) =
P (A ∩ B) . P (A)
(B.7)
The conditional probability of an all-spade hand (event B) given that there are at least four spades in the hand (event A) can now be readily calculated. Note that the event A consists of all hands with four spades
© 2002 by Chapman & Hall/CRC
346
appendix b. a brief introduction to stochastics
and all hands with five spades and, therefore, that A ∩ B = B. Hence, P (B|A) =
=
P (B) P (A)
13 4
13 5
39 1
/
+
52 5 13 5
/
52 5
= .0441,
since, by condition (B.3), P (A) = P (A1 ∪ A2 ) = P (A1 ) + P (A2 ), where A1 and A2 denote the events that there are exactly four and exactly five spades in the hand, respectively. Note that it follows from (B.7) that P (A ∩ B) = P (B|A)P (A) = P (A|B)P (B).
(B.8)
These identities are known as the multiplicative rule of probability. It may happen that events A and B have no effect upon each other. Consider, for example, the experiment consisting in two successive tosses of a fair coin. Clearly, the probability of any of the two possible outcomes of the second toss is unaffected by the outcome of the first toss and, conversely, the first toss is independent of the second. In such situations, the information that some event A has occurred does not change the probability of B occurring and, conversely, the information that B has occurred does not change the probability of A occurring. Formally, we have then P (B|A) = P (B) and P (A|B) = P (A). (B.9) By (B.8), property (B.9) is equivalent to the following one: P (A ∩ B) = P (A)P (B).
(B.10)
We say that events A and B are stochastically independent or, shortly, independent if condition (B.10) is fulfilled. We leave it to the reader to find out why mutually exclusive events (of positive probability) cannot be independent.
B.3
Random Variables
Examples of random experiments show that elementary events may but do not have to be expressed numerically. However, in order to handle
© 2002 by Chapman & Hall/CRC
random variables
347
events defined on a sample space in a mathematically convenient way, it is necessary to attach numerical value to every elementary event. When the elementary events are themselves numbers, it is, in fact, already done. In general, it is convenient to introduce the concept of a random variable. The random variable is a function that associates a real number with each element in the sample space. The way in which a particular random variable is determined depends on an experimenter’s need. For example, if we consider time to the first failure of a PC’s hard disk, elementary events are themselves numbers from the interval S = [0, ∞). On the other hand, if we consider coin tossing once, we may attach, say, value zero to tail occurring and value one to head occurring, thus obtaining the random variable associated with the experiment under scrutiny. Of course, we can, if we wish to, define another random variable on the sample space S = {H, T } and work with that another definition as well; the only problem is that we “remember” our definition and properly interpret random variable’s values. A random variable, to be abbreviated r.v., is discrete if the range of its values forms a discrete set; that is, if the r.v. can take only values that are either integers or can be put into one-to-one correspondence with some subset of integers. Obviously, r.v.’s associated with coin tossing experiments are discrete. A simple generalization of the coin tossing and die tossing experiments is the following one. Suppose we are given a random variable X which assumes the values x1 , x2 , . . . , xk (xi = xj whenever i = j) with the same probability 1/k. (From now on, we shall denote random variables by uppercase letters X, Y, Z, etc., and their values by corresponding lowercase letters x, y, z, etc.) We say that X has the (discrete) uniform distribution given by the probability function 1 P (X = xi ) = , x = x1 , x2 , . . . , xk . (B.11) k For the sake of brevity, we shall sometimes denote probability functions by f (x) instead of writing explicitly P (X = x). It is also convenient to define the cumulative distribution function (c.d.f.) F (x) of a r.v. X as the probability of the event that the r.v. X assumes a value not greater than x, F (x) = P (X ≤ x), −∞ < x < ∞. (B.12) Note that F (x) is defined for all real numbers x. By the definition, it is a nondecreasing function, bounded from below by zero and from above by one.
© 2002 by Chapman & Hall/CRC
348
appendix b. a brief introduction to stochastics
In the case of a uniformly distributed r.v. X, F (x) =
P (X = xi ) =
xi ≤x
f (xi ),
(B.13)
xi ≤x
a sum over all xi s which are not greater than x, −∞ < x < ∞. So far, we have discussed only the simplest possible discrete probability distribution. In the next section, we shall discuss four more distributions, namely the hypergeometric, binomial, geometric and Poisson distributions. Already a cursory look at the probability functions of these distributions, given by (B.30), (B.31), (B.32) and (B.36), respectively, makes it evident that a way of summarizing the information provided by the probability functions would be welcome. The problem is that probability functions have to be determined at many different points. In particular, the Poisson probability function is defined for all nonnegative integers! So it is natural to attempt to characterize the most important aspects of a probability distribution using as few number characteristics as possible. The most common of such summaries are the mean and variance of an r.v. (or, equivalently, of its distribution). The mean, or expected value, of the r.v. X is defined as μ = E(X) =
xP (X = x) =
xf (x),
(B.14)
where the summation is over all possible values x of the r.v. X. Thus, the mean is simply the weighted average of the values of X, with the weights being the probabilities of the values’ occurrences. For the discrete uniform distribution in its general form, we have μ = E(X) =
k 1 xi . k i=1
(B.15)
The mean provides some information about the “location,” or the “center,” of the probability distribution. However, it does provide no information about the “spread,” or “variability,” of the distribution about its mean value. It is the variance of the distribution which measures expected squared departures from μ. The variance of the r.v. X is given by (x − μ)2 f (x), (B.16) E[(X − μ)2 ] = Var(X) = σ 2 = where the summation is over all possible values x of the r.v. X. The square root of the variance is called the standard deviation. For the
© 2002 by Chapman & Hall/CRC
random variables
349
discrete uniform distribution in its general form, we have Var(X) = σ 2 =
k 1 (xi − μ)2 . k i=1
(B.17)
Straightforward calculations show that if X is “rescaled” by a fixed factor a and “shifted” by a constant b, i.e., if it is transformed to the r.v. aX + b, then E(aX + b) = aμ + b and Var(aX + b) = a2 Var(X).
(B.18)
Note that the above results are intuitively appealing: e.g., shifting a random variable changes its “location” but does not affect its “variability.” More generally, if we are interested in the mean (or the expectation) of some function g(X) of X, we have E(g(X)) =
g(x)f (x),
(B.19)
x
where the summation is over all possible values x of X, and, for any fixed reals a and b, we obtain E(ag(X) + b) = aE[g(X)] + b.
(B.20)
Applying (B.20) repeatedly, we obtain Var(X) = E(X 2 ) − [E(X)]2
(B.21)
for any r.v. X. It is often the case in engineering sciences that random phenomena are functions of more than one random variable. Thus, we have to answer the question how to characterize the joint distribution of several r.v.’s. We shall confine ourselves to the case of two r.v.’s only, since generalizations to more r.v.’s are obvious. For an r.v. X which assumes the values xi , i = 1, 2, . . . , k, and an r.v. Y which assumes the values yj , j = 1, 2, . . . , m, where either k or m or both can be equal to infinity, it is natural to define the joint p.f. of X and Y via f (xi , yj ) = P (X = xi , Y = yj ), i = 1, 2, . . . , k, j = 1, 2, . . . , m (B.22) and the joint c.d.f. via F (x, y) =
xi ≤x yj ≤y
© 2002 by Chapman & Hall/CRC
f (xi , yj ).
(B.23)
350
appendix b. a brief introduction to stochastics
Note that given the joint p.f. f (x, y), we can immediately determine the p.f.’s of X and Y alone: fX (xi ) =
m
f (xi , yj ), i = 1, 2, . . . , k
(B.24)
f (xi , yj ), j = 1, 2, . . . , m.
(B.25)
j=1
and fY (yj ) =
k i=1
fX (x) and fY (y) are called the marginal p.f.’s of X and Y , respectively. Note that it readily follows from (B.10) that the joint p.f. of two independent r.v.’s X and Y is given by f (x, y) = fX (x)fY (y).
(B.26)
Furthermore, we can calculate the expectation of a function g(x, y) of the r.v.’s X and Y : E(g(X, Y )) =
k m
g(xi , yj )f (xi , yj ).
(B.27)
i=1 j=1
In particular, the covariance of X and Y , defined as Cov(X, Y ) = E{[X − E(X)][Y − E(Y )]},
(B.28)
can be seen as some measure of “co-dependence” between random variables. If X and Y “vary together,” i.e., if they tend to assume large or small values simultaneously, then the covariance takes positive values. If, rather, large values of one random variable are associated with small values of the other one, the covariance takes negative values. In the former situation we may speak of “positive,” while in the latter of “negative,” dependence. Moreover, if X and Y are stochastically independent, we have Cov(X, Y) = E[X − E(X)]E[Y − E(Y )] = 0. However, the reader should be warned that, in general, the converse is not true; it is possible to give examples of dependent r.v.’s whose covariance is zero. The covariance, seen as the measure of co-dependence between r.v.’s, has the drawback of not being normalized. Namely, covariances can take
© 2002 by Chapman & Hall/CRC
discrete probability distributions
351
values from −∞ to ∞. A normalized version of this measure is provided by the correlation between X and Y , which is defined as Cov(X, Y ) ρ(X, Y ) = . Var(X)Var(Y )
(B.29)
It is easy to prove that correlation ρ(X, Y ) takes values only between −1 and 1. To see this, let a be an arbitrary constant, and consider 0 ≤ E{[a(X − E(X)) − (Y − E(Y ))]2 } 2 + σY2 − 2aCov(X, Y ), = a2 σX
(B.30)
2 and σ 2 are the variances of X and Y , respectively. Substitutwhere σX Y 2 yields ing in (B.30) a = Cov(X, Y )/σX
ρ2 ≤ 1 and the desired result follows. Of course, ρ(X, Y ) preserves other properties of the covariance of X and Y . Moreover, one can readily show that ρ(X, Y ) = 1 if Y = aX+b for any positive a and any b, and ρ(X, Y ) = −1 if Y = aX + b for any negative a and any b. Thus, the correlation between random variables attains its maximum and minimum values when there is a linear relationship between the variables.
B.4 B.4.1
Discrete Probability Distributions Hypergeometric Distribution
Suppose we are given a batch of 100 manufactured items. It is known that 5 of them are defective but we do not know which ones. We draw at random and without replacement 10 items. The question is what is the probability that we have drawn no defectives, what is the probability that we have drawn exactly 1 defective, etc. More generally, we can assume that we are given a population of N items, of which M are of type 1 and N − M are of type 2. We draw without replacement n items and ask what is the probability distribution of the number of items of type 1 among the given n items. Now we can define the random variable X which assumes value k if k items of type 1 have been drawn. In order to find the probability distribution of X, let us note first that P (X = k) = 0 if k > M or k > n. Moreover, if the total number of items of type 2, N − M , is smaller than n, then at least n − (N − M ) items of type 1
© 2002 by Chapman & Hall/CRC
352
appendix b. a brief introduction to stochastics
have to be drawn. Proceeding analogously as in Section B.1, we find, therefore, that P (X = k) =
C(M, k)C(N − M, n − k) C(N, n)
(B.31)
if max{n − (N − M ), 0} ≤ k ≤ min{n, M }. This is the probability function of the hypergeometric distribution with parameters M, N and n. Of course, did we not assume that we know the number of defective items, the above problem would be the most typical problem of acceptance sampling in quality control. In practice, we avoid this obstacle using some estimate of the unknown parameter M .
B.4.2
Binomial and Geometric Distributions
A similar probabilistic model is provided by the following experiment. Consider a single random experiment in which there are only two possible outcomes and assume that we are interested in n independent repetitions of this experiment. Let us call the possible outcomes of each single experiment a success and a failure, respectively. Thus, the whole experiment consists in performing n independent repeated trials, and each trial results in an outcome that may be classified as a success or a failure. Let us assume that the probability of success, p, remains constant from trial to trial (thus, the same can be claimed about the probability of failure, 1 − p). Such experiments are known as binomial experiments. The difference between this experiment and the one that led us to the hypergeometric distribution is that formerly we sampled without replacement, while the binomial experiment corresponds to sampling with replacement. Formerly, the probability that the next item drawn will be defective depended on the number of defectives obtained in the previous draws. In the simpler, binomial experiment, this probability does not change from trial to trial. It is intuitively obvious, however, that both models become practically equivalent if a population from which we draw items is large enough. In fact, it is the binomial model which is much more often used in quality control. Define the random variable X as the number of successes in n independent trials of a binomial experiment. Clearly, X can assume values 0, 1, 2, . . . , n. We can ask what is the probability function (or the probability distribution) of the r.v. X. That is, we want P (X = x), the probability of x successes and (n − x) failures in n trials, where x runs over the values 0, 1, 2, . . . , n. In order to answer the question, fix x and
© 2002 by Chapman & Hall/CRC
discrete probability distributions
353
note that each particular sequence of trials with x successes has the probability of occurring pp · · · p (1 − p)(1 − p) · · · (1 − p) = px (1 − p)n−x , " #$ % "
#$
n−x times
x
%
since the trials are independent. Hence, it remains to verify that there are C(n, x) different sequences of trials with x successes. Indeed, this result will follow if we number the trials from 1 to n and ask in how many ways we can select x numbers of trials from all n numbers of trials. Finally therefore,
P (X = x) =
n x
px (1 − p)n−x , x = 0, 1, 2, . . . , n.
(B.32)
This is the probability function of the binomial distribution with parameters n and p. The r.v. X is called the binomial random variable. In a natural way, the binomial distribution is used as a basis for acceptance sampling and acceptance-rejection control charts of quality control. For the binomial distribution with parameters n and p, we have μ = E(X) = 0f (0) + 1f (1) + 2f (2) + · · · + nf (n) = np, and Var(X) = σ 2 = (0 − np)2 f (0) + (1 − np)2 f (1) + (2 − np)2 f (2) + · · · + (n − np)2 f (n) = np(1 − p). The above results are not simple when calculated directly, but we shall show in Section B.8 how to obtain them in a surprisingly easy way. For n = 1, the binomial distribution becomes the Bernoulli distribution. The only possible values of the r.v. X are then 0 and 1, and the probability of success, P (X = 1), is equal to p while P (X = 0) = 1 − p. The r.v. X is called the Bernoulli random variable and it corresponds to just one experiment. Note that for any fixed n, a binomial r.v. is in fact the sum of n independent and identically distributed Bernoulli r.v.’s. In other words, a binomial experiment consists in repeating a Bernoulli experiment n times. Sometimes, an experiment consists in repeating Bernoulli experiments until the first success occurs. The number of Bernoulli experiments performed, to be denoted also by X, is then a random variable. The range
© 2002 by Chapman & Hall/CRC
354
appendix b. a brief introduction to stochastics
of X is equal to the set of all nonnegative integers, with the probability of assuming any particular value, x, P (X = x) = p(1 − p)x−1 , x = 1, 2, . . . ,
(B.33)
where p denotes the probability of success. The given distribution is called the geometric distribution and X is called the geometric random variable. The mean and variance of the geometric distribution can be easily obtained using the method of Section B.8. They can also be readily obtained using the following argument. We have, of course, that ∞
1 (1 − p)x−1 = . p x=1 Now, differentiating both sides with respect to p, we obtain ∞
(x − 1)(1 − p)x−2 =
x=1
1 . p2
Hence, E(X) =
∞
1 xp(1 − p)x−1 = . p x=1
Using the same differentiation trick again yields, after a little algebra, E(X 2 ) =
2−p . p2
Thus, since V ar(X) = E(X 2 ) − [E(X)]2 , V ar(X) =
1−p . p2
Let us note that the random variable equal to the number of trials before the first success occurs, Y = X − 1, has the distribution P (Y = y) = p(1 − p)y , y = 0, 1, 2, . . . . Some authors call Y , not X, a geometric r.v.
© 2002 by Chapman & Hall/CRC
355
discrete probability distributions
B.4.3
Poisson Distribution
The last discrete probability distribution we are going to discuss in this section is the distribution named after Simeon Poisson (1781-1840) who was the first to describe rigorously experiments of the following kind. Consider some events which happen at random instants. Let X be the random variable representing the number of events occurring in a given time interval, say, of length T . Assume that the number of events occurring in one time interval is independent of the number of events that occur in any other disjoint time interval, P (k occurrences in [t1 , t2 ] and m occurrences in [t3 , t4 ]) = P (k in [t1 , t2 ])P (m in [t3 , t4 ])
(B.34)
if [t1 , t2 ] ∩ [t3 , t4 ] = ∅. Assume also that the probability that a single event will occur in a short time interval is proportional to the length of the time interval, P (1 occurrence in [t, t + ε]) = λε,
(B.35)
where λ is a proportionality constant. Finally, assume that the probability that more than one event will occur in a short time interval is negligible, P (more than 1 occurrence in [t, t + ε]) = o(ε),
(B.36)
where limε→∞ o(ε)/ε = 0. The experiment which fulfills these assumptions is called a Poisson experiment, and the r.v. X equal to the number of events occurring in a Poisson experiment is called a Poisson random variable. In telecommunication studies, the number of incoming telephone calls in a given time interval is considered to be a Poisson r.v. Another example of a Poisson r.v. is the number of alpha particles that enter a prescribed region in a given time interval. Also, the number of failures of some devices up to a fixed time is assumed to be a Poisson r.v. This last fact may seem counterintuitive since, contrary to the properties of the Poisson experiment, one would rather expect that if, say, no failure has occurred up to an instant t, then the chances for a failure in an adjacent interval (t, t + s] should increase. However, some devices can indeed be assumed to be “memoryless” in the sense that they “do not remember” that they have already been in use for a time t. For instance, many
© 2002 by Chapman & Hall/CRC
356
appendix b. a brief introduction to stochastics
electronic components could be considered practically everlasting were they not subject to damage by external phenomena of random character. The probability function of the Poisson r.v. X, representing the number of events occurring in a given time interval of length T , has the form e−ν ν x P (X = x) = , x = 0, 1, 2, . . . , (B.37) x! where ν = λT is the parameter of the distribution, ν > 0. Note that, under the assumptions stated, the number of events’ occurrences cannot be bounded from above by some fixed positive integer. It can be shown, however, that the mean and variance of the Poisson distribution with parameter ν both have the value ν. In particular, therefore, ν is the average number of events that occur in the given time interval. Parameter λ may be called the average occurrence rate, since it is equal to the average number of events when T = 1. In a wider context, the Poisson p.f. will be derived in Section B.12. Although we have confined ourselves to Poisson experiments which are connected with observing events in a given time interval, it should be clear that we could focus on events that occur in a specified region of size T . Namely, we can also speak of a Poisson experiment if: (i) the number of events occurring in one region is independent of the number of events in a disjoint region; (ii) the probability that a single event will occur in a small region is proportional to the size of the region; (iii) the probability that more than one event will occur in a small region is negligible. Simply, all that applied above to a given time interval applies now to a specified region. An important property of the binomial and Poisson distributions is that the former may be approximated by the latter when the number of trials n is large, the probability p is small and λT = np. Taking all these properties into account, the Poisson distribution can be used in quality control, for example, to control the number of defects on a surface of a certain element or to control the number of nonconforming units in a batch.
B.5
More on Random Variables
Let us now turn to random variables whose range is “continuous” in the sense that it forms an interval, finite or not, of real numbers. For example, the random variable which is defined as the time to the first failure
© 2002 by Chapman & Hall/CRC
more on random variables
357
of a PC’s hard disk has a “continuous” range, [0, ∞). The simplest random variable of this type is a continuous counterpart of the uniformly distributed r.v. described in Section B.3. Consider the experiment consisting in choosing at random a point from the interval [0, 1]. Now, since each outcome is assumed equally likely and there are infinitely many possible outcomes, each particular outcome must have probability zero. Due to the infinite number of possible outcomes, it does not contradict the fact that some outcome must occur in the experiment. For the r.v. X assuming value x whenever point x has occurred, we should have P (0 ≤ X ≤ 1) = 1 since the probability of a certain event is one, and, for any numbers a, b, 0 ≤ a ≤ b ≤ 1, we should have P (a ≤ X ≤ b) =
length of [a, b] interval length of [0, 1] interval
due to the assumed randomness of the choice of a point from [0, 1]. Thus, P (a ≤ X ≤ b) = =
&b
1dx 0 1dx b−a = b − a, 0 ≤ a ≤ b ≤ 1, 1 & a1
in accordance with the intuitive understanding of the experiment. The given probability distribution is called the uniform, or rectangular (continuous), distribution on the interval [0, 1], and will be denoted by U [0, 1]. A random variable is said to be continuous if its range forms an interval, finite or not, and the probability of assuming exactly any of its values is zero. In such cases, it is reasonable to ask only about the probability that an r.v. assumes a value from an interval or some combination of intervals. The given example suggests also a more specific definition of a continuous r.v. Namely, a random variable X is continuous if there exists a function f (x), defined over the set of all reals R, such that (a) f& (x) ≥ 0 for all xR ∞ f (x)dx = 1 (b) −∞ & (c) P (a ≤ X ≤ b) = ab f (x)dx, a ≤ b, a, bR. The function f (x) is called a probability density function (p.d.f.) for the r.v. X.
© 2002 by Chapman & Hall/CRC
358
appendix b. a brief introduction to stochastics
It follows from the definition that, for a continuous r.v. X with p.d.f. f (x) and cumulative distribution function F (x), F (x) = P (X ≤ x), we have
x dF f (t)dt and f (x) = F (x) = (B.38) dx −∞ for each xR. For the uniform distribution on [0, 1], the p.d.f. is equal to one on the interval [0, 1], and is zero elsewhere. (We leave it to the reader to show that the p.d.f. for the uniform distribution on an arbitrary fixed interval [A, B] is equal to 1/(B − A) on that interval, and is zero elsewhere.) For later reference, let us observe also that the c.d.f. F (x) of U [0, 1] is zero for x ≤ 0, F (x) = x for 0 < x ≤ 1 and stays equal to one for all x > 1. When graphed, probability density functions provide an information where and how the “probability mass” of the probability distribution is located on the real line R. In analogy with the discrete case, we define the mean of an r.v. X with p.d.f. f (x) as E(X) = μ =
∞ −∞
xf (x)dx
(B.39)
and the variance as Var(X) = σ 2 =
∞ −∞
(x − μ)2 f (x)dx.
(B.40)
Replacing summation by integration and noting that E[g(X)] =
∞ −∞
g(x)f (x)dx,
(B.41)
we easily obtain that properties (B.18), (B.20) and (B.21) hold in the continuous case as well. In turn, given two continuous random variables X and Y , it is natural to define their joint c.d.f. and p.d.f. via F (x, y) =
x y −∞ −∞
and f (x, y) =
© 2002 by Chapman & Hall/CRC
f (x, y)dxdy
∂ 2 F (x, y) , ∂x∂y
(B.42)
(B.43)
more on random variables
359
respectively. Consequently, for any rectangle [a, b]×[c, d] in the xy plane,
b d
P (X[a, b], Y [c, d]) =
a
f (x, y)dxdy.
(B.44)
c
Of course, computing probabilities over more involved regions in the xy plane is also possible. For instance, if we are interested in the probability distribution of the sum V = X + Y , we can calculate P (X + Y ≤ v) =
f (x, y)dxdy.
(B.45)
x+y≤v
The expectation of a function g(x, y) of the r.v.’s X and Y is given by E(g(X, Y )) =
∞ ∞ −∞ −∞
g(x, y)f (x, y)dxdy,
(B.46)
and, in particular, the covariance of X and Y , defined by (B.27), has the same properties as covariances of discrete random variables. Also, correlation ρ between random variables has the same properties in discrete and continuous cases. Finally, the marginal p.d.f.’s of X and Y are given by fX (x) =
∞ −∞
f (x, y)dy and fY (y) =
∞ −∞
f (x, y)dx,
(B.47)
and, just as in the discrete case, it readily follows from (B.10) and (B.42) that the joint p.d.f. of two independent continuous r.v.’s X and Y is given by (B.48) f (x, y) = fX (x)fY (y). The case of more than two random variables is treated in some detail in Section A.11. Let us only mention here that p random variables X1 , X2 , . . . , Xp are said to be mutually stochastically independent or, shortly, stochastically independent, if and only if f (x1 , x2 , . . . , xp ) = f (x1 )f (x2 ) · · · f (xp ),
(B.49)
where f (x1 , x2 , . . . , xp ) is the joint p.d.f. of X1 , X2 , . . . , Xp and f (xi ) is the marginal p.d.f. of Xi , i = 1, 2, . . . , p. In the next section, we shall present some most widely used continuous probability distributions. Let us, however, conclude this section pointing out the surprising importance of the uniform U [0, 1] distribution. Suppose we want to construct a random variable Y having a given
© 2002 by Chapman & Hall/CRC
360
appendix b. a brief introduction to stochastics
c.d.f. F (y), increasing between its boundary values 0 and 1. In order to achieve this, it suffices to verify that the random variable Y = F −1 (X) has the required distribution if X is distributed uniformly on [0, 1], and F −1 is the inverse of the function F . Indeed, P (Y ≤ y) = P (F −1 (X) ≤ y) = P (X ≤ F (y)) = F (y) as was to be shown (the last equality is a direct consequence of the form of the c.d.f. corresponding to the U [0, 1] distribution). Analogous argument shows that the random variable F (Y ), where F (y) is the c.d.f. of Y , has the uniform distribution on [0, 1]. Thus, given an r.v. V with a c.d.f. Fv (v), we can construct an r.v. Z with an arbitrary c.d.f. Fz (z) by the composite transformation Z = Fz−1 (Fv (V )). The first transformation gives the uniformly distributed random variable, while the second transformation yields the desired result. Of course, computer simulations of random phenomena rely heavily on this property.
B.6
Continuous Probability Distributions
B.6.1
Normal and Related Distributions
Let us consider first the most widely used of all continuous distributions, namely the normal , or Gaussian, distribution. The probability density function of the normal random variable X is f (x) = √
1 2πσ 2
exp −
1 (x − μ)2 , −∞ < x < ∞, 2σ 2
(B.50)
where exp(y) denotes e raised to the power y, and μ and σ 2 are the parameters of the distribution, −∞ < μ < ∞, σ 2 > 0. It can be shown that the constants μ and σ 2 are simply the mean and variance of the r.v. X, respectively (see Section B.8). The normal distribution with mean μ and variance σ 2 will be denoted by N (μ, σ 2 ). It is easy to prove that if X is normally distributed with mean μ and variance σ 2 , then the r.v. Z=
X −μ , σ
(B.51)
√ where σ = σ 2 , is normally distributed with mean 0 and variance 1. Z is called the standard normal r.v. If X1 , X2 , . . . , Xn are independent r.v.’s having normal distributions with means μ1 , μ2 , . . . , μn and variances σ12 , σ22 , . . . , σn2 , respectively, then
© 2002 by Chapman & Hall/CRC
continuous probability distributions
361
the sum has the normal distribution with mean μ1 + μ2 + · · · + μn and variance σ12 + σ22 + · · · + σn2 . More generally, if a1 , a2 , . . . , an are n constants, then the linear combination a1 X1 + a2 X2 + · · · + an Xn has the normal distribution with mean a1 μ1 + a2 μ2 + · · · + an μn and variance a21 σ12 + a22 σ22 + · · · + a2n σn2 . (All the results on linear combinations of independent r.v.’s stated in this Section can easily be proved using the so-called moment-generating functions; see Section B.8.) It is easily seen that the normal curve (i.e., the curve corresponding to the normal p.d.f.) is “bell shaped”: it is positive for all xR, smooth, attains its unique maximum at x = μ, and is symmetric about μ.
Figure B.1. Normal Densities With Zero Mean. It was this shape of the normal curve that made many researchers of the 19th century believe that all continuous random phenomena must be of this type. And, although that was a vast exaggeration, we can indeed relatively often assume that a random phenomenon follows the normal law. The reason for this is twofold. First, we can quite often presume that the outcomes of an experiment are distributed symmetrically about some mean value and that the outcomes close to that mean are
© 2002 by Chapman & Hall/CRC
362
appendix b. a brief introduction to stochastics
more likely than those distant from it. In other words, we can consider the outcomes to be at least approximately normally distributed. And second, it turns out that if the observed outcomes are the averages of some other independent random phenomena governed by an essentially arbitrary probability distribution, then the probability distribution of the outcomes is close to normal. This last result, which is, perhaps, the most amazing result of probability theory, will be discussed in a greater detail in Section B.9. Taking these facts into account, no wonder that the normal distribution is also the most widely used distribution in statistical process control for quality improvement. A distribution which is related to the normal distribution is the lognormal distribution. We say that the random variable X has the lognormal distribution if its logarithmic transformation lnX is normally distributed. The p.d.f. of X is given by f (x) = √
1 1 exp − 2 (lnx − μ)2 2σ 2πσx
for x > 0
(B.52)
and zero elsewhere. The constants μ and σ 2 are, respectively, the mean and variance of the r.v. lnX. The p.d.f. of the lognormal distribution is not symmetric: it is zero for negative x and skewed to the right of its maximum. However, the lognormal curve becomes “almost” symmetric for small values of σ 2 , say, for σ 2 ≤ .01. Sometimes, the lognormal distribution is used in statistical process control for quality improvement and in representing distributions of the useful life (or the lifetime) of various devices. Another distribution related to the normal one is the Cauchy distribution. Let Z1 and Z2 be two independent standard normal random variables. The random variable X = Z1 /Z2 is said to have the Cauchy distribution. Its p.d.f. is given by
f (x) =
1 , −∞ < x < ∞. π(1 + x2 )
(B.53)
It is easy to see that a Cauchy r.v. does not have finite mean value. Indeed, for large values of |x|, the product xf (x) approaches the function 1/(πx) and hence integral (B.39) does not exist. As this example shows, not all random variables have finite means, let alone variances.
© 2002 by Chapman & Hall/CRC
continuous probability distributions
B.6.2
363
Gamma Distributions
In order to introduce an important family of the so-called gamma distributions, let us first define the gamma function: Γ(α) =
∞ 0
xα−1 e−x dx for α > 0.
(B.54)
Integrating by parts, we obtain Γ(α) = (α − 1)Γ(α − 1) for α > 1.
(B.55)
When α = n, with n a positive integer, repeated application of (B.55) yields Γ(n) = (n − 1)!, (B.56) &
since Γ(1) = 0∞ e−x dx = 1. We say that the r.v. X has the gamma distribution with parameters α and β, if its p.d.f. is f (x) =
1 xα−1 e−x/β for x > 0 β α Γ(α)
(B.57)
and zero elsewhere, where both constants, α and β, are positive. The mean of X is αβ and the variance of X is αβ 2 (see Section B.8). The gamma distribution with parameter α = 1 is called the (negative) exponential distribution with parameter β. That is, the exponential r.v. X has the p.d.f. 1 (B.58) f (x) = e−x/β for x > 0 β and zero elsewhere, where β > 0. It also follows from the above that X has the mean β and variance β 2 . If some events are occurring in time, independently one of another and according to a Poisson experiment with the average occurrence rate λ, then the inter-event times have the exponential distribution with mean β = 1/λ. For example, if the incoming telephone calls constitute the Poisson experiment, then the inter-arrival times between successive calls have the exponential distribution. By the same token, times between successive failures of a device and, in particular, time to the first failure (or the lifetime) can sometimes be modelled by the exponential distribution as well. (Poisson experiments will be reexamined within a more general framework in Section B.12.) Let us mention that if X1 , X2 , . . . , Xn are independent r.v.’s having identical exponential distribution with parameter β, then the sum X1 +
© 2002 by Chapman & Hall/CRC
364
appendix b. a brief introduction to stochastics
X2 + · · · + Xn has the gamma distribution with parameters n and β. For example, time between n successive telephone calls can be modelled by this distribution. If β = 1, this distribution is called the Erlang distribution. In turn, the gamma distribution with parameters α = ν/2 and β = 2, where ν is a positive integer, is called the chi-square (χ2ν for short) distribution with ν degrees of freedom. The chi-square r.v. X has the p.d.f.
f (x) =
1 2ν/2 Γ(ν/2)
xν/2−1 e−x/2 for x > 0
(B.59)
and zero elsewhere. The r.v. X has the mean ν and variance 2ν. If X1 , X2 , . . . , Xn are independent r.v.’s having chi-square distributions with ν1 , ν2 , . . . , νn degrees of freedom, respectively, then the r.v. X1 + X2 + · · · + Xn has chi-square distribution with ν1 + ν2 + · · · + νn degrees of freedom. If X1 , X2 , . . . , Xn are independent r.v.’s having identical normal distribution with mean μ and variance σ 2 , then the r.v.
n Xi − μ 2 i=1
σ
has the chi-square distribution with n degrees of freedom. In particular, the square of the standard normal r.v. Z has the chi-square distribution with one degree of freedom.
© 2002 by Chapman & Hall/CRC
continuous probability distributions
365
Figure B.2. Chi-Square Densities. Let us mention in passing a distribution which is closely related to the chi-square distribution, although it does not belong to the gamma family. If X1 , X2 , . . . , Xn are independent r.v.’s having normal distributions with variance 1 and means μ1 , μ2 , . . . , μn , respectively, then the r.v. n
Xi2
i=1
has the noncentral chi-square distribution with n degrees of freedom and noncentrality parameter λ = ni=1 μ2i . The importance of the chi-square distribution (as well as that of the t and F distributions which are discussed in the sequel) follows from the following fact. Given some random data x1 , x2 , . . . , xn , even if we know the type of the probability distribution from which the data are drawn, we very rarely know this distribution exactly. In almost all situations of practical interest, values of the parameters of the distribution are unknown. Hence, the inference about the data begins from estimating summary characteristics, such as mean and variance, of the underlying
© 2002 by Chapman & Hall/CRC
366
appendix b. a brief introduction to stochastics
probability distribution. A natural sample counterpart of the mean is the average, or the sample mean, of the data x ¯=
1 (x1 + x2 + · · · + xn ), n
(B.60)
while that of the variance is the sample variance s2 =
n 1 (xi − x ¯)2 ; n − 1 i=1
(B.61)
the reason for dividing the last sum by n − 1 and not, seemingly more naturally, by n will become apparent later. Note that prior to performing an experiment, i.e., prior to observing the data, the sample mean and sample variance are themselves random variables, since they are functions of random variables. We shall prove in the next section that, whatever the underlying distribution of the random data, the sample mean seen as a random variable itself has the following properties:
¯ =E E(X) and
X1 + X 2 + · · · + Xn n
¯ = Var X1 + X2 + · · · + Xn Var(X) n
=μ
=
σ2 , n
(B.62)
(B.63)
where μ is the mean and σ 2 is the variance of the parent distribution of the data Xi , i = 1, 2, . . . , n. For the sample variance, one can show that E(S 2 ) = σ 2 , as one would wish it to be (hence the factor 1/(n − 1), not 1/n, in the definition of S 2 ). Unfortunately, no result of corresponding generality holds for the variance of S 2 . If, however, the data come from the normal distribution with mean μ and variance σ 2 , then the random variable V =
(n − 1)S 2 σ2
(B.64)
can be shown to have chi-square distribution with n − 1 degrees of freedom. Hence, we have then Var(S 2 ) =
© 2002 by Chapman & Hall/CRC
2σ 4 . n−1
continuous probability distributions
B.6.3
367
t and F Distributions
The other two probability distributions strictly connected with infer¯ and S 2 are the t and F distributions. The (Student’s) t ences about X distribution with ν degrees of freedom is given by the p.d.f.
x2 Γ[(ν + 1)/2] √ 1+ f (x) = Γ(ν/2) πν ν
−(ν+1)/2
, −∞ < x < ∞.
(B.65)
If the r.v. Z is standard normal and the r.v. V is chi-square distributed with ν degrees of freedom, and if Z and V are independent, then the r.v.
Z T = V /ν
(B.66)
has t distribution with ν degrees of freedom. The last result is very useful in the context of random sampling from a normal distribution with unknown mean and variance. Namely, since
Z=
¯ −μ X √ σ/ n
(B.67)
is standard normal and can be shown to be statistically independent of V given by (B.64), it follows from (B.66) that the r.v.
¯ −μ X √ S/ n
(B.68)
has t distribution with n − 1 degrees of freedom. The fact that (B.68) has known distribution enables one to test certain hypotheses about the mean μ (see Section B.13).
© 2002 by Chapman & Hall/CRC
368
appendix b. a brief introduction to stochastics
Figure B.3. t Density With 3 Degrees Of Freedom. It is sometimes the case that we are given two independent sequences of data drawn from normal distributions with unknown means μ1 , μ2 and variances σ12 , σ22 , respectively. The question is whether the underlying distributions are indeed different. The equality of means can be tested using our knowledge of the t distribution. In turn, the equality of variances can be tested using the F distribution with ν1 and ν2 degrees of freedom, which is defined as the distribution of the following ratio: U/ν1 , V /ν2
(B.69)
where U and V are independent chi-square distributed r.v.’s with ν1 and ν2 degrees of freedom, respectively. Now, if S12 and S22 are the sample variances of the first and second sequence of data, respectively, and if the true variances σ12 and σ22 are equal, then, by (B.64), the ratio S12 /S22 is F distributed. More precisely, the ratio S12 /S22 has F distribution with n1 − 1 and n2 − 1 degrees of freedom, where ni , i = 1, 2, denotes the number of data in the ith sequence. The p.d.f. of the F distribution
© 2002 by Chapman & Hall/CRC
continuous probability distributions
369
with ν1 and ν2 degrees of freedom is given by f (x) =
xν1 /2−1 Γ((ν1 + ν2 )/2)(ν1 /ν2 )ν1 /2 Γ(ν1 /2)Γ(ν2 /2) (1 + ν1 x/ν2 )(ν1 +ν2 )/2
(B.70)
for x > 0 and zero elsewhere.
Figure B.4. F Densities With 10, 4 And 10, 10 Degrees of Freedom. If U in the numerator of (B.69) has noncentral chi-square distribution with ν1 degrees of freedom and noncentrality parameter λ, if V is the same as before, and U and V are independent, then the ratio (B.69) is said to have noncentral F distribution with ν1 and ν2 degrees of freedom and noncentrality parameter λ.
B.6.4
Weibull Distribution
It was mentioned in this section that, if certain assumptions are satisfied, the lifetimes (or times to a failure) can be modelled by the exponential distribution. Under different circumstances, for example, models based on the lognormal distribution can be used. However, it happens that,
© 2002 by Chapman & Hall/CRC
370
appendix b. a brief introduction to stochastics
given some observed lifetimes, one cannot decide which of the models is adequate. Such a situation calls for a distribution which for some values of its parameters would look similarly to an exponential curve, and would resemble a lognormal curve for other values of the parameters. This sort of flexibility is provided by the Weibull distribution whose p.d.f. is given by (B.71) f (x) = cβ −1 (x/β)c−1 exp[−(x/β)c ], x > 0 and zero elsewhere, c > 0 and β > 0. Clearly, the Weibull distribution reduces to the exponential when c = 1. For c > 1, it becomes similar to the lognormal distribution. One can show that the random variable Y = (x/β)c is exponential with mean one.
B.7
Laws of Large Numbers
Let us now consider the set of n data drawn from some probability distribution. Prior to the experiment which yields the data, they can be treated as a sequence of n independent and identically distributed (i.i.d.) random variables X1 , X2 , . . . , Xn . Such sequence will be labeled as a random sample of size n. Suppose that the mean and variance of the underlying probability distribution are μ and σ 2 , respectively. Otherwise, the probability distribution is unknown. We shall find the mean and variance of the sample mean (B.60) of the random sample. It is easy to see that E(X1 + X2 + · · · + Xn ) E(X1 ) + E(X2 ) + · · · + E(Xn ) = n n μ + μ + ··· + μ = μ. n
μx¯ = =
In this derivation, we have not used independence or the fact that all the r.v.’s have the same distribution, only the fact that they all have the ¯ is an unbiased estimator of μ. same (finite) mean. We say that X ¯ Next we shall derive the variance of X: ¯ − μ)2 ] σx2¯ = E[(X
= E =
(X1 − μ) (X2 − μ) (Xn − μ) + + ··· + n n n
n E[(Xi − μ)2 ] i=1
© 2002 by Chapman & Hall/CRC
n2
'
2 (
(X1 − μ)(X2 − μ) + terms like E . n2
laws of large numbers
371
Now, by independence, the expectation of the cross-product terms is zero: E[(X1 − μ)(X2 − μ)] =
∞
−∞
(x1 − μ)(x2 − μ)f (x1 )f (x2 )dx1 dx2
= E(X1 − μ)E(X2 − μ) = 0 (the argument for discrete distributions is analogous). Thus, we have σ2 . n We note that in the above derivation the fact that the Xi ’s are identically distributed has been superfluous. Only the facts that the r.v.’s are independent and have the same μ and σ 2 have been needed. The ¯ about the true mean μ decreases as n property that the variability of X increases is of key importance in experimental science. We shall develop this notion further below. Let us begin by stating the celebrated Chebyshev’s inequality. If Y is any r.v. with mean μy and variance σy2 , then for any ε > 0 σx2¯ =
σy2 . (B.72) ε2 As a practical approximation device, it is not a particularly useful inequality. However, as an asymptotic device, it is invaluable. Let us ¯ Then, (B.72) gives us consider the case where Y = X. P (|Y − μy | > ε) ≤
2 ¯ − μ| > ε) ≤ σ , P (|X nε2
(B.73)
or equivalently
2 ¯ − μ| ≤ ε) > 1 − σ . P (|X (B.74) nε2 Equation (B.74) is a form of the weak law of large numbers. The WLLN tells us that if we are willing to take a sufficiently large sample, then ¯ will be arbitrarily we can obtain an arbitrarily large probability that X close to μ. In fact, even a more powerful result, the strong law of large numbers, is available. In order to make the difference between the WLLN and SLLN more transparent, let us denote the sample mean based on a sample of ¯ n , so that the dependence of X ¯ on n be emphasized. Now we size n by X can write the WLLN in the following way
¯ n − μ| ≤ ε) = 1 lim P (|X
n→∞
© 2002 by Chapman & Hall/CRC
(B.75)
372
appendix b. a brief introduction to stochastics
for each positive ε. On the other hand, the SLLN states that ¯ n − μ| = 0) = 1. P ( lim |X n→∞
(B.76)
¯ n being closed to μ Loosely speaking, in the WLLN, the probability of X for only one n at a time is claimed, whereas in the SLLN, the closeness ¯ n to μ for all large n simultaneously is asserted with probability of X one. The rather practical advantage of the SLLN is that if g(x) is some function, then ¯ n ) − g(μ)| = 0) = 1. (B.77) P ( lim |g(X n→∞
The WLLN and SLLN are particular cases of convergence in probability and almost sure convergence of a sequence of r.v.’s, respectively. Let Y1 , Y2 , . . . , Yn , . . . be an infinite sequence of r.v.’s. We say that this sequence of r.v.’s converges in probability or stochastically to a random variable Y if lim P (|Yn − Y | > ε) = 0 n→∞
for each positive ε. We say that the sequence Y1 , Y2 , . . . , Yn , . . . converges almost surely or converges with probability one if P ( lim |Yn − Y | = 0) = 1. n→∞
For brevity, the almost sure convergence is also called the a.s. conver¯ n and gence. In the case of the laws of large numbers, Yn is equal to X Y = μ, that is, the limit is a real number or, equivalently, an r.v. which assumes only one value with probability one.
B.8
Moment-Generating Functions
Just as we defined the mean and variance of an r.v. X, we can define the kth moment of X, E(X k ) for any k = 1, 2, . . . ,
(B.78)
and the kth moment about the mean μ of X, E[(X − μ)k ) for any k = 2, 3, . . . .
(B.79)
For example, the kth moment of a continuous r.v. X with p.d.f. f (x) can be computed as 1 k
E(X ) = 1
∞
−∞
xk f (x)dx.
(B.80)
Note that an r.v. can have no or just a few finite moments. In particular, in the case of a Cauchy r.v., already the first moment does not exist.
© 2002 by Chapman & Hall/CRC
moment-generating functions
373
Of course, moments of higher orders k provide additional information about the r.v. under scrutiny. For instance, it is easily seen that if an r.v. X is symmetric about its mean μ, then E[(X − μ)2r−1 ) = 0 for each integer r (provided these moments exist). It should be mentioned that the kth moment is a well-defined concept for any positive k, not only for integer k’s. However, both in this section and in most of this book, our considerations are confined to kth moments with integer k. Let us now define the moment-generating function MX (t) for an r.v. X via MX (t) = E(etX ). (B.81) Assuming that differentiation with respect to t commutes with expectation operator E, we have (t) = E(XetX ) MX (t) = E(X 2 etX ) MX .. . (k)
MX (t) = E(X (k) etX ). Setting t equal to zero, we see that (k)
MX (0) = E(X k ).
(B.82)
Thus, we see immediately the reason for the name moment-generating function (m.g.f.). Once we have obtained MX (t), we can compute moments of arbitrary order (assuming they exist) by successively differentiating the m.g.f. and setting the argument t equal to zero. As an example of this application, let us consider an r.v. distributed according to the binomial distribution with parameters n and p. Then, MX (t) = =
n 0 n 0
tx
e
n x
n x
px (1 − p)n−x
(pet )x (1 − p)n−x .
Now recalling the binomial identity n 0
n x
ax bn−x = (a + b)n ,
we have MX (t) = [pet + (1 − p)]n .
© 2002 by Chapman & Hall/CRC
(B.83)
374
appendix b. a brief introduction to stochastics
Next, differentiating with respect to t, we have MX (t) = npet [pet + (1 − p)]n−1 .
(B.84)
Then, setting t equal to zero, we have (0) = np. E(X) = MX
(B.85)
Differentiating (B.84) again with respect to t and setting t equal to zero, we have (0) = np + n(n − 1)p2 . (B.86) E(X 2 ) = MX In order to calculate the variance, it suffices to recall that for any r.v. X we have (B.87) Var(X) = E(X 2 ) − [E(X)]2 . Thus, for the binomial X Var(X) = np(1 − p).
(B.88)
Let us find also the m.g.f. of a normal variate with mean μ and variance σ2.
MX (t) = = = =
∞ 1 1 √ etx exp − 2 (x − μ)2 dx 2σ 2πσ −∞
∞ 1 1 √ exp − 2 (x2 − 2μx − 2σ 2 tx + μ2 ) dx 2σ 2πσ −∞
∞ 1 1 √ exp − 2 (x2 − 2x(μ + tσ 2 ) + μ2 ) dx 2σ 2πσ −∞
∞ 1 t2 σ 2 1 ∗ 2 √ exp − 2 (x − μ ) dx exp tμ + , 2σ 2 2πσ −∞
where μ∗ = μ + tσ 2 . But recognizing that the integral is simply equal to √ 2πσ, we see that the m.g.f. of the normal distribution is given by
t2 σ 2 MX (t) = exp tμ + 2
.
(B.89)
It is now easy to verify that the mean and variance of the normal distribution are μ and σ 2 , respectively. The possible mechanical advantages of the m.g.f. are clear. One integration (summation) operation plus k differentiations yield the first k moments of a random variable. However, the moment-generating aspect
© 2002 by Chapman & Hall/CRC
375
moment-generating functions
of the m.g.f. pales in importance to some of its properties relating to the summation of independent random variables. Let us suppose, for example, that we have n independently distributed r.v.’s X1 , X2 , . . . , Xn with m.g.f.’s M1 , M2 , . . . , Mn , respectively. Suppose that we wish to investigate the distribution of the r.v. Y = c1 X1 + c2 X2 + · · · + cn Xn , where c1 , c2 , . . . , cn are fixed constants. Let us consider using the momentgenerating functions to achieve this task. We have MY (t) = E[exp t(c1 X1 + c2 X2 + · · · + cn Xn )]. Using the independence of X1 , X2 , . . . , Xn we may write MY (t) = E[exp tc1 X1 ]E[exp tc2 X2 ] · · · E[exp tcn Xn ] = M1 (c1 t)M2 (c2 t) · · · Mn (cn t).
(B.90)
Given the density (or probability) function, we know what the m.g.f. will be. But it turns out that, under very general conditions, the same is true in the reverse direction; namely, if we know MX (t), we can compute a unique density (probability) function that corresponds to it. The practical implication is that if we find a random variable with an m.g.f. we recognize as corresponding to a particular density (probability) function, we know immediately that the random variable has the corresponding density (probability) function. Thus, in many cases, we are able to use (B.90) to give ourselves immediately the distribution of Y . Consider, for example, the sum Y = X 1 + X2 + · · · + X n of n independent binomially distributed r.v.’s with the same probability of success p and the other parameter being equal to n1 , n2 , . . ., nn , respectively. Thus, the moment-generating function for Y is MY (t) = [pet + (1 − p)]n1 [pet + (1 − p)]n2 · · · [pet + (1 − p)]nn = [pet + (1 − p)]n1 +n2 +···+nn .
We note that, not unexpectedly, this is the m.g.f. of a binomial r.v. with parameters N = n1 + n2 + · · · + nn and p. Given (B.89), it is straightforward to give the corresponding result on the distribution of the sum (or a linear combination) of n independent normal r.v.’s with means μ1 , μ2 , . . . , μn and variances σ12 , σ22 , . . . , σn2 , respectively. Let us
© 2002 by Chapman & Hall/CRC
376
appendix b. a brief introduction to stochastics
consider also the case of gamma distributed random variables. The m.g.f. of a gamma variate with parameters α and β can be computed in the following way
M (t) = =
∞ 1 etx xα−1 e−x/β dx Γ(α)β α 0
∞ 1 xα−1 e−x(1−βt)/β dx; Γ(α)β α 0
now this integral is finite only for t < 1/β and substituting y = x(1 − βt)/β yields
∞ 1 1 y α−1 e−y dy Γ(α) 0 1 − βt
α 1 1 = f or t < , 1 − βt β
α
M (t) =
(B.91)
where the last equality follows from the form of the p.d.f. of the gamma distribution with parameters α and 1. In particular, the m.g.f. for a chi-square r.v. with ν degrees of freedom has the form
M (t) =
1 1 − 2t
ν/2
= (1 − 2t)−ν/2 , t < 1/2.
(B.92)
It readily follows from (B.90) and (B.92) that the sum of n independent r.v.’s having chi-square distributions with ν1 , ν2 , . . . , νn degrees of freedom, respectively, has chi-square distribution with ν1 + ν2 + · · · + νn degrees of freedom.
B.9
Central Limit Theorem
We are now in a position to derive one version of the central limit theorem. Let us suppose we have a sample X1 , X2 , . . . , Xn of independently and identically distributed random variables with mean μ and variance σ 2 . We wish to determine, for n large, the approximate distribution of the sample mean ¯ = X1 + X 2 + · · · + X n . X n We shall examine the distribution of the sample mean when put into the standard form. Let ¯ −μ X −μ X −μ X −μ X √ = 1√ + 2√ + · · · + n√ . Z= σ/ n σ n σ n σ n
© 2002 by Chapman & Hall/CRC
377
conditional density functions
Now, utilizing the independence of the Xi ’s and the fact that they are identically distributed with the same mean and variance, we can write MZ (t) = E(etZ ) = )
=
'
=
'
E exp t
i=1
E exp t +
=
n
X1 − μ √ σ n
Xi − μ √ σ n
(
(*n
,n
X1 − μ t2 (X1 − μ)2 1 E 1+t √ + +o σ n 2 σ2n n
t2 1+ 2n
n
2 /2
→ et
as n → ∞.
(B.93)
But (B.93) is the m.g.f. of a normal distribution with mean zero and variance one. Thus, we have been able to show that the distribution of the sample mean of a random sample of n i.i.d. random variables with mean μ and variance σ 2 becomes “close” to the normal distribution with mean μ and variance σ 2 /n as n becomes large. Clearly, the CLT offers enormous conceptual and, in effect, computational simplifications. First and foremost, if the sample size is not too small, it enables us to approximate the distribution of the sample mean by a normal distribution regardless of the parent distribution of the random sample. Moreover, even if we knew that the parent distribution ¯ is, for instance, lognormal, computation of the exact distribution of X would be an enormous task. No wonder that the CLT is widely used in experimental sciences in general and in statistical process control in particular.
B.10
Conditional Density Functions
Let us return to questions of interdependence between random variables and consider briefly conditional distribution of one random variable given that another random variable has assumed a fixed value. If two random variables X and Y are discrete and have a joint probability function (B.22), then, by (B.7) and (B.25), the conditional probability function of the r.v. X, given that Y = y, has the form f (xi |y) = P (X = xi |Y = y) =
© 2002 by Chapman & Hall/CRC
f (xi , y) , fY (y)
(B.94)
378
appendix b. a brief introduction to stochastics
where xi runs over all possible values of X, and y is a fixed value from the range of Y . It is easy to verify that the conditional p.f. is indeed a probability function, i.e., that f (xi |y) > 0 for all xi and
f (xi |y) = 1.
xi
Let us now suppose that r.v.’s X and Y are continuous and have joint p.d.f. (B.43). When deriving a formula for the conditional density function of X given Y = y, some caution is required since both random variables assume any particular value with probability zero. We shall use a type of a limit argument. Writing the statement of joint probability for small intervals in X and Y , we have by the multiplicative rule P (x < X ≤ x + ε ∩ y < X ≤ y + δ) P (y < Y ≤ y + δ)P (x < X ≤ x + ε|y < Y ≤ y + δ). Now, exploiting the assumption of continuity of the density function, we can write
x+ε y+δ x
f (x, y)dydx =
y
y+δ y
fY (y)dy
x+ε x
fX|y (x)dx
= εδf (x, y) = δfY (y)εfX|y (x). Here, we have used the terms fY and fX|y to denote the marginal density function of Y , and the conditional density function of X given Y = y, respectively. This gives us immediately (provided fY (y) > 0 for the given y) f (x, y) . (B.95) fX|y (x) = fY (y) Note that this is a function of the argument x, whereas y is fixed; y is the value assumed by the random variable Y .
B.11
Random Vectors
B.11.1
Introduction
Sometimes we want to measure a number of quality characteristics of a single object simultaneously. It may be the case that the characteristics
© 2002 by Chapman & Hall/CRC
random vectors
379
of interest are independent one of another, and can, therefore, be considered separately. More often, the characteristics are somehow related one to another, although this relationship cannot be precisely described. What we observe is not a set of independent random variables but a random vector whose elements are random variables somehow correlated one with another. Let X = [X1 , X2 , . . . , Xp ] be such a random vector of any fixed dimension p ≥ 1. By analogy with the univariate and two-variate cases, we define the cumulative distribution function of X as F (X ≤ x) = P (X ≤ x) = P (X1 ≤ x1 , . . . , Xp ≤ xp ),
(B.96)
where x = [x1 , . . . , xp ] . All random vectors considered in this section will be assumed to be of continuous type. We can, therefore, define the (joint) probability density function f (x) via F (x) =
x1 −∞
···
xp −∞
f (u)du1 . . . dup ,
(B.97)
where u = [u1 , . . . , up ] . Of course, f (x) = For any p-dimensional set A,
∂ p F (x) . ∂x1 . . . ∂xp
···
P (XA) =
(B.98)
A
f (x)dx1 . . . dxp ,
(B.99)
where the integration is over the set A. By analogy with the two-variate case, we can define marginal distribution of a subvector of X. Consider the partitioned random vector X = [X1 , X2 ] , where X1 has k elements and X2 has p − k elements, k < p. The function F (x1 ) = P (X1 ≤ x1 ) = F (x1 , . . . , xk , ∞, . . . , ∞),
(B.100)
where x1 = [x1 , . . . , xk ] , is called the marginal c.d.f. of X1 and the function f1 (x1 ) =
∞ "
−∞
··· #$
∞
−∞
p−k times
%
f (x1 , x2 )dxk+1 . . . dxp ,
(B.101)
where f (x) = f (x1 , x2 ), is called the marginal p.d.f. of X1 . Marginal distribution of any other subvector of X can be defined similarly. Also,
© 2002 by Chapman & Hall/CRC
380
appendix b. a brief introduction to stochastics
analogously as in Section B.10, for a given value of X2 , X2 = x2 , the conditional p.d.f. of X1 can be defined as f (x1 |X2 = x2 ) =
f (x1 , x2 ) , f2 (x2 )
(B.102)
where f2 (x2 ) is the marginal p.d.f. of X2 , which is assumed to be positive at x2 . If the random subvectors X1 and X2 are stochastically independent, then f (x) = f1 (x1 )f2 (x2 ). (B.103) Recall that random variables X1 , X2 , . . . , Xp , i.e., the elements of a random vector X, are said to be (mutually) stochastically independent if and only if (B.104) f (x) = f (x1 )f (x2 ) · · · f (xp ), where f (x) is the joint p.d.f. of X and f (xi ) is the marginal p.d.f. of Xi , i = 1, 2, . . . , p.
B.11.2
Moment Generating Functions
For a random vector X of any fixed dimension p, with p.d.f. f (x), the mean (or expectation) of a scalar-valued function g(x) is defined as E[g(X)] =
∞ −∞
···
∞ −∞
g(x)f (x)dx1 . . . dxp .
(B.105)
More generally, the mean of a matrix G(X) = (gij (X)), that is, the mean of the matrix-valued function G(X) of the random vector X, is defined as the matrix E[G(X)] = (E[gij (X)]). In particular, the vector μ = E(X) is the mean vector of X with components μi =
∞
© 2002 by Chapman & Hall/CRC
−∞
···
∞ −∞
xi f (x)dx1 . . . dxp , i = 1, . . . , p.
(B.106)
random vectors
381
The properties of (B.106) are direct generalizations of those for the univariate and two-variate cases. For example, for any matrix of constants A(q×p) and any constant vector b(q×1) , E(AX + b) = AE(X) + b.
(B.107)
The mean vector plays the same role as the mean in the univariate case. It is the “location” parameter of X. The “spread” of X is now characterized by the covariance, or dispersion, matrix which is defined as the matrix (B.108) Σ = V(X) = E[(X − μ )(X − μ ) ], where μ is the mean vector of X. Note that the mean vector and covariance matrix reduce to the mean and variance, respectively, when p = 1. The following properties of the covariance matrix Σ of order p are simple consequences of its definition: σij = Cov(Xi , Xj ) if i = j
(B.109)
σii = Var(Xi ), i = 1, . . . , p,
(B.110)
and where σij is the (i, j)th element of Σ; Σ = E(XX ) − μμ ;
(B.111)
Var(a X) = a Σa
(B.112)
for any constant vector a of dimension p; note that the left hand side of (B.112) cannot be negative and, hence, that Σ is positive semi-definite; V(AX + b) = AΣA
(B.113)
for any constant matrix A(q×p) and any constant vector b(q×1) . If we are given two random vectors, X of dimension p and Y of dimension q, we can define the p × q matrix Cov(X, Y) = E[(X − μ )(Y − ν ) ],
(B.114)
where μ = E(X) and ν = E(Y). Cov(X, Y) is called the covariance between X and Y. The following properties of the covariance between two random vectors can easily be proved: Cov(X, X) = V(X);
© 2002 by Chapman & Hall/CRC
(B.115)
382
appendix b. a brief introduction to stochastics Cov(X, Y) = Cov(Y, X) ;
(B.116)
if X1 and X2 are random vectors of the same dimension, then Cov(X1 + X2 , Y) = Cov(X1 , Y) + Cov(X2 , Y);
(B.117)
if X and Y are of the same dimension, then V(X + Y) = V(X) + Cov(X, Y) + Cov(Y, X) + V(Y);
(B.118)
for any constant matrices A and B of orders r × p and s × q, respectively, Cov(AX, BY) = ACov(X, Y)B ;
(B.119)
finally, if X and Y are independent, then Cov(X, Y) is the zero matrix. All the above properties reduce to known properties of covariances between random variables when p = q = 1. Let X = [X1 , . . . , Xp ] be a random vector with mean vector μ . A moment of order k of the variables Xi1 , Xi2 , . . . , Xim , m ≤ p, is defined as E[(Xi1 − μi1 )j1 (Xi2 − μi2 )j2 · · · (Xim − μim )jm ],
(B.120)
where j1 , j2 , . . . , jm are positive integers such that j1 + j2 + · · · + jm = k. Note that many different moments of the r.v.’s Xi1 , Xi2 , . . . , Xim have the same order k. As in the univariate case, calculations of higher order moments are usually greatly facilitated by using the moment generating functions. The moment generating function MX (t) for X is defined via
MX (t) = E(et X ),
(B.121)
where, as usual, t X denotes the inner product of t and X. The m.g.f. is, thus, a scalar function of the vector t of dimension p. Assuming that differentiation with respect to elements of t commutes with expectation operator E, we easily see that +
E(X1j1 X2j2
· · · Xpjp )
=
,
∂ j1 +j2 +···+jp j
∂tj11 ∂tj22 · · · ∂tpp
MX (t)
(B.122) t=0
when this moment exists (in (B.120), some ji ’s may be equal to zero).
© 2002 by Chapman & Hall/CRC
383
random vectors
B.11.3
Change of Variable Technique
The change of variable technique provides a powerful means for computing the p.d.f. of a transformed random vector given its original probability density. We shall use this technique in the next subsection to prove an important property of multinormal random vectors. Let X be a random vector of dimension p having density f (x), which is positive for x from a set A and is zero elsewhere (in particular, A may be equal to the whole space of vectors x, i.e., f (x) may be positive for all values of the vector x). Let Y = u(X) define a one-to-one transformation that maps set A onto set B, so that the vector equation y = u(x) can be uniquely solved for x in terms of y, say, x = w(y). Then the p.d.f. of Y is f (w(y))J if yB,
(B.123)
and is zero otherwise, where J = absolute value of |J|
(B.124)
and |J| is the determinant of the p × p matrix of partial derivatives of the inverse transformation w(y),
J=
∂xi ∂yj
i = 1, . . . , p j = 1, . . . , p
.
(B.125)
|J| is called the Jacobian of the transformation w(y). The above result is a classical theorem of vector calculus and, although it is not hard to be proved, its proof will be skipped. We shall see with what ease the change of variable technique can be used.
B.11.4
Normal Distribution
The random vector X of dimension p is said to have multivariate normal (or p-dimensional multinormal or p-variate normal ) distribution if its p.d.f. is given by 1 −1 f (x) = |2πΣ|−1/2 exp{− (x − μ ) Σ (x − μ )}, 2
(B.126)
where μ is a constant vector and Σ is a constant positive definite matrix. It can be shown that μ and Σ are the mean vector and covariance matrix of the random vector X, respectively. For short, we write that X
© 2002 by Chapman & Hall/CRC
384
appendix b. a brief introduction to stochastics
is N (μ , Σ) distributed. Comparing (B.50) and (B.126) we see that the latter is a natural multivariate extension of the former. Note that if the covariance matrix Σ is diagonal, Σ = diag(σ11 , σ22 , . . . , σpp ), the p.d.f. (B.126) can be written as the product f (x) =
p
1 (2πσii )−1/2 exp{− (xi − μi )σii−1 (xi − μi )}. 2 i=1
(B.127)
Thus, the elements of X are then mutually independent normal random variables with means μi and variances σii , i = 1, 2, . . . , p, respectively. If the random vector X is multivariate normal, then the property that its elements are uncorrelated one with another (i.e., that Cov(Xi , Xj ) = 0, i = j) implies their mutual independence. Let X be a p-variate normal random vector with mean vector μ and covariance matrix Σ, and let Y = Σ−1/2 (X − μ ),
(B.128)
where Σ−1/2 is the square root of the positive definite matrix Σ−1 . Then Y is the p-variate normal random vector with zero mean vector and covariance matrix equal to the identity matrix, I. In other words, Y is the random vector whose elements Y1 , Y2 , . . . , Yp are mutually independent standard normal random variables. To see that it is indeed the case, observe first that X = Σ1/2 Y + μ . (B.129) Transformation (B.129) defines the inverse transformation to the transformation (B.128) and its Jacobian (B.125) is |J| = J = |Σ1/2 | = |Σ|1/2 ,
(B.130)
where J is defined by (B.124), since Σ is positive definite and |Σ| = |Σ1/2 Σ1/2 | = |Σ1/2 Σ1/2 | (see the properties of determinants). Hence, upon noting that −1
(x − μ ) Σ
(x − μ ) = y y
(B.131)
and using (B.126) and (B.130), the p.d.f. (B.123) assumes the form p i=1
© 2002 by Chapman & Hall/CRC
(2π)−1/2 e−yi /2 . 2
385
random vectors
Thus, the proof is accomplished. Transformation (B.128) is very useful in practice. In principle, it enables one to replace a multivariate problem by a sequence of much simpler univariate problems. Usually, however, we neither know μ nor we know Σ and, hence, some caution is needed here. By (B.128), −1
(X − μ ) Σ
(X − μ ) =
p
Yi2 ,
(B.132)
i=1
and it follows from the properties of the chi-square distribution that the random variable −1 (X − μ ) Σ (X − μ ) (B.133) has chi-square distribution with p degrees of freedom.
B.11.5
Quadratic Forms of Normal Vectors
Quadratic forms of normal random vectors play an important role in statistical inference. In Chapter 6, we use them for testing certain hypotheses about the so-called regression models. Let us state first a version of the Cochran’s theorem. Suppose X = [X1 , X2 , . . . , Xp ] is a vector of p independent N (0, 1) (i.e., standard normal) random variables. Assume that the sum of squares X X =
p
Xi2
i=1
can be decomposed into k quadratic forms, X X =
k
Qj ,
j=1
where Qj is a quadratic form in X with matrix Aj which has rank rj , Qj = X Aj X, j = 1, 2, . . . , k. Then any of the following three conditions implies the other two: (i) the ranks rj , j = 1, 2, . . . , k, add to p; (ii) each of the quadratic forms Qj has chi-square distribution with rj degrees of freedom; (iii) all the quadratic forms Qj are mutually stochastically independent random variables. We shall now prove a closely related result. Namely, if X is a vector of p independent N (0, 1) random variables and A is a symmetric and
© 2002 by Chapman & Hall/CRC
386
appendix b. a brief introduction to stochastics
idempotent matrix of order p, then the quadratic form X AX has χ2r distribution, where r is the rank of A. By the spectral decomposition of A and the fact that all eigenvalues of A, λi , are equal to 0 or 1 (see Section A.7 of the Appendix), X AX =
X γ i γ i X,
i I
where γ i is a standardized eigenvector of A corresponding to λi , iI, and I is the set of indices of those eigenvalues λi which are equal to 1. We can write X AX = Yi2 , i I
X γ
γ i X,
iI. Each Yi is the linear combination of where Yi = i = standard normal r.v.’s and, hence, is a normally distributed r.v. with mean 0 and variance γ i γ i = 1. Furthermore, Cov(Yi Yj ) = E(X γ i X γ j ) = E(γ i XX γ j )
= γ i E(XX )γ j = 0 if i = j,
since E(XX ) = I and γ i γ j = 0 if i = j. Thus, the Yi ’s are uncorrelated one with another and, since they are normally distributed, they are mutually independent. Finally, therefore, X AX is chi-square distributed with the number of degrees of freedom equal to the number of elements of the set I. By (ii) of Section A.7 and (A.48), it follows that this number is equal to the rank of A.
B.11.6
Central Limit Theorem
Let us give the following multivariate extension of the central limit theorem. Let X1 , X2 , . . . , Xn be a random sample of n independent random vectors having identical p-variate distribution with mean vector μ and covariance matrix Σ. Let ¯ = X1 + X 2 + · · · + X n X n
(B.134)
be the sample mean vector, and let Z=n
1/2
−1/2
Σ
¯ − μ) = n−1/2 Σ−1/2 (X
n i=1
© 2002 by Chapman & Hall/CRC
(Xi − μ).
(B.135)
387
poisson process
Then, with n increasing, the distribution of Z approaches the p-variate normal distribution with zero mean vector and covariance matrix I. The proof of this theorem is similar to that for the univariate case.
B.12
Poisson Process
The Poisson experiment, as described in Section B.4, was assumed to take place in a time interval of fixed length T . In fact, there is no need to restrict the experiment to any fixed time interval. To the contrary, already the examples discussed in Subsections B.4.3 and B.6.2 (following (B.58)) indicate that we are interested in continuing the experiment over time and observing the evolution of the process of interest. For example, if the incoming telephone calls constitute the Poisson experiment, we observe the stochastic process X(t) as time t elapses, where X(t) is the number of telephone calls up to current instant t. Also, in the context of quality control, items returned due to unsatisfactory performance usually form a Poisson process. Let X(t) denote the number of occurrences of an event from time 0 to time t, t ≥ 0. Let X(0) = 0. The process X(t) is said to be the Poisson process having rate λ, λ ≥ 0, if the three postulates defining a Poisson experiment, (B.34)-(B.36), are fulfilled for each t ≥ 0, and if P (k in [t1 , t1 + s]) = P (k in [t2 , t2 + s])
(B.136)
for all nonnegative t1 , t2 , and s. The fourth postulate implies that the the rate λ does not change with time. Let P (k, t) = P (k in [0, t]), i.e., P (k, t) be the probability of k events up to time t. Then P (k + 1 in [0, t + ε]) = P (k + 1 in [0, t])P (0 in [t, t + ε]) + P (k in [0, t])P (1 in [t, t + ε]) + o(ε) = P (k + 1, t)(1 − λε) + P (k, t)λε + o(ε). Thus, P (k + 1, t + ε) − P (k + 1, t) o(ε) = λ[P (k, t) − P (k + 1, t)] + . ε ε Taking the limit as ε → 0, we obtain dP (k + 1, t) = λ[P (k, t) − P (k + 1, t)]. dt
© 2002 by Chapman & Hall/CRC
(B.137)
388
appendix b. a brief introduction to stochastics
Now taking k = −1, we have dP (0, t) = −λP (0, t), dt
(B.138)
since it is impossible for a negative number of events to occur. Hence P (0, t) = exp(−λt).
(B.139)
Substituting (B.139) in (B.137) for k = 0, we have dP (1, t) = λ[exp(−λt) − P (1, t)], dt
(B.140)
P (1, t) = exp(−λt)(λt).
(B.141)
and, hence, Continuing in this way for k = 1 and k = 2, we can guess the general formula of the Poisson distribution: P (k, t) =
e−λt (λt)k . k!
(B.142)
In order to verify that (B.142) satisfies (B.137), it suffices to substitute it into both sides of (B.137). Formula (B.139) gives us the probability of no event occurring up to time t. But this is equal to the probability that the first event occurs after time t and, thus, P (time of first occurrence ≤ t) = F (t) = 1 − exp(−λt).
(B.143)
F (t) is the c.d.f. of the random variable defined as the time to the first event. Corresponding p.d.f. is equal to f (t) =
dF (t) = λe−λt , dt
(B.144)
i.e., it is the exponential density with parameter β = 1/λ, as was stated in Subsection B.6.2. A slightly more involved argument proves that the inter-event times are given by the same exponential distribution as well.
B.13
Statistical Inference
B.13.1
Motivation
The most obvious aim of quality inspection is to control the proportion of nonconforming items among all manufactured ones. It is too expensive, and often too time-consuming, to examine all of the items. Thus,
© 2002 by Chapman & Hall/CRC
statistical inference
389
we draw a sample of items from the population of all items and find out experimentally what is the proportion of nonconforming items in that particular sample. Now, the following interpretational problems arise. Assume that we know that, say, all of the 1000 items produced on a particular day were produced under the same conditions and using materials of the same quality. Still, if we drew not one but several different samples of items, we would most likely obtain different proportions of nonconforming items. Which of these experimentally obtained estimates of the true proportion of nonconforming items in the population are more informative, or more accurate, than the others? It is statistical inference which allows us to answer such questions. Statistical inference provides also an answer to a much more important question. Most of the changes in a production process are unintended and, hence, unknown in advance. As a matter of fact, the main reason for implementing the statistical process control is to discover these changes as quickly as possible. That is, we should never assume that no changes have occurred but, to the contrary, we should keep asking whether the samples we obtain for scrutiny do indeed come from the same population, characterized by the same, constant properties. If we are given several samples of items produced during one day, our main task is to verify whether the daily production can be considered as forming one, homogeneous, population or, rather, it is formed of at least two populations of different properties, in particular, of different defective rates. We are able to solve the problem using statistical means. Acceptance sampling is the most primitive stage of quality control. But the problems outlined above are typical for the whole field of statistical process for quality. The most immediate aim of SPC is to systematically verify that both the mean value of a parameter of interest and the variablity of the parameter about this mean do not change in time. In other words, the problem is to verify that samples of the items measured come from one population, described by fixed mean value and variance. Note that it does not suffice to take samples of the items and measure their sample means and sample variances. From sample to sample, these measurements will be giving different results even if the production process is stable, that is, if the samples come from the same probability distribution. Thus, given different sample means and variances corresponding to different samples, we have to recognize if these values are typical for a stable production process or they point to a change in the process. In what follows, we shall focus our attention on some basics of statis-
© 2002 by Chapman & Hall/CRC
390
appendix b. a brief introduction to stochastics
tical inference about means and variances. We shall begin with getting some better insight into the properties of the sample mean and variance. Most of the considerations will be confined to the case of samples being drawn from normal populations. It should be emphasized, however, that our considerations give ground to statistical analysis of other parameters and other probability distributions in general as well.
B.13.2
Maximum Likelihood
Let us consider a random sample of n independent random variables X1 , X2 , . . . , Xn with common p.f. or p.d.f. f (x). In other words, we can say that the random sample comes from a population characterized by a probability distribution f (x). Constructing, for example, the sample ¯ of the sample does not require any prior knowledge about the mean X parent distribution f (x). Moreover, provided only that f (x) has finite ¯ is equal to the mean and variance, we know that the expected value of X true mean μ of the probability distribution f (x) and that its variance is n times smaller than that of f (x). ¯ as an estimator of unknown μ is Another justification of using X provided by the following argument. If we cannot make any a priori assumptions about the parent distribution f (x), the only available information about it is that contained in the random sample X1 , X2 , . . . , Xn . Given that the following values of the random sample have been observed, X1 = x1 , X2 = x2 , . . . , Xn = xn , it is natural to identify the unknown probability distribution with the discrete uniform distribution (B.11) with k = n. But the expected value of this distribution is given by (B.15) and, thus, is equal to the sample mean (B.62). It is also worthwhile to note that the variance of the uniform distribution (B.11) with k = n is equal to (B.17) with k = n and, therefore, that it is equal to the sample variance (B.61) multiplied by factor (n − 1)/n. In this way, the given argument justifies the use of a slightly modified sample variance as well (we shall encounter the modified sample variance in this subsection once more). However, if we want to gain a deeper insight into the problem of estimating μ and σ 2 , we have to make some more assumptions about the parent distribution f (x). Usually, it is reasonable to assume that the parent distribution is of some known type, although the parameters of this distribution are unknown. For example, in the problem of controlling the mean of a production process, we usually assume that the parent distribution is normal with unknown mean μ and unknown variance σ 2 .
© 2002 by Chapman & Hall/CRC
391
statistical inference
¯ as the esIn this last example, one is, of course, tempted to use X 2 timator of unknown μ and the sample variance S as the estimator of an unknown variance σ 2 . We shall now discuss one of possible general approaches to the problem of estimating unknown parameters of proba¯ and bility distributions which offers additional justification for using X S 2 (or its modification, ((n − 1)/n)S 2 in the particular case mentioned. Each of the n random variables constituting the random sample drawn from a normal distribution has the same p.d.f. given by (B.50). In order to make its dependence on the parameters μ and σ 2 explicit, let us denote this density by f (x; θ), (B.145) where θ denotes the set of parameters; in our case, θ = (μ, σ 2 ). By (B.49), the joint p.d.f. of the whole random sample of n independent random variables is given by the product L(x1 , x2 , . . . , xn ; θ) = f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ),
(B.146)
where −∞ < x1 < ∞, −∞ < x2 < ∞, . . ., −∞ < xn < ∞. If we have observed n values of the random sample, X1 = x1 , X2 = x2 , . . . , Xn = xn , the p.d.f. assumes a fixed value, given also by formula (B.146). This value, however fixed, is unknown, since we do not know the true values of the parameters θ of the distribution. In a sense, (B.146) remains a function of the parameters θ. We can now ask the following question. Given that we have observed data x1 , x2 , . . . , xn , what values of parameters μ and σ 2 are “most likely”? To put it otherwise, we can think of L(x1 , x2 , . . . , xn ; θ) as a function of unknown θ which measures how “likely” a particular θ is to have given the observed data x1 , x2 , . . . , xn . It is then natural to consider a θ that maximizes L(x1 , x2 , . . . , xn ; θ) to be the most likely set of parameters of the joint density of the random sample. Function L(x1 , x2 , . . . , xn ; θ), considered as a function of θ, is called the likelihood function. The suggested method of finding estimators of parameters θ is called the maximum likelihood method and the estimators obtained are called the maximum likelihood estimators. In the case of normal joint density, (B.146) takes on the form 2 −n/2
L(x1 , x2 , . . . , xn ; θ) = (2πσ )
exp
n i=1 (xi − 2σ 2
μ)2
,
(B.147)
where −∞ < μ < ∞ and 0 < σ 2 < ∞. The maximum likelihood estimators of μ and σ 2 are those which maximize the likelihood function
© 2002 by Chapman & Hall/CRC
392
appendix b. a brief introduction to stochastics
(B.147). In practice, maximization of L(x1 , x2 , . . . , xn ; θ) is replaced by a usually simpler maximization of the logarithm of L(x1 , x2 , . . . , xn ; θ). This can indeed be done, since both functions achieve their maxima for the same θ. The logarithm of a likelihood function is called the loglikelihood function. In our case, we have n(2πσ 2 ) ln L(x1 , x2 , . . . , xn ; θ) = − − 2
n
i=1 (xi − 2σ 2
μ)2
.
(B.148)
Maximization of (B.148) with respect to μ and σ 2 yields the following maximum likelihood estimators of these parameters n−1 2 ¯ and σ μ ˆ=X ˆ2 = S , n
(B.149)
respectively. Thus, in the case considered, the sample mean has occurred to be the maximum likelihood estimator of μ. Interestingly enough, S 2 has to be slightly modified if one wishes it to become the maximum ˆ 2 is not likelihood estimator of σ 2 . It follows that the expected value of σ 2 equal to the true variance σ . From the practical point of view, however, the modification required is inessential. The maximum likelihood approach is not confined to the normal case only. In fact, it is the most widely used approach to estimating unknown parameters of both discrete and continuous distributions. Note that in the case of a discrete parent distribution, the likelihood function L(x1 , x2 , . . . , xn ; θ) given by (B.146) has a particularly clear interpretation. For the observed data x1 , x2 , . . . , xn , it is simply the probability of obtaining this set of data.
B.13.3
Confidence Intervals
The idea of constructing the so-called confidence intervals forms a basis for evaluating control limits for control charts. This is so, because the confidence intervals tell us, given the data observed, within which interval “we would expect” the true value of an unknown parameter to lie. From the point of view of statistical process control, the most important confidence intervals are those for the mean and variance of a parent distribution. Let us consider a random sample of n independent random variables X1 , . . . , Xn with common normal p.d.f given by (B.50). Let, as usual, ¯ denote the sample mean and assume that the variance σ 2 is known. X
© 2002 by Chapman & Hall/CRC
statistical inference Then (see (B.67)), Z=
¯ −μ X √ σ/ n
393
(B.150)
is a standard normal random variable. Let zα/2 be such a number that P (Z > zα/2 ) = α/2,
(B.151)
where α is fixed, 0 < α < 1. For any given α, corresponding zα/2 can be found in the statistical tables for the standard normal distribution or can be provided by any computer statistical package. Now, since the standard normal p.d.f. is symmetric about zero, P (Z < −zα/2 ) = α/2
(B.152)
P (−zα/2 ≤ Z ≤ zα/2 ) = 1 − α.
(B.153)
and, hence, Thus, P (−zα/2 ≤
¯ −μ X √ ≤ zα/2 ) = 1 − α. σ/ n
But the last equation can be written in the following form ¯ − zα/2 √σ ≤ μ ≤ X ¯ + zα/2 √σ ) = 1 − α. P (X n n
(B.154)
(B.155)
In this way, we have obtained the confidence interval for μ: If x ¯ is an observed value of the sample mean of a random sample X1 , . . . , Xn from the normal distribution with known variance σ 2 , the (1 − α)100% confidence interval for μ is σ σ x ¯ − zα/2 √ ≤ μ ≤ x ¯ + zα/2 √ . (B.156) n n That is, we have a (1 − α)100% chance that the given interval includes the true value of unknown μ. The number 1 − α is called the confidence coefficient. Usually, we choose the confidence coefficient equal to 0.95. In practice, the variance of a normal population is rarely known. We have, however, two easy ways out of the trouble. First, we know that the random variable given by (B.68) has t distribution with n − 1 degrees of freedom. The t distribution is symmetric about zero and, thus, proceeding analogously as before, we obtain the following (1 − α)100% confidence interval for μ if σ 2 is not known s s ¯ + tα/2 √ , (B.157) x ¯ − tα/2 √ ≤ μ ≤ x n n
© 2002 by Chapman & Hall/CRC
394
appendix b. a brief introduction to stochastics
where s is the observed value of the square root of the sample variance and tα/2 is such that P (T > tα/2 ) = α/2 for a random variable T which has t distribution with n − 1 degrees of freedom. For any given α, corresponding tα/2 can be found in the statistical tables for the relevant t distribution or can be provided by any computer statistical package. The second solution is an approximate one. If the sample size n is not too small, we can substitute s for σ in (B.156) and use that former confidence interval. Intuitively, for large n, we can consider s to be a sufficiently accurate estimate of σ. More rigorously, t distribution tends to the standard normal distribution as n increases. It is equally straightforward to construct a confidence interval for unknown variance of a normal population. We know that the random variable V defined by (B.64) has chi-square distribution with n − 1 degrees of freedom. Let χ21−α/2 and χ2α/2 be such numbers that P (V > χ21−α/2 ) = 1 − α/2 and P (V > χ2α/2 ) = α/2.
(B.158)
For any given α, corresponding χ21−α/2 and χ2α/2 can be found in the statistical tables for the relevant χ2 distribution or can be provided by any computer statistical package. Now, we have
P and, hence, P
χ21−α/2
(n − 1)S 2 ≤ ≤ χ2α/2 σ2
(n − 1)S 2 (n − 1)S 2 2 ≤ σ ≤ χ2α/2 χ21−α/2
=1−α
(B.159)
= 1 − α.
(B.160)
Thus, if s2 is an observed value of the sample variance of a random sample X1 , . . . , Xn from a normal distribution, the (1 − α)100% confidence interval for the variance σ 2 is (n − 1)s2 (n − 1)s2 ≤ σ2 ≤ . (B.161) 2 χα/2 χ21−α/2 This time, unlike in the two former cases, we had to choose two different limit points χ21−α/2 and χ2α/2 . The choice of only one point would be useless, since the χ2 distribution is not symmetric about zero. Let us conclude this discussion with the following observation. In all the cases, we constructed the “equal tailed” confidence intervals. That is, we required that P (Z > zα/2 ) = P (Z < −zα/2 ) = α/2,
© 2002 by Chapman & Hall/CRC
395
statistical inference P (T > tα/2 ) = P (T < −tα/2 ) = α/2, P (V > χ2α/2 ) = P (V < χ21−α/2 ) = α/2,
respectively. Most often, it is indeed most natural to construct the equal tailed confidence intervals. Sometimes, however, it may be recommended to use some other limit values for a confidence interval. The choice of the limit values is up to an experimenter and the only strict requirement is that a predetermined confidence level be achieved. Say, when constructing a confidence interval for the variance of a normal population, we have to require that P (v ≤ V ≤ v) = 1 − α for a given α, and any limit values v and v of the experimenter’s choice.
B.13.4
Testing Hypotheses
A dual problem to the construction of confidence intervals is that of testing hypotheses. Instead of constructing an interval which includes the true value of a parameter of interest with predetermined confidence, we can ask whether we have sufficient evidence to reject our hypothesis about the true value of the parameter. In the context of SPC, we usually ask whether, given the evidence coming from the observations taken, we should not reject the hypothesis that the true value of the parameter of interest remains unchanged. The parameters which are examined most often are the mean and variance of some measurements of a production process. Consider a normal population with unknown mean μ and known variance σ 2 . We want to test the null hypothesis that μ is equal to some specified value μ0 , (B.162) H0 : μ = μ0 , against the general alternative hypothesis that it is not the case, H1 : μ = μ0 .
(B.163)
The conclusion has to be based on a suitable random experiment. Thus, in order to solve the problem, two tasks have to be accomplished. First, we have to introduce a test statistic whose range could be partitioned into two nonoverlapping sets in such a way that the values from one set would indicate that the null hypothesis should be rejected whereas the values from the other set would indicate that the null hypothesis should be
© 2002 by Chapman & Hall/CRC
396
appendix b. a brief introduction to stochastics
accepted. Of course, rejection of H0 implies acceptance of the alternative hypothesis, H1 . And second, a random sample of observations should be obtained, so that the test statistic could be assigned a value. Due to the random character of the whole experiment, we cannot hope for reaching a 100% certain conclusion. It is therefore crucial to propose a test statistic which is “capable of” discerning the two possibilities with maximum possible accuracy. Let us put our problem into a general framework. Suppose a random sample X1 , X2 , . . . , Xn is drawn from a population with common p.d.f. or p.f. f (x; θ). We wish to test the null hypothesis H0 : θΘ0 against the alternative hypothesis H1 : θΘ1 , where Θ0 and Θ1 are some fixed and disjoint subsets of the parameter space Θ, i.e., of the whole range of possible values of the parameters θ. In case of (B.162) and (B.163), just one parameter is tested, θ = μ, Θ = (−∞, ∞), Θ0 = {μ0 }, i.e., Θ0 is the set consisting of only one point μ0 , and Θ1 = (−∞, μ0 ) ∪ (μ0 , ∞). If a hypothesis is determined by a set consisting of only one point in the parameter space, the hypothesis is called simple. Otherwise, it is called the composite hypothesis. We shall consider the following, intuitively plausible, test statistic λ(x1 , x2 , . . . , xn ) =
sup{L(x1 , x2 , . . . , xn ; θ) : θΘ1 } , sup{L(x1 , x2 , . . . , xn ; θ) : θΘ0 }
(B.164)
where L(x1 , x2 , . . . , xn ; θ) is the likelihood function, L(x1 , x2 , . . . , xn ; θ) = f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ).
(B.165)
In the numerator of (B.164), the supremum of L(x1 , x2 , . . . , xn ; θ) is taken over all θΘ1 , while in the denominator, the supremum is taken over all θΘ0 . Note that according to our interpretation of the likelihood function, the test statistic λ(x1 , x2 , . . . , xn ) should assume large values if H1 is true, that is, if the true value of the parameters θ belongs to the set Θ1 . Analogously, λ(x1 , x2 , . . . , xn ) should assume small values if H0 is true. Tests based on the statistic (B.164) are called the likelihood ratio tests. If the null hypothesis is simple, as is the case in example (B.162)-(B.163), formula (B.164) simplifies to the following one λ(x1 , x2 , . . . , xn ) =
sup{L(x1 , x2 , . . . , xn ; θ) : θΘ1 } , L(x1 , x2 , . . . , xn ; θ0 )
where θ0 is given by the null hypothesis, H0 : θ = θ0 .
© 2002 by Chapman & Hall/CRC
(B.166)
statistical inference
397
It remains to determine the critical value of λ(x1 , x2 , . . . , xn ), which partitions the set of all possible values of the statistic into two complementary subsets of “small” and “large” values of the statistic. If the observed value of the test statistic (B.164) falls into the subset of “large” values, the null hypothesis is rejected. If the observed value falls into the subset of “small” values, the null hypothesis is accepted. The subset of values of the test statistic leading to rejection of H0 is called the critical region of the test, whereas the other subset is called the acceptance region of the test. Let us first determine the critical value c of λ(x1 , x2 , . . . , xn ) for the case of a simple null hypothesis, H0 : θ = θ0 . Prior to the experiment, a random sample X1 , X2 , . . . , Xn is, of course, a sequence of n random variables. Now, even if the null hypothesis is true, it may still happen that the test statistic will assume arbitrarily large value. We determine c in such a way that the probability that H0 will be rejected, when it is true, be equal to a predetermined significance level α, Pθ0 [λ(X1 , X2 , . . . , Xn ) > c] = α,
(B.167)
where Pθ0 denotes the probability calculated under the assumption that the true value of the parameter is θ0 . A simple generalization of this requirement to the case of composite null hypothesis has the form sup{Pθ [λ(X1 , X2 , . . . , Xn ) > c] : θΘ0 } = α.
(B.168)
The supremum is taken here over all θΘ0 . In other words, whatever value of θ from the set determining the null hypothesis is true, we require that the probability of rejecting H0 not exceed the significance level α. Rejection of the null hypothesis when it is true is called the type I error . (Acceptance of the null hypothesis when it is false is called the type II error .) Observe that the rejection region has the form {x1 , x2 , . . . , xn : λ(x1 , x2 , . . . , xn ) > c},
(B.169)
that is, it is the set of such points (x1 , x2 , . . . , xn ) in the n-dimensional sample space that the function λ(x1 , x2 , . . . , xn ) assumes values greater than c. In practice, it is often convenient to replace statistic (B.164) by an equivalent statistic, which is in one-to-one correspondence with (B.164). We can thus describe the likelihood ratio test in terms of either of these statistics. In fact, all the test statistics introduced in the sequel are some equivalent statistics to those defined by (B.164). Of course, we are interested in having tests which guarantee small probabilities of committing type II errors. This requirement is most often
© 2002 by Chapman & Hall/CRC
398
appendix b. a brief introduction to stochastics
investigated using the concept of the power of a test. The power function of a test is the function that associates the probability of rejecting H0 by the test with each θ from the parameter space Θ. Thus, for θ ∈ Θ1 , the value of the power function at θ is equal to one minus the probability of committing the type II error when the true value of the parameter is θ. It can sometimes be proved that the likelihood ratio tests are most powerful in a certain sense and, even if such a proof is not available, it can usually be shown that these tests have desirable properties anyway. Any choice of the significance level of a test is always somewhat arbitrary. Interpretation of results of testing a hypothesis is greatly facilitated by providing the so-called p-value. Given a test and the computed value of the test statistic, the p-value of the test is the smallest level of significance at which the null hypothesis is to be rejected. Thus, if the p-value proves small for a particular case under consideration (say, it is equal to .01), one is certainly inclined to reject H0 . We call the data used to compute a test statistic significant, if they lead to the rejection of H0 . The p-value says “how significant” the data are. Let us return to problem (B.162)-(B.163). It can be shown that the likelihood ratio test, provided the parent probability distribution is normal and its variance is known, is given by the test statistic ¯ − μ0 | |X √ . σ/ n
(B.170)
The critical value c is equal to zα/2 given by (B.151). Thus, the null hypothesis is rejected if the given statistic assumes a value which is greater than zα/2 . This result is intuitively appealing, in particular in view of (B.155). Sometimes, we want to test (B.162) against the hypothesis H1 : μ > μ0 .
(B.171)
(For obvious reasons, hypothesis (B.163) is called two-sided , while hypothesis (B.171) is called one-sided .) In view of the preceding considerations, it is not surprising that the likelihood ratio test is given by the test statistic ¯ − μ0 X √ (B.172) σ/ n and the critical value c is given by zα , where zα is such that P (Z > zα ) = α,
© 2002 by Chapman & Hall/CRC
(B.173)
statistical inference
399
and Z is a standard normal random variable. We leave it to the reader to guess the likelihood ratio test for testing (B.162) against the one-sided hypothesis H1 : μ < μ0 . If the population is normal and we want to test (B.162) against (B.163), but the population variance σ 2 is unknown, it can be shown that the likelihood ratio test is given by the test statistic ¯ − μ0 | |X √ , S/ n
(B.174)
where S is the square root of the sample variance. The critical value c is equal to tα/2 from (B.157). It is worthwhile to note that, in the problem just considered, H0 was a composite hypothesis. Indeed, with σ 2 unknown, H0 corresponds to the set of parameters Θ0 = {μ0 , 0 < σ 2 < ∞}; i.e., although the mean μ is equal to μ0 under H0 , the variance σ 2 is arbitrary. Tests based on (B.174) are known as t-tests. As in the case of the confidence interval for μ when σ 2 is unknown, if n is not too small, we can replace tα/2 by zα/2 given by (B.151). We leave it to the reader to guess the form of the likelihood ratio tests for the one-sided alternative hypotheses H1 : μ > μ0 and H1 : μ < μ0 . Finally, let us consider the problem of testing the hypothesis that the variance of a normal population is equal to a fixed number σ02 against the alternative hypothesis that the population variance is not equal to σ 2 . The population mean is unknown. That is, we wish to test H0 : σ 2 = σ02 versus H1 : σ 2 = σ02 .
(B.175)
It can be shown that the likelihood ratio test accepts the null hypothesis if and only if (n − 1)S 2 c1 ≤ ≤ c2 , (B.176) σ02 where c1 and c2 are such that c1 − c2 = n ln(c1 /c2 )
(B.177)
F (c2 ) − F (c1 ) = 1 − α,
(B.178)
and where F is the c.d.f. of the χ2 distribution with n−1 degrees of freedom. This time, perhaps somewhat surprisingly, the likelihood ratio test does not correspond exactly to the equal tailed confidence interval for the dual problem, given by (B.161). However, for sufficiently large n, χ21−α/2
© 2002 by Chapman & Hall/CRC
400
appendix b. a brief introduction to stochastics
and χ2α/2 , given by (B.158), approximately satisfy conditions (B.177) and (B.178). Thus, at least approximately, the given test and the confidence interval (B.161) correspond one to another. Although we shall not discuss these cases, let us mention that it is easy to construct likelihood ratio tests when we are given two independent normal random samples, and we wish to test either equality of means (with variances known or unknown but equal) or equality of variances (with means unknown and arbitrary). In the latter case, as hinted in Subsection B.6.3, the test is based on the F statistic.
B.13.5
Goodness-of-Fit Test
Sometimes, we want to verify the hypothesis that a random sample comes from a specified probability distribution which does not have to belong to any family of distributions defined by some parameters. Thus, the problem posed does not reduce to that of testing a hypothesis about the parameters of the hypothesized distribution. A natural way to solve this problem is to fit somehow the data to the hypothesized distribution. In order to introduce one such approach, developed by Karl Pearson, let us consider the following very simple task. A six-sided die with faces 1 to 6 is tossed 120 times. The null hypothesis is that the die is balanced, i.e., H0 : f (x) = 1/6 for x = 1, 2, . . . , 6, (B.179) where f (x) denotes the probability function of the outcomes. The alternative hypothesis is that f (x) is not a p.f. of the uniform distribution. If the die is balanced, we would expect each face to occur approximately 20 times. Equivalently, we would expect all the proportions of occurrences of each of the six faces to be equal to 1/6. Having this information about the hypothesized distribution, we can base our inference on the comparison of the observed proportions with the corresponding expected proportions. Suppose the following proportions, denoted by pi , have been observed face pi
1
1 15
2 1 3
3
1 12
4
2 15
5
2 15
6 1 4
In the sequel, we shall refer to each possible outcome of a discrete-valued experiment as a bin. That is, we have six bins in our example. We shall also say that the outcomes fall into bins.
© 2002 by Chapman & Hall/CRC
401
statistical inference The criterion for verifying the fit of data proposed by Pearson is k (pi − πi )2
χ2 = n
i=1
πi
,
(B.180)
where πi is the expected proportion of data in the ith bin, i = 1, 2, . . . , k, k is the number of bins and n is the number of data. In the case above, χ2
= 120
1 ( 15 − 16 )2 1 6
+
( 13 − 16 )2 1 6
+
1 ( 12 − 16 )2 1 6
+2
2 ( 15 − 16 )2 1 6
+
( 14 − 16 )2
1 6
= 32.13. Before we present the way of using criterion (B.180) to test the goodness of fit, let us note that the above example shows clearly how to obtain this criterion’s value in case of an arbitrary hypothesized discrete distribution. If a hypothesized probability distribution is continuous, we have first to bin the data and then see if the proportions thus obtained fit the corresponding expected proportions. Bins are formed as adjacent intervals on the real line, and expected proportions are given by
πi =
ith bin
f (x)dx,
(B.181)
where f (x) is the hypothesized p.d.f. Binning of the data should be done in such a way that all the expected proportions be approximately the same. Criterion (B.180) can be used effectively to test the goodness of fit due to the following fact. If a random sample is not too small, and if the null hypothesis is true, then χ2 given by (B.180) is an observed value of a random variable whose probability distribution is well approximated by the χ2 distribution with k − 1 degrees of freedom. In practice, we require that at least 5 data fall in each bin. Of course, if the observed proportions are close to the expected ones, the value of (B.180) is small, indicating good fit. Thus, at the α level of significance, the critical region should have the form (B.182) χ2 > χ2α , where the value of χ2α is provided by the statistical tables of χ2 distribution with k − 1 degrees of freedom. The value of χ2 in our example was equal to 32.13. For α = 0.05, 2 χ0.05 = 11.07 and, hence, the null hypothesis is rejected at 0.05 level.
© 2002 by Chapman & Hall/CRC
402
appendix b. a brief introduction to stochastics
B.13.6
Empirical Distribution Functions
We may also wish to estimate the parent probability distribution of a random sample directly, instead of verifying the hypothesis that the random sample comes from a specified distribution. As previously, we do not want to assume that the parent distribution belongs to a family of distributions defined by some parameters. The primary way of accomplishing this general task is to construct the so-called empirical distribution function (e.d.f.). Let us consider a random sample of n independent random variables X1 , X2 , . . . , X n with common c.d.f. F (x). The empirical distribution function of the random sample is defined as Fn (x) =
n 1 I(Xi ≤ x), −∞ < x < ∞, n i=1
(B.183)
where I(Xi ≤ x) is the indicator function of the event {Xi ≤ x}, +
I(Xi ≤ x) =
1 if Xi ≤ x 0 if Xi > x.
Equivalently, Fn (x) =
number of sample points satisfying Xi ≤ x . n
(B.184)
The e.d.f. can be written in still another form, using the concept of order statistics. By the order statistics of the sample X1 , X2 , . . . , X n we mean the sample rearranged in order from least to greatest. The ordered sample is written as X(1) , X(2) , . . . , X(n) , where X(1) ≤ X(2) ≤ . . . ≤ X(n) , and X(k) is called the kth order statistic. If we want to emphasize that the kth order statistic comes from the sample of size n, we write Xk:n instead of X(k) . Now ⎧ ⎪ ⎨ 0
Fn (x) =
if X1:n > x k/n if Xk:n ≤ x < Xk+1:n , k = 1, 2, . . . , n − 1 ⎪ ⎩ 1 if Xn:n ≤ x.
(B.185)
Thus, as is easily seen from the above definitions, the e.d.f. is a piecewise constant function with jumps at the observed data points X1 , X2 , . . . , X n . The jumps are of value 1/n, unless more than one observation Xi assumes the same value (note that for a continuous parent distribution it
© 2002 by Chapman & Hall/CRC
statistical inference
403
happens with probability zero). One can object that, since the e.d.f. is necessarily a nonsmooth function, it is not an “elegant” estimator of any continuous parent c.d.f. F (x). Yet, the closer examination of the properties of Fn (x) shows that it is a good estimator of both discrete and continuous c.d.f.’s. In order to appreciate the concept of the e.d.f., let us note first that, by (B.183) and for any fixed value of the argument x, Fn (x) is the sample mean of n independent Bernoulli r.v.’s, each of which assumes value one with probability P (Xi ≤ x). But P (Xi ≤ x) = F (x), and it follows from the properties of the Bernoulli r.v.’s that E(Xi ) = F (x) and Var(Xi ) = F (x)(1 − F (x)) for each i = 1, 2, . . . , n. Hence, by (B.62) and (B.63), E(Fn (x)) = F (x) and Var(Fn (x)) =
F (x)(1 − F (x)) . n
(B.186)
Moreover, the SLLN implies that Fn (x) converges almost surely to F (x) as n tends to infinity, while the CLT implies that, for large n, Fn (x) is approximately normally distributed with mean F (x) and variance F (x)(1 − F (x))/n. Sometimes, we are interested in estimating not a whole c.d.f. but only the so-called quantiles of a parent distribution. For the sake of simplicity, let us assume for a moment that the random sample is governed by a continuous probability distribution with a c.d.f. F (x) which is increasing between its boundary values 0 and 1. A value xp such that F (xp ) = p for any fixed p(0, 1) is called the quantile of order p of F (x). Quantiles of orders .25 and .75 are called the lower quartile and upper quartile, respectively. Quantile of order .5 is called the median. Often, the quantile of order p is called the (100p)th percentile. It follows from (B.185) that a natural sample counterpart of the quantile of order p can be defined as +
x ˜p =
Xnp:n if np is an integer X[np]+1:n otherwise,
(B.187)
where, for any fixed a, [a] denotes the largest integer not greater than a. Indeed, xp ) = p Fn (˜ if np is an integer. If np is a fraction, Fn (x) is never equal to p exactly. In fact, since Fn (x) is a piecewise constant function, defining sample
© 2002 by Chapman & Hall/CRC
404
appendix b. a brief introduction to stochastics
quantiles is always somewhat arbitrary. For example, as we shall see in the next subsection, a bit more natural definition of the sample median is available. The properties discussed make the e.d.f. a widely used tool in statistical inference. It is another question, not to be dealt with here, that much work has been done to obtain smooth estimators of continuous c.d.f.’s (and p.d.f.’s).
B.13.7
Nonstandard Estimators of Mean and Variance
We know well from the previous considerations that the sample mean is a natural and good estimator of the population mean. It happens, however, that the sample we have at our disposal does not come exactly from the random phenomenon of our interest but, rather, that the probability distribution of the sample is contaminated by some other distribution. In such circumstances, we would wish to have an estimator which is likely to disregard the elements of the sample that come from the contaminating distribution. The sample mean usually does not perform this task well, while the so-called sample median quite often does. In particular, the concept of the sample median is very useful in the context of statistical process control. This point is briefly discussed below and elaborated at length in Chapter 3. Throughout this subsection, we shall assume that the random sample is governed by a continuous probability distribution (with c.d.f. F (x) and p.d.f. f (x)). For the sake of simplicity, we shall assume also that the c.d.f. F (x) is increasing between its boundary values 0 and 1. The median of the distribution is defined as the number ξ ≡ x.5 for which F (ξ) = 1/2. That is, half the area under the p.d.f. f (x) is to the left of ξ and half is to the right of ξ (note that an additional condition is required to make ξ uniquely defined if F (x) is not assumed to be increasing). It is easily seen that the mean and median are equal if the probability distribution is symmetric, i.e., if F (x) = 1 − F (−x) or, equivalently, f (x) = f (−x) for each x. Indeed, both are then equal to zero. It is also clear that the equality of the mean and median holds for probability distributions which arise from shifting a symmetric distribution by an arbitrary constant. We say that a probability distribution of the sample X1 , X2 , . . . , Xn is symmetric about a constant c if the common distribution of the r.v.’s X1 −c, X2 −c, . . . , Xn −c is symmetric. Of course, the mean and median
© 2002 by Chapman & Hall/CRC
405
statistical inference
are then equal to c. The most important family of such distributions is that of normal distributions. In general, however, the mean and median may differ (take, e.g., a unimodal skewed density). A natural sample counterpart of the population median, slightly different from that implied by (B.187), is the following sample median +
μ ˜=
X((n+1)/2) if n is odd 1/2(X(n/2) + X(n/2+1) ) if n is even,
(B.188)
where n is the sample size and X(k) denotes the kth order statistic. Before we give the properties of the sample median, we have to derive some general results for arbitrary order statistics. In particular, the c.d.f. of the kth order statistic of a sample of size n, to be denoted F(k) (x), has the form F(k) (x) = P (X(k) ≤ x) = P (at least k of the Xi ’s are not greater than x) =
n
i=k
n i
F i (x)(1 − F (x))n−i ,
(B.189)
since each term in the summand is the binomial probability that exactly i of X1 , X2 , . . . , Xn are not greater than x. Hence, differentiating with respect to x, we obtain that the p.d.f. of the kth order statistic is given by f(k) (x) =
n! f (x)(F (x))k−1 (1 − F (x))n−k . (k − 1)!(n − k)!
(B.190)
Substituting the p.d.f. of the kth order statistic to (B.39) yields the mean of the statistic: μk:n =
n! (k − 1)!(n − k)!
∞ −∞
xf (x)(F (x))k−1 (1 − F (x))n−k dx, (B.191)
where μk:n denotes the mean of the kth order statistic Xk:n of a sample of size n. After some tedious, and hence skipped, algebra one obtains that the following recurrence relation holds for the means of order statistics of samples from the same parent distribution: (n − k)μk:n + kμk+1:n = nμk:n−1 , where k = 1, 2, . . . , n − 1.
© 2002 by Chapman & Hall/CRC
(B.192)
406
appendix b. a brief introduction to stochastics
Relations (B.191) and (B.192) enable us to prove that the mean of the sample median is equal to the mean of a parent distribution of the sample, provided that the parent distribution of the sample is symmetric about some constant. In the proof, without loss of generality, we shall confine ourselves to the case of c = 0, that is, of symmetric parent distributions; by (B.18) and the definition of distribution’s symmetry about c, the proof will be accomplished for any c. Now, if the sample size is odd, μ ˜ = X(n+1)/2:n and, by (B.191), μ(n+1)/2:n = 0, since the integrand in (B.191) is an odd function, g(x) = −g(−x), where g(x) = xf (x)(F (x))(n−1)/2 (1 − F (x))(n−1)/2 ; indeed, x is the odd function while, under the symmetry assumption, f (x) and [F (x)(1 − F (x))](n−1)/2 are even functions. But the mean of a symmetric distribution is necessarily zero and, thus, the required result holds for odd n. For n even and k = n/2, the result also readily follows, since the left hand side of (B.192) is equal to nE(˜ μ), while the right hand side is equal to zero by the same argument as previously, but applied to μn/2:n−1 . Finally, therefore, the proof is concluded. Given (B.190), it is obvious that calculation of the variance of μ ˜ for particular parent distributions is a cumbersome task. We shall content ourselves with citing a more general but only asymptotically valid result, which says that, for large n, μ ˜ is approximately normally distributed with mean μ and variance 1/(4nf 2 (ξ)), provided the parent p.d.f. is continuous in a neighhborhood of ξ. It readily follows from the last result that the sample median converges in probability to the median, that is, μn − ξ| > ε) = 0 lim P (|˜
n→∞
(B.193)
for each positive ε, where μ ˜n denotes the sample median of a sample of size n; more explicitely, (B.193) states that for each positive ε and δ there exists an N such that P (|˜ μn − ξ| > ε) < δ for all n > N . Without coming to technical details of a fully rigorous proof, it suffices to observe that, in the limit, for n = ∞, the sample median is normally distributed with mean ξ and variance 0. In other words, asymptotically, the probability distribution of μ ˜n becomes concentrated at ξ, which implies (B.193).
© 2002 by Chapman & Hall/CRC
statistical inference
407
The above proof of convergence in probability cannot be generalized in such a way as to answer the question of the almost sure convergence of μ ˜n . Since convergence properties form an important ingredient of justifying the usefulness of any estimator, we shall now give another, simple and complete, proof of (B.193), which does not refer to any distributional properties of the sample median. A suitable modification of this latter proof answers the question of the a.s. convergence of μ ˜n as well. For any fixed ε > 0, we have F (ξ − ε) < F (ξ) = 1/2 < F (ξ + ε).
(B.194)
η = min{1/2 − F (ξ − ε), F (ξ + ε) − 1/2}.
(B.195)
Let Now, since the e.d.f. Fn (x) converges almost surely to F (x) for each x, we have P (|Fn (ξ − ε) − F (ξ − ε)| < η/2) ≥ 1 − δ/2
(B.196)
P (|Fn (ξ + ε) − F (ξ + ε)| < η/2) ≥ 1 − δ/2
(B.197)
and for any fixed δ and all sufficiently large n. Combining (B.195) - (B.197) yields P (Fn (ξ − ε) < 1/2 < Fn (ξ + ε)) ≥ 1 − δ. (B.198) For n odd, it follows from (B.185), (B.198) and the definition of μ ˜ that P (ξ − ε < μ ˜ < ξ + ε) ≥ 1 − δ
(B.199)
for all sufficiently large odd n. A similar but more careful examination of (B.198) shows that property (B.199) holds for n even as well, which proves the desired result. Although we do not give it here, let us mention again that a slightly more refined argument shows that the sample median converges to the mean not only in probability but also almost surely. In statistical process control, the following conceptual model is of interest. We are given a number of independent samples of random variables, each sample having the same size, say, n. Denote the number of samples by N . Within each sample, the random variables are independent and identically distributed. Ideally, the distribution of the random variables is the same for all N samples. Our goal is to estimate the mean of this parent, or “norm,” distribution. Provided all N samples are indeed
© 2002 by Chapman & Hall/CRC
408
appendix b. a brief introduction to stochastics
governed by the norm distribution, the given task is simple. The obvious way to solve it is to calculate the sample mean of all nN random variables. Equivalently, we can calculate N sample means of all the samples first, and then take the average of the sample means obtained. The average of the sample means can be viewed as the sample mean of the sample means for particular samples. Ideally, the estimate thus obtained is unbiased and, with N increasing, it converges to the true mean of the norm distribution if only the variance of the distribution is finite. However, whenever we are interested in implementing statistical process control for quality improvement, it is because “sometimes something different from the norm is happening.” Namely, the norm distribution of some samples is contaminated by another distribution. In order to take this fact into account, it is reasonable to assume that each sample is governed by the norm distribution only with some probability less than one, say, with probability p, and is governed by some other distribution with probability 1 − p. The other distribution is a mixture of the norm distribution and a contaminating distribution. Assume that the mean of the norm distribution is μ1 , while that of the other distribution is μ2 , μ1 = μ2 . Assume also that the variances of both distributions are finite. Now, the average of the sample means is an unbiased estimator of pμ1 + (1 − p)μ2 and, as N increases, it converges to pμ1 + (1 − p)μ2 , although we would wish it to converge to μ1 . To the worse, it does not help if we can make the size of the samples, n, arbitrarily large. With n increasing, each sample mean converges to the true mean of the sample, which is either equal to μ1 or to μ2 , and the average of the sample means still converges to pμ1 + (1 − p)μ2 . The reason for this last fact is that, for N large, the sample means obtained are approximately equal to μ1 about pN times and to μ2 about (1 − p)N times. At the same time, precisely for the same reason, the sample median of the sample means must converge to μ1 as n and N both increase, provided only that p > .5. This important result follows from the fact that pN is greater than (1 − p)N , the sample median is chosen from N data and, in the limit (as n and N both approach infinity), pN data are equal to μ1 while (1−p)N are equal to μ2 . Intuitively, for sufficiently large n and N , more than half the data are concentrated around μ1 and are separated from the data concentrated around μ2 with probability arbitrarily close to one. In view of the definition of the sample median, this implies the required result. A rigorous argument is slightly more involved. For each fixed n and ¯ is, of course, a random variable. We for each sample, sample mean X
© 2002 by Chapman & Hall/CRC
statistical inference
409
¯ can be made arbitrarily close to μ1 with probashall show first that X bility greater than .5 by choosing a sufficiently large sample size. Given that a sample is governed by the norm distribution, it follows from the Chebyshev’s inequality (see (B.74)) with ε = 1/n1/4 that the probability that ¯ [μ1 − 1/n1/4 , μ1 + 1/n1/4 ] (B.200) X √ is greater than 1 − σ12 / n, where σ12 denotes the variance of the norm distribution. Now, since the probability that a sample is governed by the norm distribution is p, the unconditional probability of event (B.200) satisfies the following inequality √ ¯ [μ1 − 1/n1/4 , μ1 + 1/n1/4 ]) > p(1 − σ 2 / n) P (X 1
(B.201)
and, for n sufficiently large, the right hand side of (B.201) is indeed greater than .5, since p was assumed to be greater than .5. Hence, we obtain by the definition of the median of a probability distribution that, for all such n, the median of the (unconditional) distribution of the sample mean belongs to the interval [μ1 − 1/n1/4 , μ1 + 1/n1/4 ]. But this implies in turn that the sample median of N sample means converges, as N → ∞, to a value in this interval as well. Finally, therefore, unlike the sample mean, the sample median is indeed insensitive to the model’s departures from the norm in the sense described. The only measure of “spread” or “variability” of a probability distribution discussed so far has been the variance of the distribution. The sample counterpart of the variance is, of course, the sample variance. Another natural candidate for becoming a measure of sample’s spread is the range of the sample, defined as the difference between the last and the first order statistic, R = Xn:n − X1:n .
(B.202)
No doubt, the range provides the simplest possible way of measuring the spread of the parent probability distribution. The interesting question is what is the relationship between the variance and the range or, since the range is a sample characteristic and, hence, is itself a random variable, between the variance and the mean of the range. Just as we derived the p.d.f. of the kth order statistic, one can solve the more ambitious task of giving the joint p.d.f. of two order statistics of the same sample, Xk:n and Xm:n , say. In particular, one can derive the joint p.d.f. of the first and the last order statistic, X1:n and Xn:n ,
© 2002 by Chapman & Hall/CRC
410
appendix b. a brief introduction to stochastics
respectively. Using the change-of-variable technique, one can then derive the p.d.f. of the range R. Let us omit all the tedious calculations and state only the form of the p.d.f. of the sample’s range: fr (y) = n(n − 1)
∞ −∞
f (x)(F (x + y) − F (x))n−2 f (x + y)dx,
(B.203)
where, as usual, F (x) and f (x) denote the parent c.d.f. and p.d.f., respectively. Integrating (B.203) yields the c.d.f. of the range Fr (y) =
y
fr (t)dt
−∞
∞
= n
−∞
f (x)(F (x + y) − F (x))n−1 dx.
(B.204)
We are now in a position to prove that the mean of the range of a sample of normal r.v.’s with variance σ 2 is linearly related to the square root of σ 2 , i.e., to the standard deviation of the sample, ¯ = σ/b, E(R) = R
(B.205)
where b depends on the sample size. It follows that, for normal samples, the square root of the sample variance can be replaced by a suitable multiple of the range. To prove (B.205), let us consider a sample of n normally distributed r.v.’s with mean μ and variance σ 2 , X1 , X2 , . . . , Xn . We have Fr (y) = P (R ≤ y) = P (R/σ ≤ y/σ) = P (R ≤ y/σ),
(B.206)
where R is given by (B.202) and R = R/σ. Observe that the transformed sample, X1 − μ X2 − μ Xn − μ , ,..., , σ σ σ comes from the standard normal distribution and that R is its range. By (B.206), if we want the c.d.f. of the range of the original sample, we can compute P (R ≤ r/σ) for the transformed sample. Thus, by (B.204), Fr (y) = P (R ≤ y/σ) = n
© 2002 by Chapman & Hall/CRC
∞
−∞
ϕ(x)(Φ(x + y/σ) − Φ(x))n−1 dx,
(B.207)
411
bayesian statistics
where ϕ(x) and Φ(x) denote the p.d.f. and c.d.f. of the standard normal distribution, respectively. Upon differentiating (B.207) and substituting u for y/σ, we have E(R) = n(n − 1)σ
−1
∞ ∞ 0
×ϕ(x + y/σ)dxdy = n(n − 1)σ
−∞
∞ ∞ 0
−∞
yϕ(x)(Φ(x + y/σ) − Φ(x))n−2
uϕ(x)(Φ(x + u) − Φ(x))n−2
×ϕ(x + u)dxdu.
(B.208)
Therefore, relation (B.205) is proved. Moreover, we have obtained that b is equal to the inverse of n(n − 1)
∞ ∞ 0
−∞
uϕ(x)(Φ(x + u) − Φ(x))n−2 ϕ(x + u)dxdu. (B.209)
The values of b for different sample sizes n have to be calculated numerically. They are given in Table 3.1 for sample sizes 2, 3, . . . , 10, 15 and 20. Of course, the sample range is easier for calculation than the sample variance. The price to be paid is that the variance of the former is greater than that of the latter. The two variances can be considered comparable only for small sample sizes, say, up to n = 10.
B.14
Bayesian Statistics
B.14.1
Bayes Theorem
Suppose that the sample space S can be written as the union of disjoint sets: S = A1 ∪ A2 ∪ · · · ∪ An . Let the event H be a subset of S which has non-empty intersections with some of the Ai ’s. Then P (H|Ai )P (Ai ) . P (H|A1 )P (A1 ) + P (H|A2 )P (A2 ) + · · · + P (H|An )P (An ) (B.210) To explain the conditional probability given by equation (B.210), consider a diagram of the sample space, S. Consider that the Ai ’s represent n disjoint states of nature. The event H intersects some of the Ai ’s. P (Ai |H) =
© 2002 by Chapman & Hall/CRC
412
appendix b. a brief introduction to stochastics
H A
A
A
A
A
A
1
4
7
2
5
8
A
3
A
6
A
9
Figure B.5. Bayes Venn Diagram. Then, P (H|A1 ) =
P (H ∩ A1 ) P (A1 |H)P (H) = . P (A1 ) P (A1 )
Solving forP (A1 |H), we get P (A1 |H) =
P (H|A1 )P (A1 ) , P (H)
P (Ai |H) =
P (H|Ai )P (Ai ) . P (H)
and in general, (B.211)
Now, P (H) = P ((H ∩ A1 ) ∪ (H ∩ A2 ) ∪ · ∪ (H ∩ An )) = =
P (H ∩ Ai ), since the intersections (H ∩ Ai ) are disjoint P (H|Ai )P (Ai ), where j = 1, 2, . . . , n.
Thus, with (B.211) and P (H) given as above, we get (B.210).
© 2002 by Chapman & Hall/CRC
bayesian statistics
413
The formula (B.210) finds the probability that the true state of nature is Ai given that H is observed. Notice that the probabilities P (Ai ) must be known to find P (Ai |H). These probabilities are called prior probabilities because they represent information prior to experimental data. The P (Ai |H) are then posterior probabilities. For each i = 1, 2, . . . , n, P (Ai |H) is the probability that Ai was the state of nature in light of the occurrence of the event H.
B.14.2
A Diagnostic Example
Consider patients being tested for a particular disease. It is known from historical data that 5% of the patients tested have the disease, further, that 10% of the patients that have the disease test negative for the disease, and that 20% of the patients who do not have the disease test positive for the disease. Denote by D+ the event that the patient has the disease, by D− the event that the patient does not, and denote by T + the event the patient tests positive for the disease, and by T − the event the patient tests negative. If a patient tests positive for the disease, what is the probability that the patient actually has the disease? We seek the conditional probability, P (D+ |T + ). Here, T + is the observed event, and D+ may be the true state of nature that exists prior to the test. (We trust that the test does not cause the disease.) Using Bayes’s theorem, P (D+ |T + ) = = =
P (T + |D+ )P (D+ ) (B.212) P (T + ) P (T + |D+ )P (D+ ) P (T + |D+ )P (D+ ) + P (T + |D− )P (D− ) .9 × .05 = 0.1915. .9 × .05 + .2 × .95
Thus, there is nearly a 20% chance given a positive test result that the patient has the disease. This probability is the posterior probability, and if the patient is tested again, we can use it as the new prior probability. If the patient tests positive once more, we use equation (B.212) with an updated version of P (D+ ), namely, .1915. The posterior probability now is: P (D+ |T + ) =
© 2002 by Chapman & Hall/CRC
P (T + |D+ )P (D+ ) P (T + )
414
appendix b. a brief introduction to stochastics
= =
P (T + |D+ )P (D+ ) P (T + |D+ )P (D+ ) + P (T + |D− )P (D− ) .9 × .1915 = 0.5159. .9 × .1915 + .2 × .8085
Twice the patient tests positive for the disease and the posterior probability that the patient has the disease is now much higher. As we gather more and more information with further tests, our posterior probabilities will better and better describe the true state of nature. In order to find the posterior probabilities as we have done, we needed to know the prior probabilities. A major concern in a Bayes application is the choice of priors, a choice which must be made sometimes with very little prior information. One suggestion made by Bayes is to assume that the n states of nature are equally likely (Bayes’ Axiom). If we make this assumption in the example above, that is, that P (D+ ) = P (D− ) = .5, then P (D+ |T + ) =
P (T + |D+ )P (D+ ) P (T + |D+ )P (D+ ) + P (T + |D− )P (D− )
P (D+ ) and P (D− ) cancel, giving P (T + |D+ ) P (T + |D+ ) + P (T + |D− ) .9 = .9 + .2 = 0.8182.
P (D+ |T + ) =
This is much higher than the accurate probability, .1912. Depending upon the type of decisions an analyst has to make, a discrepancy of this magnitude may be very serious indeed. As more information is obtained, however, the effect of the initial choice of priors will become less severe.
B.14.3
Prior and Posterior Density Functions
We have defined a parameter of a distribution to be a quantity that describes it. Examples include the mean μ and variance σ 2 of the normal distribution, and the number of trials n and the probability of success p of the binomial distribution. A great deal of statistical inference considers parameters to be fixed quantities, and sample data are used to estimate a parameter that is fixed but unknown. A Bayesian approach, however, supposes that a parameter of a distribution varies according to some
© 2002 by Chapman & Hall/CRC
bayesian statistics
415
distribution of its own. Observations taken on a random variable may then permit (better) estimates of the underlying true distribution of the parameter. Suppose that the p.d.f. of X depends on the parameter θ. It is reasonable to write this as the conditional p.d.f. of X given θ: f (x|θ). Suppose further that θ has p.d.f. given by g(θ). The p.d.f. g(θ) is a density function for the prior distribution of θ, since it describes the distribution of θ before any observations on X are known. Bayes Theorem gives us a means to find the conditional density function of θ given X, g(θ|x). This is known as the posterior p.d.f. of θ, since it is derived after observation X = x is taken. According to Bayes Theorem, we have for f (x) > 0 f (x|θ)g(θ) , where f1 (x)
g(θ|x) =
f1 (x) =
∞ −∞
(B.213)
f (x|θ)g(θ)dθ
is the marginal (unconditional) p.d.f. of X. Note that f (x|θ)g(θ) is the joint distribution of X and θ. If a random sample of size n is observed, the prior distribution on θ can be updated to incorporate the new information gained from the sample. That is, the posterior density function of θ is derived from the sample data as a conditional density of θ given that the sample (x1 , x2 , . . . , xn ) is observed:
g(θ|(x1 , x2 , . . . , xn )) =
f ((x1 , x2 , . . . , xn )|θ)g(θ) , h(x1 , x2 , . . . , xn )
(B.214)
where the marginal density h(x1 , x2 , . . . , xn ) is found by integrating the numerator of equation (B.214) over θ. If the density of the random variable X depends upon more than one parameter, then we let Θ = (θ1 , θ2 , . . . , θk ) be the vector of parameters governing the distribution of X, and equation (B.214) becomes (with x = (x1 , x2 , . . . , xn )), g(Θ|x) = We find the marginal h(x) by
© 2002 by Chapman & Hall/CRC
f (x|Θ)g(Θ) . h(x)
416
appendix b. a brief introduction to stochastics
h(x1 , x2 , . . . , xn )
= θ1
θ2
···
θk
f (x|Θ)gΘ)dθ1 dθ2 . . . dθk .
The apparent difficulty in finding a posterior density from sample data is that the prior density function must be known. This is a problem similar to that addressed in example B.14.2. In that example, we found the posterior probability that a patient who tests positive for a disease actually has the disease. We used prior probabilities that were based on historical data, and so were informed priors. We considered a second case for which no historical data were available, so that noninformed priors had to be selected, and we got a posterior probability very different from that in the original (informed) case. We are confronted by the same problem when looking for posterior densities. We must specify a prior distribution g(θ) that best describes the distribution of θ based on available knowledge. The available knowledge in fact may be only expert opinions and “best” guesses.
B.14.4
Example: Priors for Failure Rate of a Poisson Process
Suppose we have a device which fails at a rate of θ according to a Poisson process. If X is the number of failures in an interval of length t, then X is a Poisson random variable with probability function P [X = x] = f (x) =
e−θt (θt)x . x!
In the Bayesian setting, we write this as the conditional probability function f (x|θ) and suppose that θ has prior density g(θ). What we seek in g(θ) is a density function that uses all of the prior knowledge (or prior belief) of the distribution of the failure rate, θ. If we had a pretty good idea that θ would not exceed some maximum value, say θmax = b, then we might assume that θ is distributed unif ormly over the interval, (0, b). Then 1 for 0 < θ < b b = 0 otherwise.
g(θ) =
© 2002 by Chapman & Hall/CRC
bayesian statistics
417
Using (B.213), (e−θt (θt)x /x!)(1/θmax )
&b
g(θ|x) =
−θt (θt)x )/x!)(1/b)dθ 0 ((e
(e−θt θx )(1/b)
=
(1/b)
&b 0
e−θt θx dθ
(e−θt θx )
&b
=
0
e−θt θx dθ
.
If we make the change of variable y = θt in the denominator, we get =
(e−θt θx )
& bt 0
x
e−y ytx
dy t
=
e−θt θx 1 Γ(x + 1, bt) tx+1
=
tx+1 e−θt θx , Γ(x + 1, bt)
(B.215)
where Γ(x + 1, θt) is an “incomplete” gamma function. (The complete gamma function is Γ(α, ∞) = Γ(α) =
∞ 0
e−y y α−1 dy, for α > 0.)
It is clear from (B.215) that a uniform prior does not yield a uniform posterior. A more widely used prior for θ is the gamma density, G(α, β), introduced in Section B.6.2. If g(θ) is a gamma density function, then we will see that the posterior density of θ given X is also a gamma density function. For this reason, the gamma distribution is considered the “natural” choice for a prior density of θ. If X is a Poisson random variable with parameter ν = θt, then the natural conjugate prior density function of θ is a gamma density function with parameters α, β > 0, G(α, β) =
1 θα−1 e−θ/β . Γ(α)β α
Using (B.213), we find the posterior density g(θ|x) when g(θ) = G(α, β):
© 2002 by Chapman & Hall/CRC
418
appendix b. a brief introduction to stochastics
g(θ|x) =
(e−θt (θt)x /x!)g(θ) (e−θt (θt)x /x!)g(θ)dθ
&∞ 0
−θ/β α−1
=
=
θ (e−θt (θ)x ) e Γ(α)β α
1 Γ(α)β α
&∞ 0
e−θt θx e−θ/β θα−1 dθ
e−θ(t+1/β) θx+α−1 . e−θ(t+1/β) θx+α−1 dθ
&∞ 0
Making the change of variable y = θ(t + β1 ) in the denominator, we get,
∞ 0
e−θ(t+1/β) θx+α−1 dθ = = =
∞ 0
e−y (
y t+
1 β
)x+α−1
∞
1
0
(t + β1 )x+α
dy (t + β1 )
e−y y x+α−1 dy
Γ(x + α) . (t + β1 )x+α
(Be careful to note that this is not quite the marginal distribution of X, since we cancelled common factors before evaluating the denominator.) Now, the posterior density is −θ(t+ 1 )
β θ x+α−1 e g(θ|x) = , Γ(x + α)(t + β1 )−(x+α)
which is a gamma density function with parameters αnew = α + x and βnew = (t + β1 )−1 . In this case, we can see how the data truly update the prior density in the sense that it does not change its functional form. Recall that if θ is G(α, β), then E[θ] = αβ, and V [θ] = αβ 2 . If we have some estimates of the prior mean and prior variance, then the parameters can be chosen by solving the system αβ = (mean estimate) αβ 2 = (variance estimate) .
© 2002 by Chapman & Hall/CRC
419
bayesian statistics
Since it may be difficult to pin down a best guess of the variance, two other methods might be used with better results. These are described fully in Martz and Waller [2], and were first presented in the papers, Martz and Waller [1] and Waller et al. [3]. The first method, from Martz and Waller [2], requires that an expert provide an upper limit (U L) and a lower limit (LL) for the failure rate, θ, such that 1 − po P [θ < LL] = P [θ > U L] = , 2 where po is one of the values, .80, .90, or .95. (These are usual choices, since po = .80 gives 10th and 90th percentiles for LL and U L respectively, po = .90 gives the 5th and 95th percentiles, and po = .95 gives the 2.5th and 97.5th percentiles.) In general, LL is the 50(1-po )th percentile, and U L is the 50(1 + po )th percentile. Then P [LL < θ < U L] =
UL
g(θ; α, β)dθ
LL
=
UL LL
1 θα−1 e−θ/β dθ = po , (B.216) Γ(α)β α
or
LL
g(θ; α, β)dθ =
0
∞
g(θ; α, β)dθ
UL
=
1 − po . 2
By setting LL/β = 1, equation (B.216) can be rewritten as
UL
1 θα−1 e−θ/(LL) dθ = po , Γ(α)(LL)α
LL
and making the change of variable, y = θ/LL, we get the integral
U L/LL 1
1 (LLy)α−1 e−y LLdy = po , Γ(α)(LL)α
which is equal to
U L/LL 1
© 2002 by Chapman & Hall/CRC
1 α−1 −y e dy = po . y Γ(α)
(B.217)
420
appendix b. a brief introduction to stochastics
The equation (B.217) is numerically solved for α. Once a value, αo , is determined, a temporary lower limit, LLo , is set, and equation (B.216) is solved for a temporary value of β, bo . Then since β is a scale parameter, it can be found for any other LL from β LL bo ⇒β= bo . = LL LLo LLo A graph of αo versus log(U L/LL) and a table of bo versus αo and po are given in ([2], pp. 700-705). The second method, presented in [3], requires that any two percentiles, θ1 and θ2 , are specified, so that P [θ < θ1 ] = p1 , and P [θ < θ2 ] = p2 . These give two equations in two unknowns (αo , βo ):
θ1 0
θ2 0
1 e−x/β xα−1 dx = p1 Γ(α)β α 1 e−x/β xα−1 dx = p2 . Γ(α)β α
(B.218)
These are solved simultaneously for (α, β). A pair which simultaneously satisfies (B.218) is found by overlaying graphs of αi and βi for specified values θi and pi , (i = 1, 2). Such graphs are found in [3], and ([2], 707-712).
References [1] Martz, Harry F. and Waller, R. A.(1979). “A Bayesian Zero-Failures (BAZE) Reliability Analysis,” Journal of Quality Technology, v.11., pp. 128-138. [2] Martz, H. F. and Waller, R. A. (1982). Bayesian Reliability Analysis, New York: John Wiley & Sons. [3] Waller, R. A., Johnson, M.M., Waterman, M.S. and Martz, H.F. (1977). “Gamma Prior Distribution Selection for Bayesian Analysis of Failure Rate and Reliability,” in Nuclear Systems Reliability Engineering and Risk Assessment, Philadelphia: SIAM, pp. 584-606.
© 2002 by Chapman & Hall/CRC
Appendix C
Statistical Tables 1. 2. 3. 4. 5.
Table Table Table Table Table
of of of of of
the Normal Distribution the Chi-Square Distribution Student’s t Distribution the F Distribution with α = .05 the F Distribution with α = .01
421
© 2002 by Chapman & Hall/CRC
422
appendix c
C.1. Table of the Normal Distribution
Values of √1
&z
2π
z . 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
. 0 .50000 .53983 .57926 .61791 .65542 .69146 .72575 .75804 .78814 .81594 .84134 .86433 .88493 .90320 .91924 .93319 .94520 .95543 .96407 .97128 .97725 .98214 .98610 .98928 .99180 .99379 .99534 .99653 .99744 .99813 .99865 .99903 .99931 .99952 .99966 .99977 .99984 .99989 .99993 .99995
.01 .50399 .54380 .58317 .62172 .65910 .69497 .72907 .76115 .79103 .81859 .84375 .86650 .88686 .90490 .92073 .93448 .94630 .95637 .96485 .97193 .97778 .98257 .98645 .98956 .99202 .99396 .99547 .99664 .99752 .99819 .99869 .99906 .99934 .99953 .99968 .99978 .99985 .99990 .99993 .99995
© 2002 by Chapman & Hall/CRC
.02 .50798 .54776 .58706 .62552 .66276 .69847 .73237 .76424 .79389 .82121 .84614 .86864 .88877 .90658 .92220 .93574 .94738 .95728 .96562 .97257 .97831 .98300 .98679 .98983 .99224 .99413 .99560 .99674 .99760 .99825 .99874 .99910 .99936 .99955 .99969 .99978 .99985 .99990 .99993 .99996
.03 .51197 .55172 .59095 .62930 .66640 .70194 .73565 .76730 .79673 .82381 .84849 .87076 .89065 .90824 .92364 .93699 .94845 .95818 .96638 .97320 .97882 .98341 .98713 .99010 .99245 .99430 .99573 .99683 .99767 .99831 .99878 .99913 .99938 .99957 .99970 .99979 .99986 .99990 .99994 .99996
.04 .51595 .55567 .59483 .63307 .67003 .70540 .73891 .77035 .79955 .82639 .85083 .87286 .89251 .90988 .92507 .93822 .94950 .95907 .96712 .97381 .97932 .98382 .98745 .99036 .99266 .99446 .99585 .99693 .99774 .99836 .99882 .99916 .99940 .99958 .99971 .99980 .99986 .99991 .99994 .99996
−∞
2 −t 2
e
.05 .51994 .55962 .59871 .63683 .67364 .70884 .74215 .77337 .80234 .82894 .85314 .87493 .89435 .91149 .92647 .93943 .95053 .95994 .96784 .97441 .97982 .98422 .98778 .99061 .99286 .99461 .99598 .99702 .99781 .99841 .99886 .99918 .99942 .99960 .99972 .99981 .99987 .99991 .99994 .99996
dt .06 .52392 .56356 .60257 .64058 .67724 .71226 .74537 .77637 .80511 .83147 .85543 .87698 .89617 .91309 .92785 .94062 .95154 .96080 .96856 .97500 .98030 .98461 .98809 .99086 .99305 .99477 .99609 .99711 .99788 .99846 .99889 .99921 .99944 .99961 .99973 .99981 .99987 .99992 .99994 .99996
.07 .52790 .56749 .60642 .64431 .68082 .71566 .74857 .77935 .80785 .83398 .85769 .87900 .89796 .91466 .92922 .94179 .95254 .96164 .96926 .97558 .98077 .98500 .98840 .99111 .99324 .99492 .99621 .99720 .99795 .99851 .99893 .99924 .99946 .99962 .99974 .99982 .99988 .99992 .99995 .99996
.08 .53188 .57142 .61026 .64803 .68439 .71904 .75175 .78230 .81057 .83646 .85993 .88100 .89973 .91621 .93056 .94295 .95352 .96246 .96995 .97615 .98124 .98537 .98870 .99134 .99343 .99506 .99632 .99728 .99801 .99856 .99896 .99926 .99948 .99964 .99975 .99983 .99988 .99992 .99995 .99997
.09 .53586 .57535 .61409 .65173 .68793 .72240 .75490 .78524 .81327 .83891 .86214 .88298 .90147 .91774 .93189 .94408 .95449 .96327 .97062 .97670 .98169 .98574 .98899 .99158 .99361 .99520 .99643 .99736 .99807 .99861 .99900 .99929 .99950 .99965 .99976 .99983 .99989 .99992 .99995 .99997
423
statistical tables
C.2. Table of the Chi-Square Distribution
Critical Values of P = ν 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0.100 0.016 0.211 0.584 1.064 1.610 2.204 2.833 3.490 4.168 4.865 5.578 6.304 7.042 7.790 8.547 9.312 10.085 10.865 11.651 12.443 13.240 14.041 14.848 15.659 16.473 17.292 18.114 18.939 19.768 20.599
0.250 0.102 0.575 1.213 1.923 2.675 3.455 4.255 5.071 5.899 6.737 7.584 8.438 9.299 10.165 11.037 11.912 12.792 13.675 14.562 15.452 16.344 17.240 18.137 19.037 19.939 20.843 21.749 22.657 23.567 24.478
0.500 0.455 1.386 2.366 3.357 4.351 5.348 6.346 7.344 8.343 9.342 10.341 11.340 12.340 13.339 14.339 15.338 16.338 17.338 18.338 19.337 20.337 21.337 22.337 23.337 24.337 25.336 26.336 27.336 28.336 29.336
© 2002 by Chapman & Hall/CRC
0.750 1.323 2.773 4.108 5.385 6.626 7.841 9.037 10.219 11.389 12.549 13.701 14.845 15.984 17.117 18.245 19.369 20.489 21.605 22.718 23.828 24.935 26.039 27.141 28.241 29.339 30.435 31.528 32.620 33.711 34.800
0.900 2.706 4.605 6.251 7.779 9.236 10.645 12.017 13.362 14.684 15.987 17.275 18.549 19.812 21.064 22.307 23.542 24.769 25.989 27.204 28.412 29.615 30.813 32.007 33.196 34.382 35.563 36.741 37.916 39.087 40.256
1
2ν/2 Γ(ν/2) 0.950 3.841 5.991 7.815 9.488 11.070 12.592 14.067 15.507 16.919 18.307 19.675 21.026 22.362 23.685 24.996 26.296 27.587 28.869 30.144 31.410 32.671 33.924 35.172 36.415 37.652 38.885 40.113 41.337 42.557 43.773
& χ2 ν/2−1 −x/2 e dx 0 x
0.975 5.024 7.378 9.348 11.143 12.833 14.449 16.013 17.535 19.023 20.483 21.920 23.337 24.736 26.119 27.488 28.845 30.191 31.526 32.852 34.170 35.479 36.781 38.076 39.364 40.646 41.923 43.195 44.461 45.722 46.979
0.990 6.635 9.210 11.345 13.277 15.086 16.812 18.475 20.090 21.666 23.209 24.725 26.217 27.688 29.141 30.578 32 33.409 34.805 36.191 37.566 38.932 40.289 41.638 42.980 44.314 45.642 46.963 48.278 49.588 50.892
0.995 7.879 10.597 12.838 14.860 16.750 18.548 20.278 21.955 23.589 25.188 26.757 28.300 29.819 31.319 32.801 34.267 35.718 37.156 38.582 39.997 41.401 42.796 44.181 45.559 46.928 48.290 49.645 50.993 52.336 53.672
0.998 9.550 12.429 14.796 16.924 18.907 20.791 22.601 24.352 26.056 27.722 29.354 30.957 32.535 34.091 35.628 37.146 38.648 40.136 41.610 43.072 44.522 45.962 47.391 48.812 50.223 51.627 53.023 54.411 55.792 57.167
0.999 10.828 13.816 16.266 18.467 20.515 22.458 24.322 26.124 27.877 29.588 31.264 32.909 34.528 36.123 37.697 39.252 40.790 42.312 43.820 45.315 46.797 48.268 49.728 51.179 52.620 54.052 55.476 56.892 58.301 59.703
424
appendix c
C.3. Table of Student’s t Distribution
Critical Values of P = ν 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 40 60 120 ∞
0.600 0.325 0.289 0.277 0.271 0.267 0.265 0.263 0.262 0.261 0.260 0.260 0.259 0.259 0.258 0.258 0.258 0.257 0.257 0.257 0.257 0.257 0.256 0.256 0.256 0.256 0.256 0.256 0.256 0.256 0.256 0.255 0.254 0.254 0.253
0.750 1 0.816 0.765 0.741 0.727 0.718 0.711 0.706 0.703 0.700 0.697 0.695 0.694 0.692 0.691 0.690 0.689 0.688 0.688 0.687 0.686 0.686 0.685 0.685 0.684 0.684 0.684 0.683 0.683 0.683 0.681 0.679 0.677 0.675
© 2002 by Chapman & Hall/CRC
0.900 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.303 1.296 1.289 1.282
0.950 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.684 1.671 1.658 1.645
&t 2 Γ[(ν+1)/2] √ (1 + tν )−(ν+1)/2 dt Γ(ν/2) πν −∞ 0.975 0.990 0.995 0.998 0.999 12.706 31.821 63.657 159.153 318.309 4.303 6.965 9.925 15.764 22.327 3.182 4.541 5.841 8.053 10.215 2.776 3.747 4.604 5.951 7.173 2.571 3.365 4.032 5.030 5.893 2.447 3.143 3.707 4.524 5.208 2.365 2.998 3.499 4.207 4.785 2.306 2.896 3.355 3.991 4.501 2.262 2.821 3.250 3.835 4.297 2.228 2.764 3.169 3.716 4.144 2.201 2.718 3.106 3.624 4.025 2.179 2.681 3.055 3.550 3.930 2.160 2.650 3.012 3.489 3.852 2.145 2.624 2.977 3.438 3.787 2.131 2.602 2.947 3.395 3.733 2.120 2.583 2.921 3.358 3.686 2.110 2.567 2.898 3.326 3.646 2.101 2.552 2.878 3.298 3.610 2.093 2.539 2.861 3.273 3.579 2.086 2.528 2.845 3.251 3.552 2.080 2.518 2.831 3.231 3.527 2.074 2.508 2.819 3.214 3.505 2.069 2.500 2.807 3.198 3.485 2.064 2.492 2.797 3.183 3.467 2.060 2.485 2.787 3.170 3.450 2.056 2.479 2.779 3.158 3.435 2.052 2.473 2.771 3.147 3.421 2.048 2.467 2.763 3.136 3.408 2.045 2.462 2.756 3.127 3.396 2.042 2.457 2.750 3.118 3.385 2.021 2.423 2.704 3.055 3.307 2.000 2.390 2.660 2.994 3.232 1.980 2.358 2.617 2.935 3.160 1.960 2.327 2.576 2.879 3.091
C.4. Table of the F Distribution with α = .05 1−α =
ν2 \ν1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 30 40 60 120 ∞
F Γ((ν1 + ν2 )/2)(ν1 /ν2 )ν1 /2 0
Γ(ν1 /2)Γ(ν2 /2)
xν1 /2−1 dx (1 + ν1 x/ν2 )(ν1 +ν2 )/2
Critical Values of the F Distribution when α =.05 1 161.4476 18.5128 10.1280 7.7086 6.6079 5.9874 5.5914 5.3177 5.1174 4.9646 4.8443 4.7472 4.6672 4.6001 4.5431 4.4940 4.4513 4.4139 4.3807 4.3512 4.1709 4.0847 4.0012 3.9201 3.8424
© 2002 by Chapman & Hall/CRC
2 199.5000 19 9.5521 6.9443 5.7861 5.1433 4.7374 4.4590 4.2565 4.1028 3.9823 3.8853 3.8056 3.7389 3.6823 3.6337 3.5915 3.5546 3.5219 3.4928 3.3158 3.2317 3.1504 3.0718 2.9966
3 215.7073 19.1643 9.2766 6.5914 5.4095 4.7571 4.3468 4.0662 3.8625 3.7083 3.5874 3.4903 3.4105 3.3439 3.2874 3.2389 3.1968 3.1599 3.1274 3.0984 2.9223 2.8387 2.7581 2.6802 2.6058
4 224.5832 19.2468 9.1172 6.3882 5.1922 4.5337 4.1203 3.8379 3.6331 3.4780 3.3567 3.2592 3.1791 3.1122 3.0556 3.0069 2.9647 2.9277 2.8951 2.8661 2.6896 2.6060 2.5252 2.4472 2.3728
5 230.1619 19.2964 9.0135 6.2561 5.0503 4.3874 3.9715 3.6875 3.4817 3.3258 3.2039 3.1059 3.0254 2.9582 2.9013 2.8524 2.8100 2.7729 2.7401 2.7109 2.5336 2.4495 2.3683 2.2899 2.2150
6 233.9860 19.3295 8.9406 6.1631 4.9503 4.2839 3.8660 3.5806 3.3738 3.2172 3.0946 2.9961 2.9153 2.8477 2.7905 2.7413 2.6987 2.6613 2.6283 2.5990 2.4205 2.3359 2.2541 2.1750 2.0995
8 238.8827 19.3710 8.8452 6.0410 4.8183 4.1468 3.7257 3.4381 3.2296 3.0717 2.9480 2.8486 2.7669 2.6987 2.6408 2.5911 2.5480 2.5102 2.4768 2.4471 2.2662 2.1802 2.0970 2.0164 1.9393
10 241.8817 19.3959 8.7855 5.9644 4.7351 4.0600 3.6365 3.3472 3.1373 2.9782 2.8536 2.7534 2.6710 2.6022 2.5437 2.4935 2.4499 2.4117 2.3779 2.3479 2.1646 2.0772 1.9926 1.9105 1.8316
20 248.0131 19.4458 8.6602 5.8025 4.5581 3.8742 3.4445 3.1503 2.9365 2.7740 2.6464 2.5436 2.4589 2.3879 2.3275 2.2756 2.2304 2.1906 2.1555 2.1242 1.9317 1.8389 1.7480 1.6587 1.5716
30 250.0951 19.4624 8.6166 5.7459 4.4957 3.8082 3.3758 3.0794 2.8637 2.6996 2.5705 2.4663 2.3803 2.3082 2.2468 2.1938 2.1477 2.1071 2.0712 2.0391 1.8409 1.7444 1.6491 1.5543 1.4602
60 252.1957 19.4791 8.5720 5.6877 4.4314 3.7398 3.3043 3.0053 2.7872 2.6211 2.4901 2.3842 2.2966 2.2229 2.1601 2.1058 2.0584 2.0166 1.9795 1.9464 1.7396 1.6373 1.5343 1.4290 1.3194
∞ 254.3017 19.4956 8.5267 5.6284 4.3654 3.6693 3.2302 2.9281 2.7072 2.5384 2.4050 2.2967 2.2070 2.1313 2.0664 2.0102 1.9610 1.9175 1.8787 1.8438 1.6230 1.5098 1.3903 1.2553 1.0000
C.5. Table of the F Distribution with α = .01 1−α=
ν2 \ν1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 30 40 60 120 ∞
F Γ((ν1 + ν2 )/2)(ν1 /ν2 )ν1 /2
Γ(ν1 /2)Γ(ν2 /2)
0
xν1 /2−1 dx (1 + ν1 x/ν2 )(ν1 +ν2 )/2
Critical Values of the F Distribution when α =.01 1
2
4052 98.5025 34.1162 21.1977 16.2582 13.7450 12.2464 11.2586 10.5614 10.0443 9.6460 9.3302 9.0738 8.8616 8.6831 8.5310 8.3997 8.2854 8.1849 8.0960 7.5625 7.3141 7.0771 6.8509 6.6374
4999 99 30.8165 18.0000 13.2739 10.9248 9.5466 8.6491 8.0215 7.5594 7.2057 6.9266 6.7010 6.5149 6.3589 6.2262 6.1121 6.0129 5.9259 5.8489 5.3903 5.1785 4.9774 4.7865 4.6073
© 2002 by Chapman & Hall/CRC
3 5403 99.1662 29.4567 16.6944 12.0600 9.7795 8.4513 7.5910 6.9919 6.5523 6.2167 5.9525 5.7394 5.5639 5.4170 5.2922 5.1850 5.0919 5.0103 4.9382 4.5097 4.3126 4.1259 3.9491 3.7836
4 5624 99.2494 28.7099 15.9770 11.3919 9.1483 7.8466 7.0061 6.4221 5.9943 5.6683 5.4120 5.2053 5.0354 4.8932 4.7726 4.6690 4.5790 4.5003 4.4307 4.0179 3.8283 3.6490 3.4795 3.3210
5 5763 99.2993 28.2371 15.5219 10.9670 8.7459 7.4604 6.6318 6.0569 5.6363 5.3160 5.0643 4.8616 4.6950 4.5556 4.4374 4.3359 4.2479 4.1708 4.1027 3.6990 3.5138 3.3389 3.1735 3.0191
6 5858 99.3326 27.9107 15.2069 10.6723 8.4661 7.1914 6.3707 5.8018 5.3858 5.0692 4.8206 4.6204 4.4558 4.3183 4.2016 4.1015 4.0146 3.9386 3.8714 3.4735 3.2910 3.1187 2.9559 2.8038
8 5981 99.3742 27.4892 14.7989 10.2893 8.1017 6.8400 6.0289 5.4671 5.0567 4.7445 4.4994 4.3021 4.1399 4.0045 3.8896 3.7910 3.7054 3.6305 3.5644 3.1726 2.9930 2.8233 2.6629 2.5130
10 6055 99.3992 27.2287 14.5459 10.0510 7.8741 6.6201 5.8143 5.2565 4.8491 4.5393 4.2961 4.1003 3.9394 3.8049 3.6909 3.5931 3.5082 3.4338 3.3682 2.9791 2.8005 2.6318 2.4721 2.3227
20 6208 99.4492 26.6898 14.0196 9.5526 7.3958 6.1554 5.3591 4.8080 4.4054 4.0990 3.8584 3.6646 3.5052 3.3719 3.2587 3.1615 3.0771 3.0031 2.9377 2.5487 2.3689 2.1978 2.0346 1.8801
30 6260 99.4658 26.5045 13.8377 9.3793 7.2285 5.9920 5.1981 4.6486 4.2469 3.9411 3.7008 3.5070 3.3476 3.2141 3.1007 3.0032 2.9185 2.8442 2.7785 2.3860 2.2034 2.0285 1.8600 1.6983
60 6313 99.4825 26.3164 13.6522 9.2020 7.0567 5.8236 5.0316 4.4831 4.0819 3.7761 3.5355 3.3413 3.1813 3.0471 2.9330 2.8348 2.7493 2.6742 2.6077 2.2079 2.0194 1.8363 1.6557 1.4752
∞ 6365 99.4991 26.1263 13.4642 9.0215 6.8811 5.6506 4.8599 4.3118 3.9111 3.6062 3.3648 3.1695 3.0080 2.8723 2.7565 2.6565 2.5692 2.4923 2.4240 2.0079 1.8062 1.6023 1.3827 1.0476