Multivariate Statistical Process Control with Industrial Applications
ASA-SIAM Series on Statistics and Applied Proba...
313 downloads
1885 Views
28MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Multivariate Statistical Process Control with Industrial Applications
ASA-SIAM Series on Statistics and Applied Probability The ASA-SIAM Series on Statistics and Applied Probability is published jointly by the American Statistical Association and the Society for Industrial and Applied Mathematics. The series consists of a broad spectrum of books on topics in statistics and applied probability. The purpose of the series is to provide inexpensive, quality publications of interest to the intersecting membership of the two societies.
Editorial Board Robert N. Rodriguez SAS Institute, Inc., Editor-in-Chief
Gary C. McDonald General Motors R&D Center
Janet P. Buckingham Southwest Research Institute
Jane F. Pendergast University of Iowa
Richard K. Burdick Arizona State University
Alan M. Polansky Northern Illinois University
James A. Calvin Texas A&M University
Paula Roberson University of Arkansas for Medical Sciences
Katherine Bennett Ensor Rice University
Dale L Zimmerman University of Iowa
Robert L. Mason Southwest Research Institute
Mason, R. L. and Young, J. C., Multivariate Statistical Process Control with Industrial Applications Smith, P. L, A Primer for Sampling Solids, Liquids, and Gases: Based on the Seven Sampling Errors of Pierre Gy Meyer, M. A. and Booker, J. M., Eliciting and Analyzing Expert Judgment: A Practical Guide Latouche, G. and Ramaswami, V., Introduction to Matrix Analytic Methods in Stochastic Modeling Peck, R., Haugh, L., and Goodman, A., Statistical Case Studies: A Collaboration Between Academe and Industry, Student Edition Peck, R., Haugh, L., and Goodman, A., Statistical Case Studies: A Collaboration Between Academe and Industry Barlow, R., Engineering Reliability Czitrom, V. and Spagon, P. D., Statistical Case Studies for Industrial Process Improvement
mULTIVARIATE STATISTICAL
Process Control with
Industrial Applications Robert L Mason Southwest Research Institute San Antonio, Texas
John C. Young McNeese State University Lake Charles, Louisiana InControl Technologies, Inc. Houston, Texas
S1HJTL Society for Industrial and Applied Mathematics Philadelphia, Pennsylvania
ASA American Statistical Association Alexandria, Virginia
Copyright © 2002 by the American Statistical Association and the Society for Industrial and Applied Mathematics. 10987654321 All rights reserved. Printed in the United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688. Library of Congress Cataloging-in-Publication Data Mason, Robert L, 1946Multivariate statistical process control with industrial applications / Robert L. Mason, John C. Young. p. cm. - (ASA-SIAM series on statistics and applied probability) Includes bibliographical references and index. ISBN 0-89871-496-6 1. Process control-Statistical methods I. Young, John C., 1942- II. Title. III. Series. TS156.8 .M348 2001 658.5'62-dc21 2001034145
is a registered trademark. Windows and Windows NT are registered trademarks of Microsoft Corporation. QualStat is a trademark of InControl Technologies, Inc. The materials on the CD-ROM are for demonstration only and expire after 90 days of use. These materials are subject to the same copyright restrictions as hardcopy publications. No warranties, expressed or implied, are made by the publisher, authors, and their employers that the materials contained on the CD-ROM are free of error. You are responsible for reading, understanding, and adhering to the licensing terms and conditions for each software program contained on the CD-ROM. By using this CD-ROM, you agree not to hold any vendor or SIAM responsible, or liable, for any problems that arise from use of a vendor's software.
^&o ^Saimen an& Qj^am
This page intentionally left blank
Contents Preface
xi
1 Introduction to the T2 Statistic 1.1 Introduction 1.2 Univariate Control Procedures 1.3 Multivariate Control Procedures 1.4 Characteristics of a Multivariate Control Procedure 1.5 Summary
1 3 4 5 9 11
2 Basic Concepts about the T2 Statistic 2.1 Introduction 2.2 Statistical Distance 2.3 T2 and Multivariate Normality 2.4 Student t versus Retelling's T2 2.5 Distributional Properties of the T2 2.6 Alternative Covariance Estimators 2.7 Summary 2.8 Appendix: Matrix Algebra Review 2.8.1 Vector and Matrix Notation 2.8.2 Data Matrix 2.8.3 The Inverse Matrix 2.8.4 Symmetric Matrix 2.8.5 Quadratic Form 2.8.6 Wishart Distribution
13 13 13 17 20 22 26 28 28 29 29 30 30 31 32
3 Checking Assumptions for Using a T2 Statistic 3.1 Introduction 3.2 Assessing the Distribution of the T2 3.3 The T2 and Nonnormal Distributions 3.4 The Sampling Distribution of the T2 Statistic 3.5 Validation of the T2 Distribution 3.6 Transforming Observations to Normality 3.7 Distribution-Free Procedures 3.8 Choice of Sample Size
33 33 33 37 37 41 47 48 49
vii
viii
Contents 3.9 3.10 3.11
Discrete Variables Summary Appendix: Confidence Intervals for UCL
50 50 51
4 Construction of Historical Data Set 4.1 Introduction 4.2 Planning 4.3 Preliminary Data 4.4 Data Collection Procedures 4.5 Missing Data 4.6 Functional Form of Variables 4.7 Detecting Collinearities 4.8 Detecting Autocorrelation 4.9 Example of Autocorrelation Detection Techniques 4.10 Summary 4.11 Appendix 4.11.1 Eigenvalues and Eigenvectors 4.11.2 Principal Component Analysis
53 54 55 57 61 62 64 65 68 72 78 78 78 79
5 Charting the T2 Statistic in Phase I 5.1 Introduction 5.2 The Outlier Problem 5.3 Univariate Outlier Detection 5.4 Multivariate Outlier Detection 5.5 Purging Outliers: Unknown Parameter Case 5.5.1 Temperature Example 5.5.2 Transformer Example 5.6 Purging Outliers: Known Parameter Case 5.7 Unknown T2 Distribution 5.8 Summary
81 81 81 82 85 86 86 89 91 92 96
6 Charting the T2 Statistic in Phase II 6.1 Introduction 6.2 Choice of False Alarm Rate 6.3 T2 Charts with Unknown Parameters 6.4 T2 Charts with Known Parameters 6.5 T2 Charts with Subgroup Means 6.6 Interpretive Features of T2 Charting 6.7 Average Run Length (Optional) 6.8 Plotting in Principal Component Space (Optional) 6.9 Summary
97 98 98 100 105 106 108 Ill 115 118
7 Interpretation of T2 Signals for Two Variables 7.1 Introduction 7.2 Orthogonal Decompositions 7.3 The MYT Decomposition
119 119 120 125
Contents 7.4 7.5 7.6 7.7 7.8 7.9 7.10
ix Interpretation of a Signal on a T2 Component Regression Perspective Distribution of the T2 Components Data Example Conditional Probability Functions (Optional) Summary Appendix: Principal Component Form of T2
127 129 131 135 140 142 143
8 Interpretation of T2 Signals for the General Case 8.1 Introduction 8.2 The MYT Decomposition 8.3 Computing the Decomposition Terms 8.4 Properties of the MYT Decomposition 8.5 Locating Signaling Variables 8.6 Interpretation of a Signal on a T2 Component 8.7 Regression Perspective 8.8 Computational Scheme (Optional) 8.9 Case Study 8.10 Summary
147 147 147 149 152 155 157 162 163 165 169
9 Improving the Sensitivity of the T2 Statistic 9.1 Introduction 9.2 Alternative Forms of Conditional Terms 9.3 Improving Sensitivity to Abrupt Process Changes 9.4 Case Study: Steam Turbine 9.4.1 The Control Procedure 9.4.2 Historical Data Set 9.5 Model Creation Using Expert Knowledge 9.6 Model Creation Using Data Exploration 9.7 Improving Sensitivity to Gradual Process Shifts 9.8 Summary
171 172 172 174 175 175 176 180 183 188 191
10 Autocorrelation in T2 Control Charts 10.1 Introduction 10.2 Autocorrelation Patterns in T2 Charts 10.3 Control Procedure for Uniform Decay 10.4 Example of a Uniform Decay Process 10.4.1 Detection of Autocorrelation 10.4.2 Autoregressive Functions 10.4.3 Estimates 10.4.4 Examination of New Observations 10.5 Control Procedure for Stage Decay Processes 10.6 Summary
193 193 194 199 201 202 202 207 209 211 212
11 The T2 Statistic and Batch Processes 11.1 Introduction
213 213
x
Contents 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 11.10
Types of Batch Processes Estimation in Batch Processes Outlier Removal for Category 1 Batch Processes Example: Category 1 Batch Process Outlier Removal for Category 2 Batch Processes Example: Category 2 Batch Process Phase II Operation with Batch Processes Example of Phase II Operation Summary
213 217 219 221 226 226 230 232 234
Appendix. Distribution Tables
237
Bibliography
253
Index
259
Preface Industry continually faces many challenges. Chief among these is the requirement to improve product quality while lowering production costs. In response to this need, much effort has been given to finding new technological tools. One particularly important development has been the advances made in multivariate statistical process control (SPC). Although univariate control procedures are widely used in industry and are likely to be part of a basic industrial training program, they are inadequate when used to control processes that are inherently multivariate. What is needed is a methodology that allows one to monitor the relationships existing among and between the process variables. The T2 statistic provides such a procedure. Unfortunately, the area of multivariate SPC can be confusing and complicated for the practitioner who is unfamiliar with multivariate statistical techniques. Limited help comes from journal articles on the subject, as they usually include only theoretical developments and a limited number of data examples. Thus, the practitioner is not well prepared to face the problems encountered when applying a multivariate procedure to a real process situation. These problems are further compounded by the lack of adequate computer software to do the required complex computations. The motivation for this book came from facing these problems in our data consulting and finding only a limited array of solutions. We soon decided that there was a strong need for an applied text on the practical development and application of multivariate control techniques. We also felt that limiting discussions to strategies based on Hotelling's T2 statistic would be of most benefit to practitioners. In accomplishing this goal, we decided to minimize the theoretical results associated with the T2 statistic, as well as the distributional properties that describe its behavior. These results can be found in the many excellent texts that exist on the theory of multivariate analysis and in the numerous published papers pertaining to multivariate SPC. Instead, our major intent is to present to the practitioner a modern and comprehensive overview on how to establish and operate an applied multivariate control procedure based on our conceptual view of Hotelling's T2 statistic. The intended audience for this book are professionals or students involved with multivariate quality control. We have assumed the reader is knowledgeable about univariate statistical estimation and control procedures (such as Shewhart charts) and is familiar with certain probability functions, such as the normal, chi-square, t, and F distributions. Some exposure to regression analysis also would be helpful. xi
xii
Preface
Although an understanding of matrix algebra is a prerequisite in studying any area of multivariate analysis, we have purposely downplayed this requirement. Instead, appendices are included in various chapters in order to provide the minimal material on matrix algebra needed for our presentation of the T2 statistic. As might be expected, the T2 control procedure requires the use of advanced statistical software to perform the numerous computations. All the T2 charts presented in this text were generated using the QualStat™ software package, which is a product of InControl Technologies, Inc. On the inside back cover of the book we have included a free demonstration version of this software. You will find that the full-licensed version of the package is easy to apply and provides an extended array of graphical and statistical summaries. It also contains modules for use with most of the procedures discussed in this book. This text contains 11 chapters. These have been designed to progress in the same chronological order as one might expect to follow when actually constructing a multivariate control procedure. Each chapter has numerous data examples and applications to assist the reader in understanding how to apply the methodology. A brief description of each chapter is given below. Chapter 1 provides the incentive and intuitive grasp of statistical distance and presents an overview of the T2 as the ideal control statistic for multivariate processes. Chapter 2 supplements this development by providing the distributional properties of the T2 statistic as they apply to multivariate SPC. Distributional results are stated, and data examples are given that illustrate their use when applied to control strategy. Chapter 3 provides methods for checking the distributional assumptions pertaining to the use of the T2 as a control statistic. When distributional assumptions cannot be satisfied, alternative procedures are introduced for determining the empirical distribution of the T2 statistic. Chapters 4 and 5 discuss the construction of the historical data set and T2 charting procedures for a Phase I operation. This includes obtaining the preliminary data, analyzing data problems such as collinearity and autocorrelation, and purging outliers. Chapter 6 addresses T2 charting procedures and signal detection for a Phase II operation. Various forms of the T2 statistic also are considered. Signal interpretation, based on the MYT (Mason-Young-Tracy) decomposition, is presented in Chapter 7 for the bivariate case. We show how a signal can be isolated to a particular process variable or to a group of variables. In Chapter 8 these procedures are extended to cases involving two or more variables. Procedures for increasing the sensitivity of the T2 statistic to small consistent process changes are covered in Chapter 9. A T2 control procedure for autocorrelated observations is developed in Chapter 10, and the concluding chapter, Chapter 11, addresses methods for monitoring batch processes using the T2 statistic. We would like to express our sincere thanks to PPG Industries, Inc., especially the Chemicals and Glass Divisions, for providing numerous applications of the T2 control procedure for use in this book. From PPG in Lake Charles, LA, we thank Joe Hutchins of the Telphram Development Project; Chuck Stewart and Tommy Hampton from Power Generation; John Carpenter and Walter Oglesby from Vinyl; Brian O'Rourke from Engineering; and Tom Hatfield, Plant Quality Coordinator. The many conversations with Bob Jacobi and Tom Jeffery (retired) were most
Preface
xiii
helpful in the initial stages of development of a T2 control procedure. A special thanks also is due to Cathy Moyer and Dr. Chuck Edge of PPG's Glass Technical Center in Harmerville, PA; Frank Larmon, ABB Industrial Systems, Inc.; Bob Smith, LA Pigment, Inc.; and Stan Martin (retired), Center for Disease Control. Professor Youn-Min Chou of the University of Texas at San Antonio, and Professor Nola Tracey McDaniel of McNeese State University, our academic colleagues, have been most helpful in contributing to the development of the T2 control procedure. We also wish to acknowledge Mike Marcon and Dr. James Callahan of InControl Technologies, Inc., whose contributions to the application of the T2 statistic have been immeasurable. We ask ourselves, where did it all begin? In our case, the inspiration can be traced to the same individual, Professor Anant M. Kshirsagar. It is not possible for us to think about a T2 statistic without incurring the fond memories of his multivariate analysis classes while we were in graduate school together at Southern Methodist University many years ago. Finally, we wish to thank our spouses, Carmen and Pam. A project of this magnitude could not have been completed without their continued love and support. Robert L. Mason John C. Young
This page intentionally left blank
Chapter 1
Introduction to the T2 Statistic
The Saga of Old Blue Imagine that you have recently been promoted to the position of performance engineer. You welcome the change, since you have spent the past few years as one of the process engineers in charge of a reliable processing unit labeled "Old Blue." You know every "nook and cranny" of the processing unit and especially what to do to unclog a feed line. Each valve is like a pet to you, and the furnace is your "baby." You know all the shift operators, having taught many and learned from others. This operational experience formed the basis for your recent promotion, since in order to be a performance engineer, one needs a thorough understanding of the processing unit. You are confident that your experience will serve you well. With this promotion, you soon discover that your job responsibilities have changed. No longer are you in charge of meeting daily production quotas, assigning shifts to the operators, solving personnel problems, and fighting for your share of the maintenance budget. Your new position demands that you adopt the skills of a detective and search for methods to improve unit performance. This is great, since over time you have developed several ideas that should lead to process improvement. One of these is the recently installed electronic data collector that instantly provides observations on all variables associated with the processing unit. Some other areas of responsibility for you include identifying the causes of upset conditions and advising operations. When upsets occur, quick solutions must be found to return the unit to normal operational conditions. With your understanding of the unit, you can quickly and efficiently address all such problems. The newly created position of performance engineer fulfills your every dream. But suddenly your expectations of success are shattered. The boss anxiously confronts you and explains that during the weekend, upset conditions occurred with Old Blue. He gives you a diskette containing process data,
1
2
Chapter 1. Introduction to the T2 Statistic retrieved from the data net for both a good-run time period and the upset time period. He states that the operations staff is demanding that the source of the problem be identified. You immediately empathize with them. Having lived through your share of unit upsets, you know no one associated with the unit will be happy until production is restored and the problem is resolved. There is an entire megabyte of data stored on the diskette, and you must decide how to analyze it to solve this problem. What are your options ? You import the data file to your favorite spreadsheet and observe that there are 10,000 observations on 35 variables. These variables include characteristics of the feedstock, as well as observations on the process, production, and quality variables. The electronic data collector has definitely done its job. You remember a previous upset condition on the unit that was caused by a significant change in the feedstock. Could this be the problem? You scan the 10,000 observations, but there are too many numbers and variables to see any patterns. You cannot decipher anything. The thought strikes you that a picture might be worth 1,000 observations. Thus, you begin constructing graphs of the observations on each variable plotted against time. Is this the answer? Changes in the observations on a variable should be evident in its time-sequence graph. With 35 variables and 10,000 observations, this may involve a considerable time investment, but it should be worthwhile. You readily recall that your college statistics professor used to emphasize that graphical procedures were an excellent technique for gaining data insight. You initially construct graphs of the feedstock characteristics. Success eludes you, however, and nothing is noted in the examination of these plots. All the input components are consistent over the entire data set, including over both the prior good-run period and the upset period. From this analysis, you conclude that the problem must be associated with the 35 process variables. However, the new advanced process control (APC) system was working well when you left the unit. The multivariable system keeps all operational variables within their prescribed operational range. If a variable exceeded the range, an alarm would have signaled this and the operator would have taken corrective action. How could the problem be associated with the process when all the variables are within their operational ranges? Having no other options, you decide to go ahead and examine the process variables. You recall from working with the control engineers in the installation of the APC system that they had been concerned with how the process variables vary together. They had emphasized studying and understanding the correlation structure of these variables, and they had noted that the variables did not move independently of one another, but as a group. You decide to examine scatter plots of the variables as well as time-sequence plots. Again, you recall the emphasis placed on graphical techniques by that old statistics professor. What was his name? You begin the laborious task, soon realizing the enormity of the job. From experience, it is easy to identify the most important control variables and the
1.1. Introduction
3
fundamental relationships existing between and among the variables. Perhaps scatter plots of the most influential variables will suffice in locating the source of the problem. However, you realize that if you do not discover the right combination of variables, you will never find the source of the problem. You are interrupted from your work by the reappearance of your boss, who inquires about your progress. The boss states he needs an immediate success story to justify your newly created position of performance engineer. There is a staff meeting in a few days, and he would like to present the results of this work as the success story. More pressure to succeed. You decide that you cannot disappoint your friends in the processing unit, nor your boss. You feel panic creeping close to the edge of your consciousness. A cup of coffee restores calm. You reevaluate your position. How does one locate the source of the problem? There must be a quicker, easier way than the present approach. We have available a set of data consisting of 10,000 observations on 35 variables. The solution must lie with the use of statistics. You slowly begin to recall the courses you took in college, which included basic courses in statistical procedures and statistical process control (SPC). Would those work here? Yes, we can compare the data from the good-run period to the data from the upset period. How is this done for a group o/35 variables? This was the same comparison made in SPC. The good-run period data served as a baseline and the other operational data were compared to it. Signals occurred when present operational data did not agree with the baseline data. Your excitement increases as you remember more. What was that professor's name, old Dr. or Dr. Old? Your coursework in SPC covered only situations involving 1 variable. You need a procedure that considers all 35 related variables at one time and indicates which variable or group of variables contributes to the signal. A procedure such as this would offer a solution to your problem. You rush to the research center to look for a book that will instruct you on how to solve problems in multivariate SPC.
1.1
Introduction
The problem confronting the young engineer in the above situation is common in industry. Many dollars have been invested in electronic data collectors because of the realization that the answer to most industrial problems is contained in the observations. More money has been spent on multivariable control or APC systems. These units are developed and installed to ensure the containment of process variables within prescribed operational ranges. They do an excellent job in reducing overall system variation, as they restrict the operational range of the variables. However, an APC system does not guarantee that a process will satisfy a set of baseline conditions, and it cannot be used to determine causes of system upsets. As our young engineer will soon realize, a multivariate SPC procedure is needed to work in unison with the electronic data collector and the APC system. Such a
4
Chapter 1. Introduction to the T2 Statistic
Group Number Figure 1.1: Shewhart chart of a process variable.
procedure will signal process upsets and, in many cases, can be used to pinpoint precursors of the upset condition before control is lost. When signals are identified, the procedure allows for the decomposition of the signal in terms of the variables that contributed to it. Such a system is the main subject of this book.
1.2
Univariate Control Procedures
Walter A. Shewhart, in a Bell Telephone Laboratories memorandum dated May 16, 1924, presented the first sketch of a univariate control chart (e.g., see Duncan (1986)). Although his initial chart was for monitoring the percentage defective in a production process, he later extended his idea to control charts for the average and standard deviation of a process. Figure 1.1 shows an example of a Shewhart chart designed to monitor the mean, X, of a group of process observations taken on a process variable at the same time point. Drawn on the chart are the upper control limit (UCL) and the lower control limit (LCL). Shewhart charts are often used in detecting unusual changes in variables that are independent and thus not influenced by the behavior of other variables. These changes occur frequently in industrial settings. For example, consider the main laboratory of a major chemical industry. Many duties are assigned to this facility. These may range from research and development to maintaining the quality of production. Many of the necessary chemical determinations are made using various types of equipment. How do we monitor the accuracy (i.e., closeness to a target value) of the determination made by the equipment? Often a Shewhart chart, constructed from a set of baseline data, is utilized. Suppose a measurement on a sample of known concentration is taken. If the result of the sample falls within the control limits of the Shewhart chart, it is assumed that the equipment is performing normally.
1.3. Multivariate Control Procedures
5
Figure 1.2: Model of a production unit. Otherwise, the equipment is fixed and recalibrated and a new Shewhart chart is established. It may be argued that a specific chemical determination is dependent on other factors such as room temperature and humidity. Although these factors can influence certain types of equipment, compensation is achieved by using temperature and humidity controls. Thus, this influence becomes negligible and determinations are treated as observations on an independent variable.
1.3
Multivariate Control Procedures
There are many industrial settings where process performance is based on the behavior of a set of interrelated variables. Production units, such as the one illustrated in Figure 1.2, are excellent examples. They are designed to change an input to some specific form of output. For example, we may wish to change natural gas, a form of energy, to an alternate state such as steam or electricity. Or, we may wish to convert brine (salt water) to caustic soda and chlorine gas; sand to silica or glass; or hydrochloric acid and ethylene to ethylene dichloride, which in turn is changed to vinyl chloride. Our interest lies in the development of a control procedure that will detect unusual occurrences in such variables. Why not use univariate control procedures for these situations? To answer this question, we first need to describe the differences between univariate and multivariate processes. Although the biggest distinction that is evident to the practitioner is the number of variables, there are more important differences. For example, the characteristics or variables of a multivariate process often are interrelated and form a correlated set. Since the variables do not behave independently of one another, they must be examined together as a group and not separately. Multivariate processes are inherent to many industries, such as the chemical industry, where input is being chemically altered to produce a particular output. A good example is the production of chlorine gas and caustic soda. The input variable is saturated brine (water saturated with salt). Under proper conditions, some of the brine is decomposed by electrolysis to chlorine gas; caustic soda is formed within the brine and is later separated. The variables of interest are the components produced by the electrolysis process. All are related to the performance of the process. In
6
Chapter 1. Introduction to the T2 Statistic
addition, many of the variables follow certain mathematical relationships and form a highly correlated set. The correlation among the variables of a' multivariate system may be due to either association or causation. Correlation due to association in a production unit often occurs because of the effects of some unobservable variable. For example, the blades of a gas or steam turbine will become contaminated (dirty) from use over time. Although the accumulation of dirt is not measurable, megawatt production will show a negative correlation with the length of time from the last cleaning of the turbine. The correlation between megawatt production and length of time since last cleaning is one of association. An example of a correlation due to causation is the relationship between temperature and pressure since an increase in the temperature will produce a pressure change. Such correlation inhibits examining each variable by univariate procedures unless we take into account the influence of the other variable. Multivariate process control is a methodology, based on control charts, that is used to monitor the stability of a multivariate process. Stability is achieved when the means, variances, and covariances of the process variables remain stable over rational subgroups of the observations. The analysis involved in the development of multivariate control procedures requires one to examine the variables relative to the relationships that exist among them. To understand how this is done, consider the following example. Suppose we are analyzing data consisting of four sets of temperature and pressure readings. The coordinates of the points are given as
where the first coordinate value is the temperature and the second value is the pressure. These four data points, as well as the mean point of (175, 75), are plotted in the scatter plot given in Figure 1.3. There also is a line fitted through the points and two circles of varying sizes about the mean point. If the mean point is considered to be typical of the sample data, one form of analysis consists of calculating the distance each point is from the mean point. The distance, say D, between any two points, (ai, a^) and (&i, 62)5 is given by the formula This type of distance measure is known as Euclidean, or straight-line, distance. The distance that each of our four example points is from the mean point (in order of occurrence) is computed as
From these calculations, it is seen that points 1 and 4 are located an equal distance from the mean point on a circle centered at the mean point and having a radius of 3.16. Similarly, points 2 and 3 are located at an equal distance from the mean but on a larger circle with a radius of 7.07.
1.3. Multivariate Control Procedures
7
Figure 1.3: Scatter plot illustrating straight-line distance. There are two major criticisms of this analysis. First, the variation in the two variables has been completely ignored. From Figure 1.3, it appears that the temperature readings contain more variation than the pressure readings, but this could be due to the difference in scale between the two variables. However, in this particular case the temperature readings do contain more variation. The second criticism of this analysis is that the covariation between temperature and pressure has been ignored. It is generally expected that as temperature increases, the pressure will increase. The straight line given in Figure 1.3 depicts this relationship. Observe that as the temperature increases along the horizontal axis, the corresponding value of the pressure increases along the vertical axis. This poses an interesting question. Can a measure of the distance between two points be devised that accounts for the presence of a linear relationship between the corresponding variables and the difference in the variation of the variables? The answer is yes; however, the distance is statistical rather than Euclidean and is not as easy to compute. To calculate statistical distance (SD), a measure of the correlation between the variables of interest must be obtained. This is generally expressed in terms of the covariance between the variables, as covariance provides a measure of how variables vary together. For our example data, the sample covariance between temperature and pressure, denoted as S]2, is computed using the formula
where x\ represents the temperature component of the observation vector and x% represents the pressure component. The number of sample points is given by n. The value of the sample covariance as computed from the temperature-pressure data set is 18.75. Also needed in the computation of the statistical distance is the sample variance of the individual variables. The sample variance of a variable, x,
8
Chapter 1. Introduction to the T2 Statistic
Figure 1.4: Scatter plot illustrating statistical distance. is given by
The sample variances for temperature and pressure as determined from the example data are 22.67 and 17.33, respectively. Using the value of the covariance, and the values of the sample variances and the sample means of the variables, the squared statistical distance, (SD)2, is computed using the formula
where R = S12/S1S2is the sample correlation coefficient The actual SD value is obtained by taking the principal square root of both sides of (1.2). Since (1.2) is the formula for an ellipse, the SD is sometimes referred to as elliptical distance (in contrast to straight-line distance). It also has been labeled Mahalanobis's distance, or Hotelling's T 2 , or simply T2. The concept of statistical distance is explored in more detail in Chapter 2. Calculating the (SD)2 for each of the four points in our temperature-pressure sample produces the following results;
From this analysis it is concluded that our four data points are the same statistical distance from the mean point. This result is illustrated graphically in Figure 1.4. All four points satisfy the equation of the ellipse superimposed on the plot. From a visual perspective, this result appears to be unreasonable. It is obvious that points 1 and 4 are closer to the mean point in Euclidean distance than points 2 and 3. However, when the differences in the variation of the variables and the
1.4. Characteristics of a Multivariate Control Procedure
9
relationships between the variables are considered, the statistical distances are the same. The multivariate control procedures presented in this book are developed using methods based on the above concept of statistical distance.
1.4
Characteristics of a Multivariate Control Procedure
There are at least five desirable characteristics of a multivariate control procedure. These include the following: 1. The monitoring statistic should be easy to chart and helpful in identifying process trends. 2. When out-of-control points occur, it must be easy to determine the cause in terms of the contributing variables. 3. The procedure must be flexible in application. 4. The procedure needs to be sensitive to small but consistent process changes. 5. The procedure should be capable of monitoring the process both on-line as well as off-line. A good charting method not only allows for quick signal detection but also helps in identifying process trends. By examining the plotted values of the charting statistic over time, process behavior is observed and upset conditions are identified in advance of a signal, i.e., an out-of-control point. For a clear understanding of the control procedure, interpretation needs to be global and not isolated to a particular data set. The same must be true for the detection of variable conditions that are precursors to upset or chaotic situations. A control procedure having this ability is a valuable asset. A control procedure should work with both independent and time-dependent process observations and be applicable to both continuous processes and batch processes. It should be flexible enough for use with various forms of the control statistic, such as a sample mean or an individual observation, and work with different estimators of the internal structure of the variables. Most industries are volume-oriented. Small changes in efficiency can be the difference in creating a profit or generating a loss. Sensitivity to small process changes is a necessary component of any multivariate control procedure. Multivariate procedures are computationally intense. This is a necessary component since industrial processes are steadily moving toward total computerization. Recent technological advances in industrial control procedures have greatly improved the quantity and quality of available data. The use of computing hardware, such as electronic data collectors, facilitates the collection of data on a multitude of variables from all phases of production. In many situations, one may be working with a very large number of variables and thousands of observations. These are the data available to a statistical control procedure. Any control procedure must be programmable and able to interface and react with such collected online data.
10
Chapter 1. Introduction to the T2 Statistic
Charting with the T2 Statistic Although many different multivariate control procedures exist, it is our belief that a control procedure built on the T2 statistic possesses all the above characteristics. Like many multivariate charting statistics, the T2 is a univariate statistic. This is true regardless of the number of process variables used in computing it. However, because of its similarity to a univariate Shewhart chart, the T2 control chart is sometimes referred to as a multivariate Shewart chart. This relationship to common univariate charting procedures facilitates the understanding of this charting method. Signal interpretation requires a procedure for isolating the contribution of each variable and/or a particular group of variables. As with univariate control, outof-control situations can be attributed to individual variables being outside their allowable operational range; e.g., the temperature is too high. A second cause of a multivariate signal may be attributed to a fouled relationship between two or more variables; e.g., the pressure is not where it should be for a given temperature reading. The signal interpretation procedure covered in this text is capable of separating a T2 value into independent components. One type of component determines the contribution of the individual variables to a signaling observation, while the other components check the relationships among groups of variables. This procedure is global in nature and not isolated to a particular data set or type of industry. The T2 statistic is one of the more flexible multivariate statistics. It gives excellent performance when used to monitor independent observations from a steadystate continuous process. It also can be based on either a single observation or the mean of a subgroup of n observations. Minor adjustments in the statistic and its distribution allow the movement from one form to the other. Many industrial processes produce observations containing a time dependency. For example, process units with a decaying cycle often produce observations that can be modeled by some type of time-series function. The T2 statistic can be readily adapted to these situations and can be used to produce a time-adjusted statistic. The T2 statistic also is applicable to situations where the time correlation behaves as a step function. We have experienced no problems in applying the T2 statistic to batch or semibatch processes with targets specified or unspecified. In the case of target specification, the T2 statistic measures the statistical distance the observed value is from the specified target. In cases where rework is possible, such as blending, components of the T2 decomposition can be used in determining the blending process. Sensitivity to small process change is achieved with univariate control procedures, such as Shewhart charts, through applications of zonal charts with run rules. Small, consistent process changes in a T2 chart can be detected by using certain components of the decomposition of a T2 statistic. This is achieved by monitoring the residual error inherent to these terms. The detection of small process shifts is so important that a whole chapter of the text is devoted to this procedure. An added benefit of the T2 charting procedure is the potential to do on-line experimentation that can lead to local optimization. Because of the demand of
1.5. Summary
11
production quotas, the creation of dangerous and hazardous conditions, extended upset recovery periods, and numerous other reasons, the use of experimental design is limited in most production units. However, one can tweak the process. Monitoring of the appropriate residual terms allows one to observe the effect of this type of experimentation almost instantaneously. In addition, the monetary value of process changes, due to new equipment or operational procedures, can be quickly determined. This aspect of a T2 control procedure has proved invaluable in many applications. Numerous software computer programs are available for performing a variety of univariate SPC procedures. However, computer packages for doing multivariate SPC are few in number. Some, such as SAS™, can be useful but require individual programming. Others, such as JMP™, a product of SAS, Inc., provide only limited multivariate SPC. The program QualStat™, a product of InControl Technologies, Inc., contains a set of procedures based entirely on the T2 statistic. This program is used extensively in this book to generate the T2 graphs and perform the T2 analyses.
1.5
Summary
Industrial process control generally involves monitoring a set of correlated variables. Such correlation confounds the interpretation of univariate procedures run on individual variables. One method of overcoming this problem is to use a Hotelling's T2 statistic. As demonstrated in our discussion, this statistic is based on the concept of statistical distance. It consolidates the information contained in a multivariate observation to a single value, namely, the statistical distance the observation is from the mean point. Desirable characteristics for a multivariate control chart include ease of application, adequate signal interpretation, flexibility, sensitivity to small process changes, and available software to use it. One multivariate charting procedure that possesses all these characteristics is the method based on the T2 statistic. In the following chapters of this book, we explore the various properties of the T2 charting procedure and demonstrate its value.
This page intentionally left blank
Chapter 2
Basic Concepts about the 72 Statistic
2.1
Introduction
Some fundamental concepts about the T2 statistic must be presented before we can discuss its usage in constructing a multivariate control chart. We begin with a discussion of statistical distance and how it is related to the T2 statistic. How statistical distance differs from straight-line or Euclidean distance is an important part of the coverage. Included also is a discussion of the relationship between the univariate Student t statistic and its multivariate analogue, the T2 statistic. The results lead naturally to the understanding of the probability functions used to describe the T2 statistic under a variety of different circumstances. Having knowledge of these distributions aids in determining the UCL value for a T2 chart, as well as the corresponding false alarm rate.
2.2
Statistical Distance
Hotelling (1947), in a paper on using multivariate procedures to analyze bombsight data, was among the first to examine the problem of analyzing correlated variables from a statistical control perspective. His control procedure was based on a charting statistic that he had introduced in an earlier paper (i.e., Hotelling (1931)) on the generalization of the Student t statistic. The statistic later was named in his honor as Hotelling's T 2 . Slightly prior to 1931, Mahalanobis (1930) proposed the use of a similar statistic, which would later become known as Mahalanobis's distance measure, for use in measuring the squared distance between two populations. Although the two statistics differ only by a constant value, the T2 form is the most popular in multivariate process control and is the main subject of this text. The following discussion provides insight into how the concept of statistical distance, as defined with the T2 statistic, is used in the development of multivariate 13
14
Chapter 2. Basic Concepts about the T2 Statistic
control procedures. The reader unfamiliar with vectors and matrices may find the definitions and details given in this chapter's appendix (section 2.8) to be helpful in understanding these results. Suppose we denote a multivariate observation on p variables in vector form as X' = (x±,x2,..., xp}. Our main concern is in processing the information available on these p variables. One approach is to use graphical techniques, which are usually excellent for this task, but plotting points in a pdimensional space (p > 3) is severely limited. This restriction inhibits overall viewing of the multivariate situation. Another method for examining information provided in a ^-dimensional observation is to reduce the multivariate data vector to a single univariate statistic. If the resulting statistic contains information on all p variables, it can be interpreted and used in making decisions as to the status of a process. There are numerous procedures for achieving this result, and we demonstrate two of them below. Suppose a process generates uncorrelated bivariate observations, (xi,:^), and it is desired to represent them graphically. It is common to construct a twodimensional scatter plot of the points. Also, suppose there is interest in determining the distance a particular point is from the mean point. The distance between two points is always measured as a single number or value. This is true regardless of how many dimensions (variables) are involved in the problem. The usual straight-line (Euclidean) distance measures the distance between two points by the number of units that separate them. The squared straight-line distance, say D, between a point (£1,22) and the population mean point (//i,/^) is defined as Note that we have taken the bivariate observation, (xi,x 2 ), and converted it to a single number D, the distance the observation is from the mean point. If this distance, D, is fixed, all points that are the same distance from the mean point can be represented as a circle with center at the mean point and a radius of D (i.e., see Figure 2.1). Also, any point located inside the circle has a distance to the mean point less than D. Unfortunately, the Euclidean distance measure is unsatisfactory for most statistical work (e.g., see Johnson and Wichern (1998)). Although each coordinate of an observation contributes equally to determining the straight-line distance, no con sideration is given to differences in the variation of the two variables as measured by their variances, a\ and cr|, respectively. To correct this deficiency, consider the standardized values
and all points satisfying the relationship
The value SD, the square root of (SD)2 in (2.1), is known as statistical distance. For a fixed value of SD, all points satisfying (2.1) are the same statistical distance
2.2. Statistical Distance
15
Figure 2.1: Region of same straight-line distance.
from the mean point. The graph of such a group of points forms an ellipse, as is illustrated in the example given in Figure 2.2. Any point inside the ellipse will have a statistical distance less than SD, while any point located outside the ellipse will have a statistical distance greater than SD. In comparing statistical distance to straight-line distance, there are some major differences to be noted. First, since standardized variables are utilized, the statistical distance is dimensionless. This is a useful property in a multivariate process since many of the variables may be measured in different units. Second, any two points on the ellipse in Figure 2.2 have the same SD but could have possibly different Euclidean distances from the mean point. If the two variables have equal variances and are uncorrelated, the statistical and Euclidean distance, apart from a constant multiplier, will be the same; otherwise, they will differ. The major difference between statistical and Euclidean distance in Figure 2.2 is that the two variables used in statistical distance are weighted inversely by their standard deviations, while both variables are equally weighted in the straight-line distance. Thus, a change in a variable with a small standard deviation will contribute more to statistical distance than a change in a variable with a large standard deviation. In other words, statistical distance is a weighted straight-line distance where more importance is placed on the variable with the smaller standard deviation to compensate for its size relative to its mean. It was assumed that the two variables in the above discussion are uncorrelated. Suppose this is not the case and that the two variables are correlated. A scatter plot of two positively correlated variables is presented in Figure 2.3. To construct a statistical distance measure to the mean of these data requires a generalization of (2.1).
16
Chapter 2. Basic Concepts about the T2 Statistic
Figure 2.2: Region of same statistical distance.
Figure 2.3: Scatter plot of correlated variables. From analytical geometry, the general equation of an ellipse is given by
where the a^- are specified constants satisfying the relationship (a^2 — 4ana22) < 0, and c is a fixed value. By properly choosing the a^ in (2.2), we can rotate the
2.3. T2 and Multivariate Normality
17
Figure 2.4: Elliptical region encompassing data points.
ellipse while keeping the scatter of the two variables fixed, until a proper alignment is obtained. For example, the ellipse given in Figure 2.4 is centered at the mean of the two variables vet rotated to reflect the correlation between them.
2.3
72 and Multivariate Normality
Suppose (xi, #2) can be described jointly by a bivariate normal distribution. Under this assumption, the statistical distance between this point and the mean vector (//i, Hz] is the value of the variable part of the exponent of the bivariate normal probability function
where — oo < x^ < oo for i = 1,2, and Oi > 0 represents the standard deviation of Xi. The value of (SD) 2 is given by
where p represents the correlation between the two variables, with — 1 < p < I. The cross-product term between x\ and x^ in (2.4) accounts for the fact that the two variables vary together and are dependent. When x\ and x^ are correlated, the major and minor axes of the resulting ellipse differ from that of the variable space (,TI, x%). If the correlation is positive, the ellipse will tilt upward to the right, and if the correlation is negative, the ellipse will tilt downward to the right. This
Chapter 2. Basic Concepts about the T2 Statistic
18
Figure 2.5: Correlation and the ellipse.
is illustrated in Figure 2.5. If p = 0, so that there is no correlation between x\ and #2, the ellipse will be oriented similar to the one given in Figure 2.2. Equation (2.4) can be expressed in matrix notation (see section 2.8.5) as
where X' = (£1,2:2), // = (//i, ^2), and E
[
2
1
is the inverse of the matrix E. Note
1
a\E ai2 L = po\a-2 where is the covariance between 0 and zero elsewhere). Using (3.3), the kurtosis of this bivariate nonnormal distribution is 12.8. In comparison, the kurtosis of a bivariate normal distribution is 8. Thus, this distribution is heavier in the tails than a bivariate normal. However, suppose we keep adding another independent uniform variate to the above nonnormal distribution and observe the change in the kurtosis value. The results are provided in Table 3.1. As the number of uniform variables increases in the multivariate nonnormal distribution, the corresponding kurtosis value of the MVN distribution approaches and then exceeds the kurtosis value of the nonnormal distribution. Equivalence of the two kurtosis values occurs at the combination of five uniform variables and one exponential variable. For this combination, the tails of the joint nonnormal distribution are similar in shape to the tails of the corresponding normal. The result also implies that the T2 statistic based on this particular joint nonnormal distribution will have the same variance as the T2 statistic based on the MVN distribution. This example indicates that there do exist combinations of many (independent) univariate nonnormal distributions with the same kurtosis value that is achieved under an MVN assumption. For these cases, the mean and variance of the T2 statistic based on the nonnormal data are the same as for the T2 statistic based on the corresponding normal data. This result does not guarantee a perfect fit of the T2 sampling distribution to a beta (or chi-square or F) distribution, as this would require that all (higher) moments of the sampling distribution of the T2 statistic be identical to those of the corresponding distribution. However, such agreement of
3.5.
Validation of the T2 Distribution
41
the lower moments suggests that, in data analysis using a multivariate nonnormal distribution, it may be beneficial to determine if the sampling distribution of the T2 statistic fits a beta (or chi-square or F) distribution. If such a fit is obtained, the data can then be analyzed as if the MVN assumption were true.
3.5
Validation of the T2 Distribution
A popular graphical procedure that is helpful in assessing if a set of data represents a reference distribution is a quantile-quantile (Q-Q) plot (e.g., see Gnanadesikan (1977) or Sharma (1995)). Thus, this technique can be used in assessing the sampling distribution of the T2 statistic. We emphasize that the Q-Q plot is not a formal test procedure but simply a visual aid for determining if a set of data can be approximated by a known distribution. Alternatively, goodness-of-fit tests can also be used for such determinations. For example, with baseline data, we could construct a Q-Q plot of the ordered sample values, denoted by x^ = [n/(n — 1)2]T?, against the corresponding quantiles,