PITMAN'S MEASURE OF CLOSENESS
A Comparison of Statistical Estimators
This page intentionally left blank
PITMAN'S M...
33 downloads
631 Views
22MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
PITMAN'S MEASURE OF CLOSENESS
A Comparison of Statistical Estimators
This page intentionally left blank
PITMAN'S MEASURE OF CLOSENESS
A Comparison of Statistical Estimators
BiaJTL.
Society for Industrial and Applied Mathematics Philadelphia 1993
Library of Congress Cataloging-in-Publication Data Keating, Jerome P. Pitman's measure of closeness: a comparison of statistical estimators / Jerome P. Keating, Robert L. Mason, Pranab K. Sen. p. cm Includes bibliographical references and index. ISBN 0-89871-308-0 1. Pitman's measure of closeness. I. Mason, Robert Lee, 1946 II. Sen, Pranab Kumar, 1937 - . HI. Title. QA276.8.K43 1993 519.5'44—dc20
93-3059
Copyright 1993 by the Society for Industrial and Applied Mathematics. All rights reserved. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the Publisher. For information, write the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, Pennsylvania 19104-2688.
To our esteemed colleague, teacher, and mentor, C. Radhakrishna Rao on the occasion of his seventy-second birthday.
This page intentionally left blank
Contents Foreword
xi
Preface
xiii
1 Introduction 1.1 Evolution of estimation theory 1.1.1 Least squares 1.1.2 Method of moments 1.1.3 Maximum likelihood 1.1.4 Uniformly minimum variance unbiased estimation 1.1.5 Biased estimation 1.1.6 Bayes and empirical Bayes 1.1.7 Influence functions and resampling techniques 1.1.8 Future directions 1.2 PMC comes of age 1.2.1 PMC: A product of controversy 1.2.2 PMC as an intuitive criterion 1.3 The scope of the book 1.3.1 The history, motivation, and controversy of PMC 1.3.2 A unified development of PMC
1 5 5 6 8 10 12 14 15 17 18 18 19 22 22 23
2 Development of Pitman's Measure of Closeness 2.1 The intrinsic appeal of PMC 2.1.1 UseofMSE 2.1.2 Historical development of PMC 2.1.3 Convenience store example 2.2 The concept of risk 2.2.1 Renyi's decomposition of risk 2.2.2 How do we understand risk?
25 25 26 31 37 41 41 43
VII
viii
CONTENTS 2.3 Weaknesses in the use of risk 2.3.1 When MSB does not exist 2.3.2 Sensitivity to the choice of the loss function 2.3.3 The golden standard 2.4 Joint versus marginal information 2.4.1 Comparing estimators with an absolute ideal 2.4.2 Comparing estimators with one another 2.5 Concordance of PMC with MSE and MAD
3 Anomalies with PMC 3.1 Living in an intransitive world 3.1.1 Round-robin competition 3.1.2 Voting preferences 3.1.3 Transitiveness 3.2 Paradoxes Among Choice 3.2.1 The pairwise-worst simultaneous-best paradox 3.2.2 The pairwise-best simultaneous-worst paradox 3.2.3 Politics: The choice of extremes 3.3 Rao's phenomenon 3.4 The question of ties 3.4.1 Equal probability of ties 3.4.2 Correcting the Pitman criterion 3.4.3 A randomized estimator 3.5 The Rao-Berkson controversy 3.5.1 Minimum Chi-square and maximum likelihood 3.5.2 Model inconsistency 3.6 Remarks 4 Pairwise Comparisons 4.1 Geary-Rao Theorem 4.2 Applications of the Geary-Rao Theorem 4.3 Karlin's Corollary 4.4 A special case of the Geary-Rao Theorem 4.4.1 Surjective estimators 4.4.2 The MLR property 4.5 Applications of the special case 4.6 Transitiveness 4.6.1 Transitiveness Theorem
45 45 50 55 56 58 60 61 65 66 68 70 70 75 75 77 80 83 85 87 90 92 94 94 98 99 101 102 106 111 116 117 118 119 128 128
ix
CONTENTS 4.6.2
Another extension of Karlin's Corollary
129
5 Pitman-Closest Estimators 5.1 Estimation of location parameters 5.2 Estimators of scale 5.3 Generalization via topological groups 5.4 Posterior Pitman closeness 5.5 Linear combinations 5.6 Estimation by order statistics
135 136 147 151 155 159 164
6 Asymptotics and PMC 6.1 Pitman closeness of BAN estimators 6.1.1 Modes of convergence 6.1.2 Fisher information 6.1.3 BAN estimates are Pitman closest 6.2 PMC by asymptotic representations 6.2.1 A general proposition 6.3 Robust estimation of a location parameter 6.3.1 L-estimators 6.3.2 M-estimators 6.3.3 R-estimators 6.4 APC characterizations of other estimators 6.4.1 Pitman estimators 6.4.2 Examples of Pitman estimators 6.4.3 PMC equivalence 6.4.4 Bayes estimators 6.5 Second-order efficiency and PMC 6.5.1 Asymptotic efficiencies 6.5.2 Asymptotic median unbiasedness 6.5.3 Higher-order PMC
169 170 170 174 175 181 181 183 184 186 188 191 191 193 195 198 201 202 203 205
Bibliography
211
Index
223
This page intentionally left blank
Foreword I have great pleasure in writing a foreword to "Pitman's Measure of Closeness (PMC): A Comparison of Statistical Estimators" by Keating, Mason, and Sen, for many reasons. It is the result of a fruitful collaboration by three major research workers on PMC. It is a comprehensive survey of recent contributions to the subject. It discusses the merits and deficiencies of PMC, throws light on recent controversies, and formulates new problems for further research. Finally, there is a need for such a book, as PMC is not generally discussed in statistical texts. Its role in estimation theory and its usefulness to the decision maker are not well known. Since 1980, I have expressed the belief that the decision maker would benefit from examining the performance of any given estimator under different criteria (or loss functions). I suggested the use of PMC as one of the criteria to be seriously considered because of its intuitive appeal. I am glad to see that during the last ten years, PMC has been a topic of active research. The contributions by the authors of this book have been especially illuminating in resolving some of the controversies surrounding PMC. The authors deserve to be congratulated for their excellent effort in putting together much useful material for the benefit of statistical theorists and practitioners.
C.R. Rao Eberly Professor of Statistics
XI
This page intentionally left blank
Preface This book presents a unified development of the origins, nature, methods, and applications of Pitman's measure of closeness (PMC) as a criterion in estimation. PMC is based on the probabilities of the closeness of competing estimators to an unknown parameter. Although there had been limited exploration of the PMC methodology, renewed interest has been sparked in the last twenty years especially in the last ten years after Rao (1981) pointed out its use as an alternative criterion to minimum variance. Since 1975 over 100 research articles, authored by many prominent statisticians, have appeared on this criterion. With this renewed interest has come better understanding of this method of comparison and its usefulness. Posed as an alternative to the concept of MSE (mean squared error), PMC has been extensively explored through theorems and examples. The goal of this monograph is to acquaint readers with this information in a single comprehensive source. We refer researchers and practitioners in multivariate analysis to Sen (1991) and (1992a) for a comprehensive review of the known results about PMC in the multivariate and multiparameter cases. The recent proliferation of published results on Pitman's measure of closeness makes it difficult to provide the readership with a relatively current monograph. To do so, we make some restrictions. We have, for example, written the book at the level of a graduate student who has completed a traditional two-semester course in mathematical statistics. We hope that the holistic presentation of the known results about PMC presented under a common notation will accelerate the integration of the beneficial features of this criterion into the mainstream of statistical thought. The intended audience for this book consists of two groups. The first group includes statisticians and mathematicians who are research oriented and would like to know more about a new and growing area of statistics. It also includes those who have some formal training in the theory of statistics or interest in the estimation field, such as practicing statisticians who work xin
xiv
PREFACE
in areas of application where estimation problems need solutions. The second group for whom this book is intended includes graduate students in statistics courses in colleges and universities. This book is appropriate for courses in which statistical inference or estimation techniques are the main topics. It also would be useful in courses on research topics in statistics. It thus is appropriate for graduate-level courses devoted to the theoretical foundations of estimation techniques. The topics selected for inclusion in this book constitute a compilation of the many research papers on PMC. In all decisions regarding topics, we were guided by our collective experiences in working in this field and by our desire to produce a book that would be informative, readable, and for which the techniques discussed would be understandable to individuals trained in statistics. The book contains six chapters. The first chapter begins with a philosophical perspective, includes the least amount of technical detail, and focuses on basic results. The book then gradually increases in mathematical complexity with each succeeding chapter. Throughout the book a serious effort has been made to present both the merits and the drawbacks of PMC. More technical multivariate results are treated only briefly, to allow a thorough discussion of univariate procedures and methodologies. The first three chapters relate the history and philosophy of PMC. The material is illustrated through realistic estimation problems and presented with a limited degree of technical difficulty. The Introduction in Chapter 1 presents the motivation for exploring PMC and the notation to be used throughout the book. Chapter 2 contains discussions on the development of PMC, including its history, the concept of risk, the competition with MSE, and the role of the loss function. Chapter 3 explores the operational issues involved in adopting PMC as a criterion. Discussions are given on the issues of intransitiveness and probability paradoxes as well as on a useful methodology for resolving ties in probability. The last three chapters present a unified development of the extensive theoretical and mathematical research on PMC. Taken together, they serve as a single comprehensive source on this important topic. We unify notation, denote overlap, and present a common foundation from which the known results are deduced. The text is highly referenced, allowing researchers to readily access the original articles. Many new findings not yet published also are presented. Chapter 4 establishes the fundamental results of pairwise comparisons based on PMC. Chapter 5 connects results of PMC with well-accepted notions in statistical inference such as robustness, equivari-
DEVELOPMENT OF PITMAN'S MEASURE OF CLOSENESS
xv
ance, and median unbiased estimation. The last chapter contains results on asymptotics, including the optimality of BAN estimators according to the Pitman closeness criterion and general equivalence results of Pitman, Bayes, and maximum likelihood estimators. The examples used throughout the second half of the book are often important and practical applications in the physical and engineering sciences. This element should strengthen the appeal of the book to many statisticians and mathematicians. The last three chapters also could serve as a useful reference for an advanced course in statistical inference. We are indebted to many individuals for contributing to this work. C. Radhakrishna Rao, Eberly Professor of Statistics (Pennsylvania State University), has been a constant source of guidance and inspiration for over a decade and his work has been a chief motivator in this venture. Colin Blyth, Professor Emeritus (Queen's University), reviewed several earlier versions of this manuscript and has provided many valuable suggestions on content and validity of the results. We gratefully acknowledge the extensive editorial review of this book by Norman L. Johnson, Professor Emeritus (University of North Carolina). We also acknowledge the influence of Malay Ghosh (University of Florida) not only on the many technical results which bear his name but also on his foresight to recommend a special issue of Communications in Statistics devoted to recent developments in Pitman's closeness criterion. We thank the reviewers, H. T. David (Iowa State University), R. L. Fountain (Portland State University), and T. K. Nayak (George Washington University) for their careful reading of and commentary on this manuscript. These reviews helped us produce an improved presentation of the fundamental issues and results associated with Pitman's measure of closeness. We are also grateful to Vickie Kearn, the SIAM book editor, her assistant, Susan Ciambrano, and our editor, Laura Helfrich, for their editorial support of our work. Finally, we extend thanks to Catherine Check, Shelly Eberly, Nata Kolton, and Alan Polansky for the excellent typing support they provided for the project. J. P. Keating R. L. Mason P. K. Sen September 10, 1992
This page intentionally left blank
Chapter 1
Introduction There are many different ways to estimate unknown parameters. Such estimation procedures as least squares, method of moments, maximum likelihood, unbiasedness with minimum variance, skillfully chosen biased estimation, and Bayes and empirical Bayes are only a few of the many techniques that are currently available. Most of these techniques have been developed in response to specific practical problems for which no useful or existing solution was available, or to provide improvements to well-accepted estimation procedures. Certainly it is fortunate that there are many different ways to estimate unknown parameters, but which one should we use? It seems circular to use the criterion by which a best estimator was obtained as a basis for determining which estimator is best. What is clearly lacking is a systematic treatment of the comparison of such estimators based on criteria other than the currently popular mean squared error (MSE), which we will define in The use of mean squared error as a criterion in the comparison of estimators dates back to Karl Priedrich Gauss. As a simple illustration, consider the location (or intercept) model with an additive error component. This model is expressed in the following equation: where YJ is the ith observed value of n observations taken to measure an unknown parameter 0, and £; is the additive experimental error. For example, a scientist may try to reconcile differences in observations of the length Yi of a chord of the earth along its axis of rotation, where 9 is the true distance from the North Pole to the South Pole of the Earth. 1
2
PITMAN'S MEASURE OF CLOSENESS
Gauss took as his estimate of 6 the value 9 that minimizes the sum of squares for errors (SSE),
The value 0, termed the least squares estimator, is given by
Gauss tried to support his mathematical result through probabilistic methods but he (1821)
found out soon that determination of the most probable value of an unknown quantity is impossible unless the probability distribution of errors is known explicitly. Since the error component e, in the location model had to have a completely specified distribution, Gauss reasoned that the process might be made simpler if he transposed the ordering. Gauss (1821) intended to determine the distribution of the experimental errors, ei, 62, • • • , £n, contained in a system of observations that in the simplest case would result in the rule, generally accepted as good, that the arithmetic mean of several values for the same unknown quantity obtained from equally reliable observations shall be considered the most probable value. By this inverted process, Gauss found the desired density function of experimental errors to be that of a normal distribution with mean zero and variance cr2; i.e.,
Using the criterion of minimizing the SSE and assuming normality for the distribution of experimental errors in the system of observations, Gauss found that the sample mean Y became the most probable estimate of 0. Gauss (1821) was able to separate his construction of the least squares estimator from assumptions about the distribution of experimental errors.
INTRODUCTION
3
To obtain this separation, Gauss modified his proposed estimation criterion and chose the value 6 that minimizes
This Gaussian modification is the precursor to MSE. When 0 = Y, the sum of squares for error measures the total sum of squares in the system of observations, whereas if 0 — E(Y), the MSE becomes n times the population variance. In 1774 Pierre Simon Laplace proposed as an estimate of 0 the value 9 that minimizes the expected sum of absolute deviations,
This method was the precursor of the mean absolute deviation (MAD) criterion. In the location model, Laplace showed that the sample median Y was the most probable estimate of 9 whenever the experimental errors are independent and identically distributed with a common double-exponential density function given by
where A is a scale parameter that measures dispersion in the experimental errors. He proved that the Gaussian procedure of minimizing the expected sum of squares and his own procedure of minimizing the expected sum of absolute deviations produce the same estimate Y if and only if the experimental errors were normally distributed. During this period of the birth of estimation theory, the chosen loss function was clearly at the center of its development. The Gaussian selection of normally distributed experimental errors was a direct consequence of his conjugate choice of quadratic loss. Likewise, the Laplacian selection of double-exponentially distributed experimental errors stemmed from his conjugate choice of absolute deviation. We can add to these examples the distribution named after Augustin Luis Cauchy, who postulated for the experimental errors in the location model a density function given by
4
PITMAN'S MEASURE OF CLOSENESS
where A is a scale parameter that measures the scatter in the system of observations. Under the Cauchy law for the distribution of errors, the criteria of Gauss and Laplace were inappropriate since neither expectation exists. With this background, estimation theory, as we know it, was born. Gauss and Laplace set forth the two most important loss functions, based on MSE and MAD, and clearly stated that what was "best" depended upon the distribution for experimental errors. Cauchy's contribution was his example that minimizing the "expected" dispersion in a system of observations could be flawed. These contradictory results produced two important questions: (i) Is there an estimation criterion, which is insensitive (i.e., robust) to the experimenter's choice for measuring loss? (ii) Can this criterion be applied without focusing only on the expectation of the experimental error? In response to these questions, we present in this book a systematic treatment of pairwise comparisons of statistical estimators based on an alternative criterion termed Pitman's measure of closeness (PMC). We consider other criteria as a basis of comparison, but for the most part we focus on the criterion developed by Pitman (1937). It is defined as follows.
Definition 1.0.1 Let 6\ and 02 be two real-valued estimators of the real parameter 6. Then Pitman's measure of closeness of these two competing estimators is denoted by lP(0i,02\0) and defined by
Whenever the probability in (1.1) does not depend upon 0, we will suppress the conditioning notation and simply write .P(0i, #2)- When the value of JP(0i,02\0) depends upon the value of 0, PMC varies over the parameter space ft. In its simplest form PMC is the relative frequency with which the estimator 0\ will be closer than its competitor 62 to the true but unknown value of the parameter 0. However, lP(0i,02\0) does not measure how much closer 0i is to 0 than is #2- Pitman made his original definition under the assumption that
In such cases, the ratio !P(0i,02\0)/lP(02,0i\0) is the odds that Q\ will produce an estimator closer to 0 than 02- In §3.4, we extend Definition 1.0.1
INTRODUCTION
5
to the interesting and important comparison when the probability in (1.2) is not necessarily zero. Our viewpoint in this book is intentionally limited to PMC. Chapters 1-3 contain our reasons for this narrowness. In the space of this modest monograph we cannot hope to sort through all the known competing criteria for estimation. Savage (1954) has presented an analysis of pairwise preferences between estimators derived under seven different criteria. We propose PMC as an alternative to other ways of comparing the relative value of two or more statistical estimators in a variety of situations, some of which have no previously known solution. The similarities between PMC and MSB will also be thoroughly explored. We begin this chapter with a brief discussion of the evolution of estimation theory. The discussion will help us place in perspective the different estimation procedures that we will compare in this book and the reason for the emergence of Pitman's measure of closeness as a viable alternative to MSE.
1.1
Evolution of estimation theory
The evolution of estimation theory has been driven by a need to solve real problems. It has progressed from least squares to the method of moments, to maximum likelihood, to Bayes and empirical Bayes procedures, to riskreduction approaches, to robustness, and to resampling techniques. The following subsections provide brief discussions of these historical events in estimation theory. They also set the stage for our later discussions on comparisons of estimators.
1.1.1
Least squares
One of the most widely used estimation procedures is the method of least squares (LS). It was first derived in the beginning of the nineteenth century and its origins can be traced to independent work by Gauss and Legendre, and later input by Laplace. Plackett (1972) provides extensive details on the discovery of this method and the controversy of its origins. Its use of MSE as an estimation criterion is well known and is briefly discussed in §2.1. Consider a simple linear regression model in which
6
PITMAN'S MEASURE OF CLOSENESS
where .E^lo^; 9] is the conditional mean of YJ, 0 is an unknown parameter, and the Xj's are known values. Assume that the conditional variances of the Yi's are equal. To find the least squares estimator of 6 for n pairs of observations (2/i,£j), we determine the value of 9 that minimizes
The least squares estimators have several useful properties that add to their appeal. For example, when Yi,...,Yn have indepenedent norml distributions, the LS estimators maximize the likelihood function, are functions of sufficient statistics, and are uniformly minimum-variance unbiased estimators of regression parameters. They also are the best linear unbiased estimators in the class of estimators that are linear functions of F, in the sense of having the minimum MSE or convex risk. However, the LSEs are not invariant under arbitrary monotone (such as logarithmic or power) transformations on the observations—a case that often arises in practice, especially in biomedical studies. During the last fifty years many new developments have occurred in least squares estimation. In each situation new methodologies were introduced to improve the performance of older techniques. For example, regression diagnostics have progressed from simple residual plots and correlation matrices to such sophisticated techniques as partial leverage plots, condition numbers, Cook' s distance measure, and dfbetas (e.g., see Belsley, Kuh, and Welsch (1990)). Similar advancements have been made in such new areas as robust regression analysis, influence functions, quasi-least squares analysis, weighted least squares, inverse regression, and generalized linear models. In each of these settings, new estimation schemes were developed to remedy past problems. For example, robust regression and influence functions are useful improvements in accommodating data with outliers or influential observations. Weighted least squares is an improvement for addressing the problem of heteroscedasticity of the error variance, and inverse regression is a helpful solution for use in calibration problems. The wide popularity of regression in data analysis is a tribute to its ability to integrate these new estimation techniques into its overall approach to parameter estimation.
1.1.2
Method of moments
Karl Pearson introduced the method of moments (MM) at the beginning of the twentieth century. His empirical procedure involved equating the first
INTRODUCTION
7
k moments of the sample with the first k moments of the population, when there are k(> 1) unknown parameters associated with the population. This method results in k equations consisting of k unknowns. Although the MM procedure is very much an ad hoc technique, its simplicity and optimality in some specific cases made it useful in the absence of theoretically well-founded alternatives. For distributions with one unknown parameter, say, 0 = h(n) (i.e., a function of the population mean //), the method of moments estimator (MME) is given by
where X is the sample mean taken from a random sample of size n from the distribution of interest. When the distribution has twp parameters, namely, the population mean p, and variance cr2, the MMEs are given by
Pearson realized that his procedure might focus too closely on the first two moments of the distribution without regard to higher-order effects such as skewness and kurtosis. In such cases, alternative strategies were developed in which fi and a were estimated based on equating third and fourth population moments with the third and fourth sample moments. In addition, Pearson developed a family of distributions to allow for incorporation of these higher-order moments. This extension introduced to the underlying distribution additional parameters that were estimated as before. The MM procedure survived in part because of its ability to estimate reasonably well the parameters in the normal distribution. Indeed, the method of moments was shown by Blyth (1970) to be a special case of the method of least squares under relatively general conditions. However, MM estimation has obvious shortcomings. For example, it cannot be used with a Cauchy distribution under its present formulation since no moments exist (N.B., the MM can be modified in the Cauchy to estimate its scale parameter by using fractional moments). Finally, in distributions with complicated moments such as the Weibull, solving for the unknown parameters produces nonlinear equations that can only be solved numerically. This drawback is also shared by LS estimation for nonlinear models. Although the MME has several undesirable features, it seems reasonable to question how it compares with its competitors. To answer this query we
8
PITMAN'S MEASURE OF CLOSENESS
must decide on the criterion we will use for the basis of the comparison. Suppose we select as our criterion the bias B(0,6} of an estimator, which is given by
To illustrate the effect of this choice, let us consider the MME of 0 in the uniform distribution. Example 1.1.1 Consider a random sample A"i, Xi,..., Xn chosen from the uniform distribution
where I(a^(x) is the indicator function on the open interval (a,6). The MME is given by
which does not depend solely on the value of the sufficient statistic Xn:n, the largest order statistic. Since OM in (1.7) is formed from the sample mean of a randomly chosen sample of size n, then E(&M) = 2E(X) = 2| = 0. Thus the MME produces an unbiased estimator of 0. However, it is easy to show that many other estimators have zero bias (e.g., twice any convex linear combination of the unordered X^s is unbiased). Thus, the bias by itself does not have enough discriminating power, and we may need additional criteria to compare these estimators.
1.1.3
Maximum likelihood
Sir Ronald A. Fisher in the 1920s introduced the method of maximization of the likelihood function. He examined the joint distribution of the observed sample values #1,... ,x n as a function of the unknown parameters. The likelihood function is given by
and reduces to
INTRODUCTION
9
whenever the xi,...,x n are independent and identically distributed with common density function f ( x ; 0 } . In the ex-post-facto sense, after the data are collected, the likelihood function is only a function of 6. Interpretations of the likelihood function can lead to wide philosophical differences, which we wish to avoid. Readers interested in further details of these differences will benefit from reading Efron (1982). It suffices to say that the maximum likelihood estimator (MLE) is the value of 9 that produces the absolute maximum of l(0\x\,..., xn] and is a function of #1,..., xn. However, it may not always be possible to express this estimator in closed form as a function of x i , . . . , xn, and in some cases, the very existence of an absolute maximum may be difficult to establish. Most practitioners of statistics will frequently choose the MLE technique because they can ordinarily obtain reasonable estimators of the unknown parameters. While the exact distributional properties of the MLEs are unknown except for distributions having sufficient statistics, the asymptotic properties remain appealing features. The asymptotic unbiasedness, normality, consistency, sufficiency, and efficiency of these estimators make analysis of large data sets, such as consumer preference surveys and clinical trials, straightforward. This set of properties forms the condition of an estimator being best asymptotic normal (BAN). We will discuss such estimators in Chapter 6. For a general class of models, the concept of transformational invariance is a property shared by the ML and MM estimators. For location and scale parameter families of distributions, the consequent MLEs are equivariant, which is a condition we will use throughout Chapter 5. MLEs have the additional desirable aspect that whenever a sufficient statistic T(X) exists for an estimation problem, the MLE, by virtue of its origins from the joint density function, will be a function of that sufficient statistic. These features of unbiasedness, normality, consistency, invariance, and efficiency have become the principal properties sought in estimators. In the example of the uniform distribution given in (1.6) the MLE of 6 is
Since Xn:n/0 has a beta distribution with parameters a = n and (3 = 1, then E(0L) = B • E(Xn:n/6) = n0/(n+1). Thus the MLE produces a biased estimator of 0 whereas the MME yields an unbiased estimator. However, the MLE is asymptotically unbiased and has a smaller variance since
10
PITMAN'S MEASURE OF CLOSENESS
The uniform distribution is an interesting case because it exemplifies the Hodges-LeCam superefficiency property primarily because a regularity condition pertaining to the Frechet-Cramer-Rao inequality does not hold. To be more specific, df(x]0)/dO, where f ( x ; 0 ) is defined in (1.6), does not exist for x = 0 and moreover, since the essential range of X depends on 0, differentiation under the integral may no longer be valid. Note that since VBX(OM) is proportional to 1/n, the rate of convergence of Var(0A/) to zero is consistent with the optimum rate spelled out by the Frechet-CramerRao inequality (whenever it is applicable) and the Central Limit Theorem. However, the reader can readily verify that Var(#£,) is proportional to 1/n2 and its subsequent rate of convergence to zero is denoted 0(l/n2). The method of moments and the maximum likelihood estimators are consistent in that for each e > 0
We may want to discriminate between OM and QL as consistent estimators of 0 based on their rates of convergence. In the above example, the two rates of convergence are different (i.e., 1/n and 1/n2, respectively), and hence, in an asymptotic framework we can compare them. However, one might be concerned that the MLE underestimates 0 (a.e.), whereas the MME has a distribution that is symmetric about 0. If we graph the corresponding marginal distributions, we have an excellent example of the trade-off between an unbiased estimator with a reasonable variance compared to a biased estimator with a very small variance. It is exactly these types of differences that make estimator comparisons a difficult task and illustrate the need for useful comparison criteria. In the next subsection we introduce MSE as one way to reconcile variance and bias considerations and illustrate this reconciliation by constructing an unbiased estimator from the MLE in the uniform distribution.
1.1.4 Uniformly minimum variance unbiased estimation
(UMVUE)
The notion of MSE was introduced as
INTRODUCTION
11
Among the class of unbiased estimators (for which E(9) = 9), the estimator with minimum variance also has minimum MSE. This observation motivated the construction of unbiased estimators with minimum variance. The subsequent procedure produced small sample estimators that had two of the three appealing asymptotic properties of MLEs. This procedure was especially embraced by decision theorists because it could be developed from squared error loss functions. The examples of the MME and the MLE of 9 in the uniform distribution (see §§1.1.2 and 1.1.3) illustrate the trade-off between minimizing the squared bias and reducing the variance of an estimator. Many researchers have been intrigued with the mathematical elegance of the derivation of the Frechet-Cramer-Rao inequality, the Rao-Blackwell Theorem, and the Lehmann-Scheffe Theorem. From a practical perspective, the last two theorems provided practitioners with a method for the construction of unbiased estimators with minimum variance. We need only to find the conditional expectation of any unbiased estimator of the target parameter, given the sufficient statistic. For a unique best estimator in this sense, we may need the concept of completeness. The MSE of 9M is given by
and the MSE of OL is given by
Using MSE as a criterion, the biased estimator OL is preferred to the unbiased estimator OM• Since the E[OL\ = n9/(n + 1), we construct the following unbiased estimator:
The MSE of Su is given by
and is smaller than the MSE of OL whenever n > 1. The Frechet-CramerRao inequality does not apply to the uniform distribution because the essential range of the uniform distribution depends upon 9.
12
PITMAN'S MEASURE OF CLOSENESS
In general, we may not be able to reduce the bias of the MLE so simply as we did in the uniform distribution. In the more complex models, we may need to reduce the bias by more sophisticated techniques such as jackknifing the MLE. The jackknifing process reduces the size of the bias through higherorder expansions of the likelihood function.
1.1.5
Biased estimation
After World War II, researchers began to question the conventional methods of estimation detailed above and more closely scrutinized the resultant estimators. These individuals recognized that the MSE of an estimator can be made small by manipulating the bias to reduce the variance of the estimator. For example, estimators that are constants have a zero variance but are almost always biased; under the MSE criterion, they are always admissible but have no practical value. Within a short span, statisticians began producing biased estimators with uniformly smaller mean squared error than that attained by the UMVUE. The Hodges-Lehmann (1951) admissibility result for the sample mean in random samples chosen from a normal distribution helped to dilute the influence of these biased estimators. Its utility in large samples through the Central Limit Theorem aided the cause for unbiased estimators. However, Charles Stein in 1956 began to produce biased estimators in three-or-higher dimensions which had performed better than the traditional MLE, even for the multivariate normal distribution. Estimation rules with smaller total risk than the MLE were shown to exist for estimating each of the means of the sets of data. Stein's paradox generated much debate and caused many researchers to consider radical departures from traditional statistical thinking. The Stein effect helps illustrate the usefulness of Pitman's measure of closeness in one dimension in the following example. Example 1.1.2 Efron (1975) proposed the following problem. Let Xi,..., Xn be a random sample of size n from a normal distribution with a mean of 6 and unit variance (i.e., J\f(0,l)). As an alternative to the unbiased sample mean, 6\ = X, Efron proposed the following estimator #2 = X — A n (Jf), where
INTRODUCTION
13
is an odd function, and $(t) = l/v^/l^ e s /2ds is the standard normal distribution function. Figure 1.1 illustrates the relationship between 6\ and #2 for n = 1 and X > 0. When X is between —c/^/n and c/^/n, 62 = X/1\ when |-X"| > c/^/n, then
where c is the unique zero of q(x) = x — $(—x). The numerical value of c is 0.35962.
Figure 1.1: A graph of the unbiased estimator and the Efron estimator. To paraphrase Efron, #2 has none of the nice properties of 9\. It is not unbiased, not invariant, not minimax, nor even admissible in the sense of MSE. Yet 02 = B2(X) will more often be closer than 0i = $i(X) to the true value of 0 regardless of the value of 9\ In the sense of PMC, X is inadmissible to #2 ? which illustrates the presence of the Stein effect in the univariate case. Thus the sanctuary afforded by the Hodges-Lehmann landmark admissibility result for the sample mean does not extend to the criterion of PMC. The term A n (X) in (1.12) is known as a shrinkage factor since 0% shrinks X toward zero. The Efron estimator #2(X) is not unique in that A n (X) may
14
PITMAN'S MEASURE OF CLOSENESS
be defined in many ways so that &i(X) is inadmissible to the associated estimator O^X) in the PMC sense. This form of inadmissibility was also presented by Salem and David in 1973. The form of An(X) presented by them can be found in Salem (1974, p. 16). David and Salem (1991) and Yoo and David (1992) show that the inadmissibility surfaces in the general location parameter case or in the general scale parameter case. Moreover, Yoo and David (1992) present the proof that 0i pC) is inadmissibile in the PMC sense to any estimator 9$(X) that satisfies the conditions that #3(0) = 0 and, for all X ^ 0,
This latter condition implies that any estimator, which lies strictly between the two estimators in Figure 1.1, is also better than 9\(X} in the sense of PMC. Thus the "Stein effect" surfaces under the Pitman closeness criterion even in one dimension. Whereas many statisticians would not see this as sufficient evidence to abandon the unbiased estimator 0i, Efron (1975) uses Example 1.1.2 to raise ... a serious point: in more complicated estimation situations involving the estimation of several parameters at the same time, statisticians have begun to realize that biased estimation rules have definite advantages over the usual unbiased estimators. This represents a radical departure from the traditional unbiased estimation which has dominated statistical thinking since Gauss' development of the least squares method. The remarks with regard to Gauss are especially important and are addressed in §2.1.1.
1.1.6
Bayes and empirical Bayes
Bayesian analysis can be used by practitioners for situations where scientists have a priori information about the values of the parameters to be estimated. This information is formalized into a prior distribution on the parameter, and estimators are formed from the posterior distribution of the parameter given the data. This approach contains as a special case the fiducial procedure, which was used by Fisher. Our intent in this book is to present Pitman's concept of closeness under Bayesian, fiducial, and frequentist assumptions. Whether the prior distri-
INTRODUCTION
15
bution should be subjectively based or empirically constructed is an issue of considerable debate among Bayesians. The fiducial approach of Fisher essentially uses noninformative priors and the resulting estimates often have a relevant classical interpretation. In the Bayesian sense, the Pitman estimator (see Pitman (1939)) has an interesting interpretation. With respect to a uniform weight function (see Strasser (1985)), the Pitman estimator becomes the mean of the posterior distribution and as such is the Bayes estimator for quadratic loss. For Pitman's concept of closeness, the differences between frequentists and Bayesians are mainly philosophical, since the estimator that is "best" in the sense of PMC is the same for the different groups for a large class of estimation problems. Bayesians could certainly use PMC whenever two estimates of a parameter are derived from different measures of central tendency of the posterior distribution. For example, the Bayes estimator under a squared error loss function is the mean of the posterior distribution of (0\x\,..., xn), whereas the Bayes estimator under an absolute loss function is the median of the posterior distribution. Hence, if we are uncertain about the true error incurred, we may want to use a posterior version of the PMC definition. This comparison has been completed by Ghosh and Sen (1991) and is contained in §5.4 (see Definition 5.4.1).
1.1.7
Influence functions and resampling techniques
John Tukey in the early 1960s began a minor but much needed revolution in estimation theory with his Exploratory Data Analysis (EDA). Tukey believed that before we start using sophisticated estimation theory, we should explore the data under investigation through a variety of techniques for depicting its true underlying distribution. Moreover, as early as 1922, Fisher recognized the poor MSE performance of the sample mean and sample variance except in a highly confined subset of Pearson curves concentrated about the normal distribution. Two significant theoretical spin-offs that came from EDA were Huber's concept of robustness together with Hampel's formulation of influence functions and Efron's renovation of some resampling techniques. Huber, through the use of influence functions, developed methods that were robust (i.e., insensitive) to families of distributions within a neighborhood of the assumed distribution. Efron, with the novel bootstrap resampling technique, introduced a valuable procedure that offered an alternative to the jackknife estimator, although the latter may have smaller MSE.
16
PITMAN'S MEASURE OF CLOSENESS
Student Exam Consistent Overconfident 1 72 96 2 94 75 12 74 3 4 92 73 Average 73.5 73.5 Median 73.5 93
Procrastinator 60 65 59 100 73.5 62.5
To illustrate the concerns of Tukey, consider the following example. Example 1.1.3 The data in the table above represent the scores on four examinations for three different students: one who is consistent, one who is overconfident, and one who is a procrastinator. The question concerns what grade should be assigned to each student. To those who advocate sample averages, it is easy to see that each student has an average exam score of 73.5 and should therefore receive the same grade. However, if we remember that the purpose of a grade is to provide the best indicator of a student's performance, we will find ourselves in a dilemma. Clearly the overconfident student is capable of better work than a 12 on Exam 3, but that is precisely his score. The grade of 12 is considered an outlying observation that is not representative of the student's performance. Researchers in influence functions have produced estimation procedures which are robust to one's assumption about the true unknown distribution of the grades. A simplistic interpretation of their research is that sample medians tend to be better indicators of true performance because our knowledge of the distribution of grades is at best imprecise. Using the sample median, the consistent student's grade remains at 73.5, but the overconfident student's grade increases to 93. The contrast in the estimated grades for the latter student is remarkable. However, before students begin demanding a robust estimator of their grade, they should consider the grade of the procrastinator. This student has precisely the same average as the other two students, but the median grade is substantially lower. The question surfaces: what grade should the procrastinating student receive? Is the sample median a Pitman-closer estimator than the sample mean? Eisenhart, Deming, and Martin (1963) initiated such inquiries via simulation of the comparison of the sample mean with the sample median for the Gaussian, Laplace, uniform, Cauchy, hyperbolic
INTRODUCTION
17
secant, and hyperbolic secant-squared distributions. This numerical study strongly suggests that the sample median may be a more robust choice across a broad spectrum of distributions. This kind of example shows the impossibility of describing a student's performance on a set of tests by any one number. Similarly, random loss cannot be identified with any single value (see §3.1.2).
1.1.8 Future directions Many other estimation techniques exist in addition to the major ones discussed above, and ongoing research continues to advance the evolution of estimation theory. While our concern is primarily directed toward useful criteria for the comparison of these estimators, it is worthwhile to reflect on some comments by Efron (1978) on these statistical foundations. He states: The field of statistics continues to flourish despite, and partly because of its foundational controversies. Literally millions of statistical analyses have been performed in the past 50 years, certainly enough to make it abundantly clear that common statistical methods give trustworthy answers when used carefully. In my own consulting work I am constantly reminded of the power of the standard methods to dissect and explain formidable data sets from diverse scientific disciplines. In a way this is the most important belief of all, cutting across the frequentist-Bayesians divisions: that there do exist more or less universal techniques for extracting information from noisy data, adaptable to almost every field of inquiry. In other words, statisticians believe that statistics exists as a discipline in its own right, even if they can't agree on its exact nature. In Chapters 5 and 6, we introduce optimum estimators in the PMC sense. We show that these estimators have not only a frequentist interpretation but a Bayesian one as well. Hence, this inquiry into the Pitman closeness criterion produces a "best" estimator, which transcends the frequentist-Bayesian division. Moreover, these methods are shown to be of practical value in such diverse disciplines as political science, economics, marketing, quantum physics, engineering, clinical sciences, and bioassay.
18
PITMAN'S MEASURE OF CLOSENESS
1.2
PMC comes of age
Pitman's measure of closeness has existed as a criterion in statistical estimation for more than fifty years. However, its development has progressed at a much slower rate than the discipline itself. In §2.1, we will see that this delayed development is due primarily to researchers' need to develop inferential procedures that had superior mathematical properties to their predecessors or provided reasonable estimators in a wider variety of estimation problems. A brief discussion of the origins of PMC will help us better understand the role it plays for the statistics community.
1.2.1
PMC: A product of controversy
In the 1930s, an extensive controversy arose between Karl Pearson and Ronald Fisher concerning the merits of the minimum chi-square and the maximum likelihood estimators. In a 1936 Biometrika paper, Pearson defended the minimum chi-square procedure and noted that Fisher's MLE did not give "better" results in an example where a distribution is fitted to a set of data in two separate ways: where the parameters are estimated by minimum chi-square and MLE. The fitted distributions were then compared using a chi-square goodness-of-fit test. Pearson questioned the value of the MLE and Fisher's definition of the "best" estimate of a parameter in the following quotation: But the main point in this paper is to suggest to Professor Fisher that he should state the method of likelihood has taken hold of so many of the younger generation of mathematical statisticians, wherein he conceives it to give "better" results than the method of minimum chi-square. The latter has a firmer basis and gives the practical statistician what he requires, the "best" fitting curve to his data according to chi-square. But what does the maximum likelihood give him? If it gives the "best" parameters to the frequency curves, how does he define "best"? It cannot be because it gives the minimum [obviously maximum meant] likelihood, for that would be arguing in a circle. He must define what he means by "best," before he can prove that the principle of likelihood provides it. Pearson suggests that Fisher admit that he has converted many younger statisticians into users of maximum likelihood with the promise that it gives
INTRODUCTION
19
better results than the method of minimum chi-square. It is ironic that while Pearson warned Fisher against circular reasoning, Pearson's explicit assumption of what is best was equally circular in nature. For the purpose of our book, Pearson's quotation is essential because it is the prime mover in Pitman's endeavor to provide an impartial basis of comparison of the method of minimum chi-square with that of maximum likelihood. Pitman proposed PMC as a criterion that is intuitive, natural, and easily understood. Pearson's article is cited in the first sentence of Pitman's cornerstone article that appeared in 1937. Without this controversy, Pitman might not have proposed the criterion. Over forty years after the Pearson-Fisher controversy, Berkson and Rao engaged in a more refined debate over the usefulness of minimum chi-square as a criterion in estimation. Berkson claimed that by any traditional criterion, the procedure of minimum chi-square estimation would produce a "better" estimator than the MLE. Just as the Pearson-Fisher controversy precipitated Pitman's suggestion of PMC as a basis for the comparison of estimators, the Rao-Berkson controversy precipitated Rao's appeal to other criteria (including PMC) as methods for the comparison of estimators. In §3.5, we provide empirical evidence from Berkson's 1955 bioassay problem that the outcome of this comparison via the Pitman closeness criterion is mixed over the parameter space. What have we learned from these controversies? We, as well as many other researchers, have found answers in a multitude of desirable properties associated with PMC. This book is written to share these findings with a broader spectrum of the scientific community. The methods given here provide reasonable estimators in practical problems which lack well-founded estimation techniques. The development of Pitman-closest estimators is an earnest effort to present the "best" of what the Pitman closeness criterion has to offer the practitioner.
1.2.2
PMC as an intuitive criterion
Pitman's criterion has been studied by many generations of statisticians. Geary (1944), Scheffe (1945), Johnson (1950), Zacks (1965), Blyth (1972), David (1973), Efron (1975), and Rao (1980) are the principal scholars who kept the criterion alive until the wealth of information about PMC was sufficient for it to be regarded on its own merits. Although several developments in estimation theory, especially in the 1940s (see §2.1), rightfully overshadowed the concept of PMC, many articles
20
PITMAN'S MEASURE OF CLOSENESS
have been written about it in the past twenty years. Efron's (1975) article on the relevance of PMC is illustrated in the experience of a reliability engineer at Bell Helicopter Textron in 1976. The junior engineer was asked by the Chief Engineer of Research and Development to discuss different methods for estimation of the mission reliability of a critical AH-1J (the Cobra) attack helicopter component. The data in question came from an experiment that had to be curtailed due to cost overruns. As such the data were considered as Type II censored data from an exponential failure model. The reliability engineer presented discussions of the MLE and the UMVUE of the mission reliability under this exponential assumption. The MLE procedure was part of MIL-STD 781, which was developed in the 1950s. Thus, the reliability engineer's suggestion that the UMVUE might perform better in small samples was met with criticism from the project engineers. Nonetheless, they raised a multitude of valid questions about his choice. They objected to the fact that the UMVUE could provide an estimated mission reliability of zero even though every component in the sample had a positive chance of completing the mission. They disliked the fact that if they took the estimated mission reliability and back-solved for the estimated failure rate, the result disagreed with the reliability engineer's best estimate of the failure rate obtained by the same procedure. These practical questions raised by the engineers illustrated the point that the UMVUE was not necessarily range-preserving, nor was it invariant under transformations. The engineers did like the fact that the UMVUE gave a higher estimate of the mission reliability than the MLE for the AH-1J data. Faced with these conflicting views, the Chief Engineer of Research and Development asked the reliability engineer to determine which of the two estimators gave the "better" (i.e., closer) estimate more often. Although his question was stated simply, the solution was not evident and could not be provided for several months. Example 1.2.1 To present the more technical details of this example, let Xi ...,n be n independent and identically distributed exponential random variables with common density function
The parameter A is known as the failure rate of the process. The values n are the lifetimes of the n components in the study. The time to is the mission time for which the component is to be used. The reliability at the time to is given by
INTRODUCTION
21
The maximum likelihood estimator RL^O) is given by
where X is the average lifetime of the n observations in the sample. The uniformly minimum variance unbiased estimator is given by
If X < tQ/n, then Ru(to] = 0. However, in the asymptotic setting, we can show that The reliability engineer calculated the values of JP(Ri,(tQ), Ru(to)\R(to)), which can be found in Dyer, Keating, and Hensley (1979). However, he was not content with the outcome that the comparison was mixed over the parameter space. After completing the requested pairwise comparison, the junior engineer found that the 50% lower confidence bound on R(to) uniformly outperformed the MLE and UMVUE in the sense of Pitman's closeness criterion. Also, the 50% lower confidence bound (which is a median unbiased estimator in this case) was range preserving and the back calculation of the estimated failure rate produced a 50% upper confidence bound on A. Thus, this estimator was better than the standard ones and did not have the counter intuitive complications raised by the engineers. Due to the success of his techniques the junior engineer was moved from Reliability Engineering into Research and Development. He continued adding distributions to the list for which the 50% confidence bound out performed the MLE and UMVUE under the Pitman closeness criterion. PMC remains a reasonable alternative to MSE because of persistent problems in conventional estimation procedures and because among these "good" available estimators it provides a simple, readily understood way of choosing between two alternatives. The Pearson-Fisher example, the RaoBerkson example, and the reliability engineer's example arise because of specific problems inherent to the application (see §§3.5 and 4.4, respectively, for detailed discussions). In irregular cases, such as the uniform distribution in §1.1, we would find it to be an appealing alternative. However, PMC
22
PITMAN'S MEASURE OF CLOSENESS
has arrived because of the compelling result that PMC gives rise to an optimal estimator (within a broad class) related to the intrinsic property of median unbiased estimation. The importance of this result is reflected in the fact that we devote Chapter 5 to synthesizing the major works on optimality by Ghosh and Sen (1989) and (1991), Kubokawa (1991), and Nayak (1990). PMC's existence does not rely on moments of the underlying family of distributions and it has origins in diverse areas of estimation such as Bayesian, influence-function-based, and conditional estimation procedures. These results, which are given in Chapter 5, were more readily accepted due to PMC's ability to explain the Stein effect in multivariate analysis (see Sen, Kubokawa, and Saleh (1989)).
1.3
The scope of the book
When several different estimators are available for comparison, it is natural to ask which one is best. The answer depends upon many items such as the use to be made of the chosen estimator and the loss incurred in selecting the wrong one. Judging the "goodness" of an estimator requires criteria for making this determination. Some of these criteria, such as MSB, are very popular among researchers due to their experience in using them. Others, such as PMC, are less frequently used due to a lack of general familiarity with it and its associated properties.
1.3.1
The history, motivation, and controversy of PMC
The first three chapters of this book focus on the basic concepts, history, controversy, paradoxes, and examples associated with the PMC criterion. The attributes as well as the shortcomings of PMC are discussed in an attempt to impart a balanced perspective. Chapters 2 and 3 contain illustrations of the use of PMC in practical estimation problems, which arise in such diverse disciplines as political science, economics, athletics, engineering, sociology, marketing, and game theory. Chapter 2 contains discussions on the intuitive, intrinsic, and simplistic characteristics of PMC. We begin with a discussion of the history of PMC and introduce the concepts of risk, the competition between PMC and MSB, and the role of loss functions in the comparison of estimators. These results are illustrated through realistic estimation problems and presented with a limited amount of technical detail.
INTRODUCTION
23
However, the PMC criterion is not presented as a panacea to be used in all estimation problems. Throughout Chapter 3, we discuss the shortcomings of PMC and provide a sort of Surgeon General's Warning about its use (as suggested by Efron (1975)). The lack of transitiveness of PMC is discussed and the paradoxical consequences are illustrated. We compare preferences obtained by the use of PMC with those obtained by other methods whenever the comparison provides insight into the estimation process or clear weaknesses in other criteria. Through simple practical problems the reader can discover the prevalence of the usefulness of PMC and the caveats associated with its limitations. Chapter 3 explores the many controversies and paradoxes associated with PMC and emerges with a unified conclusion that echoes the sentiments of Johnson and Rao about the usefulness of different criteria in estimation.
1.3.2
A unified development of PMC
The last three chapters of this book contain a current account of theoretical and mathematical research in PMC. The examples used to illustrate and support these theoretical findings are ones related either to the controversies put forth in the first part of the book or are important practical examples from the physical and engineering sciences. In Chapter 4, the role of PMC in pairwise comparisons is developed through a sequence of successive generalizations that expand the class of estimation problems for which PMC can be computed. An extremely useful result is given whenever competing estimators are functions of the same univariate statistic, which happens frequently in the presence of a sufficient statistic. Chapter 5 contains a general development of Pitman-closest estimators through the use of topological groups and group invariant structures. The results are unified through the works of Ghosh and Sen (1989), Nayak (1990), and Kubokawa (1991) and result in a Rao-Blackwell-type theorem for PMC. An alternative posterior Bayesian interpretation of PMC produces a posterior Pitman-closest estimator from the work of Ghosh and Sen (1991). Linear estimation is a rich source of estimators and special methods are developed to allow us to compare different linear estimators. An optimal linear estimation procedure is given which relates the Kagan-Linnik-Rao form of the Pitman estimators to medians obtained by conditioning on ancillary statistics. Chapter 6 contains the important asymptotic properties that can be obtained for PMC and Pitman-closest estimators. BAN estimators are shown
24
PITMAN'S MEASURE OF CLOSENESS
to be Pitman-closest within a broad class of consistent asymptotic normal (CAN) estimators. The Pitman-closest estimators derived in Chapter 5 are examined in Chapter 6 but in an asymptotic framework. Under suitable regularity conditions these estimators are shown not to be only first-order efficient in the sense of Fisher but also second-order Pitman closest. These latter results are made possible by a first-order-asymptotic normal representation for the Pitman-closest estimators (Sen (1992b)). This representation is also useful in the asymptotic comparison of Pitman estimators, linear estimators, and MLEs. We conclude this chapter with a brief remark on multiparameter estimation problems, which have received the basic impact of PMC during the past six years. The role of a loss function is far more complex in a multiparameter estimation problem, and the mathematical complexities involved in the study of optimal properties of estimators make it difficult for us to include these developments in this volume. In the multiparameter context, Stein (1956) presented the paradox that even if each coordinate of a multivariate estimator is marginally optimal for the corresponding coordinate of the parameter, the vector-valued estimator may not be jointly optimal. For the normal mean vector, James and Stein (1962) constructed a shrinkage estimator that dominates the MLE in quadratic risk. This remarkable finding led to a steady growth of results in multiparameter estimation and the Pitman closeness criterion has contributed as a viable criterion in this context. Sen, Kubokawa, and Saleh (1989) and Sen (1992a) contain useful accounts of these related developments. We hope that readers interested in studying these more advanced topics will consult the references at the end of this book.
Chapter 2
Development of Pitman's Measure of Closeness The concepts underlying Pitman's measure of closeness (PMC) are tied to both an intuitive and a philosophical understanding of risk. These issues can become complicated as we consider pairwise and multiple comparisons of estimators. Nevertheless, the basic ideas are easy to understand and are tied closely to comparison examples that occur daily in our lives. This chapter begins with a discussion that traces the historical development of PMC and the motivation for its use as an alternative to MSE. We consider the more popular criterion of minimum mean squared error (MSE) in order to place PMC in its proper perspective. This is particularly pertinent in view of the strong emphasis given to MSE in the statistical literature. We explore the concept of risk as well as our understanding of it. Risk is shown to be sensitive to the choice of the underlying loss function. Illustrations are given of situations where MSE does not exist and where PMC is a helpful alternative. Finally, we discuss when an estimator is compared to both an absolute ideal as well as to other estimators to show the usefulness of joint versus marginal information.
2.1
The intrinsic appeal of PMC
An exploration of the historical development of Pitman's measure of closeness will be useful in our understanding of its merits. To do this, however, we will need to look at the major statistical criterion used in comparing estimators, namely, minimum mean squared error. Then these two procedures, MSE and PMC, will be compared and contrasted. 25
26
2.1.1
PITMAN'S MEASURE OF CLOSENESS
Use of MSB
The comparison technique based on minimizing the mean squared error has an extensive history. It was first discussed by Gauss, who stated the following in a paper that was read to the Royal Society of Gottingen, Germany, on February 15, 1821 (see Farebrother (1986)): From the value of the integral f^°00x(f>(x)dx, i.e., the average value of x (defined as deviation in the estimator from the true value of the parameter) we learn the existence or non-existence of a constant error as well as the value of this error; similarly, the integral J^°00x2^>(x)dx, i.e., the average value of x2, seems very suitable for defining and measuring, in a general way, the uncertainty of a system of observations. ... If one objects that this convention is arbitrary and does not appear necessary, we readily agree. The question which concerns us here has something vague about it from its very nature, and cannot be made really precise except by some principle which is arbitrary to a certain degree. ... It is clear to begin with that the loss should not be proportional to the error committed, for under this hypothesis, since a positive error would be considered as a loss, a negative error would be considered as a gain; the magnitude of loss ought, on the contrary, to be evaluated by a function of the error whose value is always positive. Among the infinite number of functions satisfying this condition, it seems natural to choose the simplest, which is, without doubt, the square of the error, and is the way proposed above. The MSE methodology stemmed directly from Laplace's earlier work in 1774 on estimation based on minimization of the mean absolute error of estimation. An MMSE estimator is defined as follows. Definition 2.1.1 (Minimum Mean Squared Error (MMSE)). If 0 is a parameter to be estimated and 9\,..., Ok are competing estimators, the estimator with the smallest MSE is found by selecting Oj such that
DEVELOPMENT OF PMC
27
The basis of the MMSE estimator in decision theoretic settings is its minimization under a squared error loss function. Frequently, the MSE comparison does not produce one estimator that is uniformly better for all 0. Indeed, a uniformly MMSE estimator is impossible within an unrestricted class of estimation rules. It is very popular, however, less because of its practical relevance and more because of its tradition, mathematical convenience, and simplicity—to summarize Gauss. This fact has been noted by several authors, including Karlin (1958) who gave the following explanation of its merits: The justification for the quadratic loss as a measure of the discrepancy of an estimate derives from the following two characteristics: (i) in the case where a(x) represents an unbiased estimate of h(ijS), MSE may be interpreted as the variance of a(x] and, of course, fluctuation as measured by the variance is very traditional in the domain of classical estimation; (ii) from a technical and mathematical viewpoint square error lends itself most easily to manipulation and computation. Due to its many attractive features, the criterion of MSE has been ingrained in statistical education. Although numerous estimation techniques such as maximum likelihood and minimum chi-square exist, few are as readily advocated and well accepted as the MMSE estimator. Another popular attribute of the MSE criterion can be developed if we consider a smooth loss function in a neighborhood of the parameter. Then for consistent estimators, MSE is a natural risk function in large samples. To illustrate this point, consider the following loss function
with the properties: (i) p(0) = 0 (i.e., whenever 0 — 0 no loss is incurred) (ii) p'(0) = 0 (i.e., the loss function is smooth in a neighborhood of 0) (iii) p(x] is nondecreasing for x > 0 and nonincreasing for x < 0 (iv) pW(0) exists for each k=0,l,2,.... Then the MacLaurin series expansion of p(0 — 0) is given by
28
PITMAN'S MEASURE OF CLOSENESS
where the first two terms vanish due to properties (i) and (ii) of the loss function. If 6 is a consistent estimator, 6 tends to be close to 0 in large samples; consequently, the cubic, quartic, and subsequent terms of the above series become negligible. This yields the result that for large sample sizes, a smooth loss function for the consistent estimator 0 in a neighborhood of 9 is given by
which is the quadratic loss function. (Note that /0"(0) > 0 is not guaranteed by property (iii) alone but requires that p(x) be concave upward in a neighborhood of the origin.) This approximation certainly provides the asymptotic appeal for MSE. In the asymptotic setting, comparisons made using the Pitman closeness criterion will be shown to be concordant with those obtained by MSE (see §6.2). With small sample sizes, however, the higher-order terms in the above series are not necessarily negligible and thus the quadratic loss function may not be appropriate. Condition (iv) can be relaxed so that a broader class of loss functions may be included in this appeal to MSE. If we replace condition (iv) with the condition that p^ (x) exists and is continuous for all x then the second-order Taylor expansion of p(6 — 0} is given by
where 0 < h < 1. With small sample sizes, however, setting a = h(9 — 0), p^\a)—/9^(0) may not be negligible. This modification allows us to include loss functions of the form p(y] = \y\r where r = 2 + 6 for some 0 < 8 < 1. Another popular loss function is absolute error loss, defined by
and the estimator for which the expected value of \0 — 0\ is smallest among the competing estimators is known as the minimum mean absolute deviation (MMAD) estimator. Note that this loss function does not satisfy property
DEVELOPMENT OF PMC
29
(ii) since p'(Q) does not exist. Hence the MacLaurin expansion is not applicable in this case. In a similar way loss functions defined as powers of the absolute loss may violate property (iv) as well, thus making the MacLaurin expansion even less plausible. Another problem with MSE is that, even in large samples, since MSE may heavily weight large errors, it will not necessarily provide for high probabilities that the estimator is near the true value of the parameter. This result has been recognized by many authors (e.g., Rao (1981)) and has led to much controversy about the use of MSE as a criterion. The following simplistic example illustrates one such criticism by Rao (1981) that quadratic and higher-order loss functions may place "... undue emphasis on large deviations which occur with small probability." Example 2.1.2 Suppose a tour bus company conducts excursions for school children who must pay the driver $2.00 for each trip. Two different tours are available. The first driver, an experienced but somewhat lazy person, knows that the schools always send groups in multiples of ten students. Rather than count each student, he mainly watches the loading process and allows students to deposit their own $2.00 fares in the token box. Since he knows the bus holds exactly 50 students, he counts the number of apparently empty seats using his overhead mirror. He then subtracts the count from 50 and rounds to the nearest 10 to obtain an estimate of the number of students on the bus. His count is made inaccurate by children whose heads are not seen, who move during the count, who sit more than two to a seat, and so forth. In contrast, a second driver meticulously collects each fare and individually counts the students as they pay their money. However, he always pockets $2.00 before depositing the collected fares in the token box. He then provides an estimated count one less than the true number. Let 0i represent the estimate given by the first bus driver and #2 be the estimate given by the second bus driver of the unknown student count 6. Suppose 61 equals 0 with probability .99 and equals 9 ± 10 with probability .01. Thus 0i errs (in absolute value) by 10 with a frequency of only 1%. Suppose also that 02 equals 0 — 1 with probability 1. Define the risk Ri(q] associated with the first driver's estimate as
whereas the risk R^q) in using the second driver's estimate is
30
PITMAN'S MEASURE OF CLOSENESS
where q > 0. If one is certain about the "cost" of erring and therefore knows g, then the choice of which estimator to use is clear. When q = 1, -Ri(l) = 0.1 and #2(1) = 1> implying that with respect to absolute risk, Q\ is the preferred estimator as the risk associated with 6\ is the smallest of the two estimators. However, when q = 2, the "infrequent" but large error associated with 0\ (a result of the first bus driver rounding the count to the nearest 10 students) is squared so that #1(2) = 1 and #2(2) = 1. The estimates now are equivalent with respect to MSE. With higher-order loss functions, such as q = 3, a reversal of preference occurs and 62 is the choice. As can be seen from this example, MSE and higher-order loss functions place heavy emphasis on the error in estimation made by the first bus driver. Yet such an error has a frequency rate of only 1%! In protecting against large errors, the MSE criterion fails to choose the estimator which is more frequently closer to the true count 0. To find this probability we compute
Therefore, 9\ provides a closer estimate of 6 than 62 with a frequency of 99%. More importantly, this choice remains the same regardless of the form of the loss function. Example 2.1.2 is counterintuitive for the mean squared error criterion in that 9\ is exactly correct with 99% frequency, whereas 62 always errs by exactly 1. From a practical point of view, if one worked for a company and had such drastic odds, which estimator would one use to estimate an unknown parameter, 0, such as the fraction defective? For a realistic example of this simplified situation, the reader is referred to the problem of estimation of the fraction defective in the normal distribution in Example 4.5.2. Consider this comparison when the true fraction defective is .05. The outcome of Example 2.1.2 should not be overemphasized. Consider two pharmaceutical machines that fill a medication, such as Accutane, Digitalis, or Lithium, which can be toxic if taken in excessive amounts. The first machine produces a 40 mg capsule of Accutane with 99% frequency but produces a 60 mg capsule with 1% frequency. The latter dose can be toxic with such severe side effects as blindness and irreversible liver damage. The second filling machine always produces a capsule with 39 mg of Accutane. In this case the first machine produces a capsule closer to the doctor's specifications with a relative frequency of 99%. However, the consequences of an overdose are so severe that even a 99% better fill rate does not justify use of the first filling machine.
DEVELOPMENT OF PMC
31
The following example by Robert, Hwang, and Strawderman (1993) exhibits a much different problem than that suggested by Rao in that the estimator B\ has large deviations from the target parameter 9 but the probability of occurrence, which equals .49, is reasonably large. Example 2.1.3 Let 0\ and 0% be two estimators ofOelR such that 0\ equals 9 with probability .51 and 9 ± 1023 with probability .49, while 92 = 9 ± .1 with probability 1. Then it follows that
Therefore, Q\ is Pitman-closer than #2 in estimation of 9 but "errs badly almost half the time." Whenever p > ^ Iog10(2),
This shows that #2 is preferred to 0\ for almost any £p-loss function because 9\ errs "badly" with probability .49. Note that, even in this example (despite the enormous deviation given to the Pitman-closer estimator in cases where it is not preferred), there exist values of p(< ^ Iog10(2)) for which the estimator preferred under Pitman's criterion coincides with that obtained through risk. Nonetheless, Robert et al. (1993) illustrate that the appeal for the Pitman closeness criterion is its basis as a majority preference rule. The importance of this observation and the subsequent role of the Pitman closeness criterion in game theory are discussed in §3.2. This example illustrates that when we have firm knowledge of the relative seriousness of the possible errors, our use of PMC may be inappropriate. Like other criteria, PMC is not universally applicable. The context of these examples provides no overwhelming endorsement of either criterion but can serve to manifest complications with both. It does serve to demonstrate that the choice of the loss function should be neither capricious nor automatic. Many statisticians concur with this conclusion, which seems consistent with the stance taken by Rao (1991).
2.1.2
Historical development of PMC
Pitman's (1937) measure of closeness is based on the probabilities of the relative closeness of competing estimators to an unknown parameter. For example, a confidence interval is a measure of the concentration (closeness)
32
PITMAN'S MEASURE OF CLOSENESS
of an estimator about the true value of the parameter. PMC can be used to define a Pitman-closer estimator and a Pitman-closest estimator within a class D. Definition 2.1.4 (Pitman-Closer Estimator). If 6\ and 62 are estimators of a common parameter 6, then 9\ is said to be a Pitman-closer estimate than #2 if
for all values of 9, with strict inequality for at least one 0 e Q. Definition 2.1.5 (Pitman-Closest Estimator). Let D be a nonempty class of estimators of a common parameter 9. Then 0* is Pitman-closest among the estimators in D provided for every 6 e D, such that 0^=0* (a.e.)
for all values of 6, with strict inequality for at least one value of 9. For a variety of technical reasons, there has been a general lack of interest in PMC. Pitman originally showed that the measure was intransitive. Blyth (1972) also noted this problem, as well as some paradoxes illustrating the inconsistency of PMC in selecting estimators based on pairwise or simultaneous comparisons. There were difficulties in evaluating the probability statement in Definition 1.0.1 and the accompanying theory was not of a sufficiently general nature to make the measure useful. The controversy between Berkson (1980) and Rao (1980) as mentioned in §1.2.1 over the use of MSE as a criterion in estimation sparked new interest in PMC as an alternative measure. Rao (1981) successfully argued for PMC as an intrinsic measure of acceptability and presented many diverse univariate examples in which shrinking the MSE of an unbiased estimator to a MMSE estimator did not yield a better estimator in the sense of PMC. Keating (1983) observed a similar phenomenon for percentile estimators under absolute error risk and he later framed Rao's observations in the context of a risk unbiased estimator. He showed that the phenomenon holds uniformly for certain estimation problems under absolute error risk. With this renewed interest in PMC has come more understanding of the methodology. Keating and Mason (1985b) provide some intuitive examples that illustrate and clarify the paradoxes raised by Blyth. These same
DEVELOPMENT OF PMC
33
authors (1988b) use PMC to give an alternative perspective to James-Stein estimation. Sen, Kubokawa, and Saleh (1989) shed light on Stein's paradox in the sense of PMC. Many other illustrations of the expanded attention being given to PMC can be seen by scanning the titles of the references at the end of this book. Readers will find Pitman's (1937) seminal paper to be essential reading in the discussions that follow. It has been conveniently reprinted in a 1991 special issue of Communications in Statistics-Theory and Methods, A20(ll), devoted to Pitman's measure of closeness. Pitman defined the criterion, presented examples where the MLE might be misleading, articulated the central role played by the median in his criterion, introduced the notion that we know as equivariance, and recognized the intransitive nature of pairwise comparisons under his criterion. In the latter regard the reader will undoubtedly be surprised at Pitman's declaration: "In any case, this (intransitiveness) does not detract from the definition of closest." Although Pitman gave no rationale for this declaration, he was clearly convinced of the lack of necessity for transitiveness whenever a Pitman-closest estimator existed. His ability to segregate the concept of a Pitman-closest estimator among a class from that of a Pitman-closer estimator between a pair reflects an intuition which places his understanding of the criterion years ahead of his colleagues. In this regard, it suffices to say that more than five decades passed before the mathematical foundations of all the concepts introduced by Pitman were completely established. In the historical discussion which follows, we trace various ideas of Pitman to their ultimate solution in mathematical rather than chronological order. The ingenious nature of Pitman's original work was given little attention in its day. Shortly after the appearance of Pitman's work, Germany precipitated World War II with its invasion of Poland in 1939. Obviously many statistical pursuits were diverted to support the war effort. More importantly, between 1940 and 1950 many theoretical advances in decision theory were made, including the Frechet-Cramer-Rao inequality, the RaoBlackwell Theorem, and the Lehmann-Scheffe Theorem. These hallmark results based on the criterion of mean squared error clearly dominated the statistical literature at the time and for decades to follow. A review of Pitman's personal research shows he derived the Pitman (1939) estimators (PE) of location and scale based on squared error loss. Later, we shall see that his adherence to quadratic risk probably prevented him from recognizing the connection of his closeness criterion with Pitman estimators. This connection will be made in §§5.1 and 5.2. The relationship
34
PITMAN'S MEASURE OF CLOSENESS
between Pitman-closest estimators and Pitman estimators was no doubt stimulated by the work of Nayak (1990), who explained the role of location and scale invariance in the construction of Pitman-closest estimators. However, Ghosh and Sen (1991) explain the connection via a Bayesian perspective and Kubokawa (1991) extends these results to a general group invariant structure. Although Kubokawa's discussion is quite brief, a full development is given in §5.3. In 1944 Geary made a significant contribution in the determination of the closeness probability for unbiased estimators of a common mean. His approach will be used in a generalization known as the Geary-Rao Theorem in §4.2. His example of the comparison of the sample mean to the sample median in random samples chosen from a normal distribution is presented asymptotically (see Example 4.6.8). The asymptotic nature of Geary's example prompted Scheffe (1945) to advocate the Pitman closeness criterion as a procedure for discrimination among consistent estimators of a common parameter. This idea of Scheffe's is discussed in §6.2 and has connections with Rao's (1961) concept of second-order efficiency. The formal investigation of the asymptotic nature of the Pitman closeness criterion was started by Sen (1986a) with his study of best asymptotic normal (BAN) estimators. Its full development is the subject of Chapter 6. Much of the research from 1945-1975 involved the tedious comparison of estimators for known distributions. The importance of these comparisons (e.g., Landau (1947), Johnson (1950), Eisenhart et al. (1963), Zacks and Even (1966), and Maynard and Chow (1972)) should not be overlooked because they gave reasonable examples in which mean unbiased estimators were not necessarily Pitman-closer. Johnson (1950) generalized the results of Geary and Landau by dropping the condition of unbiasedness with respect to estimation of the mean in a normal distribution (see Example 4.2.1). In the uniform distribution, Johnson showed that the MLE is inadmissible under the Pitman closeness criterion and, like Pitman, he constructed a Pitman-closest estimator (see Example 3.3.2). An exception to these articles was given by Blyth (1972), who discussed several paradoxes surrounding the Pitman closeness criterion. The first paradox was related to the criterion's intransitive nature. The issue of transitiveness was given a partial solution by Ghosh and Sen (1989) and Nayak (1990) through restrictions of invariance. Ghosh and Sen (1991) also furthered this cause by showing that the posterior Pitman closeness criterion produced transitive comparisons. Hence, in the posterior setting established by Ghosh and Sen (1991), the paradox vanishes. Blyth also introduced the "pairwise best-simultaneous
DEVELOPMENT OF PMC
35
worst paradox," a useful discussion of which is given in §3.2. Later Keating and Gupta (1984) and Keating and Mason (1985a) developed techniques for these simultaneous evaluations based on PMC for a specific class of estimators. Blyth (1951) extended the well-known centerpiece of estimation theory in showing that the sample mean in random samples chosen from a normal distribution was admissible in the sense of mean squared error across the class of all decision rules. He stated: For classes of decision rules unrestricted by unbiasedness or invariance, it is clear that no decision rule is uniformly best in the sense of MSB. Blyth's generalization of the Hodges-Lehmann admissibility result allayed the apprehension among researchers about biased estimators that might have smaller MSB than their unbiased counterparts. Its importance to much of statistical inference arises from large sample application through the Central Limit Theorem. Note that constant estimators are also admissible under the MSE criterion but admissibility alone does not make them useful in any circumstance. Within a decade, Stein (1956) showed that in three or more dimensions, the MLE is an inadmissible estimator of the normal mean vector. He also constructed biased estimators, known as Stein-rule or shrinkage estimators, which dominate the MLE in terms of quadratic risk. Efron (1975) showed (see Example 1.1.2) that the Stein effect occurred in the one-dimensional case for the Pitman criterion. Robert et al. (1993) presented a class of decision rules for which one-dimensional UMVUEs are inadmissible in the Pitman sense. Keating and Mason (1988b) discussed the importance which the Pitman closeness criterion had on the subsequent MSE comparisons in two dimensions. This started a new but fragmented age of research in the Pitman closeness criterion. Prom 1979-1981, Dyer and Keating, in a sequence of articles, began to derive some general results for the determination of the Pitman closeness criterion whenever the estimators were functions of a common (usually sufficient) statistic. Within restricted classes, they derived Pitman-closest estimators of percentiles and scale parameters. Keating (1983), (1985) extended the earlier results for Pitman-closest estimators to location and scale parameter families. An extensive method for the determination of the Pitman closeness criterion was given by Rao, Keating, and Mason (1986), and
36
PITMAN'S MEASURE OF CLOSENESS
the results of Geary (1944) were shown to be a special case of the Geary-Rao Theorem. Much of the work completed between 1945-1985 resulted in Pitmanclosest estimators which were median unbiased. Recall that an estimator, #, is median unbiased for a parameter 6 provided
This observation had already been made by Pitman, but the class of examples was now sufficiently large to produce a general result. Ghosh and Sen (1989) gave conditions under which a median-unbiased estimator (MUE) would be Pitman-closest within a class. Nayak (1990) actually provided a technique for the construction of Pitman-closest estimators of location and scale which were median unbiased (see §§5.1 and 5.2, respectively). Kubokawa (1991) generalized this median-unbiased property for a general group invariant structure (see §5.3). However, the renewed interest in the study of the Pitman closeness criterion was begun by Rao (1980), (1981). He offered examples in which shrinking an unbiased estimator to a minimum mean squared error estimator within a class did not make the latter Pitman-closer. His approach was directed at the indiscriminate use of the mean squared error criterion. He proposed that, in the comparison of "improved" estimators derived under various conditions, the process should improve such natural properties as the Pitman closeness criterion and stochastic domination with respect to the classical estimates such as UMVUE and MLE. He questioned the usefulness of the improvement and clearly opposed Berkson's (1980) stance on minimum chi-square estimation. Keating (1985) discussed this phenomenon in the estimation of scale parameters. Keating and Mason (1985b), at the suggestion of Rao, discussed many practical examples in which the Pitman closeness criterion may be more useful than MSE. These practical papers made it easier for the general readership to understand the questions raised by Rao and one example is given in the following subsection. From the reliability estimation problem contained in Example 1.2.1, the reader can begin to see why Rao called PMC an intrinsic criterion. It arises naturally out of the estimation problem and can be interpreted simply by engineers and scientists with some formal training in estimation theory. Moreover, the intrinsic nature dominates some estimation problems where emphasis is placed on closeness to the target value. We present a relevant example due to Keating and Mason (1985b) illustrating a situation where closeness probability is a meaningful criterion for comparing estimators.
DEVELOPMENT OF PMC
2.1.3
37
Convenience store example
Suppose there are k competing convenience stores located in the town of Pitmanville. The loci of the stores will be denoted by the points xi, X 2 , . . . , Xfc in the plane of Pitmanville. We will assume that it is unknown whether or not the grid system of the town is rectangular, and in the absence of this knowledge will choose the Mahalanobis distance to measure the distance between points. Each Xj is thus the rectangular coordinate pair associated with the location of the ith convenience store. The two coordinate axes are, respectively, east-west and north-south, and the origin (0,0) is the location of the town square. The population distribution over Pitmanville will be represented by /(x), the joint density function of X that is the coordinate location of an arbitrary individual. Of concern is the location of this individual relative to the zth store. We will represent the weighted squared distance between the individual and the ith store by
where E is the covariance matrix of X. The expectation of the value in (2.3) is the bivariate analogue of MSB and is referred to as quadratic risk. Individuals usually patronize the convenience store that is closest to their location (all other considerations being the same) rather than the one that has the smallest weighted squared distance (i.e., MSB). To illustrate this consider Figure 2.1, which locates three stores (denoted by xi,X2, and Xa) in the plane of Pitmanville. The town has been subdivided into three regions (denoted Vi, V^, and Vs) in order that the persons located in a given region Vi are closer to the ith store than to any other. If the three Xj are not collinear, the perpendicular bisectors of the line segments joining each pair of x's determine the boundaries of these regions. The simultaneous closeness probability is simply the probability of Vi denoted Pr(Vi). Suppose that the coordinate location of the population density has a bivariate normal distribution with mean vector 0 and covariance matrix S. With this assumption the weighted squared distance in (2.3) has a noncentral chi-squared distribution with two degrees of freedom and a noncentrality parameter given by Using this result it follows that the MSB is given as
38
PITMAN'S MEASURE OF CLOSENESS
Figure 2.1: The Voronoi tessellation of three convenience stores.
Figure 2.2: Example of stores distributed on the unit circle.
DEVELOPMENT OF PMC
39
The weighted MSE is at a minimum when Xj is at the origin 0. Thus MSB places a value on each store without considering the location of competing stores, while the closeness probability accounts for the relative positions of the competing stores through its determination of the size of the region V£. Figure 2.2 illustrates three stores located on the unit circle at coordinates given by (--s/2/2, \/2/2), (l/2,\/3/2), and (0,-1). In this example we will assume that £ is the 2 x 2 identity matrix. Note that each store would be closest to exactly 50% of the town population in any of the three pairwise comparisons and that all three stores have the same weighted MSE value of 3. The conclusion at this stage would be that store location should not be a factor in deciding which store to patronize. However, consider what happens when the closeness probability is used as a criterion of comparison. The probability for Vi represents the fraction of the population that is located closer to x$ than any other store. These probabilities can be calculated using the probability content of sectors of the bivariate normal. A simpler geometric approach has been described by Farebrother (1986) and is given below. Note in Figure 2.2 that the three stores are located at the same distance from the town square. One is in the direction N 30° E; the other at N 45° W; and the third located due S. The angles between these directions are 75°, 135°, and 150°, so the first store has a sector with a central angle of 37.5 + 75 = 112.5°; the second store a sector of 37.5 + 67.5 = 105°; and the third store a sector of 67.5 + 75 = 142.5°. Dividing by 360°, we find that the stores attract the following proportions of the population: Store Pr(VS) 1 .3125 2 .2917 .3958 3 Based on these calculations, a larger proportion of the population in Pitmanville will live closer to the location of store 3 than to the location of the two competing stores. In general, all the stores on the same contour of constant population density will have the same weighted MSE but they will not necessarily share the same fraction of customers located in Pitmanville. The above example illustrates another useful aspect of PMC as an alternative criterion to MSE in comparing estimators. In this case, we illustrate the importance not of pairwise comparisons but rather simultaneous ones (see §3.2). This example, however, is not a universal result, as similar examples could be constructed to show that MSE is often preferred to PMC.
40
PITMAN'S MEASURE OF CLOSENESS
What should be gained from this discourse is the need to carefully consider the merits and limitations of other criteria besides MSB when comparing competing estimators. For example, when we decide to purchase a gallon of milk at 11:00 pm from the local convenience store we usually do not stop and compute the MSE of the population density of each store. Instead, we probably go to the store that is closest in travel time to our homes. While the placement of new stores according to MSE would always be at the town center, the placement using the PMC would be based on the proportion of the population closest to the store. We note that the partition, Vi, Vz, and Vs, is the Voronoi tessellation of M2 determined by the points xi, X2, and X3. Nearest neighbor algorithms have been developed to determine the regions whenever many points or stores are involved. Keating and Mason (1988b) discuss the extensive impact on the Voronoi tessellation created by changing to the l\ metric, which is frequently used in robotics. For example, the Voronoi tessellation becomes random if the coordinates xi, X2, and xa represent the locations of three police patrol cars in Pitmanville (see Stoyan, Kendall, and Mecke (1987)). The convenience store example also serves to illustrate the important connection of PMC with Bayesian estimation (e.g., see Keating and Mason (1988b)). Suppose we have a bivariate parameter 0 = (#1, #2) and that g(6) is a bivariate prior distribution defined over the parameter space. Using different estimation criteria, such as MSE and MAD, certain estimators §1, 62,..., &k are optimal based on the measures of central tendency of the posterior distribution. These estimators are fixed given the data and the parameter(s) in the prior distribution. In the posterior case, 0\, 62,..., Ok, play the same role as the points representing convenience stores. The regions V\,..., \4 based on the underlying Mahalanobis metric, partition Pitmanville into k sets in which the posterior loss due to estimator Oi is less than the posterior loss due to any other estimator. The calculation of Pr(V^) is the value of posterior PMC and the Mahalanobis metric serves as the loss function. Thus, as in Bayesian estimation theory, the individual posterior PMCs can be compared to select the estimator of choice. The concept of posterior Pitman closeness is developed in §5.4. A similar bivariate example involving election data is given by Romer and Rosenthal (1984), where the electorate is jointly distributed on public spending on parks and police. The candidates assume fixed positions, x i , . . . , Xfc on spending on parks and police.
DEVELOPMENT OF PMcc
2.2
41
The concept of risk
The language of decision theory is often used in evaluating different measures of closeness. We want to know how close two estimators are to the true value of a parameter, but we lack a methodology for determining the error we make in choosing each alternative. Since an estimator helps make a decision on the value of the parameter, we term the estimator a decision function. Associated with each decision function is a measure of the error in not estimating the correct parameter value. This error is formally called a loss function. We have seen in the previous subsection that loss functions can take a variety of forms. With the MSE criterion the loss function is based on squared error, while with the PMC criterion the loss function is expressed as a probability function of the absolute error. Selection of an appropriate loss function is neither an arbitrary nor a simple matter. But once chosen the problem is to select the estimator that makes the loss small. Doing this across many different samples of data can be difficult as some errors will be small while others will be large. To remove the dependence of the loss on the sample values, we usually choose estimators that minimize the average loss, the risk function. Choosing estimators with the smallest loss requires us to have an understanding of the concept of risk. This is particularly important in our study of PMC.
2.2.1
Renyi's decomposition of risk
One useful result that serves as an aid in understanding the risk associated with PMC is contained in a theorem given by Renyi (1970). Theorem 2.2.1 (Theorem of Total Expectation). Let #1,0 2> • • • , Ok (k > 2) be k real-valued estimators of the real parameter 0. Denote the loss of the ith estimator by Ci where Ci — p(\6i — 9\) for i = 1,..., k and p(y) is a strictly increasing and nonnegative-valued function of y for y > 0. The mean absolute error using the closeness partitions C\ < C^, £1 = £2, and C\ > £2 is as follows:
42
PITMAN'S MEASURE OF CLOSENESS
Using the above theorem, Dyer and Keating (1979b) determined a general solution for the decomposition components and introduced PMC as an interpretable element of the total risk, E(Ci). The probability statements in the first two terms of the preceding equation are the values of JP(0\,0 #i|0)> while the term E(d\Cj = min(£i, £2)) for i,j = 1,2, is the conditional risk incurred in estimating 6 with 0, given that 9j is closer to 9. Thus the pairwise Pitman probability has a unique role in the decomposition and evaluation of risk. In this regard Johnson (1950) remarked that It must be admitted that the closeness criterion does provide information not given by the mean square error. In particular, it is always possible to use the closeness criterion, regardless of the existence or non-existence of the first and second moments of the estimators. Ideally, of course, it would be desirable to use both criteria. An interesting property of PMC is discovered by extending the loss functions in Theorem 2.2.1. We call it the invariance property, due to its origins in Bayesian hypothesis testing (see Keating and Mason (1985a)). Theorem 2.2.2 (Invariance Property). If C\ = \0i — 9\q is the Cq-norm loss function of the estimator Q{, i = 1,2 and q > 0, then
The closeness partition defined by £f < C\ in (2.4) is the same as the closeness partition defined by C\ < £2- Note that while varying the value of q does not alter the Pitman closeness probabilities, it does arbitrarily increase or decrease the associated conditional risks. The reader can verify this with Examples 2.1.2 and 2.1.3. Thus, the Pitman closeness probabilities may be more important than the conditional risks in evaluating (2.4) because they are not changed by the arbitrary choice of the loss function within this class of £g-norm, q > 0. This is another natural property underlying the appeal of PMC which is not generally shared by MSE.
DEVELOPMENT OF PMC
43
However, the invariance of PMC does not remain true when we include an asymmetric loss function such as the entropy loss defined by
which is a useful loss function in the comparison of estimators of a scale parameter.
2.2.2
How do we understand risk?
Prevalent in our discussions of PMC is the issue of risk (e.g., see Mason (1991)). It is easy enough to define risk as the average loss, but it is not always as simple to express our understanding of it. Consider, for example, the risk involved in gambling. If we were to bet money on a particular outcome, we often think of the risk we incur as a function of the amount we bet as well as the odds of winning. Our major concerns are with the loss we would suffer if we were to lose the bet or the amount we would gain if we were to win the bet. But the true risk is a function of both of these elements and combines the respective probabilities of winning and losing with the amount wagered. Suppose this example is extended to a situation in which two alternative methods of betting are available to us. How do we proceed to compare them? One approach might try each method for a fixed number of bets using the same amount of money for the bets. The "better" method would be the one that yielded the smallest total loss, or, equivalently, the smaller average loss. We would say that it had the smaller risk. Pitman, in his 1937 paper, addresses the problem of such pairwise comparisons in discussing the limitations of PMC. He states that ... from the practical point of view, what is best depends upon the use we make of our estimates, and ultimately upon how we pay for our mistakes ... A key word in this phrase is pay. If we define clearly the use to be made of an estimate and the loss of an incorrect decision, then finding the proper criterion to use in picking the "best" estimator would be easy. An alternative approach would be to apply Renyi's Theorem of Total Expectation. To understand this concept of loss, consider two competing estimators, say 9\ and #2, for an unknown parameter 0. Denote the loss function for the first estimator as C\ = £(#i, 0) and the loss function for the
44
PITMAN'S MEASURE OF CLOSENESS
second estimator as £2 = £(^2>#)- The difference in the risks, A.R, which we incur in using #2 instead of #1, can be expressed as follows:
This difference in risks is a function of the average amount we would expect to lose (in using 62 when 0\ is preferred) times the probability of losing, plus the average amount we would expect to win (when #2 is preferred) times the probability of winning. In this setting, ties become irrelevant to the choice of estimator since they occur when the two risks are the same. The equation shows that the associated conditional risks are weighted by the closeness probabilities: Pr(£i < £2) and Pr(£i > £2)- This result is stated in a more mathematical way in the form of the following characterization of preference based on risk. Theorem 2.2.3 Let 9\ and #2 be real-valued estimators of the real-valued parameter 9 and let £(0,0) be the chosen loss function. Then it follows that
If the conditional risks are the same, then it is the odds of winning or losing that determine which estimator is preferred. Thus the risk is highly sensitive to the value of PMC. These ideas are expressed in the use of the word pay in the preceding quotation by Pitman. In practical situations we do not always have enough information to evaluate the average loss. Risk then is based on our prior beliefs about winning and losing as well as our expectations for loss. When comparing two estimators, PMC is an interpretable component of the associated risk. It expresses our belief that one method will have a smaller loss than the other. When these probabilities are combined with the conditional average losses, we are then able to determine which estimator has the smallest risk.
DEVELOPMENT OF PMC
2.3
45
Weaknesses in the use of risk
There are inherent weaknesses in relying totally on risk as a criterion in comparing two estimators. One concern is that it may not always be possible to determine the risk. This can occur when the losses associated with incorrect decisions are unknown or cannot be formulated. The risk is also very sensitive to the choice of the loss function. As noted earlier, an arbitrary choice for the loss function can definitely influence the risk but would have no effect on PMC across monotone functions of |0 - 0| and a small effect in classes including both entropy and £q loss functions. Another problem, discovered by Rao (1981), is that shrinking the risk of an unbiased estimator does not necessarily produce an estimator which is Pitman-closer than the unbiased one. However, some shrinkage estimators, such as the Stein rule, are Pitman-closer estimators than the classical MLE, even under less restrictive conditions than with quadratic or other risks (see Sen, Kubokawa, and Saleh (1989) or Sengupta and Sen (1991)). Finally, when comparing an estimator to a fixed standard, as in calibration studies, risk can be a poor criterion. The following subsections contain discussions on these various situations. They explore in more detail some of the inherent weaknesses associated with the use of risk and illustrate settings where PMC is a viable alternative. As has been stressed throughout this book, one should not arbitrarily choose risk or PMC as the sole criterion in comparing estimators. Concern must also be given to the use of the estimator as well as the loss in making an incorrect decision.
2.3.1
When MSE does not exist
An interesting aspect of risk is that it may not always exist. By definition the risk function is the expected value of the loss; i.e.,
Unfortunately, as mentioned by Johnson (1950), there are situations where MSE does not exist or cannot be determined for a variety of reasons. This subsection discusses some commonly encountered examples, which are used to demonstrate the value of PMC as an alternative criterion for the comparison of estimators.
46
PITMAN'S MEASURE OF CLOSENESS
Example 2.3.1 (Divergent Estimators). We consider a random sample Xi,... ,Xn of size n from a Bernoulli distribution with probability function where 0 is the probability that a component is defective. A completesufficient statistic for estimation of 0 is given by Y = nX, which has a binomial distribution under the usual assumptions. Prom a frequentist's perspective, 0 is the frequency with which defective components are produced. Its reciprocal r = I/O is the period or expected number of produced components between defects. From the invariance property of MLEs, r\ = n/Y = l/X is the MLE of T. However, we can see that T\ is undefined with positive probability (1 — 0)n. MLEs have the convenient property that if OL is the MLE of 0, then g(OL) is the MLE of g(0). However, this can be a great weakness in that the MLE may be an inferior estimator for many #'s, even if OL happens to be a good estimator for 9. Further consideration of the MLE would be fruitless due to its singularity at Y = nX = 0. This is clearly the Achilles' heel of the logistic regression problem posed by Berkson (1980). He suggests the ad hoc procedure of adding l/2n to X whenever X = 0. Consider the estimator of r motivated by a 50% confidence interval on 0,
where 6.50(a,/?) is the median of a beta random variable with parameters a and /3. The motivation for this estimator stems from the relationship between cumulative binomial probabilities and values of the beta distribution function. Since the denominator never vanishes, Ti(X) is well defined over the sample space. Moreover
For example, if 1,000 components are inspected without defect then
which is a reasonable estimate of r when X = 0. Even the ad hoc Berkson estimator produces an estimator of T of 2,000. MSE can be calculated for
DEVELOPMENT OF PMC
47
Tz(X} whereas it cannot be for T\(X). The estimator T2(X} arises out of a Bayesian development which is discussed in §5.4. MSE may fail to exist due to problems inherent to the underlying distribution such as the lack of moments of any order. To illustrate this problem consider the example of the Cauchy distribution. Example 2.3.2 (The Cauchy Distribution). Suppose X\ and X-2 are a random sample of size two from the Cauchy probability density given by
where 9 is an unknown parameter. Although the above density function is symmetric about the parameter 0, its mean and higher-order moments do not exist. Consider the following two estimators of 0:
Blyth and Pathak (1985) show that #2 is Pitman-closer than 6\ to 0 in spite of their marginal distributions being identical. While moments of any positive order diverge, both estimators underestimate 0 with probability 0.50. If we want to determine which of these estimators is closer on the average to the unknown parameter and use MSE as the criterion, we cannot solve the problem. This results from the fact that the expectations of X\ and X2 do not exist. Thus we must use some alternative estimation criterion. If PMC is selected, we need to evaluate the probability Pr(|#i — 0\ < |#2—#!)• Consider the following transformation which parallels the argument in Blyth (1986). Let a = arctan[Xi - 9} and /3 = arctan[X2 - 9}. Thus, it follows that
The random vector (a, (3) is uniformly distributed over a square with length TT and centered at the origin. With this transformation, PMC is given by
48
PITMAN'S MEASURE OF CLOSENESS
Figure 2.3: Preference region for X\ in the Cauchy distribution.
An illustration of this region of preference in the a — 0 plane is shown in the shaded part of Figure 2.3. Using numerical integration and the rectangular symmetry of the /(a, /?) we find that PMC reduces to
The probability is less than .50, indicating that $2, the sample mean, is Pitman-closer to 0 than 0\, the individual sample value. Thus, although we were unable to compare the two estimators with the MSE criterion, we were able to discriminate between them using the PMC criterion. Example 2.3.3 (The Calibration Problem). A calibration experiment is one in which a response variable Y is measured as a function of a known set of predictor-variable values X, at least two of which must be distinct. A calibration line is fitted to the data and then used to estimate unknown
DEVELOPMENT OF PMC
49
predictor-variable values from corresponding measured values of the response variable. For example, consider the problem of calibrating a measuring instrument such as a gas pressure gauge. To calibrate the gauge, measurements are taken at several known gas pressures. Some technique, such as least squares regression, is used to fit a calibration line to the gauge measurements as a function of the known gas pressures. In the future, when the gauge is used to take measurements at unknown gas pressures, the calibration line can be employed to estimate these pressures. Estimating an unknown X, associated with a known Y value, can be achieved by classical regression methods. Suppose the responses are obtained from the model
and the unknown parameters a and /3 are estimated by least squares as 5 and /9. Given a known value of V, say ?/, the estimate of X is given by
This estimator is supported by the fact that it is also the maximum likelihood estimator when the errors, £j, are normally distributed. Alternatively, one could use an inverse regression procedure to estimate X. In this approach instead of regressing Y on X we do the "inverse" and regress X on Y. The resultant prediction equation yields the estimate
for the known value ?/, where 5* and ft* are the least squares estimates of the unknown parameters a* and (3* in the model
There has been much debate in the statistical literature over which of these two estimators is superior in estimating the unknown X value (e.g., see KrutchkofT (1971); Halperin (1970); Vecchia, Iyer, and Chapman (1989)). Much of the controversy stems from the fact that the MSE of X\ is infinite so that a random drawing from any distribution with a finite variance would produce a better estimate than this one under the MSE criterion. In short, MSE is an inappropriate criterion in this situation since it does not exist for one of the two estimators.
50
PITMAN'S MEASURE OF CLOSENESS
PMC has been introduced as an effective alternative criterion to solve this problem (Halperin (1970); Krutchkoff (1971); Keating and Mason (1991)). The relative closeness of X\ and X-z to the unknown X value can be found by evaluating the following probability
This quantity cannot be evaluated directly due to the complicated integration that is required but it can be approximated by considering the appropriate asymptotic distributions and using the fact that the variables are normally distributed. When this is done, it can be shown that neither estimate dominates the other, although the classical estimator appears to be far superior in practice. This example, like the previous one involving the Cauchy distribution, demonstrates a situation where the MSE criterion is ineffective due to its lack of existence. In these settings alternative criteria such as PMC have a meaningful role to play in the comparison of competing estimators. However, even when MSE does exist, the PMC criterion can be very helpful. The advantage in using PMC is that, while it is an interpretable component of the associated risk, its existence is independent of the arbitrary choice of the loss function from the class of monotone functions of \6 — 9\. The entropy loss function provides a nice example that we may be unable to extend this result for a much larger class of functions. Computing probabilities can pose as many problems as encountered in evaluating risk functions but at least the PMC is unchanged by our choice of loss functions.
2.3.2
Sensitivity to the choice of the loss function
The choice of the loss function determines the risk. A squared error loss function such as has the feature that estimates within a unit of the true parameter 0 produce smaller losses than that obtained under absolute error loss. However, the squared error loss function may give undue influence to large errors while an absolute error loss function such as
does not down-weight small errors. With an arbitrary loss function the risk can be highly sensitive to the selection.
DEVELOPMENT OF PMC
51
This apparent sensitivity is one that has encouraged many researchers to look to other criteria in evaluating estimators. The convenience store example and the electorate example of Romer and Rosenthal (1984) are practical examples where PMC is more relevant than a squared error loss function. Halperin's inverse regression problem provided an example in which PMC is a useful alternative for a situation where the risk associated with a squared error loss function cannot be evaluated. In each of these situations the risk was highly sensitive to the choice of the loss function. The selection of an appropriate loss function has been addressed by Laplace, Gauss, Pitman, Karlin, and Rao. The problems associated with inappropriate choices are documented throughout Chapters 2 and 3 of this book. We raise the need for estimators that are insensitive, that is, robust, to such choices. The importance of this issue has been cited by Ghosh, Keating, and Sen (1993): Robustness of estimators with respect to the loss function is an issue of utmost concern to the decision theorist With a few exceptions such as Hwang (1985), (1988) or Brown and Hwang (1989), this issue is usually inadequately addressed in the decision theory literature and one can safely assert that this robustness issue is yet to be satisfactorily resolved. In this regard, an appealing feature of the Pitman closeness criterion is its invariance with regard to choice of loss functions from the If and larger classes. The last sentence is made in light of Theorem 2.2.2, and the conditions for the larger class are specified by Keating (1991). The previous discussions raised the issue of the choice of a loss function. Indiscriminate choices may well prove to be disastrous. Many recent advances in estimation theory have centered on the concept of robustness of estimation techniques to departures in observed values from the underlying distribution. A thorough discussion of robust techniques is given in Hampel et al. (1986). Their focus is based on influence functions which provide formalizations of the bias created by an outlier. Among the popular robust procedures are those based on Huber's influence function
where c is labeled the breakdown point. A graph of ^c(x) is given in Figure 2.4. Note that at x = ±c, this function changes from a linear influence to
52
PITMAN'S MEASURE OF CLOSENESS
Figure 2.4: The Huber influence function with breakdown point c. a constant influence. If we let £'(x) = ^c(x), the corresponding loss function uses quadratic loss in the interval [—c, c] and an absolute loss function elsewhere. The appeal of the loss function generated by the Huber influence function is that it incorporates the desirable features of absolute and squared-error loss. However, controversy still centers on the selection of the breakdown point c. It should be noted that, due to Nayak's (1991) observation, Pitman's measure of closeness is invariant under any loss function defined by H(\x—6\] where #(0) — 0 and H(x) is strictly increasing for x > 0. Thus, the result of invariance in Theorem 2.2.2 applies to loss functions constructed out of such breakdown functions. A very popular influence function is the one associated with redescending M-estimators, which we will discuss in Chapter 6. Consider the following "normal" loss function given by
This loss function is graphed in Figure 2.5 and its complement has the same shape as a normal density function. All losses are bounded below by zero and above by one. In the sense of robust statistics, it has the desired
DEVELOPMENT OF PMC
53
Figure 2.5: The "normal" loss function. property of down-weighting extreme observations. A convenient feature of this loss is that if one defines I{>N(X) — £'N(X}, the influence function does not descend too rapidly. The influence function defined by VwW is given in Figure 2.6 and is very similar to that of the tanh-estimator under the normal distribution. Consider the influence function, i/;*, defined as follows:
c and /3C can be determined from Hampel et al. (1986). The optimum estimator determined by ip* is known as a median-type hyperbolic tangent estimator. This influence function is depicted in Figure 2.7 and differs from the one in Figure 2.6 in that it goes to zero at ±c, and in the neighborhood of x = 0, V>* nas a JumP discontinuity. However, ?/>#(#) does not descend to zero for all x such that \x\ > c, as does ^(x). It also has the desirable feature that the influence function is nearly linear in a large neighborhood of the origin. Moreover, ^N(X) satisfies the property of losses of the form H(\x\) such that #(0) = 0 and H(x) is strictly increasing for x > 0. Hence the results of Theorem 2.2.2 apply and a Pitman-closest estimator derived under absolute loss will also be Pitman-closest under this "normal" loss. Furthermore, the optimal estimator under an influence function is frequently related to medians, which is also the case for Pitman-closest estimators (see §5.3).
54
PITMAN'S MEASURE OF CLOSENESS
Figure 2.6: The influence function of the "normal" loss.
Figure 2.7: The influence function for a median-type tanh-estimator.
DEVELOPMENT OF PMC
55
2.3.3 The golden standard The concept of risk, as defined in (2.6), assesses penalties to departures from the truth (i.e., the golden standard). Certainly it is within the domain of the statistician to question by what authority one has chosen, for example, quadratic risk. Some would contend that the user with a real problem knows the potential loss. If one does not know the loss function, should one universally apply a squared error loss function? It is clear that one can artificially create large departures from the truth by an imprudent choice of loss. In the evolution of science, competing theories are compared on how well they explain observed phenomena. For example, for most of recorded history, it was commonly believed that the earth was flat. This belief was held despite the brilliant experimental work of Eratosthenes (c. 230 B.C.) associated with the summer solstice at Syene and Alexandria. This same kind of "common sense" was used via the MacLaurin series to support the motivation of MSE. In layman's terms, the world is "locally flat" or "locally Euclidean." The great truth that Riemannian geometry models the geometry on the surface of a sphere has done little to diminish the value of the Euclidean approximation but should make us more prudent in applying it. Another example of competing theories is seen in a study of gravitational attraction, which initially was well described by Sir Isaac Newton. However, with the advent of Albert Einstein's special theory of relativity, it became clear that Newtonian laws were simply inadequate to explain the behavior of relativistic particles. This superior explanation of Einstein's does not diminish the fact that much of human endeavor is well explained by a world which is "locally Newtonian." It seems consistent with the history and philosophy of science that Einstein's theory of relativity is a better approximation to an "ideal" world that we do not fully understand. In using Pitman's measure as a criterion of estimation we also compare two competing theories (in the form of estimators) with a true ideal (the unknown parameter). The choice of 9\ over 82 in estimating 0 based on PMC allows us to quantify "how frequently" we have chosen better. In any given sample the statistician does not know whether his choice was closer but does know how frequently he can expect to choose the better of the two theories. The simple philosophy of Pitman's measure was coined centuries ago (see Rao (1989)) in a discourse by Descartes: " .. .when it is not in our power to know what is true, we ought to do what is more probable." Descartes' idealized view must be considered in light of the fact that comparisons based
56
PITMAN'S MEASURE OF CLOSENESS
on PMC may be mixed over the parameter space. Our probabilities depend upon an unknown parameter and, what is more probable, may depend on this parameter, which we do not know.
2.4
Joint versus marginal information
In comparing estimators using the MSE or PMC criterion, an issue that often arises concerns the use of marginal versus joint information. PMC depends on the joint distribution of the competing estimators while MSE is based on their separate marginal distributions. Which is better depends on the particular situation under study. Several different authors have addressed this topic and have provided useful examples of the value of both types of information. Savage began the debate in his 1954 book on the foundations of statistics. After reviewing Pitman's (1937) original paper on PMC he concluded that
On any ordinary interpretation of estimation known to me, it can be argued (as it was in Criterion 3) that no criterion need depend on more than the separate distributions. A careful reading of Criterion 3, stochastic domination, reveals that Savage states that if the concentration functions of two estimators are identical, then there would be nothing to choose from since they are marginally identical. So Savage does not give any argument for support of this statement. However, Blyth (1972), with accompanying discussants, and Blyth and Pathak (1985) offer a series of examples where knowledge of the joint distribution can be more helpful. These examples are essential in establishing the role of PMC in estimation theory. Robert, Hwang, and Strawderman (1993) criticize the Pitman closeness criterion because its value depends on the joint distribution of two competing estimators. The basic reason given is that only one estimator is used in practice. They note that the Pitman closeness criterion depends upon the value of the correlation between the estimators. An informative example in this regard is given by Peddada and Khattree (1986), who compare two unbiased but correlated estimators of the common mean of two normal distributions. Sarkar (1991) provides a generalization of their comparison to several populations. In the asymptotic setup of Chapter 6, we show that the MSE and PMC criteria depend on a common correlation coefficient which has its origin in the Fisher efficiency.
DEVELOPMENT OF PMC
57
Example 2.4.1 UMVU estimation is certainly one of the key procedures in estimation theory and its origins in decision theory are well accepted. Consider two (mean) unbiased estimators 6\ and 6% of a common parameter 0 such that both have finite second moments, and 62 is the UMVUE of 9 attaining the Frechet-Cramer-Rao information bound. Then the mean squared error relative efficiency of B\ to 0% is
where p is the correlation coefficient between 9\ and 62- Whenever the family of distributions is complete, we have that Var(02) = Cov(0i,02) by the Lehmann-Scheffe Theorem, since 02, the UMVUE, is uncorrelated with all unbiased estimators of zero, in particular 6\ —62- When the family of distributions is not complete, Var(02) = Cov(0i,02) since Var[o:0i -f (1 a)02] > Var(02) for any a £ [0,1] and
Since any convex linear combination of 6\ and 02 is unbiased, its variance cannot be smaller than that of the UMVUE. Thus, even in the classical context of unbiased estimation, the correlation coefficient provides a quantitative value to the mean squared error relative efficiency of an unbiased estimator to the UMVUE! This connection between an element from the class of unbiased estimators with the UMVUE, pointed out by Fisher (1938), is seldom accentuated in texts on statistical inference, although it plays a natural role in finding the Fisher efficiency. In the absence of a complete sufficient statistic, the joint distribution of two estimators may contain more information than either of the marginals. The Pitman closeness criterion extracts such additional information in a natural way. In Example 2.3.2 on the Cauchy distribution, 0 = [9\ 02] forms a sufficient statistic for estimation of 0. The Pitman closeness criterion, based on the joint distribution of &\ and 02, contains this sufficient statistic through a linear transformation and should intuitively be more informative. This example of the Cauchy distribution has been widely discussed by Blyth (1972), Mood et al. (1974), Blyth and Pathak (1985), and Robert et al. (1993). It has been proposed as an unintuitive aspect of the Pitman closeness criterion. We interpret this result very positively, since larger
58
PITMAN'S MEASURE OF CLOSENESS
samples provide sample averages that are more frequently closer to the parameter of interest, which is indeed a very reasonable conclusion that cannot be obtained using conventional procedures. The sample mean, considered as a convex linear combination of independent Cauchy random variables, has a Cauchy distribution with a median of 0 and a unit scale parameter (see Kagan, Linnik, and Rao (1973)). Consequently, this phenomenon persists for increasing sample sizes in the Cauchy distribution and is not simply a special result for samples of size 2. Decision theorists could reasonably contend that absolute or quadratic loss is inappropriate for such distributions as the Cauchy. We illustrate loss functions for which the risk would exist even in the Cauchy distribution. Consider the "normal" loss function defined previously by
which provides a bounded loss in the interval [0,1). Since 9\ and #2 have the same marginal distributions, if their risks exist they will be identical in value. In this case,
Although the risks exist in this case, they produce the same value for both candidate estimators and thus we conclude that the two estimators are equivalent. Again, we reiterate that the Pitman closeness criterion produces the intuitive result that 02 is Pitman-closer than 6\. The following subs'ections contain discussions on situations where joint information is more helpful than marginal information. Initially the concept of comparing estimators to the true value of the parameter being estimated (as used in the MSB criterion) is discussed. It is then shown that, by determining which of the estimators is closer in a relativistic sense (as in the PMC criterion), more useful results frequently can be attained.
2.4.1
Comparing estimators with an absolute ideal
Use of the MSE criterion in comparing estimators is based on examining only the marginal information available on each estimator. This is the common procedure in statistical estimation theory. There exists a true but unknown value of the parameter. We believe in the existence of this ideal value but we do not know it and cannot determine it with probability one. Each available
DEVELOPMENT OF PMC
59
estimator is to be compared with this ideal using MSE as the criterion. The estimators thus stand alone and we are forced to use only their separate distributions. The popular procedures of statistical inference such as maximum likelihood estimation, minimum variance unbiased estimation, and Bayes estimation (using noninformative priors), produce estimators that are optimal within a class of decision rules under a specific criterion. Thus with respect to a chosen criterion, a best estimator (i.e., an MLE, a UMVUE, or a Bayes estimator) of the ideal value exists. In the previous section, we discussed examples in which one or all of these optimal decision rules may not exist. In such estimation problems and, in the absence of optimum estimators, we are left with only the ability to make relative comparisons in pairs. Pitman's closeness criterion provides an alternative technique for comparing available estimators of the true but unknown value of the parameter. Even in estimation problems where all the optimal procedures exist, the question surfaces as to how these different procedures should be compared. Each technique has its own set of favorable properties, but if there is disagreement over the criteria then the Pitman closeness criterion offers an impartial method for comparing the optimal decision rules. Some authors (e.g., Rao (1980) and Severini (1991)), when comparing optimum estimators derived from loss functions with those that are not (such as the maximum likelihood estimator), have advocated the use of the Pitman closeness criterion as an impartial procedure for the pairwise comparison. Keating (1985) also advocated the use of Pitman's closeness criterion in comparing the best estimators derived under different loss functions, such as absolute and quadratic loss. This point accentuates our view that the sensitivity of the optimal estimator in the decision theoretic framework is highly conditioned on the presumed loss function. It is simple to overlook the use of other loss functions because quadratic loss is employed so pervasively throughout the discipline. Further, in estimation theory, when the class of estimators is unrestricted, optimum estimators with respect to traditional but different criteria fail to exist (i.e., we are left with at best a set of admissible rules). It therefore seems judicious to use some impartial criterion such as the Pitman closeness criterion to compare estimators. Contrast this situation with the setting where we use PMC as the criterion. Here we take a relativistic approach in that our interest is in which estimator is more frequently closer to the ideal value. We do not really care about the size of the error as much as we do about the estimator that has
PITMAN'S MEASURE OF CLOSENESS
60
the smaller error. For example, in comparing the times of runners at a track meet, we are not so concerned in how fast each preliminary winner runs a particular race as we are in which runner wins the final race. It would be helpful to have the marginal results of each runner for individual races but we are mainly concerned with which one wins in head-to-head competition. This estimation requires knowledge of the joint distribution of the two race times, which may not be generally independent.
2.4.2
Comparing estimators with one another
Consider the following example to demonstrate the usefulness of joint versus marginal information in the comparison of one estimator to another. Consider two mechanical instruments that drill holes on some metallic strip. It is desired that the holes be drilled at the location 0 on a measurement scale. Let us assume that the two instruments are subject to error and hence do not always drill at the location of interest. Suppose the joint distribution of the location of the holes drilled by the two instruments is given as follows: q1 q2 .001 1.000
0 .495 .495
10 .005 .005
In the context of estimation, 6\ and 02 are two competing estimators of the known parameter 0 = 0. We want to choose the most appropriate estimator (the drilling instrument). Note that
so that 0i is superior to 02 in terms of PMC. Consider now the marginal distributions of these two estimators. These are as follows: q1
0
10
pr00 .99 .01 q2
.001
pr02
.5
1.0 .5
DEVELOPMENT OF PMC
61
Using these distributions the MSEs of the two estimators are
and
Hence, the estimator 62 is preferable to 9\ since it has the smaller MSE. We are confronted with the conflicting results that 0\ is preferable using the PMC criterion and 0% is superior under the MSE criterion. This gives rise to the question as to which estimator is of more practical value. Observe, from the marginal distributions, that 9\ drills at the correct location 99% of the time and at the wrong location 1% of the time. In contrast, #2 never drills at the location of interest although, in terms of MSE, it is closer to the true parameter value than B\. Based on these results, our choice should be clear: select the first instrument as it is better in drilling at the preferred location. The above examples demonstrate that a user of PMC can defend the value of joint information in many types of estimation problems. We will discuss this point in more detail in Chapter 5, where we will introduce the appropriate roles of ancillarity, equivariance, sufficiency, etc. They also add an important link in the historical development of PMC, as they relate some of the concerns that needed to be addressed to make it a viable alternative. Fortunately, researchers have been able to provide opposing views to procedures based on only examining marginal information. These arguments raised other issues of concern that in turn opened new paths of development.
2.5
Concordance of PMC with MSE and MAD
In the preceding sections of this chapter, we presented examples in which PMC may be a better criterion than the traditional criterion of mean squared error for the comparison of competing estimators. Many of these examples were nonregular cases that pose difficulties for most criteria. In many regular cases, there is considerable agreement of PMC with MSE on the better estimator in a pairwise comparison. To present a more balanced perspective of the interplay of these criteria, we discuss here, in very general terms, the
62
PITMAN'S MEASURE OF CLOSENESS
relative behavior of PMC, MSE, and MAD. We refer the reader to examples in subsequent chapters for illustration of these observations. In the presentation of a more balanced perspective of the general agreement between PMC and MSE in regular cases, certain questions arise naturally. For example, given cases where MSE performs well, what can we say about the Pitman-closeness criterion and vice versa? At this stage it is not possible to provide a complete answer to such a broad question. Nevertheless, we provide some guidelines in this section to motivate the more technical deliberations which will follow in Chapters 4-6. Throughout Chapter 4 we provide many examples in which PMC has an influence on the comparison of two estimators based on MSE and MAD. In estimator comparisons that are mixed over the parameter space ft, we can often determine the point in the parameter space where MSE or MAD change preference by noting dramatic shifts in the numerical values of the Pitman closeness probabilities. This observation is illustrated in Example 4.5.2 of the proportion defective in the normal distribution. This example illustrates the effect that PMC can have on risk-based comparisons. However, it does not demonstrate complete agreement between the criteria. In §4.2.2, we engage the problem as to whether an unbiased estimator with known variance is Pitman-closer than another unbiased estimator with a larger variance. If the estimators have a bivariate normal distribution, then we prove that PMC and MSE are equivalent. Khattree and Peddada (1987) extend this concordance result to estimators having elliptically symmetric distributions. Through these results we lay the foundation for the asymptotic comparison of the estimators. A preliminary example of an asymptotic PMC comparison comes from the comparison of the sample mean and sample median of random samples from a normal distribution in Example 4.6.8. The theoretical asymptotic results are thoroughly studied throughout Chapter 6. In §§6.1 and 6.2, the theory of BAN estimators is shown to support the concordance of PMC and MSE within a broad class of estimators having asymptotically normal (AN) distributions. A direct consequence of these asymptotically equivalent criteria, is that PMC is asymptotically transitive. Thus, PMC can asymptotically share the same standing as the MSE criterion. Moreover, PMC's flexibility in small sample sizes may often make it a more appealing criterion than MSE or MAD. In §§5.1, 5.2, and 5.3, we construct small-sample Pitman-closest estimators via the properties of equivariance and ancillarity. Under some reasonable conditions, these Pitman-closest estimators coincide with the median
DEVELOPMENT OF PMC
63
unbiased estimator having smallest MAD within the equivariant class. Once again, PMC is transitive when one compares estimators from an equi variant class. It is important to observe that for Pitman's measure of closeness, median unbiasedness replaces unbiasedness from the MSB perspective and the properties of equivariance and ancillarity replace sufficiency and completeness as prerequisites. In §5.4 we present a Bayesian interpretation of Pitman's measure known as posterior Pitman closeness. We mentioned this concept at the end of the convenience store example in §2.1.3. Posterior Pitman-closest estimators are frequently Pitman estimators obtained under absolute error loss. We show that posterior Pitman closeness is transitive and that the posterior Pitman-closest estimator is the one which minimizes posterior MAD. In §6.4, we unify the convergent equivalent nature of the MLE, the Pitman estimator, and the Bayes estimator under mild regularity conditions. These unifying results manifest a remarkable concordance of PMC with MSE. This important but little-known result is established under the suitable regularity conditions needed for the BAN estimators. However, the regularity conditions needed for PMC are less restrictive than those generally assumed.
This page intentionally left blank
Chapter 3
Anomalies with PMC We have seen in the last chapter that there are many different reasons for preferring PMC as a criterion in parameter estimation. These arguments should help solidify our understanding of this procedure and how it relates to the concept of risk. It now is helpful to turn our attention to certain operational aspects of PMC and when it is appropriate to adopt it as a criterion. Most of these issues center on probability properties and apply to many facets of our daily lives. Understanding them will help us recognize estimation problems when PMC is a useful alternative to other estimation criteria and when it is not. For example, in many types of consumer preference tests, it is common to ask each sampled individual to make a pairwise choice between several brands, say A, B, and C. One could then determine the proportion of consumers who preferred A to B, B to C, and C to A. If all three proportions exceed 50%, it is difficult to choose the best alternative as any one appears to be preferred over exactly one of the other two brands. Yet such an event is entirely possible. The above example illustrates a key criticism associated with PMC: the fact that it lacks the transitive property. By this we mean that Pr(X < Y) > .50 and Pr(F < Z) > .50 does not guarantee that Pr(X < Z) > .50. Blyth (1991), in a discussion of Pitman's original paper on PMC, counters these criticisms with arguments that transitiveness may not describe reality, particularly when making social preferences. We live in an intransitive world in which there are many situations, such as athletic competitions, political races, or chess matches, where the transitive property is irrelevant. The aspect of intransitiveness is explored in this chapter and several examples of practical settings in which it occurs are described. 65
66
PITMAN'S MEASURE OF CLOSENESS
A discussion is also given of the probability paradoxes in choice that may occur in using a pairwise comparison procedure. Some of these are similar to problems of the comparison of multiple treatment means in a one-way ANOVA. In using PMC it may be possible to find an estimator that is worst in pairwise comparisons but best in a simultaneous comparison. Likewise it is possible to have a pairwise-best estimator that is simultaneous-worst. Such paradoxes are given careful study to aid in understanding PMC. The problems of intransitiveness and paradoxes among choices can be partially resolved by the manner in which probability is assigned to ties between two estimators. A useful resolution for this assignment is discussed and leads to judicious treatment of adaptive estimators under PMC. The chapter ends with a discussion of the Rao-Berkson controversy concerning the comparison of minimum chi-square estimators with maximum likelihood estimators which are equal with positive probability.
3.1
Living in an intransitive world
Many daily decisions are made which produce intransitive preferences among the available choices. For example, in the long history of the Southwest Athletic Conference (SWC) only the season of 1959 produced a three-way tie for the conference football championship. In that season, three teams, Texas Christian University (TCU), the University of Arkansas, and the University of Texas, finished conference play with identical won-lost records of 5-1. Moreover, in head-to-head competition, TCU defeated Texas 14-9, Texas defeated Arkansas 13-12, and Arkansas defeated TCU 3-0. This outcome produced the type of intransitiveness that Savage (1954) so well characterized as the feeling of being caught in a contradiction. The final scores of that season are given in Table 3.1 for each tri-champion. Since final scores are available we can compute a total-point differential to select a champion. In head-to-head competition, TCU had a +2 point differential, Arkansas had a +2 point differential, and Texas had a —4 point differential. Unfortunately, this comparison still leaves TCU and Arkansas tied for the championship. Based on all the conference opponents, TCU had a +70 point differential, Arkansas had a +30 point differential, and Texas had a +43 point differential. This comparison would declare TCU the conference champion. However, if a minimax criterion is used then Arkansas would be the champion, since its lone defeat was by a single point. It appears that a partisan fan could probably find an established statistical procedure to support the right of each of these three teams to the championship. To those
ANOMALIES WITH PMC
67
Table 3.1: Final scores of the SWC tri-champions. Opponent TCU Arkansas Texas A&M Baylor Rice SMU
TCU 0-3 14-9 14-8 14-0 35-6 19-0
Arkansas 3-0
12-13 12-7 23-7 14-10 17-14
Texas 9-14 13-12
20-17 13-12 28-6 21-0
Taken from "The Official Southwest Athletic Conference Football Roster-Records Book," 1960, Volume XL
of us who are sports fans such paradoxical outcomes are indeed fascinating. One should be impressed by the very close outcomes and low scores of most of these games. As a partial explanation, one should recall that at this time, most of the participants played both offense and defense. The SWC employs a unique rule in the event of a tie for the football championship: the team with the least recent appearance in the Cotton Bowl becomes the conference representative. TCU had won championships in 1955 and 1958, Arkansas had won a championship in 1954, but Texas had neither won nor shared a championship since 1953. Therefore, the University of Texas was chosen to represent the SWC in the 1960 Cotton Bowl against Syracuse University. Syracuse defeated Texas 23-14 in the bowl game and also won the national title. Usually when several alternatives are available, we use pairwise comparisons to help guide our selection but, as shown in this example, intransitiveness among choices may pose a problem. Examples where this paradox can occur range from round-robin competitions, to politics, and even to psychology, such as in the study of the notion of rationality in thinking patterns. Understanding these phenomena is important in our use of PMC, due to its emphasis on preference of choice between two alternatives. Before beginning this section we need a formal definition of intransitiveness as it applies to PMC. An excellent source is David (1988) who defines it as follows. Definition 3.1.1 For any three real-valued random variables A, B, and C, stochastic intransitiveness occurs whenever Pr(A < B}, Pr(B < C), and Pr(C < A) all exceed .50.
68
PITMAN'S MEASURE OF CLOSENESS
Suppose that we wish to compare three estimators, 0i,#2> and 0$ of a common parameter 0. Then it is possible for .P(0i,02|0)>JP(02»03|0)> and ^P(035#i|0) to all exceed .50. Such circularities were dubbed by Sir Maurice Kendall as circular triads more than half a century ago. David's (1988) monograph on paired comparisons is essential reading on this topic. Each pairwise comparison, 6\ vs. 62, #2 vs. #3, and 9\ vs. #3, has two possible outcomes based on PMC (ignoring ties). Hence, there are 23 possible outcomes for the three pairwise comparisons. Based on PMC, the three pairwise comparisons may not have independent outcomes. Among these eight results, two possibilities produce circular triads. These triads can be represented by means of a directed graph where Oi —> 0j symbolizes that Oi is Pitman-closer than Oj to 0, as depicted in Figure 3.1. Whereas Savage lamented the contradictory nature of a circular triad, David (1988) provided a decidedly different perspective: "It is a valuable feature of the method of paired comparisons that it allows such contradictions to show themselves —"
3.1.1
Round-robin competition
The simplest example of intransitiveness can be seen in round-robin competitions. For example, in a round-robin tournament involving three soccer teams from Argentina, Brazil, and Italy, Argentina may be preferred over Brazil, Brazil may have a better chance of beating Italy, but Italy may be the choice over Argentina. Which team should we choose? Clearly our choice is difficult as no one team appears to be preferred over all the others. Blyth (1972) gives an example of an athletic event where the times TA, TB, and TC to run a particular course are recorded for three runners, labeled A, B, and C. If Pr(TA < T5), Pr(TB < Tc), and Pr(Tc < TA) are all greater
Figure 3.1: A directed graph of a circular triad based on PMC.
ANOMALIES WITH PMC
69
than .50, then these three probabilities indicate that there is better than a 50% chance that each of the following events occurs: A beats B, B beats C, and C beats A. In many Olympic team competitions, such as basketball, water polo, or ice hockey, this same property arises in comparisons. Several national teams are placed together in a league and then compete in round-robin competition to see who advances to the medal round. During the competition, pairwise comparisons between teams are made by the various sports writers. When enough information is gathered to assess the necessary probabilities of winning, the newscasters select the teams with the best chance to make it to the final round. Unfortunately, the intransitiveness property often arises making it difficult to predict the winner. In the Games of the XVI Winter Olympiad (1992), a circular triad occurred in the preliminary round of the ice hockey competition. The teams from Canada, Czechoslovakia, and the Unified Team of the former Soviet Union completed the preliminary round-robin with identical records of four wins and one defeat. Czechoslovakia defeated the Unified Team 4-3, the Unified Team defeated Canada 5-4, and Canada defeated Czechoslovakia 51. In terms of team placement the Olympic committee uses a goal differential in head-to-head competition to break ties. Canada had a +3 goal differential, the Unified Team had a 0 goal differential, and Czechoslovakia had a —3 goa differential. Accordingly, the teams were ranked first, second, and third, respectively for the medal round. The ice hockey medal round uses a single-elimination approach in the determination of the gold medal winner. Thus placement can definitely affect the results. For example, although the Unified Team lost to Czechoslovakia in the preliminary round, it did not face Czechoslovakia in the medal round as the Czech team was eliminated in its first medal-round game. As was predicted by the placement order, the Unified Team won the gold medal, the Canadian team won the silver medal, and the Czech team won the bronze medal. Intransitiveness is not necessarily a negative aspect as transitiveness may be an artificial requirement. For example, in chess it is entirely possible that with high consistency player A wins over player B, B wins over C, and C wins over A. This does not make the rules of chess inappropriate but instead increases interest in the game. This is why round-robin events are popular. A dominant team will usually excel but when no team has a clear advantage over all the others then the intransitiveness leads to interesting and often exciting results.
70
3.1.2
PITMAN'S MEASURE OF CLOSENESS
Voting preferences
The issue of intransitiveness is most prevalent in discussions on voting preferences. For example, Blyth (1972) presents the problem where voters are asked to make pairwise comparisons between alternatives A, B, and C. A majority (70%) of the voters prefer A over B, a majority (60%) prefer B over C, and a majority (70%) prefer C over A. So which alternative is most popular? Election data often result in intransitive popularity contests. For example, prior to the 1980 Presidential elections certain political opinion polls revealed that in head-to-head competition, President Jimmy Carter would defeat Senator Edward Kennedy, California Governor Ronald Reagan would defeat President Carter, but Senator Kennedy would defeat Governor Reagan. The presence of such intransitiveness in elections is common and should not cause us to ignore this information. Instead, it should help us to understand that elections are based on closeness probabilities. Hoffman (1988) describes a preference test concerning America's favorite fast food, the hamburger. The choices are to buy a hamburger from McDonald's, Burger King, or Wendy's. He notes that society's preference may be paradoxically intransitive in that McDonald's may be preferred to Burger King, Burger King may be preferred to Wendy's, but Wendy's may be preferred to McDonald's. With three individuals, such a selection would occur if each chain is ranked first by only one person, second by only one person, and third by only one of the individuals. Another example described by Blyth (1972) concerns the choice of pie that brings an individual the most satisfaction. Suppose there are three alternatives: apple, blueberry, and cherry. It is entirely conceivable that the individual prefers apple to blueberry, blueberry to cherry pie, and cherry to apple pie. This preference would imply a probability exceeding .50 of receiving more satisfaction from one pie as compared to the other. Here the preference within an individual, as opposed to across individuals, is intransitive.
3.1.3
Transitiveness
The lack of transitiveness is clearly evident in these examples and requires extensive treatment and discussion. We are concerned, although to a lesser extent than decision theorists, over the possible intransitiveness of Pitman's closeness criterion. Our diminished concern over transitiveness is based on a
ANOMALIES WITH PMC
71
multitude of results. In Theorem 5.4.5, we demonstrate that transitiveness exists among competing estimators in the Bayesian setting known as posterior Pitman closeness (see Ghosh and Sen (1991)) to which we previously referred in the convenience store example. From Ghosh and Sen (1991), the posterior median is posterior Pitman-closest, and as such is unique, whenever the posterior distribution is continuous. Likewise, if we use absolute loss, the estimator that minimizes the Bayes risk (using a noninformative prior) coincides with that obtained from the Pitman closeness criterion. If one restricts the class of estimators under consideration to those that are equivariant under a suitable group of transformations (as in Ghosh and Sen (1989); or Nayak (1990)), the Pitman closeness criterion produces a median-unbiased estimator as the Pitman-closest within the class. More importantly, among these estimators transitiveness holds (see Nayak (1990) and §5.3). Restriction of the class of estimators by unbiasedness or equivariance is virtually inescapable in classical analysis and there is no reason not to involve such "common" restrictions in using the Pitman closeness criterion. Likewise, the Pitman-closest estimator under such restrictions is the Pitman estimator under absolute risk. In decision theory the lack of transitiveness among competing estimators under Pitman's criterion is usually unavoidable. This happenstance is a consequence of the fact that majority preference alone is sufficient to guarantee the status of being Pitman-closer. Game theorists will recognize that as a majority preference rule the Pitman closeness criterion parallels the democratic voting process in the United States. If 51% of the citizens favor a candidate, it does not matter how much he or she is disliked by the 49% who voted for the opposition. Neither does it matter how close the decision was for the 51% who favored the winner. This criterion follows closely the well-known paradoxes in elections or preference polls. In this regard, the example given in §3.2.3 is very similar to the vast literature of Brams (1985). As an illustration, consider a primary election in which there are three candidates, A, B, and C. Let us suppose that each voter orders the candidates with only the following possible outcomes:
The ability to order makes individual preference transitive. (Note that this situation differs from the pie example discussed earlier because in that case individual choices were intransitive.) Let us suppose that the electorate
72
PITMAN'S MEASURE OF CLOSENESS
supports each of the three permissible outcomes with the stated percentages. Even though each member of the electorate has ordered the candidates, the following pairwise elections would result:
where X ^> Y denotes that candidate X defeats candidate Y. Hence if one candidate withdraws from the election, the consequences will be disastrous for one of the two remaining individuals. With three candidates in the race, the polls provide the picture of a very competitive election. In this example the two candidates, B and C, will be chosen as the top two winners. Even though candidate B avoided elimination by only 1% of the electorate in the following run-off election, B will overwhelmingly defeat C by a landslide margin of 30%. The axiom of von Neumann and Morgenstern (1944, p. 27), that individual preference is transitive, is often called plausibility. However, these authors later remark that preference by society or by n participants in a game, even with the postulate of plausibility, ... is indeed not transitive. This lack of transitivity may appear to be annoying and it may even seem to be desirable to make an effort to get rid of the theory of it. Yet ... [it is] a most typical phenomenon in all social organizations. Cyclical dominations — y over x and z over y and x over z — is indeed one of the most characteristic difficulties that a theory of these phenomena must face. This incoherence (i.e., intransitiveness) of Pitman's closeness criterion is well known among social scientists through Arrow's (1951) Impossibility Theorem for the collective rationality of social decision making. Blyth (1991) discusses the context of this point and its relation to the Pitman closeness criterion. Following von Neumann and Morgenstern, Arrow (1951) hypothesizes an electorate of voters whose preferences are "rational," meaning that they satisfy the following pair of axioms: I. Either x > y or y > x (Decidability); II. If x > y and y > z, then x > z (Transitiveness). In the mathematical sense, Axiom I applied to the real number line is very much an outcome of the principal of trichotomy (i.e., x > 0 or x < 0).
ANOMALIES WITH PMC
73
Arrow constructs five more conditions to avoid trivial cases and proves the Impossibility Theorem: that no social ordering satisfies the pair of axioms and the quintet of conditions. Reflecting upon his discovery, Arrow (1951) remarks that ... the only part of the conditions that seems to me at all in dispute is the assumption of rationality. The consequences of dropping this assumption are so radical that it seems worthwhile to maintain it and consider restrictions on individual preferences. Readers who are interested in a complete presentation of the two axioms and the five conditions are referred to the presentation by Thompson (1989). He also provides a proof of Arrow's Impossibility Theorem. In fact Savage (1954) constructed his Foundations of Statistics on an axiomatic treatment of pair wise preferences. He makes the analogy that the occurrence of an intransitive comparison creates the same sensation as when one is confronted with the fact that some of his beliefs are logically contradictory. With the Pitman closeness criterion and its subsequent results, we are often better equipped to address many of the pressing issues of politics. An example of this can be seen in the usage of the system of approval voting, a method which is used for the election of certain officers in scientific and engineering societies (see Brams and Fishburn (1992)). In an approval vote involving three candidates, each voter selects one or two candidates. For example, in our previous primary election discussion with candidates A, B, and C, A would appear on 67% of the ballots, B on 65%, and C on 68%. Approval voting allows us to rank the candidates; thus, to some extent our degree of preference is reflected in the approval ballot. Haunsperger (1992) and Esty (1992) very recently suggested different ways to use these simultaneous comparisons to find a Condorcet (1785) candidate, who would defeat all competitors in a two-candidate election. Haunsperger (1992) suggests an innovative use of the Kruskal-Wallis test in her procedure, whereas Esty employs a maximum likelihood-based approach. These recent approaches arising from practical problems in diverse disciplines illustrate the realism and necessity of the PMC criterion. In our example of the 1959 SWC football season, approval voting could be used to determine the conference champion by ranking the three teams by their margin of victory in conference games. In this approach suggested by Haunsperger (see Table 3.2), TCU receives three first, three second, and one third place ranks, whereas Arkansas and Texas each receive two first, two
PITMAN'S MEASURE OF CLOSENESS
74
Table 3.2: Ranks of tri-champions by magnitude of victory. TCU Arkansas Texas A&M Baylor Rice SMU
Rank 1 Rank 2 RankS Arkansas TCU Texas Texas Arkansas TCU TCU Texas Arkansas TCU Arkansas Texas Arkansas TCU Texas TCU Texas Arkansas Texas TCU Arkansas
second, and three third place ranks. Hence, as illustrated in Table 3.2, TCU wins approval voting based on ranks ordered by the magnitude of victory. This nonparametric rank sum is known among social scientists as a Borda count (see Brams 1985). Along the lines of Arrow's axioms, the Possibility Theorem of Sen (1966) also has application in our discussion of transitiveness. Heating's (1991) proof (see Theorem 4.6.1) that the Pitman closeness criterion is transitive among a class of ordered estimators, is a special case of Sen's more general result applied to the comparison of estimators. The intransitiveness of the Pitman closeness criterion in estimation theory is certainly to be expected given the vast research on majority preference in the larger domain of game theory. Moreover, Pitman's (1937) closing remarks that intransitiveness "... does not detract from the definition of closest" challenge the necessity of transitiveness among all pairwise comparisons. In other words, if we can exhibit a decision rule, a Condorcet (1785) estimator, that is Pitman-closer than all its competitors on a pairwise basis, is transitiveness among the defeated a realistic concern? The mathematical significance of Pitman's intuitive remark is discussed in §3.4. If transitiveness can only be guaranteed by restricting the class of candidate estimators, then a natural question arises as to how frequently transitiveness occurs. Brams (1985, p. 66) gives us some limited insight into the relevance of intransitiveness based on the number of estimators under consideration through calculations of probabilities of a Condorcet candidate, which is a Pitman-closest estimator. For three competing estimators, Brams calculates that approximately 92% of the comparisons would produce a Pitman-closest estimator due to chance alone. Moreover, in asymptotic setups, we will show in Chapter 6 that transitiveness holds under fairly general regularity conditions.
ANOMALIES WITH PMC
3.2
75
Paradoxes among choice
Intransitiveness is only one of several paradoxes associated with the concept of probability of choice. Even if a pairwise criterion such as PMC were transitive, the preferred alternative may be the pairwise-worst but simultaneouslybest. Likewise, the choice could be pairwise-best but simultaneously-worst. These paradoxes need to be understood when we use PMC. An excellent example in politics is used to illustrate these concepts.
3.2.1
The pairwise-worst simultaneous-best paradox
A formal description of the pairwise-worst simultaneous-best paradox is given by Blyth (1972). It is defined as follows. Definition 3.2.1 For real-valued random variables X\, X^, and X%, it is possible for Pr(Xi = mm{Xi,X2,Xs}) to be the largest for i = 3, even though Pi(Xi < X3) and Pr(X2 < X3) exceed .50. Thus X% is preferred over both X\ and X 0}. When a = 1, we obtain the maximum likelihood estimator 6 = Xn:n. Prom the statement following (1.10) we know that 9/9 has a beta distribution with a cumulative distribution function given by Fn(t) = (t/0)n, Q \ and #2(3) < f > we guarantee that, for all values of 0, 02(X) will be preferred over 9 for either [X = 5] or [X = f]. Since Pr[§i(X) = ±] = 36(1 -0)2 and Pi[0i(X) = §] = 302(1 -9),
Thus, by counting ties we focus on the values of the parameter space for which the two estimators are equal and assure that #2 is as close as 0\ with probability 1. If the probability of ties were equally divided between the two estimators, our conclusion changes drastically and 02 is only slightly preferred over B\. The median plays a central role throughout this discourse. Its culmination comes in Chapter 5, in which Pitman-closest estimators among an equivariant class are shown to be median-unbiased estimators. Then Oi (X) = X is the MLE of the unknown proportion of successes 0. It is also the UMVUE of 0 and is efficient in that its variance attains the FrechetCramer-Rao lower bound on the variance of an unbiased estimator. Without modification of the Pitman closeness criterion, X may be inadmissible in this sense. However, the example can be modified to construct an alternative estimator that is not confounded with the problem of ties. This alternative estimator has its origins in Bayesian analysis and has the form
where 6.50 (a, (3) is the median of a beta random variable with parameters a and (3. The Bayesian origins of this median-rank estimator of 0 are explained
ANOMALIES WITH PMC
89
Table 3.6: Numerical values of estimators of the binomial proportion.
X 0 1 3 2
3
1
0i 02 0 0.000 0.000 .159 .333 .356 .386 .667 .644 .614 1.000 1.000 0.841
in Example 5.4.8. It should be noted that the values of 0(X) coincide with the median ranks of the order statistics taken from a random sample of size four. Table 3.6 contains the numerical values of the three competing estimators. These estimators are graphed in Figure 3.4. In the earlier discussion, the issue of ties was brought forth to clarify that the disparity between Q\ and 02 was not nearly as drastic as first thought. The table provides insight to the superiority of 02 in the middle of the parameter space in that 02(5) > 0i(g) and 02(|) < 0i(|)- Hence for 0 between .3445 and .6555, the values of X = ^ and X = | will always contribute to the Pitman closeness of 02 over 01. By slightly increasing the estimate at X = ^ by e and decreasing it at X = | by e, we produce an estimator which is preferred over a subset of the parameter space. This strategy of improving traditional estimators based on the Pitman closeness criterion was first suggested by Keating and Mason (1985a). The modification obtained by using 0 provides the same strategy to 0i for all values of X. All three estimators are increasing functions of X and as such they will be discussed in §4.4. By connecting the estimates at each value of X with straight lines as depicted in Figure 3.4, the estimators can be made continuous. However, the median-rank estimator 0 is Pitman-closer to 0 than 0i with probability 1, whenever .3707 < 0 < .6293. This unusual result is obtained whether probabilities of ties are split or not because 0 uses the same e-strategy on the extreme values of X. This can be seen from Figure 3.4, where no value of X produces an estimate defined by 0 whose value is closer to 0 than that given by 0 over the restricted parameter space. In essence, 0 shrinks 0i toward .50 much like a shrinkage or Stein-rule estimator. Whereas one may lack sufficient knowledge to place a prior distribution on 0, it may well be known that 0 lies in the interval (.3707, .6293). In such
90
PITMAN'S MEASURE OF CLOSENESS
Figure 3.4: Three estimators of the success probability. cases, one would be certain that 0 is closer to 6 than 0\. We would reasonably counter that, with such prior information, an estimator superior to 6 could be constructed. The importance of the result is that MSE comparisons over the same subset of the parameter space do not disclose this complete preference. A second practical problem that may occur frequently is that if the true parameter space is (0,1) then the estimates at X = 0 and X = 1 fall in the closure of the parameter space, but not in the parameter space itself. This issue, that estimators should map the data space into the parameter space, was very much a central theme in the work of Hoeffding (1984).
3.4.2
Correcting the Pitman criterion
In the above example we see that PMC, given in Definition 1.0.1, results in a comparison criterion that is not only intransitive but also is neither reflexive nor symmetric. While we cannot easily correct the problem of intransitiveness, we can solve the latter two problems. To do this, define the following weighted function:
ANOMALIES WITH PMC
91
where C\ is the loss associated with estimator 9\ and £2 is the loss associated with #2- If £2 < £1 there is no penalty; if the losses are equal there is a penalty of ^; and if £2 > £1 the loss is 1. Using these results, let us set forth a weaker definition, denoted by JP*, as the expectation of the weighted function in (3.2). It can be expressed as follows:
Thus the weight function J serves to split the probability assigned to events that produce ties between the loss functions. The weaker definition of Keating and Mason (1988b) has the following properties:
Thus, the corrected Pitman relation is reflexive and skew-symmetric but still intransitive. To illustrate the usefulness of the corrected Pitman criterion, reconsider the special case of the two observations from a uniform distribution as mentioned at the beginning of §3.4.1. In this special case, we have that
because Pr (9\ = 62) = .50 (i.e., ties should occur in half the cases due to chance alone). Using the corrected Pitman criterion, we have that
and JP*(02) 0i) = 1 — JP*(0\, #2) = -75. These corrected values give us a more reasonable picture of this pairwise comparison than the perplexing values determined from Definition 1.0.1.
92
PITMAN'S MEASURE OF CLOSENESS
The problem of ties can occur with many adaptive estimators such as in comparing 6\ — X and
Since these two estimators agree over a large part of the sample space, there is a problem with ties. By utilizing an equal assignment of probability of the ties this problem can be avoided. More importantly, use of this procedure produces a Pitman relation which is more nearly an equivalence relation. There remains only the problem of lack of transitiveness. From a mathematical perspective, if we define a new relation r*(0i,02\0) = \r(0i,02\0)\ then r*(x,y\0) is symmetric. In the language of probabilistic metric spaces, our questions about transitiveness are then converted into questions about whether the triangle inequality is true.
3.4.3 A randomized estimator Randomized estimators are frequently used in estimation theory, especially for the estimation of parameters arising from discrete distributions. The process of randomization is often used to produce an exact test of a hypothesis when the distribution of the test statistic is discrete (see Lehmann (1986)). To assess the capacity of the corrected Pitman criterion to judiciously treat randomization, let us consider 6\ and 62 to be two real-valued estimators of the real-parameter 9. Consider the random variable J7, having a uniform distribution on (0,1), as being stochastically independent of 0\ and 62. Traditionally we obtain values of U externally by means of a random number generator. In this context, define the randomized estimator OR as
Let us suppose that 6\ is Pitman-closer to 0 than $2 (by Definition 2.1.4), and compare OR with 02 (i-e., the worse of 6\ and 62}, using
Note that whenever U > |, the comparison results in a tie, OR = 02- Consequently, by the independence of t/,
ANOMALIES WITH PMC
93
According to Pitman's original definition, the randomized estimator would not produce an estimator that is Pitman-closer than the worse of the two competing estimators. Let us reconsider this same comparison in light of the corrected Pitman criterion.
The second term, on the right-hand side of the preceding equation, is a consequence of the tie between OR and 0% whenever U > \. So we have
Therefore, the randomized estimator will be corrected Pitman-preferred to 02. Using a similar argument we can show that the estimator 9\ will be corrected Pitman-preferred to OR. This provides a transitive ordering among Oi, OR, and 02 under the corrected Pitman criterion. One might reasonably conjecture that the corrected Pitman criterion is artificially difficult, and modify Definition 1.0.1 as follows:
Note that under this definition
By using < in Definition 1.0.1, we would produce a criterion that says that by randomizing we would always produce an estimator that is Pitman-closer than 0i or 02. This result emphasizes the judicious way in which the corrected Pitman criterion handles ties. Pitman's original definition assigns all the probability of ties to the second estimator. The less than or equal to modification assigns all the probability of ties to the first estimator. Randomization is a convenient means for illustrating these pitfalls and the practicality of the corrected Pitman criterion.
94
3.5
PITMAN'S MEASURE OF CLOSENESS
The Rao—Berkson controversy
A controversy important to the understanding of PMC is one stimulated by Berkson (1980), who questions the sovereignty of maximum likelihood estimation. Berkson's preference is for minimum chi-square estimation, where
... a chi-square function is defined as any function of the observed frequencies and their expectations (or estimations of their expectations) that is asymptotically distributed in the tabular chisquare distribution. Such a professed choice stirred much discussion and debate and ultimately led to a resurgence of interest in PMC. Rao (1980), in his discussion of Berkson's results, questioned Berkson's reliance on the principles of mean squared error. Rao noted some anomalies in using minimum mean squared error as a criterion and proposed that estimation procedures be reexamined using other criteria such as PMC. He also clarified the concepts associated with his criterion of second-order efficiency (SOE). A discussion of the Rao-Berkson controversy and of the principles involved in the usage of SOE is contained in this section.
3.5.1 Minimum chi-square and maximum likelihood In categorical data models, there are many functions that are asymptotically distributed as a chi-square random variable. Each could be used as a criterion in estimation. Berkson (1980) lists five such functions, one of which is the likelihood chi-square denoted by
where O is the observed frequency in n trials and E is the expectation of the corresponding frequency and depends on the unknown parameters. Minimizing this chi-square with respect to the unknown parameter yields the estimate of interest, labeled the minimum chi-square estimate (MCE). Asymptotically equivalent estimates can be obtained by minimizing any of the other chi-square functions. These procedures belong to a general class of BAN estimators, which we will consider in Chapter 6. Berkson (1980) argues that minimizing the likelihood chi-square yields the same estimate as would be obtained using the maximum likelihood procedure, and thus the MCE should be preferred over the maximum likelihood
ANOMALIES WITH PMC
95
estimate (MLE). The maximum likelihood procedure selects the parameter estimate that maximizes the likelihood function, or, equivalently, minimizes the log-likelihood function, associated with the sample data. Since the MLE can be derived as a MCE, it is argued that minimum chi-square is the primary principle of estimation. To illustrate the controversy aroused by Berkson, we discuss a simplified variant of his 1955 bioassay problem in logistic regression. We assume that two independent groups, consisting of n patients each, are exposed to different dose levels of an experimental medication. The first n patients receive the (lower) dose level d\ = —1, and the second n receive (the higher) dose level d-2 = +1. Then each patient in the experiment constitutes a Bernoulli trial as in Example 2.3.1 and Example 3.4.1. Fountain, Keating, and Rao (1991) discuss this bioassay problem under different criteria. Let nX\ and nX-2 be the number of survivors due to treatments 1 and 2, respectively. Under the Bernoulli conditions described we know that nX\ ~ B(n, Q\) and nX-2 ~ J3(n, #2)- In this simplified logistic regression problem, the survival proportions 0\ and 6-2 for the two dose levels are functions of a common parameter /3 and given by
The parameter space fi for (#i,02) is defined as the open square lying in the first quadrant with sides of unit length and a vertex at the origin. The logit is defined as TTJ = ln[0j/(l — fy)] = ftdi, for i = 1,2. Berkson suggested estimating (3 by minimizing a weighted least squares term of the following form (which is the basis of logistic regression):
where the estimated logit, Tri = ln[Xi/(l — Xi)} and C(fl\X\,X $2 — 0 and for each ordered pair (0i,02)eft3, we have that 0i + 0% < 20 and -0\ + 02 < 0. By adding these inequalities, we obtain that 02 < 0. Likewise, the two inequalities jointly imply that — (0 — #2) < 01 — 0 < (0 — 02) or equivalently that |0i-0| < |02-0|. In a similar way we can show that in 7^2 and 7^4 that |02 — 0| < |0i — 0|Hence, it follows that P(0i,0210) = Pr(R,i) + Pr(7e3). Note that Jl\ and 7£s are the first and third quadrants in a coordinate system whose origin in located at the point (0,0) and whose axes are rotated through an angle of 45° from the original. We have been careful in this theorem to state Pitman's measure of closeness in terms of the strict inequality to avoid the problem of ties discussed in Chapter 3. If the line of equality and the switching line are events with probability zero, then both variants of PMC yield the same result (i.e., the comparison is the same whether one defines PMC according to (1.1) or as in the Pitman relation in §3.4). The result of the Geary-Rao theorem can be stated in a rectangularform whenever the line of equality and the switching line have zero probability. Define the bivariate random vector 0 by 9 = [ 9\ 02 ]', the 2 x 1 vector of ones as 1 and the matrix A by
These definitions give rise to the transformed axes depicted as dashed lines in Figure 4.1 and defined by the bivariate random vector U as follows:
The advantage obtained in using this transformation resides in the simplicity with which Pitman's measure of closeness can be determined. This simplification is illustrated in the following theorem. Theorem 4.1.4 (Rectangular Form) Let 9\ and 62 be two univariate estimators of the real parameter 0 and let /(iti, 1^2) be a joint density where U\ and C/2 o,re defined according to (4.3). Then it follows that
PAIRWISE COMPARISONS
105
The proof of the rectangular-form theorem is an obvious consequence of the Geary-Rao Theorem. Also, note that since this transformation from the coordinate system defined on the bivariate random vector 6 to U is unitary, then the joint density function of U can be found from that of 9 in a straightforward way. The rectangular-form theorem is also valid when the joint distribution of 0 is discrete. In this case the integrals in (4.4) would be replaced with summation signs and the joint density functions would be replaced with joint probability functions. From the PMC inequality, however, the indices of summation cannot be initialized at zero but rather must be initialized at some minimum value. Theorem 4.1.5 (Discrete Rectangular Form) LetO\ and 6-2 be two discrete univariate estimators of the real parameter 0 and let p(u\,u 0 such that for all e, 0 < e < 77, and for the function K,(t) = 0\(t) + 02 (t) — 20 it follows that /C(s + e)JC(s — e) < 0. A switching point s is a change point of the first kind for the function /C(£). Hence, we restate the Geary-Rao Theorem in the context given by Rao et al. (1986). Corollary 4.3.3 (Karlin's Corollary) Let 9\ and 02 be univariate estimators of the real parameter 0 and continuous functions, a.e., of the statistic T. Let A = {ti,... ,tm} be the set of m crossing points of 6\ and 02, let B = (si,..., sn} be the set of n switching points of 6\ and 02 at 9, and denote A&B — {yi,..., y j } , where j < m + n, as the symmetric difference, (A U B)—(A fl B}, of the sets A and B. Define the ordered points of the closeness partition as XQ = —oo, x\ = min{yi,..., T/J}, . . . , Xj = max{?/i,..., yj} and £j+i = oo. // \6\ — 0| < |02 — 0| in (XQ,XI) then
where double brackets in the upper index of summation in (4.11) denote the greatest integer function. Some assumptions given in Karlin's Corollary require explanation. He assumes that the sets of crossing points and switching points are finite. Later we shall demonstrate that the corollary is still true for infinite sets, which must be at most countable. A second assumption in Karlin's approach is that the essential range / of the random variable T is an interval. We will show that the essential range need only be an open set of the reals. From Zacks (1971), we use the following notation, which is presented there in a Bayesian context. Let Co(X^6) be the absolute loss function
114
PITMAN'S MEASURE OF CLOSENESS
defined on the Cartesian product of the essential range / and the parameter space ft and define
where T is a statistic, which possesses the monotone likelihood ratio property for the parameter 0, and d\(T] and di(T) are two decision rules or estimators). In the Bayesian context, £o[dj(T), 0} is the loss due to choosing action rfi(T), when the true state of nature is 0. Using Karlin's concepts applied to preference in terms of Pitman's measure, we can see that preference (i.e., choice of the one with smaller loss) between d\(T] and d,2(T) will change according to the changes in sign of the function N(T\0), that is,
where JC(T) and Q(T] are defined using Definitions 4.3.1 and 4.3.2, respectively. Define Z(N) as the set of all zeros of N(T\0) and C(N) as the set of all changes in sign of N(T\0). From the previous discussions we can see that N(T\0) will change sign according to the changes in sign of /C(T) or Q(T) except when these functions change sign simultaneously. Karlin (1957) and Zacks (1971) observe this in the Bayesian context for testing two hypotheses (i.e., hypothesis Ai : 6 e ftj, for i = 1,2). The set over which B\ is closer to 0 than 62, corresponds to the acceptance region determined by the test function for the first hypothesis (i.e., A\ : 0 e fti) in the Bayesian framework. Denote the indicator function on the positive real number line by I+(x) Then / + [—N(T\0)] plays the corresponding (but classical) role of the Bayes test function for the above hypothesis (or action). It is defined as
We see in the continuous case that
PAIRWISE COMPARISONS
115
The numerical value of Pitman's measure becomes the expected value of the classical analogue to the Bayes test function. The partition defined in Karlin's Corollary consists of at most j + 1 intervals, and on these, 6\ has smaller loss than #2 on exactly [j/2j intervals. The estimator with smaller loss alternates between the two competitors over adjacent intervals defined by the partition. We do not emphasize the assumptions used in Karlin's Corollary since they can be made less restrictive without affecting the truth of the corollary. However, some remarks about the assumptions are noteworthy. The definitions of crossing, switching, and change points allow for N(T\0) to be discontinuous at such points. Consequently, every change point of N(T\0) need not be a zero of N(T\0). Moreover, elements of Z(N) need not be the change points of N(T\0). Previously, it had been assumed that Z(N] must be a subset of C(N). In fact it is conceivable that Z(N)C\C(N) = . Earlier research in this area had required that all zeros be modal (i.e., every zero of N(T\0) had to also be a change point). However, the use of the symmetric difference of the set of crossing points A = C(Q} and the set of switching points B = C(/C) precludes the requirement. The previous researchers in this area were aware of the unnecessary nature of this condition and they circumvented the obstacle in another way by requiring that the nonmodal zeros of N(T\0) be counted twice. By Theorem 2.2.2, the corollary remains true under a wide class of transformations on £Q- Thus it follows that the sets A and B suffice for all loss functions in the family. Ferguson (1967) made the same observation in the context of Bayesian hypothesis testing. The importance of this result to Bayesian decision theory lies in the explicit statement given for C(N) = C(/C)AC(C/). We can be more articulate in this case since we have a limited family T of the loss functions. In practice, the Bayesian can start by solving for the value(s) of 0 that satisfy di + d2 = 10. Earlier we claimed that Karlin's Corollary was true even in the event that the number of changes of sign is countably infinite and that we could drop the condition that the statistic T must satisfy the monotone likelihood ratio (MLR) property. Toward this end, we present an extension of Karlin's Corollary given by Keating (1991).
Corollary 4.3.4 (Karlin's Extended Corollary) Using the definition given in (4.12), suppose that N(U\0) is a piecewise continuous function of the statistic U, whose essential range I contains at most a countable number
116
PITMAN'S MEASURE OF CLOSENESS
of discontinuities in the set D. Then there exists at most a countable number of disjoint open intervals, /i, /2,... over which \0\(U) — Q\ < \&2(U) — 0\. Consequently, we have that
Proof: By assumption N is a continuous (a.e.) function such that N : I — D —> JR. Now the interval (—00,0) is an open set in JR under the Euclidean metric. So from the continuity of JV, the inverse image JV~ 1 [(—oo,0)|0] is a relatively open set in / — D, which is an open subset of ]R. For the real number line, every open set can be written as at most a countable union of disjoint open intervals, which are mutually exclusive events defined on the random variable C7. Thus it follows that
such that Ij fl Ik = for j ^ k. From the definition of N
The subsequent result in (4.18) is a direct consequence of Kolmogorov's axiom for being countably subadditive. Note that Karlin used the MLR property to obtain a complete (or essentially complete) class of decision rules for the two hypotheses. The end points of the intervals, given above, satisfy either Definition 4.4.1 or Definition 4.4.2 except we must add the condition "less than or equal" to zero as opposed to "strictly less than" zero.
4.4
A special case of the Geary—Rao Theorem
The calculation of Pitman's measure of closeness can be greatly simplified for a special case arising out of the Geary-Rao Theorem. Before presenting
PAIRWTSE COMPARISONS
117
the special case, we define a class of estimators that can be compared based on PMC via the simplification.
4.4.1
Surjective estimators
Definition 4.4.1 An estimator 6 (T), which is a function of the statistic T, is a nondecreasing full-range estimator of the real parameter 0, provided: (i) 0 : I —» 17 is surjective, (i.e., the estimator maps the essential range of the random variable onto the parameter space, £1). (ii) 0(T) is a nondecreasing function o f T . Property (i) of Definition 4.4.1 (i.e., surjective estimators) has concerned statisticians since Halmos (1946), Lehmann (1951), and, more recently, Hoeffding (1984) and Bar-Lev and Enis (1988). The simple intent of the condition is that the set of possible values that the estimator can assume is a subset of the parameter space 17. An estimator should not assume values which are not possible for the parameter 0. However, the history of statistics contains many estimation procedures that do not satisfy the surjective condition. One such example involves estimators that can provide negative estimates of variance components, such as Rao's (1971) MINQUE estimator of a linear combination of variances. If Property (ii) is strictly increasing, as opposed to just nondecreasing, then the estimator becomes injective in the topological sense. Of course, if an estimator is both injective and surjective it becomes bijective or a homeomorphism. If #i(T) and O^T} are two estimators that satisfy Definition 4.4.1 and are not both constant on the same subset C (a.e.) of the domain /, then 0(T) = [0i(T)+02(T)]/2 will be a homeomorphism on/. By the intermediate value theorem there exists at least one switching point si in /, and by the strictly increasing nature of the average of the two estimators, s\ is also unique. These observations produce the following corollary, which was derived by Dyer and Keating (1979a). Corollary 4.4.2 LetO\(T} and Oz(T) be nondecreasing full-ranged estimators of 0, such that both are not constant (a.e.) on I. Let s\ be the unique switching point at 6 and suppose that 0i(T) and O^T] have at most a finite number of crossing points, say £1,..., tm. If 0\(t] > 02(t), whenever t is in (-co, min(ti,..., tm)) then
118
PITMAN'S MEASURE OF CLOSENESS
where XQ = -oo, x\ = min(*i,ti,... ,tm), ..., xm+i = max(si,£i,... ,£ m ), #m+2 = oo, o,nd j = m + 1. In this statement, we assume that the unique switching point is not a crossing point. If it is, we can derive a simpler result through the symmetric difference in Karlin's Corollary since Corollary 4.4.2 is a special case of Karlin's Corollary. The wide applicability of Corollary 4.4.2 is demonstrated in the following subsection. 4.4.2
The MLR property
Earlier in this chapter, we cited Karlin's use of the MLR property to establish a complete class of decision rules. In this subsection, we begin the process of disclosing results, which the MLR property supports in classical estimation theory. Definition 4.4.3 The family f(x,0) of probability functions indexed by the parameter 0 is said to have the monotone likelihood ratio (MLR) property if there exists a real-valued function T(x) such that for any 6 ~l(P) is more frequently referred to as the standard normal deviate associated with the lOOPth percentile of the distribution on X. Let the statistic T be defined as follows
where n is the sample size, X is the sample mean, and S is the sample standard deviation (i.e., S2 = Y!i=i(Xi - *)2/(n - !))• It follows that
where £(/, A) represents a noncentral ^-random variable with / degrees of freedom and noncentrality parameter A. Let
where c\ — \/27r/[r(//2)2^~2)/2], denote the distribution function of a noncentral ^-random variable having / degrees of freedom and noncentrality parameter A. The classical estimators of the survival proportion P(XQ), maximum likelihood and minimum variance unbiased, are functions of the complete sufficient statistic (X, S). From the invariance property of maximum likelihood, the MLE of P(XQ) is given by
PAIRWISE
COMPARISONS
123
The uniformly minimum variance unbiased estimator (UMVUE) of P(XQ) can be determined through the Rao-Blackwell Theorem as
where /x(a, (3) is the incomplete beta function ratio. Thus it is obvious that PL(T) and Py(T] can be expressed as functions of the common statistic T and the two estimators satisfy Definition 4.4.1, which is necessary in Corollary 4.4.2 in that each is a continuous, nondecreasing function of T. The range of PL(T) is (0,1), whereas the range of Py(T) is [0,1]. Graphs of PL(T) and Py(T) are given in Figures 4.2 and 4.3 for n = 3 and 8, respectively. These estimators have been compared by Zacks and Milton (1972) and Brown and Rutemiller (1973) on the basis of MSE, and by Boullion, Cascio, and Keating (1985) in terms of absolute risk.
Figure 4.2: Estimators of the proportion when the sample size is 3. To apply Corollary 4.4.2, we must determine the crossing points, which are the points of intersection, of Pi(T) and Py(T). In this section, we note that PL(T] and Py(T) have precisely three crossing points. First observe that at t = 0, we have P L(0) = $(0) = \ and likewise that Py(0) = /i/2[(n-
124
PITMAN'S MEASURE OF CLOSENESS
Figure 4.3: Estimators of the proportion when the sample size is 8. 2)/2, (n — 2)/2] = ^. Consequently, t = 0 is a point of intersection for Pi(t)
L(Q= $(t/,/n^l) = 1- $(-t/^fn^l) = 1 - PL(-t),
then the function of Pi(t) — \ is an odd function of t. Furthermore, for \t\ < n — 1, we have that
Therefore, the function Py(t) — \ is an odd function of t. This propery of symmetry about (0, |) implies that for each positive crossing point of the graphs of PL(^} and Py(t), a negative crossing point exists in which these ^-coordinates are equal in absolute value. The fact that there are exactly three points of intersection can be established by proving that there is a unique positive point of intersection (see Keating (1980) for the mathematical details). Using numerical procedures, we determined the positive crossing point z\ of the MLE and UMVUE for n from 3-10. From a numerical perspective, for estimators that satisfy Definition 4.4.1, the associated switching point s\ must be bounded by $~l(P}\/n — 1
PAIKWISE COMPARISONS
125
and (n- !){2/p1[(n-2)/2, (n-2)/2] -1}. The first value is the value of t for which Pi(t) = P and the latter is the value for which Py(t) = P- Hence, we can use the secant-method to solve for s\ for each value of P and n through IMSL subroutines for the inverse functions of the standard normal and the beta distribution functions. We shall present results for P > .50 and note that the PMC is symmetric about P = .50. Observe that Pi(t) > Pv(t) whenever t e (—00,21) so that Corollary 4.4.2 implies that the ordered endpoints are given by XQ = — oo,xi = —2i,Z2 = 0,2:3 = min(si,2i),#4 = max(si,zi) and £5 = oo, which yields
which is a linear combination of noncentral ^-probabilities, where
Table 4.3 contains the closeness probabilities of the MLE to the UMVUE of the survival function in the normal distribution for values of n = 3, 4, 6, 7, 9, and 10 and P = .50(.05).95. We note that the closeness probabilities approach extremes as P approaches .50 and .95. As the survival proportion approaches .50, Py(T) is highly preferred to the MLE in the sense of Pitman's measure of closeness, whereas the MLE is highly preferred in the same sense when the survival proportion approaches .95. These extreme values of Pitman's measure are produced by the same geometric phenomenon. When P = .50, the crossing point at t = 0 also becomes a switching point of the two estimators and thus according to Karlin's Corollary is deleted from the symmetric difference of the sets of crossing points and switching points. At this proportion of survivors, the UMVUE is closer to P = .50 for all values of T from —z\ to z\ and hence the subsequent dominance of Py in terms of Pitman's closeness criterion is readily understood. Likewise, when the survival proportion approaches .95, the switching point si approaches the positive crossing point z\. Consequently, the values of si and z\ are deleted from the symmetric difference (or more exactly the probability content of that interval approaches zero) of the sets of crossing points and switching points. In this way the MLE is closer to a value of P
126
PITMAN'S MEASURE OF CLOSENESS
Table 4.3: PMC of the MLE to the UMVUE. Sample Size n P
.50 .55 .60 .65 .70 .75 .80 .85 .90 .95
3
4
6
7
9
10
.1939 .2801 .3766 .4802 .5882 .6973 .8035 .9004 .9766 .9824
.0804 .1778 .2819 .3887 .4957 .6016 .7071 .8141 .9247 .9660
.0158 .1326 .2462 .3502 .4425 .5260 .6093 .7066 .8368 .9813
.0073 .1328 .2505 .3525 .4368 .5084 .5797 .6697 .8036 .9903
.0016 .1433 .2692 .3682 .4388 .4903 .5411 .6176 .7551 .9988
.0007 .1499 .2797 .3772 .4417 .4844 .5257 .5941 .7295 .9871
near .95 for all T t (—00, — z\) U (0, oo). Of course for P near .95, the latter set in this union contains most of the probability. One should also observe that the magnitude of the preference of the MLE in terms of Pitman's measure of closeness becomes a more local phenomenon in the parameter space as the sample size increases. For the cases in which the sample size is near 10, the closeness probability of the MLE relative to the UMVUE drops in value by about .30 when P moves away from .95 below to .90. From Table 4.3, we see that for small sample sizes, the decided preference of MLE in terms of PMC is spread over a larger subset of the parameter space [0,1]. A median unbiased estimator P^T) can be determined from medians of the noncentral ^-distribution. Let £.5o(/, 8) denote the median of a noncentral ^-distribution with / degrees of freedom and noncentrality param For a fixed number of degrees of freedom, £.5o(/, 8) is an increasing function of 8. Hence, there exists a unique value 8 = h.soj(x), such that
The median unbiased estimator is then determined by solving the following equation for Ps(t}
PAIRWISE COMPARISONS
127
Consequently, the median unbiased estimator of P is given by
In Chapter 5, we will show that median unbiased estimators (which are functions of a complete sufficient statistic) are Pitman-closest within a restricted class of estimators. Example 4.5.3 (The Efron Rule Revisited) Another illustration of Corollary 4.4.2 arises from Example 1.1.2. In Chapter 1, we merely stated that Efron's rule was Pitman-closer than the sample mean X. Efron (1975) does not detail his evaluation of PMC for these estimators. We note for 0 = 0, iP(0i,02|0) = 0 which is obvious from Figure 1.1. We present a proof for the case when 9 > 0 and note that PMC in this example is an even function of 0. From Corollary 4.4.2, we have
where s is the unique switching point of 9\ and 02 and 0 is obviously the sole crossing point. Two cases must be considered, depending upon the two possible definitions of O^X}. The first case occurs whenever 0 < 9 < c/Cl^/n) and can be readily handled. In the second case, where 0 > c/(2v/n), we have by Definition 4.1.2
which implies that Moreover, since s > 0 > c/^-^/n), then we have
The expression on the right-hand side of this inequality is a strictly increasing function of 0 and is bounded above by one-half.
128
PITMAN'S MEASURE OF CLOSENESS
4.6
Transit iveness
Although Pitman's measure of closeness of three competing estimators need not be transitive (i.e., see §3.1), there are certain sufficient conditions that produce transitiveness. Let #i(X), #2(X), and #s(X) be three competing estimators of the real parameter 0 based on a common random vector X. If the three estimators are ordered for all X in the sample space ]Rn, Pitman's measure of closeness is transitive.
4.6.1
Transitiveness Theorem.
Theorem 4.6.1 (Transitiveness Theorem) Let #i(X), ^(X), and 0a(X) be real-valued estimators of the real parameter Q based on the data contained in the n-dimensional random vectorX. If the three estimators are functionally ordered for all X, Pitman's measure of closeness is transitive. Proof: Assume that #i(X) > ^(X) > ^a(X). Since the estimators are ordered, then from the Geary-Rao Theorem we can conclude that for i = l,2;j = 2 , 3 a n d i < j
Moreover, by the assumed ordering we know that
Since these random variables are functionally ordered according to the ordering given in inequality (4.35) and the common essential range E of the three estimators is an open subset of JRn, we have that
for all x e (—oo, oo). In particular, when x = 20, we have from inequalities (4.34) and (4.36) that
Case 1. Suppose that JP(Si,d2\6) > .50. Then JP(02,S3\0) > .50jind JP(?ii?3|0) > -50. In this case, we assumed that P(0i,02|0) > -50 (i.e., Si is
PAIRWISE COMPARISONS
129
better in the sense of PMC than 02) and consequently the two conclusions follow from the transitive property of real numbers and inequality (4.37J. The two conclusions are that 02 is Pitman-closer to 9 than #3 and that Q\ is Pitman-closer to 6 than #3. Case 2. Suppose that P(0i,02|0) < -50 and P(0i,03|0) > .50. Then F(02,03|0) > .50. From the first assumption in Case 2, we have Pr(0i +02 > 20) < .50 and from the second that Pr(0i + 03 < 20)^> .SO.^Prom these assumptions, we have that 02 is Pitman-closer to 0 than 0i, and 6\ is Pitmancloser to 0 than 03. However, by inequality (4.36) and the transitive property of real numbers, we have that
The transitiveness of the Pitman closeness criterion is now established in this second case; i.e., , 02 is preferred over 0s. ^Case 3. Suppose that P(02,03|0) < .50. Then P(0i,02|0) < -50 and JP(6ii03\0] < -50. Notice that this assumption presumes that 0s is Pitmancloser to 0 than 02. This case is easily established by inequality (4.36). Since F(02>03|0) < -50 then by the transitiveness of real numbers, we have that P(0i,02|0) < JP(0i,03|0) < .50. The transitiveness of Pitman's measure of closeness is now established in this last case.
4.6.2
Another extension of Karlin's Corollary
Remark 4.6.2 Note that (4.34) includes an intervening equation that states the following result:
where these functions are defined, respectively, after Definitions 4.3.1 and 4.3.2, except the domains of definition are extended to JRn. We can now denote the inverse image of a set A C JR for the function £(•) as Q~l(A), which is a subset of JRn, as we did previously. If the product of Q(-} and /C(-) is negative, then the regions of preference defined in the Geary-Rao Theorem are Ui = /C"1 [(0, oo)] H Q~l [(-oo, 0)] and ft Therefore, it follows that
130
PITMAN'S MEASURE OF CLOSENESS
The regions given in (4.39) are subsets of Mn and consequently in the continuous case these results have expressions in terms of n-fold integrals,
Let E be the essential range of the random variable X and let D be the set of discontinuities of the function TV. If /C and Q are restricted continuous maps from E — D —> JR1, so that the restricted domain, E — D, of N is an open subset of JRn, then the pre-images must be open in JRn since the sets (—00,0) and (0, oo) are open in M1. Since open sets in this separable metric space (M, p) satisfy the Lindelof Property, then the preference region, which is an open subset of JRn, can be written as a countable union of open sets, O\, 02 > • • • , in Mn. However, the open subset of JRn cannot be written as a countable union of "disjoint" open cubes (see Dugundji (1966, p. 95)). Since we have assumed that E — D is also an open subset of JRn, then the pre-image under the continuous restricted map N can be written as the union of at most a countable number of nonoverlapping "closed" cubes, Ci, #2(X), then Pitman's measure of closeness can be viewed as a risk function defined on the estimator, which is the average of #i(X) and ^(X). Define the simple loss function as £*(0,0) = J+(0 - 0). Then from (4.34), we have
PAIRWfSE COMPARISONS
131
Theorem 4.6.4 (Median Unbiased Estimation) Let 02(X) be a median unbiased estimator of the real parameter 6. Suppose that #i(X) > 02(X) > 03(X), for all X e ]Rn. Then F(8i,fe|0) < .50 and JP(e2,S3\0) > .50. Proof: By the assumed ordering, it follows that 0i(X) + #2(X) > 202 (X). From inequality (4.34), we conclude that
Once again, from the assumed ordering, 202 (X) > 02(X.) + 0a(X) and by inequality (4.34), we observe that
We conclude that the median unbiased estimator (MUE) is optimal in the sense of Pitman's measure of closeness over a class of ordered estimators. Example 4.6.5 (Estimation of the Characteristic Life) In Example 4.5.1, we discussed the estimation of the characteristic life 6 in an exponential failure model and we considered Type II censored data from this distribution. The maximum likelihood estimator QL satisfied the criteria of unbiasedness, maximizing the likelihood function, minimum variance, consistency, efficiency, and was uniformly preferred in terms of Pitman's measure over the minimum absolute risk estimator in Q. In the following discussion, we show that the MLE is inadmissible to the median unbiased estimator in the sense of Pitman's measure of closeness. Recall from the remarks following (4.22) that the estimators in Q are functionally ordered for a specified value of r. Then from the previous optimality theorem of the MUE, we know that all other estimators in Q will be inadmissible to the MUE in Q provided it exists. Since Pr(2T/0 < m^r) = -50 it follows Ojj — 2T/77i2r. The median unbiased estimator is as likely to overestimate 6 as it is to underestimate 0. From another aspect, the median of the statistic BU is the true but unknown value of 9. Brown, Cohen, and Strawderman (1976) also established that, among the median unbiased estimators, the one that is a function of the sufficient statistic will have smallest absolute risk.
132
PITMAN'S MEASURE OF CLOSENESS
Remark 4.6.6 Suppose that the sample size of a random sample, chose from a population with continuous distribution function FQ(X) is odd. Then the sample median X is a median unbiased estimator of the population median 9.
m(x) denote the distribution function of the sample median which is given by Xm:2m-i- Obviously, we are assuming a sample size of 2m — 1. Then it follows that
The simplification results from £" for each j = 1,... ,ra. So the conclusion follows that the sample median is a median unbiased estimator of the population median. For even sample sizes, if we define X = l/2(-Xro:2m + .Xm+i:2m) and use that FQ is symmetric about 0, then X is median unbiased for 9. Remark 4.6.7 Reconsider Example 2.3.2 but with a random sample of size 3. Let Xi, Xi, X$ be three independent observations from the Cauchy distribution whose density function is
where 9 is the location parameter. It is quite natural to take the sample mean X as an estimator of 9. Note that X has precisely the same Cauchy density, so that E(X) and Var(X) diverge. As an alternative, we consider the sample median X = X2-.3, which has a symmetric distribution around 0, and moreover E(X) = 0, so that X is both median unbiased and unbiased (while X is only median unbiased). However, since Var(X) does not exist, we cannot use MSE in comparing X and X. Since E(X) does not exist either, the MMAD criterion would present the same dilemma as in the calibration problem, Example 2.3.3. However, there is no problem in incorporating the PMC as a valid means for comparing the performance of X and X. In §1.1.2, we made the parenthetical note that fractional moments could be used in the method of moments to obtain estimates of the scale parameter in the Cauchy distribution. Along these lines one may also consider a loss
PAJRWTSE COMPARISONSx
133
£9 — |0 _ 0|9 as given in Theorem 2.2.2, where 0 < q < I. Such choices allow for the existence of E[£q(X, 0}] and E[£q(X, 6)]. Two problems arise: (i) for q < 1, this loss function is not convex; and (ii) there is not a natural choice for q(Q < q < 1). What would be a natural interpretation of X or X being better than the other under a nonconvex loss function? While some answers to these queries are presented in §2.3, fortunately, PMC rescues us from such unnecessary controversies and complications by Theorem 2.2.2. Not surprisingly more than forty years ago, Johnson (1950), being aware of these controversies, suggested the Pitman closeness criterion for this Cauchy population! From the preceding remark it follows that the sample median is also a consistent estimator of the population median. In reference to estimation of the Cauchy location parameter, we see that the sample median is a consistent estimator of 0 whereas the sample mean is not, although both are median unbiased estimators of 0. Example 4.6.8 (Asymptotically Unbiased Estimators) In the normal distribution, for odd sample sizes 2m — 1, the sample mean X and the sample m:2m-i are median unbiased estimators of the population median IJL. Notice that we choose an odd sample size to evade the problems of randomization required in even sample sizes. Hence, these two median unbiased estimators of // must cross. Note that both are convex linear combinations of the order statistics X\:im-\,..., ^2m-i:2m-iFor example, Geary (1944) compared the sample mean and the sample median as median unbiased estimators of the population mean \JL in the normal distribution. From asymptotic theory, we know that the sample mean X and the sample median X are mean unbiased estimators having a bivariate normal distribution. The respective asymptotic variances (see Geary (1944)) are given by 0 for all x > 0) so that a symmetric loss function is produced in accord with the suggestion of Efron (1978) as Let G(17) and G(D) denote the associated groups of transformations defined on 17 and D, the class of all estimators of 0. For any g e G, let ~g and g be the corresponding transformations defined on 17 and £>, respectively. Note that g : £7 —* 17 and g : D —> D. To continue with the example started above with the loss function £, consider the simple group of common translations of the observed x's,
PITMAN-CLOSEST ESTIMATORS
137
This transformation arises naturally in the estimation of location parameters. This translation induces the associated transformations The class D of estimators that satisfy gc() in the preceding equation are known as location invariant estimators. For the symmetric loss function £ note that
This example illustrates what is meant by a loss function being invariant under a group, G(En), of transformations. In a more mathematical vein, a loss function is invariant under a group of transformations provided
and In this context an estimator 0 is said to be invariant under the group of transformations G, provided 0[#(x)] =