ROBUST STATISTICAL METHODS with R
ROBUST STATISTICAL METHODS with R
Jana Jurecková ˇ Jan Picek
Published in 2006 b...
444 downloads
4151 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
ROBUST STATISTICAL METHODS with R
ROBUST STATISTICAL METHODS with R
Jana Jurecková ˇ Jan Picek
Published in 2006 by Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2006 by Taylor & Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group No claim to original U.S. Government works Printed in the United States of America on acid-free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number-10: 1-58488-454-1 (Hardcover) International Standard Book Number-13: 978-1-58488-454-5 (Hardcover) Library of Congress Card Number 2005053192 This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data Jureckova, Jana, 1940Robust statistical methods with R / Jana Jureckova, Jan Picek. p. cm. Includes bibliographical references and indexes. ISBN-13: 978-1-58488-454-5 (acid-free paper) ISBN-10: 1-58488-454-1 (acid-free paper) Robust statistics. 2. R (Computer program language)--Statistical methods. I. Picek, Jan, 1965- II. Title. QA276.J868 2006 519.5--dc22
2005053192
Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com Taylor & Francis Group is the Academic Division of Informa plc.
and the CRC Press Web site at http://www.crcpress.com
Contents
Preface
ix
Authors
xi
Introduction
1
1 Mathematical tools of robustness
5
1.1 Statistical model
5
1.2 Illustration on statistical estimation
8
1.3 Statistical functional
9
1.4 Fisher consistency
11
1.5 Some distances of probability measures
12
1.6 Relations between distances
13
1.7 Differentiable statistical functionals
14
1.8 Gˆ ateau derivative
15
1.9 Fr´echet derivative
17
1.10 Hadamard (compact) derivative
18
1.11 Large sample distribution of empirical functional
18
1.12 Computation and software notes
19
1.13 Problems and complements
23
2 Basic characteristics of robustness
27
2.1 Influence function
27
2.2 Discretized form of influence function
28
2.3 Qualitative robustness
30 v
vi
CONTENTS 2.4 Quantitative characteristics of robustness based on influence function
32
2.5 Maximum bias
33
2.6 Breakdown point
35
2.7 Tail–behavior measure of a statistical estimator
36
2.8 Variance of asymptotic normal distribution
41
2.9 Problems and complements
41
3 Robust estimators of real parameter
43
3.1 Introduction
43
3.2 M -estimators
43
3.3 M -estimator of location parameter
45
3.4 Finite sample minimax property of M -estimator
54
3.5 Moment convergence of M -estimators
58
3.6 Studentized M -estimators
61
3.7 L-estimators
63
3.8 Moment convergence of L-estimators
70
3.9 Sequential M - and L-estimators
72
3.10 R-estimators
74
3.11 Numerical illustration
77
3.12 Computation and software notes
80
3.13 Problems and complements
83
4 Robust estimators in linear model
85
4.1 Introduction
85
4.2 Least squares method
87
4.3 M -estimators
94
4.4 GM -estimators
98
4.5 S-estimators and M M -estimators
100
4.6 L-estimators, regression quantiles
101
4.7 Regression rank scores
104
4.8 Robust scale statistics
106
CONTENTS
vii
4.9 Estimators with high breakdown points
109
4.10 One-step versions of estimators
110
4.11 Numerical illustrations
112
4.12 Computation and software notes
115
4.13 Problems and complements
126
5 Multivariate location model
129
5.1 Introduction
129
5.2 Multivariate M -estimators of location and scatter
129
5.3 High breakdown estimators of multivariate location and scatter 132 5.4 Admissibility and shrinkage
133
5.5 Numerical illustrations and software notes
134
5.6 Problems and complements
139
6 Some large sample properties of robust procedures
141
6.1 Introduction
141
6.2 M -estimators
142
6.3 L-estimators
144
6.4 R-estimators
146
6.5 Interrelationships of M -, L- and R-estimators
146
6.6 Minimaximally robust estimators
150
6.7 Problems and complements
153
7 Some goodness-of-fit tests
155
7.1 Introduction
155
7.2 Tests of normality of the Shapiro-Wilk type with nuisance regression and scale parameters
155
7.3 Goodness-of-fit tests for general distribution with nuisance regression and scale
158
7.4 Numerical illustration
160
7.5 Computation and software notes
166
viii
CONTENTS
Appendix A: R system A.1 Brief R overview
173 174
References
181
Subject index
191
Author index
195
Preface
Robust statistical procedures became a part of the general statistical consciousness. Yet, students first learn descriptive statistics and the classical statistical procedures. Only later students and practical statisticians hear that the classical procedures should be used with great caution and that their favorite and simple least squares estimator and other procedures should be replaced with the robust or nonparametric alternatives. To be convinced to use the robust methods, one needs a reasonable motivation; but everybody needs motivation of their own: a mathematically–oriented person demands a theoretical proof, while a practitioner prefers to see the numerical results. Both aspects are important. The robust statistical procedures became known to the Prague statistical society by the end of the 1960s, thanks to Jaroslav H´ajek and his contacts with Peter J. Huber, Peter J. Bickel and other outstanding statisticians. Frank Hampel presented his “Some small sample asymptotics,” also touching Mestimates, at H´ajek’s conference in Prague in 1973; and published his paper in the Proceedings. Thus, we had our own experience with the first skepticism toward the robust methods, but by 1980 we started organizing regular workshops for applied statisticians called ROBUST. The 14th ROBUST will be in January 2006. On this occasion, we express our appreciation to Jarom´ır Antoch, the main organizer. The course “Robust Statistical Methods” is now a part of the master study of statistics at Charles University in Prague and is followed by all statistical students. The present book draws on experience obtained during these courses. We supplement the theory with examples and computational procedures in the system R. We chose R as a suitable tool because R seems to be one of the best statistical environments available. It is also free and the R project is an open source project. The code you are using is always available for you to see. Detailed information about the system R and the R project is available from http://www.r-project.org/. The prepared procedures and dataset, not available from the public resource, can be found on website: http://www.fp.vslib.cz/kap/picek/robust/.
ix
x
PREFACE
We acknowledge the support of the Czech Republic Grant 201/05/2340, the Research Projects MSM 0021620839 of Charles University in Prague, and MSM 467488501 of Technical University in Liberec. The authors would also like to thank the editors and anonymous referees who contributed considerably to the readability of the text. Our gratitude also belongs to our families for their support and patience.
Jana Jureˇckov´ a Jan Picek Prague and Liberec
Authors
Jana Jureˇ ckov´ a is a professor of statistics and probability at Charles University in Prague, Czech Republic. She is a coauthor of Robust Statistical Inference: Asymptotics and Inter-Relations (with P.K. Sen, John Wiley & Sons, 1996) and of Adaptive Regression (with Y. Dodge, Springer, 2000). She is a fellow of the Institute of Mathematical Statistics and an elected member of the International Statistical Institute. She was an associate editor of the Annals of Statistics for six years, and previously for two years in the 1980s; and now is an associate editor of Journal of the American Statistical Association. Jureˇckov´ a participates in extensive international cooperation, mainly with statisticians of Belgium, Canada, France, Switzerland and United States. Jan Picek is an associate professor of applied mathematics at Technical University in Liberec, Czech Republic.
xi
Introduction
If we analyze the data with the aid of classical statistical procedures, based on parametric models, we usually tacitly assume that the regression is linear, the observations are independent and homoscedastic, and assume the normal distribution of errors. However, when today we can simulate data from any probability distribution and from various models with our high–speed computers and follow the graphics, which was not possible before, we observe that these assumptions are often violated. Then we are mainly interested in the two following questions: a) When should we still use the classical statistical procedures, and when are they still optimal? b) Are there other statistical procedures that are not so closely connected with special models and conditions? The classical procedures are typically parametric: the model is fully specified up to the values of several scalar or vector parameters. These parameters typically correspond to the probability distributions of the random errors of the model. If we succeed in estimating these parameters or in testing a hypothesis on their domain, we can use our data and reach a definite conclusion. However, this conclusion is correct only under the validity of our model. An opposite approach is using the nonparametric procedures. These procedures are independent of or only weakly dependent on the special shape of the basic probability distribution, and they behave reasonably well (though not just optimally) for a broad class of distribution functions, e.g., that of distribution functions with densities, eventually symmetric. The discrete probability distributions do not create a serious problem, because their forms usually follow the character of the experiment. A typical representative of nonparametric statistical procedures is the class of the rank tests of statistical hypotheses: the “null distributions” of the test criterion (i.e., the distributions under the hypothesis H0 of interest) coincide under all continuous probability distribution functions of the observations. Unlike the parametric models with scalar or vector parameters, the nonparametric models consider the whole density function or the regression function as an unknown parameter of infinite dimension. If this functional parameter is only a nuisance, i.e., if our conclusions concern other entities of the model, 1
2
INTRODUCTION
then we try to avoid its estimation, if possible. The statistical procedures — considering the unknown density, regression function or the influence function as a nuisance parameter, while the inference concerns something else — are known as semiparametric procedures; they were developed mainly during the past 30 years. On the other hand, if just the unknown density, regression functions, etc. are our main interests, then we try to find their best possible estimates or the tests of hypotheses on their shapes (goodness-of-fit tests). Unlike the nonparametric procedures, the robust statistical procedures do not try to behave necessarily well for a broad class of models, but they are optimal in some way in a neighborhood of some probability distribution, e.g., normal. Starting with the 1940s, the statisticians successively observed that even small deviations from the normal distribution could be harmful, and can strongly impact the quality of the popular least squares estimator, the classical F test and of other classical methods. Hence, the robust statistical procedures were developed as modifications of the classical procedures, which do not fail under small deviations from the assumed conditions. They are optimal, in a specified sense, in a special neighborhood of a fixed probability distribution, defined with respect to a specified distance of probability measures. As such, the robust procedures are more efficient than the nonparametric ones that pay for their universality by some loss of their efficiency. When we speak of robust statistical procedures, we usually have the estimation procedures in mind. There are also robust tests of statistical hypotheses, namely tests of the Wald type, based on the robust estimators; but whenever possible we recommend using the rank tests instead, mainly for their simplicity and high efficiency. In the list of various statistical procedures, we cannot omit the adaptive procedures that tend to the optimal parametric estimator or test with an increasing number of observations, either in probability or almost surely. Hence, these procedures adapt themselves to the pertaining parametric model with an increasing number of observations, which would be highly desirable. However, this convergence is typically very slow and the optimality is attained under a nonrealistic number of observations. Partially adaptive procedures also exist that successively tend to be best from a prescribed finite set of possible decisions; the choice of prescribed finite set naturally determines the success of our inference. The adaptive, nonparametric, robust and semiparametric methods developed successively, mainly since the 1940s, and they continue to develop as the robust procedures in multivariate statistical models. As such, they are not disjoint from each other, there are no sharp boundaries between these classes, and some concepts and aspects appear in all of them. This book, too, though oriented mainly to robust statistical procedures, often touches other statistical methods. Our ultimate goal is to show and demonstrate which alternative procedures to apply when we are not sure of our model.
INTRODUCTION
3
Mathematically, we consider the robust procedures as the statistical functionals, defined on the space of distribution functions, and we are interested in their behavior in a neighborhood of a specific distribution or a model. This neighborhood is defined with respect to a specified distance; hence, we should first consider possible distances on the space of distribution functions and pertaining to basic characteristics of statistical functionals, such as their continuity and derivatives. This is the theoretic background of the robust statistical procedures. However, robust procedures were developed as an alternative to the practical statistical procedures, and they should be applied to practical problems. Keeping this in mind, we also must pay great attention to the computational aspects, and refer to the computational programs that are available or provide our own. As such, we hope that the readers will use, with understanding, robust procedures to solve their problems.
CHAPTER 1
Mathematical tools of robustness
1.1 Statistical model A random experiment leads to some observed values; denote them X1 , . . . , Xn . To make a formal analysis of the experiment and its results, we include everything in the frame of a statistical model. The classical statistical model assumes that the vector X = (X1 , . . . , Xn ) can attain the values in a sample space X (or Xn ) and the subsets of X are random events of our interest. If X is finite, then there is no problem working with the family of all its subsets. However, some space X can be too rich, as, e.g., the n-dimensional Euclidean space; then we do not consider all its subsets, but restrict our considerations only to some properly selected subsets/events. In order to describe the experiments and the events mathematically, we consider the family of events that creates a σ-field, i.e., that is closed with respect to the countable unions and the complements of its elements. Let us denote it as B (or Bn ). The probabilistic behavior of random vector X is described by the probability distribution P, which is a set function defined on B. The classical statistical model is a family P = {Pθ , θ ∈ Θ} of probability distributions, to which our specific distribution P also belongs. While we can observe X1 , . . . , Xn , the parameter θ is unobservable. It is a real number or a vector and can take on any value in the parametric space Θ ⊆ Rp , where p is a positive integer. The triple {X , B, Pθ : θ ∈ Θ} is the parametric statistical model. The components X1 , . . . , Xn are often independent copies of a random variable X, but they can also form a segment of a time series. The model is multivariate, when the components Xi , i = 1, . . . , n are themselves vectors in Rk with some positive integer k. The type of experiment often fully determines the character of the parametric model. We easily recognize a special discrete probability distribution, as Bernoulli (alternative distribution), binomial, multinomial, Poisson and hypergeometric distributions. Example 1.1 The binomial random variable is a number of successful trials among n independent Bernoulli trials: the i-th Bernoulli trial can result either in success with probability p: then we put Xi = 1 - or in a failure with probability 1 − p: then we put Xi = 0. In the case of n trials, the binomial 5
6
MATHEMATICAL TOOLS OF ROBUSTNESS random variable X is equal to X = ni=1 Xi and can take all integer values 0, 1, . . . , n; specifically, n P (X = k) = pk (1 − p)n−k , k = 0, . . . , n, 0 ≤ p ≤ 1 k Example 1.2 Let us have n independent trials, each of them leading exactly to one of k different outcomes, to the i-th one with probability pi , i = 1, . . . , k, k i=1 pi = 1. Then the i-th component Xi of the multinomial random vector X is the number of trials, leading to the outcome i, and P (X1 = x1 , . . . , Xn = xn ) =
n! px1 . . . pxk k x1 ! . . . xk ! 1
for any vector (x1 , . . . , xk ) of integers, 0 ≤ xi ≤ n, i = 1, . . . , k, satisfying k i=1 xi = n. Example 1.3 The Poisson random variable X is, e.g., the number of clients arriving in the system in a unit interval, the number of electrons emitted from the cathode in a unit interval, etc. Then X can take on all nonnegative integers, and if λ is the intensity of arrivals, emission, etc., then P (X = k) = e−λ
λk , k!
k = 0, 1, . . .
Example 1.4 The hypergeometric random variable X is, e.g., a number of defective items in a sample of size n taken from a finite set of products. If there are M defectives in the set of N products, then M N −M k n−k P (X = k) = N n for all integers k satisfying 0 ≤ k ≤ M and 0 ≤ n−k ≤ N −M ; this probability is equal to 0 otherwise. Among the continuous probability distributions, characterized by the densities, we most easily identify the asymmetric distributions concentrated on a halfline. For instance, the waiting time or the duration of a service can be characterized by the gamma distribution. Example 1.5 The gamma random variable X has density function (see Figure 1.1) b b−1 −ax a x e if x ≥ 0 Γ(b) f (x) = 0 if x < 0 where a and b are positive constants. The special case (b=1) is the exponential distribution.
7
0.00
0.05
0.10
0.15
0.20
0.25
STATISTICAL MODEL
0
2
4
6
8
Figure 1.1 The density function of gamma distribution with b = 3 and a = 1.
In technical practice we can find many other similar examples. However, the symmetric distributions with continuous probability densities are hardly distinguished from each other by simply looking at the data. The problem is also with asymmetric distributions extended on the whole real line. We should either test a hypothesis on their shape or, lacking knowledge of the distribution shape, use a robust or nonparametric method of inference. Most of the statistical procedures elaborated in the past were derived under the normality assumption, i.e., under the condition that the observed data come from a population with the Gaussian/normal distribution. People believed that every symmetric probability distribution described by a density is approximately normal. The procedures based on the normality assumption usually have a simple algebraic structure, thus one is tempted to use them in all situations in which the data can take on symmetrically all real values, forgetting the original normality assumption. For instance, the most popular least squares estimator (LSE) of regression or other parameters, though
8
MATHEMATICAL TOOLS OF ROBUSTNESS
seemingly universal, is closely connected with the normal distribution of the measurement errors. That itself would not matter, but the LSE fails when even a small fraction of data comes from another population whose distribution has heavier tails than the normal one, or when the dataset is contaminated by some outliers. At present, these facts can be easily demonstrated numerically with simulated data, while this was not possible before the era of high-speed computers. But these facts are not only verified with computers; the close connection of the least squares and the normal distribution is also supported by strong theoretical arguments, based on the characterizations of the normal distribution by means of properties of estimators and other procedures. For instance, Kagan, Linnik, and Rao (1973) proved that the least squares estimator (LSE) of the regression parameter in the linear regression model with a continuous distribution of the errors is admissible with respect to the quadratic risk (i.e., there is no other estimator with uniformly better quadratic risk), if and only if the distribution of the measurement errors is normal. The Student t-test, the Snedecor F -test and the test of the linear hypothesis were derived under the normality assumption. While the t-test is relatively robust to deviations from the normality, the F -test is very sensitive in this sense and should be replaced with a rank test, unless the normal distribution is taken for granted. If we are not sure by the parametric form of the model, we can use either of the following possible alternative procedures: a) Nonparametric approach: We give up a parametrization of Pθ by a real or vector parameter θ, and replace the family {Pθ : θ ∈ Θ} with a broader family of probability distributions. b) Robust approach: We introduce an appropriate measure of distance of statistical procedures made on the sample space X , and study the stability of the classical procedures, optimal for the model Pθ , under small deviations from this model. At the same time, we try to modify slightly the classical procedures (i.e., to find robust procedures) to reduce their sensitivity.
1.2 Illustration on statistical estimation Let X1 , . . . , Xn be independent observations, identically distributed with some probability distribution Pθ , where θ is unobservable parameter, θ ∈ Θ ⊆ Rp . Let F (x, θ) be the distribution function of Pθ . Our problem is to estimate the unobservable parameter θ. We have several possibilities, for instance (1) maximal likelihood method (2) moment method
STATISTICAL FUNCTIONAL
9
2
(3) method of χ -minimum, or another method minimizing another distance of the empirical and the true distributions (4) method based on the sufficient statistics (Rao-Blackwell Theorem) and on the complete sufficient statistics (Lehmann-Scheff´e Theorem) In the context of sufficient statistics, remember the very useful fact in nonparametric models that the ordered sample (the vector of order statistics) sufficient statistic for the family of Xn:1 ≤ Xn:2 ≤ . . . ≤ Xn:n is a complete n probability distributions with densities i=1 f (xi ), where f is an arbitrary continuous density. This corresponds to the model in which the observations create an independent random sample from an arbitrary continuous distribution. If θ is a one-dimensional parameter, thus a real number, we are intuitively led to the class of L-estimators of the type Tn =
n
cni h(Xn:i )
i=1
based on order statistics, with suitable coefficients cni , i = 1, . . . , n, and a suitable function h(·). (5) Minimization of some (criterion) function of observations and of θ : e.g., the minimization n ρ(Xi , θ) := min, θ ∈ Θ i=1
with a suitable non-constant function ρ(·, ·). As an example we can consider ρ(x, θ) = − log f (x, θ) leading to the maximal likelihood estimator θˆn . The estimators of this type are called M -estimators, or estimators of the maximum likelihood type. (6) An inversion of the rank tests of the shift in location, of the significance of regression, etc. leads to the class of R-estimators, based on the ranks of the observations or of their residuals. These are the M -, L- and R-estimators, and some other robust methods that create the main subject of this book.
1.3 Statistical functional Consider a random variable X with probability distribution Pθ with distribution function F, where Pθ ∈ P = {Pθ : θ ∈ Θ ⊆ Rp }. Then in many cases θ can be looked at as a functional θ = T (P ) defined on P; we can also write θ = T (F ). Intuitively, a natural estimator of θ, based on observations X1 , . . . , Xn , is T (Pn ), where Pn is the empirical probability distribution of vector (X1 , . . . , Xn ), i.e., 1 I[Xi ∈ A], n i=1 n
Pn (A) =
A∈B
(1.1)
10
MATHEMATICAL TOOLS OF ROBUSTNESS
Otherwise, Pn is the uniform distribution on the set {X1 , . . . , Xn }, because Pn ({Xi }) = n1 , i = 1, . . . , n. Distribution function, pertaining to Pn , is the empirical distribution function 1 I[Xi ≤ x], x ∈ R n i=1 n
Fn (x) = Pn ((−∞, x]) =
(1.2)
Example 1.6 (1) Expected value: T (P )
=
T (Pn ) =
R
= EX
xdP
¯n = 1 =X Xi n i=1 n
R xdPn
(2) Variance: T (P ) =
var X = R
x2 dP − (EX)2
1 2 ¯ n2 X −X n i=1 i n
T (Pn ) =
(3) If T (P ) = R h(x)dP, where h is an arbitrary P -integrable function, then an empirical counterpart of T (P ) is 1 h(Xi ) n i=1 n
T (Pn ) =
(4) Conversely, we can find a statistical functional corresponding to a given statistical estimator: for instance, the geometric mean of observations X1 , . . . , Xn is defined as T (Pn ) = Gn =
n
1/n Xi
i=1
1 log Gn = log Xi = n i=1 n
R
log xdPn
hence the corresponding statistical functional has the form
T (P ) = exp log xdP R
Similarly, the harmonic mean T (Pn ) = Hn of observations X1 , . . . , Xn is
FISHER CONSISTENCY
11
defined as 1 1 1 = Hn n i=1 Xi n
and the corresponding statistical functional has the form −1 1 dP T (P ) = H = R x Statistical functionals were first considered by von Mises (1947). The estimator T (Pn ) should tend to T (P ), as n → ∞, with respect to some type of convergence defined on the space of probability measures. Mostly, it is a convergence in probability and in distribution, or almost sure convergence; but an important characteristic also is the large sample bias of estimator T (Pn ), i.e., limn→∞ |E[T (Pn ) − T (P )]|, which corresponds to the convergence in the mean. Because we need to study the behavior of T (Pn ) also in a neighborhood of P, we consider an expansion of the functional (T (Pn ) − T (P )) of the Taylor type. To do it, we need some concepts of the functional analysis, as various distances Pn and P, and their relations, and the continuity and differentiability of functional T with respect to the considered distance.
1.4 Fisher consistency A reasonable statistical estimator should have the natural property of Fisher consistency, introduced by R. A. Fisher (1922). We say that estimator θˆn , based on observations X1 . . . , Xn with probability distribution P , is a Fisher consistent estimator of parameter θ, if, written as a functional θˆn = T (Pn ) of the empirical probability distribution of vector (X1 , . . . , Xn ), n = 1, . . . , it satisfies T (P ) = θ. The following example shows that this condition is not always automatically satisfied. 2 Example 1.7 Let θ = var X = T (P ) = R x2 dP − R xdP be the variance ¯ n )2 is Fisher of P. Then the sample variance θˆn = T (Pn ) = n1 ni=1 (Xi − X − n1 θ. On the other hand, the consistent; but it is biased, because Eθˆn = 1 1 2 ¯ n )2 is not a Fisher unbiased estimator of the variance Sn = n−1 ni=1 (Xi − X consistent estimator of θ, because n n T (Pn ) and T (P ) = T (P ) Sn2 = n−1 n−1 From the robustness point of view, the natural property of Fisher consistency of an estimator is more important than its unbiasedness; hence it should be first checked on a statistical functional.
12
MATHEMATICAL TOOLS OF ROBUSTNESS
1.5 Some distances of probability measures Let X be a metric space with metric d, separable and complete, and denote B the σ-field of its Borel subsets. Furthermore, let P be the system of all probability measures on the space (X , B). Then P is a convex set, and on P we can introduce various distances of its two elements P, Q ∈ P. Let us briefly describe some such distances, mostly used in mathematical statistics. For those who want to learn more about such and other distances and other related topics, we refer to the literature of the functional analysis and the probability theory, e.g., Billingsley (1998) or Fabian et al. (2001). (1) The Prochorov distance: inf{ε > 0 : P (A) ≤ Q(Aε ) + ε ∀A ∈ B, A = ∅}
dP (P, Q) =
where Aε = {x ∈ X : inf y∈A d(x, y) ≤ ε} is a closed ε-neighborhood of a non-empty set A. (2) The L´evy distance: X = R is the real line; let F, G be the distribution functions of probability measures P, Q, then dL (F, G)
=
inf{ε > 0 : F (x − ε) − ε ≤ G(x) ≤ F (x + ε) + ε∀x ∈ R}
(3) The total variation: dV (P, Q) = sup |P (A) − Q(A)| A∈B
We easily verify that dV (P, Q) =
X
|dP − dQ|
(4) The Kolmogorov distance: X = R is the real line and F, G are the distribution functions of probability measures P, Q, then dK (F, G) = sup |F (x) − G(x)| x∈R
(5) The Hellinger distance: dH (P, Q) = dP dµ
√ X
2 dP − dQ
1/2
dQ dµ
If f = and g = are densities of P, Q with respect to some measure µ, then the Hellinger distance can be rewritten in the form √ 2 (dH (P, Q))2 = f − g dµ = 2 1 − f gdµ X
X
RELATIONS BETWEEN DISTANCES
13
(6) The Lipschitz distance: Assume that d(x, y) ≤ 1 ∀x, y ∈ X (we take the d metric d = 1+d otherwise), then dLi (P, Q) = sup ψdP − ψdQ ψ∈L
X
X
where L = {Ψ : X → R : |ψ(x)− ψ(y)| ≤ d(x, y)} is the set of the Lipschitz functions. (7) Kullback-Leibler divergence: Let p, q be the densities of probability distributions P, Q with respect to measure µ (Lebesgue measure on the real line or the counting measure), then q(x) dKL (Q, P ) = q(x)ln dµ(x) p(x) The Kullback-Leibler divergence is not a metric, because it is not symmetric in P, Q and does not satisfy the triangle inequality. More on distances of probability measures can be found in Gibbs and Su (2002), Liese and Vajda (1987), Rachev (1991), Reiss (1989) and Zolotarev (1983), among others.
1.6 Relations between distances The family P of all probability measures on (X , B) is a metric space with respect to each of the distances described above. On this metric space we can study the continuity and other properties of the statistical functional T (P ). Because we are interested in the behavior of the functional, not only at distribution P , but also in its neighborhood; we come to the question, which distance is more sensitive to small deviations of P. The following inequalities between the distances show not only which distance eventually dominates above others, but also illustrate their relations. Their verification we leave as an exercise: d2H (P, Q)
≤ 2dV (P, Q)
≤ 2dH (P, Q)
d2P (P, Q)
≤ dLi (P, Q)
≤ 2dP (P, Q) ∀ P, Q ∈ P
(1.3)
1 2 d (P, Q) ≤ dKL (P, Q) 2 V if X = R, then it further holds: dL (P, Q)
≤ dP (P, Q)
≤ dV (P, Q)
dL (P, Q)
≤ dK (P, Q)
≤ dV (P, Q) ∀ P, Q ∈ P
(1.4)
14
MATHEMATICAL TOOLS OF ROBUSTNESS
Example 1.8 Let P be the exponential distribution with density −x e ... x ≥ 0 f (x) = 0 ... x < 0 and let Q be the uniform distribution R(0, 1) with density 1 ... 0 ≤ x ≤ 1 g(x) = 0 . . . otherwise Then
2dV (P, Q) =
1
1 − e−x dx +
0
∞
e−x dx = 1 +
1
1 2 1 −1+ = e e e
hence dV (exp, R(0, 1) ≈ 0.3679. Furthermore, dK (P, Q) = sup 1 − e−x − xI[0 ≤ x ≤ 1] − I[x > 1] x≥0
= e−1 ≈ 0.1839 and
d2H (exp, R(0, 1)) = 2 1 −
1
0
√ 2 e−x dx = 2 √ − 1 e
thus dH (exp, R(0, 1)) ≈ 0.6528. Finally
dK L(R(0, 1), exp) =
1
ln 0
1 1 dx = e−x 2
1.7 Differentiable statistical functionals Let again P be the family of all probability measures on the space (X , B, µ), and assume that X is a complete separable metric space with metric d and that B is the system of the Borel subsets of X . Choose some distance δ on P and consider the statistical functional T (·) defined on P. If we want to analyze an expansion of T (·) around P, analogous to the Taylor expansion, we must introduce the concept of a derivative of statistical functional. There are more possible definitions of the derivative, and we shall consider three of them: the Gˆ ateau derivative, the Fr´echet and the Hadamard derivative, and compare their properties from the statistical point of view. Definition 1.1 Let P, Q ∈ P and let t ∈ [0, 1]. Then the probability distribution Pt (Q) = (1 − t)P + tQ
(1.5)
is called the contamination of P by Q in ratio t. Remark 1.1 Pt (Q) is a probability distribution, because P is convex. P0 (Q) = P means an absence of the contamination, while P1 (Q) = Q means the full contamination.
ˆ GATEAU DERIVATIVE
15
1.8 Gˆ ateau derivative Fix two distributions P, Q ∈ P and denote ϕ(t) = T ((1−t)P +tQ), 0 ≤ t ≤ 1. Suppose that the function ϕ(t) has the final n-th derivative ϕ(n) , and that the derivatives ϕ(k) are continuous in interval (0, 1) and that the right-hand (k) derivatives ϕ+ are right-continuous at t = 0, k = 1, . . . , n − 1. Then we can consider the Taylor expansion around u ∈ (0, 1) ϕ(t) = ϕ(u) +
n−1 k=1
ϕ(k) (u) ϕ(n) (v) (t − u)k + (t − u)n , v ∈ [u, t] k! n!
(1.6)
We are mostly interested in the expansion on the right of u = 0, that corresponds to a small contamination of P. For that we replace derivatives ϕ(k) (0) (k) with the right-hand derivatives ϕ+ (0). The derivative ϕ+ (0) is called the Gˆ ateau derivative of functional T in P in direction Q. Definition 1.2 We say that functional T is differentiable in the Gˆ ateau sense in P in direction Q, if there exists the limit TQ (P ) = lim
t→0+
T (P + t(Q − P )) − T (P ) t
(1.7)
ateau derivative T in P in direction Q. TQ (P ) is called the Gˆ Remark 1.2 a) The Gˆ ateau derivative TQ (P ) of functional T is equal to the ordinary right derivative of function ϕ at the point 0, i.e., TQ (P ) = ϕ (0+ ) b) Similarly defined is the Gˆ ateau derivative of order k: k d (k) TQ (P ) = T (P + t(Q − p)) = ϕ(k) (0+ ) dtk t=0+ c) In the special case when Q is the Dirac probability measure Q = δx assigning probability 1 to the one-point set {x} x, we shall use a simpler notation Tδx (P ) = Tx (P ) In the special case t = 1, u = 0 the Taylor expansion (1.6) reduces to the form n−1 TQ(k) (P ) 1 dn + T (P + t(Q − p)) (1.8) T (Q) − T (P ) = k! n! dtn t=t∗ k=1
∗
where 0 ≤ t ≤ 1.
16
MATHEMATICAL TOOLS OF ROBUSTNESS
Example 1.9 (a) Expected value: T (P ) =
X
ϕ(t) = X
xdP = EP X
xd((1 − t)P + tQ) = (1 − t)EP X + tEQ X
=⇒ ϕ (t) = EQ X − EP X TQ (P ) = ϕ (0+ ) = EQ X − EP X Finally we obtain for Q = δx Tx = x − EP X (b) Variance: T (P ) = varP X = EP (X 2 ) − (EP X)2 T ((1 − t)P + tQ) = x2 d((1 − t)P + tQ) X
−
X
2 xd((1 − t)P + tQ)
=⇒ ϕ(t) = (1 − t)EP X 2 + tEQ X 2 − (1 − t)2 (EP X)2 2
−t2 (EQ X) − 2t(1 − t)EP X · EQ X ϕ (t) = −EP X 2 + EQ X 2 2
2
+2(1 − t) (EP X) − 2t (EQ X) −2(1 − 2t)EP X · EQ X This further implies lim ϕ (t) = TQ (P )
t→0+
2
= EQ X 2 − EP X 2 − 2EP X · EQ X + 2 (EP X) and finally we obtain for Q = δx Tx (P ) = x2 − EP X 2 − 2xEP X + 2 (EP X)
2
= (x − EP X)2 − varP X
´ FRECHET DERIVATIVE
17
1.9 Fr´ echet derivative Definition 1.3 We say that functional T is differentiable in P in the Fr´echet sense, if there exists a linear functional LP (Q − P ) such that T (P + t(Q − P )) − T (P ) = LP (Q − P ) t uniformly in Q ∈ P, δ(P, Q) ≤ C for any fixed C ∈ (0, ∞). lim
t→0
(1.9)
The linear functional LP (Q − P ) is called the Fr´echet derivative of functional T in P in direction Q. Remark 1.3 a) Because LP is a linear functional, there exists a function g : X → R such that LP (Q − P ) = gd(Q − P ) (1.10) X
b) If T is differentiable in the Fr´echet sense, then it is differentiable in the Gˆ ateau sense, too, i.e., there exists TQ (P ) ∀Q ∈ P, and it holds TQ (P ) = LP (Q − P )
∀Q ∈ P
Especially, Tx (P ) = LP (δx − P ) = g(x) − and this further implies EP (Tx (P ))
= X
(1.11)
gdP
(1.12)
X
Tx (P )dP = 0.
(1.13)
c) Let Pn be the empirical probability distribution of vector (X1 . . . , Xn ). n Then Pn − P = n1 i=1 (δXi − P ) . Hence, because LP is a linear functional, 1 1 LP (δXi − P ) = T (P ) = TP n (P ) n i=1 n i=1 Xi n
LP (Pn − P ) =
n
(1.14)
Proof of (1.11): Actually, because LP (·) is a linear functional, we get by (1.9) TQ (P ) = =
lim
T (P + t(Q − P )) − T (P ) t
lim
T (P + t(Q − P )) − T (P ) − LP (Q − P ) t
t→0+
t→0+
+ LP (Q − P ) = 0 + LP (Q − P ) = LP (Q − P )
2
18
MATHEMATICAL TOOLS OF ROBUSTNESS
1.10 Hadamard (compact) derivative If there exists a linear functional L(Q − P ) such that the convergence (1.9) is uniform not necessarily for bounded subsets of the metric space (P, δ) containing P, i.e., for all Q satisfying δ(P, Q) ≤ C, 0 < C < ∞, but only for Q from any fixed compact set K ⊂ P containing P ; then we say that functional T is differentiable in the Hadamard sense, and we call the functional L(Q − P ) the Hadamard (compact) derivative of T. The Fr´echet differentiable functional is obviously also Hadamard differentiable, and it is, in turn, also Gˆ ateau differentiable, similarly as in Remark 1.3. We refer to Fernholz (1983) and to Fabian et al. (2001) for more properties of differentiable functionals. The Fr´echet differentiability imposes rather restrictive conditions on the functional that are not satisfied namely by the robust functionals. On the other hand, when we have a Fr´echet differentiable functional, we can easily derive the large sample (normal) distribution of its empirical counterpart, when the number n of observations infinitely increases. If the functional is not sufficiently smooth, we can sometimes derive the large sample normal distribution of its empirical counterpart with the aid of the Hadamard derivative. If we only want to prove that T (Pn ) is a consistent estimator of T (P ), then it suffices to consider the continuity of T (P ). The Gˆ ateau derivative of Tx (P ), called the influence function of functional T, is one of the most important characteristics of its robustness and will be studied in Chapter 2 in detail.
1.11 Large sample distribution of empirical functional Consider again the metric space (P, δ) of all probability distributions on (X , B), with metric δ satisfying √ nδ(Pn , P ) = Op (1) as n → ∞, (1.15) where Pn is the empirical probability distribution of the random sample (X1 , . . . , Xn ), n = 1, 2, . . . . The convergence (1.15) holds, e.g., for the Kolmogorov distance of the empirical distribution function from the true one, which is the most important for statistical applications; but it holds also for other distances. As an illustration of the use of the functional derivatives, let us show that the Fr´echet differentiability, together with the classical central limit theorem, always give the large sample (asymptotic) distribution of the empirical functional T (Pn ). Theorem 1.1 Let T be a statistical functional, Fr´echet differentiable in P,
COMPUTATION AND SOFTWARE NOTES
19
and assume that the empirical probability distribution Pn of the random sample (X1 , . . . , Xn ) satisfies the condition (1.15) as n → ∞. If the variance of the Gˆ √ateau derivative TX1 (P ) is positive, varP TX1 (P ) > 0, then the sequence n(T (Pn ) − T (P )) is asymptotically normally distributed as n → ∞, namely L T (Pn ) − T (P ) −→ N 0, varP TX (P ) (1.16) 1 Proof. By (1.14), TP n (P ) = tion (1.15) we obtain
1 n
n i=1
TX (P ) and further by (1.8) and condii
n √ 1 n(T (Pn ) − T (P )) = √ T (P ) + Rn n i=1 Xi
√ 1 = √ LP (Pn − P ) + n o(δ(Pn , P )) n i=1 n
(1.17)
1 T (P ) + op (1) = √ n i=1 Xi n
(P ) = varP TX (P ), i = 1, . . . , n, is finite, then If the joint variance varP TX i 1 (1.16) follows from (1.17) and from the classical central limit theorem. 2
Example 1.10 Let T (P ) = varP X = σ 2 , then 1 ¯ n )2 (Xi − X n i=1 n
T (Pn ) = Sn2 = and, by Example 1.9b),
Tx (P ) = (x − EP X)2 − varP X hence (P ) = EP (X − EP X)4 − E2P (X − EP X)2 = µ4 − µ22 varP TX
and by Theorem 1.1 we get the large sample distribution of the sample variance √ L n(Sn2 − σ 2 ) −→ N 0, µ4 − µ22
1.12 Computation and software notes We chose R (a language and environment for statistical computing and graphics) as a suitable tool for numerical illustration and the computation. R seems to us to be one of the best statistical environments available. It is also free and the R project is an open source project. The code you are using is always available for you to view. Detailed information about R and the R project is available from http://www.r-project.org/.
20
MATHEMATICAL TOOLS OF ROBUSTNESS
Chapter 1 focuses mainly on the theoretical background. However, Section 1.1 mentioned some specific distribution shapes. The R system has built-in functions to compute the density, distribution function and quantile function for many standard distributions, including ones mentioned in Section 1.1 (see Table 1.1). Table 1.1 R function names and parameters for selected probability distributions.
Distribution
R name
Parameters
binomial exponential gamma hypergeometric normal Poisson uniform
binom exp gamma hyper norm pois unif
size, prob rate shape, scale m, n, k mean, sd lambda min, max
The first letter of the name of the R function indicates the function: dXXXX, pXXXX, qXXXX are, respectively, the density, distribution and quantile functions. The first argument of the function is the quantile q for the densities and distribution functions, and the probability p for quantile functions. Additional arguments specify the parameters. These functions can be used as the statistical tables. Here are some examples: • P (X = 3) - binomial distribution with n=20, p = 0.1 > dbinom(3,20,0.1) [1] 0.1901199 • P (X ≤ 5) - Poisson distribution with λ=2 > ppois(5,2) [1] 0.9834364 • 95% – quantile of standard normal distribution > qnorm(0.95) [1] 1.644854 • The density function of gamma distribution with b = 3 and a = 1 ( see Figure 1.1). > plot(seq(0,8,by=0.01),dgamma(seq(0,8,by=0.01),3,1), + type="l", ylab="",xlab="")
COMPUTATION AND SOFTWARE NOTES
21
System R enables the generation of random data. The corresponding functions have prefix r and first argument n, the size of the sample required. For example, we can generate a sample of size 1000 from gamma distribution (b = 3; a = 1) by > rgamma(1000,3,1) R has a function hist to plot histograms. > hist(rgamma(1000,3,1), prob=TRUE,nclass=16) We obtain Figure 1.2, compare with Figure 1.1.
0.15 0.00
0.05
0.10
Density
0.20
0.25
Histogram of rgamma(1000, 3, 1)
0
2
4
6
8
10
rgamma(1000, 3, 1)
Figure 1.2 The histogram of the simulated sample from gamma distribution with b = 3 and a = 1.
In Section 1.5, some distances of probability measures were introduced and illustrated in Example 1.8. There we need to compute an integral. We can also use R function integrate to solve that problem. Compare the following example with the calculation in Example 1.8.
22
MATHEMATICAL TOOLS OF ROBUSTNESS
> integrate(function(x) 0.3678794 with absolute > integrate(function(x) 0.3678794 with absolute
{1-exp(-x)},0,1) error < 4.1e-15 {exp(-x)},1, Inf) error < 2.1e-05
R can also help us in the case of discrete distribution. Let P be the binomial distribution with parameters n = 100, p = 0.001. Let Q be Poisson distribution with parameterλ = np = 1. Then 100 ∞ n e−1 e−1 k 100−k + 0.01 2dV (P, Q) = 0.99 − k k! k! k=0
k=101
> sum(abs(choose(100,0:100)*0.01^(0:100)*(0.99)^(100:0) + -exp(-1)/factorial(0:100)))+1-sum(exp(-1)/factorial(0:100)) [1] 0.005550589 > ## or also > sum(abs(dbinom(0:100,100,0.01)-dpois(0:100,1)))+1-ppois(100,1) [1] 0.005550589 Thus dV (Bi(100, 0.01), P o(1)) ≈ 0.0028. Similarly, dK (Bi(100, 0.01), P o(1)) ≈ 0.0018, dH (Bi(100, 0.01), P o(1)) ≈ 0.0036, dKL (Bi(100, 0.01), P o(1)) ≈ 0.000025 because > ### Kolmogorov distance > max(abs(pbinom(0:100,100,0.01)-ppois(0:100,1))) [1] 0.0018471 > ### Hellinger distance > sqrt(sum((sqrt(dbinom(0:100,100,0.01)) + -sqrt(dpois(0:100,1)))^2)) [1] 0.003562329 >### Kullback-Leibler divergence (Q,P) > sum(dpois(0:100,1)*log(dpois(0:100,1)/ + dbinom(0:100,100,0.01))) [1] 2.551112e-05 >### Kullback-Leibler divergence (P,Q) > sum(dbinom(0:100,100,0.01)*log(dbinom(0:100,100,0.01)/ + dpois(0:100,1))) [1] 2.525253e-05
PROBLEMS AND COMPLEMENTS
23
1.13 Problems and complements 1.1 Let Q be the binomial distribution with parameters n, p and let P be the Poisson distribution with parameter λ = np, then 1 (dV (Q, P ))2 ≤ dKL (Q, P ) 2 1 np3 p 1 p2 ≤ dKL (Q, P ) ≤ + + + p2 4 4 3 2 4n 1 min(p, np2 ) ≤ dV (Q, P ) ≤ 2p 1 − e−np 16 dKL (Q, P ) ≤
p2 2(1 − p)
λ2 n→∞ 4 See Barbour and Hall (1984), Csisz´ar (1967), Harremo¨es and Ruzankin lim n2 dKL (Q, P ) =
(2004), Kontoyannis et al. (2005), Pinsker (1960) for demonstrations. 1.2 Wasserstein-Kantorovich distance of distribution functions F, G of random variables X, Y : ∞ • L1 -distance on F1 = {F : −∞ |x|dF (x) < ∞} : 1 (1) dW (F, G) = |F −1 (t) − G−1 (t)|dt 0
Show that
∞ (1) dW (F, G) = −∞ |F (x) − G(x)|dx (Dobrushin (1970)). (1) dW (F, G) = inf{E|X − Y |} where the infimum is over
Show that jointly distributed X and Y with respective marginals F and G. ∞ • L2 -distance on F2 = {F : −∞ x2 dF (x) < ∞} : 1 (2) [F −1 (t) − G−1 (t)]2 dt dW (F, G) =
all
0
(2) dW (F, G)
Show that = inf{E(X − Y )2 } where the infimum is over all jointly distributed X and Y with respective marginals F and G (Mallows (1972)). • Weighted L1 -distance: 1 (1) dW (F, G) = |F −1 (t) − G−1 (t)|w(t)dt, 0
1
w(t)dt = 1 0
24
MATHEMATICAL TOOLS OF ROBUSTNESS (1)
1.3 Show that dP (F, G) ≤ dW (F, G) (Dobrushin (1970)). 1.4 Let (X1 , . . . , Xn ) and (Y1 , . . . , Yn ) be two samples ∞independent random ∞ from distribution functions F, G such that −∞ xdF (x) = −∞ ydG(y) = 0. n , Gn be the distribution functions of n−1/2 i=1 Xi and Let Fn n n−1/2 i=1 Yi , respectively, then (2)
(2)
dW (Fn , Gn ) ≤ dW (F, G) (Mallows (1972)). 1.5 χ2 -distance: Let p, q be the densities of probability distributions P, Q with respect to measure µ (µ can be a countable measure). Then dχ2 (P, Q) is defined as (p(x) − q(x))2 dχ2 (P, Q) = dµ(x) q(x) x∈X :p(x),q(x)>0 Then 0 ≤ dχ2 (P, Q) ≤ ∞ and dχ2 is independent of the choice of the dominating measure. It is not a metric, because it is not symmetric in P, Q. Distance dχ2 is dating back to Pearson in the 1930s and has many applications in the statistical inference. The following relations hold between dχ2 and other distances: √ (i) dH (P, Q) ≤ 2(dχ2 (P, Q))1/4 (ii) If the sample space X is countable, then dV (P, Q) ≤ 12 dχ2 (P, Q) (iii) dKL (P, Q) ≤ dχ2 (P, Q) 1.6 Let P be the exponential distribution and let Q be the uniform distribution (see Example 1.8) Then 1 2 (1 − e−x ) dχ2 (Q, P ) = dx = e + e−1 − 2 e−x 0 hence dχ2 (R(0, 1), exp) ≈ 0.350402. Furthermore, 1 −x 2 1 1 dχ2 (P, Q) = e − 1 dx = − e−2 + 2 e−1 − 2 2 0 hence dχ2 (exp, R(0, 1)) ≈ 0.168091 1.7 Bhattacharyya distance: Let p, q be the densities of probability distributions P, Q with respect to measure. Then dB (P, Q) is defined as −1 dB (P, Q) = log p(x) q(x) dµ(x) x∈X :p(x),q(x)>0
(Bhattacharyya (1943)). Furthermore, for a comparison 1 √ −1 2 −x dB (exp, R(0, 1)) = log e dx = − log(2 − √ ) ≈ 0.239605 e 0
PROBLEMS AND COMPLEMENTS 1.8 Verify 2dV (P, Q) =
X
25
|dP − dQ|.
1.9 Check the inequalities 1.3. 1.10 Check the inequalities 1.4. (1)
1.11 Compute the Wasserstein-Kantorovich distances dW (F, G) and (2) dW (F, G) for the exponential distribution and the uniform distribution (as in Example 1.8).
CHAPTER 2
Basic characteristics of robustness
2.1 Influence function Expansion (1.17) of difference T (Pn ) − T (P ) says that 1 T (P ) + n−1/2 Rn n i=1 Xi n
T (Pn ) − T (P ) =
(2.1)
where the reminder term is asymptotically n−1/2 Rn = op (n−1/2 ) n negligible, 1 as n → ∞. Then we can consider n i=1 TXi (P ) as an error of estimating (P ) as a contribution of Xi to this error, or as an T (P ) by T (Pn ), and TX i influence of Xi on this error. From this point of view, a natural interpretation of the Gˆ ateau derivative Tx (P ), x ∈ X is to call it an influence function of functional T (P ). Definition 2.1 The Gˆ ateau derivative of functional T in distribution P in the direction of Dirac distribution δx , x ∈ X is called the influence function of T in P ; thus IF (x; T, P ) = Tx (P ) = limt→0+
T (Pt (δx )) − T (P ) t
(2.2)
where Pt (δx ) = (1 − t)P + tδx . As the first main properties of IF, let us mention: a) EP (IF (x; T, P )) = X Tx (P )dP = 0, hence the average influence of all points x on the estimation error is zero. b) If T is a Fr´echet differentiable functional satisfying condition (1.15), and varP (IF (x; T, P )) = EP (IF (x; T, P ))2 > 0 √ then n(T (Pn ) − T (P ) −→ N 0, varP (IF (x; T, P )) Example 2.1 (a) Expected value:
T (P ) = EP (X) = mP , then
¯n T (Pn ) = X 27
28
BASIC CHARACTERISTICS OF ROBUSTNESS IF (x; T, P ) = Tx (P ) = x − mp EP (IF (x; T, P )) = 0 varP (IF (x; T, P )) = varP X = σP2 EQ (IF (x; T, P )) = mQ − mP for Q = P √ ¯ n − mp ) −→ N (0, σP2 ) L n(X
provided P is the true probability distribution of random sample (X1 , . . . , Xn ). (b) Variance:
T (P ) = varP X = σP2 , then IF (x; T, P ) = (x − mP )2 − σP2 EP (IF (x; T, P )) = 0 varP (IF (x; T, P )) = µ4 − µ22 = µ4 − σP4 EQ (IF (x; T, P )) = EQ (X − mp )2 − σP2 2 = σQ + (mQ − mP )2 + 2EQ (X − mQ )(mQ − mP ) 2 − σP2 + (mQ − mP )2 −σP2 = σQ
2.2 Discretized form of influence function
Let (X1 , . . . , Xn ) be the vector of observations and denote Tn = T (Pn ) = Tn (X1 , . . . , Xn ) as its empirical functional. Consider what happens if we add another observation Y to X1 , . . . , Xn . The influence of Y on Tn is characterized by the difference Tn+1 (X1 , . . . , Xn , Y ) − Tn (X1 , . . . , Xn ) := I(Tn , Y ) Because =
1 δX n i=1 i
=
1 n+1
n
Pn
Pn+1
= 1−
n
δXi + δY
i=1
1 n+1
Pn +
=
1 n Pn + δY n+1 n+1
1 δY n+1
(2.3)
DISCRETIZED FORM OF INFLUENCE FUNCTION
29
we can say that Pn+1 arose from Pn by its contamination by the one-point 1 distribution δY in ratio n+1 , hence
1 1 δY − T (Pn ) I(Tn , Y ) = T 1− Pn + n+1 n+1 Because lim (n + 1)I(Tn , Y )
(2.4)
n→∞
T = lim
1−
1 n+1
Pn +
1 n+1 δY
− T (Pn )
1 n+1
n→∞
= IF (Y ; T, P ) (n + 1)I(Tn , Y ) can be considered as a discretized form of the influence function. The supremum of |I(Tn , Y )| over Y then represents a measure of sensitivity of the empirical functional Tn with respect to an additional observation, under fixed X1 , . . . , Xn . Definition 2.2 The number S(Tn ) = sup |I(Tn (X1 , . . . , Xn ), Y )|
(2.5)
Y
is called a sensitivity of functional Tn (X1 , . . . , Xn ) to an additional observation. Example 2.2 (a) Expected value: T (P ) = EP X,
¯ n , Tn+1 = X ¯ n+1 Tn = X
1 ¯n + Y ) (nX n+1 n ¯ n + 1 Y = 1 (Y − X ¯n) −1 X I(Tn , Y ) = n+1 n+1 n+1
=⇒ Tn+1 =
P
¯ n −→ Y − EP X as n → ∞ =⇒ (n + 1)I(Tn , Y ) = Y − X ¯n) = =⇒ S(X
1 ¯n| = ∞ sup |Y − X n+1 Y
Thus, the sample mean has an infinite sensitivity to an additional observation. (b) Median: Let n = 2m + 1 and let X(1) ≤ . . . ≤ X(n) be the observations ordered in increasing magnitude. Then Tn = Tn (X1 , . . . , Xn ) = X(m+1) and Tn+1 =
30
BASIC CHARACTERISTICS OF ROBUSTNESS
Tn+1 (X1 . . . , Xn , Y ) takes on the following of Y among the other observations: ⎧ X +X (m) (m+1) ⎪ ... ⎪ 2 ⎪ ⎨ X(m+1) +X(m+2) Tn+1 = ... 2 ⎪ ⎪ ⎪ Y +X ⎩ (m+1) ... 2
values, depending on the position Y ≤ X(m) Y ≥ X(m+2) X(m) ≤ Y ≤ X(m+2)
Hence, the influence of adding Y to X1 , . . . , Xn on the median is measured by ⎧ X −X (m) (m+1) ⎪ Y ≤ X(m) ⎪ 2 ⎪ ⎨ X(m+2) −X(m+1) I(Tn , Y ) = Y ≥ X(m+2) 2 ⎪ ⎪ ⎪ Y −X ⎩ (m+1) X(m) ≤ Y ≤ X(m+2) 2 Among three possible values of |I(Tn , Y )| is | 12 (Y − X(m+1) )| the smallest; thus the sensitivity of the median to an additional observation is equal to 1 1 (X(m+1) − X(m) ), (X(m+2) − X(m+1) ) S(Tn ) = max 2 2 and it is finite under any fixed X1 , . . . , Xn . 2.3 Qualitative robustness As we have seen in Example 2.1, the influence functions of the expectation and variance are unbounded and can assume arbitrarily large values. Moreover, Example 2.2 shows that adding one more observation can cause a breakdown of the sample mean. The least squares estimator (LSE) behaves analogously (in fact, the mean is a special form of the least squares estimator). Remember the Kagan, Linnik and Rao theorem, mentioned in Section 1.1, that illustrates a large sensitivity of the LSE to deviations from the normal distribution of errors. Intuitively it means that the least squares estimator (and the mean) are very non-robust. How can we mathematically express this intuitive non-robustness property, and how shall we define the concept of robustness? Historically, this concept has been developing over a rather long period, since many statisticians observed a sensitivity of statistical procedures to deviations from assumed models, and analyzed it from various points of view. It is interesting that the physicists and astronomers, who tried to determine values of various physical, geophysical and astronomic parameters by means of an average of several measurements, were the first to notice the sensitivity of the mean and the variance to outlying observations. This interesting part of the statistical history is nicely described in the book by Stigler (1986). The history goes up to 1757, when R. J. Boskovich, analyzing his experiments aiming at a characterization of the shape of the globe, proposed an estimation
QUALITATIVE ROBUSTNESS
31
method alternative to the least squares. E. S. Pearson noticed the sensitivity of the classical analysis of variance procedures to deviations from the normality in 1931. J. W. Tukey and his Princeton group have started a systematic study of possible alternatives to the least squares since the 1940s. The name “robust” was first used by Box in 1953. Box and Anderson (1955) characterized as robust such a statistical procedure that is little sensitive to changes of the nuisance or unimportant parameters, while it is sensitive (efficient) to its parameter of interest. When we speak about robustness of a statistical procedure, we usually mean its robustness with respect to deviations from the assumed distribution of errors. However, other types of robustness are also important, such as the assumed independence of observations, the assumption that is often violated in practice. The first mathematical definition of robustness was formulated by Hampel (1968, 1971), who based the concept of robustness of a statistical functional on its continuity in a neighborhood of the considered probability distribution. The continuity and neighborhood were considered with respect to the Prohorov metric on the space P. Let a random variable (or random vector) X take on values in the sample space (X , B); denote P as its probability distribution. We shall try to characterize mathematically the robustness of the functional T (P ) = T (X). This functional is estimated with the aid of observations X1 , . . . , Xn , that are independent copies of X. More precisely, we estimate T by the empirical functional Tn (Pn ) = Tn (X1 , . . . , Xn ), based on the empirical distribution Pn of X1 , . . . , Xn . Instead of the empirical functional, Tn is often called the (sample) statistic. Hampel’s definition of the (qualitative) robustness is based on the Prohorov metric dP on the system P of probability measures on the sample space. Definition 2.3 We say that the sequence of statistics (empirical functionals) {Tn } is qualitatively robust for probability distribution P, if to any ε > 0 there exists a δ > 0 and a positive integer n0 such that, for all Q ∈ P and n ≥ n0 , dP (P, Q) < δ =⇒ dP (LP (Tn ), LQ (Tn )) < ε
(2.6)
where LP (Tn ) and LQ (Tn ) denote the probability distributions of Tn under P and Q, respectively. This robustness is only qualitative: it only says whether it is or is not the functional robust, but it does not numerically measure a level of this characteristic. Because such robustness concerns only the behavior of the functional in a small neighborhood of P0 , it is in fact infinitesimal. We can obviously replace the Prohorov metric with another suitable metric on space P, e.g., the L´evy metric. However, we do not only want to see whether T is or is not robust. We want to compare the various functionals with each other and see which is more robust
32
BASIC CHARACTERISTICS OF ROBUSTNESS
than the other. To do this, we must characterize the robustness with some quantitative measure. There are many possible quantifications of the robustness. However, using such quantitative measures, be aware that a replacement of a complicated concept with just one number can cause a bias and suppress important information.
2.4 Quantitative characteristics of robustness based on influence function Influence function is one of the most important characteristics of the statistical functional/estimator. The value IF (x; T, P ) measures the effect of a contamination of functional T by a single value x. Hence, a robust functional T should have a bounded influence function. However, even the fact that T is a qualitatively robust functional does not automatically mean that its influence function IF (x; T, P ) is bounded. As we see later, an example of such a functional is the R-estimator of the shift parameter, which is an inversion of the van der Waerden rank test; while it is qualitatively robust, its influence function is unbounded. The most popular quantitative characteristics of robustness of functional T, based on the influence function, are its global and local sensitivities: a) The global sensitivity of the functional T under distribution P is the maximum absolute value of the influence function in x under P, i.e., γ ∗ = sup |IF (x; T, P )|
(2.7)
x∈X
b) The local sensitivity of the functional T under distribution P is the value IF (y; T, P ) − IF (x; T, P ) (2.8) λ∗ = sup y−x x,y; x=y that indicates the effect of the replacement of value x by value y on the functional T. The following example illustrates the difference between the global and local sensitivities. Example 2.3 (a) Mean T (P ) = EP (X), IF (x; T, P ) = x − EP X =⇒ γ ∗ = ∞, λ∗ = 1; the mean is not robust, but it is not sensitive to the local changes. (b) Variance T (P ) = varP X = σP2 IF (x; T, P ) = (x − EP (X))2 − σP2 ,
γ∗ = ∞
MAXIMUM BIAS
(x − EP (X))2 − (y − EP (X))2 λ∗ = sup x−y y=x 2 x − y 2 − 2(x − y)EP X = sup |x + y − 2EP X| = ∞ = sup x−y y=x y=x
33
hence the variance is non-robust both to large as well as to small (local) changes.
2.5 Maximum bias Assume that the true distribution function F0 lies in some family F . Another natural measure of robustness of the functional T is its maximal bias (maxbias) over F , (2.9) b(F ) = sup {|T (F ) − T (F0 )|} F ∈F
The family F can have various forms; for example, it can be a neighborhood of a fixed distribution F0 with respect to some distance described in Section 1.5. In the robustness analysis, F is often the ε-contaminated neighborhood of a fixed distribution function F0 , that has the form FF0 ,ε = {F : F = (1 − ε)F0 + εG, G unknown distribution function} (2.10) The value ε of the contamination ratio is considered as known, known is also the central distribution function F0 . When estimating the location parameter θ of F (x − θ), where F is an unknown member of F , then the central distribution F0 is usually taken as symmetric around zero and unimodal, while the contaminating distribution G can run either over symmetric or asymmetric distribution functions. We then speak about symmetric or asymmetric contaminations. Many statistical functionals are monotone with respect to the stochastic ordering of distribution functions (or random variables), defined in the following way: Random variable X with distribution function F is stochastically smaller than random variable Y with distribution function G, if F (x) ≥ G(x)
∀x ∈ R
The monotone statistical functional thus attains its maxbias either at the stochastically largest member F∞ or at the stochastically smallest member F−∞ of FF0 ,ε , hence b(FF0 ,ε ) = max{|T (F∞ ) − T (F0 )|, |T (F−∞ ) − T (F0 )|}
(2.11)
The following example well illustrates the role of the maximal bias; it shows that while the mean is non-robust, the median is universally robust with respect to the maxbias criterion.
34
BASIC CHARACTERISTICS OF ROBUSTNESS
Example 2.4 (i) Mean T (F ) = EF (X); if F0 is symmetric around zero and so are all contaminating distributions G, all having finite first moments, then T (F ) is unbiased for all F ∈ FF0 ,ε , hence b(FF0 ,ε ) = 0. However, under an asymmetric contamination, b(FF0 ,ε ) = |E(F∞ ) − E(F0 )| = ∞, where F∞ = (1 − ε)F0 + εδ∞ , the stochastically largest member of FF0 ,ε . (ii) Median. Because the median is nondecreasing with respect to the stochastic ordering of distributions, its maximum absolute bias over an asymmetric ε-contaminated neighborhood of a symmetric distribution function F0 is attained either at F∞ = (1 − ε)F0 + εδ∞ (the stochastically largest distribution of FF0 ,ε ), or at F−∞ = (1 − ε)F0 + εδ−∞ (the stochastically smallest distribution of FF0 ,ε ). The median of F∞ is attained at x0 satisfying 1 1 −1 (1 − ε)F0 (x0 ) = =⇒ x0 = F0 2 2(1 − ε) while the median of F−∞ is x− 0 such that (1 −
ε)F0 (x− 0 )
1 1 − −1 =⇒ x0 = F0 +ε= 1− = −x0 2 2(1 − ε)
hence the maxbias of the median is equal to x0 . Let T (F ) be any other functional such that its estimate T (Fn ) = T (X1 , . . . , Xn ), based on the empirical distribution function Fn , is translation equivariant, i.e., T (X1 + c, . . . , Xn + c) = T (X1 , . . . , Xn ) + c for any c ∈ R. Then obviously T (F (· − c)) = T (F (·)) + c. We shall show that the maxbias of T cannot be smaller than x0 . Consider two contaminations of F0 , F+ = (1 − ε)F0 + εG+ , where
G+ (x) =
and G− (x) =
. . . x ≤ x0
0 1 ε {1
F− = (1 − ε)F0 + εG−
− (1 − ε)[F0 (x) + F0 (2x0 − x)]}
1 ε {(1
1
− ε)[F0 (x + 2x0 ) − F0 (x)]}
. . . x ≥ x0
. . . x < −x0 . . . x ≥ −x0
Notice that F− (x − x0 )) = F+ (x + x0 ), hence T (F− ) + x0 = T (F+ ) − x0 and T (F+ ) − T (F− ) = 2x0 ; thus the maxbias of T at F0 cannot be smaller than x0 . It shows that the median has the smallest maxbias among all translation equivariant functionals.
BREAKDOWN POINT
35
If T (F ) is a nonlinear functional, or if it is defined implicitly as a solution of a minimization or of a system of equations; then it is difficult to calculate (2.11) precisely. Then we consider the maximum asymptotic bias of T (F ) over a neighborhood F of F0 . More precisely, let X1 , . . . , Xn be independent identically distributed observations, distributed according to distribution function F ∈ F and Fn be the empirical distribution function. Assume that under an infinitely increasing number of observations, T (Fn ) has an asymptotical normal distribution for every F ∈ F in the sense that √ x P n(T (Fn ) − T (F )) ≤ x → Φ as n → ∞ σ(T, F ) with variance σ 2 (T, F ) dependent of T and F. Then the maximal asymptotic bias (asymptotic maxbias) of T over F is defined as sup {|T (F ) − T (F0 )| : F ∈ F}
(2.12)
We shall return to the asymptotic maxbias later in the context of some robust estimators that are either nonlinear or defined implicitly as a solution of a minimization or a system of equations.
2.6 Breakdown point The breakdown point, introduced by Donoho and Huber in 1983, is a very popular quantitative characteristic of robustness. To describe this characteristic, start from a random sample x0 = (x1 , . . . , xn ) and consider the corresponding value Tn (x0 ) of an estimator of functional T. Imagine that in this “initial” sample we can replace any m components by arbitrary values, possibly very unfavorable, even infinite. The new sample after the replacement denotes x(m) , and let Tn (xm ) be the pertaining value of the estimator. The breakdown point of estimator Tn for sample x(0) is the number ε∗n (Tn , x(0) ) =
m∗ (x(0) ) n
where m∗ (x(0) ) is the smallest integer m, for which sup Tn (x(m) ) − Tn (x(0) ) = ∞ x(m) i.e., the smallest part of the observations that, being replaced with arbitrary values, can lead Tn up to infinity. Some estimators have a universal breakdown point, when m∗ is independent of the initial sample x(0) . Then we can also calculate the limit ε∗ = limn→∞ ε∗n , which often also is called the breakdown point. We can modify the breakdown point in such a way that, instead of replacing m components, we extend the sample by some m (unfavorable) values.
36
BASIC CHARACTERISTICS OF ROBUSTNESS
Example 2.5 ¯n = (a) The average X ¯ n , x(0) ) ε∗n (X
=
1 n,
1 n
n i=1
Xi :
¯ n , x(0) ) = 0 for any initial sample x(0) hence limn→ ε∗n (X
˜ n = X n+1 (consider n odd, for simplicity): (b) Median X ( ) 2
˜ n , x(0) ) ε∗n (X
=
n+1 2n ,
˜ n , x(0) ) = thus limn→ ε∗n (X
1 2
for any initial sample x(0)
2.7 Tail–behavior measure of a statistical estimator The tail–behavior measure is surprisingly intuitive mainly in estimating the shift and regression parameters. We will first illustrate this measure on the shift parameter, and then return to regression at a suitable place. Let (X1 , . . . , Xn ) be a random sample from a population with continuous distribution function F (x − θ), θ ∈ R. The problem of interest is that of estimating parameter θ. A reasonable estimator of the shift parameter should be translation equivariant: Tn is translation equivariant, if Tn (X1 + c, . . . , Xn + c) = Tn (X1 , . . . , Xn ) + c ∀ c ∈ R and ∀X1 . . . , Xn The performance of such an estimator can be characterized by probabilities Pθ (|Tn − θ| > a) analyzed either under fixed a > 0 and n → ∞, or under fixed n and a → ∞. Indeed, if {Tn } is a consistent estimator of θ, then limn→0 Pθ (|Tn −θ| > a) = 0 under any fixed a > 0. Such a characteristic was studied, e.g., by Bahadur (1967), Fu (1975, 1980) and Sievers (1978), who suggested the limit 1 lim − ln Pθ (|Tn − θ| > a) under fixed a > 0 n→∞ n (provided it exists) as a measure of efficiency of estimator Tn , and compared estimators from this point of view. On the other hand, a good estimator Tn also verifies the convergence lim Pθ (|Tn − θ| > a) = lim P0 (|Tn | > a) = 0
a→∞
a→∞
(2.13)
while this convergence is as fast as possible. The probabilities Pθ (Tn − θ > a) and Pθ (Tn − θ < −a), for a sufficiently large, are called the right and the left tails, respectively, of the probability distribution of Tn . If Tn is symmetrically distributed around θ, then both its tails are characterized by probability (2.13). This probability should rapidly tend to zero. However, the speed of this convergence cannot be arbitrarily high. We shall show that the rate of
TAIL–BEHAVIOR MEASURE OF A STATISTICAL ESTIMATOR
37
convergence of tails of a translation equivariant estimator is bounded, and that its upper bound depends on the behavior of 1 − F (a) and F (−a) for large a > 0. Let us illustrate this upper bound on a model with symmetric distribution function satisfying F (−x) = 1 − F (x) ∀x ∈ R. Jureˇckov´ a (1981) introduced the following tail-behavior measure of an equivariant estimator Tn : B(Tn ; a) =
−ln P0 (|Tn | > a) −ln Pθ (|Tn − θ| > a) = , a>0 −ln(1 − F (a)) −ln(1 − F (a))
(2.14)
The values B(Tn ; a) for a 0 show how many times faster the probability P0 (|Tn | > a) tends to 0 than 1−F (a), as a → ∞. The best is estimator Tn with the largest possible values B(Tn ; a) for a 0. The lower and upper bounds for B(Tn ; a), thus for the rate of convergence of its tails, are formulated in the following lemma: Lemma 2.1 Let X1 , . . . , Xn be a random sample from a population with distribution function F (x − θ), 0 < F (x) < 1, F (−x) = 1 − F (x), x, θ ∈ R. Let Tn is an equivariant estimator of θ such that, for any fixed n min Xi > 0 =⇒ Tn (X1 , . . . , Xn ) > 0
1≤i≤n
(2.15) max Xi < 0 =⇒ Tn (X1 , . . . , Xn ) < 0
1≤i≤n
Then, under any fixed n 1 ≤ lima→∞ B(Tn ; a) ≤ lima→∞ B(Tn ; a) ≤ n
(2.16)
Proof. Indeed, if Tn is equivariant, then P0 (|Tn (X1 , . . . , Xn )| > a) = P0 (Tn (X1 , . . . , Xn ) > a) + P0 (Tn (X1 , . . . , Xn ) < −a) = P0 (Tn (X1 − a, . . . , Xn − a) > 0) + P0 (Tn (X1 + a, . . . , Xn + a) < 0) ≥ P0 min Xi > a + P0 max Xi < −a 1≤i≤n
1≤i≤n
= (1 − F (a))n + (F (−a))n hence −ln P0 (|Tn (X1 , . . . , Xn )| > a) ≤ −ln 2 − n ln(1 − F (a)) =⇒ lima→∞
−ln P0 (|Tn | > a) ≤n −ln(1 − F (a))
Similarly, P0 (|Tn (X1 , . . . , Xn )| > a)
38
≤ P0
BASIC CHARACTERISTICS OF ROBUSTNESS min Xi ≤ −a + P0 max Xi ≥ a
1≤i≤n
1≤i≤n
= 1 − (1 − F (−a))n + 1 − (F (a))n = 2 {1 − (F (a))n } = 2(1 − F (a)) 1 + F (a) + . . . + (F (a))n−1 ≤ 2n(1 − F (a)) hence −ln P0 (|Tn (X1 , . . . , Xn )| > a) ≥ −ln 2 − ln n − ln(1 − F (a)) =⇒ lima→∞
−ln P0 (|Tn | > a) ≥1 −ln(1 − F (a)) 2
If Tn attains the upper bound in (2.16), then it is obviously optimal for distribution function F , because its tails tend to zero n-times faster than 1 − F (a), which is the upper bound. However, we still have the following questions: • Is the upper bound attainable, and for which Tn and F ? • Is there any estimator Tn attaining high values of B(Tn ; a) robustly for a broad class of distribution functions? ¯ n can attain both lower and upper bounds It turns out that the sample mean X in (2.16); namely, it attains the upper bound under the normal distribution and under an exponentially tailed distribution, while it attains the lower bound only for the Cauchy distribution and for the heavy-tailed distributions. ¯ n even from the tail behavior This demonstrates a high non-robustness of X ˜ n is robust even with respect aspect. On the other hand, the sample median X ˜ to tails: Xn does not attain the upper bound in (2.16), on the contrary, the ˜ n ; a) is always in the middle of the scope between 1 and n limit lima→∞ B(X for a broad class of distribution functions. These conclusions are in good concordance with the robustness concepts. The following theorem gives them a mathematical form. Theorem 2.1 Let X1 , . . . , Xn be a random sample from a population with distribution function F (x − θ), 0 < F (x) < 1, F (−x) = 1 − F (x), x, θ ∈ R. ¯ n = 1 n Xi be the sample mean. If F has exponential tails, (i) Let X i=1 n i.e., −ln(1 − F (a) lim = 1 for some b > 0, r ≥ 1 (2.17) a→∞ bar then ¯ n ; a) = n (2.18) lim B(X a→∞
TAIL–BEHAVIOR MEASURE OF A STATISTICAL ESTIMATOR
39
(ii) If F has heavy tails in the sense that −ln(1 − F (a) =1 a→∞ m ln a
for some m > 0
lim
(2.19)
then ¯ n ; a) = 1 lim B(X
(2.20)
a→∞
˜ n be the sample median. Then for F satisfying either (2.17) or (iii) Let X (2.19), n ˜n ; a) ≤ n + 1 for n even, and ≤ lima→∞ B(X (2.21) 2 2 ˜ n , a) = n + 1 for n odd (2.22) lim B(X a→∞ 2 Remark 2.1 The distribution functions with exponential tails, satisfying (2.17), will be briefly called the type I. This class includes the normal distribution (r = 2), logistic and the Laplace distributions (r = 1). The distribution functions with heavy tails, satisfying (2.19), will be called the type II. The Cauchy distribution (m = 1) or the t-distribution with m > 1 degrees of freedom belongs here. Proof of Theorem 2.1. (i) It is sufficient to prove that the exponentially tailed F has a finite expected value ¯ n |r < ∞ (2.23) Eε = E0 exp n(1 − ε)b|X for arbitrary ε ∈ (0, 1). Indeed, then we conclude from the Markov inequality that ¯ n | > a) ≤ Eε · exp{−n(1 − ε)bar } P0 (|X ¯ n | > a) −ln P0 (|X n(1 − ε)bar − ln Eε ≥ lim = n(1 − ε) a→∞ bar bar and we arrive at proposition (2.18). =⇒ lima→∞
The finite expectation (2.23) we get from the H¨older inequality: n ¯ n |r ≤ E0 exp (1 − ε)b E0 exp n(1 − ε)b|X |Xi |r
(2.24)
i=1
n
≤ (E0 [exp {(1 − ε)b|X1 | }]) = 2 r
n
∞
n [exp {(1 − ε)bx }] dF (x) r
0
It follows from (2.17) that, given ε > 0, there exists an Aε > 0 such that ε 1 − F (a) < exp −(1 − )bar 2 holds for any a ≥ Aε .
40
BASIC CHARACTERISTICS OF ROBUSTNESS
The last integral in (2.24) can be successively rewritten in the following way:
∞
exp {(1 − ε)bxr } dF (x) =
0
−
Aε
exp {(1 − ε)bxr } dF (x)
0 ∞
exp {(1 − ε)bxr } d(1 − F (x))
Aε
Aε
=
exp {(1 − ε)bxr } dF (x) + (1 − F (Aε )) · exp {(1 − ε)bArε }
0
∞
+
(1 − F (x))(1 − ε)brxr−1 · exp {(1 − ε)bxr } dx
Aε
≤ 0
Aε
ε exp {(1 − ε)bxr } dF (x) + exp − bArε 2 ∞ ε + (1 − ε)brxr−1 · exp − bxr dx < ∞ 2 Aε
and that leads to proposition (i). (ii) If F has heavy tails, then ¯ n | > a) = P0 (X ¯ n > a) + P0 (X ¯ n < −a) P0 (|X ≥ P0 X1 > −a, . . . , Xn−1 > −a, Xn > (2n − 1)a +P0 X1 < a, . . . , Xn−1 < a, Xn < −(2n − 1)a = 2(F (a))n−1 [1 − F ((2n − 1)a)] hence ¯ n , a) ≤ lima→∞ lima→∞ B(X = lim
a→∞
−ln [1 − F (2n − 1)a] m ln a
−ln [1 − F (2n − 1)a] =1 m ln((2n − 1)a)
˜ n be the sample median and n be odd. Then X ˜ n is the middle (iii) Let X ˜ order statistic of the sample X1 , . . . , Xn , i.e., Xn = X(m) , m = n+1 2 , and ˜ F (Xn ) = U(m) has the beta-distribution. Then ˜ n | > a) = P0 (X ˜ n > a) + P0 (X ˜ n < −a) P0 (|X 1 n−1 = 2n um−1 (1 − u)m−1 du m−1 F (a)
VARIANCE OF ASYMPTOTIC NORMAL DISTRIBUTION n−1 ≤ 2n (1 − F (a))m m−1 and similarly
˜ n | > a) ≥ 2n P0 (|X
n−1 m−1
41
(F (a))m−1 (1 − F (a))m
that leads to (2.22) after taking logarithms. Analogously we proceed for n even. 2
2.8 Variance of asymptotic normal distribution If estimator Tn of functional T (·) is asymptotically normally distributed as n → ∞, √ n(Tn − T (P )) → N (0, V 2 (P, T )) LP then another possible robustness measure of T is the supremum of the variance V 2 (P, T ) σ 2 (T ) = sup V 2 (P, T ) P ∈P0
over a neighborhood P0 ⊂ P of the assumed model. The estimator minimizing supP ∈P0 V 2 (P, T ) over a specified class T of estimators of parameter θ, is called minimaximally robust in the class T . We shall show in the sequel that the classes of M-estimators, L-estimators and R-estimators all contain a minimaximally robust estimator of the shift and regression parameters in a class of contaminated normal distributions.
2.9 Problems and complements 2.1 Show that both the sample mean and the sample median of the random sample X1 , . . . , Xn are nondecreasing in each argument Xi , i = 1, . . . , n. 2.2 Characterize distributions satisfying lim
a→∞
−ln(1 − F (a + c)) =1 −ln(1 − F (a))
(2.25)
for any fixed c ∈ R. Show that this class contains distributions of Type 1 and Type 2. 2.3 Let X1 , . . . , Xn be a random sample from a population with distribution function F (x − θ), where F is symmetric, absolutely continuous, 0 < F (x) < 1 for x ∈ R, and satisfying (2.25). Let Tn (X1 , . . . , Xn ) be a translation equivariant estimator of θ, nondecreasing in each argument
42
BASIC CHARACTERISTICS OF ROBUSTNESS Xi , i = 1, . . . , n. Then Tn has a universal breakdown point m∗n = m∗n (Tn ) and there exists a constant A such that Xn:m∗n − A ≤ Tn (X1 , . . . , Xn ) ≤ Xn:n−m∗n +1 + A where Xn:1 ≤ Xn:2 ≤ . . . ≤ Xn:n are the order statistics of the sample X1 , . . . , Xn . (Hint: see He at al. (1990)).
2.4 Let Tn (X1 , . . . , Xn ) be a translation equivariant estimator of θ, nondecreasing in each argument Xi , i = 1, . . . , n. Then, under the conditions of Problem 2.2, m∗n ≤ lima→∞ B(a, Tn ) ≤ lima→∞ B(a, Tn ) ≤ n − m∗n + 1
(2.26)
Illustrate it on the sample median. (Hint: see He et al. (1990)). 2.5 Let Tn (X1 , . . . , Xn ) be a random sample from a population with distribution function F (x − θ). Compute the breakdown point of 1 (Xn:1 + Xn:n ) 2 This estimator is called the midrange (see the next chapter). Tn =
2.6 Show that the midrange (Problem 2.5) of the random sample X1 , . . . , Xn is nondecreasing in each argument Xi , i = 1, . . . , n. Illustrate (2.26) for this estimator. 2.7 Determine whether the gamma distribution (Example 1.5) has exponential or heavy tails.
CHAPTER 3
Robust estimators of real parameter
3.1 Introduction Let X1 , . . . , Xn be a random sample from a population with probability distribution P ; the distribution is generally unknown, we only assume that its distribution function F belongs to some class F of distribution functions. We look for an appropriate estimator of parameter θ, that can be expressed as a functional T (P ) of P. The same parameter θ can be characterized by means of more functionals, e.g., the center of symmetry is simultaneously the expected value, the median, the modus of the distribution, and other possible characterizations. Some functionals T (P ) are characterized implicitly as a root of an equation (or of a system of equations) or as a solution of a minimization (maximization) problem: such are the maximal likelihood estimator, moment estimator, etc. An estimator of parameter θ is obtained as an empirical functional, i.e., when one replaces P in the functional T (·) with the empirical distribution corresponding to the vector of observations X1 , . . . , Xn . We shall mainly deal with three broad classes of robust estimators of the real parameter: M -estimators, L-estimators, and R-estimators. We shall later extend these classes to other models, mainly to the linear regression model.
3.2 M -estimators The class of M -estimators was introduced by P. J. Huber in (1964); the properties of M -estimators are studied in his book (Huber (1981)), and also in the books by Andrews et al. (1972), Antoch et al. (1998), Bunke and Bunke (1986), Dodge and Jureˇckov´ a (2000), Hampel et al. (1986), Jureˇckov´ a and Sen (1996), Lecoutre and Tassi (1987), Rieder (1994), Rousseeuw and Leroy (1987), Staudte and Sheather (1990), and others. M -estimator Tn is defined as a solution of the minimization problem n
ρ(Xi , θ) := min
with respect to θ ∈ Θ
i=1
or
(3.1) EPn [ρ(X, θ)] = min, 43
θ∈Θ
44
ROBUST ESTIMATORS OF REAL PARAMETER
where ρ(·, ·) is a properly chosen function. The class of M -estimators covers also the maximal likelihood estimator (MLE) of parameter θ in the parametric model P = {Pθ , θ ∈ Θ}; if f (x, θ), is the density function of Pθ , then the MLE is a solution of the minimization n (− log f (Xi , θ)) = min, θ ∈ Θ i=1
If ρ in (3.1) is differentiable in θ with a continuous derivative ψ(·, θ) = ∂ ∂θ ρ(·, θ), then Tn is a root (or one of the roots) of the equation n
ψ(Xi , θ) = 0,
θ∈Θ
(3.2)
i=1
hence
1 ψ(Xi , Tn ) = EPn [ψ(X, Tn )] = 0 Tn ∈ Θ. n i=1 n
(3.3)
We see from (3.1) and (3.3) that the M -functional, the statistical functional corresponding to Tn , is defined as a solution of the minimization ρ(x, T (P )) dP (x) = EP [ρ(X, T (P ))] := min, T (P ) ∈ Θ (3.4) X
or as a solution of the equation ψ(x, T (P )) dP (x) = EP [ψ(X, T (P ))] = 0, X
T (P ) ∈ Θ
(3.5)
The functional T (P ) is Fisher consistent, if the solutions of (3.4) and (3.5) are uniquely determined.
3.2.1 Influence function of M -estimator Assume that ρ(·, θ) is differentiable, that its derivative ψ(·, θ) is absolutely continuous with respect to θ, and that the equation (3.5) has a unique solution T (P ). Let Pt = (1 − t)P + tδx ; then T (Pt ) solves the equation ψ(y, T (Pt ))d((1 − t)P + tδx ) = 0 X
hence (1 − t)
X
ψ(y, T (Pt )) dP (y) + tψ(x, T (Pt )) = 0
Differentiating (3.6) in t, we obtain − ψ(y, T (Pt ))dP (y) + ψ(x, T (Pt )) X
+(1 − t)
dT (Pt ) dt
X
∂ ψ(y, θ) dP (y) ∂θ θ=T (Pt )
(3.6)
M -ESTIMATOR OF LOCATION PARAMETER dT (Pt ) ∂ ψ(x, θ) +t =0 dt ∂θ θ=T (Pt )
45
We obtain the influence function of T (P ) if t ↓ 0 : IF (x; T, P ) = ˙ T (P ) = where ψ(y,
ψ(x, T (P )) ˙ T (P )dP (y) − X ψ(y,
(3.7)
∂ ∂θ ψ(y, θ) θ=T (P )
3.3 M -estimator of location parameter An important special case is the model with the shift parameter θ, where X1 , . . . , Xn are independent observations with the same distribution function F (x − θ), θ ∈ R; the distribution function F is generally unknown. M -estimator Tn is defined as a solution of the minimization n
ρ(Xi − θ) := min
(3.8)
i=1
and if ρ(·) is differentiable with absolutely continuous derivative ψ(·), then Tn solves the equation n ψ(Xi − θ) = 0 (3.9) i=1
The corresponding M -functional T (F ) is Fisher consistent, provided the mini mization X ρ(x − θ)dP (x) := min has a unique solution θ = 0. The influence function of T (F ) is then ψ(x − T (P )) IF (x; T, P ) = ψ (y)dP (y) X
(3.10)
We see from the minimization (3.8) and from the equation (3.9) that Tn is translation equivariant, i.e., that it satisfies Tn (X1 + c, . . . , Xn + c) = Tn (X1 , . . . , Xn ) + c
∀c ∈ R
(3.11)
However, Tn generally is not scale equivariant: the scale equivariance of Tn means that Tn (cX1 , . . . , cXn ) = cTn (X1 . . . , Xn )
for c > 0
If the model is symmetric, i.e., we have a reason to assume the symmetry of F around 0, we should choose ρ symmetric around 0 (ψ would be then an odd function). If ρ(x) is strictly convex (and thus ψ(x) strictly increasing), then ni=1 ρ(Xi − θ) is strictly convex in θ, and the M -estimator is uniquely determined. If ρ(x) is linear n in some interval [a, b], then ψ(·) is constant in [a, b], and the equation i=1 ψ(Xi − θ) = 0 can have more roots. There are
46
ROBUST ESTIMATORS OF REAL PARAMETER
many possible rules choosing one among these roots; one possibility to obtain a unique solution is to define Tn in the following way: 1 + (T + Tn− ) 2 n n = sup{t : ψ(Xi − t) > 0}
Tn = Tn−
(3.12)
i=1
Tn+ = inf{t :
n
ψ(Xi − t) < 0}
i=1
Similarly, we determine the M -estimator in the case of nondecreasing ψ with jump discontinuities. If ψ(·) is nondecreasing, continuous or having jump discontinuities; then the M -estimator Tn obviously satisfies for any a ∈ R n
Pθ ψ(Xi − a) > 0 ≤ Pθ (Tn > a) ≤ Pθ (Tn ≥ a) i=1
≤ Pθ
n
ψ(Xi − a) ≥ 0
(3.13)
i=1
= Pθ
n
ψ(Xi − a) > 0 + Pθ
i=1
n
ψ(Xi − a) = 0
i=1
n The inequalities in (3.13) convert in equalities if Pθ { i=1 ψ(Xi − a) = 0} = 0. This further implies that, for any y ∈ R, n 1 √ y
< 0 ≤ Pθ ( n(Tn − θ) < y) ψ Xi − √ P0 n− 2 n i=1 (3.14)
√ 1 ≤ Pθ ( n(Tn − θ) ≤ y) ≤ P0 n− 2
n y
≤0 ψ Xi − √ n i=1
n Because √1n i=1 ψ Xi − √yn is a standardized sum of independent identically distributed random variables, the asymptotic probability distribution of √ n(Tn − θ) can be derived from the central limit theorem by means of (3.14).
3.3.1 Breakdown point of M -estimator of location parameter If Mn estimates the center of symmetry θ of F (x − θ), then its breakdown point follows from Section 2.6: ε∗ = limn→∞ ε∗n = 0 ∗
ε =
limn→∞ ε∗n
=
1 2
if ψ(·) is an unbounded function if ψ is odd and bounded
M -ESTIMATOR OF LOCATION PARAMETER
47
Hence, the class of M -estimators contains robust as well as non-robust elements. Example 3.1 (a) Expected value: Expected value θ = EP X is an M -functional with the criterion function ρ(x) = x2 , ψ(x) = 2x and ψ (x) = 2. Its influence function follows from (3.10): IF (x; T, P ) =
2(x − EP (X)) = x − EP (X) 2dP R
¯ n ; its breakThe corresponding M -estimator is the sample (arithmetic) mean X down point ε∗ = limn→∞ ε∗n is equal to 0, and its global sensitivity γ ∗ = +∞. (b) Median: ˜ = F −1 ( 1 ) can be considered as an M -functional with the criterion Median X 2 ˜ n is a solution of the function ρ(x) = |x|, and the sample median Tn = X minimization n |Xi − θ| := min, θ ∈ R i=1
To derive the influence function of the median, assume that the probability distribution P has a continuous distribution function F, strictly increasing in interval (a, b), −∞ ≤ a < b ≤ ∞, and differentiable in a neighborhood of ˜ Let Ft be a distribution function of the contaminated distribution Pt = X. (1 − t)P + tδx . Median T (Pt ) is a solution of the equation Ft (u) = 12 , i.e., (1 − t)F (T (Pt )) + tI[x < T (Pt )] = that leads to
⎧ 1 −1 ⎪ ⎨ F 2(1−t) T (Pt ) =
⎪ ⎩ F −1 1−2t 2(1−t)
1 2
. . . x > T (Pt ) . . . x ≤ T (Pt )
˜ = T (P ) as The function T (Pt ) is continuous at t = 0, because T (Pt ) → X t → 0; using the following expansions around t = 0 1 1 t 1 t 1 − 2t = + + O(t2 ) and = − + O(t2 ) 2(1 − t) 2 2 2(1 − t) 2 2 we obtain
−1 dF (u) 1 1 −1 1 −1 1 lim [T (Pt ) − F ( )] = sign (x − F ( )) t→0 t 2 2 2 du u= 1 2
and this, in turn, leads to the influence function of the median ˜ ˜ F ) = sign (x − X) IF (x; X, ˜ 2f (X)
(3.15)
48
ROBUST ESTIMATORS OF REAL PARAMETER
The influence function of the median is bounded, hence the median is robust, while the expected value is non-robust. Breakdown point of the median is ε∗ = 1 1 ∗ ∗ ˜ , (γ = 1.253 for the standard normal 2 , and its global sensitivity γ = 2f (X) distribution N (0, 1)). 1 ˜ P ))2 = By (3.15), E(IF (x; X, ˜ = const and we can show that the se4f 2 (X) √ ˜ ˜ is asymptotically normally distributed, quence n(Xn − X) √ 1 ˜ ˜ L{ n(Xn − X)} → N 0, ˜ 4f 2 (X)
as n → ∞. Especially, if F is the distribution function of the normal distri˜ = f 2 (µ) = 1 2 and bution N (µ, σ 2 ), then f 2 (X) 2πσ 2 √ ˜ n − X)} ˜ → N 0, πσ L{ n(X 2 (c) Maximal likelihood estimator of parameter θ of the probability distribution with density f (x, θ) : ρ(x, T (P )) = − log f (x, T (P )) ∂ ψ(x, T (P )) = − log f (x, θ) ∂θ θ=T (P ) IF (x; T, P ) = where
f˙(x, T (P )) 1 · If (T (P )) f (x, T (P ))
∂ f (x, θ) f˙(x, T (P )) = ∂θ θ=T (P )
and
2 ∂ log f (x, θ) f (x, T (P ))dx θ=T (P ) X ∂θ is the Fisher information of distribution f at the point θ = T (P ). If (T (P )) =
3.3.2 Choice of function ψ The M -estimator is determined by the choice of the criterion function ρ or of its derivative ψ. If the location parameter coincides with the center of symmetry of the distribution, we choose ρ symmetric around zero and hence ψ odd. The influence function of an M -estimator is proportional to ψ(x − T (P )) (see (3.10)). Hence, a robust M -estimator should be generated by a bounded ψ. Let us describe the most well-known and the most popular types of functions ψ (and ρ), that we can find in the literature.
M -ESTIMATOR OF LOCATION PARAMETER
49
The expected value is an M -functional generated by a linear, and hence un¯n, bounded function ψ. The corresponding M -estimator is the sample mean X which is the maximal likelihood estimator of the location parameter of the normal distribution. However, this functional is closely connected with the normal distribution and is highly non-robust. If we look for an M -estimator of the location parameter of a distribution not very far from the normal distribution, but possibly containing an ε ratio of nonnormal data, more precisely, that belongs to the family F = {F : F = (1 − ε)Φ + εH} where H runs over symmetric distribution functions, we should use the function ψ, proposed and motivated by P. J. Huber (1964). This function is linear in a bounded segment [−k, k], and constant outside this segment, see Figure 3.1. x . . . |x| ≤ k (3.16) ψH (x) = k sign x . . . |x| > k where k > 0 is a fixed constant, connected with ε through the following identity: 2Φ (k) 1 2Φ(k) − 1 + = (3.17) k 1−ε The corresponding M -estimator is very popular and is often called Huber estimator in the literature. It has a bounded influence function proportional to ψH (following from (3.10)), the breakdown point ε∗ = 12 , the global sensitivity k , and the tail-behavior measure lima→∞ B(a, Tn , F ) = 12 both for γ ∗ = 2F (k)−1 distributions with exponential and heavy tails. Thus, it is a robust estimator of the center of symmetry, insensitive to the extreme and outlying observations. As Huber proved in 1964, an estimator, generated by the function (3.16), is minimaximally robust for a contaminated normal distribution, while the value k depends on the contamination ratio. An interesting and natural question is whether there exists a distribution F such that the Huber M -estimator is the maximal likelihood estimator of θ for F (x − θ), i.e., such that ψ is the likelihood function for F. Such a distribution really exists, and its density is normal in interval [−k, k], and exponential outside. 3.3.3 Other choices of ψ Some authors recommend reducing the effect of outliers even more and choosing a redescending function ψ(x), tending to 0 as x → ±∞, eventually vanishing outside a bounded interval containing 0. Such is the likelihood function of the Cauchy distribution, see Figure 3.2. ψC (x) = − where f (x) =
1 π(1+x2 )
2x f (x) = f (x) 1 + x2
is the density of the Cauchy distribution.
(3.18)
50
ROBUST ESTIMATORS OF REAL PARAMETER
Another example is the Tukey biweight function (Figure 3.3), ⎧ 2 ⎪ . . . |x| ≤ k ⎨ x 1 − xk ψT (x) = ⎪ ⎩ 0 . . . |x| > k and the Andrews sinus function (Figure 3.4), ⎧ x . . . |x| ≤ kπ ⎨ sin k ψA (x) = ⎩ 0 . . . |x| > kπ
(3.19)
(3.20)
Hampel (1974) proposed a continuous, piecewise linear function ψ (see Figure 3.5), vanishing outside a bounded interval: ⎧ |x| sign x . . . |x| < a ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ a sign x . . . a ≤ |x| < b (3.21) ψHA (x) = c−|x| ⎪ ⎪ a sign x . . . b ≤ |x| < c ⎪ c−b ⎪ ⎪ ⎪ ⎩ 0 . . . |x| > c In the robustness literature we can also find the skipped mean, generated by the function (Figure 3.6) x . . . |x| ≤ k ∗ ψ (x) = (3.22) 0 . . . |x| > k or the skipped median, generated by the function (Figure 3.7) ⎧ −1 . . . −k ≤ x < 0 ⎪ ⎪ ⎪ ⎨ ˜ 0 ... |x| > k ψ(x) = ⎪ ⎪ ⎪ ⎩ 1 ... 0≤x≤k
(3.23)
The redescending functions are not monotone, and their corresponding primitive functions ρ are not convex. Besides the global minimum, the function n ρ(Xi −θ) can have local extremes, inducing further roots of the equation i=1 n i=1 ψ(Xi − θ) = 0. Moreover, the functions ψ generating the skipped mean and n the skipped median have jump discontinuities, and hence the equation M -estimator i=1 ψ(Xi − θ) = 0 generally has no solution; the corresponding n must be calculated as a global minimum of the function i=1 ρ(Xi − θ).
M -ESTIMATOR OF LOCATION PARAMETER
51
2
1
-2
-1
1
2
-1
-2
Figure 3.1 Huber function ψH with k = 1.345.
2
1
-5
-4
-3
-2
-1
1
-1
-2
Figure 3.2 Cauchy function ψC .
2
3
4
5
52
ROBUST ESTIMATORS OF REAL PARAMETER
2
1
-2
-1
1
2
-1
-2
Figure 3.3 Tukey biweight function ψT with k = 1.345.
2
1
-6
-5
-4
-3
-2
-1
1
2
3
4
-1
-2
Figure 3.4 Andrews sinus function ψA with k = 1.339.
5
6
M -ESTIMATOR OF LOCATION PARAMETER
53
2
1
-10
-8
-6
-4
-2
2
4
6
-1
-2
Figure 3.5 Hampel function ψHA with a = 2, b = 4, c = 8.
2
1
-2
-1
1
2
-1
-2
Figure 3.6 Skipped means function ψ ∗ with k = 1.345.
8
10
54
ROBUST ESTIMATORS OF REAL PARAMETER
2
1
-2
-1
1
2
-1
-2
Figure 3.7 Skipped medians function ψ˜∗ with k = 1.345.
3.4 Finite sample minimax property of M -estimator Huber estimator is asymptotically minimax over the family of contaminated normal distributions. We shall now illustrate another finite sample minimax property of the Huber M -estimator proved by Huber in 1968. Consider a random sample from a population with distribution function F (x− θ) where both F and θ are unknown, and assume that F belongs to the Kolmogorov ε-neighborhood of the standard normal distribution, i.e., F ∈ F = {F : sup |F (x) − Φ(x)| ≤ ε}
(3.24)
x∈R
where Φ is the standard normal distribution function. Fix an a > 0 and consider the inaccuracy measure of an estimator T of θ : sup
F ∈F ,θ∈R
{Pθ |T − θ)| > a}
Let TH be a slightly modified, randomized Huber estimator: ∗ with probability 12 T TH = T ∗∗ with probability 12
(3.25)
(3.26)
FINITE SAMPLE MINIMAX PROPERTY
55
where T ∗ = sup{t :
n
ψH (Xi − t) ≥ 0}
i=1
(3.27) T ∗∗ = inf{t :
n
ψH (Xi − t) ≤ 0}
i=1
ψH is the Huber function (3.16), and the randomization does not depend on X 1 , . . . , Xn . Then TH is translation equivariant, i.e., satisfies (3.11). We shall show that TH minimizes the inaccuracy (3.25) among all translation equivariant estimators of θ. To be more precise, let us formulate it as a theorem. The sketch of the proof of Theorem 3.1 can be omitted on the first reading. Theorem 3.1 Assume that the bound k in (3.12) is connected with ε and with a > 0 in (3.25) through the following identity: e−2ak [Φ(a − k) − ε] + Φ(a + k) − ε = 1
(3.28)
where Φ is the standard normal distribution function. Then the estimator TH defined in (3.26) and (3.27) minimizes the inaccuracy (3.25) in the family of translation equivariant estimators of θ. Sketch of the proof: The main idea is to construct a minimax test of the hypothesis that the parameter equals −a, against the alternative that it equals +a. The estimator TH will be an inversion of this minimax test. Let Φ be the standard normal distribution function and φ(x), x ∈ R be its density. Moreover, denote p− (x) = φ(x − a),
p+ (x) = φ(x + a),
x∈R
the shifted normal densities, Φ− (x) = Φ(x − a) and Φ+ (x) = Φ(x + a) their distribution functions, and P− and P+ the corresponding probability distributions. Then Φ− (x) < Φ+ (x) ∀x, and the likelihood ratio p− (x) = e2ax p+ (x)
(3.29)
is strictly increasing in x. Introduce two families of distribution functions: F−
= {G ∈ F : G(x) ≤ Φ(x − a) + ε ∀x ∈ R} (3.30)
F+
= {G ∈ F : G(x) ≥ Φ(x + a) − ε ∀x ∈ R}
We can assume that F− ∩ F+ = ∅, which is true for sufficiently small ε. We
56
ROBUST ESTIMATORS OF REAL PARAMETER
shall look for the minimax test of the hypothesis H : F ∈ F− against the alternative K : F ∈ F+ . This test will be the likelihood ratio test of two least favorable distributions of families F− and F+ , respectively. We shall show that the least favorable distributions have the densities: ⎧ [p (x) + p− (x)](1 + e2ak )−1 . . . x < −k ⎪ ⎪ ⎨ + p− (x) . . . |x| ≤ k g− (x) = (3.31) ⎪ ⎪ ⎩ −2ak −1 [p+ (x) + p− (x)](1 + e ) ... x > k and
⎧ [p (x) + p− (x)](1 + e−2ak )−1 ⎪ ⎪ ⎨ + p+ (x) g+ (x) = ⎪ ⎪ ⎩ [p+ (x) + p− (x)](1 + e2ak )−1
. . . x < −k . . . |x| ≤ k
(3.32)
... x > k
Denote G− , G+ the distribution functions and Q− , Q+ the probability distributions corresponding to densities g− , g+ , respectively. We can easily verify that the log-likelihood ratio of g− and g+ is connected with ψH in the following way: n n g− (Xi ) = 2a ln ψH (Xi ) g (Xi ) i=1 + i=1 and the likelihood ratio test of hypothesis that the true distribution is g+ against g− with minimax risk α rejects the hypothesis for large values of the n likelihood ratio, i.e., when i=1 ψH (Xi ) > K for a suitable K. Mathematically such a test is characterized by a test function ζ(x) that is the probability that the test rejects the hypothesis under observation x : ⎧ n 1 ... ⎪ i=1 ψH (Xi ) > K ⎪ ⎨ n γ ... (3.33) ζ(x) = i=1 ψH (Xi ) = K ⎪ ⎪ n ⎩ 0 ... i=1 ψH (Xi ) < K where K is determined so that n Q+ ( ψH (Xi ) > K) = α, i=1
n Q− ( ψH (Xi ) > K) = 1 − α
(3.34)
i=1
and α ∈ (0, 12 ) is the minimax risk; from the symmetry we conclude that K = 0 and γ = 12 . It remains to show that G− , G+ is really the least favorable pair of distribution functions for families (3.30), and that the test (3.33) is minimax.
FINITE SAMPLE MINIMAX PROPERTY
57
But it follows from the inequalities
Q−
Q+
g− (X) >t g+ (X) g− (X) >t g+ (X)
≥
Q−
≤
Q+
g− (X) >t g+ (X) g− (X) >t g+ (X)
(3.35)
that hold for all Q− ∈ F− and Q+ ∈ F+ and t > 0. Indeed, it is trivially 1 1 1 true for 2a ln t < −k and 2a ln t > k, and for −k ≤ 2a ln t ≤ k it follows from (3.30), (3.31) and (3.32). If the distribution P of Xi belongs to F− , i = 1, . . . , n then it follows from (Xi ) (3.35) that the likelihood ratio ni=1 gg− is the stochastically smallest pro+ (Xi ) vided the Xi are identically distributed with density g− . Analogously, if the distribution P of Xi belongs to F+ , i = 1, . . . , n then the likelihood ratio n g− (Xi ) i=1 g+ (Xi ) is the stochastically largest provided the Xi are identically distributed with density g+ . Thus the test (3.33) minimizes max
sup EG+ (ζ),
G ∈F+
sup EG− (1 − ζ)
G ∈F−
(3.36)
hence it is really minimax. If the distribution of X − θ belongs to F , then that of X − θ − a and that of X − θ + a belongs to F+ and F− , respectively, and Pθ (TH (X) > θ + a) = P0 (TH (X) > a) = 12 P0 (T ∗ (X) > a) + 12 P0 (T ∗∗ (X) > a) = 12 P0 (T ∗ (X1 − a, . . . , Xn − a) > 0) + 12 P0 (T ∗∗ (X1 − a, . . . , Xn − a) > 0) n n 1 1 ≤ 2 P0 ψH (Xi − a) > 0 + 2 P0 ψH (Xi − a) ≥ 0 i=1
i=1
= EP0 (ζ(X1 − a, . . . , Xn − a)) ≤ EQ+ (ζ(X1 − a, . . . , Xn − a)) = α as it follows from (3.34) and (3.36). Similarly we verify that Pθ (TH (X) < θ − a) ≤ α Let T now be a translation equivariant estimator. Because the distributions Q+ and Q− are absolutely continuous, T has a continuous distribution function both under Q+ and Q− (see Problem 3.4), hence Q+ (T (X) = 0) = Q− (T (X) = 0) = 0.
58
ROBUST ESTIMATORS OF REAL PARAMETER
Then T induces a test of F+ against F− rejecting when T (X) > 0, and because the test based on Huber estimator is minimax with the minimax risk α, we conclude that max Q+ (T > 0), Q− (T < 0)) ≥ α sup Q+ ∈F+ ,Q− ∈F−
2
hence no equivariant estimator can be better than TH . 3.5 Moment convergence of M -estimators
Summarizing the conditions imposed on a good estimator, it is desirable to have an M -estimator Tn with a bounded influence function and with a break√ down√point 1/2, which estimates θ consistently with the rate of consistency n and n(Tn − θ) that has an asymptotic normal distribution. The asymptotic distribution naturally has finite moments; however, we also wish Tn to have finite moments tending to the moments of the asymptotic distribution. √ Otherwise, we would welcome the uniform integrability of the sequence n(Tn − θ) and its powers. Indeed, we can prove the moment convergence of M -estimators for a broad class of bounded ψ-functions and under some conditions on density f. For an illustration, we shall prove that under the following conditions (A.1) and (A.2). The conditions can be still weakened, but (A.1) and (A.2) already cover a broad class of M -estimators with a bounded influence. This was first proved by Jureˇckov´ a and Sen (1982). (A.1) X1 , . . . , Xn is a random sample from a distribution with density f (x − θ), where f is positive, symmetric, absolutely continuous and nonincreasing for x ≥ 0; we assume that f has positive and finite Fisher information,
2 f (x) dF (x) < ∞ f (x) R and that there exists a positive number (not necessarily an integer or ≥ 1) such that E|X1 | = |x| dF (x) < ∞ R (A.2) ψ is nondecreasing and skew-symmetric, ψ(x) = −ψ(−x), x ∈ R, and 0 < I(f ) =
ψ(x) = ψ(c) · sign x for |x| > c,
c>0
Moreover, ψ can be decomposed into absolutely continuous and step components, i.e., ψ(x) = ψ1 (x) + ψ2 (x), x ∈ R, where ψ1 is absolutely continuous inside (−c, c) and ψ2 is a step function that has a finite number of jumps inside (−c, c), i.e., ψ2 (x) = bj . . . dj−1 < x < dj , j = 1, . . . , m + 1 d0 = −c, dm+1 = c
MOMENT CONVERGENCE OF M -ESTIMATORS
59
Theorem 3.2 For every r > 0, there exists nr < ∞ such that, under conditions (A.1) and (A.2), r (3.37) Eθ n 2 |Tn − θ|r < ∞, uniformly in n ≥ nr Moreover, lim Eθ
√
n→∞
and, especially, lim Eθ
n→∞
r n|Tn − θ| = ν r
σ2 , γ2
R
|x|r dΦ(x)
2r √ (2r)! n|Tn − θ| = ν 2r r 2 r!
for r = 1, 2, . . . , where ν2 =
(3.38)
(3.39)
σ2 =
ψ 2 (x)dF (x) R (3.40)
γ= R
ψ1 (x)dF (x) +
m
(bj − bj−1 )f (dj ) (> 0)
j=1
and Φ is the standard normal distribution function. Furthermore, n1/2 (Tn − θ) is asymptotically normally distributed L{n1/2 (Tn − θ)} → N (0, ν 2 )
(3.41)
Sketch of the proof. We can put θ = 0, without loss of generality. First, because F has the finite th absolute moment, then max {|x|F (x)(1 − F (x))} = C < ∞ x∈R
and
R
[F (x)(1 − F (x))]λ dx < ∞
∀λ >
1 >0
Let a1 > c > 0, where c comes from condition (A.2). Then ∞ r √ r 2 rtr−1 P ( n|Tn | > t)dt E n |Tn | = 0 √ ∞ a1 n √ = + √ rtr−1 P ( n|Tn | > t)dt = In1 + In2 0
n
We shall first estimate the probability √ √ P ( n|Tn | > t) = 2P ( nTn > t) n n 1 1 1 1 ≤ 2P ψ(Xi − tn− 2 ) − E ψ(Xi − tn− 2 ) n i=1 n i=1 1 ≥ −E[ψ(X1 − tn− 2 )]
(3.42)
(3.43)
(3.44)
60
ROBUST ESTIMATORS OF REAL PARAMETER
where we use the inequalities −E[ψ(X1 − tn− 2 ) = −E[ψ(X1 − tn− 2 − ψ(X1 )] c 1 = [F (x + tn− 2 ) − F (x)]dψ(x) −c c 1 1 [F (x + tn− 2 ) − F (x − tn− 2 )]dψ(x) = 1
1
(3.45)
0
≥ 2tn− 2 f (c + tn− 2 )[ψ(c) − ψ(0)] 1
1
= 2tn− 2 f (c + a1 )ψ(c), 1
√ ∀t ∈ (0, a1 n)
based on the facts that f (x) 0 as x → ∞ and ψ(x) + ψ(−x)√= 0, F (x) + F (−x) = 1 ∀x ∈ R. Hence, by (3.44) and (3.45), for 0 < t < a1 n, n √ 1 − 12 Zni ≥ 2tn f (c + a1 )ψ(c) P ( n|Tn | > t) ≤ 2P n i=1 where
1 1 Zni = ψ Xi − tn− 2 − Eψ Xi − tn− 2 , i = 1, . . . , n
are independent random variables with means 0, bounded by 2ψ(c). Thus we can use the Hoeffding inequality (Theorem 2 in Hoeffding (1963)) and obtain √ for 0 < t < a1 n √ P ( n|Tn | > t) ≤ 2exp{−a2 t2 } (3.46) 2 2 where a2 ≥ 2f (c + a1 )ψ (c). Hence, a1 √n ∞ In1 ≤ 2r exp{−a2 t2 }tr−1 dt ≤ 2r exp{−a2 t2 }tr−1 dt < ∞ (3.47) 0
0
√ On the other hand, if t ≥ a1 n, then for n = 2m ≥ 2, √ √ 1 P ( n|Tn | > t) = 2P ( nTn > t) ≤ 2P (Xn:m+1 ≥ −c + tn− 2 ) (3.48) 1 1 n−1 ≤ 2n um (1 − u)n−m−1 du ≤ 2[q(F (−c + tn− 2 ))]n −1 m−1 F (−c+tn 2 ) where Xn:1 ≤ . . . ≤ Xn:n are the order statistics and q(u) = 4u(1 − u) ≤ 1, 0 ≤ u ≤ 1 √ 1 Actually, F (−c + tn− 2 ) > 12 for t ≥ a1 n, and 1 1 n−1 2n um (1 − u)n−m−1 du ≤ 2[q(A)]n for A > m−1 2 A that can be proved again with the aid of the Hoeffding inequality. If n = 2m+1, we similarly get 1 √ n−1 P ( n|Tn | > t) ≤ 2n um (1 − u)n−m du 1 m F (−c+tn− 2 ) ≤ 2[q(F (−c + tn− 2 ))]n 1
(3.49)
STUDENTIZED M -ESTIMATORS
61
Finally, using (3.42)–(3.44), (3.46), (3.48)–(3.49), we obtain r r−1 − 12 n 2 In2 ≤ 2r t [q(F (−c + tn ))] dt = 2rn xr−1 [q(F (−c + x))]n dx √ a1 n a1 r r−1 r [4F (y)(1 − F (y))]λ dy < ∞ ≤ 2r(C ) n 2 [q(F (−c + a1 ))]n−[ ]−1−λ a1 −c
for n ≥ (3.37).
[ r ]
+ 1, and limn→∞ In2 = 0. This, combined with (3.47), proves
It remains to prove the moment convergence (3.39). Under the conditions (A.1)–(A.2), the M -estimator admits the asymptotic representation n √ 1 1 ψ(Xi − θ) + Op (n− 4 ) γ n(Tn − θ) = n− 2 i=1
proved by Jureˇckov´ a (1980). Since Eθ ψ(X1 − θ) = 0 and ψ is bounded, all moments of ψ(X1 − θ) exist. Hence, the von Bahr (1965) theorem on moment convergence of sums of independent random variables applies to 1 n n− 2 i=1 ψ(Xi − θ), and this further implies (3.40) for any positive integer r, in view of the uniform integrability (3.37). It further extends to any s − 12 n positive real r, because n i=1 ψ(Xi − θ) is uniformly integrable for any s ∈ [2r − 2, 2r]. The asymptotic normality then follows from the central limit theorem. 2
3.6 Studentized M -estimators The M -estimator of the shift parameter is translation equivariant but generally it is not scale equivariant (see (3.11)). This shortage can be overcome by using either of the following two methods: • We estimate the scale simultaneously with the location parameter: e.g., Huber (1981) proposed estimating the scale parameter σ simultaneously with the location parameter θ as a solution of the following system of equations: n Xi − θ ψH =0 (3.50) σ i=1 n Xi − θ χ =0 σ i=1
(3.51)
2 2 (x) − R ψH (y)dΦ(y), ψH is the Huber function (3.16), where χ(x) = ψH and Φ is the distribution function of the standard normal distribution. • We can obtain a translation and scale equivariant estimator of θ, if we
62
ROBUST ESTIMATORS OF REAL PARAMETER studentize the M -estimator by a convenient scale statistic Sn (X1 , . . . , Xn ) and solve the following minimization: n Xi − θ ρ := min, θ ∈ R (3.52) Sn i=1 However, to guarantee the translation and scale equivariance of the solution of (3.52), our scale statistic should satisfy the following conditions: (a) Sn (x) > 0 a.e. for x ∈ R (b) Sn (x1 + c, . . . , xn + c) = Sn (x1 , . . . , xn ), c ∈ R, x ∈ Rn (translation invariance) (c) Sn (cx1 , . . . , cxn ) = cSn (x1 , . . . , xn ), c > 0, x ∈ Rn (scale equivariance)
Moreover, it is convenient if Sn consistently estimates a statistical functional S(F ), so that √ n(Sn − S(F )) = Op (1) as n → ∞ (3.53) Indeed, the estimator defined as in (3.52) is translation and scale equivariant, and the pertaining statistical functional T (F ) is defined implicitly as a solution of the minimization x−t ρ dF (x) := min, t ∈ R (3.54) S(F ) X The functional is Fisher consistent, provided the solution of the minimization (3.54) is unique. If ρ has a continuous derivative ψ, then the estimator equals a root of the equation n Xi − θ ψ =0 (3.55) Sn i=1 If ρ is convex and hence ψ is nondecreasing, but discontinuous at some points or constant on some intervals, we obtain a unique studentized estimator analogously as in (3.12), namely Tn = Tn−
1 + (T + Tn− ) 2 n
= sup{t :
n
ψ
i=1
Tn+ = inf{t :
n i=1
ψ
Xi − t Sn
Xi − t Sn
> 0}
(3.56)
< 0}
There is a variety of possible choices of Sn , because there is no universal scale functional. Let us mention some of the most popular choices of the scale statistic Sn :
L-ESTIMATORS
63
• Sample standard deviation: Sn =
n 1
n
¯ n )2 (Xi − X
12
i=1 1
S(F ) = (varF (X)) 2 This functional, being highly non-robust, is used for studentization only in special cases, as in the Student t-test under normality. • Inter-quartile range: Sn = Xn:[ 34 n] − Xn:[ 14 n] where Xn:[np] , 0 < p < 1 is the empirical p-quantile of the ordered sample Xn:1 ≤ . . . ≤ Xn:n . The corresponding functional has the form S(F ) = F −1 ( 34 ) − F −1 ( 14 ) • Median absolute deviation (MAD): ˜n| Sn = med1≤i≤n |Xi − X The corresponding statistical functional S(F ) is a solution of the equation F S(F ) + F −1 ( 12 ) − F −S(F ) + F −1 ( 12 ) = 12 and S(F ) = F −1 ( 34 ) provided the distribution function F is symmetric around 0 and F −1 ( 12 ) = 0. The influence function of the studentized M -functional in the symmetric model satisfying F (−x) = 1 − F (x), ρ(−x) = ρ(x), with absolutely continuous ψ satisfying ψ(−x) = −ψ(x), x ∈ R, has the form x − T (F ) S(F ) ψ IF (x, T, F ) = γ(F ) S(F )
y where γ(F ) = R ψ S(F ) dF (y). Hence, the influence function of T (F ) in the symmetric model depends on the value of S(F ), but not on the influence function of the functional S(F ).
3.7 L-estimators L-estimators are based on the ordered observations (order statistics) Xn:1 ≤ . . . ≤ Xn:n of the random sample X1 , . . . , Xn . The general L-estimator can be written in the form Tn =
n i=1
cni h(Xn:i ) +
k
aj h∗ (Xn:[npj ]+1 )
(3.57)
j=1
where cn1 , . . . , cnn and a1 , . . . , ak are given coefficients, 0 < p1 < . . . < pk < 1 and h(·) and h∗ (·) are given functions. The coefficients cni , 1 ≤ i ≤ n are
64
ROBUST ESTIMATORS OF REAL PARAMETER
generated by a bounded weight function J : [0, 1] → R in the following way: either ni cni = J(s)ds, i = 1, . . . , n (3.58) i−1 n
or approximately cni =
1 J n
i n+1
,
i = 1, . . . , n
(3.59)
The first component of the L-estimator (3.57) generally involves all order statistics, while the second component is a linear combination of several (finitely many) sample quantiles. Many L-estimators have just the form either of the first or of the second component in (3.57); we speak about L-estimators of type I or II, respectively. ˜n The simplest examples of L-estimators of location are the sample median X and the midrange 1 Tn = (Xn:1 + Xn:n ) 2 The popular L-estimators of scale are the sample range Rn = Xn:n − Xn:1 and the Gini mean difference Gn =
n n 1 2 |Xi − Xj | = (2i − n − 1)Xn:i n(n − 1) i,j=1 n(n − 1) i=1
The L-estimators of type I are more important for applications. Let us consider some of their main characteristics. Let L-estimator Tn have an integrable 1 weight function J such that 0 J(u)du = 1. Its corresponding statistical functional is based on the empirical quantile function Qn (t) = Fn−1 (t) = inf{x : Fn (x) ≥ t}, 0 < t < 1 that is the empirical counterpart of the quantile function Q(t) = F −1 (t) = inf{x : F (x) ≥ t}, 0 < t < 1 and is equal to i i = 1, . . . , n − 1 Xn:i . . . i−1 n < t ≤ n, Qn (t) = (3.60) n−1 Xn:n . . . n (1 − t)F (x) + t
⎧ ⎪ ⎨
u (1−t)2
·
1 u f (F −1 ( 1−t ))
u < (1 − t)F (x)
⎪ ⎩
u−1 (1−t)2
·
1 f (F −1 ( u−t 1−t ))
u > (1 − t)F (x) + t
This implies that dT (Ft ) = dt
Ft (x)
= 0
1
+ Ft (x)
1 0
dFt−1 (u) du J(u)h Ft−1 (u) · dt
u −1 F h 1−t u J(u)du · (1 − t)2 f F −1 u 1−t
−1 u−t 1−t u−1 h F J(u)du · (1 − t)2 f F −1 u−t 1−t
and t → 0+ leads to the influence function of the functional (3.62): dT (Ft ) = dt t=0
F (x)
s·
= 0
h (F −1 (u)) J(u)du + f (F −1 (u))
1
(u − 1) · F (x)
h (F −1 (u)) J(u)du f (F −1 (u))
66
= 0
1
ROBUST ESTIMATORS OF REAL PARAMETER 1 h (F −1 (u)) h (F (u)) J(u)du − J(u)du u· −1 (u)) f (F −1 (u)) F (x) f (F
−1
Hence, the influence function of T (F ) is equal to ∞ ∞ IF (x, T, F ) = F (y)h (y)J(F (y))dy − h (y)J(F (y))dy −∞
(3.63)
x
Notice that it satisfies the identity d IF (x, T, F ) = h (x)J(F (x)) dx If h(x) ≡ x, x ∈ R, J(u) = J(1 − u), 0 < u < 1 and F is symmetric around 0, the influence function simplifies: ∞ ∞ IF (x, T, F ) = F (y)J(F (y))dy − J(F (y))dy −∞
x
∞
0
F (y)J(F (y))dy +
=
−∞
0
−
(1 − F (−y))J(1 − F (−y))dy
∞
J(F (y))dy x
∞
=
∞
F (y)J(F (y))dy + 0
− =
(1 − F (y))J(F (y))dy
0 ∞
J(F (y))dy x ∞
J(F (y))dy −
0
and hence IF (x, T, F ) =
∞
J(F (y))dy x
x 0
J(F (y))dF (y)
... x ≥ 0 (3.64)
IF (−x, T, F ) = −IF (x, T, F )
... x ∈ R
Remark 3.1 If M -estimator Mn of the center of symmetry is generated by absolutely continuous function ψ, and Ln is the L-estimator with the weight function J(u) = c ψ (F −1 (u)), then the influence functions of Mn and Ln coincide. 3.7.2 Breakdown point of L-estimator If the L-estimator Tn is trimmed in the sense that its weight function satisfies J(u) = 0 for 0 < u ≤ α and 1 − α ≤ u < 1, and ε∗n = mnn is its breakdown point, then limn→∞ ε∗n = α.
L-ESTIMATORS
67 1 2)
Example 3.2 α-trimmed mean ( 0 < α < quantiles: ¯ nα = X hence
cni =
J(u) =
is the average of the central
n−[nα]
1 n − 2[nα]
Xn:i
i=[nα]+1
1 n−[nα]
. . . [nα] + 1 ≤ i ≤ n − [nα]
0
. . . otherwise
1 I[α ≤ u ≤ 1 − α] 1 − 2α
Tn = T (Fn ) =
1 1 − 2α
1 1 − 2α
T (F ) =
1−α
1−α
Fn−1 (u)du
α
F −1 (u)du
α
¯ nα follows from (3.63): The influence function of X ∞ F (y)J(F (y))dy − J(F (y))dy IF (x, T, F ) = x
R
1 = 1 − 2α
F −1 (1−α) F −1 (α)
F (y)dy −
∞
I[α < F (y) < 1 − α]dy
x
hence IF (x, T, F ) + µα = =−
1 −1 αF (1 − α) − (1 − α)F −1 (α) I x < F −1 (α) 1 − 2α
+
1 x − αF −1 (α) − αF −1 (1 − α) I F −1 (α) ≤ x ≤ F −1 (1 − α) 1 − 2α
+
1 −αF −1 (α) + (1 − α)F −1 (1 − α) I x > F −1 (1 − α) 1 − 2α
where µα =
1 1 − 2α
1−α α
F −1 (u)du =
1 1 − 2α
F −1 (1−α)
ydF (y) F −1 (α)
68
ROBUST ESTIMATORS OF REAL PARAMETER
If F is symmetric, then F −1 (u) = −F −1 (1 − u), 0 < u < 1 and µα = 0; then ⎧ F −1 (1−α) − 1−2α . . . x < −F −1 (1 − α) ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ x . . . −F −1 (1 − α) ≤ x ≤ F −1 (1 − α) IF (x, T, F ) = 1−2α ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ F −1 (1−α) . . . x > F −1 (1 − α) 1−2α The global sensitivity of the trimmed mean is γ∗ =
F −1 (1 − α) 1 − 2α
Remark 3.2 If Mn is the Huber estimator of the center of symmetry θ of F (x−θ), generated by the Huber function ψH with k = F −1 (1 − α) (see (3.16)), then the influence ¯ nα coincide. functions of Mn and X Remark 3.3 ¯ nα . Then (i) Let ε∗n = mnn be the breakdown point of the α-trimmed mean X ∗ limn→∞ εn = α. ¯ n,αn ; a) be the tail-behavior measure of (ii) Let αn = [k/n], n ≥ 3 and let B(X ¯ n,αn , defined in (2.14). Then, if F has exponential tails (2.17), X ¯ n,αn ; a) ≤ lima→∞ B(X ¯n,αn ; a) ≤ n − k n − 2k ≤ limand→∞ B(X and if F has heavy tails (2.19) and k
F −1 (1 − α)]
The global sensitivity of the Winsorized mean is α γ ∗ = F −1 (α) + f (F −1 (1 − α)) and the limiting breakdown point of W nα is ε∗ = α. The influence function of the Winsorized mean has jump points at F −1 (α) and F −1 (1 − α), while the influence function of the α-trimmed mean is continuous. Example 3.4 (i) Sen’s weighted mean (Sen (1964)): −1 n n i−1 n−i Tn,k = Xn:i 2k + 1 k k i=1
¯ n and Tn,k is the sample median if where 0 < k < Notice that Tn,0 = X either n is even and k = n2 − 1 or n is odd and k = n−1 2 . n−1 2 .
(ii) The Harrell-Davis estimator of p-quantile (Harrell and Davis (1982)): Tn =
n
cni Xn:i
i=1
cni
Γ(n + 1) = Γ(k)Γ(n − k + 1)
i = 1, . . . , n, where k = [np], 0 < p < 1.
i/n
uk−1 (1 − u)n−k du (i−1)/n
70
ROBUST ESTIMATORS OF REAL PARAMETER
(iii) BLUE (asymptotically best linear unbiased estimator) of the location parameter (more properties described by Blom (1956) and Jung (1955, 1962)). Let X1 , X2 , . . . be independent observations with the distribution function F (x − θ), where F has an absolutely continuous density f with derivative f . Then the BLUE is the L-estimator with the weight function n i 1 Tn = cni Xn:i , cni = J , i = 1, . . . , n n n+1 i=1 J(F (x)) = ψf (x),
ψf (x) = −
f (x) , f (x)
x∈R
3.8 Moment convergence of L-estimators Similarly as in the case of M -estimators, we can prove the moment convergence of L-estimators for a broad class of bounded J-functions and under some conditions on density f. Consider the L-estimator Tn =
n
cni Xn:i
(3.68)
i=1
where Xn:1 ≤ Xn:2 ≤ . . . ≤ Xn:n are the order statistics corresponding to observations X1 , . . . , Xn and cni = cn,n−i+1 ≥ 0,
i = 1, . . . , n, and
n
cni = 1
i=1
and
(3.69) cni = cn,n−i+1
kn = α0 = 0 for i ≤ kn , where lim n→∞ n
for some α0 , 0 < α0 < 12 . Assume that the independent observations X1 , . . . , Xn are identically distributed with a distribution function F (x − θ) such that F has a symmetric density f (x), f (−x) = f (x), x ∈ R, and f is monotonically decreasing in x for x ≥ 0, and |f (x)| < ∞
(3.70)
i i−1 < t ≤ , i = 1, . . . , n n n
(3.71)
sup F −1 (α0 )≤x≤F −1 (1−α0 )
Denote Jn (t) = ncni
for
and assume that Jn (t) → J(t) a.s. ∀t ∈ (0, 1)
(3.72)
where J : [0, 1] → [0, ∞) is a symmetric and integrable function, J(t) = 1 J(1 − t) ≥ 0, 0 ≤ t ≤ 1 and 0 J(t)dt = 1. Then (see Huber (1981)), the
MOMENT CONVERGENCE OF L-ESTIMATORS 71 √ 2 sequence n(Tn − θ) has an asymptotic normal distribution N (0, σL ) where 2 = [F (x ∧ y) − F (x)F (y)]J(F (x))J(F (y))dxdy (3.73) σL R
R
Theorem 3.3 Under the conditions (3.69)–(3.73), for any positive integer r, √ 2r (2r)! lim Eθ [ n(Tn − θ)]2r = σL 2r r!
n→∞
(3.74)
We shall only sketch the basic steps of the proof; the detailed proof can be found in Jureˇckov´ a and Sen (1982). We shall use the following lemma, that follows from the results of Cs¨ org˝ o and R´ev´esz (1978): Lemma 3.1 Under the conditions of Theorem 3.3, for any n ≥ n0 , there exists a sequence of random variables {Yni }n+1 i=1 , independent and normally distributed N (0, 1), such that n+1 √ 1 1 n(Tn − θ) − √ anj Ynj = O(n− 2 log n) a.s. (3.75) n + 1 j=1 as n → ∞, where anj
n
n+1 1 = bni − bni , n + 1 i=1 i=j
bni =
cni , i = 1, . . . , n i −1 f F n+1
Proof of Lemma 3.1: Put θ = 0 without a loss of generality. Using Theorem 6 of Cs¨org˝ o and R´ev´esz (1978), we conclude that there exists a sequence of Brownian bridges {Bn (t) : 0 ≤ t ≤ 1} such that i i √ i −1 n Xn:i − F −1 max − B f F n kn +1≤i≤n−kn n+1 n+1 n+1 1 = O n− 2 log n a.s. as n → ∞ n √ i 1 bni Bni = O n− 2 log n nTn − n+1 i=1 t : t ≥ 0 is a standard Wiener process Wn on The process (t + 1)Bn t+1 [0, ∞), thus Wn (k) = ki=1 Yni , k = 1, 2, . . . , where the Yni are independent random variables with N (0, 1) distributions. 2 n i = 0, Sketch of the proof of Theorem 3.3. Because i=1 cni F −1 n+1 we get by the Jensen inequality (put θ = 0) n i 2r √ √ 2r −1 ( n|Tn |) ≤ n Xni − F n+1
hence
i=1
72 ≤
ROBUST ESTIMATORS OF REAL PARAMETER i 2r √ cni n Xni − F −1 n+1
n−k n
i=kn +1
hence
n−k i 2r n √ √ 2r −1 E0 ( n|Tn |) ≤ cni E n Xni − F 0 be the price of one observation and let the global loss be L(Tn , θ, c) = a(Tn − θ)2 + cn
(3.76)
where a > 0 is a constant. The corresponding risk is Rn (Tn , θ, c) = Eθ (Tn − θ)2 + cn
(3.77)
Our goal is to find the sample size n minimizing the risk (3.77). Let us first consider the situation that F is known and that σn2 = nEθ (Tn −θ)2 exists for n ≥ n0 , and that lim σn2 = σ 2 (F ),
n→∞
0 < σ 2 (F ) < ∞
(3.78)
Hence, we want to minimize n1 aσn + cn with respect to n, and if we use the approximation (3.78), the approximate solution n0 (c) has the form a (3.79) n0 (c) ≈ σ(F ) c and for the minimum risk we obtain
√ Rn (Tn0 (c) , θ, c) ≈ 2σ(F ) ac
where p(c) ≈ q(c) means that limc↓0 c ↓ 0.
q(c) p(c)
(3.80)
= 1. Then obviously n0 (c) ↑ ∞ as
If distribution function F is unknown, we cannot know σ 2 (F ) either. But we can still solve the problem sequentially, if there is a sequence σ ˆn of estimators of σ(F ). We set the random sample size (stopping rule) Nc , defined as a −ν Nc = min n ≥ n : n ≥ σ ˆn + n , c>0 (3.81) c
SEQUENTIAL M - AND L-ESTIMATORS
73
where n is an initial sample size and ν > 0 is a chosen number. Then Nc ↑ ∞ with probability 1 as c ↓ 0. The resulting estimator of θ is TNc , based on Nc observations X1 , . . . , XNc . The corresponding risk is R∗ (Tn , θ, c) = aE (TNc − θ)2 + cENc We shall show that, if Tn is either a suitable M -estimator or an L-estimator, then R∗ (Tn , θ, c) → 1 as c ↓ 0 (3.82) Rn (Tn(c) , θ, c) An interpretation of the convergence (3.82) is that the sequential estimator TNc is asymptotically (in the sense that c ↓ 0) equally risk-efficient as the optimal non-sequential estimator Tn(c) corresponding to the case that σ 2 (F ) is known. Such a problem was first considered by Ghosh and Mukhopadhyay (1979) and later by Chow and Yu (1981) under weaker conditions; they proved ˆn2 as the sample variance. Sen (1980) (3.82) for Tn as the sample mean and σ solved the problem for a class of R- (rank-based) estimators of θ. Let Tn be the M -estimator of θ generated by a nondecreasing and skewsymmetric function ψ, and assume that ψ and F satisfy condition (A.1) and (A.2) of Section 3.5, put Sn (t) =
n
ψ(Xi − t)
i=1
Then Tn is defined by (3.12) and tributed N (0, σ 2 (ψ, F )), where
√ n(Tn − θ) is asymptotically normally dis
ψ 2 (x)dF (x) 2 f (x)dψ(x) R
σ (ψ, F ) = R 2
(see Huber (1981)), put 1 2 ψ (Xi − Tn ) n i=1 n
s2n = Choose α ∈ (0, 1) and put
α 1 Mn− = sup t : n− 2 Sn (t) > s2n Φ−1 1 − α 2 1 Mn+ = sup t : n− 2 Sn (t) < s2n Φ−1 2 dn = Mn+ − Mn−
√ nd p
n −→ σ 2 (ψ, F ) as n → ∞ 2Φ−1 1 − α2 is proved, e.g., in Jureˇckov´ a (1977).
Then
σ ˆn2 =
If Nc is the stopping rule defined in (3.81) with σ ˆn2 , then TNc is a risk-efficient M -estimator of θ.
74
ROBUST ESTIMATORS OF REAL PARAMETER
Let now Tn be an L-estimator of √ θ, defined in (3.68), trimmed at α and 1−α, satisfying (3.69)–(3.72). Then n(Tn −θ) is asymptotically normal with variance σ 2 (J, F ) given in (3.73). Sen (1978) proposed the following estimator of σ 2 (J, F ) : σ ˆn2 =
n−1 n−1
cni cnj [F (x ∧ y) − F (x)F (y)]J(F (x))J(F (y))dxdy
i=1 j=1
and proved that σ ˆn2 → σ 2 (J, F ) a.s. as n → ∞. ˆn2 , then TNc is a Again, if Nc is the stopping rule defined in (3.81) with σ risk-efficient L-estimator of θ.
3.10 R-estimators Let Ri be the rank of Xi among X1 , . . . , Xn , i = 1, . . . , n, where X1 , . . . , Xn is a random sample from a population with continuous distribution function. The rank Ri can be formally expressed as Ri =
n
I[Xj ≤ Xi ],
i = 1, . . . , n
(3.83)
j=1
and thus Ri = nFn (Xi ), i = 1, . . . , n, where Fn is the empirical distribution function of X1 , . . . , Xn . The ranks are invariant with respect to the class of monotone transformations of observations, and the tests based on ranks have many advantages: the most important among them is that the distribution of the test criterion under the hypothesis of randomness (i.e., if X1 , . . . , Xn are independent and identically distributed with a continuous distribution function) is independent of the distribution of observations. Hodges and Lehmann (1963) proposed a class of estimators, called R-estimators, that are obtained by an inversion of the rank tests. The R-estimators can be defined for many models, practically for all where the rank tests have a sense, and the test criterion is symmetric about a known center or has other suitable property under the null hypothesis. We shall describe the Restimators of the center of symmetry of an (unknown) continuous distribution function, and the R-estimators in the linear regression model in the sequel. Let X1 , . . . , Xn be independent random observations with continuous distribution function F (x − θ), symmetric about θ. The hypothesis H 0 : θ = θ0 on the value of the center of symmetry is tested with the aid of the signed rank test (or one-sample rank test), based on the statistic + (θ0 )) Sn (θ0 ) = sign(Xi − θ0 )an (Rni
where
+ (θ0 ) Rni
(3.84)
is the rank of |Xi −θ0 | among |X1 −θ0 |, . . . , |Xn −θ0 | and an (1) ≤
R-ESTIMATORS
75
. . . ≤ an (n) are given scores, generated by a nondecreasing score function i + + + + ϕ : [0, 1) → R , ϕ (0) = 0, in the following way: an (i) = ϕ n+1 , i = 1, . . . , n. For example, the linear score function ϕ+ (u) = u, 0 ≤ u ≤ 1 generates the Wilcoxon one-sample test. If θ0 is the right center of symmetry, + then are sign(Xi − θ0 ) and Rni (θ0 ) stochastically independent, i = 1, . . . , n, and Sn (t) is a nonincreasing step function of t with probability 1 (Problem 3.10). This implies that Eθ0 Sn (θ0 ) = 0 and the distribution of Sn (θ0 ) under H 0 is symmetric around 0. As an estimator of θ0 we propose the value of t which solves the equation Sn (t) = 0. Because Sn (t) is discontinuous, such an equation may have no solution; then we define the R-estimator similarly as the M -estimator and put 1 (3.85) Tn = (Tn− + Tn+ ) 2 Tn− = sup{t : Sn (t) > 0},
Tn+ = inf{t : Sn (t) < 0}
Tn coincides with the sample median if an (i) = 1, i = 1, . . . , n. The estimator, corresponding to the one-sample Wilcoxon test with the scores an (i) = i n+1 , i = 1, . . . , n, is known as the Hodges-Lehmann estimator : Xi + Xj : 1≤i≤j≤n (3.86) TnH = med 2 Other R-estimators should be computed by an iterative procedure. Unlike the M -estimators, the R-estimators are not only translation, but also scale equivariant, i.e., Tn (X1 + c, . . . , Xn + c) = Tn (X1 , . . . , Xn ) + c, c ∈ R (3.87) Tn (cX1 , . . . , cXn ) = cTn (X1 , . . . , Xn ), c > 0 The distribution function of statistic Sn (θ) is discontinuous, even if X1 , . . . , Xn have a continuous distribution function F (x − θ). On the other hand, if θ is the actual center of symmetry, then the distribution function of statistic Sn (θ) does depend on F. If we denote 0 ≤ pn = Pθ (Sn (θ) = 0) = P0 (Sn (0) = 0) < 1 then limn→∞ pn = 0 and 1 1 (1 − pn ) ≤ Pθ (Tn < θ) ≤ Pθ (Tn ≤ θ) ≤ (1 + pn ) (3.88) 2 2 This means that if F is symmetric around zero, Tn is an asymptotically median unbiased estimator of θ. Using (3.83) in statistic (3.84) with linear scores, we arrive at an alternative form of the Hodges-Lehmann estimator Tn as a solution of the equation ∞ [Fn (y) − Fn (2Tn − y)]dFn (y) = 0 (3.89) −∞
76
ROBUST ESTIMATORS OF REAL PARAMETER
Similarly, the R-estimator generated by the score function ϕ+ can be expressed as a solution of the equation ∞ ϕ (Fn (y) − Fn (2Tn − y)) dFn (y) = 0 (3.90) −∞
where ϕ(u) = sign(u − 12 )ϕ+ (2u − 1), 0 < u < 1. Hence, the corresponding statistical functional is a solution of the equation ∞ ϕ [F (y) − F (2T (F ) − y)] dF (y) (3.91) −∞
=
1
ϕ u − F (2T (F ) − F −1 (u)) du = 0
0
The influence function of the R-estimator can be derived similarly as that of the L-estimator, and in case of symmetric F with an absolutely continuous density f it equals ϕ(F (x)) ϕ(F (y))(−f (y))dy R
IF (x, T, F ) =
(3.92)
Example 3.5 The breakdown point of Hodges-Lehmann estimator TnH : the estimator can break down if at least half of sum 12 (Xi + Xj ) for 1 ≤ i ≤ j ≤ n is replaced. Assume a sample is corrupted by replacement of m outliers and
n + n2 is even, then the estimator TnH breaks down for m satisfying n−m 1 n 1 n−m+ > n+ 2 2 2 2 Therefore 2m2 − m(4n + 2) + n + n2 > 0, 1 ≤ m ≤ n
(3.93)
We look for the smallest integer m satisfying (3.93). Thus the breakdown point m∗ ε∗n = nn , where √ 2n + 1 − 2n2 + 2n + 1 ∗ +1 m = 2 where · is the ceiling function,
that is, x is the smallest integer no smaller than x. Analogously, for n + n2 odd √ 2 + 2n + 5 2n 2n + 1 − +1 m∗ = 2 . Finally, limn→∞ ε∗n (TnH ) = 0.293. Remark 3.4 If ψ(x) = cϕ(F (x)), x ∈ R, then the influence function of the M -estimator generated by ψ coincides with the influence function of the Restimator generated by ϕ.
NUMERICAL ILLUSTRATION
77
Jaeckel (1969) proposed an equivalent definition of the R-estimator, more convenient for the calculation. Consider the n(n + 1) n n+ = 2 2 averages 12 (Xn:j + Xn:k ) , 1 ≤ j ≤ k ≤ n, including the cases j = k. Let ϕ : (0, 1) → R be a nondecreasing score function, skew-symmetric on (0, 1) in the sense that ϕ(1 − u) = −ϕ(u), 0 < u < 1 and put
i+1 i −ϕ i = 1, . . . , n 2n + 1 2n + 1 Then define the weights cjk , 1 ≤ j ≤ k ≤ n in the following way: dn−k+j ; cjk = 1 cjk = n i=1 idi din = ϕ
1≤j≤k≤n
The R-estimator is defined as the median of the discrete distribution that assigns the probability cjk to each average 12 (Xn:j + Xn:k ) , 1 ≤ j ≤ k ≤ n. If the score function ϕ is linear and hence dn1 = . . . = dnn = n1 , then the weights cjk are all equal and the estimator is just the median of all averages, thus the Hodges-Lehmann estimator. However, Jaeckel’s definition of the Restimator is applicable to more general signed-rank tests, as the one sample van der Waerden test and others.
3.11 Numerical illustration Assume that the following data are independent measurements of a physical entity θ: 46.34, 50.34, 48.35, 53.74, 52.06, 49.45, 49.90, 51.25, 49.38, 49.31, 50.62, 48.82, 46.90, 49.46, 51.17, 50.36, 52.18, 50.11, 52.49, 48.67. Otherwise, we have the measurements Xi = θ + ei ,
i = 1, . . . , 20
and we want to determine the unknown value θ, assuming that the errors e1 , . . . , e20 are independent and identically distributed, symmetrically around zero. The first column of Table 3.2 provides the values of the estimates of θ and the scale characteristics from Table 3.1, based on X1 , . . . , X20 . We see that the values in column I are rather close to each other, and that the data seem to be roughly symmetric around 50. Let us now consider what happens if some observations are slightly changed. The effects of some changes we can see in columns II-V of Table 3.2. Columns II and III illustrate the effects of a change of solely one observation, caused by a mistake in the decimal point: column II corresponds to the fact that the last
78
ROBUST ESTIMATORS OF REAL PARAMETER Table 3.1 Estimates of the location and the scale characteristics.
Location mean median 5%-trimmed mean 10%-trimmed mean 5%-Winsorized mean 10%-Winsorized mean Huber M -estimator Hodges-Lehmann estimator Sen’s weighted mean, k1 = [0.05n] Sen’s weighted mean, k2 = [0.1n] midrange
¯n X ˜n X ¯ .05 X ¯ .10 X ¯ .05 W ¯ .05 W MH HL Sk1 Sk2 Rm
Scale standard deviation inter-quartile range median absolute deviation Gini mean difference
Sn RI M AD Gn
value in the dataset, 48.67, was replaced by 486.7, while column III gives the result of a replacement of 48.67 by 4.867. These changes considerably effected ¯ n , the standard deviation Sn , and the midrange Rm , which is in the mean X ¯ n , Sn and Rm are a correspondence with the theoretical conclusions that X highly non-robust. Columns III and IV show the changes in the estimates when the last five observations in the dataset were replaced by the values 79.45, 76.80, 80.73, 76.10, 87.01, or by the values 1.92, 0.71, 1.26, 0.32, -1.71, respectively. When we wish to obtain a picture of the behavior of an estimator under various models, we usually simulate the model and look at the resulting values of the estimator of interest. For example, 200 observations were simulated from the following probability distributions: • Normal distribution N (0, 1) and N (10, 2) with the density (x−µ)2 1 f (x) = √ e− 2σ2 , µ = 0, 10, σ 2 = 1, 2, x ∈ R σ 2π
NUMERICAL ILLUSTRATION
79
Table 3.2 Effects of changes in the dataset on the estimates.
estimator
I
II
III
IV
V
¯n mean X ˜n median X ¯ .05 5%-trimmed mean X ¯ .10 10%-trimmed mean X ¯ .05 5%-Winsorized mean W ¯ .05 10%-Winsorized mean W Huber M -estimator MH Hodges-Lehmann estimator HL Sen’s weighted mean Sk1 Sen’s weighted mean Sk2 midrange Rm
50.04 50.00 50.05 50.09 50.01 50.12 50.07 50.02 50.04 50.02 50.04
71.95 50.22 50.33 50.33 50.33 50.35 50.33 50.31 50.29 50.25 266.52
47.85 50.00 49.92 49.98 49.87 49.89 49.94 49.94 49.98 50.00 29.30
57.36 50.48 56.32 55.39 57.07 57.09 56.59 51.31 54.19 52.44 66.68
37.48 49.34 38.75 40.32 37.50 37.46 37.62 48.18 42.49 45.60 26.02
sample standard deviation Sn inter-quartile range RI median absolute deviation M AD Gini mean difference Gn
1.82 2.00 1.18 2.09
97.64 2.09 0.98 45.55
10.28 2.00 1.18 6.42
13.67 9.97 1.62 13.39
21.97 15.18 1.86 20.74
(symmetric and exponentially-tailed distribution). • Exponential Exp(5) distribution with the density f (x) = 5e−5x , x ≥ 0, f (x) = 0, x < 0 (skewed and exponentially-tailed distribution). • Cauchy with the density f (x) =
1 , x∈R π(1 + x2 )
(symmetric and heavy-tailed distribution) • Pareto with the density f (x) =
1 , x ≥ 1, f (x) = 0, x < 1 (1 + x)2
(skewed and heavy-tailed distribution). The values of various estimates under the above distributions are given in Table 3.3.
80
ROBUST ESTIMATORS OF REAL PARAMETER Table 3.3 Values of estimates under various models.
estimator ¯n mean X ˜n median X ¯ .05 5%-trimmed mean X ¯ .10 10%-trimmed mean X ¯ .05 5%-Winsorized mean W ¯ .05 10%-Winsorized mean W Huber M -estimator MH Hodges-Lehmann est. HL Sen’s weighted mean Sk1 Sen’s weighted mean Sk2 midrange Rm sample standard deviation Sn inter-quartile range RI median abs. deviation M AD Gini mean difference Gn
N (0, 1)
N (10, 2)
Exp(5)
Cauchy
Pareto
0.06 -0.01 0.05 0.04 0.07 0.05 0.05 0.04 0.02 0.02 0.05
9.92 9.73 9.89 9.88 9.92 9.91 9.89 9.87 9.75 9.73 10.37
4.49 2.92 4.01 3.75 4.30 4.05 3.89 3.73 3.16 3.10 11.85
1.77 -0.25 -0.23 -0.29 -0.10 -0.30 -0.29 -0.24 -0.22 -0.21 146.94
12.19 2.10 3.42 2.84 4.17 3.32 2.87 2.73 2.24 2.17 525.34
1.07 1.26 0.63 1.18
2.03 3.05 1.58 2.30
4.63 5.53 2.02 4.78
13.06 2.45 1.24 7.18
78.48 2.83 0.93 20.23
3.12 Computation and software notes The system R includes some function for computation of mentioned location and scale characteristics. The standard ones are mean, median, var, further the function sd computes the sample standard deviation, IQR the inter-quartile range and mad the median absolute deviation. The summary function returns the median, quartiles, minimum and maximum. The slightly different function fivenum returns Tukey’s five number summary (minimum, lower-hinge, median, upper-hinge, maximum). The hinges are versions of the first and third quartiles. They are equal the quartiles for odd n and differ for even n, details see R: reference manual. There are also standard functions quantiles, max, min and range. Function mean has the argument trim, the fraction of observations to be trimmed from each end before the computation of mean. It means that we can mean(x, trim=alpha) as the α%-trimmed mean. Function mad will compute the median of the absolute deviations from the center – defaults to the median, and multiplied by a constant. The default value = 1.4826 ensures the asymptotically normal consistency. The procedures for location Huber M -estimation are also incorporated into the system R. The function huber finds the Huber M -estimator of location with MAD scale and the function hubers finds the Huber M -estimator for location with scale specified, scale with location specified, or both if neither is specified. Both functions are stored in package MASS.
COMPUTATION AND SOFTWARE NOTES
81
We do not find the function for the Hodges-Lehmann estimator, Sen’s weighted mean, Winsorized mean, midrange and Gini mean difference in R. It is easy to prepare the function for them but it is possible to find them on the website, http://www.fp.vslib.cz/kap/picek/robust/. For example, we can create the function for Gini mean difference and Sen’s weighted mean as follows: gini.mean.difference fivenum(chem) [1] 2.200 2.750 3.385 3.700 28.950 > > mean(chem) [1] 4.280417
3.03 3.70
3.03 3.70
82
ROBUST ESTIMATORS OF REAL PARAMETER
> median(chem) [1] 3.385 > > mean(chem, trim=0.05) [1] 3.253636 > > mean(chem, trim=0.10) [1] 3.205 > > winsorized.mean(chem, [1] 3.294167 > > winsorized.mean(chem, [1] 3.185 > > huber(chem) $mu [1] 3.206724 $s [1] 0.526323 > > hubers(chem) $mu [1] 3.205498 $s [1] 0.673652 > > hodges.lehmann(chem) [1] 3.225 > > sen.weight.mean(chem, [1] 3.241265 > > sen.weight.mean(chem, [1] 3.251903 > > midrange(chem) [1] 15.575 > > sd(chem) [1] 5.297396 > > IQR(chem) [1] 0.925 >
trim=0.05)
trim=0.10)
trunc(length(chem)*0.05))
trunc(length(chem)*0.10))
PROBLEMS AND COMPLEMENTS
83
> mad(chem) [1] 0.526323 > > mad(chem,co=1) [1] 0.355 > > gini.mean.difference(chem) [1] 2.830906
3.13 Problems and complements 3.1 Let X1 , . . . , Xn be a sample from the distribution with the density ⎧ ⎨ 1 if |x| ≤ 14 f (x) = ⎩ 1 3 if |x| > 1 ¯= and X
1 n
32|x|
n i=1
4
¯ = ∞. Xi be the sample mean. Then var X
3.2 The α-interquantile range (0 < α < 1) : Sα = F −1 (1 − α) − F −1 (α). The influence function of Sα equals ⎧ 1−α α ⎪ f (a2 ) − a1 ⎪ ⎪ ⎨ IF (x; F, Sα ) = −α f (a1 1 ) + f (a1 2 ) ⎪ ⎪ ⎪ ⎩ 1−α α f (a2 ) − f (a1 )
. . . x < a1 . . . a1 < x < a2 . . . x > a2
where a1 = F −1 (α) and a2 = F −1 (1 − α) and f (x) = should exist in neighborhoods of a1 and a2 .
dF (x) dx ;
the derivative
3.3 The symmetrized α-interquantile range (0 < α < 1) (Collins (2000)): S˜α (F ) = Sα (F˜ ) = F˜ −1 (1 − α) − F˜ −1 (α) where F˜ (x) =
1 2
! F (x) + 1 − F 2F −1 ( 12 ) − x for F continuous
S˜ 14 coincides with MAD. Calculate the influence function of S˜α . 3.4 Let X1 , . . . , Xn be an independent sample from a population with density f (x − θ) and let T (X1 . . . . , Xn ) be a translation equivariant estimator of θ, then Tn has a continuous distribution function. Hint: T (x1 , . . . , xn ) = t if and only if x1 = t − T (0, x2 − x1 , . . . , xn − x1 ). Hence, given X2 − X1 , . . . , Xn − X1 = (y2 , . . . , yn ) and t ∈ R, there is exactly one point x for which T (x) = t. Hence, P {T (X) = t|X2 − X1 =
84
ROBUST ESTIMATORS OF REAL PARAMETER y2 , . . . , Xn − X1 = yn } = 0 for every (y2 , . . . , yn ) and t, thus P {T (X) = t} = 0 ∀t.
3.5 Let X1 , . . . , Xn be independent observations with distribution function n F (x − θ), and let Tn = i=1 cni Xn:i be an L-estimator of θ, then n (a) If i=1 cni = 1, then T is translation equivariant. n (b) If F is symmetric about zero, i=1 cni = 1 and cni = cn,n−i+1 , i = 1, . . . , n, then the distribution of Tn is symmetric about θ. 3.6 Tukey (1960) proposed the model of the normal distribution with variance 1 contaminated by the normal distribution with variance τ 2 > 1, i.e., that of the distribution function F of the form x (3.94) F (x) = (1 − ε)Φ(x) + εΦ τ where Φ is the standard normal distribution function. Compare the asymptotic variances of the sample mean and the sample variance under (3.94). 3.7 Let X1 , . . . , Xn be a sample from the Cauchy distribution C(ξ, σ) with the density σ 1 f (x) = 2 π σ + (x − ξ)2 ¯ n is again C(ξ, σ) Then the distribution of X 3.8 Let X, − π2 ≤ X ≤ π2 , be a random angle with the uniform distribution on the unit circle. Then tg X has the Cauchy distribution C(0, 1). 3.9 Consider the equation n
ψC (Xi − θ) = 0
i=1
where ψC is the Cauchy likelihood function (3.18). Denote Kn the number of its roots. If X1 , . . . , Xn are independent, identically distributed with the Cauchy C(0, 1) distribution, then Kn − 1 has asymptotically Poisson distribution with parameter π1 , as n → ∞. (See Barnett (1966) or Reeds (1985)). 3.10 (a) If X is a random variable with a continuous distribution function, symmetric about zero, then sign X and |X| are stochastically independent. (b) Prove that the linear signed rank statistic (3.84) is a nondecreasing step function of θ with probability 1, provided the score function ϕ+ is nondea and creasing on (0, 1) and ϕ+ (0) = 0. (See van Eeden (1972) or Jureˇckov´ Sen (1996), Section 6.4.) 3.11 Generate samples of different distribution and apply the described methods to these data. 3.12 Write a R procedure that computes the Hodges-Lehmann estimator.
CHAPTER 4
Robust estimators in linear model
4.1 Introduction Consider the linear regression model Yi = xi β + Ui , i = 1, . . . , n
(4.1)
with observations Y1 , . . . , Yn , unknown and unobservable parameter β ∈ Rp , where xi ∈ Rp , i = 1, . . . , n are either given deterministic vectors or observable random vectors (regressors) and U1 , . . . , Un are independent errors with a joint distribution function F. Often we consider the model in which the first component β1 of β is an intercept: it means that xi1 = 1, i = 1, . . . , n. Distribution function F is generally unknown; we only assume that it belongs to some family F of distribution functions. Denoting Y = (Y1 , . . . , Yn ) ⎛ ⎞ x1 ⎜ .. ⎟ X = Xn = ⎝ . ⎠ xn U = (U1 , . . . , Un ) we can rewrite (4.1) in the matrix form Y = Xβ + U
(4.2)
The most popular estimator of β is the classical least squares estimator (LSE) If X is of rank p, then β is equal to β. = (X X)−1 X Y β
(4.3)
is the best linear unbiased As it follows from the Gauss-Markov theorem, β estimator of β, provided the errors U1 , . . . , Un have a finite second moment. is the maximum likelihood estimator of β if U1 , . . . , Un are norMoreover, β mally distributed. is an extension of the sample mean to the linear The least squares estimator β 85
86
ROBUST ESTIMATORS IN LINEAR MODEL
regression model. Then, naturally, it has similar properties: it is highly nonrobust and sensitive to the outliers and to the gross errors in Yi , and to the deviations from the normal distribution of errors. It fails if the distribution of is heavily affected by the regression the Ui is heavy-tailed. But above this all, β matrix X, namely it is sensitive to the outliers among its elements. Violating some conditions in the linear models can have more serious consequences than in the location model. This can have a serious impact in econometric, but also in many other applications. Hence, we must look for robust alternatives to the classical procedures in linear models.
-5
0
y
5
10
Example 4.1 Figure 4.1 illustrates an effect of an outlier in x-direction (leverage point) on the least square estimator.
-5
-4
-3
-2
-1
0
1
-1
0
1
-5
0
y
5
10
x
-5
-4
-3
-2 x2
Figure 4.1 Data with 27 points and the corresponding least squares regression line (top) and the sensitivity of least squares regression to an outlier in the x-direction (bottom).
LEAST SQUARES METHOD
87
Before we start describing the robust statistical procedures, we shall try to illustrate how seriously the outliers in X can affect the performance of the estimator β.
4.2 Least squares method is a If we estimate β by the least squares method, then the set Y = X β hyperplane passing through the points (xi , Yˆi ), i = 1, . . . , n, where = h Y , i = 1, . . . , n Yˆi = xi β i
= X X X −1 X . Hence, and hi is the ith row of the project (hat ) matrix H is the projection of vector Y in the space spanned by the columns Y = HY of matrix X. is the project matrix, then h hj = hij , i, j = 1, . . . , n, and thus Because H i 2 0≤ hik = hii (1 − hii ) =⇒ 0 ≤ hii ≤ 1, i = 1, . . . , n k=i
(4.4) =⇒ |hij | ≤ hi · hj =
1 (hii hjj ) 2
≤ 1, i, j = 1, . . . , n
is of order n × n and of rank p; its diagonal elements lie in the The matrix H = n hii = p. interval 0 ≤ hii ≤ 1, i = 1, . . . , n and trace H i=1 In the extreme situation we can imagine that hii = 1 for some i; then 1 = hii = hi 2 =
n
h2ik = 1 +
k=1
h2ij
k=i
=⇒ hij = 0 for j = i which means that
= h Y = hii Yi = Yi Yˆi = xi β i
and the regression hyperplane passes through (xi , Yi ), regardless of the values of other observations. The value hii = 1 is an extreme case, but it illustrates that a high value of the diagonal element hii causes the regression hyperplane to pass near to the point (xi , Yi ). This point is called a leverage point of the dataset. There are different opinions about which value hii can be considered as high. Huber (1981, p.162) considers xi as a leverage point if hii > .5. It is well known that if EUi = 0 and 0 < σ 2 = EUi2 < ∞, i = 1, . . . , n, then lim max hii = 0
n→∞ 1≤i≤n
is a necessary and sufficient condition for the convergence − β2 → 0 Eβ β n
88
ROBUST ESTIMATORS IN LINEAR MODEL
1 − β) → N 0, σ 2 I p L (X X)− 2 (β n
as n → ∞, where I p is the identity matrix of order p (see, e.g., Huber (1981)). It is intuitively clear that large values of the residuals |Yˆi − Yi | are caused Consider this relation in more by a large maximal diagonal element of H. detail. Assume that the distribution function F has nondegenerate tails, i.e., 0 < F (x) < 1, x ∈ R; moreover, assume that it is symmetric around zero, i.e., F (x)+F (−x) = 1, x ∈ R, for the sake of simplicity. One possible characteristic is the following measure: of the tail behavior of estimator β − β)| > a − log Pβ maxi |xi (β = (4.5) B(a, β) − log(1 − F (a)) We naturally expect that
− β)| > a = 0 lim Pβ max |xi (β a→∞ i
(4.6)
and we are interested in how fast this convergence can be, and under which conditions it is faster. The faster convergence leads to larger values of (4.5) under large a, denote ˜ = max hii , hii = x (X X)−1 xi , i = 1, . . . , n h i 1≤i≤n
(4.7)
˜ on the limit behavior of B(a, β) is described in the following The influence of h theorem: be the least squares estimator of β in model (4.1) with Theorem 4.1 Let β a nonrandom matrix X. (i) If F has exponential tails, i.e., lim
a→∞
− log(1 − F (a)) = 1, ba
b > 0,
then
1 ≤ lima→∞ B(a, β) ≤ 1 ≤ lima→∞ B(a, β) ˜ ˜ h h
(4.8)
(ii) If F has exponential tails with exponent r, i.e., − log(1 − F (a)) = 1, a→∞ bar lim
then
b > 0 and r ∈ (1, 2]
˜ −r+1 ≤ lim ˜ −r h a→∞ B(a, β) ≤ lima→∞ B(a, β) ≤ h
(4.9)
(iii) If F is a normal distribution function, then = lim B(a, β)
a→∞
1 ˜ h
(4.10)
LEAST SQUARES METHOD (iv)
89
If F is heavy-tailed, i.e., lim
a→∞
then
− log(1 − F (a)) = 1, m log a
m>0
=1 lim B(a, β)
a→∞
(4.11)
˜ of matrix H is Theorem 4.1 shows that if the maximal diagonal element h large, then the probability Pβ (maxi |xi (β − β| > a) decreases slowly to 0 with increasing a; this is the case even when F is normal and the number n under normal F is of observations is large. The upper bound of B(a, β) ≤ lima→∞ B(a, β)
n p
with the equality under a balanced design with the diagonal hii = 1, . . . , n.
(4.12) p n,
i =
Proof of Theorem 4.1. Let us assume, without loss of generality, that ˜ = h11 . Because 0 < h ˜ ≤ 1 and Yˆi = x β h i = hi Y , we can write Pβ max xi (β − β > a i = P0 (max |hi Y | > a) ≥ P0 (h1 Y > a) i
˜ 1 > a, h12 Y2 ≥ 0, . . . , h1n Yn ≥ 0) ≥ P0 (hY
1 n−1 ˜ 1 n−1 = (1 − F (a/h)) ˜ ≥ P0 (Y1 > a/h) 2 2 This implies that ˜ ≤ lima→∞ − log(1 − F (a/h)) lima→∞ B(a, β) − log(1 − F (a))
(4.13)
If F has exponential tails with index r, then it further follows from (4.13) that ≤ lima→∞ lima→∞ B(a, β)
˜ r b(a/h) ˜ −r =h bar
which gives the upper bounds in (i) and (ii). For a heavy-tailed F, it follows from (4.13) that ≤ lima→∞ lima→∞ B(a, β)
˜ m log(a/h) =1 m log a
has both positive and negative residuals and and it gives (iv) because β lima→∞ B(a, β) ≥ 1. It remains to verify the lower bounds in (ii) and (iii). If F has exponential tails with exponent r, 1 < r ≤ 2, then, using the Markov inequality, we can
90
ROBUST ESTIMATORS IN LINEAR MODEL
write for any ε ∈ (0, 1) that ˜ 1−r ˆ r − β| > a) ≤ E0 [exp{(1 − ε)bh (maxi |Yi | )}] Pβ (max |xi (β ˜ 1−r ar } i exp{(1 − ε)bh Hence, if we can verify that ˜ 1−r (max |Yˆi |)r }] ≤ Cr < ∞ E0 [exp{(1 − ε)bh
(4.14)
i
we can claim that ˜ 1−r ar − log P0 (max |Yˆi | > a) ≥ − log Cr + (1 − ε)bh i
and this would give the lower bound in (ii), and in fact also the lower bound for the normal distribution in (iii). Thus, it remains to prove that the expected n 1/s value in (4.14) is finite. Denote xs = ( i=1 |xi |s ) , s > 0 and put s = n r 2 k=1 hik = hii , we conclude r−1 (> 2). Then, regarding the relation (max |Yˆi |)r = max |hi Y |r ≤ max(hi s Y r )r i
i
i
n n n ˜ r−1 ≤ max( h2ik )r/s |Yk |r ≤ h |Yk |r i
k=1
k=1
k=1
and hence ˜ 1−r (max |Yˆi |r } ≤ E0 exp{(1 − ε)b E0 exp{(1 − ε)bh i
n
|Yk |r }
k=1
≤ (E0 exp{(1 − ε)b|Y1 |r })n For exponentially-tailed F with exponent r, there exists K > 0 such that 1 − F (x) ≤ exp{−(1 − 2ε bxr } = CK for x > K. Integrating by parts, we obtain ∞ 0 < E0 [exp{(1 − ε)b|Y1 |r }] = −2 exp{(1 − ε)by r }d(1 − F (y)) 0
K
exp{(1 − ε)by r }dF (y) + 2 exp{(1 − ε)bK r }(1 − F (K))
≤2
0 ∞
+2
r(1 − ε)by r−1 (1 − F (y))exp{(1 − ε)by r }dy
(4.15)
K
K
exp{(1 − ε)by r }dF (y) + 2(1 − F (K))exp{(1 − ε)bK 2 }
≤2
0 ∞
+2 K
ε r(1 − ε)by r−1 exp{− by r }dy ≤ Cε < ∞ 2
So we have proved (4.14) for 1 < r ≤ 2. If r = 1, we proceed as follows: (4.4)
LEAST SQUARES METHOD √ implies that |hij | ≤ hii , i, j = 1, . . . , n; thus max |Yˆi | = max |hi Y | = max | i
i
≤ max |hij | ij
n
i
|Yj | ≤ max |
j=1
i
91 n
hij Yj |
j=1 n
˜ 12 hij Yj | ≤ h
j=1
n
|Yj |
j=1
Using the Markov inequality, we obtain ˜ − 2 maxi |Yˆi |} E0 exp{(1 − ε)bh ˜ − 12 a} exp{(1 − ε)bh 1
P0 (max |Yˆi | > a) ≤ i
(E0 exp{(1 − ε)b|Y1 |})n ˜ − 12 a} exp{(1 − ε)bh
≤
and it further follows from (4.15) that E0 exp{(1 − ε)b|Y1 |} < ∞; this gives the lower bound in (i). 2 If F is the normal distribution function of N (0, σ ), then Y − Xβ has n , hence dimensional normal distribution Nn 0, σ 2 H
˜− 2 ) P0 (max |Yˆi | > a) ≥ P0 (h1 Y > a) = 1 − Φ(aσ −1 h 1
i
≤h ˜ −1 . and lima→∞ B(a, β)
2
The proposition (iii) of Theorem 4.1 shows that the performance of the LSE can be poor even under the normal distribution of errors, provided the design is fixed and contains leverage points leading to large ˜h. The rate of convergence in (4.6) does not improve even if the number of observations increases. In ˜ = 1, the convergence (4.6) to zero is equally slow for the extreme case of h arbitrarily large number n of observations as if there is only one observation (just the leverage one). On the other hand, if the design is balanced with diagonal elements hii = p ) = n , hence the rate of convergence i = 1, . . . , n, then lima→∞ B(a, β n n, p in (4.6) improves with n. Up to now, we have considered a fixed number n of observations. The situation can change if n → ∞ and a = an → ∞ at an appropriate rate. An interesting choice of the sequence {an } is the population
analogue of the extreme error among U1 , . . . , Un , i.e., a = an = F −1 1 − n1 . For the normal distribution N (0, σ 2 ), this population extreme is approximately 1 (4.16) an = σΦ−1 1 − ≈ σ 2 log n n Namely for the normal distribution of errors we shall derive the lower and ) under this choice of {an }. These bounds are more upper bounds of B(an , β n
92
ROBUST ESTIMATORS IN LINEAR MODEL
optimistic for the least square estimator, because they both improve with increasing n. This is true for both fixed and random designs, and in the latter case even when the xi are random vectors with a heavy-tailed distribution G, i = 1, . . . , n. Theorem 4.2 Consider the linear regression model Yi = xi β + ei ,
i = 1, . . . , n
(4.17)
with the observations Yi , i = 1, . . . , n and with independent errors e1 , . . . , en , normally distributed as N (0, σ 2 ). Assume that the matrix X n = [x1 , . . . , xn ] is either fixed and of rank p for n ≥ n0 or that x1 , . . . , xn are independent pdimensional random vectors, independent of the errors, identically distributed with distribution function G, then √ ) B(σ 2 log n, β n limn→∞ 2 ≤1 (4.18) (n−1) log n + log n 2 p √ B σ 2 log n, β n ≥1 limn→∞ 1+log 2 log log n n 1 − − 2 log n log n
and
(4.19)
Remark 4.1 The bounds (4.18) and (4.19) can be rewritten in the form of the following asymptotic inequalities that are true for the LSE under normal F, as n → ∞ and for any ε > 0 : √ 2 max x ≥ (σ β 2 log n − log P 0 i n i n (4.20) ≥ p log n n 1 + log 2 log log n n(1 − ε) ≥ − 1− ≥ 2 log n log n 2 Proof of Theorem 4.2. Let us first consider the upper bound. Note that ˜ n ≥ p because trace H n = p. For each j such that hjj > 0 we can write h n P0 max (xi βn ) ≥ a(x1 , . . . , xn ) (4.21) 1≤i≤n = P0 max(hi Y ) ≥ a(x1 , . . . , xn ) ≥ P0 hj Y ≥ a(x1 , . . . , xn ) i
≥ P0 (hjj Yj ≥ a, h1k Yk ≥ 0, k = j |(x1 , . . . , xn )) n−1 n−1 1 a 1 a ≥ P0 Yj ≥ |(x1 , . . . , xn )) = 1−F hjj 2 hjj 2 This holds for each j such that hjj > 0; hence also n−1 a 1 P0 max (xi β n ) ≥ a(x1 , . . . , xn ) ≥ 1 − F ˜ 1≤i≤n 2 h
LEAST SQUARES METHOD
93
Because (− log) is a convex function, we may apply the Jenssen inequality and (4.21) then implies ) ≥ a(x1 , . . . , xn ) − log EG P0 max(xi β n i a ≤ (n − 1) log 2 − log EG 1 − F (4.22) ˜ h a ≤ (n − 1) log 2 + EG − log 1 − F ˜ h If F is of exponential type (4.9) with 1 ≤ r ≤ 2, then
1 a = an = b−1 log n r → ∞ as n → ∞ and (4.22) gives the following asymptotic inequality, as n → ∞, for any G: r r n 1 (n − 1) log 2 (n − 1) log 2 ≤ + + B(an , β n ) ≤ EG ˜ log n p log n hn A more precise form of the above inequality is ) B(an , β n limn→∞ r ≤1 (n−1) log 2 n + p log n and this gives (4.18). The lower bound: Since N (0, σ 2 ) has exponential tails, − log (1 − Φ(a/σ)) =1 a→∞ a2 /2σ 2 lim
and we can write | ≥ a(x1 , . . . , xn ) P0 max |xi β n 1≤i≤n
1 ˜ 2 Y ≥ a(x1 , . . . , xn ) ≤ P0 h
(4.23)
max |hi Y | ≥ a(x1 , . . . , xn ) 1≤i≤n Y 2 a2 = P0 ≥ , . . . , x ) (x 1 n ˜ 2 σ2 hσ 2 a = 1 − Fn ˜ 2 hσ
= P0
where Fn is the χ2 distribution function with n degrees of freedom. Because of (4.23), it holds (see Cs¨org˝ o and R´ev´esz (1981) or Parzen (1975)) fn (x) Fn (x)(1 − Fn (x)) ≤ 1 sup 2 x∈R fn (x) hence, because ˜ hn ≤ 1,
94
ROBUST ESTIMATORS IN LINEAR MODEL 1 − Fn
a2 ˜ 2 hσ
≤ 1 − Fn
a2 σ2
(4.24)
2 2 n −1 a /σ 2
2 ≤ 2 2 ea /2σ 2n/2 Γ(n/2) 12 − n2 − 1 σa2
√ Inserting an = σ 2 log n in (4.24), we obtain n − log(1 − Fn (2 log n)) 1 + log 2 log log n ≥ − 1− log n 2 log n log n 2 4.3 M -estimators The M -estimator of parameter β in model (4.1) is defined as solutions Mn of the minimization n ρ(Yi − xi t) := min (4.25) i=1
with respect to t ∈ Rp , where ρ : R1 → R1 is absolutely continuous, usually convex function with derivative ψ. Such Mn is obviously regression equivariant, i.e., (4.26) Mn (Y + Xb) = Mn (Y ) + b ∀b ∈ Rp but Mn is generally not scale equivariant: generally, it does not hold Mn (cY ) = cMn (Y ) for c > 0
(4.27)
A scale equivariant M -estimator we obtain either by a studentization or if we estimate the scale simultaneously with the regression parameter. The studentized M -estimator is a solution of the minimization n Yi − xi t ρ := min (4.28) Sn i=1 where Sn = Sn (Y ) ≥ 0 is an appropriate scale statistic. To obtain Mn both regression and scale equivariant, our scale statistic Sn must be scale equivariant and invariant to the regression, i.e., Sn (c(Y + Xb)) = cSn (Y ) ∀b ∈ Rp and c > 0
(4.29)
Such is, e.g., the root of the residual sum of squares, ]2 Sn (Y ) = [(Y − Y ) (Y − Y )] 2 = [Y (I n − H)Y 1
1
but this is closely connected with the least squares estimator, and thus highly non-robust. Robust scale statistics can be based on the regression quantiles or on the regression rank scores, which will be considered later. The minimization (4.28) should be supplemented with a rule on how to define
M -ESTIMATORS
95
Mn in case Sn (Y ) = 0; but this mostly happens with probability 0. Moreover, a specific form of the rule does not affect the asymptotic behavior of Mn . If ψ(x) =
dρ(x) dx
is continuous, then Mn is a root of the system of equations n Yi − xi t =0 (4.30) xi ψ Sn i=1
This system can have more roots, while only one leads to the global minimum of (4.28). Under general conditions, there always exists at least one root of the √ system (4.30) which is an n-consistent estimator of β (see Jureˇckov´ a and Sen (1996)). Another important case is that ψ is a nondecreasing step function, hence ρ is a convex, piecewise n linear function. Then Mn is a point of minima of the convex function i=1 ρ ((Yi − xi t)/Sn ) over t ∈ Rp . In this case, too, we can prove its consistency and asymptotic normality. If we want to estimate the scale simultaneously with the regression parameter, we can proceed in various ways. One possibility is to consider (Mn , σ ˆ ) as a solution of the minimization n
σρ σ −1 (Yi − xi t + aσ := min, t ∈ Rp , σ > 0
(4.31)
i=1
with an appropriate constant a > 0. We arrive at the system of p+1 equations n Yi − xi t xi ψ =0 σ i=1 n Yi − xi t χ =a σ i=1 where
χ(x) = xψ(x) − ρ(x) and a =
(4.32) χ(x)dΦ(x) R
and Φ is the standard normal distribution function. The usual choice of ψ is the Huber function (3.16). The matrix X can be random, fixed or a mixture of random and fixed elements. If the matrix X is random, then its rows are usually independent random vectors, identically distributed, hence they are an independent sample from some multivariate distribution. The influence function of Mn in this situation depends on two arguments, on x and y. Similarly, the possible breakdown and the value of the breakdown point of Mn should be considered not only with respect to changes in observations y, but also with respect to those of x.
96
ROBUST ESTIMATORS IN LINEAR MODEL
4.3.1 Influence function of M -estimator with random matrix Consider the model (4.1) with random matrix X, where (xi , Yi ) , i = 1, . . . , n are independent random vectors with values in Rp ×R1 , identically distributed with distribution function P (x, y). If ρ has an absolutely continuous derivative ψ, then the statistical functional T(P ), corresponding to the estimator (4.25), is a solution of the system of p equations xψ(y − x T(P )dP (x, y) = 0 (4.33) Rp+1
Let Pt denote the contaminated distribution Pt = (1 − t)P + tδ(x0 , y0 ), 0 ≤ t ≤ 1, (x0 , y0 ) ∈ Rp × R where δ(x0 , y0 ) is a degenerated distribution with the probability mass concentrated in the point (x0 , y0 ). Then the functional T(Pt ) solves the system of equations (1 − t) xψ(y − x T(Pt ))dP (x, y) Rp+1
+tx0 ψ(y0 − x0 T(Pt )) = 0. Differentiating by t we obtain − xψ(y − x T(Pt ))dP (x, y) + x0 ψ(y0 − x0 T(Pt )) Rp+1
−(1 − t)
Rp+1
x x
dT(Pt ) ψ (y − x T(Pt ))dP (x, y) dt
dT(Pt ) ψ (y0 − x0 T(Pt )) = 0 dt and we get the influence function dT(Pt ) IF(x0 , y0 ; T, P ) = dt t=0 if we put t = 0 and notice that, on account of (4.33), xψ(y − x T(Pt ))dP (x, y) = 0 −tx0 x0
Rp+1
then
IF(x0 , y0 ; T, P )
Rp+1
x xψ (y − x T(P ))dP (x, y)
= x0 ψ(y0 − x0 T(P )) Hence, the influence function of the M -estimator is of the form IF(x0 , y0 ; T, P ) = B−1 x0 ψ(y0 − x0 T(P ))
(4.34)
M -ESTIMATORS
97
where
B= Rp+1
x xψ (y − x T(P ))dP (x, y)
(4.35)
Observe that the influence function (4.35) is bounded in y0 , provided ψ is bounded; however, it is unbounded in x0 , and thus the M -estimator is nonrobust with respect to X. Many authors tried to overcome this shortage and introduced various generalized M -estimators, called GM -estimators, that outperform the effect of outliers in x with the aid of properly chosen weights.
4.3.2 Large sample distribution of the M -estimator with nonrandom matrix Because the M -estimator is nonlinear and defined implicitly as a solution of a minimization, it would be very difficult to derive its distribution function under a finite number of observations. Moreover, even if we were able to derive this distribution function, its complicated form would not give the right picture of the estimator. This applies also to other robust estimators. In this situation, we take recourse to the limiting (asymptotic) distributions of estimators, which are typically normal, and their covariance matrices fortunately often have a compact form. Deriving the asymptotic distribution is not easy and we should use various, sometimes nontraditional methods that have an interest of their own, but their details go beyond this text. In this context, we refer to the monographs cited in the literature, such as Huber (1981), Hampel et al. (1986), Rieder (1994), Jureˇckov´ a and Sen (1996), among others. Let us start with the asymptotic properties of the non-studentized estimator with a nonrandom matrix X. Assume that the distribution function F of errors Ui in model (4.1) is symmetric around zero. Consider the M -estimator Mn as a solution of the minimization (4.25) with an odd, absolutely continuous function ψ = ρ such that EF ψ 2 (U1 ) < ∞. The matrix X = X n is supposed (n) (n) to be of rank p and max1≤i≤n hii → 0 as n → ∞, where hii is the maximal n = X n (X X n )−1 X , diagonal element of the projection (hat) matrix H n n then p
Mn −→ β
1 L (X n X n ) 2 (Mn − β) → Np 0, σ 2 (ψ, F )Ip as n → ∞, where σ 2 (ψ, F ) = If, moreover, p × p, then
1 n X nX n
L
(4.36)
EF ψ 2 (U1 ) (EF ψ (U1 ))2
→ Q, where Q is a positively definite matrix of order
√
n(Mn − β) → Np 0, σ 2 (ψ, F )Q−1
If ψ has jump points, but is nondecreasing, and F is absolutely continuous
98
ROBUST ESTIMATORS IN LINEAR MODEL
with density f, then (4.36) is still true with the only difference being that EF ψ 2 (U1 ) σ 2 (ψ, F ) = ( R f (x)dψ(x))2 Notice that σ 2 (ψ, F ) is the same as in the asymptotic distribution of the M -estimator of location. p
If M -estimator Mn is studentized by the scale statistic Sn such that Sn −→ S(F ) as n → ∞, then the asymptotic covariance matrix of Mn depends on S(F ). In the models with intercepts, only the first component of the estimator is asymptotically affected by S(F ).
4.3.3 Large sample properties of the M -estimator with random matrix If the system of equations EP [xψ(y − x t) = 0 has a unique root T(P ) = β, then T(Pn ) → T(P ) as n → ∞, where Pn is the empirical distribution pertaining to observations ((x1 , y1 ), . . . , (xn , yn )) . The functional T(Pn ) admits, under some conditions on probability distribution P, the following asymptotic representation T(Pn ) = T(P ) +
1 1 IF(x, y; T, P ) + op (n− 2 ) as n → ∞ n
If EP IF(x, y; T, P )2 < ∞, the above representation further leads to the asymptotic distribution of T(Pn ) : √ L n(T(Pn ) − T(P )) → Np (0, Σ) (4.37) where
Σ = EP [IF(x, y; T, P )] [IF(x, y; T, P )] = B−1 AB−1
B is the matrix defined in (4.35) and x xψ 2 (y − x T(P ))dP (x, y) A= Rp+1
4.4 GM -estimators The influence function (4.34) of the M -estimator is unbounded in x, thus the M -estimator is sensitive to eventual leverage points in matrix X. The choice of function ψ has no effect on this phenomenon. Some authors proposed to supplement the definition of the M -estimator by suitable weights w that reduce the influence of the gross values of xij .
GM -ESTIMATORS
99
Mallows (1973, 1975) proposed the generalized M -estimator as a solution of the minimization n Yi − xi t σw(xi )ρ (4.38) := min, t ∈ Rp , σ > 0 σ i=1 If ψ = ρ is continuous, then the generalized M -estimator solves the equation n Yi − xi t xi w(xi )ψ =0 (4.39) σ i=1 The influence function of the pertaining functional T(P ) then equals y − x T(P ) −1 (4.40) IF(x, y; T, P ) = B xw(x)ψ S(P ) where S(P ) is the functional corresponding to the solution σ in the minimization (4.38). We obtain a bounded influence function if we take w leading to bounded xw(x). Such an estimator is a special case of the following GM -estimator, which solves the equation n Yi − xi t η xi , =0 σ i=1 n Yi − xi t χ =0 σ i=1
(4.41)
with functions η, χ, where η : Rp × R → R and χ : R → R. If we take η(x, u) = u and χ(u) = u2 − 1 we obtain the least squares estimator: the choice η(x, u) = ψ(u) leads to the M -estimator, and η(x, u) = w(x)ψ(u) leads to the Mallows GM -estimator. The usual choice of the function η is η(x, u) = ψ1x(x) ψ(u), where ψ is, e.g., the Huber function. The choice of function χ usually coincides with (4.32). The statistical functionals T(P ) and S(P ) corresponding to Mn and σn are defined implicitly as a solution of the system of equation: y − x T(P ) x η x, dP (x, y) = 0 S(P ) Rp+1 y − x T(P ) χ x, dP (x, y) = 0 S(P ) Rp+1
(4.42)
100
ROBUST ESTIMATORS IN LINEAR MODEL
The influence function of functional T(P ) in the special case σ = 1 has the form IF(x, y; T, P ) = B−1 xη(x, y − x T(P )), where
x x
B=
Rp+1
∂ η(x, u) dP (x, y) ∂u u=y−x T(P )
The asymptotic properties of GM -estimators were studied by Maronna and Yohai (1981), among others. Under some √ regularity conditions, the GM estimators are strongly consistent and n(T(Pn ) − T(P )) has asymptotic p-dimensional normal distribution Np (0, Σ) with covariance matrix Σ = B−1 AB−1 , where A= x xη 2 (x, y − x T(P ))dP (x, y) Rp+1
Krasker and Welsch (1982) proposed a GM -estimator as a solution of the system of equations n Yi − xi t =0 xi wi σ i=1 with weights wi = w(xi , Yi , t) > 0 determined so that they maximize the asymptotic efficiency of the estimator (with respect to the asymptotic covariance matrix Σ) under the constraint γ ∗ ≤ a < ∞, where γ ∗ is the global sensitivity of the functional T under distribution P, i.e., 1 γ ∗ = sup (IF(x, y; T, P )) Σ−1 (IF(x, y; T, P )) 2 x,y As a solution, we obtain the weights of the form ⎧ ⎫ ⎨ ⎬ a w(x, y, t) = min 1, ⎩ y−x t (x Ax) 12 ⎭ σ where
x x
A= Rp+1
y − x t σ
w2 (x, y, t)dP (x, y)
The Krasker-Welsch estimator has a bounded influence function, but it should be computed iteratively, because matrix A depends on w.
4.5 S-estimators and M M -estimators The S-estimator, proposed by Rousseeuw and Yohai (1984) minimizes an estimator of scale, ˜ = arg min σ β ˜n (β) n
L-ESTIMATORS, REGRESSION QUANTILES
101
and the estimator of scale σ ˜n (β) solves the equation n Yi − xi β 1 ρ = b for each fixed β n i=1 σ ˜n (β) where ρ(x) is a symmetric, continuous function, nondecreasing in |x|, and b = ρ(x)dΦ(x) with Φ being the standard normal distribution function. ˜ and the scale σ After calculating the S-estimator β ˜n , we calculate the M M proposed by Yohai (1987), as a solution of the minimization estimator β, n Yi − xi β ρ = min, β ∈ Rp ˜ ) σ ˜ ( β n n i=1 4.6 L-estimators, regression quantiles L-estimators of location parameter as linear combinations of order statistics or linear combinations of functions of order statistics are highly appealing and intuitive, because they are formulated explicitly, not as solutions of minimization problems or of systems of equations. The calculation of L-estimators is much easier. Naturally, many statisticians tried to extend the L-estimators to the linear regression model. Surprisingly, this extension is not easy, because it was difficult to find a natural and intuitive extension of the empirical (sample) quantile to the regression model. A successful extension of the sample quantile appeared only in 1978, when Koenker and Bassett introduced the regression α-quantile β(α) for model (4.1). It is more illustrative in a model with an intercept: hence let us assume that β1 is an intercept and that matrix X satisfies the condition xi1 = 1, i = 1, . . . , n
(4.43)
The regression α-quantile β(α), 0 < α < 1 is defined as a solution of the minimization n ρα (Yi − xi t) := min, t ∈ Rp (4.44) i=1
with the criterion function ρα (x) = |x|{αI[x > 0] + (1 − α)I[x < 0]}, x ∈ R
(4.45)
The function ρα is piecewise linear and convex; hence we can intuitively expect that the minimization (4.44) can be solved by some modification of the simplex algorithm of the linear programming. Indeed, Koenker and Bassett (1978) proposed to calculate β(α) as the component β of the optimal solution (β, r+ , r− ) of the parametric linear programming problem α
n i=1
ri+ + (1 − α)
n i=1
ri− : min
(4.46)
102
ROBUST ESTIMATORS IN LINEAR MODEL under constraint
p
xij βj + ri+ − ri− = Yi , i = 1, . . . , n
j=1
βj ∈ R1 , j = 1, . . . , p, ri+ , ri− ≥ 0, i = 1, . . . , n, 0