Springer Series in Statistics
Jeffrey D. Hart
'Springer
Springer Series in Statistics Advisors: P. Bickel, P. Diggle...
81 downloads
801 Views
10MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Springer Series in Statistics
Jeffrey D. Hart
'Springer
Springer Series in Statistics Advisors: P. Bickel, P. Diggle, S. Fienberg, K. Krickeberg, I. Olkin, N. Wermuth, S. Zeger
Springer New York Berlin Heidelberg Barcelona Budapest Hong Kong London Milan Paris Santa Clara Singapore Tokyo
Springer Series in Statistics Andersen!Borgan!Gill!Keiding: Statistical Models Based on Counting Processes. Andrews/Herzberg: Data: A Collection of Problems from Many Fields for the Student and Research Worker. Anscombe: Computing in Statistical Science through APL. Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition. Bolfarine!Zacks: Prediction Theory for Finite Populations. Borg!Groenen: Modem Multidimensional Scaling: Theory and Applications Bremaud: Point Processes and Queues: Martingale Dynamics. Brockwell/Davis: Time Series: Theory and Methods, 2nd edition. Daley!Vere-Jones: An Introduction to the Theory of Point Processes. Dzhaparidze: Parameter Estimation and Hypothesis Testing in Spectral Analysis of Stationary Time Series. Fahrmeir!Tutz: Multivariate Statistical Modelling Based on Generalized Linear Models. Farrell: Multivariate Calculation. Federer: Statistical Design and Analysis for Intercropping Experiments. Fienberg!Hoaglin!Kruskal!Tanur (Eds.): A Statistical Model: Frederick Mosteller's Contributions to Statistics, Science and Public Policy. Fisher/Sen: The Collected Works of Wassily Hoeffding. Good: Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Goodman!Kruskal: Measures of Association for Cross Classifications. Gourieroux: ARCH Models and Financial Applications. Grandell: Aspects of Risk Theory. Haberman: Advanced Statistics, Volume 1: Description of Populations. Hall: The Bootstrap and Edgeworth Expansion. Hiirdle: Smoothing Techniques: With Implementation inS. Hart: Nonparametric Smoothing and Lack-of-Fit Tests. Hartigan: Bayes Theory. Heyde: Quasi-Likelihood and Its Application: A General Approach to Optimal Parameter Estimation. Heyer: Theory of Statistical Experiments. Huet!Bouvier!Gruet/Jolivet: Statistical Tools for Nonlinear Regression: A Practical Guide with S-PLUS Examples. Jolliffe: Principal Component Analysis. Kolen!Brennan: Test Equating: Methods and Practices. Katz/Johnson (Eds.): Breakthroughs in Statistics Volume I. Katz/Johnson (Eds.): Breakthroughs in Statistics Volume II. Katz/Johnson (Eds.): Breakthroughs in Statistics Volume III. Kres: Statistical Tables for Multivariate Analysis. Kiichler!S¢rensen: Exponential Families of Stochastic Processes. Le Cam: Asymptotic Methods in Statistical Decision Theory. Le Cam/Yang: Asymptotics in Statistics: Some Basic Concepts. Longford: Models for Uncertainty in Educational Testing. Manoukian: Modem Concepts and Theorems of Mathematical Statistics. Miller, Jr.: Simultaneous Statistical Inference, 2nd edition.
(continued after index)
For Michelle and Kayley
Jeffrey D. Hart Department of Statistics Texas A&M University College Station, TX 77843-3143 USA
Library of Congress Cataloging-in-Publication Data Hart, Jeffrey D. Nonparametric smoothing and lack-of-fit tests I Jeffrey D. Hart p. em. - (Springer series in statistics) Includes bibliographical references (p - ) and indexes. ISBN 0-387-94980-1 (hardcover: alk. paper) 1. Smoothing (Statistics) 2. Nonpararnetric statistics. 3. Goodness-of-fit tests. I. Title. II. Series QA278.H357 1997 519.5-dc21 97-10931 Printed on acid-free paper. © 1997 Springer-Verlag New York, Inc. All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer-Verlag New York, Inc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in connection with reviews or scholarly. analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use of general descriptive names, trade names, trademarks, etc., in this publication, even if the former are not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone.
Production managed by Steven Pisano; manufacturing supervised by Joe Qua tela. Photocomposed pages prepared from the author's LaTeX files. Printed and bound by Maple-Vail Book Manufacturing Group, York, PA. Printed in the United States of America.
98765432 1 ISBN 0-387-94980-1 Springer-Verlag New York Berlin Heidelberg SPIN 10568296
Preface
The primary aim of this book is to explore the use of nonparametric regression (i.e., smoothing) methodology in testing the fit of parametric regression models. It is anticipated that the book will be of interest to an audience of graduate students, researchers and practitioners who study or use smoothing methodology. Chapters 2-4 serve as a general introduction to smoothing in the case of a single design variable. The emphasis in these chapters is on estimation of regression curves, with hardly any mention of the lack-offit problem. As such, Chapters 2-4 could be used as the foundation of a graduate level statistics course on nonparametric regression. The purpose of Chapter 2 is to convey some important basic principles of smoothing in a nontechnical way. It should be of interest to practitioners who are new to smoothing and want to learn some fundamentals without having to sift through a lot of mathematics. Chapter 3 deals with statistical properties of smoothers and is somewhat more theoretical than Chapter 2. Chapter 4 describes the principal methods of smoothing parameter selection and investigates their large-sample properties. The remainder of the book explores the problem of testing the fit of probability models. The emphasis is on testing the fit of parametric regression models, but other types of models are also considered (in Chapter 9). Chapter 5 is a review of classical lack-of-fit tests, including likelihood ratio tests, the reduction method from linear models, and some nonparametric tests. The subject of Chapter 6 is the earliest work on using linear smoothers to test the fit of models. These tests assume that a statistic's smoothing parameter is nonstochastic, which entails a certain degree of arbitrariness in performing a test. The real heart of this book is Chapters 7 through 10, in which lack-offit tests based on data-driven smoothing parameters are studied. It is my opinion that such tests will have the greatest appeal to both practitioners and researchers. Chapters 7 and 8 are a careful treatment of distributional properties of various "data-driven" test statistics. Chapter 9 shows that many of the ideas learned in Chapters 7 and 8 have immediate applications in more general settings, including multiple regression, spectral analysis and vii
viii
Preface
testing the goodness of fit of a probability distribution. Applications are illustrated in Chapter 10 by means of several real-data examples. There are a number of people who in various ways have had an influence on this book (many of whom would probably just as soon not take any credit). I'd like to thank Scott Berry, Jim Calvin, Chien-Feng Chen, Ray Chen, Cherng-Luen Lee, Geung-Hee Lee, Fred Lombard, Manny Parzen, Seongbaek Yi and two anonymous reviewers for reading portions of the book and making valuable comments, criticisms and suggestions. I also want to thank Andy Liaw for sharing his expertise in graphics and the finer points of TEX. To the extent that there are any new ideas in this book, I have to share much of the credit with the many colleagues, smoothers and nonsmoothers alike, who have taught me so much over the years. In particular, I want to express my gratitude to Randy Eubank, Buddy Gray, and Bill Schucany, whose ideas, encouragement and friendship have profoundly affected my career. Finally, my biggest thanks go to my wife Michelle and my daughter Kayley. Without your love and understanding, finishing this book would have been impossible. Jeffrey D. Hart
ji'
Contents
Preface
vii
1. Introduction
1
2. Some Basic Ideas of Smoothing
4
2.1. Introduction 2.2. Local Averaging 2.3. Kernel Smoothing 2.3.1 Fundamentals 2.3.2 Variable Bandwidths 2.3.3 Transformations of x 2.4. Fourier Series Estimators 2.5. Dealing with Edge Effects 2.5.1 Kernel Smoothers 2.5.2 Fourier Series Estimators 2.6. Other Smoothing Methods 2.6.1 The Duality of Approximation and Estimation 2.6.2 Local Polynomials 2.6.3 Smoothing Splines 2.6.4 Rational Functions 2.6.5 Wavelets 3. Statistical Properties of Smoothers
3.1. Introduction 3.2. Mean Squared Error of Gasser-Muller Estimators 3.2.1 Mean Squared Error at an Interior Point 3.2.2 Mean Squared Error in the Boundary Region 3.2.3 Mean Integrated Squared Error 3.2.4 Higher Order Kernels 3.2.5 Variable Bandwidth Estimators 3.2.6 Estimating Derivatives
4 5 6 6 14 16 19 28 29 32 35 35 37 40 41 44 50
50 51 51 59 61 62 63 64 ix
X
Contents
3.3. MISE of Trigonometric Series Estimators 3.3.1 The Simple Truncated Series Estimator 3.3.2 Smoothness Adaptability of Simple Series Estimators The Rogosinski Series Estimator 3.3.3 Asymptotic Distribution Theory 3.4. 3.5. Large-Sample Confidence Intervals
4. Data-Driven Choice of Smoothing Parameters 4.1. Introduction 4.2. Description of Methods 4.2.1 Cross-Validation 4.2.2 Risk Estimation 4.2.3 Plug-in Rules 4.2.4 The Hall-Johnstone Efficient Method 4.2.5 One-Sided Cross-Validation 4.2.6 A Data Analysis 4.3. Theoretical Properties of Data-Driven Smoothers 4.3.1 Asymptotics for Cross-Validation, Plug-In and Hall-Johnstone Methods 4.3.2 One-Sided Cross-Validation 4.3.3 Fourier Series Estimators 4.4. A Simulation Study 4.5. Discussion
5. Classical Lack-of-Fit Tests 5.1. Introduction 5.2. Likelihood Ratio Tests 5.2.1 The General Case 5.2.2 Gaussian Errors 5.3. Pure Experimental Error and Lack of Fit 5.4. Testing the Fit of Linear Models 5.4.1 The Reduction Method 5.4.2 Unspecified Alternatives 5.4.3 Non-Gaussian Errors 5.5. Nonparametric Lack-of-Fit Tests 5.5.1 The von Neumann Test 5.5.2 A Cusum Test 5.5.3 Von Neumann and Cusum Tests as Weighted Sums of Squared Fourier Coefficients 5.5.4 Large Sample Power 5.6. Neyman Smooth Tests
65 66 70 71 76 78
84 84 85 85 86 88 90 90 92 93 94 98 105 107 113
117 117 119 119 121 122 124 124 126 129 131 132 134 136 137 140
Contents
6. Lack-of-Fit Tests Based on Linear Smoothers
6.1. Introduction 6.2. Two Basic Approaches 6.2.1 Smoothing Residuals 6.2.2 Comparing Parametric and Nonparametric Models 6.2.3 A Case for Smoothing Residuals 6.3. Testing the Fit of a Linear Model 6.3.1 Ratios of Quadratic Forms 6.3.2 Orthogonal Series 6.3.3 Asymptotic Distribution Theory 6.4. The Effect of Smoothing Parameter 6.4.1 Power 6.4.2 The Significance Trace 6.5. Historical and Bibliographical Notes
xi
144
144 145 145 147 148 149 149 151 152 157 158 160 163
7. Testing for Association via Automated Order Selection
164
7.1. 7.2. 7.3. 7.4.
164 166 168 174 174 175 176 177 177 181 183 185 185 187 188 188 189 195 196
7.5.
7.6.
7.7.
7.8. 7.9.
Introduction Distributional Properties of Sample Fourier Coefficients The Order Selection Test Equivalent Forms of the Order Selection Test 7.4.1 A Continuous-Valued Test Statistic 7.4.2 A Graphical Test Small-Sample Null Distribution of Tn 7.5.1 Gaussian Errors with Known Variance 7.5.2 Gaussian Errors with Unknown Variance 7.5.3 Non-Gaussian Errors and the Bootstrap 7.5.4 A Distribution-Free Test Variations on the Order Selection Theme 7.6.1 Data-Driven Neyman Smooth Tests 7.6.2 F-Ratio with Random Degrees of Freedom 7.6.3 Maximum Value of Estimated Risk 7.6.4 Test Based on Rogosinski Series Estimate 7.6.5 A Bayes Test Power Properties 7.7.1 Consistency 7.7.2 Power of Order Selection, Neyman Smooth and Cusum Tests 7.7.3 Local Alternatives 7.7.4 A Best Test? Choosing an Orthogonal Basis Historical and Bibliographical Notes
197 201 203 205 206
xii
Contents
8. Data-Driven Lack-of-Fit Tests for General Parametric Models
208
8.1. Introduction 8.2. Testing the Fit of Linear Models 8.2.1 Basis Functions Orthogonal to Linear Model 8.2.2 Basis Functions Not Orthogonal to Linear Model 8.2.3 Special Tests for Checking the Fit of a Polynomial 8.3. Testing the Fit of a Nonlinear Model 8.3.1 Large-Sample Distribution of Test Statistics 8.3.2 A Bootstrap Test 8.4. Power Properties 8.4.1 Consistency 8.4.2 Comparison of Power for Two Types of Tests
208 208 209 213 217 219 219 221 223 223 224
9. Extending the Scope of Application 9.1. 9.2. 9.3. 9.4. 9.5. 9.6. 9.7. 9.8. 9.9. ],i
Introduction Random x's Multiple Regression Testing for Additivity Testing Homoscedasticity Comparing Curves Goodness of Fit Tests for White Noise Time Series Trend Detection
10. Some Examples 10.1. Introduction 10.2. Babinet Data 10.2.1 Testing for Linearity 10.2.2 Model Selection 10.2.3 Residual Analysis 10.3. Comparing Spectra 10.4. Testing for Association Among Several Pairs of Variables 10.5. Testing for Additivity Appendix A.l. Error in Approximation of Fos(t) A.2. Bounds for the Distribution of Tcusum
226 226 226 228 232 234 236 238 240 248 253
253 253 253 255 257 258 261 263 267 267 268
References
271
Index
281
1 Introduction
The estimation of functions is a pervasive statistical problem in scientific endeavors. This book provides an introduction to some nonpammetric methods of function estimation, and shows how they can be used to test the adequacy of parametric function estimates. The settings in which function estimation has been studied are many, and include probability density estimation, time series spectrum estimation, and estimation of regression functions. The present treatment will deal primarily with regression, but many of the ideas and methods to be discussed have applications in other areas as well. The basic purpose of a regression analysis is to study how a variable Y responds to changes in another variable X. The relationsHip between X and Y may be expressed as
(I.i)
Y
=
r(X)
+ E,
where r is a mathematical function, called the regression function, and E is an error term that allows for deviations from a purely deterministic relationship. A researcher is often able to collect data (X1 , Y1 ), ... , (Xn, Yn) that contain information about the function r. From these data, one may compute various guesses, or estimates, of r. If little is lmo~n about the nature of r, a nonparametric estimation approach is desirable. Nonparametric methods impose a minimum of structure on the regression function. This is paraphrased in the now banal statement that "nonparametric methods let the data speak for themselves." In order for nonparametric methods to yield reasonable estimates of r, it is only necessary that r possess some degree of smoothness. Typically, continuity of r is enough to ensure that an appropriate estimator converges to the truth as the amount of data increases without bound. Additional smoothness, such as th(il existence of derivatives, allows more efficient estimation. In contrast to nonparametric methods are the parametri,c ones that have dominated much of classical statistics. Suppose the variable X is known to lie in the interval [0, 1]. A simple) example of a parametric model for r in 0
1
2
1. Introduction
(1.1) is the straight line r(x) =eo+ e1x,
0 S::
X
S:: 1,
where e0 and e1 are unknown constants. More generally, one might assume that r has the linear structure p
r(x) =
L
eiri(x),
0::::; X::::; 1,
i=O
where To, ... , Tp are known functions and eo, ... , ep are unknown constants. Parametric models are attractive for a number of reasons. First of all, the parameters of a model often have important interpretations to a subject matter specialist. Indeed, in the regression context the parameters may be of more interest than the function values themselves. Another attractive aspect of parametric models is their statistical simplicity; estimation of the entire regression function boils down to inferring a few parameter values. Also, if our assumption of a parametric model is justified, the regression function can be estimated more efficiently than it can be by a nonparametric method. If the assumed parametric model is incorrect, the result can be misleading inferences about the regression function. Thus, it is important to have methods for checking how well a parametric model fits the observed data. The ultimate aim of this book is to show that various nonparametric, or smoothing, methods provide a very useful means of diagnosing lack of fit of parametric models. It is by now widely acknowledged that smoothing is an extremely useful means of estimating functions; we intend to show that smoothing is also valuable in testing problems. The next chapter is intended to be an expository introduction to some of the basic methods of nonparametric regression. The methods given our greatest attention are the so-called kernel method and Fourier series. Kernel methods are perhaps the most fundamental means of smoothing data and thus provide a natural starting point for the study of nonparametric function estimation. Our reason for focusing on Fourier series is that they are a central part of some simple and effective testing methodology that is treated later in the book. Other useful methods of nonparametric regression, including splines and local polynomials, are discussed briefly in Chapter 1 but receive less attention in the remainder of the book than do Fourier series. Chapter 3 studies some of the statistical properties of kernel and Fourier series estimators. This chapter is much more theoretical than Chapter 2 and is not altogether necessary for appreciating subsequent chapters that deal with testing problems. Chapter 4 deals with the important practical problem of choosing an estimator's smoothing parameter. An introduction to several methods of data-driven smoothing and an account of their theoretical properties are given. The lack-of-fit tests focused upon in Chapters 7-10 are based on data-driven choice of smoothing parameters. Hence, although
1. Introduction
3
not crucial for an understanding of later material, Chapter 4 provides the reader with more understanding of how subsequent testing methodology is connected with smoothing ideas. Chapter 5 introduces the lack-of-fit problem by reviewing some classical testing procedures. The procedures considered include likelihood ratio tests, the reduction method and von Neumann's test. Chapter 6 considers more recently proposed lack-of-fit tests based on nonparametric, linear smoothers. Such tests use fixed smoothing parameters and are thus inherently different from tests based on data-driven smoothing parameters. Chapter 7 introduces the latter tests in the simple case of testing the "noeffect" hypothesis, i.e., the hypothesis that the function r is identical to a constant. This chapter deals almost exclusively with trigonometric series methods. Chapters 8 and 9 show that the type of tests introduced in Chapter 7 can be applied in a much wider range of settings than the simple no-effect problem, whereas Chapter 10 provides illustrations of these tests on some actual sets of data.
2 Some Basic Ideas of Smoothing
2.1 Introduction In its broadest sense, smoothing is the very essence of statistics. To smooth is to sand away the rough edges from a set of data. More precisely, the aim of smoothing is to remove data variability that has no assignable cause and to thereby make systematic features of the data more apparent. In recent years the term smoothing has taken on a somewhat more specialized meaning in the statistical literature. Smoothing has become synonomous with a variety of nonparametric methods used in the estimation of functions, and it is in this sense that we shall use the term. Of course, a primary aim of smoothing in this latter sense is still to reveal interesting data features. Some major accounts of smoothing methods in various contexts may be found in Priestley (1981), Devroye and Gyorfi (1985), Silverman (1986), Eubank (1988), Hardie (1990), Wahba (1990), Scott (1992), Tarter and Lock (1993), Green and Silverman (1994), Wand and Jones (1995) and Fan and Gijbels (1996). Throughout this chapter we shall make use of a canonical regression model. The scenario of interest is as follows: a data analyst wishes to study how a variable Y responds to changes in a design variable x. Data Y1, ... , Yn are observed at the fixed design points x 1 , ... , Xn, respectively. (For convenience we suppose that 0 < x 1 < x 2 < · · · < Xn < 1.) The data are assumed to follow the model (2.1)
Yj
=
r(xj)
+ Ej,
j = 1, ... , n,
where r is a function defined on [0, 1] and E1 , ... , En are unobserved random variables representing error terms. Initially we assume that the error terms are uncorrelated and that E(Ei) = 0 and Var(Ei) = o- 2 , i = 1, ... , n. The data analyst's ultimate goal is to infer the regression function r at each x in [0, 1]. The purpose of this chapter is twofold. First, we wish to introduce a variety of nonparametric smoothing methods for estimating regression functions, and secondly, we want to point out some of the basic issues that arise 4
2.2. Local Averaging
5
when applying such methods. We begin by considering the fundamental notion of local averaging.
2.2 Local Averaging Perhaps the simplest and most obvious nonparametric method of estimating the regression function is to use the idea of local averaging. Suppose we wish to estimate the function value r(x) for some x E [0, 1]. If r is indeed continuous, then function values at Xi's near x should be fairly close to r(x). This suggests that averaging Yi's corresponding to xi's near x will yield an approximately unbiased estimator of r(x). Averaging has the beneficial effect of reducing the variability arising from the error terms. Local averaging is illustrated in Figures 2.1 and 2.2. The fifty data points in Figure 2.1 were simulated from the model
}j
=
r(xj)
+ Ej,
j = 1, ... , 50,
where
r(x)
=
( 1-
(2x- 1) 2
r,
0 :'S:
X
:'S: 1,
Xj = (j- .5) /50, j = 1, ... , 50, and the Ej 's are independent and identically distributed as N(O, (.125) 2 ). (N(t-t, u 2 ) denotes the normal distribution with mean t-t and variance u 2 .)
•
co
0
• >-
• •• • •• • •
"': 0
0 0
0.0
0.2
0.4
0.6
0.8
X
FIGURE
2.1. Windows Centered at .20 and .60.
•• • •• 1.0
6
2. Some Basic Ideas of Smoothing
For each x, consider the interval [x- h, x + h], where his a small positive number. Imagine forming a "window" by means of two lines that are parallel to the y-axis and hit the x-axis at x - h and x + h (see Figure 2.1). The window is that part of the (x, y) plane that lies between these two lines. Now consider the pairs (x j, Yj) that lie within this window, and average all the Yj 's from these pairs. This average is the estimate of the function value r(x). The window can be moved to the left or right to compute an estimate at any point. The resulting estimate of r is sometimes called a window estimate or a moving average. In the middle panel of Figure 2.2, we see the window estimate of r corresponding to the window of width .188 shown in Figure 2.1. The top and bottom panels show estimates resulting from smaller and larger window widths, respectively. Smoothing is well illustrated in these pictures. The top estimate tracks the data well, but is much too rough. The estimate becomes more and more smooth as its window is opened wider. Of course, there is a price to be paid for widening the window too much, since then the estimate does not fit the data well. Parenthetically we note that Figure 2.2 provides, with a single data set, a nice way of conveying the notion that variability of an average decreases as the number of data points increases. The decrease in variability is depicted by the increasing smoothness in the estimated curves. This is an example of how smoothing ideas can be pedagogically useful in a general study of statistics.
2.3 Kernel Smoothing
2. 3.1 Fundamentals A somewhat more sophisticated version of local averaging involves the use of so-called kernel estimators, also referred to as kernel smoothers. Here, the simple average discussed in Section 2.2 is replaced by a weighted sum. Typically, larger weights are given to li's whose Xi's are closer to the point of estimation x. There are at least three versions of the kernel estimator to which we shall refer from time to time. We first consider the NadarayaWatson type of estimate (Nadaraya, 1964 and Watson, 1964). Define (2.2)
Q:::; X:::; 1,
where K is a function called the kernel. The quantity h is called the bandwidth or smoothing parameter and controls the smoothness of rf:W in the same way that window width controls the smoothness of a moving average. In fact, the window estimate of Section 2.2 is a special case of (2.2) with
/
2.3. Kernel Smoothing
7
q
"'0 >-
"'0
"' 0
"! 0
0
0
0.0
0.2
0.6
0.4
0.8
1.0
0.8
1.0
X
q
"'
>-
.
0.4
0.6
.. .
0
------
•'
-~,·
"'0
"' 0
"'0 0
0
0.0
0.2
X
"'0 •,._
.....,
....'•
"!
·~
0
·..
•'·-~-·-.
0
0
0.0
0.2
0.4
0.6
0.8
1.0
X
FIGURE 2.2. Window Estimates. The dashed line is the true curve. The window widths of the estimates are, from top to bottom, .042, .188 and .60.
8
2. Some Basic Ideas of Smoothing
KR(u)
=
1 ·~/(-1,1)(u),
and IA denotes the indicator function for the set A, i.e., IA(x)
= { 1,
x E A
0,
X
tf_ A.
The kernel KR is called the rectangular kernel. A popular choice for K in (2.2) is the Gaussian kernel, i.e., 2
Ka(x)
=
-1-
J21f
exp ( - x- ) . 2
A qualitative advantage of using the Gaussian kernel as opposed to the more naive rectangular one is illustrated in Figure 2.3. Here, a NadarayaWatson estimate has been applied to the same data as in our first example. The bandwidths used are comparable to their corresponding window widths in Figure 2.2. We see in each case that the Gaussian kernel estimate is smoother than the corresponding window estimate, which is obviously due to the fact that Ka is smooth, whereas KR is discontinuous at -1 and 1. An estimate that is guaranteed to be smooth is an advantage when one envisions the underlying function r as being smooth. At least two other types of kernel smoothers are worth considering. One, introduced by Priestley and Chao (1972), is defined by ~PC
rh (x) I''
1
n h ~(XiXi-1)YiK
=
(
X- Xi
-h- ) .
•=1 A similar type smoother, usually known as the Gasser-Muller (1979) estimator, is
1
~GM
rh
(x)
=
n
h ~ Yi
1 s·'
s;-1
K
( X-U )
-h-
du,
where so = 0, Bi = (xi + Xi+l)/2, i = 1, ... , n - 1, and Gasser-Muller estimator may also be written as (2.3)
~
1 1
Yn(u)K (
Sn
1. The
x~ u) du,
where Yn ( ·) is the piecewise constant function n
Yn(u) =
L Yii[si-l,si)(u). i=1
In other words, the Gasser-Muller estimate is the convolution of Yn(·) with K(-/h)/h. This representation suggests that one could convolve K(-/h)/h with other "rough" functions besides Yn(·). Clark (1977) proposed a version
2.3. Kernel Smoothing
9
q CX)
d CD
»
d
..,. d
••'-,
"'d
..•
0
d
0.0
0.2
0.4
0.6
0.8
1.0
X
q
.· .. . ,•
CX)
d CD
»
d
..,. d
....
:;l
,
0
d
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
X
q CX)
d CD
»
d
"': 0
"'d 0
d
0.0
0.2
0.4 X
FIGURE 2.3. Nadaraya-Watson Type Kernel Estimates Using a Gaussian Kernel. The dashed line is the true curve. The bandwidths of the estimates are, from top to bottom, .0135, .051 and .20.
10
2. Some Basic Ideas of Smoothing
of (2.3) in which YnO is replaced by a continuous, piecewise linear function that equals Yi at Xi, i = 1, ... , n. An appealing consequence of (2.3) is that the Gasser-Muller estimate tends to the function Yn(·) as the bandwidth tends to 0. By contrast, the Nadaraya-Watson estimate is not even well defined for sufficiently small h when K has finite support. As a result the latter estimate tends to be much more unstable for small h than does the Gasser-Muller estimate. The previous paragraph suggests that there are some important differences between Nadaraya-Watson and Gasser-Muller estimators. Chu and Marron (1991) refer to the former estimator as being an "evaluation type" and to the latter as a "convolution type." The Priestley-Chao estimator may also be considered of convolution type, since, if K is continuous,
where si-l :S x; :S si, i = 1, ... , n. When the design points are at least approximately evenly spaced, there is very little difference between the evaluation and convolution type estimators. However, as Chu and Marron (1991) have said, "when the design points are not equally spaced, or when they are iid random variables, there are very substantial and important differences in these estimators." Having made this point, there are, nonetheless, certain basic principles that both estimator types obey. In the balance of Section 2.3 we will discuss these principles without making any distinction between evaluation and convolution type estimators. We reiterate, though, that there are nontrivial differences between the estimators, an appreciation of which can be gained from the articles of Chu and Marron (1991) and Jones, Davies and Park (1994). A great deal has been written about making appropriate choices for the kernel K and the bandwidth h. At this point we will discuss only the more fundamental aspects of these choices, postponing the details until Chapters 3 and 4. Note that each of the kernel estimators we have discussed may be written in the form n
r(x)
=
L Wi,n(x)Yi i=l
for a particular weight function wi,n(x). To ensure consistency of f(x) it is necessary that I:~=l Wi,n(x) = 1. For the Nadaraya-Watson estimator this condition is guaranteed for each x by the way in which f{;W (x) is constructed. Let us investigate the sum of weights for the Gasser-Muller
2.3. Kernel Smoothing
11
estimator. (The Priestley-Chao case is essentially the same.) We have
L-h1 1 8
n
i=1
i
Si-1
- d U )u = 1 K( -X h h
=
1 1
Q
( X - U K d) u h
1x/h
K(v)dv.
(x-1)/h
This suggests that we take K to be a function that integrates to 1. By doing so, the sum of kernel weights will be approximately 1 as long as h is small relative to min(x, 1- x). If J K(u) du = 1 and K vanishes outside ( -1, 1), then the sum of kernel weights is exactly 1 whenever h :::; min(x, 1 - x). This explains why finite support kernels are generally used with GasserMuller and Priestley-Chao estimators. When xis in [0, h) or (1- h, 1], note that the weights do not sum to 1 even if K has support ( -1, 1). This is our first indication of so-called edge effects, which will be discussed in Section
2.5. In the absence of prior information about the regression function, it seems intuitive that K should be symmetric and have a unique maximum at 0. A popular way of ensuring these two conditions and also J K (u) du = 1 is to take K to be a probability density function (pdf) that is unimodal and symmetric about 0. Doing so also guarantees a positive regression estimate for positive data, an attractive property when it is known that r 2: 0. On the other hand, there are some very useful kernel functions that take on negative values, as we will see in Section 2.4. To ensure that a kernel estimator has attractive mean squared error properties, it turns out to be important to choose K so that (2.4)
J
uK(u) du = 0,
J
K 2 (u) du
< oo
and
J
u 2 K(u) du
< oo.
Note that conditions (2.4) are satisfied by KR and Ka, and, in fact, by any bounded, finite variance pdf that is symmetric about 0. The necessity of (2.4) will become clear when we discuss mean squared error in Chapter 3. It is widely accepted that kernel choice is not nearly so critical as choice of bandwidth. A common practice is to pick a reasonable kernel, such as the Gaussian, and use that same kernel on each data set encountered. The choice of bandwidth is another matter. We saw in Figure 2.3 how much an estimate can change when its bandwidth is varied. We now address how bandwidth affects statistical properties of a kernel estimator. Generally speaking, the bias of a kernel estimator becomes smaller in magnitude as the bandwidth is made smaller. Unfortunately, decreasing the bandwidth also has the effect of increasing the estimator's variance. A principal goal in kernel estimation is to find a bandwidth that affords a satisfactory compromise between the competing forces of bias and variance. For a given sample size n, the interaction of three main factors dictates the value of this "optimal" bandwidth. These are
12
2. Some Basic Ideas of Smoothing
• the smoothness of the regression function, • the distribution of the design points, and • the amount of variability among the errors E1, ... , En· The effect of smoothness can be illustrated by a kernel estimator's tendency to underestimate a function at peaks and overestimate at valleys. This tendency is evident in Figure 2.4, in which data were simulated from the model
Yj = 9.9 + .3 sin(27rxj) + Ej,
j
=
1, ... , 50,
where Xj = (j - .5)/50, j = 1, ... , 50, and the Ej's are independent and identically distributed as N(O, (.06) 2 ). The estimate in this case is of Gasser-Muller type with kernel K(u) = .75(1- u 2)J(-l,l)(u). The bandwidth of .15 yields an estimate with one peak and one valley that are located at nearly the same points as they are for r. However, the kernel estimate is too low and too high at the peak and valley, respectively. The problem is that at x ~ .25 (for example) the estimate tends to be pulled down by data values (xj, Yj) for which r(xj) is smaller than the peak. All other factors being equal, the tendency to under- or overestimate will be stronger the sharper the peak or valley. Said another way, the bias of a kernel estimator is smallest where the function is most nearly linear. The bias at a peak or valley can be lessened by choosing the bandwidth smaller. However, doing so also has its price. In Figure 2.5 we see the same
..... .·~·· ..~./. •'
0
·...
~--
··... ··..
......
0
. •.
..• .....
·
',
.
co
a)
···.. •
co
a)
0.0
0.2
0.4
. . ···' .. .. .. ·."il!:
0.6
··...
....··
··...
0.8
1.0
X
FIGURE 2.4. The Tendency of a Kernel Smoother to Undershoot Peaks and Overshoot Valleys. The solid line is the true curve and the dotted line a Gasser-Muller kernel smooth.
2.3. Kernel Smoothing
13
C\J
c:i
.•.......
~
0
c:i
..
co cri
·~
''!
(0
cri
0.0
0.2
0.4
•
··· ...
..
0.6
·. ···•··•
.!'...
0.8
1.0
X
FIGURE 2.5. Kernel Smooth with a Small Bandwidth Based on the Same Data as in Figure 2.4.
data and same type of kernel estimate as in Figure 2.4, except now the bandwidth is much smaller. Although there is no longer an obvious bias at the peak and valley, the overall estimate has become very wiggly, a feature not shared by the true curve. Figure 2.5 illustrates the fact that the variance of a kernel estimator tends to increase when its bandwidth is decreased. This is not suprising since a smaller value of h means that effectively fewer Yj 's are being averaged. To gain some insight as to how design affects a good choice for h, consider estimating r at two different peaks of comparable sharpness. A good choice for h will be smaller at the x near which the design points are more highly concentrated. At such a point, we may decrease the size of h (and hence bias) while still retaining a relatively stable estimator. Since bias tends to be largest at points with a lot of curvature, this suggests that a good design will have the highest concentration of points at x's where r(x) is sharply peaked. This is borne out by the optimal design theory of Muller (1984). The variance, 0' 2 , of each error term affects a good choice for h in a fairly obvious way. All other things being equal, an increase in 0' 2 calls for an increase in h. If 0' 2 increases, the tendency is for estimator variance to increase, and the only way to counteract this is to average more points, i.e., take h larger. The trade-off between stability of an estimator and how well the estimator tracks the data is a basic principle of smoothing. In choosing a smoothing parameter that provides a good compromise between these two
14
2. Some Basic Ideas of Smoothing
properties, it is helpful to have an objective criterion by which to judge an estimator. One such criterion is integrated squared error, or ISE. For any two functions f and g on [0, 1], define the ISE by
1 1
I(!, g) =
2
(f(x)- g(x)) dx.
For a given set of data, it seems sensible to consider as optimum a value of h that minimizes I(fh, r). For the set of data in Figure 2.3, Figure 2.6 shows anISE plot for a Nadaraya-Watson estimate with a Gaussian kernel. The middle estimate in Figure 2.3 uses the bandwidth of .051 that minimizes the ISE. For the two data sets thus far considered, the function r was known, which allows one to compute the ISE curve. Of course, the whole point of using kernel smoothers is to have a means of estimating r when it is unknown. In practice then, we will be unable to compute ISE, or any other functional of r. We shall see that one of smoothing's greatest challenges is choosing a smoothing parameter that affords a good trade-off between variance and bias when the only knowledge about the unknown function comes from the observed data. This challenge is a major theme of this book and will be considered in detail for the first time in Chapter 4.
2.3.2 Variable Bandwidths It was tacitly assumed in Section 2.3.1 that our kernel smoother used the same bandwidth at every value of x. For the data in Figures 2.3 and 2.4, doing so produces reasonable estimates. In Figure 2. 7 we see the function r(x) = 1 + (24x) 3 exp( -24x), 0 :::; x :::; 1, which has a sharp peak at x = .125, but is nearly fiat for .5 :::; x :::; 1. The result of smoothing noisy data from this curve using a constant bandwidth smoother is also
'?
[jJ
'1
(/)
'§; .Q
'? 't' !";-
0.0
0.05
0.10
0.15
0.20
0.25
0.30
h
FIGURE 2.6. Plot of log(ISE) for the Data in Figure 2.3.
2.3. Kernel Smoothing
.. 0.0
. . . ... 0.4
0.2
15
0.6
0.8
1.0
0.6
0.8
1.0
X
0.0
0.4
0.2
X
0
N
.. 0.0
0.2
.· 0.4
.. 0.6
0.8
1.0
FIGURE 2.7. The Effect of Regression Curvature on a Constant Bandwidth Estimator. The top graph is the true curve along with noisy data generated from that curve. The middle and bottom graphs show Gasser-Muller estimates with Epanechnikov kernel and respective bandwidths of h = .05 and h = .20. The lesson here is that the same bandwidth is not always adequate over the whole range of x's.
16
2. Some Basic Ideas of Smoothing
shown in Figure 2. 7. The estimates in the middle and lower graphs are of Gasser-Muller type with K(u) = .75(1 - u 2 )I(-l,l)(u), which is the socalled Epanechnikov kernel. The bandwidth used in the middle graph is appropriate for estimating the function at its peak, whereas the bandwidth in the lower graph is more appropriate for estimating the curve where it is flat. Neither estimate is satisfactory, with the former being too wiggly for x > .3 and the latter having a large bias at the peak. This example illustrates that a constant bandwidth estimate is not always desirable. Values of x where the function has a lot of curvature call for relatively small bandwidths, whereas x's in nearly flat regions require larger bandwidths. The latter point is best illustrated by imagining the case where all the data have a common mean. Here, it is best to estimate the underlying "curve" at all points by Y, the sample mean. It is not difficult to argue that a Nadaraya-Watson estimate tends toY as h tends to infinity. An obvious way of dealing with the problem exhibited in Figure 2. 7 is to use an estimator whose bandwidth varies with x. We have done so in Figure 2.8 by using h(x) of the form shown in the top graph of that figure. The smoothness of the estimate is preserved by defining h(x) so that it changes smoothly from h = .05 up to h = .5.
2.3.3 Transformations of x Constant bandwidth estimators are appealing because of their simplicity, which makes for computational convenience. In some cases where it seems that a variable bandwidth estimate is called for, a transformation of the x-variable can make a constant bandwidth estimate suitable. Let t be a strictly monotone transformation, and define si = (t(xi) + t(xH 1 ))/2, i = 1, ... ,n -1, so= t(O), Sn = t(1) and
/th(z) =
1
h
8Yi 18; (z _u) n
s;-1
K
-h-
du,
t(O) ::::; z ::::; t(1).
Inasmuch as rfM (x) estimates r(x), ith(z) estimates r(r 1(z)). Therefore, an alternative estimator of r(x) is ith(t(x)). The key idea behind this approach is that the function r(r 1 ( · )) may be more amenable to estimation with a constant bandwidth estimate than is r itself. This idea is illustrated by reconsidering the data in Figure 2.7. The top graph in Figure 2.9 shows a scatter plot of (xi/\ Yi), i = 1, ... , n, and also a plot of r(x) versus x 114. Considered on this scale, a constant bandwidth estimate does not seem too unreasonable since the peak is now relatively wide in comparison to the flat spot in the right-hand tail. The
/
2.3. Kernel Smoothing
17
l!)
c:i '
c:i
..c
C\l
c:i
,.... c:i
0.0
0.2
0.4
0.6
0.8
1.0
X
.. 0
C\i >.
• ~
.... .... .....
• C!
0.0
0.2
0.4
0.6
0.8
1.0
X
FIGURE 2.8. A Variable Bandwidth Kernel Estimate and Bandwidth Function. The bottom graph is a variable bandwidth Gasser-Muller estimate computed using the same data as in Figure 2.7. The top graph is h(x), the bandwidth function used to compute the estimate.
estimate fth(t(x)) (t(x) = x 114), shown in the bottom graph of Figure 2.9, is very similar to the variable bandwidth estimate of Figure 2.8. The use of an x-transformation is further illustrated in Figures 2.10 and 2.11 (pp. 19-20). The data in the top graph of Figure 2.10 are jawbone lengths for thirty rabbits of varying ages. Note that jawbone length increases rapidly up to a certain age, and then asymptotes when the rabbits reach maturity. Accordingly, the experimenter has used a judicious design in that more young than old rabbits have been measured.
18
2. Some Basic Ideas of Smoothing
0.6
0.4
1.0
0.8 t(x)
0
C\i
\
0.0
0.2
.•
... ... .... ... ···•---•---~-~----~
0.4
0.6
0.8
~
1.0
X
FIGURE 2.9. Gasser-Muller Estimate Based on a Power Transformation of the Independent Variable. The top graph shows the same data and function as in Figure 2.8, but plotted against t(x) = x 114 . The dashed line in the bottom graph is the variable bandwidth estimate from Figure 2.8, while the solid line is the estimate /lh(x 114) with h = .07.
A Gasser-Muller estimate with Epanechnikov kernel and a bandwidth of .19 is shown in Figure 2.10 along with a residual plot. This estimate does not fit the data well around days 20 to 40. Now suppose that we apply a square root transformation to the xi's. The resulting estimate flh( y'X) and its residual plot are shown in Figure 2.11. Transforming the x's has obviously led to a better fitting estimate in this example.
2.4. Fourier Series Estimators
0
100
200
300
400
500
19
600
day
FIGURE 2.10. Ordinary Gasser-Miiller Kernel Estimate Applied to Rabbit Jawbone Data. The top graph depicts the jawbone lengths of thirty rabbits of varying ages along with a Gasser-Miiller estimate with bandwidth equal to 114. The bottom graph shows the corresponding residual plot.
2.4 Fourier Series Estimators Another class of nonparametric regression estimators makes use of ideas from orthogonal series. Many different sets of orthogonal basis functions could and have been used in the estimation of functions. These include orthogonal polynomials and wavelets. Here we shall introduce the notion of a series estimator by focusing on trigonometric, or Fourier, series. We reiterate that much of the testing methodology in Chapters 7-10 makes use
20
2. Some Basic Ideas of Smoothing
. . ..
0 LO
0
'
c ..Q1
0
(1)
0
0.1
100
0
200
300
400
500
600
400
500
600
day
. •. . ...
0.1 (ij ~
I• •
"0
'(i.i
~
0
•
...
C}l
0
100
200
300 day
FIGURE 2.11. Gasser-Muller Estimate Computed After Using a Square Root Transformation of the Independent Variable. The top graph shows the rabbit jawbone data with a Gasser-Muller estimate based on a square root transformation of the x's. The bandwidth on the transformed scale is 5.4. The bottom graph is the corresponding residual plot.
of Fourier series; hence, the present section is perhaps the most important one in this chapter. Consider the system C = {1, cos(nx), cos(2nx), ... } of cosine functions. The elements of C are orthogonal to each other over the interval (0, 1) in the sense that
1 1
(2.5)
cos(njx) cos(nkx) dx
= 0,
j =/= k,
j, k
=
0, 1, ....
2.4. Fourier Series Estimators
21
For any function r that is absolutely integrable on (0,1), define its Fourier coefficients by
1 1
¢j =
r(x) cos(njx)dx,
j
= 0, 1, ....
The system C is complete for the class C[O, 1] of functions that are continuous on [0, 1]. In other words, for any function r in C[O, 1], the series m
(2.6)
r(x; m) = ¢o
+2L
0 ::::; x ::::; 1,
¢j cos(njx),
j=l
converges tor in mean square as m ---+ oo (see, e.g., Tolstov, 1962). Convergence in mean square means that the integrated squared error I(r(·; m), r) converges to 0 as m ---+ oo. The system C is said to form an orthogonal basis for C[O, 1] since it satisfies the orthogonality properties (2.5) and is complete. The practical significance of C being an orthogonal basis is its implication that any continuous function may be well approximated on [0, 1] by a finite linear combination of elements of C. In addition to being continuous, suppose that r has a continuous derivative on [0, 1]. Then the series r(·; m) converges uniformly on [0, 1] to r as m ---+ oo (Tolstov 1962, p. 81). Often the ¢j's converge quickly to 0, implying that there is a small value of m such that
r(x; m) :::::; r(x),
\;/ x
[0, 1].
E
This is especially relevant in statistical applications where it is important that the number of estimated parameters be relatively small compared to the number of observations. Let us now return to our statistical problem in which we have data from model (2.1) and wish to estimate r. Considering series (2.6) and the discussion above, we could use Y1, ... , Yn to estimate ¢ 0, ... , ¢m and thereby obtain an estimate of r. Since the series r(·; m) is linear in cosine functions, one obvious way of estimating the ¢j 's is to use least squares. Another possibility is to use a "quadrature" type estimate that parallels the definition of ¢j as an integral. Define ¢j by
¢j
=
t 1s' Yi
i=l
cos(nju)du,
j
= 0, 1, ... ,
Bi-1
where so = 0, Si = (xi + XiH)/2, i define an estimator r(x; m) of r(x) by
= 1, ... , n- 1,
and
Sn
1. Now
m
r(x; m) =:Po+ 2
L ¢j cos(njx),
0 ::::;
X ::::;
1.
j=l
To those used to parametric statistical methods the estimator r(x; m) probably seems more familiar than those in Sections 2.2 and 2.3. In one
22
2. Some Basic Ideas of Smoothing
sense the Fourier series approach to estimating r is simply an application of linear models. We could just as well have used orthogonal polynomials as our basis functions (rather than C) and then the series estimation scheme would simply be polynomial regression. What sets a nonparametric approach apart from traditional linear models is an emphasis on the idea that the m in r(x; m) is a crucial parameter that must be inferred from the data. The quantity m plays the role of smoothing parameter in the smoother f(·; m). We shall sometimes refer to f(·; m) as a truncated series estimator and tom as a truncation point. In Figure 2.12 we see three Fourier series estimates computed from the data previously encountered in Figures 2.2 and 2.3. Note how the smoothness of these estimates decreases as the truncation point increases. The middle estimate (with m = 2) minimizes I(r(·; m), r) with respect tom. At first glance it seems that the estimator f(·; m) is completely different from the kernel smoothers of Section 2.3. On closer inspection, though, the two types of estimators are not so dissimilar. It is easily verified that
r(x; m) =
t
Yi
1Si
i=l
Km(x, u)du,
0 ::;
X ::;
1,
Si-1
where m
Km(u, v) = 1 + 2
'2: cos(1fju) cos(1fjv) j=l
=
Dm(u- v)
+ Dm(u + v)
and Dm is the Dirichlet kernel, i.e.,
Dm(t)
=
sin [(2m+ 1)1ft/2] · 2 sin(1ft/2)
We see, then, that r(·; m) has the same form as a Gasser-Muller type estimator with kernel Km (u, v). The main difference between f( ·; m) and the estimators in Section 2.3 is with respect to the type of kernel used. Note that Km(x, u) depends on both x and u, rather than only x - u. The truncation point m, roughly speaking, is inversely proportional to the bandwidth of a kernel smoother. Figure 2.13 gives us an idea about the nature of Km and the effect of increasing m. Each of the three graphs in a given panel provides the kernel weights used at a point, x, where the curve is to be estimated. Notice that the weights are largest in absolute value near x and that the kernel becomes more concentrated at x as m increases. A key difference between the kernel Km and, for example, the Gaussian kernel is the presence of high frequency wiggles, or side lobes, in the tails of Km, with the number of wiggles being an increasing function of m. To many practitioners the oscillatory nature of the kernel Km is an unsavory aspect of the series estimator r(·; m). One way of alleviating this
2.4. Fourier Series Estimators
23
q
"'
0
.:.
(0
>-
0
..·. .. ...
..,. 0 "! 0
.·'
0
0
•
0.0
0.2
0.4
0.6
' ·.•
-~
0.8
1.0
X
q
..
;
"'
0
.:.
(0
>-
0
"' 0
"'
..
0 0
0
..
.~
...
.
. '•
~~~·
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
X
q
"'0 (0
>-
0
"' 0
"'0 0
0
0.0
0.2
0.4 X
FIGURE 2.12. Fourier Series Estimates. The dashed line is the true curve. The truncation points of the estimates are, from top to bottom, 27, 2 and 1.
2. Some Basic Ideas of Smoothing
24
m=3
"' :g,
'
·~ Q;
E
"'
i
Q;
E
E ~
~
~
"'
3
Q;
"'
';" ';"
0.0
0.4
0.8
0.0
0.4
0.8
0.0
0.4
0.8
0.0
0.4
0.8
0.0
0.4
0.8
m=B
~
:E
·~ 3
Q;
E ~
:g,
:g,
3
"'
·~
"'
~
·;;;
"'
Q;
E ~
"'
~
Q;
E
'l'
'l' 0.0
0.4
0.8
0.0
0.4
0.8
m=16
~
:g,
0
~
~
·~
i
Q;
Q;
E ~
E ~
"'
~
:E
3
0.0
0.4
0.8
1.) of r(x) by n-1
(2.7)
¢o + 2 L
r(x; W>.) =
W>.(j)¢j cos(njx),
0 :::; X :::; 1,
j=1
where the taper {w:>-.(1), W:>-.(2), ... } is a sequence of constants depending on a smoothing parameter .A. Usually the taper is such that 1 2 W:>-.(1) 2 W:>-.(2) 2 · · · 2 W>.(n- 1) 2 0. The estimator (2.7) may also be written as the kernel estimator
with kernel n-1
Kn(u, v; W>.) = 1 + 2
L
w>.(j) cos(njv) cos(nju).
j=1
By appropriate choice of W>., one can obtain a kernel K(x, ·; W>.) that looks much like a Gaussian curve for x not too close to 0 or 1. We shall discuss how to do so later in this section. Tapering has its roots in the mathematical theory of Fourier series, where it has been used to induce convergence of an otherwise divergent series. For example, define, for each postive integer m, wm(j) = 1 - (j - 1)/m, j = 1, ... , m. These are the so-called Fejf:r weights and correspond to forming the sequence of arithmetic means of a series; i.e.,
(2.8)
¢0
m + 2L j=1
(
'-1) ¢ cos(njx) = -1L r(x; j).
1- _J-
m
1
m
m
j=1
The series (2.8) has the following remarkable properties (see, e.g., Tolstov, 1962): (i) It converges to the same limit as r(x; m) whenever the latter converges, and (ii) it converges uniformly to r(x) on [0, 1] for any function r in C[O, 1]. Property (ii) is all the more remarkable considering that there exist continuous functions r such that r(x; m) actually diverges at certain points x. (The last statement may seem to contradict the fact that the Fourier series of any r E C[O, 1] converges in mean square to r; but in fact it does not, since mean square convergence is weaker than pointwise convergence.) The kernel corresponding to the Fejer weights has the desirable property of being positive, but is still quite oscillatory. Another possible set of weights is
26
2. Some Basic Ideas of Smoothing
K:;,
The kernel corresponding to these weights may be written in terms of the Rogosinski kernel Rm (Butzer and Nessel, 1971):
K:;,,(u, v)
=
Rm(u- v)
+ Rm(u + v),
where
The trigonometric series
rR(x; m)
=
1+2
f
j=l
cos(
nj
2m+ 1
) 0. Now take the taper WA. in the series estimator (2.7) to be
K:;,
W,\(j) = 0,
j = 1, 2, ...
For large n, the kernel of this estimator is well approximated by 00
K(u, v; ..\)
=
1+
L
¢K(.Anj) cos(nju) cos(njv)
j=l
=
Kw(u + v; ..\) + Kw(u- v; ..\),
0:::; u, v:::; 1,
2.4. Fourier Series Estimators
27
m=4 q
"'
"'
"'
'"
'" '"
;;j
0
t
·;;;
:E ·~
"l
w
-;;;
j
"l
~
~
E ~
q
0
"'0
"l
0
q
0
0
0.0
0.4
0.0
0.8
0.4
0.8
0.0
0.4
0.8
m=14
;'? :E 0>
:E
·~
·;;;
~
w
~
:§, ·~
~
~
"
E
~
"
;?
"~E
"' 0
I~ ~ 0.0
0.4
-'\)
0.8
0.0
0.4
-'\)
\!' 0.8
0.0
0.4
0.8
,'I 1:1
m=26
,,'i
Iii
II, If:
;'?
:E ·~
;'?
:E ·~
;?
~
"E ~
II
0
"' :§,
;?
·;;;
~
~
"~E
"~E
i
!
0
"'
'II
;?
III!
~ ~ 0.0
0.4
0.8
0.0
0.4
0.8
'\ 0.0
0.4
0.8
FIGURE 2.14. Kernels Corresponding to the Rogosinski Series Estimate fR(x; m). For a given m, the graphs show, from left to right; the data weights K!:;(x, v) used in estimating the curve at x = .5, x = .75 and x = 1.
r i !
28
2. Some Basic Ideas of Smoothing
0
c)
0.0
0.2
0.6
0.4
0.8
1.0
X FIGURE 2.15. Gamma Type Function and Series Approximators. The solid line is r(x) = (20x) 2 exp( -20x), the dashed line is r(x; 8) and the dots are rR(x; 14), a Rogosinski type series.
where Kw(y; A) is a "wrapped" version of K(yjA)jA, i.e.,
1 ~ (y2j) -A- ,
Kw(y;A)=): .L..J K
Vy.
J=-oo
Note that K w (· ; A) is periodic with period 2 and integrates to 1 over any interval of length 2. Now, let u and v be any numbers in (0, 1); then if K is sufficiently light-tailed,
Kw(u+v;A)+Kw(u-v;A)"-'
~K(u~v)
as A -+ 0. (Here and subsequently in the book, the notation a>-. "' b>-. means that a>-./b>-. tends to 1 as A approaches a limit.) This is true, for example, when K is the standard normal density and whenever K has finite support. So, except near 0 and 1, the Gasser-Muller type kernel smoother (with appropriate kernel K) is essentially the same as the series estimator (2.7) with taper ¢K(A1fj). We can now see that kernel smoothers and series estimators are just two sides of the same coin. Having two representations for the same estimator provides some theoretical insight and can also be useful from a computational standpoint.
2.5 Dealing with Edge Effects So-called edge or boundary effects are a fundamental difficulty in smoothing. The basic problem is that the efficiency of an estimator of r(x) tends to decrease as x nears either edge, i.e., boundary, of an interval containing all the design points. Edge effects are due mostly to the fact that fewer data
2.5. Dealing with Edge Effects
29
are available near the boundaries and in part to properties of the particular smoother being used. In this section we discuss how various smoothers deal with edge effects and how the performance of smoothers can be enhanced by appropriate modification within the boundary region.
2. 5.1 Kernel Smoothers
>lid line
:(x; 14),
v-er any ~n if K
' "' b;.. :ue, for LS finite 10other Gimator d series Ltations also be
10thing. ends to ning all ·er data
Priestley-Chao and Gasser-Muller type kernel estimators, as defined in Section 2.3, can be very adversely affected when x is near an edge of the estimation interval. Consider a Gasser-Muller estimator whose kernel is a pdf with support ( -1, 1). If the estimation interval is [0, 1], then for x E [h, 1- h] the kernel weights of fj?M(x) add to 1, and so the estimate is a weighted average of Y;'s. However, if x is in [0, h) or (1 - h, 1], the weights add to less than 1 and the bias of fj?M (x; h) tends to be larger than it is when x is in [h, 1 h]. In fact, when x is 0 or 1, we have E (rj?M(x)) ~ r(x)/2, implying that the Gasser-Muller estimator is not even consistent for r(O) and r(1) (unless r(O) = 0 = r(1)). A very simple way of alleviating the majority of the boundary problems experienced by r-rc and fj?M is to use the "cut-and-normalize" method. We define a normalized Gasser-Muller estimator by (2.9) By dividing by the sum of the kernel weights we guarantee that the estimate is a weighted average of Y;'s at each x. Of course, for x E [h, 1 - h] the normalized estimate is identical to fcM(x; h). It is worth noting that the N adaraya-Watson kernel estimator is normalized by definition. It turns out that even after normalizing the Priestley-Chao and GasserMuller estimators, their bias still tends to be largest when x is in the boundary region. As we will see in Chapter 3, for the types of kernels most often used in practice, this phenomenon occurs when r has two continuous derivatives throughout [0, 1] andr'(O+) -:f. 0 (orr'(1-) -:f. 0). Intuitively we may explain the boundary problems of the normalized Gasser-Muller estimator in the following way. Suppose we have data Y1 , ... , Yn from model (2.1) in which r has two continuous derivatives with r'(O+) < 0. The function r is to be estimated by (2.9), where K is symmetric about 0. Now consider data Z_n, ... , Z-1, Z1, ... , Zn from the model
zi = m(ui)
+ r]i,
Iii
= 1, ... , n,
where E(TJi) = 0, -U-i = ui = Xi, i = 1, ... , n, and m( -x) = m(x) = r(x) for x 2: 0. Using the Zi's we may estimate m(x) by an ordinary GasserMuller estimator, call it mz(x), using the same kernel and bandwidth (h) as in the normalized estimator rjfM (x) of r(x). Since r'(O+) < 0, the function m has a cusp at x = 0, implying that the bias IEmz(x) - m(x)l will be
30
2. Some Basic Ideas of Smoothing
larger at x = 0 than at x E [h, 1 - h]. A bit of thought reveals that
Emz(x)- m(x) = ErfM (x)- r(x),
for x
= 0, x
E
[h, 1- h].
So, ;;;G.l\1/(x) has a large boundary bias owing to the fact that mz(x) has a large bias at the cusp x = 0. Figure 2.16 may help to clarify this explanation. In this example we have r(x) exp(-x), 0 :::; x :::; 1, and m(x) = exp(-lxl), -1 :::; x:::; 1. Suppose in our discussion above that r' (0+) had been 0, implying that m' (0) = 0. In this case, m would be just as smooth at 0 as at other points; hence, the bias of mz (0) would not be especially large. This, in turn, implies that the normalized estimator (2.9) would not exhibit the edge effects that occur when r' (0+) f. 0. This point is well illustrated by the window and kernel estimates in the middle panels of Figures 2.2 and 2.3. Notice that these (normalized) estimates show no tendency to deteriorate near 0 or 1. This is because the regression function r(x) = [1 - (2x - 1) 2 ]2 I(o,l)(x) generating the data is smooth near the boundary in the sense that r' (0) = 0 r'(1). Consider again the Gasser-Muller estimator rfM having kernel K with support ( -1, 1). If K is chosen to be symmetric, then ~ 1 uK (u) du = 0, which, as we will see in Chapter 3, is beneficial in terms of estimator bias. This benefit only takes place when r(x) is estimated at x E [h, 1 - h]. To appreciate why, suppose we estimate r( x) for x in the boundary region, say at x = qh, 0 :::; q < 1. In this case the kernel used by rfM (x) is
J
0 ~
0~,
rP"'
0 /
Q(lll)
00
0
ci
rfJ,
/6
...._..
~ 0 0 O.c?, oo ..·txJ
> 0. Each of the two estimates shown uses the Epanechnikov kernel and bandwidth .2. The lower of the two estimates at x 1 is a normalized Gasser-Muller estimate, and the other is a Gasser-Muller estimate using boundary kernels
Kq(u) = .75(aq- bqu)(1- u 2 )I(-q,l)(u) with aq and bq defined as in (2.11). The behavior of the normalized estimate at 1 is typical. Normalized kernel estimates tend to underestimate r(1) whenever r 1 (1-) > 0 and to overestimate r(1) when r 1 (1-) < 0. At x = 0, though, where the function is nearly flat, the normalized and boundary kernel estimators behave much the same. Rice (1984a) has proposed boundary modifications for Nadaraya-Watson type kernel estimators. Although he motivates his method in terms of a numerical analysis technique, the method turns out to be similar to that
32
2. Some Basic Ideas of Smoothing
.
,!':
•
......~/
0
C')
/~ /
eve
•• •
/
lO
C\1
•· •••
0 C\1
• lO .....
•• 0.0
••
•
•
•
• •• •• •• •
• 0.2
0.6
0.4
0.8
1.0
X FIGURE 2.17. Boundary Kernels vs. Normalizing. The solid line is the true curve from which the data were generated. The dotted line is a Gasser-Muller estimate that uses boundary kernels, and the dashed line is a normalized Gasser-Muller estimate.
of Gasser and Muller in that it produces boundary kernels that integrate to 1 and have first moments equal to 0.
2.5.2 Fourier Series Estimators In this section we examine how Fourier series estimators of the form n-1
r(x; W)..) =
¢o + 2 L
W>.(J)¢j cos(njx)
j=l
are affected when x is near the boundary. Recall that such estimators include the simple truncated series r(x; m) and the Rogosinski estimator rR(x; m) as special cases. We noted in Section 2.4 that any series estimator f(x; W>.) is also a kernel estimator of the form
r(x; W>.) =
t Yi 1Si i=l
Si-1
Kn(x, u; W>.)du,
2.5. Dealing with Edge Effects
33
where n
Kn(x, u; W>-.)
=
1+2
L W>-.(j) cos(nju) cos(njx). j=1
For each x
E
[0, 1] we have
1
~ 1:~ 1 Kn(x, u; W>-.)du
1
Kn(x, u; w>-.)du
=
1.
Since the sum of the kernel weights is always 1, we can expect the boundary performance of the series estimator f(·; W>-.) to be at least as good as that of Nadaraya-Watson or normalized Gasser-Muller estimators. Figures 2.13 and 2.14 show that the boundary adjustments implicit in our series estimators are not simply normalizations. Especially in the top panels of those two figures, we see that the kernel changes shape as the point of estimation moves from x = 1/2 to 1. Another example of this behavior is seen in Figure 2.18, where kernels for the estimate r(x; W>-.) with W>-.(j) = exp( -.5(.17rj) 2 ) are shown at x = .5, .7, .9 and 1. At x = .5 and .7 the kernel is essentially a Gaussian density with standard deviation .1, but at x = .9 the kernel has a shoulder near 0. Right at x = 1 the kernel is essentially a half normal density. To further investigate boundary effects, it is convenient to express the series estimate in yet another way. Define an extended data set as follows: y_i+ 1 = Y;,, i = 1, ... , n, and B-i = -si for i = 0, ... , n. In other words, we create a new data set of size 2n by simply reflecting the data Y1, ... , Yn about the y-axis. It is now easy to verify that the series estimate r(x; W>-.) is identical to (2.12)
t ls; Yi
i=-n+1
Kn(x
u; W>-.)du,
Si-1
for each x E [0, 1], where n-1
Kn(v; W>-.)
=
~ + ~ W>-.(J) cos(njv),
\I v.
In particular, the simple series estimator f(·; m) and the Rogosinski series fR(·; m) are equivalent to kernel estimator (2.12) with Kn(·; W>-.) identical to, respectively, the Dirichlet kernel Dm and the Rogosinski kernel Rm for all n. The representation (2.12) suggests that tapered series estimators will be subject to the same type of edge effects that bother normalized kernel estimators. Note that reflecting the data about 0 will produce a cusp as in Figure 2.16, at least when r'(O+) =f 0. This lack of smoothness will tend to make the bias of estimator (2.12) relatively large at x = 0. The same type of edge effects will occur at x = 1. Since the kernel Kn(·; W>-.) is periodic
2. Some Basic Ideas of Smoothing
34
X=
s
SZ'
.5
X=
'
C')
C')
C\1
C\1
0
0
0.0
0.0
0.8
0.4 X=
0.8
X=1 CXJ
'
0
0
0.0
0.8
0.4
0.0
0.4
0.8
u
u
FIGURE 2.18. Kernels Corresponding to the Tapered Series Estimate with Taper Equal to a Gaussian Characteristic Function. Each graph is a kernel at a different point of estimation x. The bandwidth A in each case is .1.
with period 2, the estimate (2.12) at x = 1 is equal to the same type of estimate applied to data that are reflected about x = 1, rather than 0. A simple and effective way of correcting the type of edge effects just discussed was proposed by Eubank and Speckman (1990). Their proposal, called polynomial-trigonometric regression, is to estimate r (x) by m
(2.13)
rm(x)
=
alx + iizx 2 +
L
0 ~ X ~ 1,
bj cos(njx),
j=O
where, for a given m, the estimates al, iiz, bj, j ordinary least squares estimates, i.e., they minimize
with respect to a1, az, bo, ... , bm, respectively.
=
0, ... 'm, are the
2.6. Other Smoothing Methods
35
The fact that this method tends to alleviate edge effects may be explained heuristically as follows. Define r'(O+) = d1 and r 1 (1-) = d 2 and note that r(x) may be written as
r(x)
=
d1x + (1/2)(d2
-
di)x 2 + g(x),
where g is a function satisfying g'(O+) = 0 g'(1-). In essence, the quadratic terms in (2.13) model d1 x + (1/2)(d2 - d1 )x 2 and the cosine terms model g. But, since g is smooth near 0 and 1, it may be estimated by a cosine series without boundary effects. Figure 2.19 illustrates the kind of improvement that can be obtained by including quadratic terms with cosines. The data in this example are fifty observations from a model of the form (2.1) with
r(x)
=
(1 + .Olx 2 )(9.9 + .3sin(27rx)).
Edge effects are clear in the top graph where both a simple truncated series and a Rogosinski estimate have been utilized. These two estimates use five and eight cosine terms, respectively, and the estimate in the bottom graph uses quadratic terms and the cosine terms cos(1rx), cos(21rx) and cos(37rx). So, the quadratic-trigonometric estimate provides an overall better fit than the other two estimates, and is at least as parsimonious.
2.6 Other Smoothing Methods
2. 6.1 The Duality of Approximation and Estimation We have discussed two basic approaches to nonparametric function estimation: kernel methods and Fourier series. Both of these methods have counterparts in the field of approximation theory, where the goal is to approximate a function r over an interval [a, b] given values r(xi), . .. , r(xn) at the points a :S x 1 < · · · < Xn :S b. Conversely, any approximation method can be applied in an analogous statistical estimation problem where one observes not the actual function values, but "function values +noise." The isomorphism of function approximation and estimation has been noted and used to advantage by Gray (1988). 1, ... 'n, and, for each X E [a, b], let r(x; TI, ... ' Define Ti = r(xi), i rn) be an approximation to r(x) depending on r 1, ... , Tn. In a statistical estimation problem where Yi = r(xi) + Ei, i = 1, ... , n, we may estimate r(x) by
fn(x)
=
r(x; Y1, ... , Yn).
Many approximation schemes are linear in the sense that n
(2.14)
r(x; rl, ... 'Tn)
=
L Wi(x; n)ri, i=l
36
2. Some Basic Ideas of Smoothing
"i
.••iiee-
ci
-------~ ..
/
,'/
• '/J'
·~
•;,l' • 0
ci
>-
-·
.. -··
·.··!
. /," ,,
00
,•/
o)
.:-.•·
\
.., of E;...(g) turns out to be a spline. In particular, g;... is a spline with the following properties: 1. It has knots at x1, ... , Xn· 2. It is a cubic polynomial on each interval [xi-l, xi], i 3. It has two continuous derivatives.
=
2, ... , n.
The function g;... is called the cubic smoothing spline estimator of r. It is worthwhile to consider the extreme cases A = 0 and A = oo. It turns out that if one evaluates g;... at A = 0, the result is the (unique) minimizer of 2 J~ [g"(x)] dx subject to the constraint that g(xi) = Yi, i = 1, ... , n. This spline is a so-called natural spline interpolant of Y1 , ... , Yn. At the other extreme, lim;..._, 00 g;... is simply the least squares straight line fit to the data (x1, Y1), ... , (xn, Yn)· The cases A = 0 and A = oo help to illustrate that A plays the role of smoothing parameter in the smoothing spline estimator. Varying A between 0 and oo yields estimates of r with varying degrees of smoothness and fidelity to the data. An advantage that the smoothing spline estimator has over local linear and some kernel estimators is its interpretability in the extreme cases of A = 0 and oo. When based on a finite support kernel K, Nadaraya-Watson kernel and local linear estimators are not even well defined as h ----t 0. Even if one restricts h to be at least as large as the smallest value of xi - Xi-1> these estimates still have extremely erratic behavior for small h. They do, however, approach meaningful functions as h becomes large. The Nadaraya-Watson and local linear estimates approach the constant function Y and the least squares straight line, respectively, as h ____, oo. The Gasser-Muller estimator has a nice interpretation in both extreme cases. The case h ----t 0 was discussed in Section 2.3.1, and Hart and Wehrly (1992) show that upon appropriately defining boundary kernels the Gasser-Muller estimate tends to a straight line as h ----t oo.
2. 6.4 Rational Functions A well-known technique of approximation theory is that based on ratios of functions. In particular, ratios of polynomials and ratios of trigonometric polynomials are often used to represent an unknown function. One advantage of this method is that it sometimes leads to an approximation that is more parsimonious, i.e., uses less parameters, than other approximation methods. Here we consider the regression analog of a method introduced in probability density estimation by Hart and Gray (1985) and Hart (1988). Consider approximating r by a function of the form (2.17) rp q(x)
'
=
+ 2 'L::j= 1 {3j cos( njx) , j1 + a 1 exp(nix) + · · · + ap exp(nipx) j2 f3o
0:::; x:::; 1,
42
2. Some Basic Ideas of Smoothing
where the aj's and fJk 's are all real constants. If p = 0, rp,q is simply a truncated cosine series approximation as discussed in Section 2.4. Those familiar with time series analysis will recognize r p,q as having the same form as the spectrum of an autoregressive, moving average process of order (p, q). It is often assumed that the spectrum of an observed time series has exactly the form (2.17). Here, though, we impose no such structure on the regression function r, but instead consider functions rp,q as approximations to r. In the same sense a function need not have a finite Fourier series in order for the function to be well approximated by a truncated cosine series. The representation (2.17) is especially useful in cases where the regression function has sharp peaks. Consider a function of the form (2.18)
g(x; p, e) = 11
+ 2p cos(O) exp(1rix) + p2 exp(27rix)l- 2 ,
where 0 < p < 1 and 0 .::; interval [0, 1]. When arccos( 1
0.::; x .::; 1,
e .::; 1r. This function has a single peak in the
~~2 )
.::;
e.::; arccos( 1 ~p 2
),
the peak occurs in (0, 1) at x = 1r- 1 arccos{- cos(0)(1 + p 2 )/(2p) }; otherwise it occurs at 0 or 1. The sharpness of the peak is controlled by p; the closer p is to 1, the sharper the peak. Based on these observations, a rough rule of thumb is that one should use an approximator of the form r 2 k,q when approximating a function that has k sharp peaks in (0, 1). One may ask what advantage there is to using a rational function approximator when sharp peaks can be modeled by a truncated Fourier series. After all, so long as the function is continuous, it may be approximated arbitrarily well by such a series. In cases where r has sharp peaks, the advantage of (2.17) is one of parsimony. The Fourier series of functions with sharp peaks tend to converge relatively slowly. This means that an adequate Fourier series approximation to r may require a large number of Fourier coefficients. This can be problematic in statistical applications where it is always desirable to estimate as few parameters as possible. By using an approximator of the form rp,q, one can often obtain a good approximation to r by using far fewer parameters than are required by a truncated cosine series. The notion of a "better approximation" can be made precise by comparing the integrated squared error of a truncated cosine series with that of rp,q· Consider, for example, approximating r by the function, call it which has the form r 2 ,m_ 2 and minimizes the integrated squared error I(r2 ,m-2, r) with respect to (30 , ... , !3m- 2 , a 1 and a 2 . Then under a variety of conditions one may show that
r;,,
I
lim
m-+oo
I(r;,,r) I(r(·; m), r)
=
c
< 1,
2.6. Other Smoothing Methods
43
where r( · ; m) is the truncated cosine series from Section 2.4. Results of this type are proven rigorously in Hart (1988) for the case where p is 1 rather than 2. An example of the improvement that is possible with the approximator r 2 ,q is shown in Figure 2.21. The function being approximated is
1155 - - x 5(1- x) 50 1050 '
r(x)
(2.19)
0 :s; x :S 1,
which was constructed so as to have a maximum of 1. The approximations in the left and right graphs of the bottom panel are truncated cosine series based ~n truncation points of 7 and 30, respectively. The approximator in the top left graph is one of Lhe form r 2 ,5 , and that in the top right is rz,s. The two left-hand graphs are based on the same number of parameters, but obviously r 2 ,5 yields a far better approximation than does r(·; 7). The significance of the truncation point m = 30 is that this is the smallest value
co 0
co 0
8.... 0
0
0
0
0.0
0.8
0.4
0.0
0.4
0.8
0.0
0.4
0.8
co 0
0
0
0
0
0.0
0.8
0.4 X
X
FIGURE 2.21. Rational and TI:uncated Cosine Series Approximators. In each graph the solid line is the true curve and the dashed one the approximator. The top graphs depict rational function approximators of the form r2,5 and r 2 ,8 on the left and right, respectively. The bottom graphs show truncated cosine series with truncation points of 7 and 30 on the left and right, respectively.
44
2. Some Basic Ideas of Smoothing
of m for which I(r(·; m), r) < J(r 2,5 , r). We also have J(r2,s, r) ~ .5 I(r(·; 30), r), which is quite impressive considering that r 2,8 uses only one-third the number of parameters that r(·; 30) does. In practice one needs a means of fitting a function of the form r p,q to the observed data. An obvious method is to use least squares. To illustrate a least squares algorithm, consider the case p = 2, and define g(x; p, B) as in (2.18). For given values of p and B, the model
r(x)
~ g(x; p, 8) {fio + 2 ~ fi; co"(~jx)}
is linear in (3 0 , . .• , (3q, and hence one may use a linear routine to find the least squares estimates ~0 (p, B), ... , ~q(p, B) that are conditional on p and B. A Gauss-Newton algorithm can then be used to approximate the values of p and B that minimize
t [Y; -
g(x;; p, 8)
{ilo(p, 0) + 2 ~ !J;(p, 8) coe(~jx;)}]'
This algorithm generalizes in an obvious way to cases where p > 2. Usually it is sufficient to take p to be fairly small, since, as we noted earlier, p/2 corresponds roughly to the number of peaks that r has. Even when r has several peaks, an estimator of the form r 2 ,q will often be more efficient than a truncated cosine series. In particular, this is true when the rate of decay of r's Fourier coefficients is dictated by one peak that is sharper than the others.
2. 6. 5 Wavelets Relative newcomers to the nonparametric function estimation scene are smoothers based on wavelet approximations. Wavelets have received a tremendous amount of attention in recent years from mathematicians, engineers and statisticians (see, e.g., Chui, 1992). A wavelet approximation to a function defined on the real line makes use of an orthogonal series representation for members of L 2(W), the collection of functions that are square integrable over the real line. (Throughout the book W and Wk denote the real number line and k dimensional Euclidean space, respectively.) What makes wavelets so attractive is their tremendous ability to adapt to local features of curves. One situation of particular interest is when the underlying function has jump discontinuities. Without special modification, kernel, cosine series and local polynomial estimators behave quite poorly when jumps are present. By contrast, wavelets have no problem adapting
2.6. Other Smoothing Methods
45
to jump discontinuities. In addition, wavelets are good at data compression, in that they can often approximate nonsmooth functions using far fewer parameters than would be required by comparable Fourier series approximators. In this way wavelets have a similar motivation to the rational functions discussed in Section 2.6.4. Wavelet approximators of functions are orthogonal series expansions based on dilation and translation of a wavelet function'¢. Given a function r that is square integrable over the real line, this expansion takes the form ()()
(2.20)
L
r(x) =
Cj,k 2j/ 2 '¢(2jx- k),
j,k=-00
where cj,k = 2j/ 2
/_:
r(x)'¢(2jx- k) dx.
The function '¢ is called an orthogonal wavelet if the collection of functions {2j/ 2 7j;(2jx- k)} is an orthonormal basis for £ 2 (~). In a statistical context, sample wavelet coefficients Cj,k are computed from noisy data and used in place of Cj,k in a truncated version of the infinite series (2.20). The adaptability of wavelets comes from the fact that the orthogonal functions in the expansion (2.20) make use of dilation and translation of a single function. By contrast, the trigonometric series discussed in Section 2.4 use only dilation, i.e., scaling, of the function cos(nx). To see more clearly the effect of translation, consider using the simple Haar wavelet to represent a square integrable function r on the interval [0, 1]. The Haar wavelet is '¢H(x) =
1, -1, { 0,
ifO:::;x ci C\1
ci 0
ci
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
X
<Xl
ci
~
'
2.6. Other Smoothing Methods
49
in Figure 2.23 has a smaller integrated squared error than the truncated Fourier series r(· ; 11), whose ISE is .003783. We shall have more to say about the relative merits of thresholding and truncation in the testing context of Chapter 7.
3 Statistical Properties of Smoothers
3.1 Introduction The smoothers discussed in Chapter 2 provide very useful descriptions of regression data. However, when we use smoothers to formally estimate a regression function, it becomes important to understand their statistical properties. In this chapter we discuss issues such as mean squared error and sampling distribution of an estimator, and using smoothers to obtain confidence intervals for values of the regression function. We will consider two types of smoothers: Gasser-Muller type kernel estimators and tapered Fourier series. We choose to focus on Gasser-Muller rather than NadarayaWatson type kernel smoothers since the latter have a more complicated bias representation. Among Fourier series estimators the emphasis will be on truncated series estimators, since there are certain practical and theoretical advantages to using an estimator with a discrete smoothing parameter. Throughout most of this chapter we will assume that the data Y1, ... , Yn are generated from a model of the form
Yi
=
r(xi)
+ Ei,
i = 1, ... , n,
in which the error terms are uncorrelated and have mean 0 and common variance 0" 2 . In addition we will assume that the design points are generated by a positive, Lipschitz continuous density function f in the sense that, for all n, "= Q( i
X"
-n1/2) '
i = 1, ...
,n,
where
Q(u)
= F- 1 (u)
and
F(x) =
lax f(u)du,
0 SuS 1.
(A function g is said to be Lipschitz continuous over the interval [a, b] if there exists a constant C such that
jg(u)- g(v)l S Clu- vj, 50
V u, v
E
[a, b].)
3.2. Mean Squared Error of Gasser-Muller Estimators
51
Additional conditions on the regression function r and the error terms will be imposed as we proceed throughout the chapter.
3.2 Mean Squared Error of Gasser-Muller Estimators To simplify notation we shall denote the Gasser-Muller estimator rfM (x) by rh(x ). Until further notice we assume the following conditions on the kernel K: 1. K has support ( -1, 1). 2. K is Lipschitz continuous. 3. 1. 1 K(u) du 4.
J2 J2 1 uK(u) du =
0.
For future reference we also define the constants J K and
1 1
JK
=
1
CTJc by
1
K 2 (u) du
and
CTJc
=
-1
u 2 K(u) du.
-1
The second moment of K is denoted CTJc to conform with usual notation, but note that CTJc could be negative since conditions 1-4 allow K to take on negative values. The results to be presented in this section were first developed by Gasser and Muller (1979).
3.2.1 Mean Squared Error at an Interior Point As a means of measuring how close the estimator rh (x) is, on average, to
r(x), we consider the mean squared error, i.e.,
M(x; h)
=
E (rh(x)- r(x)) 2
=
Var(fh(x))
+ (E(rh(x))- r(x)) 2 .
There are two categories of factors affecting size of the mean squared error: those associated with the probability model from which the data arise, and those associated with the estimator. Relevant model factors are the smoothness of the regression function r near x, the error variance CT 2 and the size of the design density f near x. When using the Gasser-Muller smoother, the bandwidth h and kernel K are the estimator factors affecting the size of mean squared error. To gain insight about the effect of each of these factors, we will investigate the behavior of M(x; h) as the sample size n tends to oo. It turns out that in order for rh(x) to be mean squared error consistent as n __, oo, it is necessary, in general, for the bandwidth h to tend to 0, but slowly enough that nh __, oo. In this section we will avoid boundary issues by assuming
52
3. Statistical Properties of Smoothers
that x is an "interior point," i.e., a point that is in the open interval (0, 1) and such that his smaller than min(x, 1 - x). We first consider the variance of rh(x), which is
V(x; h) = Var(rh(x))
~a'~ (~ L K (x~u)du) =
• I
!72 t(si - Si-1)2
~2 K2
( x
2
~xi) '
where xi E [si_ 1, si], i = 1, ... , n. This expression for the variance provides a starting point, but does not make clear the effect of the design, the bandwidth and the kernel. The following theorem provides more insight by giving a simple form for the limiting variance as n --+ oo.
Theorem 3.1. Assume that the regression model and design conditions of Section 3.1 hold, and let x be in the open interval (0, 1). Then the variance V(x; h) of the Gasser-Muller estimator rh(x) has the following form as n --+ oo, h --+ 0 and nh --+ oo: !72 1 V(x; h)= nh f(x) JK PROOF.
Defining Q[(n + 1/2)/n]
+ O(n
= 1,
-1
)
+ O(nh)
-2
.
we have
V(x; h)= !72h-2 f)si- Si-1)2K2 (x
~xi)
•=1
I
= Vn,h + En,h, where
and
By the mean value theorem
Q ( i +n1/2) _ Q ( i -n1/2) = Q'(ui)
n-1,
0o~o
lVlean ;:;quared JC;rror ot Gasser-Miiller l:<Jstimators
where (i- 1/2)/n ::; ui ::; (i + 1/2)/n, i = 1, Q'(u) = 1/f[Q(u)] for all u, we then have
0
0
0,
53
no Using the identity 1
1
f[Q(ui)] n 1 1 J(xi) -;:;,' where
It follows that
Xi ::; Xi ::; Xi+lo
where
Now consider
where
si-1 ::; x~
::; si, i = 1,
0
0
0,
no Combining the steps above,
where
Making the change of variable z = ( x - t) j h in the above integral, we have 0'2
Vn,h
= -h n =
0'2
nh
lx/h
f(
(x-1)/h
11
1 X -
1
h ) K (z) dz + Rn,h + R~,h z
_ f(x _ hz) K 1
11
0'2 1 = nh f(x) _
1
K
2( )
2
*
2( )
z dz
z dz + Rn,h + Rn,h
+ 0 (n
-1) +
Rn,h
* + Rn,ho
54
3. Statistical Properties of Smoothers
Combining all the previous work yields
V(x; h) =
:~
ftx) JK
+ O(n- 1 ) + Rn,h + R~,h + En,h·
We still need to show that the remainder terms Rn,h, R~,h and En,h are negligible relative to (nh)- 1 . Doing so involves similar techniques for each term, so we look only at Rn,h, for which
a- 2 n 2 0, then at peaks (r"(x) < 0) the kernel estimator will tend to underestimate r(x), whereas at valleys (r"(x) > 0), overestimation occurs. :Furthermore, the bias will be largest in magnitude where lr" (x) I is largest; in other words, the sharper the peak or valley, the larger the bias. A final, very noteworthy, aspect of Theorem 3.2 is that the dominant term in the bias expansion does not depend on the design density. By contrast, the bias of a Nadaraya-Watson estimator does depend on the design (see, e.g., Fan, 1992). The asymptotic design independence of the Gasser-Muller bias is an attractive feature, owing mainly to its simplicity. It is worth noting that the local linear estimator (Section 2.6.2) also has bias that is design-independent to first order. Combining Theorems 3.1 and 3.2 leads to the following corollary x). concerning the mean squared error of
rh (
Corollary 3.1. Assume that the regression model and design conditions of Section 3.1 hold, and let x be in the open interval (0, 1). Suppose in addition that r has two continuous derivatives. Then, as n ----t oo, h ----t 0 and nh ----t oo, the mean squared error of the Gasser-Muller estimator is M(x; h)= M(x; h)+ R(n, h), where
(3.1)
3.2. Mean Squared Error of Gasser-Muller Estimators
57
and
The competing criteria of stability and fidelity are illustrated succinctly in the asymptotic form of the mean squared error. When we make the bandwidth smaller, the squared bias decreases, but the variance increases. To balance the two criteria we could choose h to minimize the mean squared error M(x; h). By considering only the dominant term M(x; h) we can find a sequence {hn} of bandwidths such that
. M(x; hn) hm supn-+oo M(x; h~) ~ 1, for any other sequence
{h~}.
We find hn by solving the equation d
-
dh M(x; h)= 0 for h. This leads to (3.2)
(
0" 2 JK f(x)[r"(x)j20"k
) 1/5
-1/5
n
and
(3.3)
M(x; hn)
~ 1.25J1f'aU 1r"(x)l'i 5
5
415 (
(:)) 1
,-'/
5
The form of hn makes more precise some notions discussed in Chapter 2. Since hn is proportional to 0" 2 / 5 , the more noisy the data, the larger the optimal bandwidth. We also see that the optimal bandwidth is smaller at points x where r has a lot of curvature, i.e., where lr"(x)l is large. These considerations point out that variance and curvature are competing factors. Our ability to detect micro features of r, such as small bumps, is lessened the more noisy the data are. Such features require a small bandwidth, but a lot of noise in the data can make the optimal bandwidth so large that a micro feature will be completely smoothed away. Of course, sample size is also an important factor in our ability to detect fine curve features. In addition to the obvious fact that curve features occurring between adjacent design points will be lost, expression (3.2) shows that the optimal bandwidth is decreasing in n. Hence, given that the bandwidth must be sufficiently small to detect certain curve features, there exists a sample size n below which detection of those features will be impossible when using an "optimal" bandwidth. The design affects hn in the way one would expect; where the design is sparse the optimal bandwidth is relatively large and where dense the bandwidth is relatively small. For cases where the experimenter actually
58
3. Statistical Properties of Smoothers
has control over the choice of design points, expression (3.3) provides insight on how to distribute them. Clearly it is advisable to have the highest concentration of design points near the x at which lr"(x)l is largest. Of course, r" will generally be unknown, but presumably a vague knowledge of where r" is large will lead to better placement of design points than no knowledge whatsoever. An important facet of the asymptotic mean squared error in (3.3) is the rate at which it tends to 0 as n __. oo. In parametric problems the rate of decrease of mean squared error is typically n-1, but in our nonparametric problem the rate is n - 4 15 . It is not surprising that one must pay a price in efficiency for allowing an extremely general form for r. Under the conditions we have placed on our problem, Corollary 3.1 quantifies this price. Having chosen the bandwidth optimally for a given kernel, it makes sense to try to find a best kernel. Expression (3.3) shows that, in terms of asymptotic mean squared error, the optimal kernel problem can be solved independently of the model factors r, f and a-. One problem of interest is to find a kernel K that minimizes subject to the constraint that K be a density with finite support and zero first moment. This calculus of variations problem was solved by Epanechnikov (1969), who showed that the solution is
Jio-'i
The function KE is often referred to as the Epanechnikov kernel. In spite of the optimality of KE there is a large number of kernels that are nearly optimal. The efficiencies of several kernels relative to K E are given in Table 3.1. The quartic and triangle kernels are, respectively,
KQ(u ) = 15 (1- u 2)2 J(-l,l)(u ) and Kr(u) = (1- lui)J(-l,l)(u). 16 One can show that expression (3.3) is valid for the Gaussian kernel even though it does not have finite support. The fact that the relative efficiencies for the quartic, triangle and Gaussian kernels are so close to 1 explains
TABLE 3.1. Asymptotically Optimal Mean Squared Error of Various Kernels Relative to the Epanechnikov
a-2K
Kernel
Epanechnikov Quartic Triangle Gaussian Rogosinski
3/5 .7142857 2/3 .2820948 5/4
1/5 .1428571 1/6 1 .0367611
Relative efficiency
1.0000 1.0049 1.0114 1.0408 .9136
3.2. Mean Squared Error of Gasser-Muller Estimators
59
the oft-quoted maxim that "kernel choice is not nearly as important as bandwidth choice." The kernel termed "Rogosinski" in Table 3.1 is defined by KR(u)
=
(.5
+ cos(?T/5) cos(?Tu) + cos(2?T/5) cos(2?Tu))I(-l,t)(u).
This kernel satisfies conditions 1-4 at the beginning of Section 3.2 but differs from the other kernels in Table 3.1 in that it takes on negative values. On [-1, 1] KR is proportional to the kernel Kf(O, ·)defined in Section 2.4. It is interesting that KR has smaller asymptotic mean squared error than does KE, which is not a contradiction since KE is optimal among nonnegative kernels. We shall have occasion to recall this property of KR in Section 3.3 when we discuss properties of the Rogosinski series estimator.
3.2.2 Mean Squared Error in the Boundary Region Boundary effects can be quantified by considering the mean squared error of an estimator at a point qh, where q E [0, 1) is fixed. Notice that the point of estimation changes as h -----> 0, but maintains the same relative position within the boundary region [0, h). Consider first the mean squared error of a normalized Gasser-Muller estimator rf: (qh) with kernel KN (u ),q
-
K(u)I(-l,q)(u) J~ K(v) dv 1
Using the same method of proof as in Theorems 3.1 and 3.2, it is straightforward to show that (3.4) E
(
rf: (qh)- r(qh) )
2
cr2 1 "' nh f(O)
lq
_ K'fv,q(u) du 1
+
h' [''(O+)]' (I: uKN,,(u) du)
2
Expression (3.4) confirms theoretically what had already been pointed out in Section 2.5.1. The main difference between (3.4) and the corresponding mean squared error at interior points is in the squared bias. The squared bias of the normalized estimator within the boundary region is of order h 2 , rather than h 4 . Also, the effect of r on the bias is felt through the first rather than second derivative. Minimizing (3.4) with respect to h shows that, in the boundary region, the optimal rate of convergence for the mean squared error of a normalized estimator is n- 213 , at least when r' (0+) -=/=- 0. If r'(O+) = 0 and r has two continuous derivatives on [0, 1], then one can show that the squared bias of rf: (qh) is of order h 4 . Suppose now that we employ boundary kernels as described in Section 2.5.1. The main difference between this approach and the normalized estimator is that the boundary kernel Kq satisfies the same moment conditions
60
3. Statistical Properties of Smoothers
as does K; i.e.,
(3.5)
l
and
q uKq(u) du = 0.
-1
Under the same conditions as in Corollary 3.1, the following expansion holds for the mean squared error of a boundary kernel estimator rq,h(qh): (3.6) E (rq,h(qh)- r(qh))
2
1 0
0'2
=
f( )
-
nh
14
~
+
lq
_1
K;(u) du
[r"(O+)]'
[I:
u'K,,(u)du]'
+ o(h4 ) + O(n- 1 ) + O(nh)- 2 . In spite of the similarity of expressions (3.1) and (3.6), there is still a price to be paid in the boundary region. Typically the integral J~ 1 K~ (u) du will be larger than J~ 1 K 2 (u). This implies that the asymptotic variance of rq,h(qh) will be larger than the variance of rh(x) at an interior point x for which f(x) f(O). It is not surprising that the variance of one's estimator tends to be larger in the boundary, since the number of data in (x- h, x +h) (h ::;; x ::;; 1 -h) is asymptotically larger than the number in (0, qh +h) when f(x) f(O). Of course, one remedy for larger variance in the boundary is to put extra design points near 0 and 1. One means of constructing boundary kernels was described in Section 2.5.1. Muller (1991) pursues the idea of finding optimal boundary kernels. Among a certain smoothness class of boundary kernels with support ( -1, q), Muller defines as optimum the kernel which minimizes the asymptotic variance of the mth derivative of the estimator (m 2: 0). For example, if m = 1 the optimum kernel turns out to be (3.7)
Kq(u)
=
6(1
+ u)(q- u)
x
1+5
{
(
1 ) 1+q 3
(
1-q 1 +q )
2
1-q } +10( 1 +q) 2 u
I(- 1 ,q)(u).
At q = 1, Kq(u) becomes simply the Epanechnikov kernel (3/4)(1 u 2 )I( _ 1 ,1 ) ( u). To ensure a smooth estimate, it would thus be sensible to use the Epanechnikov kernel at interior points x E [h, 1 h] and the kernel (3.7) at boundary points. Boundary problems near x = 1 are handled in an analogous way. For x E (1 - h, 1] one uses the estimate
8 Yi h 1 n
1
8
i s;-1
Kq
(
U- X) du,
-h-
3.2. Mean Squared Error of Gasser-Muller Estimators
61
where q = (1 - x)jh and Kq is the same kernel used at the left-hand boundary.
3.2.3 Mean Integrated Squared Error To this point we have talked only about local properties of kernel estimators. A means of judging the overall error of an estimator is to compute the global criterion of integrated mean squared error, which is
1 1
J(rh, r)
=
2
E (rh(x)- r(x)) dx.
The quantity J(rh, r) may also be thought of as mean integrated squared error (MISE) since
J(rh, r)
=
E I(rh, r)
=
E
1 1
2
(rh(x)- r(x)) .dx.
Boundary effects assert themselves dramatically in global criteria such as MISE. Suppose that we use a normalized Gasser-Muller estimator in the boundary region. Then, under the conditions of Corollary 3.1, if either r' (0+) or r' (1-) is nonzero, the integrated squared bias of the GasserMuller estimator is dominated by boundary bias. Let rh denote a GasserMuller estimator that uses kernel K at interior points and the normalized version of K, KN,q, in the boundary. It can be shown that
1
Var (rh(x)) dx
1 ~~~) j_ 1
~~
1
rv
1 2
1
K (u) du,
· which means that the boundary has no asymptotic effect on integrated variance. Consider, though, the integrated squared bias, which may be written as B1 + B2, where
B1 =
{h B 2 (x; h) dx
lo
+ /,
1
B 2 (x; h) dx
1-h
and
1
1-h
B2 =
.
B 2(x; h) dx.
Since the squared bias of rh is of order h 2 (as h ---* 0) in the boundary, the integral B 1 is of order h 3 (unless r 1(0+) = 0 = r'(O- )). Now, B 2 (x; h) is of order h 4 for x E (h, 1- h), and so B 2 is negligible relative to B 1 . It follows that the integrated squared bias over (0, 1) is of order h 3 , and the resulting MISE has asymptotic form 01 nh
3
+ C2h,
62
3. Statistical Properties of Smoothers
which will not converge to zero faster than n- 3 / 4 . In this sense edge effects dominate the MISE of a normalized Gasser-Muller estimator. The boundary has no asymptotic effect on MISE if one uses boundary kernels. Under the conditions of Corollary 3.1, a Gasser-Muller estimator using boundary kernels has MISE that is asymptotic to the integral of M(x; h) over the interval (0, 1). This implies that the MISE converges to 0 at the rate n- 4 / 5 and that the asymptotic minimizer of MISE is - ( <J2Jx fol [f(x)]-1 dx) 1/5 -1/5 hnn . 1 <Jk fo (r" (x) )2 dx
(3.8)
3.2.4 Higher Order Kernels If one assu:rpes that the regression function has more than just two continuous derivatives, then it is possible to construct kernels for which the bias converges to 0 at a faster rate than h 2 . To show how this is done, we first define a kth order kernel K to be one that satisfies
1 1
1
1
K(u) du = 1,
-1
u1 K(u) du = 0,
j
= 1, ... , k- 1,
-1
and 1
uk K(u) du -1- 0.
[ 1
The kernels so far considered have been second order kernels. Notice that kernels of order 3 or more must take on negative values. Suppose that r has k continuous derivatives on [0, 1] and that we estimate r(x) at an interior point x using a Gasser-Muller estimator rh(x) with kth order kernel K. Using a Taylor series expansion exactly as in Theorem 3.2, it follows that (3.9)
E (rh(x))- r(x) =
Assuming that K is square integrable, the asymptotic variance of a kth order kernel estimator has the same form as in Theorm 3.1. Combining this fact with (3.9), it follows that when r has k derivatives, the mean squared error of a kth order kernel estimator satisfies
Choosing h to minimize the last expression shows that the optimal bandwidth hn has the form hn
rv
c n - 1/(Zk+ 1)
0
The corresponding minimum mean squared error of the kth order kernel estimator converges to 0 at the rate n-Zk/(Zk+l). Theoretically, the bias reduction that can be achieved by using higher order kernels seems quite attractive. However, some practitioners are reluctant to use a kernel that takes on negative values, since the associated estimate no longer has the intuitivelr appealing property of being a weighted average. Also, the integral J_ 1 K 2 (u) du tends to be larger for higher order kernels than for second order ones. In small samples where asymptotics have not yet "kicked in," this can make a higher order kernel estimator have mean squared error comparable to that of a second order one (Marron and Wand, 1992).
3.2.5 Variable Bandwidth Estimators The estimators considered in Section 3.2.3 were constant bandwidth estimators, i.e., they employed the same value of hat each x. The form of the optimal bandwidth for estimating r(x) suggests that it would be better to let the bandwidth vary with x. To minimize the pointwise mean squared error, we should let h (as a function of x) be inversely proportional to
{f(x) [r"(x)J
115
2 }
Use of the pointwise optimal bandwidth leads to MISE that is asymptotically smaller than that of the constant bandwidth estimator of Section 3.2.3. Let 1n be the variable bandwidth kernel estimator that uses bandwidth (3.2) at each x, and let 2 n be the constant bandwidth estimator with h equal to (3.8). Then it is easily verified that
r
J(T1n, ~ . r) l 1m n-->oo J(rzn, r)
r
=
11
lr"(x)l2/5 -'-----'----''--'-:-:-::- dx o [f(x)]4/5
X[l ~~:r/5 {l [r"(x)]' r5 dx
Examples in Muller and Stadtmuller (1987) show that this limiting ratio can be as small as 1/2. As a practical matter, knowing that the optimal h has the form (3.2) is not of immediate value since it depends on the unknown function r. Whether one uses a constant or variable bandwidth estimator, it will be necessary to estimate r" in order to infer a mean squared error optimal bandwidth. We will discuss a method of estimating derivatives of regression functions in Section 3.2.6. Muller and Stadtmuller (1987) proposed a
3.2. Mean Squared Error of Gasser-Muller Estimators
63
Choosing h to minimize the last expression shows that the optimal bandwidth hn has the form hn
rv
c n-1/(2k+l).
The corresponding minimum mean squared error of the kth order kernel estimator converges to 0 at the rate n- 2 k/( 2k+l). Theoretically, the bias reduction that can be achieved by using higher order kernels seems quite attractive. However, some practitioners are reluctant to use a kernel that takes on negative values, since the associated estimate no longer has the intuitivelt appealing property of being a weighted average. Also, the integral J_ 1 K 2 (u) du tends to be larger for higher order kernels than for second order ones. In small samples where asymptotics have not yet "kicked in," this can make a higher order kernel estimator have mean squared error comparable to that of a second order one (Marron and Wand, 1992).
3.2.5 Variable Bandwidth Estimators The estimators considered in Section 3.2.3 were constant bandwidth estimators, i.e., they employed the same value of hat each x. The form of the optimal bandwidth for estimating r(x) suggests that it would be better to let the bandwidth vary with x. To minimize the pointwise mean squared error, we should let h (as a function of x) be inversely proportional to {
f(x) [r"(x)]
2}1/5
Use of the pointwise optimal bandwidth leads to MISE that is asymptotically smaller than that of the constant bandwidth estimator of Section 3.2.3. Let hn be the variable bandwidth kernel estimator that uses bandwidth (3.2) at each x, and let 2 n be the constant bandwidth estimator with h equal to (3.8). Then it is easily verified that
r
J(r1n, r) 11 lr"(x)l2/5 . = -'-------'-----'--'-;--;-::-- dx 11m n-t= J(r2n, r) o [f(x)]4/5
x
[1' /~) ]-'/' {[
[r"(x)]'
dx}
-l/S
Examples in Muller and Stadtmuller (1987) show that this limiting ratio can be as small as 1/2. As a practical matter, knowing that the optimal h has the form (3.2) is not of immediate value since it depends on the unknown function r. Whether one uses a constant or variable bandwidth estimator, it will be necessary to estimate r" in order to infer a mean squared error optimal bandwidth. We will discuss a method of estimating derivatives of regression functions in Section 3.2.6. Muller and Stadtmuller (1987) proposed a
64
3. Statistical Properties of Smoothers
method for estimating (3.8) and showed by means of simulation that their data-based variable bandwidth estimator can yield smaller MISE than a data-based constant bandwidth estimator. The dependence of MISE on the design density f raises the question of optimal design. Muller (1984) addresses this issue and derives design densities that asymptotically optimize MISE for constant and variable bandwidth Gasser-Muller estimators. Interestingly, the optimal design density of a constant bandwidth estimator does not depend on local features of r, whereas that of a variable bandwidth estimator does.
3. 2. 6 Estimating Derivatives We have seen that estimation of r" is necessary if one wishes to infer a mean squared error optimal bandwidth for estimating r. Also, in some applications derivatives of the regression function are of at least as much interest as the function itself. For example, in growth studies the derivative of height or weight is important in determining growth spurts and times at which height or weight are changing rapidly. An interesting example of the use of kernel methods in growth studies may be found in Gasser et al. (1984). A Gasser-Muller type kernel estimator of rU(za; 2 - b)+ if>(za; 2 +b) -1, which is less than the nominal 1- 0:. A number of different approaches have been proposed to deal with the problems inherent in the naive interval (3.22). The most obvious fix is to select h in such a way that Bnh ----r 0. Doing so leads to an interval with asymptotic coverage probability equal to the nominall - o:. The only problem with selecting h in this way is that it will undersmooth relative to a bandwidth that minimizes mean, or mean integrated, squared error. This will lead to a confidence interval whose length is greater than that of interval (3.22) for all n sufficiently large. Furthermore, we are forced into the awkward position of centering our confidence interval at a different and less efficient estimator than the one we would ideally use as our point estimator of r(x). Another possibility is to estimate the quantity Bnh and account for it explicitly in constructing the confidence interval. Suppose we are willing to assume the conditions of Corollary 3.1. Then by taking h to be of the form Cn- 115 , Bnh is asymptotic to
Bc,n =
and
rh(x)- r(x)
y'Var [f(x; m)]
- Bc,n
J)
-------*
N(O, 1).
Estimation of Ben requires estimation of r" (x), which may be done using a kernel estimate' as in Section 3.2.6. If Bc,n has the same form as Bc,n but with r 11 ( x) and CJ replaced by consistent estimators, then
80
3. Statistical Properties of Smoothers
is an asymptotically valid (1 - a)100% confidence interval. Hiirdle and Bowman (1988) provide details of a bootstrap approach to obtaining a bias-adjusted interval as in (3.23). Such methods have the practical difficulty of requiring another choice of smoothing parameter, i.e., for r"(x). More fundamentally, the interval (3.23) has been criticized via the following question: "If one can really estimate r"(x), then why not adjust rh(x) for bias and thereby obtain a better estimate of r(x)?" A less common, but nonetheless sensible, approach is to simply admit that E [rh(x)], call it Tnh(x), is the estimable part of r(x), and to treat Tnh(x) as the parameter of interest. Theorem 3.6 may then be used directly to obtain an asymptotically valid confidence interval for Tnh(x). Typically, Tnh is a "shrunken" version of r, i.e., a version of r that has more rounded peaks and less dramatic valleys than r itself. It follows that if one is more interested in the shape of r, as opposed to the actual size of the function values r (x), then treating r nh as the function of interest is not an unreasonable thing to do. Using ideas of total positivity one can make more precise the correspondence between rand Tnh· When r is piecewise smooth, the expected value of the Gasser-Muller kernel estimator is (3.24) where
rh(x)
=
Jo(1y;,K (x-u) -h-
r(u) du.
Whenever nh -+ oo and h -+ 0, expression (3.24) implies that the naive interval (3.22) is asymptotically valid for rh(x). Now, the function rh is the convolution of r with h- 1 K (-/h). Karlin (1968, p. 326) shows that if K is strictly totally positive, then rh has no more modes than r. This in turn means that, for all h sufficiently small, rh and r have the same number of modes. These considerations provide some assurance, at least for large n, that interval (3.22) is valid for a function having features similar to those of r. The Gaussian density is an example of a totally positive kernel. Silverman (1981) exploited this property of the Gaussian kernel in proposing a test for the number of modes of a probability density function. Other ideas related to inferring the number of peaks of a function are considered in Hart (1984), Donoho (1988) and Terrell and Scott (1985). Bayesian motivated confidence intervals with desirable frequentist properties have been proposed by Wahba (1983). Wahba's intervals have the form r(x) ± ZajzW(x), where r(x) is a Smoothing spline and w2 (x) is a statistic that tends to be closer to the mean squared error of r(x) than to
3.5. Large-Sample Confidence Intervals
81
Var(f(x)). The latter property implies that Wahba's interval tends to have higher coverage probability than does a naive interval as in (3.22). Nychka (1988) established another interesting frequentist property of Wahba's intervals. Suppose one computes these intervals at each of then design points, using the same nominal error probability of a at each Xi· Nychka shows that the average coverage probability of these n intervals tends to be quite close to 1 - a. Perhaps of more interest than confidence intervals for selected values r( x) are simultaneous confidence bands for the entire function r. A number of methods have been proposed for constructing such bands. These include the proposals of Knafl, Sacks and Ylvisaker (1985), Hall and Titterington (1988), Hardle and Bowman (1988), Li (1989), Hardle and Marron (1991) and Eubank and Speckman (1993). The same issue arises in construction of confidence bands as was encountered in pointwise confidence intervals; namely, one must take into account the bias of a nonparametric smoother in order to guarantee validity of the corresponding interval(s). Each of the previous references provides a means of dealing with this issue. A situation in which the bias problem can be fairly easily dealt with is when one wishes to use probability bands to test the adequacy of a parametric model. When a parametric model holds, the bias of a nonparametric smoother depends at most upon the parameters of the model, and hence can be readily estimated. We now illustrate how probability bands can be constructed under a parametric model by means of an example. Suppose that we wish to test the adequacy of a straight line model for the regression function r. In other words, the null hypothesis of interest is
Ho : r(x) = Bo
+ elx,
0 :::;
X :::;
1,
for unknown parameters 80 and 81 . Define
8(x)
=
f(x)-
Oo- elx,
0:::;
X:::;
1,
where f is, say, a boundary-corrected Gasser-Muller smooth and Bo and el are the least squares estimates of 80 and 81 , respectively. The variance of 8(x) has the form a- 2 s 2 (x) for a known function s(x). We may determine a constant Ca such that
where 8- 2 is an estimator of o- 2 = E(ET) and PHa denotes that the probability is computed under H 0 . The constants -ca and Ca form simultaneous 1- a probability bounds for the statistics 8(xi)/(8-s(xi)), i = 1, ... , n. We may thus reject H 0 at level of significance a if any observed value strays outside these bounds. Let us suppose that the error terms Ei are i.i.d. N(O, a- 2 ). We may then approximate a P-value for the test described above by using simulation. When H 0 is true and f is a Gasser-Muller smooth, it turns out that, to
82
3. Statistical Properties of Smoothers
first order, r(x) -eo- elx depends only upon theE/Sand not upon Bo or Bt (Hart and Wehrly, 1992). In conducting a simulation one may thus take Bo = 81 = 0 without loss of generality. Letting mn denote the observed value of maXt.-~-·.:~~~-~-~~~·~···
0
. ........ ....
............---~--~----.
, ... .
0.0
0.2
0.4
0.6
0.8
1.0
0.6
0.8
1.0
X
lC)
c) 0
c)
0.0
0.2
0.4 X
FIGURE 3.1. Testing the Fit of a Straight Line Model by Simulating Probability Bands.
3.5. Large-Sample Confidence Intervals
83
on an Epanechnikov kernel and a bandwidth of .2. In the lower panel of Figure 3.1 are empirical 95% probability bands for 8(x)jfJ that were obtained by simulating data from the null model with Bo = 81 = 0. The variance estimator used was & 2 = "E~=l Yi- 8 - 8 xi /(n- 2). The A
(
A
0
1
)2
bands are such that only fifty of one thousand simulated curves strayed outside the shaded region in the graph. The line in the bottom panel is 8(x) / fj for the data in the top panel. The value of max1 - ~·:··;, .
!
I
150
200
250
days FIGURE 4.1. Maternal Serum Alphafetoprotein Data. Each curve is a local linear smooth. The smoothing parameter of the solid curve was chosen by OSCV, and that of the dotted curve by the plug-in method. The dashed curve's smoothing parameter was chosen by each of CV, GCV and estimated MASE.
estimates are similar in the middle of the data, but the plug-in estimate is somewhat undersmoothed near the edges of the x-interval. Furthermore, the plug-in method requires the choice of h0 and (j and appears to be quite sensitive to changes in b. By contrast, OSCV requires only a choice for m, and the author has found the method to be insensitive to this choice. Of course, the CV, GCV and estimated MASE bandwidths are also more objective than the plug-in method, but, as we will see in the next section, they are more variable than the OSCV bandwidth.
4.3 Theoretical Properties of Data-Driven Smoothers Our most detailed treatment of the theory of data-driven smoothing parameters is in the setting of kernel smoothing (Sections 4.3.1 and 4.3.2).
94
4. Data-Driven Choice of Smoothing Parameters
C\1
0
C\1
0
co
c) UJ
>
c)
en
f'-. C\1
0 0
c)
c)
20
40
60
80
~
1.{)
c)
c)
20
40
60
80
20
40
60
80
0
C')
C\1
> 0 en
> 0
0
co
f'-. C\1
0
Ol
c)
C\1
c)
20
40 h
FIGURE
60
80
h
4.2. Cross-Validation Curves for Maternal Serum Alphafetoprotein
Data.
Data-driven choice of the truncation point for Fourier series estimators will be discussed in Section 4.3.3.
4.3.1 Asymptotics for Cross- Validation, Plug-In and Hall-Johnstone Methods Seminal work in the kernel smoothing case has been done by Rice (1984b) and Hiirdle, Hall and Marron (HHM) (1988). In this section we shall describe the results of HHM, since doing so will facilitate our theoretical discussion of the methods encountered in Section 4.2. Let rh be a kernel estimator of Priestley-Chao type. Some important insight is gained by investigating how data-driven bandwidths behave relative to h0 , the minimizer of
In the parlance of decision theory, ASE is a loss function and MASE the corresponding risk function. For a specific set of data it seems more desirable to use the bandwidth that actually minimizes ASE, rather than ASE on the average. This point of view is tantamount to the Bayesian principle that says it is more sensible to minimize posterior risk than frequentist risk.
4.3. Theoretical Properties of Data-Driven Smoothers
95
See Jones (1991) for a more comprehensive discussion of the MASE versus ASE controversy. HHM provide results on the asymptotic distribution of fi - ho, where h is a data-driven choice for the bandwidth of rh. The assumptions made by HHM are summarized as follows: 1. The design points in model (2.1) are Xi = iln, i = 1, ... , n. 2. The regression function r has a uniformly continuous, integrable second derivative. 3. The error terms Ei are i.i.d. with mean 0 and all moments finite. 4. The kernel K of fh is a compactly supported probability density that is symmetric about 0 and has a Holder continuous second derivative.
In addition we tacitly assume that boundary kernels are used to correct edge effects (Hall and Wehrly, 1991). Otherwise we would have to incorporate a taper function into our definition of the cross-validation and ASE curves to downweight the edge effects. Let hcv be the minimizer of the cross-validation curve over an interval of bandwidths of the form Hn = [n- 1+8, n], o > 0. Also, denote by h 0 the minimizer of MASE(h) for h E Hn. Under conditions 1-4 HHM prove the following results: (4.3) and (4.4)
ur
and ()~ are positive. as n ---t oo, where Results (4.3) and (4.4) have a number of interesting consequences. First, recall from Chapter 3 that h 0 rv C0 n- 115 . This fact and results (4.3) and (4.4) imply that
(4.5)
fi;:v - 1
ho
=
Op(n-1/10)
and
ho ho
1
=
Op(n-1/10).
A remarkable aspect of (4.5) is the extremely slow rate, n- 1110 , at which hcv I ho and h0 I ho tend to 1. In parametric problems we are used to the much faster rate of n- 1 / 2 . As discussed above, it is arguable that the distance Ihcv - hoI is more relevant than Ihcv - hoI· With this in mind, an interesting aspect of (4. 5) is that the cross-validation bandwidth and the MASE optimal bandwidth differ from ho by the same order in n. Hence, perfect knowledge of the MASE optimal bandwidth gets one no closer to h0 (in rate of convergence terms) than does the cross-validation bandwidth, which is data driven! If one adopts ASE rather than MASE as an optimality criterion, this makes one wonder if the extremely slow rate of n- 1110 is an inherent part of the bandwidth selection problem. In fact, Hall and Johnstone (1992) show that,
96
4. Data-Driven Choice of Smoothing Parameters
in a minimax sense, the quantity
h- ho ho never converges to 0 faster than n- 1110 , where h is any statistic. Knowing that (hcv- h0 )/h0 converges to 0 at the optimal rate, it is natural to consider how E(hcv- h 0 ) 2 compares with the analogous quantity for other data-driven bandwidths that also converge at the best rate. For commonly used kernels HHM point out that 0' 1 ~ 20' 2 , implying that ho tends to be closer to h0 in absolute terms than does hcv. This suggests the intriguing possibility that a sufficiently good estimator of h 0 will usually be closer to ho than is hcv. Let us now consider the GKK (1991) plug-in bandwidth hPJ, which is founded on estimation of h 0 . We have
hPJ - ho
=
hPJ- ho
+ (ho
- ho),
implying that h pI- h 0 will have the same asymptotic distribution as ho- h 0 as long as hPJ- h 0 is op(n- 3 110 ). GKK show that
hPJ- ho
= Op(n- 215 ) = op(n- 3110 ),
and hence
n
3/10
A
A
(hPJ- ho)
D
---+
2
N(O, 0'2 ).
Asymptotically, then, the plug-in bandwidth of GKK performs better than the cross-validated one in the sense that E(hPJ- h 0 ) 2 ~ .25E(hcv- h 0 ) 2 for commonly used kernels and all n sufficiently large. One way of explaining the behavior of hcv - h0 is to consider the representation hcv - ho
=
hcv - ho - ( ho - ho).
Rice (1984b) was the first to show that n 3 / 10 (hcv - ho) _.!!.__, N(O, 0'6v) for 0'6v > 0. It follows that, asymptotically, hcv has infinitely larger mean squared error in estimating h 0 than does hPJ. Furthermore, (4.4) and (4.5) imply that A
(4.6)
A2
A
E(hcv- ho) ~ Var(hcv)
+ Var(ho)- 2 Cov(hcv, ho). A
A
A
Expression (4.6) entails that a major factor in the large variability of hcv - ho is the fact that hcv and h 0 are negatively correlated (Hall and Johnstone, 1992). In other words, hcv has the following diabolical property: For data sets that require more (respectively, less) smoothing than average, cross-validation tends to indicate that less (respectively, more) smoothing is required.
4.3. Theoretical Properties of Data-Driven Smoothers
97
An obvious question at this point is, "Can we find a data-driven bandwidth, say h, for which E(h - h 0 )2 < E(hPI - h 0 ) 2 ?" The answer is yes, at least under sufficient regularity conditions. Hall and Johnstone (1992) find a lower bound on the limit of
where his any statistic. Let hE be the bandwidth (4.1) with an efficient estimator J of J; Hall and Johnstone (1992) show that limn->oo n 6 110 E(hEho) 2 equals the lower bound. Purely from the standpoint of asymptotic mean squared error theory, this ends the search for the ideal bandwidth selector; however, we shall have more to say on the notion of "ideal" in Section 4.5. To this point we have not discussed any theoretical properties of bandwidths, hR, selected by the risk estimation method of Section 4.2.2. HHM show that the asymptotic distribution of hR - h0 is the same as that of hcv - ho; hence, all the conclusions we have drawn about large sample behavior of cross-validation are also valid for risk estimation. Of course, asymptotics are not always an accurate indicator of what happens in finite-sized samples. Rice (1984b) shows by simulation that various asymptotically equivalent bandwidth selectors behave quite differently in small samples. It is important to point out that to first order the asymptotic ASEs of all the methods discussed in this section are the same. In other words, if h is any of the bandwidth selectors discussed, we have
ASE0) ~ 1 ASE(ho) as n ---t ,oo. The results discussed in this section nonetheless have relevance for second order terms in the ASE. Note that
ASE(h)
~
ASE(ho)
+ ~ (h- h0 ) 2 ASE"(h0 ),
where we have used the fact that ABE' (ho) = 0. Hall and Johnstone (1992) define the risk regret by E[ASE(h)] - E[ASE(ho)] and show that
E[ASE(h)] - E[ASE(ho)]
=
~ MASE"(h 0 )E(h- h0 ) 2 + rn,
where rn is negligible relative to MASE"(h 0 )E(h- h 0 ) 2 . The ratio ofrisk regrets, or relative risk regret, for two bandwidth selectors h1 and h2 is thus asymptotic to
98
4. Data-Driven Choice of Smoothing Parameters
In this way we see that results on E(h- h0 ) 2 relate directly to the question of how well the corresponding data-driven smoother estimates the underlying regression function. Hall and Johnstone (1992) provide some numerical results on risk regret for cross-validation, plug-in and their efficient method.
4.3.2 One-Sided Cross- Validation A detailed theoretical analysis of OSCV has been carried out by Yi (1996). Here we shall only summarize some salient aspects of the theory. Our main purpose in this section is to show that dramatic reductions in bandwidth variance are attainable with one-sided cross-validation. Following Chiu (1990), we assume that Xi = (i -1)/n, i = 1, ... , n, and use a "circular" design in which the data are extended periodically, i.e., for i = 1, ... , n, x-(i-1) = -i/n, Xn+i = 1 + (i - 1)/n, Y-(i-1) = Yn-i+l and Yn+i = Yi. The results in this section pertain to kernel estimators that are applied to the extended data set of size 3n. In the notation of Section 4.2.5, the estimator rh is 2
rh(x)
=
L
1 n nh . K t=-n+1
(XT x·) Yi,
0
~X~
1,
where 0 < h ~ 1 and K is a second order kernel with support (-1, 1). For the estimator rb we use
rb(x)
=
L
T
1 n (X-X·) nb . L Yi, •=-n+1
0
~X~
1,
where 0 < b ~ 1 and Lis a second order kernel with support (0, 1). Note that the estimator rb(x) uses only data for which Xi ~ x. Use of the circular design, along with the assumption that r(O) = r(1) and r' (0+) = r' (1-), eliminates boundary effects. Near the end of this section we will indicate why the forthcoming theoretical results appear to be relevant for certain local linear estimators as well. We begin by defining some notation. Let
define ho to be the minimizer of MAS E (h), and let b0 denote the minimizer of
4.3. Theoretical Properties of Data-Driven Smoothers
99
The bandwidths hcv and bcv minimize the cross-validation curves for the estimators rh and rb, respectively, and A
where, for a given function
CK CL bcv, A
hoscv
=
f, 1/5
[
J2
1
P(u) du
1
Define also the functionals Jf and BJ (when they exist) by
Finally, define Ufn (b) and UJ;, (h) by L
UJn(b)
=
1 ~ ( r ) ( -2nijr) nb ~L nb exp n '
j
= 1, ... , [n/2),
and K 1 ~ ( r ) ( 2njr) UJn (h) = nh rf::n K nh cos -----:;;:-
j
= 1, ... , [n/2].
Throughout this section we assume that the following conditions hold. (These are the same assumptions as in Chiu, 1990 plus conditions on L.) 1. The errors
2. 3. 4.
5.
E1 , E2 , ... are independent random variables with mean 0, variance 0' 2 and finite cumulants of all orders. The function r is such that r(O) = r(1), r'(O+) = r 1 (1-) and r" satisfies a Lipschitz condition of order greater than 1/2. The kernel K is a symmetric probability density function with support (-1, 1) and K" is of bounded variation. The kernel L is a second order kernel with support (0, 1). In addition, L satisfies the following: • Land L' are continuous on (0, 1), • L(O) and L' (0+) are finite, and • L" is of bounded variation on [0, 1), where L"(O) is defined to be L"(O+). The ordinary and one-sided cross-validation curves are minimized over an interval of bandwidths of the form [C- 1 n- 115 , cn- 115 ], where cis arbitrarily large but fixed.
100
4. Data-Driven Choice of Smoothing Parameters
Chiu (1990) obtains the following representation for hcv: [n/2]
(4.7)
L
n 3110 (hcv- ho) = -n 3 110 BKC~,IJ
(Vj- 2)W}~(ho)
+ op(1),
j=1
where V1 , V2, ... are i.i.d. X~ random variables,
Cr IJ
'
= (
-1,---CY_2_ _ ) 1/5
J0
r"(x) 2 dx
and
wf~,(h) = :h [1- uf~,(h)] 2 ,
j
= 1, ... , [n/2].
Similarly, Yi (1996) has shown that
(4.8)
n 3/10(hA oscv - h o)
B 03 = -n3/10 -CK CL L r,~Y [n/2]
x
L (Vj- 2)Wj~(b0 )
+
op(1),
j=1
where j = 1, ... , [n/2].
Hence, both hcv and hoscv are approximately linear combinations of independent x~ random variables. It is worth pointing out that the only reason (4.8) is not an immediate consequence of Chiu's work is that the kernel L does not satisfy Chiu's conditions of being continuous and symmetric about 0. We wish L to have support (0, 1) and to be discontinuous at 0, since such kernels are ones we have found to work well in practice. The theoretical development of Chiu (1990) relies upon the cross-validation curve being differentiable. Fortunately, differentiability of the OSCV curve is guaranteed when L is differentiable on (0, 1]; the fact that L is discontinuous at 0 does not affect the smoothness of the OSCV curve. It turns out, then, that Chiu's approach may be applied to n 3 110 (hoscv - ho) without too many modifications. The main difference in analyzing the cross-validation and OSCV bandwidths lies in the fact that, unlike U{n(h), the Fourier transform Ufn (b) is complex-valued. Representations (4. 7) and (4.8) allow one to compare the asymptotic variances of hcv and hoscv. Define the following asymptotic relative efficiency: A
E hoscv- ho
)2
(
ARE(K, L) = lim n--+oo
E (Ahcv- ho )
2
4.3. Theoretical Properties of Data-Driven Smoothers
Expressions (4. 7) and (4.8) imply that ARE(K, L) where
101
= limn-->oo AREn (K, L),
AREn(K, L) = The ratio AREn(K, L) has been computed for several values of n using the quartic kernel forK and the following choices for L, each of which has support (0, 1):
h(u)
= 140u3(1- u) 3(10- 18u),
L 3(u) = 6u(l- u)(6- lOu),
L 4 (u)
L 2 (u) = 30u2 (1- u) 2 (8- 14u), =
(5.925926- 12.96296u)(l- u 2 ) 2
L 5 (u) = (1- u 2 )(6.923077- 23.076923u
+ 16.153846u2 ).
It turns out that the limit of AREn is independent of the regression function r, and so the values of h 0 and bo were taken to be n- 1 / 5 and (CL/CK )n- 115 , respectively. The results are given in Table 4.1. The most interesting aspect of Table 4.1 is the dramatic reduction in bandwidth variation that results from using kernels L 4 and L5. Use of L 5 leads to an almost twenty-fold reduction in asymptotic variance as compared to ordinary cross-validation. Another interesting result is that the relative efficiencies decrease as the kernel L becomes less smooth at 0. Better efficiency is obtained from using the two kernels that have L(O) > 0. The relative efficiencies are smallest for L5, which is such that L~(O+) = -23.08 < -12.96 = L~(O+). The other three choices for L are shown by Miiller (1991) to be smooth, "optimum" boundary kernels. Each of these three is continuous at 0 (i.e., L(O) = 0). The kernel L 2 is smoother than L 3 in the sense that L~(O+) #- 0 while L~(O) = 0. Kernel L1 is smoother still since it has L~(O) = L~(O) = 0.
Relative Efficiencies of OneTABLE 4.1. Sided to Ordinary Cross-Validation. Each number in the body of the table is a value of AREn (K, L) for K equal to the quartic kernel.
n
L1
Lz
L3
L4
L5
50 150 300 600 1200 2400
1.732 2.165 2.197 2.202 2.202 2.202
1.296 1.899 1.936 1.939 1.940 1.939
1.303 1.811 1.667 1.811 1.755 1.768
.1719 .0469 .1089 .1001 .1004 .1006
.1039 .0389 .0456 .0627 .0561 .0558
102
T
4. Data-Driven Choice of Smoothing Parameters
The relative efficiencies in Table 4.1 suggest the possibility of further improvements in efficiency. For a given K and under general conditions on L, Yi (1996) has shown that lim n 315 Var(hoscv)
n--too
=
c; o-CkFL, '
where
and
1
=
1 1
1
AL(u)
L(x) cos(21rux) dx,
BL(u)
=
L(x) sin(27rux) dx.
Subject to the constraint that L is a second order kernel, one could use calculus of variations to determine an L that minimizes FL. Note that the asymptotically optimal L does not depend on K. Another, perhaps more relevant, optimality problem would be to find the L that minimizes 2
V(K, L)
=
' ' ) lim n 3 I 5 E (hoscvhoK
n--+oo
,
in which hoK is the minimizer of .t!SE(ht = n- 1 2:::~= 1 (fh(xi)- r(xi)) 2 . Let V(K, K) denote limn--+oo n 3 15 E(hcv- hoK ) 2 , where hcv is the ordinary cross-validation bandwidth for rh. Yi (1996) has shown that V(K, L) < V(K, K) for various choices of K and L. As before, one could employ calculus of variations to try to determine an L that minimizes V(K, L). It turns out in this case that the optimal choice for L depends on K. It seems clear that representations paralleling (4. 7) and (4.8) can be established for local linear smoothers. Suppose that rh is a local linear smoother that uses the quartic kernel. Apart from boundary effects and assuming that the Xi's are fixed and evenly spaced, this estimator is essentially the same as a Priestley-Chao type quartic-kernel estimator. Likewise, the one-sided local linear estimator using a quartic kernel is essentially the same as the kernel estimator with kernel L(u) = (5.926 - 12.963u)(1 u 2 ) 2 I(o, 1 )(u) (Fan, 1992). It is thus anticipated that the relative efficiencies in the "£4 " column of Table 4.1 will closely match those for quartic-kernel local linear estimators. Some insight as to why OSCV works better than ordinary cross-validation is gained by considering MASE curves. In a number of cases the author has noticed that the MASE curve for a one-sided estimator tends to have a more well-defined minimum than the MASE of an ordinary, or two-sided, estimator. This is illustrated in Figure 4.3, where we have plotted MASE curves of ordinary and one-sided local linear estimators that use an Epanechnikov kernel. Letting b denote the bandwidth of the one-sided estimator, that estimator's MASE is plotted against h = Cb, where Cis such
4.3. Theoretical Properties of Data-Driven Smoothers
103
1.!)
C\i 0
C\i I
UJ
(/)
"!
0 0
%
0
w
"0
5
(/)
0
0 0
0
.
0 0
0
.; :;.:IJA 0.0
0.001
0.002
0.003
0.004
ASE(hopt)
FIGURE 4.6. Scatter Plots of ASE From Simulation Study. Each plot is of ASEs for data-driven bandwidths vs. ASEs of optimal bandwidths in the case r r2, !J = 1/512 and n = 150. The top, middle and bottom graphs correspond to ordinary cross-validation, OSCV and plug-in, respectively.
=
4.5. Discussion
115
""' 0 0 0 0
c)
C\1 0 0 0 0
c)
0
c)
0.2
0.0
0.4
0.6
0.8
h FIGURE 4.7. MASE Curves for One-Sided Local Linear Estimators. The solid line is the MASE curve for a one-sided local linear estimator (with Epanechnikov kernel) in the case r rs, (]' = 1/128 and n = 50. The dotted line is the MASE curve in the same case except (]' = 1/512. When (]' is larger it is clear that it is harder to distinguish between optimal and oversmoothed estimates.
=
upon in terms of asymptotic efficiency. The version of one-sided crossvalidation used in the simulation of Section 4.4 lies in between ordinary cross-validation and the plug-in method in terms of efficiency. Although OSCV is somewhat less efficient than plug-in and Hall-Johnstone, it has the advantage of being more rough and ready and more objective than those two methods. OSCV does not require estimation of any derivatives, nor does it depend on parameters that must be fixed in an arbitrary way. Furthermore, as noted in Section 4.3.2, there exists the possibility that some version of OSCV will be both objective and fully efficient. We have discussed the application of data-driven smoothing methods only in the case of regression with but a single independent variable. Both cross-validation and plug-in methods can be applied in more complicated settings. For example, Vieu (1991) has applied a local version of crossvalidation to choose the bandwidth function of a variable bandwidth kernel estimator, and Muller and Stadtmiiller (1987) have proposed a plug-in rule for the same problem. Fan and Gijbels (1995) propose a data-driven method for choosing a variable bandwidth local polynomial estimator. Ruppert,
116
4. Data-Driven Choice of Smoothing Parameters
Sheather and Wand (1995) consider plug-in bandwidth rules that may be used for multiple regression as well as the univariate-x case. Use of plug-in rules for estimating additive regression models has been investigated by Opsomer and Ruppert (1996).
5
I
,, '
Classical Lack-of-Fit Tests
I
I ;
I
II
I
I
I
,
'I,I 'I
5.1 Introduction We now turn our attention to the problem of testing the fit of a parametric regression model. Ultimately, our purpose is to show how the nonparametric smoothing methods encountered in the previous three chapters can be useful in this regard. We begin, however, by considering some classical methods for checking model fit. This is done to provide some historical perspective and also to facilitate comparisons between smoothing-based and classical methods. We shall continue to use the following scenario as our canonical regression model:
(5.1)
Yi
=
r(xi)
+ Ei,
,:, [.ii ]I
I I, I
l
i = 1, ... , n,
where the xi's are fixed design points with 0 < x 1 < · · · < Xn < 1. We assume that E1 , ... , En are independent random variables with E(Ei) = 0 and Var(Ei) = o- 2 < oo, i = 1, ... , n. Sometimes we will add the assumption that each Ei has a Gaussian distribution. For our purposes the principal aim in analyzing the data (x 1, Y1), ... , (xn, Yn) is to learn about the relationship between x and Y as it is expressed through the regression function r. In a parametric approach to inferring r, one assumes that (5.2)
r E Se =: {r(· ; 0) : 0 E 8},
where 8 is some subset of p-dimensional Euclidean space with p finite, and, for each e E 8, r(·; 0) is a function with domain [0, 1]. A nonparametric model is of the same form as (5.2) with the exception that the set 8 is infinite dimensional and has an infinite number of elements. The term nonparametric is something of a misnomer since nonparametric models actually have infinitely more parameters than parametric ones. Nontheless, the term fits in the sense that a nonparametric analysis places less importance on inferring individual parameters, since no finite number of 117
! I
118
5. Classical Lack-of-Fit Tests
parameters characterizes the model. Of course, the real distinguishing feature of a nonparametric model is that it allows for a much wider range of possibilities than does a parametric one. A couple of examples will serve to contrast the two types of function classes. Consider quadratic functions defined on the interval [0, 1]:
The class
is an example of a parametric class of functions. If we assume that r E Q, then we know everything there is to know about r as soon as we know the values of the three parameters Bo, 81 and 82. An example of a nonparametric class of functions is L2(0, 1), the class of square integrable functions on (0, 1). A given function r is in £ 2 if and only if 00
2:: q;;u) < oo, j=O 1
where ¢r(J) = J0 cos(njx)r(x) dx. We may thus define the parameter space 8 for £ 2 (0, 1) to be the set of all sequences (8 0 , 81 , ... ) such that 00
:z=e; < oo. j=O
In some cases the data analyst may be virtually certain that r is in a parametric class Be. For example, model (5.2) might be a consequence of the physical system from which the data are generated. The statement "r is in Be" is thus correct to the extent that the data analyst is correct in his assessment of the physical system. Whenever a parametric model is "known" to hold, the statistical problem boils down to inferring the parameter e' which is the only aspect of r that is unknown. Perhaps a more typical scenario is one in which the data analyst merely entertains a model ofform (5.2). He has doubts about whether a particular model Be is correct and would like to have a statistical test of the null hypothesis that r is in Be. In the context of regression, such a test is usually referred to as a lack-of-fit test. An analogous procedure for testing whether a set of i.i.d. observations arises from a given class of probability distributions is more often called a goodness-of-fit test. In the remainder of this chapter we consider some classical lack-of-fit tests.
5.2. Likelihood Ratio Tests
119
5.2 Likelihood Ratio Tests A fundamental approach to testing statistical hypotheses stems from the seminal work of Neyman and Pearson (1933) on testing a simple null hypothesis versus a simple alternative. Neyman and Pearson showed that in the simple vs. simple case the most powerful test of a given size rejects the null hypothesis for small values of a likelihood ratio. This result has led to the use of likelihood ratio tests in more general settings where the model is parametric and one or both of the hypotheses are composite.
5.2.1 The General Case In order to apply a likelihood ratio test it is necessary that we know the joint distribution of the observations up to a finite number of unknown parameters. In our setting this entails that we have a model not only for the mean function r but also for the distribution of the errors E1 , ... , En· We might, for example, suppose that E1 , ... , En are independent and identically distributed from a density g( · ; ¢), where g is known up to the vector of parameters ¢. A familiar special case of this scenario is when the Ei 's are i.i.d. N(O, o- 2 ), in which case ¢ consists of the single unknown parameter a-. More generally, one could simply assume that E1 , ... , En have some joint density g(u 1 , ... , un; ¢) depending on the unknown vector of parameters ¢.If we also assume that r is in a parametric family Se (as in (5.2)), then the likelihood function of the data Y = (Y1 , ... , Yn) is
Our interest is in testing the hypotheses (5.3)
Ho : BE 8o
vs.
Ha : B E 8- 8o,
where 8o is some subset of e. Suppose that the parameter ¢ is assumed to lie in a set 1>. Then the likelihood ratio test of hypotheses (5.3) rejects H 0 for small values of the test statistic
(5.4)
An=
SUP{OE8o,E1>} SUP{OE8,¢E1>}
L(B, ¢JY) . L(B, ¢JY)
In order to carry out a test that has a specified level of significance, one must know the probability distribution of An when the null hypothesis is true. It is well known that when H 0 is true and the probability model satisfies certain regularity conditions, the statistic -2log(An) converges in distribution to a random variable having the x2 distribution with p - p 0 degrees of freedom, where p 0 and p are the numbers of free parameters associated with 8 0 and 8, respectively. This suggests that one use as a
120
5. Classical Lack-of-Fit Tests
large sample test with nominal size a the test that rejects Ho when (5.5)
-2log(An) ;::o: x;_Po,o:'
where X~-po,o: is the (1 a)100th percentile of the X~-Po distribution. Kendall and Stuart (1979) and Chernoff (1954) provide sufficient conditions such that the test with rejection region (5.5) is asymptotically valid. In principle, the likelihood ratio test is applicable in a wide variety of cases. An elementary case is one in which the null hypothesis is simple. For example, one might assume that r has the form
r(x; B) = Bo withe = {e interest is
+ elx,
0 ::;
X ::;
1,
(Bo, 81) : -oo < 80 ,8 1 < oo}. If the null hypothesis of H0
:
r(· ; B) is identical to 0,
then 8 0 is the singleton set {(0, 0)}. Nested regression models can also be dealt with via likelihood ratio tests. Suppose, for example, that our most general model for r is (5.6) r(x; e) = exp(Bo + elx + e2x 2), 0::; X ::; 1, with 8 = ~ 3 . The hypotheses of interest might be H0
:
log (r(·; B)) is a straight line
versus Ha : log (r(·; B)) is a quadratic.
Here the straight line model is nested within the quadratic, and 8 0 is the set of all 3-tuples of the form (80 , 81, 0). A third setting to which the likelihood ratio test is applicable is when we wish to test one parametric family of functions against another, and neither family is nested within the other. One may be interested in, say, testing the null hypothesis that r is some quadratic against the alternative that r has the form (5.6). In this case we may write the regression function as r(x; B)= (eo+ B1x + B2x 2) I{o}(B3) + exp(Bo + B1x + B2x 2)J{l}(B3), where the parameter space is
and eo
=
{(Bo, e1, e2, 83) : e3
=
o}.
The non-nested case is an example of where the X~-Po approximation to the distribution of -2log(An) is typically not valid. Cox (1962) proposed a slightly modified version of the likelihood ratio test for the non-nested case,
I.
s
f
r
5.2. Likelihood Ratio Tests
121
and White (1982) provided conditions under which the modified statistic is asymptotically normal. A treatment of subsequent developments in testing non-nested models may be found in Pace and Salvan (1990). The likelihood ratio test is obviously designed for situations where we wish to compare how well two specific, parametric models fit the data. By contrast, we might wish to test the fit of a given model without having in mind any particular alternative model. For example, we might simply like to know if the data provide any evidence that r deviates from a straight line, whatever kind of deviation that might be. We shall see later that nonparametric tests are often better suited for such cases than are likelihood ratio tests or other types of parametric tests.
f
5.2.2 Gaussian Errors It is to our benefit to consider in some detail the likelihood ratio test in the case where the error terms E1 , ... , En are independent and identically distributed N(O, o- 2 ) random variables. This situation is important not only because of the central role it has played in classical statistics but also because it leads to lack-of-fit tests that are very useful for non-normal data. When the E/s are i.i.d. N(O, o- 2 ) (where a- is an unknown parameter), the likelihood ratio (5.4) is A _ n -
2
SUP{8E8a,a>D} 0"-n SUP{8E8,a>D}
exp{- 2:.:~=1 (Yi- r(xi; 0)) /(2o-
2
2
e
Let Bo and be respectively the restricted and unrestricted least squares estimators of 0; in other words, B0 minimizes
n
n
L
(Yi- r(xi; 0))
2
i=1
for 0 E 8o while B minimizes o- 2 (0) over 0 E 8. The likelihood ratio may now be expressed as
and, hence the likelihod ratio test rejects H 0 for large values of the variance ratio
(5.7)
,,
I'
I
)}
u-n exp{- 2:.:~= 1 (Yi - r(xi; 0)) /(2o- 2 )}
1 o- 2 (0) = -
!I
o- 2 ( Bo)
a-2(B) .
The quantity (5. 7) is exemplary of a wide class of variance-ratio statistics that are useful in testing lack of fit, regardless of whether the data are Gaussian or not. Many of the statistics to be encountered in this and later
I
122
5. Classical Lack-of-Fit Tests
chapters are special cases of the following general approach. Suppose that two estimators of variance are constructed, and call them &1r and &2 . The estimator &1r is derived on the assumption that the null model is correct. It is an unbiased estimator of (} 2 under H 0 and tends to overestimate (} 2 under the alternative hypothesis. The estimator &2 is constructed so as to be less model dependent than &1r, in the sense that &2 is at least approximately unbiased for (} 2 under both null and alternative hypotheses. It follows that the ratio &1r j &2 contains information about model fit. Only when the ratio is significantly larger than 1 is there compelling evidence that the data are inconsistent with the null hypothesis.
5.3 Pure Experimental Error and Lack of Fit An ideal scenario for detecting lack of fit of a regression model is when more than one replication is available at each of several design points. In this case the data may be written 1, ... , ni, i = 1, ... , n, where we assume that the Eij 's have a common variance for all i and j. For such data we may assess the pure experimental error by computing the statistic n
SSEp =
ni
L L(1'ij- Yi)
2
,
i=l j=l
where Yi is the sample mean of the data at design point Xi· Defining N = I:~=l ni, if at least one ni is more than 1, then &'J, = SSEpj(N- n) is an unbiased estimator of the error variance (} 2 . This is an example of a model-free variance estimator, in that its construction does not require a model for r. From a model-checking point of view, there is obviously a great advantage to having replicates at at least some of the design points. Suppose that r(.) is an estimate of the regression function r and that we wish to assess the fit of r(} Define f'i = r(xi), i = 1, ... , n, and consider the residuals eij
= Yij- f'i = (Yij - Yi) + (Yi
J?i),
-
j
=
1, ... , ni, i
Defining the model sum of squares SSEM by n
SSEM
=
L ni(Yi- fi) i=l
2
,
=
1, ... , n.
5.3. Pure Experimental Error and Lack of Fit
123
the sum of squared residuals SSE is n
SSE
=
ni
LL
e7j
= SSEp + SSEM.
i=1 j=1 A model-based estimator of variance is 8-Xt = SSEMin. Generally speaking, this estimator will be a "good" estimator so long as the fitted regression model is adequate. However, if the regression function r differs substantially from the fitted model, then 8-Xt will tend to be larger than u 2 , since Yi - "fi will contain a systematic component due to the discrepancy between r and the fitted model. A formal test of model fit could be based on the statistic
(5.8) which is an example of the variance-ratio discussed in Section 5.2.2. The distributional properties of 8-Xt I cr'j, depend upon several factors, including the distribution of the errors and the type of regression model fitted to the data. A special case of interest is when the null model is linear in p unknown parameters and the parameters are estimated by least squares. Here, niTXt I (n - p) is an unbiased estimator of u 2 under the null hypothesis that the linear model is correct. If in addition the errors are Gaussian, the statistic
I I ! '
F = SSEMj (n- p) 2
O"p
has, under the null model, the F distribution with degrees of freedom n- p and N - n. When H 0 is false, F will tend to be larger than it is under H 0 ; hence the appropriate size a test is to reject H 0 when F exceeds the (1 - a)100th percentile of the F(n-p),(N -n) distribution. When there are no replicates (i.e., ni = 1 for each i), it is still possible to obtain an estimator that approximates the notion of pure experimental error. The idea is to treat the observations at neighboring design points as "near" replicates. If the regression function is sufficiently smooth, then the difference Yi- Yi-1 will be approximately Ei- Ei-1i hence differences of Y's can be used to estimate the variance of the E/s. Gasser, Sroka and JennenSteinmetz (1986) refer to Yi- Yi- 1, i = 2, ... , n, as pseudo-residuals. Other candidates for pseudo-residuals are
',,, I i 1!
,,
I
'
r
which are the result of joining YiH and 1i-1 by a straight line and taking the difference between this line and 1i. Variance estimators based on these
'i
124
5. Classical Lack-of-Fit Tests
two types of pseudo-residuals are
&~
1
=
2(n- 1)
~
~(li- li-d
2
and
Either of the estimators &~ or &~ could be used in place of&~ in (5.8) to obtain a lack-of-fit statistic that approximates the notion of comparing a model's residual error with a measure of pure experimental error. Of course, in order to conduct a formal test it is necessary to know the probability distribution of the variance ratio under the null hypothesis. We defer discussion of this issue until the next section.
5.4 Testing the Fit of Linear Models Suppose that the model under consideration has the linear form p
r(x) =
L
ejrj(x),
0:::;
X:::;
1,
j=l
where r1, ... , Tp are known functions and 81, ... , Bp are unknown parameters. We shall refer to such models as linear models, which of course have played a prominent role in the theory and practice of regression analysis. In this section we consider methods that have been used to test how well such models fit the observed data. In addition to their historical significance, linear models are of interest to us because most smoothing methods are linear in the data and hence have close ties to methods used in the analysis of linear models. The link between linear models and smoothing ideas is explored in Eubank (1988). Initially we will assume that the error terms in model (5.1) are independent and identically distributed as N(O, o- 2 ). This assumption is in keeping with the classical treatment of linear models, as in Rao (1973). However, in Section 5.4.3 we will discuss ways of approximating the distribution of test statistics when the Gaussian assumption is untenable.
5.4.1 The Reduction Method The reduction method is an elegant model-checking technique for the case where one has a particular, linear alternative hypothesis in mind and the
5.4. Testing the Fit of Linear Models
125
null hypothesis is nested within the alternative. Suppose the hypotheses of interest have the form p
(5.9)
Ho : r(x)
=
L Ojo rj(x),
0::::; X::::; 1,
j=1
and p+k
Ha: r(x)
(5.10)
=
L Oja Tj(x),
0::::; X::::; 1,
j=1
where k 2: 1. In the reduction method one determines how much the error sum of squares is reduced by fitting the alternative model having p + k terms. Let SSE0 and SSEa be the sums of squared residuals obtained by fitting the null and alternative models, respectively, by least squares, and define the test statistic
I'
FR = (SSEo- SSEa)/k. SSEa/(n- p- k) Under H FR has an F distribution with degrees offreedom k and n- p- k. 0 2 The denominator SSEa/(n- p- k) is an unbiased estimator of CJ under both Ho and Ha, whereas the numerator (SSEo - SSEa)/k is unbiased for CJ 2 only under H 0 . The expected value of the numerator is larger than CJ 2 when Ha is true, and so Ho is rejected for large values of FR. Obviously FR is another lack-of-fit statistic that is a ratio of variance estimates. In fact, the test based on FR is equivalent to the Gaussian-errors likelihood ratio test. A situation where it is natural to use the reduction method is in polynomial regression. To decide if a polynomial of degree higher than p is required one may apply the reduction method with the null model corresponding to a pth degree polynomial, and the alternative model to a p + k degree polynomial, k 2: 1. Indeed, one means of choosing an appropriate degree for a polynomial is to apply a series of such reduction tests. One tests hypotheses of the form
H[; : r(x)
=
L 010 xij=1
1
vs.
Hg : r(x)
=
L
I
I I
ill
p+k-1
p-1
I,
Oja xi-
1
j=1
for p = 2, 3, ... and takes the polynomial to be of order p, where pis the smallest p for which H{; is not rejected. The reduction method can also be used in the same way to test the fit of a trigonometric series model for r. Lehmann (1959) shows that, among a class of invariant tests, the reduction test is uniformly most powerful for testing (5.9) vs. (5.10). Hence, for alternatives of the form (5.10), one cannot realistically expect to improve upon the reduction test in terms of power. Considering the problem from a larger perspective though, it is of interest to ask how well the reduction
I I
I
126
5. Classical Lack-of-Fit Tests
test performs when H 0 fails to hold, but the regression function r is not of the form (5.10). In such cases the reduction method sometimes has very poor power. As an example, suppose the data are Yi = r(i/n) + Ei, i = 1, ... , n, and the reduction method is used to test
Ho : r(x) = elO
+ e2ox,
0:::;
X :::;
1
versus
In many cases where r is neither a line nor a quadratic, the reduction method based on the quadratic alternative will still have good power. Suppose, however, that r is a cubic polynomial ra(x) = 2:;=0 "fiXi with the properties that "(3 =/= 0 and (5.11) Obviously, r a is not a straight line, and yet when a quadratic is fitted. to the data using least squares, the estimated coefficients will each be close to 0, owing to condition (5.11). The result will be a reduction test with essentially no power. Figure 5.1 shows an example of ra. The previous example points out a fundamental property of parametric tests. Although such tests will generally be powerful against the parametric alternative for which they were designed, they can have very poor power for other types of alternatives. In our example, the problem is that r a is orthogonal to quadratic functions, and consequently the model fitted under Ha ends up looking just like a function included in the null hypothesis. Our example is extreme in that a competent data analyst could obviously tell from a plot of the data that the regression function is neither linear nor quadratic. Nonetheless, the example hints that parametric tests will not always be a satisfactory means of detecting departures from a hypothesized model.
5.4.2 Unspecified Alternatives The example in the last section suggests that it is desirable to have a method for testing lack of fit that is free of any specific alternative model. We consider such a method in this section, and in the process we introduce a technique for obtaining the probability distribution of the ratio of two quadratic forms. This technique will come in handy in our subsequent study of smoothing-based lack-of-fit tests.
5.4. Testing the Fit of Linear Models
127
•,·
(\J
0 '•/
·'
..
-
...
···~,:::':K;';U' "·'
(\J
9
;·
0.0
0.2
0.6
0.4
0.8
1.0
X
FIGURE 5.1. Cubic Polynomial That Foils a Reduction Method Lack-of-Fit Test. The 1000 data values were generated from the cubic (solid line). The dotted line is the least squares quadratic fit.
We wish to test whether the data are consistent with the linear model in (5.9). To this end, define then x p design matrix R by
R=
rl(xl) r1 (x2)
rp(x1) ) rp(x2)
rl(xn)
rp(xn)
.
( We assume throughout the rest of Section 5.4.2 that R has full column rank. This condition ensures unique least squares estimates of the coefficients el, ... 'eP' The least squares estimates will be denoted Fh, ... 'Bp· The test we shall consider is a generalization ofthe von Neumann (1941) test to be described in Section 5.5.1 and is also closely related to a test proposed by Munson and Jernigan (1989). Define the ith component of the vector e of residuals by ei = Yi 1 Bjrj(xi), i = 1, ... , n. It is well known that
I.::f=
e = [In
R(R' R)- 1 R']
Y,
128
5. Classical Lack-of-Fit Tests
where Y = (Y1 , ... , Yn)' and In is then X n identity matrix. A model-based estimator of variance is
&~![ =
-
1
-
n- P
t e7 = -
1
-Y' [In- R(R'R)- 1 R'] Y. n- P
i=1
We now desire an estimator of variance that will be reasonable whether or not the linear model (5.9) holds. Consider
u-2 = -1 6~( ei an i=2 where an matrix
H=
=
-
ei-1
)2 = -1
e
I
He,
an
2( n- 1) - trace(H R( R' R) - 1 R') and H is the n x n tridiagonal
1 -1
-1
0
2
-1
0 0
0
-1
2
0 0
0 0
0 0
-1
0 0 0
0 0 0
0 0 0
0 0 0
0 0
0 0
-1
2
0
-1
-1 1
This estimator of variance is unbiased for u 2 when the linear model holds and is consistent for u 2 as long as the linear function in (5.9) and the underlying regression function r are both piecewise smooth. We now take as our statistic the variance ratio
Vn
&~
=
(j-2 .
Other possible denominators for the test statistic are the estimators &~ and &~, as defined in Section 5.3. An argument for using & 2 is that it is completely free of the underlying regression function under H 0 . Furthermore, it will typically have smaller bias than &~ when the linear model is reasonably close to the true function. Of course, one could also form an analog of &~ based on the residuals from the linear model. Let us now consider the probability distribution of the statistic Vn when the linear model holds. First observe that Vn is the following ratio of quadratic forms:
Vn
=
Y'AY Y'BY'
where
A=
1
n-p
and 2
B = (n- p) AHA. an
5.4. Testing the Fit of Linear Models
When the linear model holds, AY fi 's, and hence
= Ac:,
where
129
c: is the column vector of
c:'Ac: Vn = c:'Bc:. Note that the distribution of c:'Ac:lc:'Bc: is invariant toO", and so at this point we assume without loss of generality that O" = 1. We have
P (Vn 2 u) = P
[c:' (A-
uB)
c:
2 o).
Theorem 2.1 of Box (1954) implies that the last probability is equal to
(5.12) where r = rank(A- uB), Aln(u), ... , Arn(u) are the real nonzero eigenvalues of A - uB and xr, ... , are i.i.d. single-degree-of-freedom x2 random variables. Given an observed value, u, of the statistic Vn, one may numerically determine the eigenvalues of A- uB and thereby obtain an approximation to the P-value of the test. Simulation or a numerical method as in Davies (1980), Buckley and Eagleson (1988), Wood (1989) or Farebrother (1990) can be used to approximate P(~;=l Ajn(u)xJ > 0). This same technique can be applied to any random variable that is a ratio of quadratic forms in a Gaussian random vector. For example, the null distributions of the statistics 8-~ I 8-J and 8-~ I 8-~ defined in Section 5.3 can be obtained in this way, assuming of course that fl, ... , En are i.i.d. Gaussian random variables. Several of the nonparametric test statistics to be discussed in this and subsequent chapters are ratios of quadratic forms and hence amenable to this technique.
x;
5.4.3 Non-Gaussian Errors To this point in Section 5.4 we have assumed that the errors have a Gaussian distribution, which has allowed us to derive the null distribution of each of the test statistics considered. Of course, in practice one will often not know whether the errors are normally distributed; hence it behooves us to consider the effect of non-Gaussian errors. An important initial observation is that, whether or not the errors are Gaussian, the null distribution of each test statistic we have considered is completely free of the unknown regression coefficients 81 , ... , eP. This is a consequence of the linearity of the null model; typically the distribution of lack-of-fit statistics will depend upon unknown parameters when the null model is nonlinear. Furthermore, if we assume that c1 , ... , En are i.i.d.
I .1
130
5. Classical Lack-of-Fit Tests
with cumulative distribution function G 0 ( x / (J), then the null distribution of each statistic is invariant to the values of both e and (J. We will discuss two methods of dealing with non-Gaussian data: large sample tests and the bootstrap. As a representative of the tests so far discussed, we shall consider the test of fit based on the statistic Vn of Section 5.4.2. The following theorem provides conditions under which Vn has an asymptotically normal distribution. Theorem 5.1. Suppose model (5.1) holds with r having the linear form of (5.9) and E1 , ... , En being independent random variables with common variance (J 2 and EIEii 2 H < lvi for all i and positive constants 8 and M. If fh, ... , satisfy
ev
2
E(Bj- ej) =
o( ~),
j=1, ... ,p,
and irJ(x)l is bounded by a constant for each j and all x E [0, 1], then the statistic Vn of Section 5.4.2 is such that
Vn(Vn- 1)
_E____.
N(O, 1)
as n-+ oo.
The numerator in the first term on the right-hand side of this expression is n
2
L
n
eiei-1
+ ei + e; = 2 L
i=2
eiei-1
+ Ov(1).
i=2
Now, n
L
n
eiei-1
i=2
=
L
EiEi-1
+ Rn,
i=2
where Rn is the sum of three terms, one of which is
p
L(ej- ej)Pj, j=1
with Pj = '2.:::~= 2 ~i-1rj(xi), j = 1, ... ,p. It follows that Rn1 = Ov(1) since Epy = O(n), E(eJ- ej) 2 = O(n- 1 ) and pis finite. The other two terms in Rn can be handled in the same way, and we thus have Rn = Ov(1).
I
5.5. Nonparametric Lack-of-Fit Tests
131
Combining previous results and the fact that 1- 2(n- p)fan = O(n- 1 ) yields
2(n- p) Vn(Vn _ l) = _1_ 2::~=2_ EiEi-1 an Vn 0" 2
+Q
( P
__!___).
Vn
The result now follows immediately upon using the central limit theorem form-dependent random variables of Hoeffding and Robbins (1948). D Under the conditions of Theorem 5.1, an asymptotically valid level-a test of (5.9) rejects the null hypothesis when Za
Vn21+ ..jn' IfVn is observed to be u, then an approximate P-value is l-ei? [vn(u- 1)]. An alternative approximation to the P-value is the probability (5.12). Theorem 5.1 implies that these two approximations agree asymptotically. However, in the author's experience the approximation (5.12) is usually superior to the normal approximation since it allows for skewness in the sampling distribution of the statistic. Another means of dealing with non-Gaussian errors is to use the bootstrap. Let F(·; G) denote the cumulative distribution of ..jn(Vn- 1) when the regression function has the linear form in (5.9) and the errors E1 , ... , En are i.i.d. from distribution G. Given data (x1, Y1), ... , (xn, Yn), we can approximate F(·; G) by F(·; G), where G is the empirical distribution of the residuals ei = Yi- 2::~= 1 Ojrj(xi), i = 1, ... , n. The distribution F(·; G) can be approximated arbitrarily well by using simulation and drawing a sufficient number of bootstrap samples of size n, with replacement, from e1 , ... , en. Using arguments as in Hall and Hart (1990), it is possible to derive conditions under which the bootstrap approach yields a better approximation to the sampling distribution of ..jn(Vn - 1) than does a.
The notion of maximal rate provides us with one way of comparing tests. If, for a given g, one test has a larger maximal rate than another, then the former test will have higher power than the latter for all n sufficiently large and all multiples of g that are sufficiently hard to detect (i.e., close to the null case). It is worth noting that in parametric problems maximal rates are usually 1/2, which is a consequence of the fo convergence rate of most parametric estimators. At this point we investigate the limiting distribution of the von Neumann and Buckley statistics under model (5.18). Doing so will allow us to determine the maximal rate of each test and also establish that each test is consistent against a general class of fixed alternatives. To simplify presentation of certain results, we assume throughout the remainder of Section 5.5.3 that Xi = (i -1/2)/n, i = 1, ... , n. We first state a theorem concerning the limit distribution of FN. When g = 0, this theorem is a special case of Theorem 5.1. A proof for the case where J g 2 > 0 is given in Eubank and Hart (1993). Theorem 5.2. The maximal rate of the von Neumann test under model 1 (5.18) is 1/4 for any g such that 0 < 11911 < oo, where 11911 2 = f 0 g 2 (x) dx. Furthermore, suppose that model (5.18) holds with""( = 1/4, and let g be square integrable on (0, 1). Then
An almost immediate consequence of Theorem 5.2 is that the von Neumann test is consistent against any fixed alternative r 0 for which lira - ro II > 0, where ro = f01 ro (x) dx. Another interesting result is that FN has a maximal rate of 1/4, which is less than the maximal rate of 1/2 usually associated with parametric tests. For example, suppose that H0 : r Cis tested by means of the reduction method with a pth order polynomial as alternative model (p ;:::: 1). For many functions g, even ones that are not polynomials, this reduction test has a maximal rate of 1/2. On the other hand, there exist functions g with 0 < llg II < oo such that the limiting power of a size-a reduction test is no more than a. (Such functions can be constructed as in the example depicted in Figure 5.1.) The difference between the von Neumann and reduction tests is characteristic of the general difference between parametric and omnibus nonparametric tests. A parametric test is very good in certain cases but
=
I!'
5.5. Nonparametric Lack-of-Fit Tests
139
very poor in others, whereas the nonparametric test could be described as jack-of-all trades but master of none. Turning to Buckley's cusum-based test, we have the following theorem, whose proof is omitted. Theorem 5.3. The maximal rate of the no-effect test based on TB is 112 under model (5.18) for any g such that 0 < 11911 < oo. Furthermore, if model (5.18) holds with 1 = 112 and g square integrable on (0, 1), then 1 v TB--+ 2 7r
= ( zj
L
j=l
+ -/2aj I "2
J
where Z1, Z2 , ... are i.i.d. as N(O, 1) and aj
=
(J
r
'
J01 cos(1rjx)g(x) dx.
Theorem 5.3 entails that Buckley's test is consistent against any fixed alternative r 0 for which lira- roll > 0. So, the von Neumann and Buckley tests are both consistent against any nonconstant, piecewise smooth member of L 2 [0, 1]. Theorem 5.3 also tells us that the cusum test has maximal rate equal to that commonly associated with parametric tests. In a certain sense, then, the cusum test is superior to the von Neumann test, since the latter has a smaller maximal rate of 1I 4. This means that for any given square integrable g, if we take 1 = 112 in (5.18), then there exists an n 9 such that the power of the Buckley test is strictly larger than that of the same-size von Neumann test for all n > n 9 . As impressive as it may sound this result certainly has its limitations. To appreciate why, we now compare the powers of the two tests in a maximin fashion. Let model (5.18) hold with 1 = 114. If we compare the power ofthe two tests for any specific g, then Buckley's test is asymptotically preferable to the von Neumann test since lim P (TB 2: Tn(a))
n-->CXJ
1
where Tn(a) and Vn(a) are the (1- a)100th percentiles of the null distributions of TB and FN, respectively. Alternatively, suppose that for 1 = 114 in model (5.18) we compute the smallest power of each test over a given class of functions. Consider, for example, the sequence of classes 9n:
where f3n --+ (3 > 0 as n --+ oo. It is straightforward to see by examining the proof of Theorem 5.2 that the von Neumann test satisfies (5.19)
I, ,I
i;
-, ;J
I.
140
5. Classical Lack-of-Fit Tests
Yn is the function gn(x; k) = hf3~1 2 cos(1Tkx),
Now, one element of
0:::; x:::; 1.
Obviously (5.20)
I
As indicated in Section 5.5.3, 1
TB = ~ a-
(5.21)
I
2n¢J 2: --------~---= 2 [2n sin (j7T/(2n))) '
n-1 j=
I
1
from which one can establish that if kn > n 114 log n, then
I
(5.22) as n----> oo. Along with (5.19) and (5.20), (5.22) implies that lim inf P (TB ?_ Tn(a)) :::; a
n->oo gE9n
< lim inf P (FN n->oo gEYn
?_ vn(a)).
In words, the very last expression says that certain high frequency alternatives that are easily detected by the von Neumann test will be undetectable by Buckley's test. The previous calculations show in a precise way the fundamental difference between the von Neumann and Buckley tests. Expression (5.21) implies that, for large n, 1
TB ~ 7T2
1
L j-2 ( 2n¢2) Q-2
n-
J
'
J=1
which shows that TB will have difficulty in detecting anything but very low frequency type functions, i.e., functions r for which llr - rll 2 is nearly the sum of the first one or two Fourier coefficients. By contrast, the .von Neumann statistic weights all the squared Fourier coefficients equally, and so the power of the von Neumann test is just as good for high frequency functions as for low frequency ones.
5.6 Neyman Smooth Tests Neyman smooth tests are a good point of departure as we near our treatment of smoothing-based lack-of-fit tests. Indeed, they are a special case of certain statistics that will be discussed in Chapter 6. Like the von Neumann and Buckley statistics, Neyman smooth statistics are weighted sums of squared Fourier coefficients. The only way in which they differ substantially from the two former tests is through the particular weighting scheme they employ. We shall see that Neyman smooth tests are a sort of compromise between the von Neumann and Buckley tests.
I '
5.6. Neyman Smooth Tests
141
Neyman (1937) proposed his smooth tests in the goodness-of-fit context. Suppose X1, ... , Xn are independent and identically distributed observations having common, absolutely continuous distribution function F. For a completely specified distribution F 0 , it is desired to test
Ho : F(x)
=
Fo(x) V x,
which is equivalent to hypothesizing that F 0 (Xl) has a uniform distribution on the interval (0, 1). Neyman suggested the following smooth alternative of order k to H 0 :
g(u)
= exp(eo
+
t,
0 < u < 1,
ei¢i(u)),
where g is the density of F 0 (X1), ¢ 1, ... , ¢k are Legendre polynomials transformed linearly so as to be orthonormal on (0, 1) and eo is a normalizing constant. In this formulation the null hypothesis of interest is
The test statistic proposed by Neyman is k
Ill~ = ~ Vi
2
1
with
n
Vi= fo ~ ¢i (Fo(Xj)).
Under H 0 , w~ is asymptotically distributed as x2 with k degrees offreedom. The null hypothesis is thus rejected at level a when Ill~ exceeds the (1 a) lOOth percentile of the x~ distribution. Neyman referred to his test as a smooth test since the order k alternatives differ smoothly from the flat density on (0, 1). His test was constructed in such a way that, to a first order approximation, its power function deAmong all tests with this pends on e1, ... , only through ..\2 = 1 property, Neyman (1937) argued that his smooth test is asymptotically uniformly most powerful against order k alternatives for which .A is small. We now consider Neyman's idea in the context of regression. Suppose that in model (5.1) we wish to test the no-effect hypothesis that r is identical to a constant. In analogy to Neyman's smooth order k alternatives, we could consider alternatives of the form
I:7= et.
ek
k
(5.23)
r(x) = eo+
I: ei¢i,n(x), i=l
where ¢1,n, ... , ¢k,n are functions that are orthonormal over the design points in the sense that, for any 0 ~ i, j ~ k,
(5.24)
142
5. Classical Lack-of-Fit Tests
and ¢o,n =:= 1. The least squares estimators of fh, ,
ej
1 =
-;;:
n
2::: ~¢j,n(xi),
j =
... , Bk are simply
o, ... , k.
i=1
If the errors are Gaussian, then under Ho, y'n(fh, ... , fh)/rY has a kvariate normal distribution with mean 0 and identity covariance matrix. This statement remains true in an asymptotic sense if the errors are merely independent with common variance rY 2. Define TN,k by TN,k = &2
n
k '2 l:j=1 ej '2 ' (J
CY 2 •
where is some estimator of An apparently reasonable test of the no-effect hypothesis is to conclude that there is an effect if and only if TN,k 2: c. We shall take the liberty of calling this test a "Neyman smooth test," due to its obvious similarity to Neyman's smooth goodness-of-fit test. Not surprisingly, the reduction method of testing the no-effect hypothesis against (5.23) is equivalent to a Neyman smooth test. Due to the orthogonality conditions (5.24), the statistic FR from Section 5.4.1 is
and so
where T/J,k is the version of TN,k with & 2 = n- 1 2.:~= 1 (~- Y)2. Since FR is monotone increasing in TfJ k> it follows that the reduction test is equivalent to a Neyman smooth test. ' Let us now suppose that the errors are i.i.d. as N(O, rY 2 ). Then FR has the F distribution with degrees of freedom k and n - k 1. Hence, an exact size-a: Neyman smooth test has rejection region of the form (5.25)
R nFk,n-k-1,a T Nk> · ' - Fk,n-k-1,a + (n k- 1)/k
Results in Lehmann (1959, Chapter 7) imply that when the errors are Gaussian, the test (5.25) has a power function that depends only on 7jJ 2 = 2.:7=1 e[ I rY 2 . More importantly, test (5.25) is uniformly most powerful for alternatives (5.23) among all tests whose power functions depend only on 'ljJ 2 . In light of our discussion in Section 5.5 .4, it is significant that the power of the smooth test (5.25) depends on the ei 's only through 7/J 2 . This implies, for example, that fork = 4, the power of (5.25) would be the same for the two cases e = (1, 0, 0, 0) and e = (0, 0, 0, 1). By contrast, Buckley's test
5.6. Neyman Smooth Tests
143
tends to have good power only for low frequency alternatives, as discussed in Section 5.5.4. 2 When the errors are merely independent with common variance tT , k
L
2 n(B·-8·) J J
D
-----+
(}2
x2k
II
j=l
as n
oo. It follows that if 8" 2 in
TN,k
1
rn(x) =eo+
i
2
is any consistent estimator of tT , then TN,k is asymptotically distributed x% under the no-effect hypothesis. This fact allows one to construct a valid large sample Neyman smooth test. It is easy to verify that for any order k alternative the order k Neyman smooth test has maximal rate 1/2. Furthermore, an asymptotic version of the uniformly most powerful property holds for the Neyman smooth test. Consider local alternatives of order k having the form -+
Vn
8 k
8iC/Ji,n(x).
I
I
i[: 1·,
iI
i
I:
!
•
'I
Ii
i•
i" I
I: I
Under these alternatives the order k Neyman smooth test has a limiting power function that is uniformly higher than that of any test whose limiting 2 tT . power depends only on I:7=1 Suppose the design points are Xi = (i- 1/2)/n, i = 1, ... , n, and that the Neyman smooth test uses the cosine functions
er;
c/Jj,n(x) =
..J2 cos(1rjx),
j = 0, ... , k.
Then the Neyman statistic is a weighted sum of squared Fourier coefficients as in (5.17) with 1, Wj,n = { 0,
I I I:
1~j~k
< j < n. A Neyman smooth test with 2 < k < < n may thus be viewed as a comk
promise between the von Neumann and Buckley test statistics. If one is uncertain about the way in which r deviates from constancy but still expects the deviation to be relatively low frequency, then a Neyman smooth test of fairly small order, say 5, would be a good test. An order 5 Neyman test will usually be better than either the von Neumann or Buckley test when most of the "energy" in r is concentrated in the third, fourth and fifth Fourier coefficients. In Chapter 7 we will introduce tests that may be regarded as adaptive versions of the Neyman smooth test. These tests tend to be more powerful than either the von Neumann or Buckley test but do not require specification of an alternative as do Neyman smooth tests.
I I
I,
l 6 Lack-of-Fit Tests Based on Linear Smoot hers
6.1 Introduction We are now in a position to begin our study of lack-of fit tests based on smoothing methodology. We continue to assume that observations Y1, ... , Yn are generated from the model
(6.1)
Yi = r(xi) + Ei,
i
=
1, ... , n,
in which E1 , ... , En are mean 0, independent random variables with common variance a 2 < oo. We also assume that the design points satisfy 1 Xi = F- [(i- 1/2)/n], i = 1, ... , n, where F is a cumulative distribution function with continuous derivative f that is bounded away from 0 on [0, 1]. This assumption on the design is made to allow a concise description of certain theoretical properties of tests, but it is not necessary in order for those tests to be either valid or powerful. In this chapter we focus attention on the use of linear smoothers based on fixed smoothing parameters. By a linear smoother, we mean one that is linear in either Y1 , ... , Yn or a set of residuals e1, ... , en. If applied to residuals, a linear smoother has the form n
(6.2)
g(x; S) =
L wi(x; S)ei, i=l
where the weights wi(x; S), i = 1, ... , n, are constants that do not depend on the data Y1, ... , Yn or any unknown parameters, and S denotes the value of a smoothing parameter. A smoother that we do not consider linear is g(x; S), where S is a nonconstant statistic. Kernel estimators, Fourier series, local polynomials, smoothing splines and wavelets are all linear in the Yi 's as long as their smoothing parameters are fixed rather than data driven. Our interest is in testing the null hypothesis that r is in some parametric class of functions Se against the general alternative that r is not in S 8 . The basic idea behind all the methods in this chapter is that one computes 144
6.2. Two Basic Approaches
145
a smooth and compares it with a curve that is "expected" under the null hypothesis. If the smooth differs sufficiently from the expected curve, then there is evidence that the null hypothesis is false. Smoothing-based tests turn out to be advantageous in a number of ways: • They are omnibus in the sense of being consistent against each member of a very large class of alternative hypotheses. • They tend be more powerful than some of the well-known omnibus tests discussed in Chapter 5. • They come complete with a smoother. The last advantage is perhaps the most attractive feature of smoothingbased tests. The omnibus tests of Chapter 5 do not provide any insight about the underlying regression function in the event that the null hypothesis is rejected. In contrast, by plotting the smoother associated with a lack-of-fit test, one obtains much more information about the model than is contained in a simple "accept-reject" decision. Our study begins in the next section with a look at two fundamental smoothing-based approaches to testing lack of fit.
6.2 Two Basic Approaches Two fundamental testing approaches are introduced in this section. These will be referred to as (i) smoothing residuals and (ii) comparing parametric and nonparametric models. Sometimes these two methods are equivalent, but when they are not the former method is arguably preferable. The two approaches are described in Sections 6.2.1 and 6.2.2, and a case for smoothing residuals is made in Section 6.2.3.
6.2.1 Smoothing Residuals For a parametric model Se, which could be either linear or nonlinear in the unknown parameters, we wish to test the null hypothesis Ho : r E Se = {r(·; B) : B E 8}.
Let Bbe a consistent estimator of B assuming that the null hypothesis is true. Ideally Bwill also be efficient, although numerical considerations and the nature of the parametric family Se may preclude this. Define residuals e1, ... ,en by
If the null hypothesis is true, these residuals should behave more or less like a batch of zero mean, uncorrelated random variables. Hence, when H 0 is true, a linear smooth gas in (6.2) will tend to be relatively flat and centered about 0. A useful subjective diagnostic is to plot the estimate g(·; S) and
I' i
146
6. Lack-of-Fit Tests Based on Linear Smoothers
see how much it differs from the zero function. Often a pattern will emerge in the smooth that was not evident in a plot of residuals. Of course, looks can be deceiving, and so it is also important to have a statistic that more objectively measures the discrepancy of?;(- ; S) from 0. An obvious way of testing H 0 is to use a test statistic of the form
T
JJg(·; S)J\ 2
=
B-2
,
where JJgJJ is a quantity that measures the "size" of the function g and B- 2 is a model-free estimator of the error variance o- 2 . Examples of Jjg II are
{
{
fa
1
fa
1
g2 (x)f(x) dx
fa
} 1/2
g2 (x) dx
,
1
jg(x)j dx,
} 1/2
and
sup Jg(x)J, O:Sx:s;I
where f is the design density. The measure involving f puts more weight on Jg(x) I at points x where there is a higher concentration of design points. 1 A convenient approximation to 0 g2 (x; S)f(x) dx is
J
which leads to the lack-of-fit statistic
(6.3)
A2( S) R _ n -1 '\'n Lli=1 g Xi; n -
G-2
.
We now argue heuristically that Rn is essentially a variance ratio, as discussed in Section 5.2.2. The residuals are
ei = r(xi) - r(xi; B)
+ Ei,
i = 1, ... , n.
Typically, whether or not H 0 is true, the statistic Bwill converge in probability to some quantity, call it 80 , as n __., oo. When H 0 is true, 80 is the true parameter value, whereas under the alternative, r(-; 80 ) is some member of Be that differs from the true regression function r. It follows that for large n
(6.4) where g(x) = r(x)- r(x; 80 ), and g is identically 0 if and only if H 0 is true. We have
6.2. Two Basic Approaches
147
and so, in essence, Rn has the variance-ratio form of Section 5.2.2. A sensible test would reject H 0 for large values of Rn· It is enlightening to compare Rn with the statistic Vn of Section 5.4.2. Typically a limiting version of a linear smoother interpolates the data. When smoothing residuals this means that (6.5) asS corresponds to less and less smoothing. For smoothers satisfying (6.5) it follows that Vn is a limiting version of Rn· We may thus think of Vn as a "nonsmooth" special case of the smooth statistic Rn. Later in this chapter we provide evidence that the smooth statistic usually has higher power than its nonsmooth counterpart.
6.2.2 Comparing Parametric and Nonparametric Models Suppose that f(·; S) is a nonparametric estimate of r based on a linear smooth of Y1 , ... , Yn. As in the previous section, denotes our estimate of B on the assumption that the null model is true. As our lack-of-fit statistic consider
e
0n
_ llr(·;S)-r(-;B)W Q-2
-
'
where llhll is some measure of the size of h, as in the previous section. We will refer to the statistic Cn as a comparison of parametric and nonparametric models. In general, the statistic Cn will be the same as Rn only when f(·;S)-r(-;B)
g(·;S).
We have n
f(x; S) - r(x; e) =
L {Yi -
r(xi; e)}wi(x; S)
i=l
n
i=l
(6.6)
=
g(x; S)
+ Bias{f(x; S), B},
where Bias{f(x; S), B} denotes the bias of f(x; S) when Ho holds and B is the true parameter value. It follows that smoothing residuals and comparing parametric and nonparametric fits will be equivalent only when the smoother f(x; S) is, for each x E [0, 1], an unbiased estimator of r(x) under· the null hypothesis . .From Chapter 3 we know that smoothers are generally biased estimators of the underlying regression function, and so for the most part Rn and Cn will be the same only in special circumstances.
148
6. Lack-of-Fit Tests Based on Linear Smoothers
Below are a couple of examples where the two methods are equivalent. EXAMPLE 6.1: TESTING FOR NO-EFFECT. Consider testing the no-effect hypothesis Ho : r = constant, wherein the estimate of the null model is simply Y and the residuals are ei = Yi - Y, i = 1, ... , n. The two methods are equivalent in this case whenever the smoother 2::=7= 1 Yiwi (· ; S) is unbiased for constant functions. This is true so long as n
L wi(x; S) =
1 for each x E [0, 1].
i=l
We saw in Chapter 2 that many smoothers have weights that sum to 1, including trigonometric series and local linear estimates, as well as Nadaraya-Watson and boundary modified Gasser-Muller kernel estimates.
EXAMPLE 6.2: TESTING THE STRAIGHT LINE HYPOTHESIS. Consider testing the null hypothesis
Ho : r(x) =eo+ elx,
0:::;: X:::;: 1.
It is easily checked that, independent of the choice of smoothing parameter, local linear estimators and cubic smoothing splines are unbiased for straight lines. It follows that comparing a fitted line with either a local linear estimate or a cubic smoothing spline is equivalent to smoothing residuals. For a given smoother the two basic methods will generally be equivalent for just one specific type of null hypothesis. For second order kernel smoothers without any boundary correction, the no-effect hypothesis is the only case where the equivalence occurs. This follows from the bias expansion that we derived in Section 3.2.2. If one uses a kth order kernel and an appropriate boundary correction, then smoothing residuals will be equivalent to the comparison method when testing the null hypothesis that r is a (k- 1)st degree polynomial. We will explore this situation further in Section 8.2.3.
6. 2. 3 A Case for Smoothing Residuals When constructing a statistical hypothesis test, the first thing one must do is ensure that the test is valid in the sense that its type I error probability is no more than its nominal size. The validity of a test based on Cn can be difficult to guarantee since, by (6.6), Cn's distribution will usually depend upon the unknown parameter e through the quantity Bias{r(x; S), B}. By contrast, bias is usually not a problem when smoothing residuals. Suppose the null model is linear and that we use least squares to estimate the regression coefficients 81 , ... , eP. Then the distribution of the residuals e1, ... , en
6.3. Testing the Fit of a Linear Model
is completely free of el' Furthermore,
149
... ' ep and hence so is the distribution of 11.9(. ; S) II· E(ei) = 0,
i
= 1, ... , n,
and so any linear smooth.§(·; S) of the residuals has null expectation 0 for each x and for every S. The bias-free nature of smoothed residuals is our main reason for preferring the statistic Rn to Cn. Intuitively, we may argue in the following way. Imagine two graphs: one of g(x; S) versus x, and the other of r(x; S)- r(x; B) versus x. A systematic pattern in the second graph would not be unusual even if H 0 were true, due to the bias in the smoother r(·; S). On the other hand, a pattern in the graph of smoothed residuals is not expected unless the regression function actually differs from the null model. Of course, S could be chosen so that Bias{r(x; S), B} is negligible. However, in this case (6.6) implies that Rn and Cn are essentially equivalent. When Bias{f(x; S, B} is not negligible, an obvious remedy is to center r(x; S) - r(x; B) by subtracting from it the statistic n
L r(xi; B)wi(x; S)- r(x; iJ). i=l
In doing so we are left with just the smooth g(x; S), and all distinction between the two methods vanishes. Probably the only reason for adjusting Cn differently is that doing so might lead to a more powerful test. Rather than pursuing this possibility, we will use the more straightforward approach of smoothing residuals in the remainder of this chapter.
6.3 Testing the Fit of a Linear Model We now consider using linear smoothers to test the fit of a linear model, in which case the null hypothesis is p
(6.7)
Ho : r(x) =
L
ejrj(x).
j=l
We assume that the design matrix R defined in Section 5.4.2 is of full column rank and that the parameters el' ... ' ep are estimated by the method of least squares.
6. 3.1 Ratios of Quadratic Forms We begin with a treatment of statistics that are ratios of quadratic forms in a vector of residuals. Define e to be the column vector of residuals, and
I: I
150
6. Lack-of-Fit Tests Based on Linear Smoothers
suppose that our test statistic has the form R
'2( S) _ n -1'\"n Di=l g xi;
s -
fj-2
,
where &2 = e' Ce for some matrix C not depending on the data, and g(xi; S) is of the form (6.2). The vector of smoothed residuals is denoted g and is expressible as
g =We= W(In- R(R'R)- 1 R')Y, where W is the n x n smoother matrix with ijth element wj(xi)· The statistic Rs has the form
Y'AY Y'BY' where
and
When H 0 is true
Rs
c1 Ac
= ---,-B c c;
hence for i.i.d. Gaussian errors the null distribution of Rs can be approximated using the technique introduced in Section 5.4.2. The bootstrap is often an effective means of dealing with the problem of non-normal errors. To the extent that an empirical distribution better approximates the underlying error distribution than does a Gaussian, one has confidence that the bootstrap will yield the better approximation to Rs 's sampling distribution. Even when n is very large there is a compelling reason to use the bootstrap. As will be shown in Section 6.3.3, the asymptotic null distribution of Rs is fairly sensitive to the choice of smoothing parameter. For most smoothers there are three distinct approximations to the large sample distribution of Rs. These approximations correspond to small, intermediate and large values of the smoothing parameter. In practice it will seldom be clear which of the three large sample tests is the most appropriate. The bootstrap is thus attractive even for large n since it automatically accounts for the effect of S on the distribution of Rs. To use the bootstrap to carry out a smoother-based test of (6.7), one may employ exactly the same bootstrap algorithm described in Section 5.4.3. An example in Section 6.4.2 illustrates this technique. Another means of dealing with non-normality and/or small n is to use a permutation test. Raz (1990) uses this approach to obtain P-values for a test of no-effect based on nonparametric smoothers. The idea is to obtain a distribution by computing the statistic for all n! possible ways in which
i i
6.3. Testing the Fit of a Linear Model
151
the li's may be assigned to the xi's. In a simulation Raz (1990) shows that this approach does a good job of maintaining the test's nominal level for non-normal errors and n as small as 10.
6.3.2 Orthogonal Series Suppose that each oft he functions r 1 , ... , r P in the null model is in L2 (0, 1), and let v)', v2, ... be an orthogonal basis for L 2(0, 1). Define {v1 , v2, ... } to be the collection of all vj's that are not linear combinations of TI, . .. , rp, and consider series estimators fm(x) of r(x) having the form m=O
m
=
1,2, ... ,
where, for each m, elm, ... , epm, b1m, ... , bmm are the least squares estimators of fh, ... , Bp, b1, ... , bm in the linear model m
P
Yi
(6.8)
=
L
+L
Bjrj(xi)
j=l
bjvj(xi)
+ Ei.
j=l
We may regard fm(x) as a nonparametric estimator of r(x) whose smoothing parameter is m. For a given m ~ 1, we could apply the reduction method (Section 5.4.1) to test the hypothesis (6.7) against the alternative model (6.8). The estimators f 0 and fm would correspond to the reduced and full models, respectively. We will now argue that the reduction method is equivalent to a test based on a statistic of the form (6.3) with fJ(- ; S) an orthogonal series smooth of the residuals e 1 , ... , en from the null model. Using Gram-Schmidt orthogonalization (Rao, 1973, p. 10), we may construct linear combinations u1, ... , Un-p of r1, ... , rp, v1, ... , Vn-p that satisfy the following orthogonality conditions: n
L
rj(xi)uk(xi) = 0,
1 :::; j :::; p, 1 :::; k:::; n- p,
i=l
and
These conditions imply that the least squares estimators of a 1 , ... , am in the model m
p
}i = L{)jrj(Xi) j=l
+
L:ajUj(Xi) j=l
+ Ei
152
6. Lack-of-Fit Tests Based on Linear Smoothers
are
1
iiJ·
n
= -n~ ""Yiuj(xi),
j
= 1, ... , n- p.
•=1
Let SSE0 and SSEa be as defined in Section 5.4.1 when (6.7) is tested by applying the reduction method to r1, ... , rp, u1, ... , Um· It follows that m
SSE0
-
SSEa = n
LiiJ j=1
and that
FR =
n- p- m Rmn ----~~-n- p- Rmn' m
where
Rmn =
n
~m
L..Jj= 1
A2
aj
A
(J 2
and 8- 2 = SSE0 j(n- p). Again using the orthogonality conditions, one can verify that '
where m
[;(xi; m) =
L ajuj(Xi)· j=1
Since aj has the form
(ij =
n-
1
~(ei + ~ekrk(Xi))uj(Xi) = n- ~eiUj(Xi), 1
we see that g(xi; m) is just a smooth of the residuals. So, the reduction test is a smoothing type test with the same general form as the statistic Rn of Section 6.2.1. Furthermore, recalling Section 5.6, we see that Rmn has the form of a Neyman smooth statistic in which the orthogonal functions are constructed to be orthogonal to the functions comprising the null model. Note that the reduction test uses a model-dependent estimator of variance, namely 8- 2 SSE0 j(n- p). By using different variance estimators , in the denominator of Rmn, one can generate other tests.
6. 3. 3 Asymptotic Distribution Theory In this section we will study the limiting distribution of statistics of the form Rn (Section 6.2.1). The behavior of Rn under both the null hypothesis and
6.3. Testing the Fit of a Linear Model
153
local alternatives will be studied. We consider a particular linear model and a particular smoother in our asymptotic analysis. Since our results generalize in a straightforward way to more general models and smoothers, the results of this section provide quite a bit of insight. Let r 1 be a known, twice continuously differentiable function on [0, 1] such that J~ r 1 (x) = 0, and consider testing the null hypothesis
dx
(6.9) We will test H 0 using statistics based on the Priestley-Chao kernel smoother. For convenience we assume that = (i- 1/2)/n, i = 1, ... , n. Let ell ... ' en be the residuals from the least squares fit Bo + e1 r1 (X)' and define the smoother
Xi
1
flh(x) = nh
~ 6_ eiK
(x-x·) T ,
where the kernel K has support ( -1, 1). We will study a test statistic of the following form:
Rn,h
n-[nh]
1
=
n8-2
g~(xi)·
"'
L.J
i=[nh]+1
2 The variance estimator 8- 2 is any estimator that is consistent for () under H0 . The sum in Rn,h is restricted to avoid the complication of boundary effects. We first investigate behavior of the estimator g when the null hypothesis (6.9) is true. We have
flh(x)
(6.10)
1 n LEiK nh i= 1
=-
(x- -x·)' + (Ooh
1 n (x-x·) Oo)- LK - - ' nh i= 1 h A
1 ~ + (01- 01) nh 6_ r1(xi)K (x-x·) T A
It follows that when H 0 is true, E{gh(x)} = 0 for each x. The quantities Oo- Bo and 01- e1 are each Op(n- 112 ), and so when h----+ 0 and nh----+ oo, the dominant term in flh(x) is
-1 Ln E·K nh .
•=1
'
(XXi) -h
·
It follows immediately from results in Chapter 3 that, under H 0 and for each x E (0, 1), flh(x) is a consistent estimator ofO ash----+ 0 and nh----+ oo. Consistency for 0 is also true when h is fixed as n ----+ oo, although in that 1 2 case each of the three terms on the right-hand side of (6.10) is Op(n- 1 ) and must be accounted for in describing the asymptotic behavior of gh (x).
154
6. Lack-of-Fit Tests Based on Linear Smoothers
Clearly, gh(x) estimates 0 more efficiently when his fixed, a fact which is pertinent when considering local alternatives to H 0 . The limiting distribution of Rn,h will be derived under the following local alternatives model: (6.11) where 0 < "' ~ 1/2 and J0 g(x) dx = 0. We first consider the limiting distribution of Rn,h under model (6.11) when h -+ 0 and nh -+ oo. A proof of the following theorem may be found in King (1988). 1
Theorem 6.1. Suppose that in model (6.11)
E1, ••• , En are i.i.d. random variables having finite fourth moments. Assume that K is continuous everywhere and Lipschitz continuous on [-1, 1]. If h rv Cn-a for some a E (0, 1), then for"'> {1/2)(1- a/2) in (6.11)
nhRn,h - B1 ~ hBz
IJ ------7
N(O, 1),
where
B,
~ [ , K'(u) du
and
B,
~ z[, (L K(u)K(u + z) du)' dz.
Furthermore, when"' = (1/2)(1- a/2), a nominal size a test of (6.9) based on Rn,h has limiting power
Theorem 6.1 shows that there is a direct link between the rate at which h tends to 0 and the maximal rate associated with Rn,h· When the alternative converges to the null relatively quickly ("! > (1/2)(1- a/2)), the limiting power of the test based on Rn,h is nil, inasmuch as it equals the limiting level. Theorem 6.1 shows that the maximal rate of Rn,h is (1/2)(1- a/2), in which case the limiting power is larger than the nominal level. By letting h tend to 0 arbitrarily slowly (i.e., a arbitrarily close to 0) the maximal rate can be made arbitrarily close to the parametric rate of 1/2. Of particular interest is the maximal rate in the case h rv cn- 115 ' which is the form of a mean squared error optimal bandwidth for twice differentiable functions. Here, the maximal rate is (1/2)(1-1/10) = 9/20. Theorem 6.1 also implies that the maximal rate for Rn,h is always at least 1/4. This is to be expected since Rn,h tends to the von Neumann type statistic Vn of Section 5.4.2 when h -+ 0 with n fixed. A slight extension of Theorem 5.2 shows that the maximal rate of Vn in the setting of the current section is 1/4.
6.3. Testing the Fit of a Linear Model
155
The following theorem shows that alternatives converging to H 0 at the parametric rate of n 112 can be detected by a test based on Rn,h with a fixed value of h. Theorem 6.2. Suppose that model (6.11) holds in which 1 = 1/2, g is piecewise smooth, r 1 ( x) = x for all x, and the Ei 's are independent random variables satisfying
and for some constant M and some v > 2
Let K satisfy conditions 1-4 of Section 3.2, and define the constant 19 1 12 f0 g(u)(u- 1/2) du. Define also the function ~h by
~h(s) = ~
1:
1
K(u)K(u-
If Rn,h corresponds to a test of Ho : r(x) each h E (0, 1/2)
1
nRn,h
rl-h
~ cr 2 jh
*) du, =
Bo
V s.
+ B1x,
it follows that for
w;(t) dt,
where {W9 (t) : 0::; t::; 1} is a Gaussian process with mean function
JJ(t)
=
h1
rl
Jo [g(u)- (u- 1/2)I9 ]K
(-ht u) du,
0 ::; t ::; 1,
and covariance kernel
L(s, t)
= cr 2
(~h(s-
t)- 12(s- 1/2)(t- 1/2)- I),
0::; s, t::; 1.
Using Theorems 8.1 and 12.3 of Billingsley (1968) it is enough to (i) show that (Vnfih(t 1 ), ... , ynfjh(tm)) converges in distribution to the appropriate multivariate normal distribution for any finite collection of t 1 , ... , tm in (0, 1) and (ii) verify the tightness criteria on p. 95 of Billingsley (1968). Defining fln = L~=l g(xi)/n, PROOF.
I
156
6. Lack-of-Fit Tests Based on Linear Smoothers
and "E
= 2:~= 1
Ei/n, it is easy to show that
n Vnfih(t) = 1- LEiK ynh i= 1
,
1
(t-x·) h --'
1 6 ~K - yn"Enh i= 1
~
- vn eE nh {;;{(xi - 1/2)K
(t-x·) h --'
(t-x·) T
The deterministic piece of the last expression is clearly E[yngh(t)]. By the piecewise smoothness of g and the Lipschitz continuity of K,
E[Vnfih(t)]
=
JL(t)
+ Rn(t),
where IRn(t)l is bounded by (a constant) · n- 1 for all t E (0, 1). Straightforward calculations show that lim Cov( Vnfih(s), Vnfih(t)) = L(s, t)
n--+co
V s, t E [h, 1- h].
The asymptotic normality of ( yngh (tl), ... , yngh (tm)) may be demonstrated by using the Cramer-Wold device (Serfling, 1980, p. 18) in conjunction with a triangular array analog of the Lindeberg-Feller theorem and the moment conditions imposed on the Ei 's. Having established the asymptotic normality of finite dimensional distributions, Theorems 8.1 and 12.3 of Billingsley (1968) imply that the process { Vnfih(t) : h s:; t s:; 1- h} converges in distribution to {W9 (t) : h s:; t s:; 1 - h} so long as the tightness criteria on p. 95 of Billingsley (1968) hold. These criteria are satisfied in our case if the sequence { Vnfih(h) : n = 1, 2, ... } is tight and if, for all n,
(6.12)
Elvn9h(s)- vn9h(t)l 2 s:; B(s- t) 2
v s, t
E
[h, 1- h],
where B is some positive constant. The tightness of { Vnfih(h) : n 1, 2, ... } can easily be proven using the fact that the mean and variance of Vnfih(h) converge to finite constants as n---+ oo. The bound in (6.12) is also easily established by using the boundedness of the function g and the Lipschitz continuity of K. D
I
Theorem 6.2 implies that whenever JL is not identically 0 on (h, 1 - h), the power of a size-a test based on Rn,h converges to a number larger than a. The mean function JL is a convolution of the kernel with the difference between g and its best linear approximation. Whenever g is not identical to a line, there exists an h such that JL is not identically 0. Hence, there exists an h such that the Rn,h-based test has a maximal rate of 1/2, meaning that Rn,h can detect alternatives that converge to H 0 at the parametric rate of n-1/2.
6.4. The Effect of Smoothing Parameter
157
It is sometimes difficult to know what, if anything, asymptotic results tell us about the practical setting in which we have a single set of data. It is tempting to draw conclusions from Theorems 6.1 and 6.2 about the size of bandwidth that maximizes power. Faced with a given set of data, though, it is probably best to keep an open mind about the value of h that is "best." The optimal bandwidth question will be considered more closely in Section 6.4. We can be somewhat more definitive about the practical implications of Theorems 6.1 and 6.2 concerning test validity. These theorems imply that the limiting distribution of statistics of the type Rn can be very sensitive to the choice of smoothing parameter. For example, the asymptotic distribution of Rn,h has three distinct forms depending on the size of h. The three forms correspond to very large, very small and intermediate-sized bandwidths. When h is fixed as n ___, oo, Rn,h converges in distribution to a functional of a continuous time Gaussian process, as described in Theorem 6.2. When h ___, 0 and nh ___, oo, Rn,h is asymptotically normal (Theorem 6.1), whereas if h = 0, Rn,h is asymptotically normal but with norming constants of a different form than in the case nh ___, oo (Theorem 5.1). Practically speaking these three distinct limit distributions suggest that we should use a method of approximating the null distribution that "works" regardless of the size of h. This was our motivation for advocating the bootstrap to approximate critical values of Rs in Section 6.3.1. It is worthwhile to point out that the conclusions reached in this section extend in a fairly obvious way to more general linear hypotheses and more general linear smoothers. Broadly speaking, the only way a smoother can attain a maximal rate of 1/2 is by fixing its smoothing parameter as n ___, oo. In other words, when an estimator's smoothing parameterS is chosen to be mean squared error optimal, the maximal rate of the corresponding test based on Rs will generally be less than 1/2.
6.4 The Effect of Smoothing Parameter The tests discussed in this chapter depend upon a smoothing parameter. To obtain a test with a prescribed level of significance, the smoothing parameter should be fixed before the data are examined. If several tests corresponding to different smoothing parameters are conducted, one runs into the same sort of problem encountered in multiple comparisons. If the null hypothesis is to be rejected when at least one of the test statistics is "significant," then the significance levels of the individual tests will have to be adjusted so that the overall probability of a type I error is equal to the prescribed value. By using the bootstrap one can ensure approximate validity of any test based on a single smoothing parameter value. The key issue then is the effect that choice of smoothing parameter has on power. In Section 6.4.1
l
158
6. Lack-of-Fit Tests Based on Linear Smoothers
we compute power as a function of bandwidth in some special cases to provide insight on how the type of regression curve affects the bandwidth maximizing power. In practice the insight of Section 6.4.1 will not be useful unless one has some knowledge about the true regression function. For cases where such knowledge is unavailable, it is important to have a data-based method of choosing the smoothing parameter. Section 6.4.2 introduces a device known as the significance trace that provides at least a partial solution to this problem.
6.4.1 Power Here we shall get an idea of how curve shape affects the bandwidth that maximizes power. Consider testing the null hypothesis that r is a constant function, in which case the residuals are simply ei = Yi- Y, i = 1, ... , n. We assume that model (6.1) holds with Xi = (i - 1/2)/n, i = 1, ... , n, and investigate power of the test based on the statistic
Rn(h)
~
=
t
flh(xi),
i=l
where gh is a local linear smooth (of residuals) that uses an Epanechnikov kernel and bandwidth h. Simulation was used to approximate the power of this test against the alternative functions
r 1 (x)
=
2
20 [(x/2) (1- x/2)
2
-
1/30] ,
0:::; x :::; 1,
and
r3(x)
+ (x- 1/2)] , 1) (2- 2x) 10 + (x- 1/2)],
.557 [5o (2x- 1) (2x) =
{ .557 [5o (2x
1
10
0 :S:
X
1/2 :S: 1
< 1/2 X
:S: 1.
These functions are such that fo ri(x) dx = 0 and fo rr(x) dx R:i .19, i = 1, 2, 3. The sample size n was taken to be fifty and E1 , ... , En were i.i.d. standard normal random variables. Ten thousand replications were used to approximate the .05 level critical value for Rn(h) at each of fifty evenly spaced values of h between .04 and 2. One thousand replications were then used to approximate power as a function of h for each of r 1 , r 2 and r3. The three regression functions and the corresponding empirical power curves are shown in Figure 6.1. In each case the solid and dashed vertical lines indicate, respectively, the maximizer of power and the minimizer of
-, 159
6.4. The Effect of Smoothing Parameter
"'0
"'0 "'0 a;
g "C
;< 0 c.
9"'
"l 0
"'9
": 0
0.0
0.2
0.4
0.6
0.8
1.0
"'0
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
"'0 "'0
"'0 a;
g 'i'!
"'0
;< 0 c.
"'9
"0 "'0
"'0 "'9
0 0.0
0.2
0.4
0.6
0.8
1.0
"' 0
"'0 g
"'
"'0 a;
0
;< 0 c.
0
0
"'0
"'9 "'
0
0 0.0
0.2
0.4
0.6
0.8
1.0
X
FIGURE 6.1. Functions and Corresponding Power Curves as a Function of Bandwidth.
mean average squared error for the local linear estimator. The agreement between the two optimal bandwidths is remarkable. In the case of r1, power is maximized at the largest bandwidth since the regression function is almost linear and the local linear smoother is a straight line for large h. For r 2 , power is maximized at h = .32 and then decreases monotonically for larger values of h. Since r 2 is symmetric about .5 the local linear smoother is an unbiased estimator of a flat line for large h, implying that the power of the test based on Rn(h) will be close to its level when h is large. By contrast, r 3 contains an overall upward trend, and so here the power at
160
6. Lack-of-Fit Tests Based on Linear Smoothers
large bandwidths is much larger than .05. In fact, the power at h = 2 is larger than it is for a range of intermediate bandwidths. The two peaks of r3 induce maximum power at a smaller bandwidth of about .13. The previous study is consistent with what intuition would suggest about the bandwidth that maximizes power. Roughly speaking, one would expect the size of the optimal bandwidth to be proportional to the smoothness of the underlying function. In other words, very smooth functions require larger bandwidths than do less smooth functions, all other factors being equal. The examples are also consistent with our claim in Section 6.2.1 that "smooth" statistics usually have more power than ones based on no smoothing. It is unclear whether or not the bandwidths maximizing power and minimizing MASE tend to agree closely in general, as they did in the above study. Agreement of the two bandwidths would suggest that one estimate an optimal testing bandwidth by using one of the methods discussed in Chapter 4, each of which provides an estimator of the MASE minimizer. The resulting test statistic would have the form Rn(h), where his a statistic. Tests of this general flavor will be the topic of Chapters 7 and 8. It is important to note at this point that the randomness of h can have a profound influence on the sampling distribution of Rn(h). It is therefore not advisable to approximate critical values for Rn (h) by pretending that his fixed. An alternative way of avoiding an arbitrary choice of bandwidth is the subject of the next section.
6.4.2 The Significance Trace King, Hart and Wehrly (1991) proposed a partial means of circumventing the bandwidth selection dilemma in testing problems. They proposed that one compute P-values corresponding to several different choices of the smoothing parameter. The question of bandwidth selection becomes moot if all P-values are less than or all greater than the prescribed level of significance. This idea was proposed independently by Young and Bowman (1995), who termed a plot of P-values versus bandwidth a significance trace. We illustrate the use of a significance trace using data borrowed from Cleveland (1993). The data consist of 355 observations and come from an experiment at the University of Seville (Bellver, 1987) on the scattering of sunlight in the atmosphere. The ¥-variable is Babinet point, the scattering angle at which the polarization of sunlight vanishes, and the x-variable is the cube root of particulate concentration in the air. The local linear smooth in Figure 6.2 seems to indicate some curvature in the relationship between average Babinet point and cube root of particulate concentration. Suppose that we test the null hypothesis that the regression function is a straight line by using a statistic Rn based on a local linear smoother.
6.5. Historical and Bibliographical Notes
161
... . . ...... .... :: . .. ... • ·!· . ::1!. . . .. . ... .....:·: ·: . . . I .. I ... • . . :. ...: !.,.::·:....• ....! ....a•:: ,: ·.... . !• • . .... ... ... . .
(!)
C\J
...
~·.
C\J
~~
C\J C\J
.
'E a.
'a Q)
c
:0
0 C\J
I a• • • •
co"'
• I • -·
.. v.;:a •.::
co
.
. 2.0
2.5
3.0
3.5
.
.. t .. : •
4.0
·= ...
4.5
5.0
Cube Root Concentration
FIGURE
6.2. The Babinet Data and Local Linear Smooth.
Figure 6.3 shows significance traces computed from three sets of data. From top to bottom, the graphs correspond respectively to 75, 100 and 200 observations randomly selected from the full set of 355. In each case the bootstrap was used to approximate P-values. Five hundred bootstrap samples were generated from each of the three data sets, and Rn was computed at twenty different values of the smoothing parameter h for each bootstrap sample. For a significance level of .05, the graphs illustrate the three cases that arise in using the significance trace. The top and bottom cases are definitive since regardless of the choice of h, the statistic Rn would lead to nonrejection of H 0 in the former case and rejection in the latter. The middle graph is ambiguous in that H 0 would be rejected for large values of the smoothing parameter but not for smaller ones. Interestingly, though, each of the graphs is consistent with the insight obtained in Section 6.4.1. Figure 6.2 suggests that the ostensible departure from linearity is low frequency; hence the tests based on less smoothing should be less powerful than those based on more smoothing.
162
6. Lack-of-Fit Tests Based on Linear Smoothers
...0 "'0 a.
"'0 0 0
0 0.5
1.0
2.0
1.5
2.5
3.0
0
"'0 0
"'0 a. 0
0
0
0 0.5
1.0
0.5
1.0
1.5
2.0
2.5
3.0
(!)
0
0
... 0
0
a.
"'0 0
0
0
2.0
1.5
2.5
3.0
h
FIGURE 6.3. Significance Traces for Babinet Data. From top to bottom the graphs correspond to sample sizes of 75, 100 and 200.
6.5. Historical and Bibliographical Notes
163
6.5 Historical and Bibliographical Notes The roots of lack-of-fit tests based on nonparametric smoothers exist in the parallel goodness-of-fit problem. As discussed in Section 5.6, smoothingbased goodness-of-fit tests can be traced at least as far back as Neyman (1937). The explicit connection, though, between Neyman smooth tests and tests based on nonparametric function estimation ideas seems not to have been made until quite recently. The use of components of omnibus goodness-of-fit tests (Durbin and Knott, 1972) is closely related to Neyman's idea of smooth tests. Eubank, LaRiccia and Rosenstein (1987) studied the components-based approach and refer to the "intimate relationship between (Fourier series) density estimation and the problem of goodness of fit." A comprehensive treatment of smooth goodness-of-fit tests may be found in Rayner and Best (1989) and a review of work on the subject in Rayner and Best (1990). Two early references on the use of kernel smoothers in testing goodness of fit are Bickel and Rosenblatt (1973) and Rosenblatt (1975). A more recent article on the same subject is that of Ghosh and Huang (1991). In the regression setting the first published paper on testing model fit via nonparametric smoothers appears to be Yanagimoto and Yanagimoto (1987), who test the fit of a straight line model by using cubic spline smoothers. Ironically, this first paper makes use of data-driven smoothing parameters, whereas most of the papers that followed dealt with the conceptually simpler case of linear smoothers, as discussed in this chapter. Tests utilizing splines with fixed smoothing parameters have been proposed by Cox, Koh, Wahba and Yandell (1988), Cox and Koh (1989), Eubank and Spiegelman (1990) and Chen (1994a, 1994b). Early work on the use of kernel smoothers in testing for lack-of-fit includes that of Azzalini, Bowman and Hardie (1989), Hall and Hart (1990), Raz (1990), King, Hart and Wehrly (1991), Muller (1992) and Hardie and Mammen (1993). Cleveland and Devlin (1988) proposed diagnostics and tests of model fit in the context of local linear estimation. Smoothingbased tests that use local likelihood ideas have been investigated by Firth, Glosup and Hinkley (1991) and Staniswalis and Severini (1991). A survey of smoothing-based tests is provided in Eubank, Hart and LaRiccia (1993), and Eubank and Hart (1993) demonstrate the commonality of some classical and smooth tests.
I ! '
'I I
+
7 Testing for Association via Automated Order Selection
7.1 Introduction The tests in Chapter 6 assumed a fixed smoothing parameter. In Chapters 7 and 8 we will discuss tests based on data-driven smoothing parameters. The current chapter deals with testing the "no-effect" hypothesis, and Chapter 8 treats more general parametric hypotheses. The methodology proposed in Chapter 7 makes use of an orthogonal series representation for r. In principle any series representation could be used, but for now we consider only trigonometric series. This is done for the sake of clarity and to make the ideas less abstract. Section 7.8 discusses the use of other types of orthogonal series. Our interest is in testing the null hypothesis (7.1)
Ho : r(x)
=
C for each x
E [0, 1],
where C is an unknown constant. This is the most basic example of the lack-of-fit scenario, wherein the model whose fit is to be tested is simply "r = C." Hypothesis (7.1) will be referred to as "the no-effect hypothesis," since under our canonical regression model it entails that x has no effect on Y. The simplicity of (7.1) will yield a good deal of insight that would be harder to attain were we to begin with a more general case. We note in passing that the methodology in this chapter can be used to test any hypothesis of the form H 0 : r(x) = C + r 0 (x), where r0 is a completely specified function. This is done by applying any of the tests in this chapter to the data Zi = Yi- ro(xi), i = 1, ... , n, rather than to Y1, ... , Yn. We assume a model of the form
(7.2)
Yj=r(xj)+Ej,
j=1, ... ,n,
where Xj = (j- 1/2)/n, j = 1, ... , n, and E1 , ... , En are independent and identically distributed random variables with E(E 1 ) = 0 and Var(El) = a 2 • Assuming the design points to be evenly spaced is often reasonable for purposes of testing (7.1), as we now argue. Consider unevenly spaced design points xi, ... , x~ that nonetheless satisfy xj = Q[(j - 1/2)/n],
164
'~ !
7.1. Introduction
165
= 1, ... , n, for some monotone increasing quantile function Q that maps [0, 1] onto [0, 1]. Then
j
Yj=r[Q(j~1 / 2 )]+Ej,
j=1, ...
,n,
and r(x) = C for all x if and only if r[Q(u)] C for all u. Therefore, we can test r for constancy by testing r[Q(·)] for constancy; but r[Q(·)] can be estimated by regressing Y1 , ... , Yn on the evenly spaced design points (j- 1/2)/n, j = 1, ... , n. Parzen (1981) refers to r[Q(·)] as the regression quantile function. If we assume that r is piecewise smooth on [0, 1], then at all points of continuity x, it can be represented as the Fourier series CXJ
r(x)
=
C
+2L
¢j cos(njx),
j=1
with Fourier coefficients
1 1
rPj =
(7.3)
r(x) cos(njx) dx,
j = 1, 2, ....
For piecewise smooth functions, then, hypothesis (7.1) is equivalent to ¢1 = ¢2 = · · · = 0; therefore, it is reasonable to consider tests of (7.1) that are sensitive to nonzero Fourier coefficients. The test statistics to be considered are functions of sample Fourier coefficients. We shall take as our estimator of rPj (7.4)
'
1
rPj = -
n
L Yi cos(njxi), n
j
= 1, ... , n-
1.
i=1
This definition of ¢j is different from that in Section 2.4; however, for evenly spaced designs the two estimators are practically identical. For our design xi = (i- 1/2)/n, i = 1, ... , n, definition (7.4) is the least squares estimator of rPj· We may estimate r(x) by the simple truncated series m
(7.5)
f(x; m) =
C + 2 L ¢j cos(njx),
x E [0, 1],
j=1
where C l:i Yi/n and the truncation point m is some non-negative integer less than n. Clearly, if H 0 is true, the best choice for m is 0, whereas under the alternative that l¢j I > 0 for some j, the best choice (for all n sufficiently large) is at least 1. In Chapter 4 we discussed a data-driven truncation point m that estimates a "best" choice for m. It makes sense that if m is 0, then there is little evidence in support of the alternative hypothesis, whereas if m 2:: 1 the data tend to favor the alternative. This simple observation motivates all the tests to be defined in this chapter.
166
7. Testing for Association via Automated Order Selection
From one perspective, the series f( · ; m) is simply a nonparametric estimator of the regression function r. However, we may also think of functions of the form m
C
+2L
aj cos(1rjx),
0 :S x :::; 1,
J'=l
as a model for r, wherein the quantity m represents model dimension. This is an important observation since the discussion in the previous paragraph suggests a very general way of testing the fit of a model. If model dimensions 0 and d > 0 correspond respectively to null and alternative hypotheses, and if a statistic d is available for estimating model dimension, then it seems reasonable to base a test of the null hypothesis on d. Many modeling problems fall into the general framework of testing d 0 versus d > 0; examples include problems for which the reduction method is appropriate, testing whether a time series is white noise against the alternative that it is autoregressive of order d > 0, or any other setting where one considers a collection of nested models. Recall the MISE-based criterion Jm introduced in Section 4.2.2: A
Jo = 0,
A
_
Jm -
Lm
2n¢J A2
2m,
m
=
1, ... , n - 1.
(]"
j=l
The statistic m is the maximizer of Jm over m = 0, 1, ... , n - 1. A number of different tests have been inspired by the criterion Jm. These will be discussed in Sections 7.3 and 7.6. For now we mention just two. One possible test rejects H 0 for large values of m. It turns out that the limiting null distribution (as n --4 oo) of m has support {0, 1, 2, ... }, with limn_,= P(m = 0) ~ .712, limn_,cxo P(O :S m :S 4) ~ .938 and limn--->cxo P(O :S m :S 5) ~ .954. This knowledge allows one to construct an asymptotically valid test of H 0 of any desired size. In particular, a test of size .05 would reject H 0 if and only if m ~ 6. A second possible test rejects H 0 for large values of Jm, a statistic that will be discussed in Section 7.6.3.
7.2 Distributional Properties of Sample Fourier Coefficients In order to derive the distribution of subsequent test statistics, it is necessary to understand distributional properties of the sample Fourier coefficients ¢ 1 , ... , ¢n-l· Our main concern is with the null distribution, and so in this section we assume that the null hypothesis (7.1) is true. More general properties of sample Fourier coefficients were discussed in Section 3.3.1.
7.2. Distributional Properties of Sample Fourier Coefficients
167
First of all,
and, for i, j ::;:: 1, Cov(¢i,¢j)={CY2/(2n), 0,
i=j i ::/= j.
These facts derive from the orthogonality properties n
L
cos(njxi) cos(nkxi)
=
j, k
0,
=
0, 1, ... , n- 1, j ::/= k,
i=1 and from 1
n
- L:cos 2 (njxi)= n i=1
1
2,
j=1, ... ,n-l.
When the Ei's are i.i.d. Gaussian, it follows that ¢ 1, ... , :Pn-1 are i.i.d. Gaussian with mean 0 and variance CY 2 / (2n). More generally, we may use the Lindeberg-Feller theorem and the Cramer-Wold device to establish that, for fixed m, vn(¢1, '¢m) converges in distribution to an m-variate normal distribution with mean vector 0 and variance matrix (CY 2/2)Im, where Im is the m X m identity. Define the normalized sample Fourier coefficients ¢N,1, ... , J;N,n-1 by 0
;;,
0 -
'1-'N,• -
0
0
v'2n¢i ff ,
i = 1, ... , n- 1,
where ff is any weakly consistent estimator of CY. Consider a test statistic S that is a function of ¢N,1, ... , J;N,m, i.e., S = S(J;N,1, ... , ¢N,m)· Then, if S is a continuous function, m is fixed and the null hypothesis (7.1) holds, S converges in distribution to S(Z1, ... , Zm), where Z1, ... , Zm are i.i.d. N(O, 1) random variables. An important example of the last statement is the Neyman smooth statistic m
S = ""A2 ~c/JN,j> j=1
x;;,.
whose limiting distribution under (7.1) is To obtain the limiting distributions of some other statistics, such as m and Jm,, it is not enough to know the limiting distribution of ¢ 1, ... , ¢m for a fixed m. The following theorem is an important tool for the case where fixing m does not suffice.
Theorem 7.1. Suppose that in model (7.2) r is constant and the Ei 's are independent and identically distributed with finite fourth moments. For each m ::;:: 1, let Bm denote the collection of all Borel subsets of lRm, and for any
i i,
168
A
E
7. Testing for Association via Automated Order Selection
l
Bm define Pmn(A) and Pm(A) by Pmn(A)
=
P [ ( V'in~/JI/!7,
... , V'in¢m/!7)
E
I
A]
i
and
I
where Z 1 , ... , Zm are i. i. d. standard normal random variables. Then for all m and n
where a(m) is a constant that depends only on m. Theorem 7.1 is essentially an application of a multivariate Berry-Esseen theorem of Bhattacharya and Ranga Rao (1976) (Theorem 13.3, p. 118). To approximate the distribution of the sample Fourier coefficients by that of i.i.d. normal random variables, we wish for the bound in Theorem 7.1 to tend to zero as nand m tend to oo. Since a(m) tends to increase with m, it is clear that m will have to increase more slowly than n 114 . Fortunately, in order to establish the limiting distribution of the statistics of interest, it suffices to allow m to increase at an arbitrarily slow rate with n. Clearly, there exists an increasing, unbounded sequence of integers {mn : n = 1, 2, ... } such that a(mn)m~/fo---+ 0 as n---+ oo.
7.3 The Order Selection Test The no-effect hypothesis says that r is identical to a constant. The nonparametric regression estimate f{; m) is nonconstant if and only if m > 0. These facts lead us to investigate tests of no-effect that are based on the statistic m. The form of lm along with Theorem 7.1 suggest that, as n ---+ oo, m converges in distribution to m, the maximizer of the random walk {S(m) : m = 0, 1, ... }, where m
S(O) = 0,
S(m) =
L
z]- 2m,
m = 1, 2, ... '
j=l
and Z 1 , Z 2 , ... are i.i.d. standard normal random variables. Theorem 7.2 below provides conditions under which this result holds. The ensuing proof is somewhat more concise than the proof of a more general result in Eubank and Hart (1992).
Theorem 7.2. Suppose that in model (7.2) r is constant and the Ei 's are independent and identically distributed with finite fourth moments. Let
7.3. The Order Selection Test
169
zl, z2, ...
be a sequence of independent and identically distributed standard normal random variables, and define m to be the maximizer of S(m) with respect tom, where S(O) = 0 and S(m) = ~';= 1 (ZJ - 2), m :2:: 1. It follows that the statistic m converges in distribution tom as n --+ 00. PROOF. For any non-negative integer m we must show that P(m = m) --+ P(m = m) as n --+ oo. Define, for any positive random variable a, the event Em(a) by
Em(a) =
m 2n¢;12 1 min - - k "" - - :2::2 { O 1, E(ZJ) ry < 0 and so S(m; ry) __, -oo with probability 1 as m __, oo, guaranteeing that the process has a maximum. The level of test (7.10) is simply 1- P(m"~ = 0). If m"~ is the maximizer of S(m; ry) over 0, 1, ... , then lim P(m"~ = o) = P(m"~ = o).
n--+oo
A remarkable formula due to Spitzer (1956, p. 335) allows us to obtain an arbitrarily good approximation to P(m'Y = 0). His formula implies that
P(m'Y = 0) = exp
(7.11)
Xj > J'Y ·)} p(2 L . { J oo
-
d f
~ Fosb),
j=l
x;
where is a random variable having the x2 distribution with j degrees of freedom, and the subscript OS stands for "order selection." If one desires a test having asymptotic level o:, one simply sets 1 - a equal to Fosb) and solves numerically for ry. It is not difficult to see that Fos is increasing in ry > 1, and hence the solution to the equation 1 - a = Fosb) is unique. In fact, Fos is the cumulative distribution function of an absolutely continuous random variable having support (1, oo), a fact to be exploited in the next section.
7.4 Equivalent Forms of the Order Selection Test
1.4.1 A Continuous- Valued Test Statistic Data analysts often desire a P-value to accompany a test of hypothesis, the P-value being the smallest level of significance for which H 0 would be rejected given the observed data. Owing to the discrete nature of the statistic m"~, finding a P-value for the order selection test by using its definition in Section 7.3 is awkward. However, an equivalent form of the test makes computation of the P-value relatively straightforward. This alternative form is
7.4. Equivalent Forms of the Order Selection Test
175
also helpful for understanding other aspects of the test, such as power, as we will see in Section 7.7. Note that m"~ equals 0 if and only if
!__
Lm 2n¢] &2
m j=1
m
::::: "(,
=
1, ... ,n -1,
which is equivalent to
T nd~f -
max
1:S:m:S:n-1
Therefore the order selection test is equivalent to the test that rejects the no-effect hypothesis for large values of the statistic Tn· If the observed value of Tn is t, then the P-value is 1- Fn(t), where Fn is the cdf of Tn under the null hypothesis. A large sample approximation to the P-value is 1- Fos (t). Note that Fos is the cdf of the random variable T, where 1
m
T =sup- 2:ZJ m2:1
m j= 1
and Z 1 , Z 2 , ••. are i.i.d. standard normal random variables. The support ofT is (1, oo), since the strong law of large numbers entails that m- 1 I:;;'=1 ----+ 1, almost surely, as m ----+ 00. It is shown in the Appendix that
z;
(M IFos(t)
F(t;M)I:::;
+ 1)-1efl+1 1- Bt
where
F(t;M)
~ exp {- ~ P (xJ/ ;t)}
and Bt = exp (- [(t- 1) logt] /2). This allows one to determine Fos(t) to any desired accuracy. For example, if the statistic Tn was observed to be 3, we can approximate the P-value by 1 - F(3; 15) = .119, which agrees with 1 - Fos(3) to the number of decimal places shown. A graph of Fos is shown in Figure 7.2.
1.4.2 A Graphical Test The outcome of the order selection test can be related graphically in a rather appealing way. Consider the Fourier series estimator f( x; m'Y), where 'Y is chosen so that the order selection test has a desired size. The hypothesis of a constant regression function is rejected if and only if the smoother r(x; m"~) is nonconstant. This follows from two facts: (i) when m"~ = 0,
176
7. Testing for Association via Automated Order Selection ~ T""
co 0
,--...
>
0 for some j ;::: 1. In this way the no-effect hypothesis is true if and only if m = 0.
190
7. Testing for Association via Automated Order Selection
The sample Fourier coefficients ¢ 0 , ¢1, ... , :Pn-1 are sufficient statistics ' and for i.i.d. Gaussian errors the likelihood function is
f(¢o, · · · ,¢n-1l¢o, · · ·, 0.
The distribution (7.21) simplifies to A. A. 1f ( '1-'1> ••• ) 'I-'m
I ) - r((m+1)/2) m r(l/2)
(},.) m
x (1 +
(IT ami)
-J
~~
c/Jr ) -Cm+l)/2 2 L.J a 2 . i=1
mt
which is an m-variate t distribution. In the form (7.21) we see that this prior amounts to assuming that, conditional on s and m, ¢1, ... , c/Jm are independent with c/Ji distributed N(O, s 2 a~i), i = 1, ... , m, and taking s to have prior g. Note that g is proper but has infinite mean. At this point we do not specify a prior for m, since doing so is not necessary for deriving the form of the posterior distribution. One possibility would be to use Rissanen's (1983) noninformative prior for the number of parameters in a statistical model. A simple convenience prior is the geometric distribution (7.22)
1r(m)
= pm(1
- p),
m
=
0, 1, ... ,
for some pin (0, 1). An advantage of (7.22) is that it allows one complete freedom in specifying the prior probability of the no-effect hypothesis. That hypothesis is true if and only if m = 0; hence we should take p = 1 - Jro if our prior probability for the null hypothesis is 1r0 • A "proper" Bayesian test of the null hypothesis would be to compute the posterior probability that m = 0 and to reject the no-effect hypothesis if this posterior probability is sufficiently low. The posterior distribution of m is found by integrating the posterior probability function with respect to ¢ 0 , .•. , c/Jm and u. The posterior distribution 1r(cj;0 , .•• , c/Jm, u, m!data) is proportional to likelihood function
1
-1r(cf;1, ... , c/Jm!m)1r(m).
X
(}"
Integrating out ¢ 1, ... , c/Jm and u, and ignoring a term that is negligible for large n, we have (7.23)
1r(m!data)
= 'L-:~2 b. j=O
where
J
,
m
=
0, 1, ... , n- 2,
192
7. Testing for Association via Automated Order Selection
and n-1
Sl
2::
&;, = 2
·rr,
j=m+1 m
A2
2
·
(For m = 0, I:i= 1 m 0 , and consider
a!
(J 2 A
(n-m-1)log (
)
(
=(n-m-1)log 1+
=
2 &2
m
(
~ A2)
(
testir m
m
7.3. Plots of Risk Criteria and Posterior Probabilities. Each plot corresponds to a set of data generated from the model (7.25). The smaller points are values of (!m -mink Jk)/(maxk Jk -mink Jk), whereas the larger ones are posterior probabilities corresponding to a prior probability of .5 for m = 0.
COnSlt
FIGURE
N(O, .72 ). The posterior probabilities (7.23) were computed for each data set with all ami= 1 and n(m) = .sm+l, m = 0, 1, .... The MISE-based quantity Jm was also computed for each data set. The posterior probability function was maximized at m = 6 in each of the four cases, whereas Jm was maximized at 6 in two cases and at 7 and 9 in the other two. Notice that the posterior probability function leaves little doubt as to which value of m is the most likely a posteriori, whereas the risk criterion tends to be much flatter near its maximum. For the data set where Jm was maximized at 9, the series estimates with truncation points 6 and 9 are shown in Figure 7.4. The Bayes criterion has chosen a better estimate in this case in the sense that it has the same features as r.
7.7 Our( relat! Such OncE how· selec· orde1 smal CUSUJ
agair othe1 from lack-
7.7. Power Properties
195
• L{)
• • • ..0
•
I
0.0
'•'•
'\,
•
• 0.2
t
•
•
\ I. • \e \',
• 0.6
0.4
0.8
1.0
X FIGURE 7.4. Data-Driven Series Estimates. The solid and dotted curves are series estimates with m = 6 and m = 9, respectively. The dashed curve is the true function.
Use of 1r(mldata) as an order selection tool and/or as a means of testing 'the no-effect hypothesis is a topic that appears to merit serious consideration, although we do not pursue it further here.
7.7 Power Properties Our discussion to this point has focused on properties of order selection and related tests under the null hypothesis of a constant regression function. Such properties are important for purposes of verifying validity of the test. Once we are confident that a test is valid, our interest naturally turns to how powerful it is. In this section we consider power properties of the order selection and other tests. We first establish the general consistency of the order selection test and the test based on J,n,. Next we present some exact small-sample power results for the order selection, Neyman smooth and cusum tests. We then study asymptotic power of the order selection test against a sequence of local alternatives, which facilitates comparisons with other tests, both omnibus and parametric. Finally, we discuss some results from the literature that compare the power of various smoothing-based lack-of-fit tests.
196
7. Testing for Association via Automated Order Selection
7. 7.1 Consistency In Sections 7.3 and 7.4 we encountered three equivalent versions of the order selection test: one based on the data-driven truncation point m.Y> one a graphical test and the other based on the statistic Tn. In studying power of the order selection test, it will be more convenient to consider the version based on Tn. For a given level of significance a, we will show consistency of the large-sample test that rejects H 0 when ~
Tn
ta,
where ta is such that Fos(ta) = 1 -a. Consistency of a small-sample test with rejection region of the form
Tn
~
tn,a
follows immediately whenever tn,a -+ ta as n -+ oo, which occurs under the conditions of Theorem 7.2. Concerning the variance estimator G- 2 used in Tn, we assume only that it converges in probability to a constant. The simplest and most important situation in which the order selection test is consistent is where the regression function r has at least one nonzero Fourier coefficient r/Yj and the sample coefficient ¢j converges in probability to ¢j. The following theorem establishes consistency under somewhat weaker conditions. Theorem 7.8. Suppose the regression function r is such that, for some j,
the following condition holds:
(7.26)
lim P(l¢j I ~ 8)
n-+oo
=
1
for some 8 > 0.
Then the order selection test is consistent in that lim P(Tn ~ ta)
n-+oo
PROOF.
We have, for all n
P(Tn
~
j
=
1.
+ 1,
~ ta) ~ P (~ t 2;fr ~ ta) J •=1
The result now follows immediately upon using condition (7.26) and the fact that G- 2 converges in probability to a positive constant. 0 Note that condition (7.26) does not even require existence of Fourier coefficients of r, nor does it assume consistency of ¢j in the event that
7.7. Power Properties
197
¢;1 exists. From a practical standpoint, though, it is probably sufficient to envision the case where Fourier coefficients exist and the sample Fourier coefficients are consistent estimators of them. For example, suppose that r is piecewise smooth. Then ¢;j exists for all j, and ¢j is a consistent estimator of ¢i so long as (} 2 < oo. The class of all piecewise smooth functions is sufficient in most applications, since it does not place restrictions on the shape of r and allows for discontinuities in both r and r'. Each of the tests discussed in Section 7.6 is consistent under very general conditions. The proof of consistency is similar in each case, and so here we consider only the test from Section 7.6.3 based on Jm,. Theorem 7.9. Let model (7.2) and the conditions of Theorem 7.2 hold, 1 and suppose that r is absolutely integrable and such that 0 r(x) cos(njx)dx is nonzero for some j. Then the power of test (7.19) tends to 1 as n ---t oo.
J
Let k ~ 1 be the smallest integer such that ¢k -/=- 0. The event ~ a} is implied by {Jk ~ a}, which occurs whenever {2n¢V 8- 2 ~ 2k+ a}. The law of large numbers implies that ¢~ /8- 2 converges in probability to ¢V (} 2 , which is positive. It is now immediate that P(2n¢V 8- 2 ~ 2k +a) ---t 1 as n ---t oo, thus proving the result. D PROOF.
{Jm,
7. 7.2 Power of Order Selection, Neyman Smooth and Cusum Tests Like the property of admissibility in decision theory, consistency is only a minimal sort of optimality property. Consistency provides a meaningful comparison of two tests only when one of the tests is consistent and the other is not. In this section we compare the finite-sample power of the order selection test with that of Neyman smooth tests and the cusum-based test of Section 5.5.2. To simplify comparisons we shall assume that (} 2 is known and that the errors are normally distributed. A slightly modified version of the cusum test rejects H 0 for large values of Tcusum =
n-l
n¢2
j=l
(}
2 L ----{-•
We have pointed out previously how this test downweights the influence of higher degree sample Fourier coefficients. As a result the cusum test will have relatively poor power unless the first couple of Fourier coefficients are "large."
198
7. Testing for Association via Automated Order Selection
An mth order Neyman smooth test of H 0 large values of
r =constant rejects Ho for
0
2nJ;2
m
S(m) =
:
L -----1, j=1
(]'
F
x;,
which has a distribution when H 0 is true. The power of a smooth test depends fairly crucially on making a good choice of m. Indeed, the test based on S(m) will have power equal to its level in cases where ¢1 = · · · = ¢m 0 and ¢j =/= 0 for some j > m. By contrast, the order selection test is consistent whenever at least one Fourier coefficient is nonzero (Section 7.7.1). Whereas the order selection test does not require knowledge of r in order to be consistent, it is clear that some advantage will accrue from partial knowledge of r. To quantify this advantage, consider functions of the form
v
p
('
mo
(7.27)
r(x)
¢o
+2L
¢j cos(njx),
0 ~ x ~ 1.
j=1
If m 0 is known, then an apparently reasonable smooth test is the one based on S(m 0 ). The power of this test is simply
P(S >
(7.28)
S
l -
1 -k
0"
j=l
2
2:
tal ,
is a sequence of independent standard normal random
The sample Fourier coefficients may be expressed as A
-
¢j = ¢j
1 + Vn ajn,
j
=
1, ... , n- 1,
where J;j
1
L n
= -
n
Ekn cos(njxk),
j
= 1, ... , n- 1,
k=l
and ajn
1
L g(xk) cos(njxk),
n
k=l
= -
n
j
= 1, ... , n-
1.
202
7. Testing for Association via Automated Order Selection
It follows that
2n¢;
=
[V271¢j + haj + V'i(ajn
=
(V2rJ:¢j + V'iaj) 2 + 4(Vn¢j + aj)(ajn- aj) + 2(ajn
aj)r aj) 2 .
To prove the result, then, it is sufficient to show that 2
-1
a- 2
k ( V271¢ +V'ia ·) max -1 ~ 1 1
1
m
= 1,
0
0
0
)
n- p,
j=I
where aj, j = 1, ... , m, are the Fourier coefficients that would result from carrying out the Gram-Schmidt procedure. The statistic Sn may thus be expressed as
So, the simplest ordinary least squares software suffices for carrying out an order selection test. An advantage of constructing ui, ... , Un-p is that one need not fit a large number of linear models to obtain the sums ~';=I Furthermore, a number of the vj's may be highly correlated with ri, ... , rp, a condition known as collinearity. Use of Gram-Schmidt can avoid numerical problems caused by collinearity.
a;.
8.2.2 Basis Functions Not Orthogonal to Linear Model Given residuals ei, ... , en from a fitted model and basis functions VI, v 2 , . . . , consider Fourier coefficients
t 214
8. Data-Driven Lack-of-Fit Tests for General Parametric Models
Even though v1 , v2 , ... may not be orthogonal to the functions comprising the fitted model, the bj 's nonetheless contain information about model fit. Suppose we use a test statistic of the form
Sn
(8.9)
=
max
l<m oo, variable
Bn
converges in distribution to the random
1
S =sup k~1
k
k
L zJ. j=1
By Slutsky's theorem, we may assume that u is known. The support of the random variable S is (1, oo), since 2:~= 1 ZJ / k converges almost surely to 1 ask----> oo (Serfiing, 1980, p. 27). In considering P(Sn :::; x), we thus take x > 1. (It is easy to establish that P(Sn :::; x) ----> 0 for x < 1.) Notice that Zjn = ffnbj/u may be written PROOF.
1
(8.11)
Zjn
=
. ,r;;;
vn
n
L
j = 1, ... 'n,
WijnEi,
i=1
where i = 1, ...
n
n
i=1
i=1
,n,
We have
P(Sn:::;
x)
p(
1 max -k tzJn:::; 1:C::k:C::kn j=1
X
n
1 max -k tzJn:::; kn -.)
=
S(w)Kn(w, u; W>-.) du,
where Kn(w, u; W>-.) is defined as in Section 2.4. A huge literature exists on spectral estimators of the form (9.14); see Priestley (1981) and Newton (1988) for more discussion and references.
242
9. Extending the Scope of Application
A fundamental problem in time series analysis is establishing that the observed data are indeed correlated across time. In the parlance of signal processing, uncorrelated data are referred to as "white noise." The hypothesis of white noise is equivalent to
l'(j) = 0,
j
= 1, 2, ... '
which in turn is equivalent to the spectrum S being constant on [0, .5]. Existing omnibus tests for white noise include Bartlett's test (Bartlett, 1955) and the portmanteau test of Box and Pierce (1970). Alternative tests of white noise may be constructed after noting an isomorphism between the regression problem of Chapter 7 and the spectral analysis problem. The sample spectrum and S(w; w;..) are analogous to regression data Y1 , ... , Yn and a Fourier series smooth, respectively. Furthermore, the white noise hypothesis is analogous to the no-effect hypothesis of Chapter 7. The former hypothesis may be tested using statistics as in Chapter 7 with 2¢]/&2 replaced by p2 (j), j = 1, ... , n- 1. The isomorphism of the two problems is even more compelling upon realizing that, under the white noise hypothesis, y'np(1), ... , y'np(m) (m fixed) are approximately independent and identically distributed N(O, 1) random variables (Priestley, 1981, p. 333). The portmanteau test of white noise (also called the Q test) is based on the statistic
p2(j)
m
Q(m) = (n
+ 2) L
J=l
(1 _ J'/n ) .
Q(m) is analogous to the Neyman smooth statistic discussed in Chapter 5. Indeed, the limiting distribution of Q(m) is x~ when the data are white noise. Newton (1988) notes that a difficulty in using the Q test is the need to choose m. To circumvent this problem, one may use a data-driven version of Q(m) analogous to the statistic TM in Section 7.6.1. Define ~2(
m
Q(m)
=
(n
J n
J=l
where
')
+ 2) L (/-~I ) ,
m is the maximizer of ~
R(m)
{ 0, =
"Lr;
1
np2 (j)- 2m,
m m
= =
0
1, ... , n- 1.
Under appropriate regularity conditions, Q(m) will converge in distribution to the random variable T defined in Theorem 7.3. The order selection criterion R( m) is one means of choosing the order m of a spectrum estimate Sm, where m
Sm(w)
=
')t(O)
+2L j=l
')t(j) cos(2njw).
9.8. Tests for White Noise
243
Such a spectrum estimate corresponds to approximating the observed time series by a moving average process. The series X 1 , X 2 , ... is said to be a moving average process of order q if q
Xt
=
I: ejZt-j,
t = 1, 2, ... ,
j=O
where e 0 = 1, eq f. 0 and {Zt : t = 0, ±1, ±2, ... } is a white noise sequence. Such a process satisfies p(j) = 0 for j > q, and hence has a spectrum of the form q
S(w) = 1(0)
+ 2L
1(j) cos(2njw).
j=l
A reasonable spectrum estimate for a qth order moving average process would be Sq. Estimation of S by Sm raises the question of whether R is the most appropriate order selection criterion for approximating a covariance stationary process by a moving average. Suppose we define an optimal m to be that which minimizes
(12
(9.15)
E
lo
2
(sm(w)- S(w))
dw.
The minimizer of this mean integrated squared error is the same as the maximizer of
Priestley (1981) shows under general conditions that
and 00
1 (1+2 Var [p(j)] ~ :;;:
L p (j) 2
)
j=l
=
-1 Cp.
n
Using these approximations one may construct an approximately unbiased estimator of the risk criterion C(m). We have
E
[f j=l
(n
+ j)p2(j.)/Cp(1- J /n)
2]
~
_1_C(m) Cpn
=
D(m).
244
9. Extending the Scope of Application
A possible estimator for Cp is
Now define
zJn = (n + j)p
2
(j)/Cp, j = 1, ... 'n- 1, and
D(m) = 0, =
~ (Zfn - 2) L.....- (1 - jjn) '
m = 0
m - 1 2 -
' ' .. · '
n
-
1
·
j=l
Letting m be the maximizer of D(m), the null hypothesis of white noise may be tested using the data-driven portmanteau statistic Q (m). When the data really are white noise, R and D will be approximately the same, since Cp estimates 1 in that case. Under appropriate regularity conditions the null distribution of Q(m) will be asymptotically the same as that of Q(m). Any advantage of Q(m) will most likely appear under the alternative hypothesis, since then m will tend to better estimate the minimizer of (9.15) than will m. Another well-known test of white noise is Bartlett's test, which may be considered as a time series analog of the Kolmogorov-Smirnov test. Bartlett's test rejects the white noise hypothesis for large values of the statistic
where N
=
[n/2]
+ 1 and
The idea behind the test is that the integrated spectrum of a covariance stationary time series is identical to w if and only if the series is white noise. Bartlett's test is analogous to a test of curve equality proposed by Delgado (1993). Because of the isomorphism of the no-effect and white noise testing problems, Bartlett's test has the same power limitations as Delgado's test. Specifically, when the observed time series is such that p(1) and p(2) are small relative to autocorrelations at higher lags, then the power of Bartlett's test will be considerably lower than that of tests based on Q(m) or Q(m). Order selection criteria of a different sort than R and D may also be used to test the white noise hypothesis. The Q test implicitly approximates the observed series by a moving average process. Suppose instead that the series is approximated by a stationary autoregressive process of order p, which
I 9.8. Tests for White Noise
245
has the form p
Xt =
L
c/JiXt-i
+ Zt, t =
0, ±1, ±2, ... '
i=l
where ¢1, ... , c/Jp are constants such that the zeros of 1 - ¢1z - ¢zz 2 · · · - c/JpzP are outside the unit circle in the complex plane, and { Zt} is a white noise process. (The Zt 's are sometimes referred to as innovations.) An autoregressive process is white noise if and only if its order is 0. Hence, given a data-driven method for selecting the order of the process, it seems sensible to reject the white noise hypothesis if and only if the selected order is greater than 0. This is precisely what Parzen (1977) proposed in conjunction with his criterion autoregressive transfer (CAT) function that selects autoregressive order. In fact, it appears that Parzen was the first person in any area of statistics to propose an order selection criterion as a test of model fit. Before discussing CAT we introduce another popular criterion for selecting autoregressive order, namely Akaike's Information Criterion (AIC) (1974), defined by AIC(k) =log & 2 (k)
+
2 k, n
k = 0, 1, ... ,
where 8' 2 (k) is the Yule-Walker estimate (Newton, 1988) of the innovations variance for a kth order autoregressive (AR) model. One may perform a test of white noise with AIC just as with the CAT function. The pioneering paper of Shibata (1976) on AIC implies that when the process is actually white noise, the minimizer of AIC(k) occurs at 0 about 71% of the time. Hence, a white noise test based on AIC has type I error probability of about .29 for large n. It is no accident that this probability matches that discussed in Section 7.3. Examining the work of Shibata (1976) reveals that AIC and the MISE criterion Jm are probabilistically isomorphic, at least in an asymptotic sense. Parzen's original version of CAT has the form CAT(k) = 1-
(n- k) n
8'~
a-z(k)
k
+ ;;: ,
k = 0, 1, ... ,
where 8'~ is a model-free estimate of innovations variance. The estimated AR order is taken to be the value of k that minimizes CAT(k). A test of white noise based on this version of CAT turns out also to have an asymptotic level of .29. Again this is no mere coincidence, since AIC and CAT turn out to be asymptotically equivalent (Bhansali, 1986a)o Bhansali (1986b) proposed the CAT a criterion, defined by 8' 2 CATa(k) = 1- a-z(k)
ak
+ --:;;
k
=
0, 1,
0
0
0'
246
9. Extending the Scope of Application
for some constant a > 1. Bhansali (1986a) has shown that his CAT 2 criterion is asymptotically equivalent to CAT. Choices of a other than 2 allow one to place more or less importance on overfitting than does CAT. There is an obvious resemblance of CATa to the regression criterion J(m; !') of Section 7.3. In analogy to the regression setting, one may use CAT a to perform a test of white noise. In fact, the asymptotic percentiles from Table 7.1 can be used in CAT a exactly as they are in J(m; 1') to produce a valid large sample test of white noise. For example, for a test with type I error probability equal to .05, one should use CAT4.1 8 . Parzen proposed the following modified CAT function: CAT*(k)
=
-(1
=
;
1
+ ~) 0' 2~0)
f;
,
1
k
k = 0
1
0' 2 (j) - 0' 2 (k),
k
=
1, 2, ... , Kn,
where 0' 2(k) = n(n- k)- 1 8- 2(k), k = 0, 1, ... , n- 1, and Kn is such that K;,/n --+ 0 as n --+ oo. Parzen proposes a test that rejects the white noise hypothesis if and only if the minimizer of CAT* is greater than 0. We shall call this white noise test Parzen's test. The "natural" definition of CAT*(O) is -1 I 0' 2(0); Par zen proposed the modification - ( 1 + n -l) I 0' 2(0) in order to decrease the significance level of the white noise test from .29. A simulation study in Newton (1988, p. 277) suggests that the level of Parzen's test is about .17. We shall provide a theoretical justification for this probability and also show that Parzen's test is closely related to the regression test proposed in Section 7.6.3. We may write 1 + 8-~CAT*(k) = CAT2(k)
+ Rn,k,
k 2: 1,
where k (
Rn,k = n
8-
2
(j-2~)
)
-
1
-
+
k(k
+ 1)
n2
~n ~ (1 - j__n ) ~
(
J=l
A
a-~' - 1)
1J'2(J)
0
If the white noise hypothesis is true, 8-~ and 8- 2(j), j = 0, ... , Kn, all estimate Var(Xt), and it follows that the terms Rn,k are negligible in comparison to CAT2(k). (This can be shown rigorously using arguments as in Bhansali, 1986a.) So, when the data are white noise, the properties of Parzen's test are asymptotically the same as for the analogous test based on a CAT2 function, where CAT2(k) = CAT2(k), k 2: 1, and
CAT;(o) = 1- (1
+
~) a-~fu)..
I 9.8. Tests for White Noise
247
The significance level of the CAT2 white noise test is
1- P
([i, {
CAT;(o)
~ CAT,(k)}) ~
1-P(Knn{&2(k)>1-(~) 8'2(0) +R
n,k
k=l
})
'
n
where
In the white noise case the terms the limiting level of the test is (9.16)
1-
nl~ P
Dl
Kn {
(
Rn,k
are asymptotically negligible, and
&2(k)
( 2k+1)}) .
log 8'2 (0) 2 log 1- - n -
The last expression is quite illuminating in that it shows Parzen's test to be asymptotically equivalent to a white noise test that rejects H 0 when the value of the AIC criterion at its minimum is less than log 8' 2 (0) -1/n. This suggests that Parzen's test is analogous to the test in Section 7.6.3 based on the maximum of an estimated risk criterion. Arguing as in Shibata (1976),
&2 (k)
-nlog --;:-z-() (J 0
L k
= n
j=l
A
2' + op(1),
¢1 (J)
k 2 1,
where, for any fixed K, fo¢1(1), ... , fo¢K(K) have a limiting multivariate normal distribution with mean vector 0 and identity covariance matrix. This fact and (9.16) imply that the limiting level of Parzen's test is
in which Z 1, Z 2 , .•• are i.i.d. standard normal random variables. The last expression may be written
1- (r~ trzJ - ~ 1) . P
2)
I:J=
Note that the random variable maxk~l 1(ZJ- 2) is precisely the same as the one appearing in Theorem 7.6, which confirms that Parzen's test is
248
9. Extending the Scope of Application TABLE 9.1. Approximate Values of qa for Parzen's Test
a qa
.29 0
.18 1
.10 2.50
.05 4.23
.01 7.87
The estimated values of qa were obtained from 10,000 replications of the process 1 (ZJ - 2), k = 1, ... , 50, where Z1, ... , Z5o are i.i.d. N(O, 1).
2:,;=
analogous to the regression test based on the maximum of an estimated risk. By means of simulation it has been confirmed that (9.17)
1- P (
'Pff
t,(z]- 1) "'.1s. 2),;
The argument leading to (9 .17) also shows that if CAT* (0) is defined to be -(1 + q/n)/& 2 (0) for a constant q, then the limiting level of the corresponding white noise test is
1 - P ( T;i'
t,(z] -
2) ,;
q) .
One can thus obtain any desired level of significance a by using a version of Parzen's test in which CAT*(O) is -(1 + qa/n)/& 2 (0) for an appropriate qa. Simulation was used to obtain approximate values of qa for large-sample tests of various sizes; see Table 9.1. It is worth noting that the values of qa in Table 9.1 are also valid large-sample percentiles for the regression test of Section 7.6.3.
9. 9 Time Series Trend Detection In time series analysis it is often of interest to test whether or not a series of observations have a common mean. One setting where such a test is desirable is in quality control applications. Consider a series of observations X1, ... , Xn made at evenly spaced time points and let /Li = E(Xi), i = 1, ... 'n. Furthermore, let us assume that the process Yi = xi - f.Li, i = 1, 2, ... , is covariance stationary, as defined in the previous section. The hypothesis Ho : f.Li = p,, i = 1, 2, ... , is simply a no-effect hypothesis as in Chapter 7. What makes the time series setting different is that the
F
sc
9.9. Time Series Trend Detection
249 '
covariance structure of the data must be taken into account in order to properly test the hypothesis of constant means. Accounting for the covariance is by no means a minor problem. Indeed, the possibility of covariance fundamentally changes the problem of detecting nonconstancy of the mean. To see part of the difficulty, consider the two data sets in Figure 9.1, which seem to display similar characteristics. The data in the top panel were generated from the autoregressive model
xi where E(Xo)
= .95Xi-1
+ Ei,
i = 1, ... , 5o,
0 and E(Ei) = 0, i
.. .. 0
.... . . . .. ..
1, ... , 50, implying that each
.... .. .. ..
T'"
'
0.0
0.2
0.4
0.6
0.8
1.0
X
1.0
...... . ..
.. .. .. ..
(f)
>-
C\1
0 T'"
'
0.0
0.2
0.4
0.6
0.8
1.0
X
FIGURE 9.1. Simulated Data from Different Models. In the top graph are data generated from an autoregressive process, while the other data were generated from a regression model with i.i.d. errors. In each graph the line is the least squares line for the corresponding data.
I
I
'i' 250
9. Extending the Scope of Application
observation has mean 0. The second data set was generated from
xi = 3.5- 2.6(i- 1/2)/50 + Ei,
i
= 1, ... , 50,
where the E/s are i.i.d. as N(O, 1). The apparent downward trend in each data set is due to different causes. The "trend" in the first data set is anomalous in the sense that it is induced by correlation; if the data set were larger the observations would eventually begin to drift upward. The downward drift in the second data set is "real" in that it is due to the deterministic factor E(Xi) = 3.5- 2.6(i- 1/2)/50. At the very least this example shows that it is important to recognize the possibility of correlation, or else one could erroneously ascribe structure in the data to a nonconstant mean. More fundamentally, the example suggests the possibility of a formal identifiability problem in which it would be impossible to differentiate between two disparate models on the basis of a data analysis alone. The more a priori knowledge one has of the underlying covariance structure, the more feasible it will be to devise a valid and consistent test of the constant mean hypothesis. Let us consider the case where {Xi - f.Li} follows a Gaussian first order autoregressive process. What would happen if we applied, say, a test as in Section 7.3 to test for a constant mean? For simplicity suppose we use a statistic of the form
=
Sn
1 m 2n¢] - "\""' - , - , 1<m<M m j=1 L....t o-2 max
where M is a constant and & 2 = I:~= 2 (Xi- Xi-1) 2/(2n). Let p be the first lag autocorrelation of the process {Xi - p,i}. It is not difficult to argue that when the P,i 's are all the same, the random vector
2n
'2
, 2
--;::-z (¢1, · · ·' ¢M) l7
converges in distribution to
1+p(2 2) (1-p)2 z1, ... ' z M , where Z 1 , ... , ZM are i.i.d. standard normal random variables. Therefore, Sn converges in distribution to the random variable
1+p (1- p)2 SM, where SM has the distribution function Fos,M defined in Section 7.5.1. Now, suppose we conduct a level-a test of constant mean as we would assuming the data were independent. The asymptotic level of this test under the first order autoregressive model is
1
-
F:
OS,M
((1-p)Zt(Y_) 1+p
9.9. Time Series Trend Detection
251
where 1- Fos,M(ta) = a. Obviously, then, this order selection test will be invalid when the data are positively correlated in that the level of the test will be larger than a when p > 0. In fact, the level can be made arbitrarily close to 1 by taking p sufficiently close to 1. This justifies our earlier comment to the effect that one may erroneously conclude that the means are nonconstant if correlation is ignored. When the first lag autocorrelation is negative, the test will be valid but less powerful than it is when the data are independent. The problem just outlined has an apparently easy fix. Suppose that pis a consistent estimator of p. The statistic
Tn =
(1 - p)2 1 + p Sn
has the same limit distribution as S M, implying that we may compare Tn with the independent-data critical values and still have an asymptotically valid test. The only problem with this proposal is obtaining an estimator of p that yields a powerful test. Probably the first estimator that comes to mind is
which is the estimator used when the process mean is assumed to be constant. This estimator is fine for ensuring asymptotic validity of the test but can be very detrimental from a power standpoint. The problem is that if the t-ti's vary slowly over time, the estimator p can be quite close to 1, causing Tn to be relatively small. An alternative estimator of p that addresses the problem inherent in Pl is
where Mi is a nonparametric estimator of f.-ti· Kim (1994) has studied the order selection test in the context of dependent data and in doing so considered various candidates for the Mi in p. A test utilizing p is problematic in that it requires the choice of a smoothing parameter for Mi· A main motivation for using an order selection test is that it circumvents any arbitrary choice of smoothing parameter. It would thus be desirable to have a test that avoided explicit estimation of the f.-ti's. This would be possible were a method available that simultaneously selects a smoothing parameter and estimates p. The time series crossvalidation (TSCV) criterion of Hart (1994) is one such method. TSCV was proposed as a means of selecting the bandwidth of a kernel smoother when the observed data are autocorrelated. It can be viewed as a generalization of the one-sided cross-validation method discussed in Section 4.2.5. To test for constancy of the means, we may proceed as in Section 8.2.3.
252
9. Extending the Scope of Application
The appropriate estimator of means is a Nadaraya-Watson smoother, since that smoother is a fiat line for large bandwidths. The bandwidth of the Nadaraya-Watson estimator is chosen by TSCV assuming that the process {Xi - J-Li} is first order autoregressive. The hypothesis of constant means is rejected if the data-driven bandwidth is sufficiently small. To obtain a valid test based on a data-driven bandwidth h, it is necessary to know the probability distribution of h when the means are constant and the data follow the prescribed process. One method of approximating the distribution of h is to use a bootstrap procedure. TSCV yields both a bandwidth h and an estimate p of p. One can show that under the null hypothesis of constant means the TSCV criterion is invariant to the scale of the xi's. It is thus sensible to generate bootstrap data as follows:
Xt = pXt_ 1
+ E7,
i
= 1, ... , n,
where X 0 = 0 and E]', ... , E~ is a random sample (with replacement) from the residuals
When H 0 is true, e 2 , ... , en will be the "correct" residuals. If the means are nonconstant, e 2 , ... , en will tend to have larger variance than will the error terms in the underlying model, but this is irrelevant since the TSCV criterion is invariant to scale when applied to the bootstrap data. So long as p is close to p we can expect this bootstrap scheme to work reasonably well. Having obtained a bootstrap sample one may compute h* from this sample in the same way h was computed from the original data. The sampling distribution of h can then be approximated by generating a large number of bootstrap samples and computing h* on each one. The assumption that {Xi- J-Li} follows a first order autoregressive process plays no important role in the above development. The test based on TSCV may be applied whenever {Xi - J-Li} is covariance stationary and has a prescribed parametric structure. Presumably one may show that this test is asymptotically valid and consistent under general conditions on the means J-Ll> . •. , f-Ln· A fascinating question is the following: What are the weakest conditions on the process {Xi - J-Li} and the means J-Li under which a valid and consistent test of equal means may be constructed? Suppose, for example, that we assume {Xi - J-Li} is an autoregressive process of unknown order p. Hart (1996) has proposed a generalization of TSCV that allows simultaneous estimation of bandwidth, p and autoregressive parameters. Is it possible to construct valid and consistent tests of constant means based on this generalization?
lC
So
10.: In tl actn Bah the l hom sis o can Fine mul1
10. 10. InS
can< of a date (0, l moe test usiu van the in 1 in S 1 thir
P-v
10 Some Examples
10.1 Introduction In this final chapter we make use of order selection tests in analyzing some actual sets of data. In Section 10.2 tests of linearity are performed on the Babinet data, which were encountered in Section 6.4.2. We also consider the problem of selecting a good model for these data, and perform a test of homoscedasticity. In Section 10.3 order selection tests are used in an analysis of hormone level spectra. Section 10.4 shows how the order selection test can enhance the scatter plots corresponding to a set of multivariate data. Finally, in Section 10.5, the order selection test is used to test whether a multiple regression model has an additive structure.
10.2 Babinet Data
10. 2.1 Testing for Linearity In Section 6.4.2 we used the Babinet data to illustrate the notion of significance trace. Here we use the same data as an example of checking linearity of a regression function via an order selection test. A scatter plot of the data is shown in Figure 10.1. (The x-variable was rescaled to the interval (0, 1).) There is some visual evidence that a straight line is not an adequate model. Does an order selection test agree with the visual impression? Two test statistics were computed using the methodology of Section 8.2.1: one using a cosine basis and the other a polynomial basis. The difference-based variance estimate 8-J 1 (Section 7.5.2) was used in each case. The values of the two test statistics and their associated large-sample P-values are given in Table 10.1. The P-values are simply 1- Fos(Sn), where Fos is defined in Section 7.3. Table 10.1 displays strong evidence that the regression function is something other than a straight line. It is interesting that, although both P-values are quite small, the one corresponding to the polynomial basis 253
254
10. Some Examples (0
C\.1
V
•
c i
•
"¢
C\.1 C\.1 C\.1
>-
j
•• • •
0 C\.1
•
CX)
••
.....
• .. • .. ..... ' .
(0
• 0.2
1
•
.....
0.0
r
0.6
0.4
'
0.8
•
.-
1.0
X FIGURE 10.1. Smooths ofBabinet Data. The solid and dotted lines are quadratic and second order cosine models, respectively. The dashed line is a local linear smooth chosen by OSCV.
is extremely small, 1.1 x 10- 6 . This is a hint that some bases will be more powerful than others in detecting departures from a particular parametric model. This point will be made even more dramatically in the next section. The P-values in Table 10.1 are based on a large-sample approximation. The sample size in this case is reasonably large, n = 355. Nonetheless, it is interesting to see what happens when the bootstrap is used to approximate a P-value. After fitting the least squares line, residuals e 1 , ... , e355 were obtained. A random sample e)', ... , e3 55 is drawn (with replacement) from these residuals and bootstrap data obtained as follows:
Yi* = ~o +~!Xi+ ei,
i
=
1, ... , 355,
TABLE 10.1. Values of Statistic Sn (Section 8.2.1) and Large-Sample P-Values
Basis Cosine Polynomial
P-value 10.06 23.68
.0015 .0000011
10.2. Babinet Data
255
!J1
where !Jo and are the least squares estimates from the original data. A cosine-based statistic S~ was then computed from (x1, Yt), ... , (xn, Y;) in exactly the same way Sn was computed from the original data. This process was repeated independently 10,000 times. A comparison of the resulting empirical distribution of s~ with the large-sample distribution Fos is shown in Figure 10.2. The two cdfs are only plotted for probabilities of at least .80, since the tail regions are of the most interest. The agreement between the two distributions is remarkable. The conclusion that the true regression curve is not simply a line appears to be well founded.
10.2.2 Model Selection Having rejected the hypothesis of linearity we turn our attention to obtaining a good estimate of the underlying regression function. One method of doing so is to use a kernel or local linear estimate. The dashed line in Figure 10.1 is a local linear smooth (with Epanechnikov kernel) whose smoothing parameter was chosen by the one-sided cross-validation method of Chapter 4. In agreement with the analysis in Section 10.2.1 the smooth shows evidence of nonlinearity. The order selection test is significant at level of significance a if and only if a particular risk criterion chooses a model order greater than 0. Which model(s) are preferred by such criteria for the Babinet data? Figure 10.3 provides plots of risk criteria of the form J(m; 1') for two values of')', 2 and 4.18. The criterion using')' = 2 corresponds to unbiased estimation of MASE and chooses a cosine series with over twenty terms. This exemplifies the undersmoothing that often occurs with MASE-based risk criteria 0 0
,....
......
i'····
g :.0
0
.0 0 ......
0
ro
CJ')
0.
0
co 0
2
4
8
6
10
12
X FIGURE
10.2. Fos and Bootstrap Distribution. The solid line is Fos.
256
10. Some Examples
........·.··....... .. . 0
l{)
0
... ... . ·...... '
"
...·
~
a Ql
c.
o;
'
..Q
..Q
"!
"!
0.0
0.4 x1
FIGURE
0.8
0.0
0.4
0.8 x2
10.7. Diabetes Data and Local Linear Smooths.
I
···"'f 264
10. Some Examples
which uses up only 11 degrees of freedom, leaving us 32 degrees of freedom with which to assess possible interactions of x 1 and x 2. Let S S Enull denote the error sum of squares corresponding to a least squares estimate of model (10.2), and let SSEr,s be the error sum of squares for the model 5
Y = f3o
5
+L
f3tj cos(njx1)
+L
j=l T
+L
f32k cos(nkx2)
k=l
S
L "fjk cos(njx1) cos(nkx2) +
E,
j=l k=l
where r, s ;:::: 1. Our test statistic for the null hypothesis that the regression function has an additive form is (10.3)
max
l~r,s~5
(SSEnull- SSEr,s) rsu 2
The variance estimate u2 was taken to be SSEnuu/(43 - 11). If anything, this choice for u2 will tend to make the test less powerful, since an interaction between x 1 and x 2 will inflate the null variance estimate. The value of statistic (10.3) for the diabetes data is 5.594. Is this significant evidence of an interaction between x 1 and x 2? Under the null hypothesis and assuming that the errors in our model are i.i.d. Gaussian, the distribution of (10.3) is well approximated by that of T
S
1 2 max - " ' " ' Z k, l jt)
L
j
j=M+l
where aM is between SM(t) and S00 (t). Obviously then,
L oo
IFos(t)- F(t; M)l :::;
j=M+l
P(
2
> "t)
Xj. J
J
.
267
268
Appendix
The next step is to obtain an exponential bound for P(xJ > jt) using Markov's inequality. For any number a such that 0 < a < 1/2, we have
P(xJ > jt)
= P(exp(ax]) > exp(ajt))
::::; (1- 2a)-j/ 2 exp( -ajt). Using this inequality and (A.1), it follows that 00
L r
IFos(t) - F(t; M)l ::::;
(A.2)
1(1- 2a)-j/ 2 exp( -ajt)
j=M+1 00
I:
=
r
1
exp { -Jtt(a)},
j=M+l
where ft(a) = at+ (1/2) log(1- 2a). Obviously we want to choose a so that ft(a) > 0. Simple analysis shows that ft[(1- r 1 )/2] > 0 for any t > 1. Since we also have 0 < (1 r 1 )/2 < 1/2 fort> 1, (A.2) implies that 00
IFos(t) -F(t;M)I::::;
L r
1
exp{-(j/2)[(t-1) -logt]}
j=M+l 00
::::; (M
L
+ 1)- 1
e{
j=J\!!+1
(M
+ 1)-1efH1 (1 - Bt)
thus proving the result.
A.2 Bounds for the Distribution of Tcusum Here we derive bounds for P(Tcusum :2:: t), where Tcusum is defined in Section 7.7.2. We assume that model (7.1) holds with the errors i.i.d. N(O, CJ 2 ) and r(x) = 2¢cos(nm 0 x). Define ?. 2 = n¢ 2 /CJ 2 , and let Z 1 ,Z2 , .•• denote a sequence of i.i.d. standard normal random variables. For any n > m 0 , a lower bound is
where n-1
P1 = p
(
z2
L -f J j=1
·~·.·
A.2. Bounds for the Distribution of Tcusum
269
To obtain an upper bound on P(Tcusum 2: t), note that
P(Tcusum 2:
t) : :; P ( j#mo L ~l + (Zmo + ,j'jj? 2: t) , J mo 2
where the sum extends from 1 to oo, excluding m 0 . The very last probability may be written
where¢(·) denotes the standard normal density. It follows that
P(Tcusum 2: t) :::; { cp(u) du J(u+v'2>.)2>m6t
+ {
Ht(u)cp(u) du,
J(u+v'2>.)2<m6t
where
2
Ht(u)
=P(f ~l2:t- (u+~) ) j=l
J
·
mo
The integral involving Ht (u) may be written as a sum of integrals, each of which may be bounded by using the monotonicity of Ht and values for the cdf of jj 2 (Anderson and Darling, 1952). 1
2:::;: ZJ
References
Akaike, H. (1974). A new look at statistical model identification. I. E. E. E. Trans. Auto. Control19, 716-723. Anderson, T. W. and Darling, D. A. (1952). Asymptotic theory of certain "goodness of fit" criteria based on stochastic processes. Ann. Math. Statist. 23, 193-212. Azzalini, A., Bowman, A. W. and Hardie, W. (1989). On the use ofnonparametric regression for model checking. Biometrika 76, 1-11. Barry, D. (1993). Testing for additivity of a regression function. Ann. Statist. 21, 235-254. Barry, D. and Hartigan, J. A. (1990). An omnibus test for departures from constant mean. Ann. Statist. 18, 1340-1357. Bartlett, M. S. (1955). An Introduction to Stochastic Processes with Special Reference to Methods and Applications. Cambridge University Press, London. Bellver, C. (1987). Influence of particulate pollution on the positions of neutral points in the sky in Seville (Spain). Atmos. Environ. 21, 699-702. Bhansali, R. J. (1986a). Asymptotically efficient selection of the order by the criterion autoregressive transfer function. Ann. Statist. 14, 315-325. Bhansali, R. J. (1986b). The criterion autoregressive transfer function of Parzen. J. Time Series Anal. 7, 315-325. Bhattacharya, P. K. (1974). Convergence of sample paths of normalized sums of induced order statistics. Ann. Statist. 2, 1034-1039. Bhattacharya, R. N. and Ranga Rao, R. (1976). Normal Approximation and Asymptotic Expansions. John Wiley & Sons, New York. Bickel, P. J. and Ritov, Y. (1992). Testing for goodness of fit: a new approach. Nonparametric Statistics and Related Topics (A. K. Md. E. Saleh, ed.), NorthHolland, Amsterdam, pp. 51-57. Bickel, P. J. and Rosenblatt, M. (1973). On some global measures of the deviation of density function estimates. Ann. Statist. 1, 1071-1095. Billingsley, P. (1968). Convergence of Probability Measures. John Wiley & Sons, New York. Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, I. Effect of inequality of variance in the one-way classification. Ann. Math. Statist. 25, 290-302.
271
272
References
Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509~1526. Buckley, M. J. (1991). Detecting a smooth signal: optimality of cusum based procedures. Biometrika 78, 253~262. Buckley, M. J. and Eagleson, G. K. (1988). An approximation to the distribution of quadratic forms in normal random variables. Austral. J. Statist. 30A, 150~ 159. Butzer, P. L. and Nessel, R. J. (1971). Fourier Analysis and Approximation. Academic Press, New York. Carroll, R. J. and Ruppert, D. (1988). Transformation and Weighting in Regression. Chapman & Hall, New York. Chen, J.-C. (1994a). Testing for no effect in nonparametric regression via spline smoothing techniques. Ann. Inst. Statist. Math. 46, 251~265. Chen, J.-C. (1994b). Testing goodness of fit of polynomial models via spline smoothing techniques. Statist. Probab. Lett. 19, 65~76. Chernoff, H. (1954). On the distribution of the likelihood ratio. Ann. Math. Statist. 25, 573~578. Chernoff, H. (1973). The use of faces to represent points ink-dimensional space graphically. J. Amer. Statist. Assoc. 68, 361~368. Chiu, S.-T. (1990). On the asymptotic distributions of bandwidth estimates. Ann. Statist. 18, 1696~1711. Chiu, S.-T. and Marron, J. S. (1990). The negative correlations between datadetermined bandwidths and the optimal bandwidth. Statist. Probab. Lett. 10, 173~180.
Chu, C. K. and Marron, J. S. (1991). Choosing a kernel regression estimator. Statist. Sci. 6, 425~427. Chui, C. K. (1992). An Introduction to Wavelets. Academic Press, San Diego, CA. Chung, K. L. (1974). A Course in Probability Theory. Academic Press, New York. Clark, R. M. (1977). Nonparametric estimation of a smooth regression function. J. Roy. Statist. Soc. Ser. B 39, 107~113. Cleveland, W. S. (1993). Visualizing Data. Hobart Press, Summit, NJ. Cleveland, W. S. and Devlin, S. J. (1988). Locally weighted regression: an approach to regression analysis by local fitting. J. Amer. Statist. Assoc. 83, 596~610.
Conover, W. J. (1980). Practical Nonparametric Statistics. John Wiley & Sons, New York. Cook, R. D. and Weisberg, S. (1983). Diagnostic for heteroscedasticity in regression. Biometrika 70, 1~10. Cox, D. and Koh, E. (1989). A smoothing spline based test of model adequacy in polynomial regression. Ann. Inst. Statist. Math. 41, 383~400. Cox, D., Koh, E., Wahba, G. and Yandell, B. (1988). Testing the (parametric) null model hypothesis in (semiparametric) partial and generalized spline models. Ann. Statist. 16, 113~119. Cox, D. R. (1962). Further results on tests of separate families of hypotheses. J. Roy. Statist. Soc. Ser. B 24, 406~424.
References
273
D'Agostino, R. B. and Stephens, M. A. (1986). Goodness-of-Fit Techniques. Marcel Dekker, New York. Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets. Comm. Pure Appl. Math. 41, 909-996. 2 Davies, R. B. (1980). The distribution of a linear combination of X random variables. Appl. Statist. 29, 323-333. Delgado, M. A. (1993). Testing the equality of nonparametric regression curves. Statist. Probab. Lett. 17, 199-204. Devroye, L. and Gyorfi, L. (1985). Nonparametric Density Estimation: The L1 View. John Wiley & Sons, New York. Diggle, P. (1990). Time Series: A Biostatistical Introduction. Oxford University Press, Oxford. Donoho, D. L. (1988). One-sided inference about functionals of a density. Ann. Statist. 16, 1390-1420. Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation via wavelet shrinkage. Biometrika 81, 425-455. Durbin, J. and Knott, M. (1972). Components of Cramer-von Mises statistics, I. J. Roy. Statist. Soc. Ser. B 34, 290-307. Durbin, J. and Watson, G. S. (1950). Testing for serial correlation in least squares regression I. Biometrika 37, 409-428. Epanechnikov, V. A. (1969). Nonparametric estimates of a multivariate probability density. Theory Probab. Appl. 14, 153-158. Eubank, R. L. (1988). Spline Smoothing and Nonparametric Regression. Marcel Dekker, New York. Eubank, R. L. (1995). On testing for no effect in nonparametric regression. Unpublished manuscript. Eubank, R. L. and Hart, J. D. (1992). Testing goodness-of-fit in regression via order selection criteria. Ann. Statist. 20, 1412-1425. Eubank, R. L. and Hart, J.D. (1993). Commonality of cusum, von Neumann and smoothing-based goodness-of-fit tests. Biometrika 80, 89-98. Eubank, R. L., Hart, J. D. and LaRiccia, V. N. (1993). Testing goodness of fit via nonparametric function estimation techniques. Comm. Statist. - Theory Methods 22, 3327-3354. Eubank, R. L., Hart, J. D., Simpson, D. G. and Stefanski, L. A. (1995). Testing for additivity in nonparametric regression. Ann. Statist. 23, 1896-1920. Eubank, R. L., Hart, J.D. and Speckman, P. (1990). Trigonometric series regression estimators with an application to partly linear models. J. Multivar. Anal. 32, 70-83. Eubank, R. L., LaRiccia, V. N. and Rosenstein, R. (1987). Test statistics derived as components of Pearson's phi-squared distance measure. J. Amer. Statist. Assoc. 82, 816-825. Eubank, R. L. and Speckman, P. (1990). Curve fitting by polynomial-trigonometric regression. Biometrika 77, 1-9. Eubank, R. L. and Speckman, P. (1993). Confidence bands in nonparametric regression. J. Amer. Statist. Assoc. 88, 1287-1301. Eubank, R. L. and Spiegelman, C. (1990). Testing the goodness-of-fit of a linear model via nonparametric regression techniques. J. Amer. Statist. Assoc. 85, 387-392.
274
References
Fan, J. (1992). Design-adaptive nonparametric regression. J. Amer. Statist. Assoc. 87, 998-1004. Fan, J. (1996). Test of significance based on wavelet threshholding and Neyman's truncation. J. Amer. Statist. Assoc. 91, 674-688. Fan, J. and Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial fitting: variable bandwidth and spatial adaptation. J. Roy. Statist. Soc. Ser. B 57, 371-394. Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and its Applications. Chapman & Hall, London. Farebrother, R. W. (1990). The distribution of a quadratic form in normal variables. Appl. Statist. 39, 294-309. Firth, D., Glosup, J. and Hinkley, D. V. (1991). Model checking with nonparametric curves. Biometrika 78, 245-252. Gasser, Th., Kneip, A. and Kohler, W. (1991). A flexible and fast method for automatic smoothing. J. Amer. Statist. Assoc. 86, 643-652. Gasser, Th. and Muller, H.-G. (1979). Kernel estimation of regression functions. Smoothing Techniques for Curve Estimation (Th. Gasser and M. Rosenblatt, eds.), Springer Lecture Notes in Mathematics No. 757, Springer-Verlag, Berlin, pp. 23-68. Gasser, Th., Muller, H.-G., Kohler, W., Molinari, L. and Prader, A. (1984). Nonparametric regression analysis of growth curves. Ann. Statist. 12, 210-229. Gasser, Th., Muller, H.-G. and Mammitzsch, V. (1985). Kernels for nonparametric curve estimation. J. Roy. Statist. Soc. Ser. B 47, 238-252. Gasser, Th., Sroka, L. and Jennen-Steinmetz, C. (1986). Residual variance and residual pattern in nonlinear regression. Biometrika 73, 625-633. Ghosh, B. K. and Huang, W. (1991). The power and optimal kernel of the BickelRosenblatt test for goodness of fit. Ann. Statist. 19, 999-1009. Gibbons, J. D. (1971). Nonparametric Statistical Inference. McGraw-Hill, New York. Gray, H. L. (1988). On a unification of bias reduction and numerical approximation. Essays in Honor of Franklin A. Graybill (J. N. Srivastava, ed.), Elsevier Science Publishers B.V. (North-Holland), Amsterdam, pp. 105-116. Green, P. J. and Silverman, B. W. (1994). Nonparametric Regression and Generalized Linear Models. Chapman & Hall, London. Grenander, U. and Rosenblatt, M. (1957). Statistical Analysis of Stationary Time Series. John Wiley & Sons, New York. Gu, C. (1992). Diagnostics for nonparametric regression models with additive terms. J. Amer. Statist. Assoc. 87, 1051-1058. Hall, P. (1983). Measuring the efficiency of trigonometric series estimates of a density. J. Multivar. Anal. 13, 234-256. Hall, P. (1992). The Bootstrap and Edgeworth Expansion. Springer-Verlag, New York. Hall, P. and Hart, J. D. (1990). Bootstrap test for difference between means in nonparametric regression. J. Amer. Statist. Assoc. 85, 1039-1049. Hall, P. and Johnstone, I. (1992). Empirical functionals and efficient smoothing parameter selection (with discussion). J. Roy. Statist. Soc. Ser. B 54, 475-530.
References
275
Hall, P., Kay, J. W. and Titterington, D. M. (1990). Asymptotically optimal difference based estimation of variance in nonparametric regression. Biometrika 77, 521-528. Hall, P. and Marron, J. S. (1990). On variance estimation in nonparametric regression. Biometrika 77, 415-419. Hall, P. and Titterington, D. M. (1988). On confidence bands in nonparametric density estimation and regression. J. Multivar. Anal. 27, 228-254. Hall, P. and Wehrly, T. E. (1991). A geometrical method for removing edge effects from kernel-type nonparametric regression estimators. J. Amer. Statist. Assoc. 86, 665-672. Hall, P. and Wilson, S. R. (1991). Two guidelines for bootstrap hypothesis testing. Biometrics 47, 757-762. Hannan. E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression. J. Roy. Statist. Soc. Ser. B 41, 190-195. Hiirdle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge. Hiirdle, W. and Bowman, A. W. (1988). Bootstrapping in nonparametric regression: local adaptive smoothing and confidence bands. J. Amer. Statist. Assoc. 83, 102-110. Hiirdle, W., Hall, P. and Marron, J. S. (1988). How far are automatically chosen regression smoothing parameters from their optimum? (with discussion). J. Amer. Statist. Assoc. 83, 86-99. Hiirdle, W. and Mammen, E. (1993). Comparing nonparametric versus parametric regression fits. Ann. Statist. 21, 1926-1947. Hiirdle, W. and Marron, J. S. (1990). Semiparametric comparison of regression curves. Ann. Statist. 18, 63-89. Hiirdle, W. and Marron, J. S. (1991). Bootstrap simultaneous error bars for nonparametric regression. Ann. Statist. 19, 778-796. Hart, J. D. (1984). On the modal resolution of kernel density estimators. Statist. Probab. Lett. 2, 363-369. Hart, J.D. (1988). An ARMA type probability density estimator. Ann. Statist. 16, 842-855. Hart, J.D. (1994). Automated kernel smoothing of dependent data by using time series cross-validation. J. Roy. Statist. Soc. Ser. B 56, 529-542. Hart, J. D. (1996). Some automated methods of smoothing time-dependent data. J. Nonparam. Statist. 6, 115-142. Hart, J. D. and Gray, H. L. (1985). The ARMA method of approximating probability density functions. J. Statist. Plan. Inference 12, 137-152. Hart, J.D. and Wehrly, T. E. (1992). Kernel regression when the boundary region is large, with an application to testing the adequacy of polynomial models. J. Amer. Statist. Assoc. 87, 1018-1024. Hart, J. D. and Yi, S. (1996). One-sided cross-validation. Technical Report No. 249, Department of Statistics, Texas A&M University. Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman & Hall, London. Hoeffding, W. and Robbins, H. (1948). The central limit theorem for dependent random variables. Duke Math. J. 15, 773-780.
, 276
References
Hurvich, C. M. and Tsai, C.-L. (1995). Relative rates of convergence for efficient model selection criteria in linear regression. Biometrika 82, 418-425. Jayasuriya, B. R. (1996). Testing for polynomial regression using nonparametric regression techniques. J. Amer. Statist. Assoc. 91, 1626-1631. Jones, M. C. (1991). The roles of ISE and MISE in density estimation. Statist. Probab. Lett. 12, 51-56. Jones, M. C., Davies, S. J. and Park, B. U. (1994). Versions of kernel-type regression estimators. J. Amer. Statist. Assoc. 89, 825-832. Kallenberg, W. C. M. and Ledwina, T. (1995). Consistency and Monte Carlo simulation of a data driven version of smooth goodness-of-fit tests. Ann. Statist. 23, 1594-1608. Karlin, S. (1968). Total Positivity. Stanford University Press, Stanford, CA. Kendall, M. and Stuart, A. (1979). The Advanced Theory of Statistics. Charles Griffin & Company Ltd, New York Kim, J. (1994). Test for change in a mean function when data are dependent. Ph.D. dissertation, Department of Statistics, Texas A&M University. Kim, J.-T. (1992). Testing goodness-of-fit via order selection criteria. Ph.D. dissertation, Department of Statistics, Texas A&M University. King, E. C. (1988). A test for the equality of two regression curves based on kernel smoothers. Ph.D. dissertation, Department of Statistics, Texas A&M University. King, E., Hart, J. D. and Wehrly, T. E. (1991). Testing the equality of two regression curves using linear smoothers. Statist. Probab. Lett. 12, 239-247. Knafl, G., Sacks, J. and Ylvisaker, D. (1985). Confidence bands for regression functions. J. Amer. Statist. Assoc. 80, 683-691. Kuchibhatla, M. and Hart, J. D. (1996). Smoothing-based lack-of-fit tests: variations on a theme. J. Nonparam. Statist. 7, 1-22. Ledwina, T. (1994). Data-driven version of Neyman's smooth test of fit. J. Amer. Statist. Assoc. 89, 1000-1005. Lee, G.-H. (1996). A statistical wavelet approach to model selection and datadriven Neyman smooth tests. Ph.D. dissertation, Department of Statistics, Texas A&M University. Lehmann, E. (1959). Testing Statistical Hypotheses. John Wiley & Sons, New York · Li, K.-C. (1987). Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: Discrete index set. Ann. Statist. 15, 958-975. Li, K.-C. (1989). Honest confidence regions for nonparametric regression. Ann. Statist. 17, 1001-1008. Liaw, A. (1997). An application of Fourier series smoothing to a diagnostic test of heteroscedasticity. Ph.D. dissertation, Department of Statistics, Texas A&M University. Mallat, S. (1989). Multiresolution approximations and wavelet orthonormal bases of £ 2 (~). Trans. Amer. Math. Soc. 315, 69-87. Mallows, C. L. (1973). Some comments on Cp. Technometrics 15, 661-675. Mammen, E. (1993). Bootstrap and wild bootstrap for high dimensional linear models. Ann. Statist. 21, 255-285. Marron, J. S. and Wand, M. P. (1992). Exact mean integrated squared error. Ann. Statist. 20, 712-736.
-
References
277
Muller, H.-G. (1984). Optimal designs for nonparametric kernel regression. Statist. Probab. Lett. 2, 285-290. Muller, H.-G. (1991). Smooth optimum kernel estimators near endpoints. Biometrika 78, 521-530. Muller, H.-G. (1992). Goodness-of-fit diagnostics for regression models. Scand. J. Statist. 19, 157-172. Muller, H.-G. and Stadtmuller, U. (1987). Variable bandwidth estimators of regression curves. Ann. Statist. 15, 182-201. Munson, P. J. and Jernigan, R. W. (1989). A cubic spline extension of the DurbinWatson test. Biometrika 76, 39-47. Nadaraya, E. A. (1964). On estimating regression. Theory Probab. Appl. 9, 141142. Nair, V. N. (1986). On testing against ordered alternatives in analysis of variance models. Biometrika 73, 493-499. Newton, H. J. (1988). Timeslab: A Time Series Analysis Laboratory. Wadsworth & Brooks/Cole, Belmont, CA. Neyman, J. (1937). 'Smooth' test for goodness of fit. Skandinavisk Aktuarietidskrift 20,149-199. Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. Roy. Soc. Ser. A 231, 289-337. Noether, G. E. (1955). On a theorem of Pitman. Ann. Math. Statist. 26, 64-68. Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. J. Amer. Statist. Assoc. 83, 1134-1143. Opsomer, J. and Ruppert, D. (1996). A fully automated bandwidth selection method for fitting additive models. Unpublished manuscript. Pace, L. and Salvan, A. (1990). Best conditional tests of separate families of hypotheses. J. Roy. Statist. Soc. Ser. B 52, 125-134. Page, E. J. (1954). Continuous inspection schemes. Biometrika 41, 100-115. Park, B. U. and Marron, J. S. (1990). Comparison of data-driven bandwidth selectors. J. Amer. Statist. Assoc. 85, 66-72. Parzen, E. (1977). Multiple time series: determining the order of approximating autoregressive schemes. Multivariate Analysis - IV (P. Khrishnaiah, ed.), North-Holland, Amsterdam, pp. 283-295. Parzen, E. (1981). Nonparametric statistical data science: a unified approach based on density estimation and testing for "white noise." Technical Report, Department of Statistics, Texas A&M University. Priestley, M. B. (1981). Spectral Analysis and Time Series. Academic Press, London. Priestley, M. B. and Chao, M. T. (1972). Non-parametric function fitting. J. Roy. Statist. Soc. Ser. B 34, 385-392. Ramachandran, M. (1992). Testing for goodness of fit using nonparametric techniques. Ph.D. dissertation, Department of Statistics, Texas A&M University. Rao, C. R. (1973). Linear Statistical Inference and its Applications. John Wiley & Sons, New York. Rayner, J. C. W. and Best, D. J. (1989). Smooth Tests of Goodness of Fit. Oxford University Press, New York.
278
References
Rayner, J. C. W. and Best, D. J. (1990). Smooth tests of goodness of fit: an overview. Int. Statist. Rev. 58, 9-17. Raz, J. (1990). Testing for no effect when estimating a smooth function by nonparametric regression: a randomization approach. J. Amer. Statist. Assoc. 85, 132-138. Rice, J. (1984a). Boundary modification for kernel regression. Comm. Statist. Theory Methods 13, 893-900. Rice, J. (1984b). Bandwidth choice for nonparametric regression. Ann. Statist. 12, 1215-1230. Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. Ann. Statist. 11, 416-431. Rosenblatt, M. (1975). A quadratic measure of deviation of two-dimensional density estimates and a test of independence. Ann. Statist. 3, 1-14. Ruppert, D., Sheather, S. J. and Wand, M. P. (1995). An effective bandwidth selector for local least squares regression. J. Amer. Statist. Assoc. 90, 12571270. Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461464. Scott, D. W. (1992). Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, New York. Serfiing, R. J. (1970). Moment inequalities for the maximum cumulative sum. Ann. Math. Statist. 41, 1227-1234. Serfiing, R. J. (1980). Approximation Theorems of Mathematical Statistics. John Wiley & Sons, New York. Shibata, R. (1976). Selection of the order of an autoregressive model by Akaike's information criterion. Biometrika 63, 117-126. Silverman, B. W. (1981). Using kernel density estimates to investigate multimodality. J. Roy. Statist. Soc. Ser. B 43, 97-99. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. Sockett, E. B., Daneman, D., Clarson, C. and Ehrich, R. M. (1987). Factors affecting and patterns of residual insulin secretion during the first year of type I (insulin dependent) diabetes mellitus in children. Diabetes 30, 453-459. Spiegelman, C. and Wang, C. Y. (1994). Detecting interactions using low dimensional searches in high dimensional data. Chemometr. Intell. Lab. Syst. 23, 293-299. Spitzer, F. (1956). A combinatorial lemma and its applications to probability theory. Trans. Amer. Math. Soc. 82, 323-339. Staniswalis, J. and Severini, T. A. (1991). Diagnostics for assessing regression models. J. Amer. Statist. Assoc. 86, 684-692. Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10, 1040-1053. Stone, C. J. (1985). Additive regression and other nonparametric models. Ann. Statist. 13, 689-705. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). J. Roy. Statist. Soc. Ser. B 36, 111-147. Tarter, M. E. and Lock, M. D. (1993). Model-Free Curve Estimation. Chapman & Hall, New York.
References
279
Terrell, G. R. and Scott, D. W. (1985). Oversmoothed nonparametric density estimates. J. Amer. Statist. Assoc. 80, 209-214. Tolstov, G. P. (1962). Fourier Series. Dover, New York. vanEs, B. (1992). Asymptotics for least squares cross-validation bandwidths in nonsmooth cases. Ann. Statist. 20, 1647-1657. Vieu, P. (1991). Nonparametric regression: Optimal local bandwidth choice. J. Roy. Statist. Soc. Ser. B 53, 453- 464. von Neumann, J. (1941). Distribution of the ratio of the mean squared successive difference to the variance. Ann. Math. Statist. 12, 367-395. Wahba, G. (1983). Bayesian 'confidence intervals' for the cross-validated smoothing spline. J. Roy. Statist. Soc. Ser. B 45, 133-150. Wahba, G. (1990). Spline Models for Observational Data. SIAM, Philadelphia. Wand, M.P. and Jones, M. C. (1995). Kernel Smoothing. Chapman & Hall, New York. Watson, G. S. (1964). Smooth regression analysis. Sankhya Ser. A 26, 359-372. Wells, M. (1990). The relative efficiency of goodness-of-fit statistics in the simple and composite hypothesis-testing problem. J. Amer. Statist. Assoc. 85, 459463. White, H. (1982). Regularity conditions for Cox's test of non-nested hypotheses. J. Econometr. 19, 301-318. Wood, A. T. A. (1989). An F approximation to the distribution of a linear combination of chi-squared variables. Comm. Statist. - Simul. Comput. 18, 1439-1456. Woodroofe, M. (1982). On model selection and the arc sine laws. Ann. Statist. 10, 1182-1194. Yanagimoto, T. and Yanagimoto, M. (1987). The use of marginal likelihood for a diagnostic test for the goodness of fit of the simple linear regression model. Technometrics 29, 95-101. Yi, S. (1996). On one-sided cross-validation in nonparametric regression. Ph.D. dissertation, Department of Statistics, Texas A&M University. Yin, Y. and Carroll, R. J. (1990). A diagnostic for heteroscedasticity based on the Spearman rank correlation. Statist. Probab. Lett. 10, 69-76. Young, S. G. and Bowman, A. W. (1995). Non-parametric analysis of covariance. Biometrics 51, 920-931. Zhang, P. (1994). On the distributional properties of model selection criteria. J. Amer. Statist. Assoc. 87, 732-737.
I
I
, !!?
Index
additive model, 116, 232-234, 253, 263-266 Akaike, H., 245, 271 Akaike's Information Criterion (AIC), 245 Anderson, T. W., 269, 271 approximation of functions, 21, 25-26, 35-49 approximator, 28, 42, 43, 45, 46, 48 arithmetic means, 25 asymptotic distribution of cusum statistic, 139 data-driven bandwidths, 94-105 data-driven Neyman smooth statistic, 185-187 Gasser-Muller estimator, 76-78 kernel-smoother-based test statistic, 154-156 Neyman smooth statistic, 141 order selection statistic, 168-175, 210-217 truncated series estimator, 76-78 von Neumann statistic, 138 von Neumann statistic applied to residuals, 130-131 asymptotic normality, 76, 95-96, 121, 130, 156, 157, 217 asymptotic relative efficiency, 76, 100, 137, 184 autocorrelation, 240, 241, 244, 250, 251 autoregressive process, 107, 166, 172, 244, 245, 249, 250, 252 autoregressive, moving average process, 42
average squared error, 108 Azzalini, A., 163, 271 Babinet data, 160-161, 253-258 Babinet point, 160 bandwidth, basic issues in choosing, 11-14 definition of, 6 illustrated effect of, 9 optimal, 11, 14, 57-58, 63-65, 79, 88, 89, 92, 95, 108, 109, 111, 113, 114, 154, 157, 159, 160, 218 variable, 14-18, 63-64, 115 Barry, D., 207, 218, 233, 271 Bartlett's test, 242, 244 Bartlett, M. S., 242, 271 Bayes factor, 192 Bayes information criterion (BIC), 106 Bayesian methods, 80-81, 134-136, 189-195, 256-257 Bellver, C., 160, 271 Berry, Scott, viii Berry-Esseen theorem, 168 Best, D. J., 163, 239, 277, 278 Bhansali, R. J., 245, 246, 271 Bhattacharya, P. K., 227, 271 Bhattacharya, R. N., 168, 216, 271 bias of Gasser-Muller estimator, 55-56, 62-63 of sample Fourier coefficient, 67-68 bias reduction, 31, 37, 39, 63 Bickel, P. J., 163, 207, 271
281
282
Index
Billingsley, P., 155, 156, 271 binomial distribution, 179 bootstrap, 80, 83, 130, 131, 133, 150, 157, 161, 181, 182, 188, 189, 208, 219, 221-223, 227, 228, 236, 237, 252, 254, 255, 259 double, 134, 221-223 wild, 228 Bowman, A. W., 80, 81, 160, 163, 237, 271, 275, 279 Box, G. E. P., 129, 242, 271, 272 Buckley, M. J., 129, 134, 135, 272 Butzer, P. L., 26, 65, 272 C-peptide concentration, 263 calculus of variations, 58, 102 Calvin, Jim, viii Carroll, R. J., 219, 235, 272, 279 central limit theorem, 76, 131 Chao, M. T., 8, 277 Chen, Chien-Feng, viii Chen, J.-C., 163, 218, 272 Chen, Ray, viii Chernoff, H., 120, 261, 272 Chiu, 8.-T., 91, 98-100, 104, 272 Chu, C. K., 10, 272 Chui, C. K., 44, 272 Chung, K. L., 77, 272 circular design, 98, 105 Clark, R. M., 8, 272 Clarson, C., 263, 278 Cleveland, W. 8., 160, 163, 272 collinearity, 213 comparing parametric and nonparametric models, 145, 147-148 complete class, 21 components-based test, 163 concurvity, 233 confidence bands, 81 confidence intervals, 50, 76-83, 179-181 Conover, W. J., 179, 272 consistency of cusum test, 139 of maximum risk test, 197 of order selection test, 195-197, 223-224 of von Neumann test, 138
consistent test, definition of, 131 continuous smoothing parameter, 26, 106 convergence in mean square, 21 convolution, 8, 10, 65, 66, 80, 156 Cook, R. D., 235, 272 cosine series, 35, 42-46, 48, 207, 255, 257, 258 covariance stationary, 240, 243, 244, 248, 252 Cox, D., 136, 163, 218, 272 Cox, D. R., 120, 272 Cramer-von Mises test, 208, 225, 240 Cramer-Wold device, 156, 167 criterion autoregressive transfer (CAT), 245 cross-validation, 84-86, 90-92, 94103, 107-115, 218, 219, 237, 256 one-sided, 84, 90-92, 98-105, 108, 115, 218, 219, 251, 255 curse of dimensionality, 232, 233 cusp, 29, 30, 33 cusum, 134-137, 139, 195, 197-201, 268-269 cut-and-normalize method, 29 D'Agostino, R. B., 238, 273 Daneman, D., 263, 278 Darling, D. A., 269, 271 data-reflection, 75 Daubechies, I., 46, 273 Daubechies wavelet, 46 Davies, R. B., 129, 273 Davies, 8. J., 10, 40, 276 Delgado, M.A., 238, 244, 261, 273 derivatives, estimation of, 63-65, 115 design and kernel estimators, 10, 13, 39-40, 54-58, 60, 64, 238 and local linear estimators, 39-40, 56, 238 and Rogosinski estimators, 72-75 and truncated series estimators, 67-69 design density, 50, 51, 54, 56, 64, 67, 68, 72, 146, 238 design, optimal, 13, 64 design, random, 40, 226-228
. /l
w~'j Index design-independent bias, 56, 238 Devlin, S. J., 163, 272 Devroye, L., 4, 273 diabetes data, 263-266 difference-based variance estimator, 86, 179 Diggle, P., 258, 261, 273 discrete smoothing parameter, 50, 65, 106 distribution-free test, 183-184 divergent series, 25 Donoho, D. L., 48, 80, 206, 273 Durbin, J., 133, 163, 273 Eagleson, G. K., 129, 272 Ehrich, R. M., 263, 278 eigenvalues, 129 empirical distribution, 83, 131, 133, 150, 182, 222, 227, 228, 255 Epanechnikov, V. A., 58, 273 Eubank, R., 273 Eubank, R. L., viii, 4, 34, 40, 75, 81, 124, 133, 138, 163, 168, 204, 205, 207, 216, 233, 273 Fan, J., 4, 38, 40, 56, 92, 102, 115, 206, 274 Farebrother, R. W., 129, 274 Fejer series, 26 Fejer weights, 25 Firth, D., 163, 274 Fourier coefficients, definition of, 21 Fourier coefficients, sample, 67-68, 136, 165-168, 189, 190, 196, 197, 201, 206, 210, 213, 214, 217, 219, 221, 224, 225, 227, 235, 256 frequentist, 80, 81, 94, 192 full model, 151, 234, 263 Gasser, Th., 8, 31, 51, 64, 65, 89, 96, 108, 123, 274 Gauss-Newton algorithm, 44 Gaussian process, 155, 157, 215, 217, 250 generalized cross-validation, 87, 92 Ghosh, B. K., 163, 274 Gibbons, J.D., 184, 274 Gijbels, I., 4, 115, 274
283
Glosup, J., 163, 274 goodness-of-fit test, defined, 118 Gram-Schmidt procedure, 151, 213, 227, 229, 231, 256 graphical test, 175-176, 189, 196 Gray, Buddy, viii Gray, H. L., 35, 41, 274, 275 Green, P. J., 4, 274 Grenander, U., 261, 274 Gu, C., 233, 274 Gyorfi, L., 4, 273 Haar wavelet, 45-48 half normal density, 33 Hall, P., 75, 76, 81, 84, 87, 90, 91, 94-98, 131, 163, 181, 221, 222, 227, 237, 274, 275 Hannan, E. J., 107, 275 Hardie, W., 4, 80, 81, 94, 163, 228, 238, 271, 275 Hart, J. D., 41, 43, 75, 80, 82, 90, 131, 133, 138, 160, 163, 168, 181, 189, 204, 207, 216, 218, 219, 233, 237, 238, 251, 252, 258, 273-276 Hartigan, J. A., 207, 218, 271 Hastie, T. J., 233, 263, 275 heteroscedasticity, 258 Hinkley, D. V., 163, 274 Hoeffding, W., 131, 275 homoscedasticity, 227, 234-236, 253, 258, 263 Huang, W., 163, 274 Hurvich, C. M., 106, 107, 276 Integrated squared error, 14, 21, 42, 46, 49, 66, 72 invariant test, 125, 136 Jayasuriya, B. R., 218, 276 Jennen-Steinmetz, C., 123, 274 Jernigan, R. W., 127, 133, 277 Johnstone, I. M., 48, 84, 90, 91, 95-98, 206, 273, 274 Jones, M. C., 4, 10, 40, 95, 276, 279 Kallenberg, W. C. M., 187, 207, 238, 239, 276 Karlin, 8., 80, 276
I
I
284
Index
Kay, J. W., 87, 275 Kayley, v, viii Kendall, M., 120, 276 kernel estimator, convolution type, 10, 66 evaluation type, 10 Gasser-Muller, 8, 10-12, 15, 17-20, 22, 28-33, 40, 41, 50-52, 55, 56, 59, 61, 62, 64, 65, 76-78, 80-82, 85, 88, 92, 148, 217-219, 237, 238 Nadaraya-Watson, 6, 8-10, 14, 16, 29, 31, 33, 38-41, 50, 56, 106, 148, 230, 238, 252 Priestley-Chao, 8, 10, 11, 29, 94, 102, 153 variable bandwidth, 14-18, 63-64, 115 kernel, boundary, 31, 32, 39, 41, 59, 60, 62, 95, 101 Dirichlet, 22, 33, 65, 71, 76 Epanechnikov, 15, 16, 18, 31, 39, 58, 60, 76, 83, 92, 102, 108, 115, 158, 255 Fejer-Korovkin, 65 finite support, 10, 11, 28, 41, 51, 58, 95 Gaussian, 8, 9, 11, 14, 22, 58, 80 higher order, 62-63 quartic, 58, 101, 102, 104 rectangular, 8 Rogosinski, 26, 27, 33, 58, 59, 65, 71, 76, 204 seco;nd order, 62, 63, 65, 75, 76, 98, 99, 102, 148 triangle, 58 Kim, J., 251, 276 Kim, J.-T., 207, 238, 239, 276 King, E. C., 154, 160, 163, 238, 276 Knafl, G., 81, 276 Kneip, A., 89, 96, 108, 274 knots, 40, 41 Knott, M., 163, 273 Koh, E., 136, 163, 218, 272 Kohler, W., 64, 89, 96, 108, 274 Kolmogorov-Smirnov test, 208, 240, 244 Kuchibhatla, M., 189, 204, 258, 276
Lack-of-fit test, defined, 118 LaRiccia, V. N., 163, 273 least squares, 21, 34, 37, 38, 41, 44, 81-83, 87, 89, 121, 123, 125-127, 142, 148, 149, 151, 153, 165, 187, 208-210, 213, 214, 217-219, 224, 225, 227, 229-231, 234, 249, 254, 255, 264 Ledwina, T., 107, 186, 187, 207, 238, 239, 276 Lee, Cherng-Luen, viii Lee, G.-H., viii, 204, 206, 276 Legendre polynomials, 141 Lehmann, E., 125, 142, 276 leutenizing hormone level, 258 Li, K.-C., 81, 106, 276 Liapounov central limit theorem, 187 Liapounov condition, 77, 78 Liaw, A., viii, 235, 236, 276 likelihood ratio test, 3, 118-122, 125 Lindeberg-Feller theorem, 156, 167, 211 Lipschitz continuous, defined, 50 local alternatives, 137, 143, 153-157, 195, 201-203, 205, 236 local likelihood, 163 local linear estimator, 37-41, 56, 91-93, 98, 102, 103, 107, 108, 115, 148, 158-161, 163, 238, 254, 255, 257, 263 local polynomial estimator, 2, 37-40, 44, 115, 144, 217, 218 local quadratic estimator, 38 locally most powerful, 134, 136 Lock, M.D., 4, 278 Lombard, Fred, viii loss function, 94, 105 Mallat, S., 46, 276 Mallows, C. L., 87, 276 Mammen, E., 163, 228, 275, 276 Mammitzsch, V., 65, 274 Markov's inequality, 268 Marron, J. S., 10, 63, 81, 88, 91, 94, 181, 228, 238, 272, 275-277 maximal rate, 137-139, 143, 154, 156, 157, 201, 203, 204 mean average squared error, 88, 159
Index mean integrated squared error of Gasser-Muller estimator, 61-62 of Rogosinski estimator, 71-76 of truncated series estimator, 68-71 mean square convergence, 25 mean squared error of Gasser-Muller estimator, 40, 51-61 of local linear estimator, 40 of local polynomial estimator, 38 mean value theorem, 52 Michelle, v, viii mineral assay data, 261-263 model selection, 255 Molinari, L., 64, 274 moving average, 6 moving average process, 243, 244 Muller, H.-G., 8, 13, 31, 51, 60, 63-65, 101, 115, 163, 274, 277 multivariate normal distribution, 155, 216, 217, 220, 221, 247 Munson, P. J., 127, 133, 277 Nadaraya, E. A., 6, 277 Nair, V. N., 136, 277 natural spline interpolant, 41, 133 Nessel, R. J., 26, 65, 272 nested models, 120, 125 Newton, H. J., 240-242, 245, 246, 277 Neyman smooth test, 140-143, 152, 163, 167, 185-187, 195, 197-201, 203-205, 207, 242, 259 Neyman, J., 119, 141, 163, 185, 277 no-effect hypothesis, 3, 132-135, 141-143, 148, 164, 168, 172, 175, 176, 178, 185, 187-189, 191, 195, 227, 23~ 239, 242, 248 Noether, G. E., 137, 277 non-Gaussian (non-normal), 121, 129-131, 150, 151, 177, 181-183, 211 nonlinear model, 129, 133, 145, 208, 219-223 non-nested models, 120, 121
285
normalized estimator, 29-33, 59, 61, 62, 70 Nychka, D., 81, 277 Omnibus test, 131, 138, 145, 163, 195, 203, 242 Opsomer, J., 116, 277 orthogonal basis, 19, 21, 46, 151, 205, 213, 256 orthogonal polynomials, 19, 22, 212 orthogonal wavelet, 45, 46 orthonormal, 141, 205, 224, 256 orthonormal basis, 45 Pace, L., 121, 277 Page, E. J., 134, 277 Park, B. U., 10, 40, 88, 91, 276, 277 Parseval's formula, 66, 136, 181 parsimony, 35, 41, 42 Parzen, E., 165, 188, 206, 245, 277 Parzen, Manny, viii Pearson, E. S., 119, 277 permutation test, 150 piecewise constant function, 8 piecewise linear function, 10 piecewise smooth function, definition of, 67 Pierce, D. A., 242, 272 Pitman relative efficiency, 137 pivotal quantity, 221, 223 plug-in, 84, 88-90, 92-98, 107-110, 112-116 pointwise convergence, 25, 231 polynomial regression, 22, 125 polynomial-trigonometric regression, 34 polynomials, testing the fit of, 217-219 portmanteau test, 242, 244 posterior distribution, 191, 193 posterior probability, 191, 192, 194, 257 posterior risk, 94 power and smoothing parameters, 158-161 of cusum test, 197-201 of Neyman smooth test, 197-201 of order selection test, 197-203
I
I
I
286
Index
power transformation, 18, 20 Prader, A., 64, 274 Priestley, M. B., 4, 8, 241-243, 259, 277 prior distribution, 135, 189-192, 206, 257 prior probability, 191, 194, 257 prior, convenience, 191 prior, noninformative, 190, 191, 257 probability bands, 81-83 probability density estimation, 1, 41, 76, 107, 163 pseudo-residuals, 123, 124 pure experimental error, 122-124 P-value, 81-83, 174, 175, 253, 254, 258, 259, 261, 264 Quadratic form, 126, 128, 129, 135, 136, 149-151 quadrature, 21, 67 quadrature bias, 67 Quinn, B. G., 107, 275 Rabbit jawbone data, 17-18 Ramachandran, M., 189, 277 random walk, 168, 174 Ranga Rao, R., 168, 216, 271 rank test, 183-184, 235, 236 Rao, C. R., 124, 151, 216, 277 rational functions, 41-45 Rayner, J. C. W., 163, 239, 277, 278 Raz, J., 150, 151, 163, 278 reduced model, 151 reduction method, 3, 124-126, 138, 142, 151, 152, 166 regression quantile function, 165, 261 relative efficiency, 58, 76, 137 residual analysis, 17-18, 234, 257 residuals, 83, 85, 122, 123, 125, 127, 128, 131, 133-135, 144-153, 158, 179, 181, 182, 208210, 213, 217, 219, 222-224, 228-231, 234-236, 252, 254, 257-259 resolution level, 46, 47 Rice, J., 31, 94, 96, 97, 278 risk estimation, 84, 86-88, 97, 106, 243, 247, 255, 256, 258 risk function, 86, 94
risk regret, 97, 98 Rissanen, J., 191, 257, 278 Ritov, Y., 207, 271 Robbins, H., 131, 275 Rogosinski series estimator, 27, 32, 33, 35, 36, 59, 70-76, 188-189, 259, 261 Rosenblatt, M., 163, 261, 271, 274, 278 Rosenstein, R., 163, 273 Ruppert, D., 88, 89, 115, 116, 219, 272, 277, 278 Sacks, J., 81, 276 Salvan, A., 121, 277 sampling distribution, 50, 76, 84, 104, 108, 131, 150, 160, 183, 219, 227, 228, 231, 252 Schucany, Bill, viii Schwarz, G., 106, 278 Scott, D. W., 4, 80, 278, 279 Serfling, R. J., 156, 171, 215, 278 serum alphafetoprotein data, 92 Severini, T. A., 163, 278 Sheather, S. J., 88, 89, 116, 278 Shibata, R., 107, 172, 245, 247, 278 side lobes, 22, 26, 71 significance trace, 158, 160-161, 238, 253 Silverman, B. W., 4, 80, 274, 278 Simpson, D. G., 204, 207, 233, 273 simulation, 64, 81-83, 97, 104, 107113, 115, 129, 131, 151, 158, 179, 186-189, 199, 204-206, 218, 219, 222, 232, 236, 237, 246, 248, 264 Slutsky's theorem, 211, 215 smoother matrix, 150 smoothing, definition of, 4 smoothing parameter, definition of, 6 smoothing residuals, 145-149 smoothing splines, 40-41, 80, 144, 148, 163, 207, 217, 218 Sockett, E. B., 263, 278 Speckman, P., 34, 75, 81, 273 spectrum (spectra), 1, 42, 240-244, 253, 258, 259, 261 Spiegelman, C., 163, 233, 273, 278 Spitzer, F., 171, 174, 177, 278
Index Sroka, L., 123, 274 Stadtmiiller, U., 63, 115, 277 Staniswalis, J., 163, 278 Stefanski, L. A., 204, 207, 233, 273 Stephens, M.A., 238, 273 Stone, C. J., 232, 233, 278 Stone, M., 85, 278 straight line, 2, 37, 41, 82, 83, 120, 121, 123, 126, 159, 219 straight lines, testing the fit of, 81-83, 148, 160-161, 163, 207, 253-255 strong law of large numbers, 175, 182 Stuart, A., 120, 276 Taper, 25, 26, 28, 33, 34, 50, 65, 70, 71, 75, 76, 95 Tarter, M. E., 4, 278 Taylor series, 55, 56, 62 Terrell, G. R., 80, 279 thresholding, 46, 48, 49, 206 Tibshirani, R. J., 233, 263, 275 tightness, 155, 156 time series, 1, 42, 166, 240-252, 259 Titterington, D. M., 81, 87, 275 Tolstov, G. P., 21, 25, 231, 279 total positivity, 80 transformation, 16-18, 20, 90, 91, 231 trigonometric series, 3, 19, 26, 37, 45, 65, 69, 125, 148, 164, 189, 208 truncated series estimator, 22, 32, 35, 50, 65-71, 76, 77, 87, 105, 106, 165 truncation bias, 67, 69, 71, 73, 75 truncation point, 22, 23, 26, 43, 65, 75, 76, 94, 107, 165, 172, 185, 189, 194, 196 Tsai, C.-L., 106, 107, 276 type I error probability, 148, 157, 173, 174, 231, 245, 246, 261, 263
287
Undersmooth, 79, 92, 93, 103, 255 uniform convergence, 21, 25, 26 uniformly most powerful, 125, 141-143 University of Seville, 160 VanEs, B., 106, 279 variance-ratio, 121, 123, 147 Vieu, P., 115, 279 von Neumann, J., 127, 132, 279 Wahba, G., 4, 80, 136, 163, 272, 279 Wand, M. P., 4, 63, 88, 89, 116, 276, 278, 279 Wang, C. Y., 233, 278 Watson, G. S., 6, 133, 273, 279 wavelets, 19, 37, 44-49, 144, 206 Wehrly, T. E., 41, 75, 82, 95, 160, 163, 207, 218, 219, 238, 275, 276 Weisberg, S., 235, 272 Wells, M., 225, 279 White, H., 121, 279 Wilson, S. R., 221, 227, 275 window estimate, 6-8, 30 Wood, A. T. A., 129, 279 Woodroofe, M., 207, 279 wrapped distribution, 28 Yanagimoto, M., 163, 206, 207, 218, 279 Yanagimoto, T., 163, 206, 207, 218, 279 Yandell, B., 136, 163, 272 Yi, S., viii, 90, 98, 100, 102, 275, 279 Yin, Y., 235, 279 Ylvisaker, D., 81, 276 Young, S. G., 160, 237, 279 Zhang, P., 207, 279
Springer Series in Statistics (continued from p. ii)
Mosteller!Wallace: Applied Bayesian and Classical Inference: The Case of The Federalist Papers. Pollard: Convergence of Stochastic Processes. Pratt/Gibbons: Concepts of Nonparametric Theory. Ramsay/Silverman: Functional Data Analysis. Read/Cressie: Goodness-of-Fit Statistics for Discrete Multivariate Data. Reinsel: Elements of Multivariate Time Series Analysis, 2nd edition. Reiss: A Course on Point Processes. Reiss: Approximate Distributions of Order Statistics: With Applications to Non-parametric Statistics. Rieder: Robust Asymptotic Statistics. Rosenbaum: Observational Studies. Ross: Nonlinear Estimation. Sachs: Applied Statistics: A Handbook of Techniques, 2nd edition. Si:irndal!Swensson!Wretman: Model Assisted Survey Sampling. Schervish: Theory of Statistics. Seneta: Non-Negative Matrices and Markov Chains, 2nd edition. Shao/Tu: The Jackknife and Bootstrap. Siegmund: Sequential Analysis: Tests and Confidence Intervals. Simonoff: Smoothing Methods in Statistics. Small: The Statistical Theory of Shape. Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions, 3rd edition. Tong: The Multivariate Normal Distribution. van der Vaart!Wellner: Weak Convergence and Empirical Processes: With Applications to Statistics. Vapnik: Estimation of Dependences Based on Empirical Data. Weerahandi: Exact Statistical Methods for Data Analysis. West/Harrison: Bayesian Forecasting and Dynamic Models, 2nd edition. Walter: Introduction to Variance Estimation. Yaglom: Correlation Theory of Stationary and Related Random Functions I: Basic Results. Yaglom: Correlation Theory of Stationary and Related Random Functions II: Supplementary Notes and References.