Cover photo credit: Copyright © 1999 Dynamic Graphics, Inc.
This book is printed on acid-free paper.
e
Copyright (9 ...
254 downloads
1733 Views
8MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Cover photo credit: Copyright © 1999 Dynamic Graphics, Inc.
This book is printed on acid-free paper.
e
Copyright (9 2001,1984 by ACADEMIC PRESS All Rights Reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher. Requests for permission to make copies of any part ofthe work should be mailed to: Permissions Department, Harcourt Inc., 6277 Sea Harbor Drive, Orlando, Florida 32887-6777
Academic Press A Harcourt Science and Technology Company 525 B Street, Suite 1900, San Diego, California 92101-4495, USA http://www.academicpress.com
Academic Press Harcourt Place, 32 Jamestown Road, London NWI 7BY, UK http://www.academicpress.com Library of Congress Catalog Card Number: 00-107735 International Standard Book Number: 0-12-746652-5 PRINTED IN THE UNITED STATES OF AMERICA 00 01 02 03 04 05 QW 9 8 7 6
5
4
3
2
1
Contents
Preface to the First Edition
IX
Preface to the Revised Edition
Xl
References . . . . . . .
Xlll
1
The Linear Model and Instrumental Variables Estimators 1 References . . . . . . 12 For Further Reading 12
2
Consistency 2.1 Limits . . . . . . . . . . . . 2.2 Almost Sure Convergence . 2.3 Convergence in Probability 2.4 Convergence in rth Mean References . . . . . . .
15 15 18
Laws of Large Numbers 3.1 Independent Identically Distributed Observations 3.2 Independent Heterogeneously Distributed Observations. 3.3 Dependent Identically Distributed Observations . . . . 3.4 Dependent Heterogeneously Distributed Observations 3.5 Martingale Difference Sequences References . . . . . . . . . . . . . . . . . . . . . . . . .
31 32 35
3
Vll
24
28 30
39 46 53
62
Contents
VIII
4
Asymptotic Normality 4.1 Convergence in Distribution 4.2 Hypothesis Testing . . 4.3 Asymptotic Efficiency References . . . . . .
65 65 74 83 111
5
Central Limit Theory 5.1 Independent Identically Distributed Observations . . . . 5.2 Independent Heterogeneously Distributed Observations. 5.3 Dependent Identically Distributed Observations . . . . 5.4 Dependent Heterogeneously Distributed Observations 5.5 Martingale Difference Sequences References . . . . . . . . . . . . . . . . . . . .
113 114 117 122 130 133 136
6
Estimating Asymptotic Covariance Matrices 6.1 General Structure of V n . . . . . . 6.2 Case 1: {Ztct} Uncorrelated . . . . . . . . . . 6.3 Case 2: {Ztct} Finitely Correlated . . . . . . 6.4 Case 3: {Ztct} Asymptotically Uncorrelated . References . . . . . . . . . . . . . . . . . . . .
137 137 139 147 154 164
7
Functional Central Limit Theory and Applications 7.1 Random Walks and Wiener Processes 7.2 Weak Convergence . . . . . . . . . . 7.3 Functional Central Limit Theorems .. 7.4 Regression with a Unit Root . . . . . 7.5 Spurious Regression and Multivariate FCLTs 7.6 Cointegration and Stochastic Integrals Referen ces . . . . . . . . . . . . . . . .
167 167 171 175 178 184 190 204
8
Directions for FUrther Study 8.1 Extending the Data Generating Process 8.2 Nonlinear Models . . . . . . . 8.3 Other Estimation Techniques 8.4 Model Misspecification References . , , , , , .
207 207
Solution Set References . Index
209 209 211 211 213
259 261
Preface to the First Edition
Within the framework of the classical linear model it is a fairly straightforward matter to establish the properties of the ordinary least squares (OLS) and generalized least squares (GLS) estimators for samples of any size. Although the classical linear model is an excellent framework for developing a feel for the statistical techniques of estimation and inference that are central to econometrics, it is not particularly well adapted to the study of economic phenomena, because economists usually cannot conduct controlled experiments. Instead, the data usually exist as the outcome of a stochastic process outside the control of the investigator. For this reason, both the dependent and the explanatory variables may be stochastic, and equation disturbances may exhibit nonnormality or heteroskedasticity and serial correlation of unknown form, so that the classical assumptions are violated. Over the years a variety of useful techniques has evolved to deal with these difficulties. Many of these amount to straightforward modifications or extensions of the OLS techniques (e.g., the Cochrane-Orcutt technique, two-stage least squares, and three-stage least squares). However, the finite sample properties of these statistics are rarely easy to establish outside of somewhat limited special cases. Instead, their usefulness is justified primarily on the basis of their properties in large samples, because these properties can be fairly easily established using the powerful tools provided by laws of large numbers and central limit theory. Despite the importance of large sample theory, it has usually received fairly cursory treatment in even the best econometrics textbooks. This is
IX
x
Preface to the First Edition
really no fault of the textbooks, however, because the field of asymptotic theory has been developing rapidly. It is only recently that econometricians have discovered or established methods for treating adequately and comprehensively the many different techniques available for dealing with the difficulties posed by economic data. This book is intended to provide a somewhat more comprehensive and unified treatment of large sample theory than has been available previously and to relate the fundamental tools of asymptotic theory directly to many of the estimators of interest to econometricians. In addition, because economic data are generated in a variety of different contexts (time series, cross sections, time series-cross sections), we pay particular attention to the similarities and differences in the techniques appropriate to each of these contexts. That it is possible to present our results in a fairly unified manner highlights the similarities among a variety of different techniques. It also allows us in specific instances to establish results that are somewhat more general than those previously available. We thus include some new results in addition to those that are better known. This book is intended for use both as a reference and as a text book for graduate students taking courses in econometrics beyond the introductory level. It is therefore assumed that the reader is familiar with the basic concepts of probability and statistics as well as with calculus and linear algebra and that the reader also has a good understanding of the classical linear model. Because our goal here is to deal primarily with asymptotic theory, we do not consider in detail the meaning and scope of econometric models per se. Therefore, the material in this book can be usefully supplemented by standard econometrics texts, particularly any of those listed at the end of Chapter l. I would like to express my appreciation to all those who have helped in the evolution of this work. In particular, I would like to thank Charles Bates, Ian Domowitz, Rob Engle, Clive Granger, Lars Hansen, David Hendry, and Murray Rosenblatt. Particular thanks are due Jeff Wooldridge for his work in producing the solution set for the exercises. I also thank the students in various graduate classes at UCSD, who have served as unwitting and indispensable guinea pigs in the development of this material. I am deeply grateful to Annetta Whiteman, who typed this difficult manuscript with incredible swiftness and accuracy. Finally, I would like to thank the National Science Foundation for providing financial support for this work under grant SES81-07552.
Preface to the Revised Edition
It is a gratifying experience to be asked to revise and update a book written over fifteen years previously. Certainly, this request would be unnecessary had the book not exhibited an unusual tenacity in serving its purpose. Such tenacity had been my fond hope for this book, and it is always gratifying to see fond hopes realized. It is also humbling and occasionally embarrassing to perform such a revision. Certain errors and omissions become painfully obvious. Thoughts of "How could I have thought that?" or "How could I have done that?" arise with regularity. Nevertheless, the opportunity is at hand to put things right, and it is satisfying to believe that one has succeeded in this. (I know, of course, that errors still lurk, but I hope that this time they are more benign or buried more deeply, or preferably both.) Thus, the reader of this edition will find numerous instances where definitions have been corrected or clarified and where statements of results have been corrected or made more precise or complete. The exposition, too, has been polished in the hope of aiding clarity. Not only is a revision of this sort an opportunity to fix prior shortcomings, but it is also an opportunity to bring the material covered up-to-date. In retrospect, the first edition of this book was more ambitious than originally intended. The fundamental research necessary to achieve the intended scope and cohesiveness of the overall vision for the work was by no means complete at the time the first edition was written. For example, the central limit theory for heterogeneous mixing processes had still not developed to
Xl
XII
Preface to the Revised Edition
the desired point at that time, nor had the theories of optimal instrumental variables estimation or asymptotic covariance estimation. Indeed, the attempt made in writing the first edition to achieve its intended scope and coherence revealed a host of areas where work was needed, thus providing fuel for a great deal of my own research and (I like to think) that of others. In the years intervening, the efforts of the econometrics research community have succeeded wonderfully in delivering results in the areas needed and much more. Thus, the ambitions not realized in the first edition can now be achieved. If the theoretical vision presented here has not achieved a much better degree of unity, it can no longer be attributed to a lack of development of the field, but is now clearly identifiable as the author's own responsibility. As a result of these developments, the reader of this second edition will now find much updated material, particularly with regard to central limit theory, asymptotically efficient instrumental variables estimation, and estimation of asymptotic covariance matrices. In particular, the original Chapter 7 (concerning efficient estimation with estimated error covariance matrices) and an entire section of Chapter 4 concerning efficient IV estimation have been removed and replaced with much more accessible and coherent results on efficient IV estimation, now appearing in Chapter 4. There is also the progress of the field to contend with. When the first edition was written, cointegration was a subject in its infancy, and the tools needed to study the asymptotic behavior of estimators for models of cointegrated processes were years away from fruition. Indeed, results of DeJong and Davidson (2000) essential to placing estimation for cointegrated processes cohesively in place with the theory contained in the first six chapters of this book became available only months before work on this edition began. Consequently, this second edition contains a completely new Cbapter 7 devoted to functional central limit theory and its applications, specifically unit root regression, spurious regression, and regression with cointegrated processes. Given the explosive growth in this area, we cannot here achieve a broad treatment of cointegration. Nevertheless, in the new Chapter 7 the reader should find all the basic tools necessary for entree into this fascinating area. The comments, suggestions, and influence of numerous colleagues over the years have had effects both subtle and patent on the material presented here. With sincere apologies to anyone inadvertently omitted, I acknowledge with keen appreciation the direct and indirect contributions to the present state of this book by Takeshi Amemiya, Donald W. K. Andrews, Charles Bates, Herman Bierens, James Davidson, Robert DeJong,
References
Xlll
Ian Domowitz, Graham Elliott, Robert Engle, A. Ronald Gallant, Arthur Goldberger, Clive W. J. Granger, James Hamilton, Bruce Hansen, Lars Hansen, Jerry Hausman, David Hendry, S0ren Johansen, Edward Leamer, James Mackinnon, Whitney Newey, Peter C. B. Phillips, Eugene Savin, Chris Sims, Maxwell Stinchcombe, James Stock, Mark Watson, Kenneth West, and Jeffrey Wooldridge. Special thanks are due Mark Salmon, who originally suggested writing this book. UCSD graduate students who helped with the revision include Jin Seo Cho, Raffaella Giacomini, Andrew Patton, Sivan Ritz, Kevin Sheppard, Liangjun Su, and Nada Wasi. I also thank sincerely Peter Reinhard Hansen, who has assisted invaluably with the creation of this revised edition, acting as electronic amanuensis and editor, and who is responsible for preparation of the revised set of solutions to the exercises. Finally, I thank Michael J. Bacci for his invaluable logistical support and the National Science Foundation for providing financial support under grant SBR-9811562.
Del Mar, CA July, 2000
References DeJong R. M. and J. Davidson (2000). "The functional central limit theorem and weak convegence to stochastic integrals I: Weakly dependent processes," forthcoming in Econometric Theory, 16.
CHAPTER 1
The Linear Model and Instrumental Variables Estimators
The purpose of this book is to provide the reader with the tools and concepts needed to study the behavior of econometric estimators and test statistics in large samples. Throughout, attention will be directed to estimation and inference in the framework of a linear stochastic relationship such as
yt = X~f3o +E:t,
t = 1, ...
,n,
where we have n observations on the scalar dependent variable yt and the vector of explanatory variables X t = CXtl , X t2 , ... , Xtk)'. The scalar stochastic disturbance Et is unobserved, and f30 is an unknown k x 1 vector of coefficients that we are interested in learning about, either through estimation or through hypothesis testing. In matrix notation this relationship is written as
Y = Xf3o+E, where Y is an n x 1 vector, X is an n x k matrix with rows X~, and E is an n x 1 vector with elements Et. COur notation embodies a convention we follow throughout: scalars will be represented in standard type, while vectors and matrices will be represented in boldface. Throughout, all vectors are column vectors.) Most econometric estimators can be viewed as solutions to an optimization problem. For example, the ordinary least squares estimator is the value
1
2
1. The Linear Model and Instrumental Variables Estimators
for
f3
that minimizes the sum of squared residuals
(Y - X(3)' (Y - X(3)
SSR(f3)
n
~)Yt - X~(3)2. t=l
The first-order conditions for a minimum are
- 2X' (Y - X(3)
8SSR(f3) / 8f3
n
-2 LXt(Yt - X~(3)
=
o.
t=l
If X'X = 2::~-1 XtX~ is nonsingular, this system of k equations in k unknowns can be uniquely solved for the ordinary least squares (OLS) estimator
f3~ n
(X'X) -1 X/Y
(t,x,x;) -, t,x,v,
Our interest centers on the behavior of estimators such as f3n as n grows larger and larger. We seek conditions that will allow us to draw conclusions about the behavior of f3n; for example, that f3n has a particular distribution or certain first and second moments. The assumptions of the classical linear model allow us to draw such conclusions for any n. These conditions and results can be formally stated as the following theorem. Theorem 1.1 The following are the assumptions of the classical linear model.
(i) The data are generated as Yt
= X~f3o
+ et,
t
= 1, ...
(ii) X is a nonstochastic and finite n x k matrix, n > k. (iii) X'X is nonsingular. (iv) E(e)
=
O.
(v) e ~ N(O,O"~I),
2
0"0
, = max[>., J.lJ. (ii) If an = o(n A) and bn = o(nJl), then anbn = o(nA+Jl) and an + bn = o(n"). (iii) If an = D(nA) and bn = o(n Jl ), then anb n = o(nA+Jl) and an + bn = O(n K ). Proof. (i) Since an = O( nA) and bn = O( n Jl ), there exist a finite .6.. > 0 and N E N such that, for all n > N, In-Aanl < .6.. and In-Jlbnl < .6... Consider anb n . Now In-A-Jlanbnl = In-Aann-Jlbnl = In-Aanl . In-Jlbnl < .6.. 2 for all n > N. Hence anb n = O(nA+Jl). Consider an +bn . Now In-"(a n +bn)1 = In-Kan+n-Kbnl < In-Kanl+ln-Kbnl by the triangle inequality. SinceI'>, > >. and I'>, > J.l, In-K(a n + bn)1 < In-Kanl + In-Kbnl < In-Aanl + In-Jlbnl < 2.6.. for all n > N. Hence an + bn = O(nK), I'>, = max [>' , J.lJ. (ii) The proof is identical to that of (i), replacing .6.. with every 8 > 0 and N with N(8). (iii) Since an = O(nA) there exist a finite .6.. > 0 and N' E N such that for all n > N', In-Aanl < .6... Given 8 > 0, let 8// = 8/.6... Then since bn = o(n Jl ) there exists N//(8//) such that In-Jlbnl < 8// for n > N//(8//). Now In-A-Jlanbnl = In-Aann-Jlbnl = In-Aanl . In-Jlbnl < .6..8// = 8 for n > N = max(N', N//(8)). Hence anbn = o(nA+Jl). Since bn = o(nJl), it is also O(nJl). That an + bn = O(n K) follows from (i) . • A particularly important special case is illustrated by the following exerCIse.
18
2. Consistency
Exercise 2.8 Let An be a k x k matrix and let b n be a k x 1 vector. If An = 0(1) and b n = 0(1), verify that Anbn = 0(1). For the most part, econometrics is concerned not simply with sequences of real numbers, but rather with sequences of real-valued random scalars or vectors. Very often these are either averages, for example, Zn = L~=l Zt/n, or functions of averages, such as Z~, where {Zt} is, for example, a sequence of random scalars. Since the Zt'S are random variables, we have to allow for a possibility that would not otherwise occur, that is, that different realizations of the sequence {Zd can lead to different limits for Zn. Convergence to a particular value must now be considered as a random event and our interest centers on cases in which nonconvergence occurs only rarely in some appropriately defined sense.
2.2
Almost Sure Convergence
The stochastic convergence concept most closely related to the limit notions previously discussed is that of almost sure convergence. Sequences that converge almost surely can be manipulated in almost exactly the same ways as nonrandom sequences. Random variables are best viewed as functions from an underlying space o to the real line. Thus, when discussing a real-valued random variable bn , we are in fact talking about a mapping bn : 0 -> R We let w be a typical element of 0 and call the real number bn (w) a realization of the random variable. Subsets of 0, for example {w EO: bn (w) < a}, are events and we will assign a probability to these, e.g., P{w EO: bn(w) < a}. We write P[bn < a] as a shorthand notation. There are additional details that we will consider more carefully in subsequent chapters, but this understanding will suffice for now. Interest will often center on averages such as n
bn (-)
= n-
1
L Zt(-). t=l
We write the parentheses with dummy argument (.) to emphasize that bn and Zt are functions. Definition 2.9 Let {b n (-)} be a sequence of real-valued random variables. We say that bnO converges almost surely to b, written bnO ~ b if there exists a real number b such that P{w : bn(w) -> b} = 1.
2.2 Almost Sure Convergence
19
The probability measure P determines the joint distribution of the entire sequence {Zt}. A sequence bn converges almost surely if the probability of obtaining a realization of the sequence {Zt} for which convergence to b occurs is unity. Equivalently, the probability of observing a realization of {Zt} for which convergence to b does not occur is zero. Failure to converge is possible but will almost never happen under this definition. Obviously, then, nonstochastic convergence implies almost sure convergence. Because the set of w's for which bn(w) --7 b has probability one, bn is sometimes said to converge to b with probability 1, (w.p.l). Other common terminology is that bn converges almost everywhere (a.e.) in n or that bn is strongly consistent for b. When no ambiguity is possible, we drop the notation (.) and simply write bn ~ b instead of bnU ~ b.
=
Example 2.10 Let Zn n- 1 L~=l Zt, where {Zd is a sequence of independent identically distributed (i.i.d.) random variables with j.l - E(Zt) < 00. Then Zn ~ j.l, by the Kolmogorov strong law of large numbers (Theorem 3.1). The almost sure convergence of the sample mean illustrated by this example occurs under a wide variety of conditions on the sequence {Zt}. A discussion of these conditions is the subject of the next chapter. As with nonstochastic limits, the almost sure convergence concept extends immediately to vectors and matrices of finite dimension. Almost sure convergence element by element suffices for almost sure convergence of vectors and matrices. The behavior of continuous functions of almost surely convergent sequences is analogous to the nonstochastic case.
Proposition 2.11 Given g : ~k --7 ~l (k, lEN) and any sequence of random k x 1 vectors {b n } such that b n ~ b, where b is k x 1, if g is continuous at b, then g(b n ) ~ g(b). Proof. Since bn(w)
--7
b implies g(bn(w))
--7
g(b),
{w: bn(w)
--7
b} C {w: g(bn(w))
1 = P{w: bn(w)
--7
b}
--7
g(b)}.
Hence
< P{w: g(bn(w))
--7
g(b)}
< 1,
so that g(b n ) ~ g(b) . • This result is one of the most important in this book, because consistency results for many of our estimators follow by simply applying Proposition 2.11.
20
2. Consistency
Theorem 2.12 Suppose
(i) Y t = X~,8o+Ct, (ii) X'c/n ~ 0;
t = 1,2, ... , ,80
E ~k;
(iii) X'X/n ~ M, finite and positive definite. Then!3n exists for all n sufficiently large a.s., and!3 n ~ f30. Proof. Since X'X/n ~ M, it follows from Proposition 2.11 that det(X'X/n) ~ det(M). Because M is positive definite by (iii), det(M) > O. It follows that for all n sufficiently large det(X'X/n) > 0 a.s., so (X'X/n)-l exists for all n sufficiently large a.s. Hence !3n = f30+ (X'X/n)-lX'c/n exists for all n sufficiently large a.s. Now!3n = f30 + (X'X/n)-lX'c/n by (i). It follows from Proposition 2.11 that !3n ~ f3 0 + M- 1 ·0 = f30, given (ii) and (iii) . • In the proof, we refer to events that occur a.s. Any event that occurs with probability one is said to occur almost surely (a.s.) (e.g., convergence to a limit or existence of the inverse). Theorem 2.12 is a fundamental consistency result for least squares estimation in many commonly encountered situations. Whether this result applies in a given situation depends on the nature of the data. For example, if our observations are randomly drawn from a population, as in a pure cross section, they may be taken to be i.i.d. The conditions of Theorem 2.12 hold for i.i.d. observations provided E(XtXD = M, finite and positive definite, and E(Xtct} = 0, since Kolmogorov's strong law of large numbers (Example 2.10) ensures that X'X/n = n- 1 2:~=1 XtX~ ~ M and X'e/n = n- 1 2:~=1 XtCt ~ O. If the observations are dependent (as in a time series), different laws of large numbers must be applied to guarantee that the appropriate conditions hold. These are given in the next chapter. A result for the IV estimator can be proven analogously. Exercise 2.13 Prove the following result. Suppose
(i) Y t = X~,8o+ct,
t = 1,2, ... , f30 E ~k;
(ii) Z'e/n ~ 0; (iii) (a) Z'X/n ~ Q, .finite with full column rank; (b) Pn ~ P, .finite and positive definite.
2.2 Almost Sure Convergence
Then
21
(3n exists for all n sufficiently large a.s., and (3n ~ f3o'
This consistency result for the IV estimator precisely specifies the conditions that must be satisfied for a sequence of random vectors {Zd to act as a set of instrumental variables. They must be unrelated to the errors, as specified by assumption (ii), and they must be closely enough related to the explanatory variables that Z'X/n converges to a matrix with full column rank, as required by assumption (iii.a). Note that a necessary condition for this is that the order condition for identification holds (see Fisher, 1966, Chapter 2); that is, that l > k. (Recall that Z is pn x l and X is pn x k.) For now, we simply treat the instrumental variables as given. In Chapter 4 we see how the instrumental variables may be chosen optimally. A potentially restrictive aspect of the consistency results just given for the least squares and IV estimators is that the matrices X'X/n, Z'X/n, and Pn are each required to converge to a fixed limiting value. When the observations are not identically distributed (as in a stratified cross section, a panel, or certain time-series cases), these matrices need not converge, and the results of Theorem 2.12 and Exercise 2.13 do not necessarily apply. Nevertheless, it is possible to obtain more general versions of these results that do not require the convergence of X'X/n, Z'X/n, or Pn by generalizing Proposition 2.11. To do this we make use of the notion of uniform continuity.
Definition 2.14 Given g : ]Rk -7 ]Rl (k, lEN), we say that g is uniformly continuous on a set Be ]Rk if for each c > 0 there is a 8(c) > 0 such that if a and b belong to Band lai-bil < 8(c), i = 1, ... ,k, then Igj(a)-gj(b)1 < c, j = 1, ... ,l. Note that uniform continuity implies continuity on B but that continuity on B does not imply uniform continuity. The essential aspect of uniform continuity that distinguishes it from continuity is that 8 depends only on c and not on b. However, when B is compact, continuity does imply uniform continuity, as formally stated in the next result.
Theorem 2.15 (Uniform continuity theorem) Suppose g : ]Rk -7 ]Rl is a continuous function on C C ]Rk. If C is compact, then g is uniformly continuous on C. Proof. See Bartle (1976, p. 160) . • Now we extend Proposition 2.11 to cover situations where a random sequence {b n } does not necessarily converge to a fixed point but instead "follows" a nonrandom sequence {c n }, in the sense that bn - Cn ~ 0, where the sequence of real numbers {c n } does not necessarily converge.
22
2. Consistency
Proposition 2.16 Let g : jRk ---> jRl be continuous on a compact set C C jRk. Suppose that {b n } is a sequence of random k x 1 vectors and {c n } is a sequence of k x 1 vectors such that bnO - C n ~ 0 and there exists rJ > 0 such that for all n sufficiently large {c : ICi - Cni I < rJ, i = 1, ... ,k} c C, i.e., for all n sufficiently large, C n is interior to C uniformly in n. Then g(b n (-)) - g(cn ) ~ O. Proof. Let gj be the jth element of g. Since C is compact, gj is uniformly continuous on C by Theorem 2.15. Let F = {w : bn(w) - C n ---> O}; then P(F) = 1 since b n - C n ~ O. Choose W E F. Since C n is interior to C for all n sufficiently large uniformly in nand bn(w) - C n ---> 0, bn(w) is also interior to C for all n sufficiently large. By uniform continuity, for any c > 0 there exists b(c) > 0 such that if Ibni(w) - cnil < b(c), i = 1, ... ,k, then Igj(bn(w)) - gj(cn)1 < c. Hence g(bn(w)) - g(c n ) ---> O. Since this is true for any wE F and P(F) = I, then g(b n ) - g(c n ) ~ O. • To state the results for the OLS and IV estimators below concisely, we define the following concepts, as given by White (1982, pp. 484-485).
Definition 2.17 A sequence of k x k matrices {An} is said to be uniformly nonsingular if for some b > 0 and all n sufficiently large I det(An)1 > b. If {An} is a sequence of positive semidefinite matrices, then {An} is uniformly positive definite if {An} is uniformly nonsingular. If {An} is a sequence of I x k matrices, then {An} has uniformly full column rank if there exists a sequence of k x k submatrices {A~} which is uniformly nonsingular. If a sequence of matrices is uniformly nonsingular, the elements of the sequence are prevented from getting "too close" to singularity. Similarly, if a sequence of matrices has uniformly full column rank, the elements of the sequence are prevented from getting "too close" to a matrix with less than full column rank. Next we state the desired extensions of Theorem 2.12 and Exercise 2.13.
Theorem 2.18 Suppose
(i) Y t = X~,I3o+ct,
t = 1,2, ... , ,130
E jRk;
(ii) X'c/n ~ 0; (iii) X'X/n - Mn ~ 0, where Mn = 0(1) and is uniformly positive definite.
Then
/3n
exists for all n sufficiently large a.s., and
/3n ~ ,130'
2.2 Almost Sure Convergence
23
Proof. Because Mn = 0(1), it is bounded for all n sufficiently large, and it follows from Proposition 2.16 that det(X'X/n) - det(Mn) ~ O. Since det(Mn) > 8 > 0 for all n sufficiently large by Definition 2.17, it follows that det(X'X/n) > 8/2 > 0 for all n sufficiently large a.s., so that (X'X/n)-l exists for all n sufficiently large a.s. Hence /3n (X'X/n)-l X'Y /n exists for all n sufficiently large a.s. Now /3n = (30 + (X'X/n)-lX'e/n by (i). It follows from Proposition 2.16 that /3n - ((30 + M;;-l . 0) ~ 0 or i3n ~ (30' given (ii) and (iii) . • Compared with Theorem 2.12, the present result relaxes the requirement that X'X/n ~ M and instead requires that X'X/n-Mn ~ 0, allowing for the possibility that X'X/n may not converge to a fixed limit. Note that the requirement det(M n ) > 8 > 0 ensures the uniform continuity of the matrix inverse function. The proof of the IV result requires a demonstration that {Q~ P n Qn} is uniformly positive definite under appropriate conditions. These conditions are provided by the following result. Lemma 2.19 If {An} is a 0(1) sequence of I x k matrices with uniformly full column rank and {Bn} is a 0(1) sequence of uniformly positive definite l x l matrices, then {A~BnAn} and {A~B;;-l An} are 0(1) sequences of uniformly positive definite k x k matrices.
Proof. See White (1982, Lemma A.3) . • Exercise 2.20 Prove the following result. Suppose
(i) Y t = X~(3o+et, (ii) Z'e/n ~ 0;
t
= 1,2, ...
, (30
E ~k;
(iii) (a) Z'X/n - Qn ~ 0, where Qn = 0(1) and has uniformly full column rank; (b) Pn - P n ~ 0, wheTe P n = 0 (1) and is symmetric and uniformly positive definite. Then 13n exists for all n sufficiently large a.s., and 13n ~ (30' The notion of orders of magnitude extends to almost surely convergent sequences in a straightforward way.
Definition 2.21 (i) The random sequence {b n } is at most of order n)., almost surely denoted bn = Oa.s.(n).,), if there exist ~ < CXl and N < CXl such that P[ln-).,b n ! < ~ for all n > N] = l. (ii) The sequence {b n } is of order smaller than n)., almost surely denoted bn = oa.s.(n).,) ifn-).,b n ~ O.
24
2. Consistency
A sufficient condition that bn = Oa.s.(n- A) is that n-Abn - an ~ 0, where an = 0(1). The algebra of Oa.s. and Oa.s. is analogous to that for 0 and o. Exercise 2.22 Prove the following. Let an and bn be random scalars. (i) If an = Oa.s. (nA) and bn = Oa.s. (n"), then anb n = Oa.s. (n A+,,) and {an + bn } is Oa.s. (nK), K = max[>., f.L]. (ii) If an = On.s. (nA) and bn = Oa.s. (n"), then anb n = oa.s.(n A+,,) and an + bn = oa.s.(n K). (iii) If an = Oa.s.(n A) and bn = oa.s.(n"), then anb n = oa.s.(n A+,,) and {an + bn } is Oa.s.(n K).
2.3
Convergence in Probability
A weaker stochastic convergence concept is that of convergence in probability. Definition 2.23 Let {b n } be a sequence of real-valued random variables. If there exists a real number b such that for every c > 0, P(w : Ibn(w) - bl < c) -> 1 as n -> 00, then bn converges in probability to b, written bn ~ b. With almost sure convergence, the probability measure P takes into account the joint distribution of the entire sequence {Zt}, but with convergence in probability, we only need concern ourselves sequentially with the joint distribution of the elements of {Zt} that actually appear in bn , typically the first n. When a sequence converges in probability, it becomes less and less likely that an element of the sequence lies beyond any specified distance c from b as n increases. The constant b is called the probability limit of bn- A common notation is plim bn = b. Convergence in probability is also referred to as weak consistency, and since this has been the most familiar stochastic convergence concept in econometrics, the word "weak" is often simply dropped. The relationship between convergence in probability and almost sure convergence is specified by the following result. Theorem 2.24 Let {b n } be a sequence of random variables. If bn ~ b, then bn ~ b. If bn ~ b, then there exists a subsequence {b nj } such that a.s. b bnj ---7 . Proof. See Lukacs (1975, p. 480) . • Thus, almost sure convergence implies convergence in probability, but the converse does not hold. Nevertheless, a sequence that converges in probability always contains a subsequence that converges almost surely. Essentially,
25
2.3 Convergence in Probability
convergence in probability allows more erratic behavior in the converging sequence than almost sure convergence, and by simply disregarding the erratic elements of the sequence we can obtain an almost surely convergent subsequence. For an example of a sequence that converges in probability but not almost surely, see Lukacs (1975, pp. 34-35). Example 2.25 Let Zn _ n- 1I:~=l Zt, where {Zt} is a sequence of random variables such that E(Zt) = 11, var(Zt) = 0'2 < 00 for all t and p cov( Zt, Zr) = 0 for t -=I- T. Then Zn ----+ 11 by the Chebyshev weak law of large numbers, (see Rao, 1973, p. 112). Note that, in contrast to Example 2.10, the random variables here are not assumed either to be independent (simply uncorrelated) or identically distributed (except for having identical mean and variance). However, second moments are restricted by the present result, whereas they are completely unrestricted in Example 2.10. Note also that, under the conditions of Example 2.10, convergence in probability follows immediately from the almost sure convergence. In general, most weak consistency results have strong consistency analogs that hold under identical or closely related conditions. For example, strong consistency also obtains under the conditions of Example 2.25. These analogs typically require somewhat more sophisticated techniques for their proof. Vectors and matrices are said to converge in probability provided each element converges in probability. To show that continuous functions of weakly consistent sequences converge to the functions evaluated at the probability limit, we use the following result. Proposition 2.26 (The implication rule) Consider events E and Fi , i = 1, ... , k, such that
(n~=l Fi)
C E. Then
I:~=l
P(Ft) > P(EC).
Proof. See Lukacs (1975, p. 7) . • Proposition 2.27 Given g : rn;k ---7 rn;l and any sequence {b n } of k x 1 p random vectors such that b n ----+ b, where b is a k x 1 vector, if g is continuous at b, then g(b n ) ~ g(b). Proof. Let gj be an element of g. For every 10 > 0, the continuity of g implies that there exists 8(10) > 0 such that if Jbni(w) - biJ < 8(10), i = 1, ... , k, then Jgj(bn(w)) - gj(b)J < E. Define the events Fni _ {w : Jbni(w) - biJ < 8(E)} and En - {w : Jgj(bn(w)) - gj(b)J < E}. Then
(n~=l
Fni) C En. By the implication rule,
I:~=l P(F~i) > P(E~).
Since
26
2. Consistency
b n ~ b, for arbitrary 7] > 0, and all n sufficiently large, P(F~i) < 7]. Hence P(E;,J < k7], or P(En ) > 1- k7]. Since P(En) < 1 and 7] is arbitrary, P(En ) -+ 1 as n -+ 00, hence gj(b n ) ~ gj(b). Since this holds for all j = 1, ... ,l, g(b n ) ~ g(b). _ This result allows us to establish direct analogs of Theorem 2.12 and Exercise 2.13.
Theorem 2.28 Suppose
(i) Y t = X~,6o+gt,
t = 1,2, ... , ,60 E ffi,k;
(ii) X'g/n ~ 0; (iii) X'X/n ~ M, .finite and positive de.finite. Then f3n exists in probability, and f3n ~ ,60'
Proof. The proof is identical to that of Theorem 2.12 except that Proposition 2.27 is used instead of Proposition 2.11 and convergence in probability replaces convergence almost surely. _ The statement that f3n "exists in probability" is understood to mean that there exists a subsequence {f3n}} such that f3nj exists for all nj sufficiently large a.s., by Theorem 2.24. In other words, X'X/n can converge to M in such a way that X'X/n does not have an inverse for each n, so that f3n may fail to exist for particular values of n. However, a subsequence of {X'X/n} converges almost surely, and for that subsequence, f3nj will exist for all nj sufficiently large, almost surely. Exercise 2.29 Prove the following result. Suppose
(i) Y t = X~,6o+gt,
t = 1,2, ... , f30 E ffi,k;
(ii) Z'g/n ~ 0; (iii) (a) Z'X/n ~ Q, finite with full column rank; (b)
Pn
~ P, finite, symmetric, and positive de.finite.
Then i3n exists in probability, and i3n ~ f30' Whether or not these results apply in particular situations depends on the nature of the data. As we mentioned before, for certain kinds of data it is restrictive to assume that X'X/n, Z'X/n, and Pn converge to constant limits. We can relax this restriction by using an analog of Proposition 2.16. This result is also used heavily in later chapters.
2.3 Convergence in Probability
27
Proposition 2.30 Let g : jRk ----t jRl be continuous on a compact set C C jRk. Suppose that {b n } is a sequence of random k x 1 vectors and {cn} is a sequence of k x 1 vectors such that b n - C n ~ 0, and for all n sufficiently large, C n is interior to C, uniformly in n. Then g(b n ) - g(c n ) ~ O. Proof. Let gj be an element of g. Since C is compact, gj is uniformly continuous by Theorem 2.15, so that for every c > 0 there exists 8(c) > 0 such that if Ib ni - cnil < 8(c), i = 1, ... ,k, then Igj(b n ) - gj(cn)1 < c. Define the events Fni {w: Ibni(w) - cnil < 8(c)} and En _ {w :
Igj(bn(w)) - gj(cn)1
< c}. Then (n~=l Fni)
C En.
By the implication
rule, L~=l P (F~i) > P(E~). Since b n - C n ~ 0, for arbitrary 7] > 0 and all n sufficiently large, P(F~i) < 7]. Hence P(E~) < k7], or P(En) > 1 - k7]. Since P(En ) < 1 and 7] is arbitrary, P(En) ----t 1 as n ----t 00, hence gj(bn)-gj(c n ) ~ O. As this holds for all j = 1, ... ,I, g(bn)-g(c n ) ~ O.
-
Theorem 2.31 Suppose
(i) Y t = X~/3o+Ct,
t = 1,2, ... , /30 E jRk;
(ii) X'c/n ~ 0; (iii) X'X/n - Mn ~ 0, where Mn definite.
=
0(1) and is uniformly positive
Then /3n exists in probability, and /3 n ~
/30'
Proof. The proof is identical to that of Theorem 2.18 except that Proposition 2.30 is used instead of Proposition 2.16 and convergence in probability replaces convergence almost surely. _ Exercise 2.32 Prove the following result. Suppose
(i) Y t
= X~/3o+Ct,
t
=
1,2, ... ,
/30 E jRk;
(ii) Z'c/n ~ 0; (iii) (a) Z'X/n - Qn ~ 0, where Qn column rank;
=
0(1) and has uniformly full
(b) Pn - P n ~ 0, where P n = 0 (1) and is symmetric and uniformly positive definite. Then /3n exists in probability, and /3n ~
/30'
28
2. Consistency
As with convergence almost surely, the notion of orders of magnitude extends directly to convergence in probability. Definition 2.33 (i) The sequence {b n } is at most of order n).. in probability, denoted bn = Op(n)..), if for every E > 0 there exist a .finite .6." > 0 and Nc E N, such that P{w : In-)..bn(w) I > .6. c} < E for all n > Nc . (ii) The sequence {b n } is of order smaller than n).. in probability, denoted bn = op(n)..), if n-)..b n ~ O. When bn = Op(l), we say {b n } is bounded in probability and when bn = op(l) we have bn ~ O. Example 2.34 Let bn = Zn, where {Zd is a sequence of identically distributed N(O, 1) random variables. Then pew : Ibn(w)1 > .6.) = P(IZnl > .6.) = 2(-.6.) for all n > 1, where is the standard normal cumulative distribution function (c. d.f.). By making .6. sufficiently large, we have 2( -.6.) < {j for arbitrary {j > O. Hence, bn = Zn = Op(l). Note that in this example can be replaced by any c.d.f. F and the result still holds, i.e., any random variable Z with c.dJ. F is Op(l). Exercise 2.35 Prove the following. Let an and bn be random scalars. (i) If an = Op(n)..) and bn = Op(nJ.L), then anb n = Op(n)..+J.L) and an + bn = Op(n"), K = max(.\ fl). (ii) If an = op(n)..) and bn = op(nJ.L), then anbn = op(n)..+J.L) and an +bn = op(n"). (iii) If an = Op(n)..) and bn = op(nJ.L), then anbn = op(nA+J.L) and an + bn = Op(n"). (Hint: Apply Proposition 2.30.) One of the most useful results in this chapter is the following corollary to this exercise, which is applied frequently in obtaining the asymptotic normality results of Chapter 4. Corollary 2.36 (Product rule) Let An be l x k and let b n be k x 1. If An = op(l) and b n = Op(l), then Anbn = op(l). Proof. Let an Anbn with An = [Anij]. Then ani = I:~=l Anijbnj. As Anij = opel) and bnj = Op(l), Anijbnj = opel) by Exercise 2.35 (iii). Hence, an = opel), since it is the sum of k terms each of which is opel). It follows that an - Anbn = Op(1) . •
2.4
Convergence in rth Mean
The convergence notions of limits, almost sure limits, and probability limits are those most frequently encountered in econometrics, and most of the
29
2.4 Convergence in rth Mean
results in the literature are stated in these terms. Another convergence concept often encountered in the context of time series data is that of convergence in the rth mean. Definition 2.37 Let {b n } be a sequence of real-valued random variables such that for some r > 0, Elbnl T < 00. If there exists a real number b such that E(lb n - bn - t 0 as n - t 00, then bn converges in the rth mean to b, written bn ~ b.
The most commonly encountered situation is that in which r = 2, in which case convergence is said to occur in quadratic mean, denoted bn ~ b. Alternatively, b is said to be the limit in mean square of bn , denoted l.i.m. bn = b. A useful property of convergence in the rth mean is that it implies convergence in the sth mean for s < r. To prove this, we use Jensen's inequality, which we now state. Proposition 2.38 (Jensen's inequality) Let g : IR
-t
IR be a convex
function on an interval B c IR and let Z be a random variable such that P(Z E B) = 1. Then g(E(Z)) < E(g(Z)). If g instead is concave on B, then g(E(Z)) > E(g(Z)).
Proof. See Rao (1973, pp. 57-58) . • Exrunple 2.39 Let g(z) =
14
It follows from lensen's inequality that IE(Z)I < E(IZI). Let g(z) = z2. It follows from lensen's inequality that (E(Z))2 < E(Z2).
Theorem 2.40 If bn ~ band r
> s,
then bn ~ b.
Proof. Let g(z) = zq, q < I, z > O. Then g is concave. Set z = Ibn - biT and q = sir. From Jensen's inequality, E(lb n -
W)
= E({lb n - W}q)
< {E(lb n - W)}q·
Since E(lb n - W) - t 0 it follows that E( {Ibn - W}q) = E(lb n - W) - t 0 and hence bn ~ b. • Convergence in the rth mean is a stronger convergence concept than convergence in probability, and in fact implies convergence in probability. To show this, we use the generalized Chebyshev inequality. Proposition 2.41 (Generalized Chebyshev inequality) Let Z be a random variable such that EIZIT < 00, r > O. Then for every E > 0,
2. Consistency
30
Proof. See Lukacs (1975, pp. 8-9) . •
When r = 1 we have Markov's inequality and when r = 2 we have the familiar Chebyshev inequality. Theorem 2.42 If bn ~ b for some r > 0, then bn ~ b.
°
Proof. Since E(lb n - W) -7 as n - 7 00, E(lbn - W) < 00 for all n sufficiently large. It follows from the generalized Chebyshev inequality that, for every c: > 0,
Hence P(w : Ibn(w) - bl < c:) > 1 - E(lb n - W)/c: T -7 1 as n -7 00, since bn ~ b. It follows that bn ~ b. • Without further conditions, no necessary relationship holds between convergence in the rth mean and almost sure convergence. For further discussion, see Lukacs (1975, Ch. 2). Since convergence in the rth mean will be used primarily in specifying conditions for later results rather than in stating their conclusions, we provide no analogs to the previous consistency results for the least squares or IV estimators.
References Bartle, R. G. (1976). The Elements of Real Analysis. Wiley, New York. Fisher, F. M. (1966). The Identification Problem in Econometrics. McGraw-Hill, New York. Lukacs, E. (1975). Stochastic Convergence. Academic Press, New York. Rao, C. R. (1973). Linear Statistical Inference and Its Applications. Wiley, New York. \Vhite, H. (1982). "Instrumental variables regression with independent observations." Econometrica, 50, 483~500.
CHAPTER 3
Laws of Large Numbers
In this chapter we study laws of large numbers, which provide conditions guaranteeing the stochastic convergence (e.g., of Z'X/ nand Z' c / n), required for the consistency results of the previous chapter. Since different conditions will apply to different kinds of economic data (e.g., time series or cross section), we shall pay particular attention to the kinds of data these conditions allow. Only strong consistency results will be stated explicitly, since strong consistency implies convergence in probability (by Theorem 2.24). The laws of large numbers we consider are all of the following form.
Proposition 3.0 Given restrictions on the dependence, hetemgeneity, and moments of a sequence of random variables {Zd, Zn - fln ~ 0, where Zn - n- 1 L~ 1 Zt and fln E(Zn).
The results that follow specify precisely which restrictions on the dependence, heterogeneity (i.e., the extent to which the distributions of the Zt may differ across t), and moments are sufficient to allow the conclusion Zn - E(Zn) ~ 0 to hold. As we shall see, there are sometimes trade-offs among these restrictions; for example, relaxing dependence or heterogeneity restrictions may require strengthening moment restrictions.
31
32
3. Laws of Large Numbers
3.1
Independent Identically Distributed Observations
The simplest case is that of independent identically distributed (i.i.d.) random variables. Theorem 3.1 (Kolmogorov) Let {Zd be a sequence of i.i.d. random variables. Then Zn ~ fJ ~f and only ~f EIZtl < 00 and E(Zt) = fJ. Proof. See Rao (1973, p. 115). _ An interesting feature of this result is that the condition given is sufficient as well as necessary for Zn ~ fJ. Also note that since {Zd is i.i.d., E(Zn) = fJ. To apply this result to econometric estimators we have to know that the summands of Z'X/n = n- l I:~=l ZtX~ and Z'e/n = n- l I:~=l Ztet are i.i.d. This occurs when the elements of {(Z~,X~,et)} are i.i.d. To prove this, we use the following result. Proposition 3.2 Let g : jRk --+ jRI be a continuous l function. (i) Let Zt and ZT be identically distributed. Then g(Zt) and g(ZT) are identically distributed. (ii) Let Zt and ZT be independent. Then g(Zt) and g(ZT) are independent.
= g(Zt), YT = g(Zr). Let A = [z : g(z) < a]. Then Ft(a) - P[Yt < a] = P[Zt E A] = P[ZT E A] = P[YT < a] - Fr(a)
Proof. (i) Let Yt
for all a E JRI. Hence g(Zt) and g(ZT) are identically distributed. (ii) Let Al = [z : g(z) < all, A2 = [z : g(z) < a2]. Then FtT(al, a2) - P[Yt < aI, YT < a2] = P[Zt E AI, Zr E A 2] = P[Zt E AI]P[ZT E A 2] = P[Yt < al]P[YT < a2] = Ft (aJ)FT(a2) for all aI, a2 E JRI. Hence g(Zt) and g(ZT) are independent. _ Proposition 3.3 If {(Z~, X~, et}} is an i.i.d. random sequence, then {XtX~}, {Xtet}, {ZtX~}, {Ztet}, and {ZtZ~} are also i.i.d. sequences. Proof. Immediate from Proposition 3.2 (i) and (ii). _ To write the moment conditions on the explanatory variables in compact form, we make use of the Cauchy-Schwarz inequality, which follows as a corollary to the following result. 1 This
result also holds for mcasurable funct.ions, defincd in Definition 3.21.
3.1 Independent Identically Distributed Observations
33
Proposition 3.4 (HOlder's inequality) lfp> 1 and l/p+ 1/q = 1 and if EIYIP < 00 and Elzlq < 00, then ElY ZI < [EIYIPj1/P[ElzIQP/q. Proof. See Lukacs (1975, p. 11) . •
If p = q = 2, we have the Cauchy-Schwarz inequality,
The i,jth element of XtX~ is given by L~=l XthiXthj and it follows from the triangle inequality that P
P
LXthiXthj < L IXthiXthjl· h=l h=l Hence, P
P
EI LXthiXthjl < h=l
LEIXthiXthjl h=l P
< L (E IXthiI2?/2(EIXthj 12) 1/2 h=l by the Cauchy-Schwarz inequality. It follows that the elements of XtX~ will have EI L~=l XthiXthj I < 00 (as we require to apply Kolmogorov's law of large numbers), provided simply that EIXthi l2 < 00 for all hand i. Combining Theorems 3.1 and 2.12, we have the following OLS consistency result for i.i.d. observations.
Theorem 3.5 Suppose
(i) Y t = X~/3o + et, t = 1,2, ... , /30 (ii) {(X~,et}} is an i.i.d. sequence;
E
ffi.k;
(iii) (a) E(Xtet) = 0;
(b) EIXthiEthl
O. Then EIY+Z Ir < cr(EIY Ir + EIZlr), where Cr = 1 ifr < 1 and Cr = 2r - 1 ifr > l. Proof. See Lukacs (1975, p. 13) . • Corollary 3.9 Let {Zt} be a sequence of independent random variables such that E\Zt\lH < 6. < 00 for some 8> 0 and all t. Then Zn - Jln ~ o. Proof. By Proposition 3.8,
By assuming that EIZtl1+6 < 6. and using Jensen's inequality,
It follows that for all t,
36
3. Laws of Large Numbers
Hence, for all t, EIZt - I1t1 1 +8 of Theorem 3.7, we have
1. If EIYlq q and Elzlq < 00, then (ElY + Zlq)l/ < (EIYlqP/q + (Elzlq)l/ q.
0, all h = 1, ... ,p, i = 1, ... ,k, and all t. Then each element of XtX~ satisfies EIl:~=lXthiXthjl1+6 < 6.' < 00 for some 6 > 0, all i, j = 1, ... ,k, and all t, where 6.' p1+86..
=
3.2 Independent Heterogeneously Distributed Observations
37
Proof. By Minkowski's inequality,
By the Cauchy-Schwarz inequality,
1 1 2 .1 1+0]1/2 EI X th,X thJ.1 +0 -< [EIX2th,·1 +O]1/2[EIXt h J' Since EIX;hiI 1+O < ~ < 00, h = 1, ... ,p, i = 1, ... ,k, it follows that for all h = 1, ... ,p and i,j = 1, ... ,k,
EIXth,·XthJ'1 1 +6 _
0, all h = 1, ... ,p, i
38
3. Laws of Large Numbers
(iv) (a) EIX;hiI 1 +8 < ~ < I, ... ,k, and allt;
00
for some 8
> 0, all h
= 1, ... ,p, i =
(b) Mn - E(X'X/n) is uniformly positive definite. Then
f3n exists for all n sufficiently large a.s., and f3n ~ /30'
Compared with Theorem 3.5, we have relaxed the identical distribution assumption at the expense of imposing slightly greater moment restrictions in (iii.b) and (iv.a). Also note that (iv.a) implies that Mn = 0(1). (Why?) The extra generality we have gained now allows treatment of situations with fixed regressors, or observations from a stratified cross section, and also applies to models with (unconditionally) heteroskedastic errors. None of these cases is covered by Theorem 3.5. The result for the IV estimator is analogous.
Theorem 3.15 Suppose
(i) Y t = X~f3o (ii)
+ ct, t = 1,2, ...
{(Z~,X~,ct)}
, f30 E R\
is an independent sequence;
(iii) (a) E(Ztct) = 0,
t = 1,2, .. . ;
(b) EIZthiCthlH8 < ~ < 1, ... ,1, and all t;
00,
for some 8 > 0, all h
=
1, ... ,p, i
(iv) (a) EIZthiXthjlH8 < ~ < 00, for some 8 > 0, all h i = 1, ... ,1, j = 1, ... , k and all t;
=
1, ... ,p,
(b) Qn - E(Z'Xjn) has uniformly full column rank; (c) P n - P n ~ 0, where P n = O( 1) and is symmetric and uniformly positive definite. Then
13n exists for all n
sufficiently large a.s., and
13n ~ /30'
Proof. By Proposition 3.10, {Ztcd and {ZtXn are independent sequences with elements satisfying the moment condition of Corollary 3.9, given (iii.b) and (iv.a), by arguments analogous to those of Corollary 3.12 and Exercise 3.13. It follows from Corollary 3.9 that Z'c/n ~ 0 and Z'X/n-Qn ~ 0, where Qn = 0(1) given (iv.a) as a consequence of Jensen's inequality. Hence, the conditions of Exercise 2.20 are satisfied and the results follow .
•
3.3 Dependent Identically Distributed Observations
3.3
39
Dependent Identically Distributed Observations
The assumption of independence is inappropriate for economic time series, which typically exhibit considerable dependence. To cover these cases, we need laws of large numbers that allow the random variables to be dependent. To speak precisely about the kinds of dependence allowed, we need to make explicit some fundamental notions of probability theory that we have so far used implicitly.
Definition 3.16 A family (collection) F of subsets of a set fl is a a-field (a-algebra) provided
(i) 0 and fl belong to F; (ii) ~f F belongs to F, then FC (the complement of F in fl) belongs to F; (iii) if {Fd is a sequence of sets in F, then U~l Fi belongs to :F. For example, let F = {0, fl}. Then F is easily verified to be a a-field (try it!). Or let F be a subset of fl and put F = {0, fl, F, FC}. Again F is easily verified to be a a-field. We consider further examples below. The pair (fl, F) is called a measurable space when F is a a-field of fl. The sets in a a-field F are sets for which it is possible to assign welldefined probabilities. Thus, we can think of the sets in F as events. We are now in a position to give a formal definition of the concept of a probability or, more formally, a probability measure.
Definition 3.17 Let {fl, F} be a measurable space. A mapping P : F [O,lJ is a probability measure on {fl,F} provided that:
---t
(i) P(0) = O. (ii) For any F E F, P(F C) = 1 - P(F). (iii) For any disjoint sequence {Fd of sets in F (i.e., Fi n Fj = 0 for all i
=J j),
P
(U:l Fi ) =
L~l P(Fi )·
When (fl, F) is a measurable space and P is a probability measure on (fl, F), we call the triple (fl, F, P) a probability space. When the underlying measurable space is clear, we just call P a probability measure. Thus, a probability measure P assigns a number between zero and one to every event (F E F) in a way that coincides with our intuitive notion of how probabilities should behave. This powerful way of understanding probabilities is one of Kolmogorov's many important contributions (Kolmogorov 1933).
3. Laws of Large Numbers
40
Now that we have formally defined probability measures, let us return our attention to the various collections of events (O'-fields) that are relevant for econometrics. Recall that an open set is a set containing only interior points, where a point x in the set B is interior provided that all points in a sufficiently small neighborhood of x ({y : Iy - xl < E} for some E > 0) are also in B. Thus (a, b) is open, while (a, b] is not.
Definition 3.18 The Borel O'-field 8 is the smallest collection of sets (called the Borel sets) that includes
(i) all open sets of JR.; (ii) the complement B C of any set B in 8; (iii) the union
U:1 Bi
of any sequence {Bd of sets in
8.
The Borel sets of JR. just defined are said to be genemted by the open sets of R The same Borel sets would be generated by all the open half-lines of JR., all the closed half-lines of JR., all the open intervals of JR., or all the closed intervals of R The Borel sets are a "rich" collection of events for which probabilities can be defined. Nevertheless, there do exist subsets of the real line not in 8 for which probabilities are not defined; constructing such sets is very complicated (see Billingsley, 1979, p. 37). Thus, we can think of the Borel O'-field as consisting of all the events on the real line to which we can assign a probability. Sets not in 8 will not define events. The Borel O'-field just defined relates to real-valued random variables. A simple extension covers vector-valued random variables.
Definition 3.19 The Borel O'-field sq, q < of sets that includes
00,
is the smallest collection
(i) all open sets ofJR.q; (ii) the complement B C of any set B in 8 q; (iii) the union
U:1 Bi
of any sequence {Bi} in 8 q .
In our notation, 8 and 8 1 mean the same thing. Generally, we are interested in infinite sequences {(Z~, X~, ct)}. If p = 1, this is a sequence of random 1 x (l + k + 1) vectors, whereas if p > 1, this is a sequence of p x (l + k + 1) matrices. Nevertheless, we can convert these matrices into vectors by simply stacking the columns of a matrix, one on top of the other, to yield a pel + k + 1) x 1 vector, denoted vec((Z~, x~, cd). (In what follows, we drop the vec operator and understand that it is implicit
3.3 Dependent Identically Distributed Observations
41
in this context.) Generally, then, we are interested in infinite sequences of q-dimensional random vectors, where q = p(l + k + 1). Corresponding to these are the Borel sets of IRbo' defined as the Cartesian products of a countable infinity of copies of IRq, IRZo - IRq x IRq x .... In what follows we can think of w taking its values in Q = IRZo. The events in which we are interested are the Borel sets of IRZo, which we define as follows.
Definition 3.20 The Borel sets of IR%o, denoted BZo, are the smallest collection of sets that includes
(i) all sets of the form
X~l B i ,
where each Bi E Bq and Bi
= IRq
except
for .finitely many i;
(ii) the complement Fe of any set F in Bbo; (iii) the union U:1 Fi of any sequence {Fd in Bbo' A set of the form specified by (i) is called a measurable .finite-dimensional product cylinder, so Bbo is the Borel u-field generated by all the measurable finite-dimensional product cylinders. When (IR%o, Bbo ) is the specific measurable space, a probability measure P on (IRZo, BZo) will govern the behavior of events involving infinite sequences of finite dimensional vectors, just as we require. In particular, when q = 1, the elements Zt(-) of the sequence {Zt} can be thought of as functions from Q = IR~ to the real line IR that simply pick off the tth coordinate of w E Q; with w = {zt}, Zt(w) = Zt. When q > 1, Zt(-) maps Q = IRZo into IRq. The following definition plays a key role in our analysis.
Definition 3.21 A function 9 on Q to IR is F-measurable if for every real number a the set [w : g(w)
< aJ
E
F.
Example 3.22 Let (Q, F) = (IRZo, BZo) and set q = 1. Then Zt(-) as just de.fined is Bbo-measurable because [w: Zt(w) < aJ = [Zl, ... , Zt-1, Zt, Zt+l, ... : Zl < 00, ... , Zt-1 < 00, Zt < a, Zt+1 < 00, ... J E Bbo for any a E IR. When a function is F-measurable, it means that we can express the probability of an event, say, [Zt < a], in terms of the probability of an event in F, say, [w: Zt(w) < aJ. In fact, a random variable is precisely an F-measurable function from Q to R In Definition 3.21, when the u-field is taken to be Bbo' the Borel sets of IRZo, we shall drop explicit reference to Bbo and simply say that the function 9 is measurable. Otherwise, the relevant u-field will be explicitly identified.
Proposition 3.23 Let f and 9 be F-measurable real-valued junctions, and let c be a real number. Then the functions cf, f + g, fg, and If I are also F -measurable.
3. Laws of Large Numbers
42
Proof. See Bartle (1966, Lemma 2.6) . • Exrunple 3.24 If Zt is measurable, then Zt/n is measurable, so that I:~=1 Zt/n is measurable.
Zn
=
A function from Q to ~q is measurable if and only if each component of the vector valued function is measurable. The notion of measurability extends to transformations from Q to Q in the following way.
Definition 3.25 Let (Q, F) be a measurable space. A one-to-one transformation2 T : Q -; Q is measurable provided that T- 1(F) c :F. In other words, the transformation T is measurable provided that any set taken by the transformation (or its inverse) into F is itself a set in F. This ensures that sets that are not events cannot be transformed into events, nor can events be transformed into sets that are not events.
Exrunple 3.26 For any w = C... ,Zt-2, Zt-1, Zt, Zt+1, Zt+2, ... ) let w' = Tw = (... ,Zt-1, Zt, Zt+1, Zt+2, Zt+3,"')' so that T transforms w by shifting each of its coordinates back one location. Then T is measurable since T(F) is in F and T- 1(F) is in F, for all F E:F. The transformation of this example is often called the shift, or the backshift operator. By using such transformations, it is possible to define a corresponding transformation of a random variable. For example, set Z1 (w) = Z (w), where Z is a measurable function from Q to ~; then we can define the random variables Z2(W) = Z(Tw), Z3(W) = Z(T 2w), and so on, provided that T is a measurable transformation. The random variables constructed in this way are said to be random variables induced by a measurable transformation.
Definition 3.27 Let (Q, F, P) be a probability space. The transformation T : Q -; Q is measure preserving if it is measurable and if p(T-1 F) = P(F) for all F in :F. The random variables induced by measure-preserving transformations then have the property that P[Z1 < a] = P[w : Z(w) < a] = P[w : Z(Tw) < a] = P[Z2 < a]; that is, they are identically distributed. In fact, transformation T maps an clement of n, say w, iuto another element of n, say WI = Tw. When T operates on a set F, it should be uuderstood as operating on each element of F. Similarly, when T operates 011 a family F, it should be understood as operating on each set in the family. 2 The
3.3 Dependent Identically Distributed Observations
43
such random variables have an even stronger property. We use the following definition.
Definition 3.28 Let G 1 be the joint distribution function of the sequence {Z 1, Z2, ... }, where Zt is a q x 1 vector, and let Gr +1 be the joint distribution function of the sequence {Zr+l' Zr+2, ... }. The sequence {Zd is stationary if G 1 = G r +1 for each T > 1. In other words, a sequence is stationary if the joint distribution of the variables in the sequence is identical, regardless of the date of the first observation.
Proposition 3.29 Let Z be a random variable (i. e., Z is a measurable function) and T be a measure-preserving transformation. Let Zl(W) = Z(w), Z2(W) = Z(Tw), ... , Zn(w) = Z(T n - 1w) for each w in n. Then {Zt} is a stationary sequence. Proof. Stout (1974, p. 169) . • A converse to this result is also available.
Proposition 3.30 Let {Zt} be a stationary sequence. Then there exists a measure-preserving transformation T defined on (n, F, P) such that Zl (w) = Zl(W), Z2(W) = Zl(Tw), Z3(W) = Zl(T 2w), ... , Zn(w) = Zl(T n - 1 w) for all w in n. Proof. Stout (1974, p. 170) . • Example 3.31 Let {Zd be a sequence of i.i.d. N(O, 1) random variables. Then {Zt} is stationary. The independence imposed in this example is crucial. If the elements of {Zd are simply identically distributed N(O,l), the sequence is not necessarily stationary, because it is possible to construct different joint distributions that all have normal marginal distributions. By changing the joint distributions with t, we could violate the stationarity condition while preserving marginal normality. Thus stationarity is a strengthening of the identical distribution assumption, since it applies to joint and not simply marginal distributions. On the other hand, stationarity is weaker than the i.i.d. assumption, since i.i.d. sequences are stationary, but stationary sequences do not have to be independent. Does a version of the law of large numbers, Theorem 3.1, hold if the i.i.d. assumption is simply replaced by the stationarity assumption? The answer is no, unless additional restrictions are imposed.
44
3. Laws of Large Numbers
Example 3.32 Let Ut be a sequence of i.i.d. random variables un~formly distributed on [O,IJ and let Z be N(O, 1), independent of Ut , t = 1,2, .... Define Yt - Z + Ut . Then {Yd is stationary (why?), but Yn = 2::~=1 Yt! n does not converge to E(Yt) = ~. Instead, Yn - Z ~ ~. In this example, Yn converges to a random variable, Z + ~, rather than to a constant. The problem is that there is too much dependence in the sequence {Yd. No matter how far into the future we take an observation on Yt, the initial value Yl still determines to some extent what Yt will be, as a result of the common component Z. In fact, the correlation between Yl and Yt is always positive for any value of t. To obtain a law of large numbers, we have to impose a restriction on the dependence or "memory" of the sequence. One such restriction is the concept of ergodicity.
Definition 3.33 Let (fl, F, P) be a probability space. Let {Zd be a stationary sequence and let T be the measure-preserving transformation of Proposition 3.30. Then {Zt} is ergodic if n
}~~ n-
1
L P(F n TtG) = P(F)P(G) t=l
for all events F, G E F, If F and G were independent, then we would have p(FnG) = P(F)P(G). We can think of TtG as being the event G shifted t periods into the future, and since P(TtG) = P( G) when T is measure preserving, this definition says that an ergodic process (sequence) is one such that for any events F and G, F and TtG are independent on average in the limit. Thus ergodicity can be thought of as a form of "average asymptotic independence." For more on measure-preserving transformations, stationarity, and ergodicity the reader may consult Doob (1953, pp. 167-185) and Rosenblatt (1978). The desired law of large numbers can now be stated.
Theorem 3.34 (Ergodic theorem) Let {Zt} be a stationary ergodic scalar sequence with EIZtl < 00. Then Zn ~ JL E(Zt).
=
Proof. See Stout (1974, p. 181) . • To apply this result, we make use of the following theorem. Theorem 3.35 Let g be an F-measurable function into IRk and define Yt - g( ... ,Zt-l, Zt, Zt+l, ... ), where Zt is q x 1. (i) If {Zt} is stationary, then {yt} is stationary. (ii) If {Zd is stationary and ergodic, then {Yd is stationary and ergodic.
3.3 Dependent Identically Distributed Observations
45
Proof. See Stout (1974, pp. 170, 182) . • Note that g depends on the present and infinite past and future of the sequence {Zd. As stated by Stout, g only depends on the present and future of Zt, but the result is valid as given here. Proposition 3.36 If {(Z~,X~,Ct)} is a stationary ergodic sequence, then {XtX~}, {XtCt}, {ZtX~}, {ZtCt}, and {ZtZa are stationary ergodic sequences. Proof. Immediate from Theorem 3.35 and Proposition 3.23. • Now we can state a result applicable to time-series data. Theorem 3.37 Suppose
(i) Y t = X~f3o
+ ct,
t = 1,2, ... ,
130
E ~k;
(ii) {(X~, cd} is a stationary ergodic sequence;
(iii) (a) E(Xtct) = 0; (b) E !XthiCth!
-00; let B~rn = a(Zn+rn, ... ) be the smallest collection of subsets of n that contains the union of the a-fields B~+rn as a -> 00. Intuitively, we can think of B~(X) as representing all the information contained in the past of the sequence {Zt} up to time n, whereas B~+rn represents all the information contained in the future of the sequence {Zt} from time n + m on. We now define two measures of dependence between a-fields. Definition 3.41 Let 9 and H be a-fields and define
¢(Q, H) a(Q, H)
sup {GEQ,HEH:P(G>o)}IP(HIG) - P(H)I, sup {GEQ,HEH}IP(G n H) - P(G)P(H)I.
Intuitively, ¢ and a measure the dependence of the events in H on those in 9 in terms of how much the probability of the joint occurrence of an event in each a-algebra differs from the product of the probabilities of each event occurring. The events in 9 and H are independent if and only if ¢(Q, H) and a(Q, H) are zero. The function a provides an absolute measure of dependence and ¢ a relative measure of dependence Definition 3.42 For a sequence of random vectors {Zd, with B~(X) and B~rn as in Definition 3.40, define the mixing coefficients ¢(m)
=
sup¢(B~(X), B~+rn)
and
a(m) - supa(B~(X), B~+rn)'
n
If, for the sequence {Zd, ¢( m) If, for the sequence {Zt}, a(m)
n
-> ->
0 as m 0 as m
-> 00, -> 00,
{Zt} is called ¢-mixing. {Zt} is called a-mixing.
48
3. Laws of Large Numbers
The quantities ¢( m) and a( m) measure how much dependence exists between events separated by at least m time periods. Hence, if ¢(m) = 0 or a( m) = 0 for some m, events m periods apart are independent. By allowing ¢(m) or a( m) to approach zero as m --- 00, we allow consideration of situations in which events are independent asymptotically. In the probability literature, ¢-mixing sequences are also called uniform mixing (see Iosifescu and Theodorescu, 1969), whereas a-mixing sequences are called strong mixing (see Rosenblatt, 1956). Because ¢(m) > a(m), ¢-mixing implies a-mixing. Example 3.43 (i) Let {Zt} be a ,-dependent sequence (i.e., Zt is independent of Zt-r for all T > "(). Then ¢(m) = a(m) = 0 for all m > "(. (ii) Let {Zt} be a nonstochastic sequence. Then it is an independent sequence, so ¢(m) = a(m) = 0 for all m > O. (iii) Let Zt = PoZt-l +ct, t = 1, ... ,n, where IPol < 1 and lOt is i. i. d. N (0, 1). (This is called the Gaussian AR( 1) process.) Then a(m) --- 0 as m --- 00, although ¢(m) -r+ 0 as m ___ 00 (Jbragimov and Linnik, 1971, pp. 312-313). The concept of mixing has a meaningful physical interpretation. Adapting an example due to Halmos (1956), we imagine a dry martini initially poured so that 99% is gin and 1% is vermouth (placed in a layer at the top). The martini is steadily stirred by a swizzle stick; t increments with each stir. We observe the proportions of gin and vermouth in any measurable set (i.e., volume of martini). If these proportions tend to 99% and 1% after many stirs, regardless of which volume we observe, then the process is mixing. In this example, the stochastic process corresponds to the position of a given particle at each point in time, which can be represented as a sequence of three-dimensional vectors {Zt}. The notion of mixing is a stronger memory requirement than that of ergodicity for stationary sequences, since given stationarity, mixing implies ergodicity, as the next result makes precise. Proposition 3.44 Let {Zt} be a stationary sequence. If a( m) --- 0 as m --- 00, then {Zt} is ergodic. Proof. See Rosenblatt (1978) . • Note that if ¢(m) --- 0 as m --- 00, then a(m) --- 0 as m --- 00, so that ¢-mixing processes are also ergodic. Ergodic processes are not necessarily mixing, however. On the other hand, mixing is defined for sequences that are not necessarily strictly stationary, so mixing is more general in this sense. For more on mixing and ergodicity, see Rosenblatt (1972, 1978).
49
3.4 Dependent Heterogeneously Distributed Observations
To state the law of large numbers for mixing sequences we use the following definition. Definition 3.45 Let a E R (i) If ¢(m) = O(m- a - E ) for some ¢ is of size -a. (ii) If o:(m) = O(m- a - E ) for some E > 0, then -a.
E
> 0, then
0:
is of size
This definition allows precise statements about the memory of a random sequence that we shall relate to moment conditions expressed in terms of a. As a gets smaller, the sequence exhibits more and more dependence, while as a ....... 00, the sequence exhibits less dependence.
an.
Example 3.46 (i) Let {Zt} be independent Zt '" N(O, Then {Zt} has ¢ of size -1. (This is not the smallest size that could be invoked.) (ii) Let Zt be a Gaussian AR(I) process. It can be shown that {Zt} has 0: of size -a for any a E ~, since o:(m) decreases exponentially with m. The result of this example extends to many finite autoregressive moving average (ARMA) processes. Under general conditions, finite ARMA processes have exponentially decaying memories. Using the concept of mixing, we can state a law of large numbers, due to McLeish (1975), which applies to heterogeneous dependent sequences. Theorem 3.47 (McLeish) Let {Zd be a sequence of scalars with finite means f1t E(Zt) and suppose that L:~ 1 (EIZt - f1tl rH /rH)l/r < 00 for some 8,0 < 8 < l' where l' > 1. If ¢ is of size -1'/(21' - 1) or 0: is of size -1'/(1' - 1), l' > 1, then Zn - Jln ~ O.
=
Proof. See McLeish (1975, Theorem 2.10) . • This result generalizes the Markov law of large numbers, Theorem 3.7. (There we have l' = 1.) Using an argument analogous to that used in obtaining Corollary 3.9, we obtain the following corollary. Corollary 3.48 Let {Zt} be a sequence with ¢ of size -1'/(21' -1), l' > 1, or 0: of size -1'/(1' - 1), l' > 1, such that EIZtlrH < ~ < 00 for some 8> 0 and all t. Then Zn - Jln ~ O. Setting l' arbitrarily close to unity yields a generalization of Corollary 3.9 that would apply to sequences with exponential memory decay. For sequences with longer memories, l' is greater, and the moment restrictions increase accordingly. Here we have a clear trade-off between the amount of allowable dependence and the sufficient moment restrictions. To apply this result, we use the following theorem.
50
3. Laws of Large Numbers
Theorem 3.49 Let g be a measurable function into ]Rk and define Yt _ g( Zt, Zt+ 1, ... , Zt+r), where T is .finite. If the sequence of q x 1 vectors {Zd is ¢-mixing (a-mixing) of size -a, a > 0, then {Ytl is ¢-mixing (a-mixing) of size -a. Proof. See White and Domowitz (1984, Lemma 2.1) . • In other words, measurable functions of mixing processes are mixing and of the same size. Note that whereas functions of ergodic processes retain ergodicity for any T, finite or infinite, mixing is guaranteed only for finite T.
Proposition 3.50 If {(Zi,Xi,€t}} is a mixing sequence of size -a, then {XtXi}, {Xt€t}, {ZtXi}, {Zt€t}, and {ZtZi} are mixing sequences of size -a.
Proof. Immediate from Theorem 3.49 and Proposition 3.23. • Now we can generalize the results of Exercise 3.14 to allow for dependence as well as heterogeneity.
Exercise 3.51 Prove the following result. Suppose
(i) Y
t
= X~/3o + ct,
t
= 1,2, ... , /30 E ]Rk;
(ii) {(Xi, €t)} is a mixing sequence with ¢ of size -r/(2r - 1), r a of size -r/(r -1), r> 1;
> 1 or
(iii) (a) E(Xt€t) = 0, t = 1,2, ... ; (b) EIXthiEthlrH < tl < 00, for some {) > 0, h = 1, ... ,p, 1, ... ,k and for all t;
Z
(iv) (a) ElxlhilrH < tl < 00, for some b > 0, h = 1, ... ,p, i = 1, ... ,k and for all t; (b) Mn _ E(X'X/n) is un~formly positive de.finite. Then
{3n
exists for all n sufficiently large a.s., and {3n ~
/30.
From this result, we can obtain the result of Exercise 3.14 as a direct corollary by setting r = 1. Compared to our first consistency result, Theorem 3.5, we have relaxed the independence and identical distribution assumptions, but strengthened the moment requirements somewhat. Among the many different possibilities that this result allows, we can have lagged dependent variables and nonstochastic variables both appearing in the explanatory variables Xt. The regression errors €t may be heteroskedastic or may be serially correlated.
51
3.4 Dependent Heterogeneously Distributed Observations
In fact, Exercise 3.51 is a powerful result that can be applied to a wide range of situations faced by economists. For further discussion of linear models with mixing observations, see Domowitz (1983). Applications of Exercise 3.51 often use the following result, which allows the interchange of expectation and infinite sums.
Proposition 3.52 Let {Zd be a sequence of random variables such that L~l EIZtl < co. Then L~l Zt converges a.s. and <Xl
<Xl
E(LZt) = LE(Zt) < co. t=l
t=l
Proof. See Billingsley (1979, p. 181) . •
This result is useful in verifying the conditions of Exercise 3.51 for the following exercise. Exercise 3.53 (i) State conditions that are sufficient to ensure the consistency of the OLS estimator when Yt = aoYt-l + f3 0 X t + lOt, where Yt, X t and lOt are scalars. (Hint: The Minkowski inequality applies to infinite sums; that is, given {Zd such that L~l(E'ZtIP)l/p < co withp > 1, then EI L~l Ztl P < (L~l(EIZtIP)l/p)P.) (ii) Find a simple example to which Exercise 3.51 does not apply. Conditions for the consistency of the IV estimator are given by the next result. Theorem 3.54 Suppose
(i) Y
(ii)
X~j3o
t =
+ ct, t =
{(Z~,X~,ct)}
1,2, ... , f30 E IRk;
is a mixing sequence with ¢ of size -1'/(21' -1),
or a of size -1'/(1' -1),
l'
>
l'
>1,
1;
(iii) (a) E(Ztct) = 0, t = 1,2, ... ; (b) EIZthiCthlr+ 0, all h = 1, ... ,p, i = 1, ... ,I, and allt; (iv) (a) EIZthiXthjlr+ 0, all h = 1, ... ,p, i = 1, ... ,I, j = 1, ... ,k and allt; (b) Qn - E(Z'X/n) has uniformly full column rank;
n n
(c) P -P ~ 0, where P positive definite. Then
13n
n= 0(1) and is symmetric and uniformly
exists for all n sufficiently large a.s., and
13n ~ f3o.
52
3. Laws of Large Numbers
Proof. By Proposition 3.50, {Zte:t} and {ZtXD are mixing sequences with elements satisfying the conditions of Corollary 3.48 (given (iii.b) and (iv.a)). It follows from Corollary 3.48 that Z'e:/n ~ 0 and Z'X/n Qn ~ 0, where Qn = 0(1), given (iv.a) as a consequence of Jensen's inequality. Hence the conditions of Exercise 2.20 are satisfied and the results follow . • Although mixing is an appealing dependence concept, it shares with ergodicity the property that it can be somewhat difficult to verify theoretically and is impossible to verify empirically. An alternative dependence concept that is easier to verify theoretically is a form of asymptotic noncorrelation. Definition 3.55 The scalar sequence {Zt} has asymptotically uncorrelated elements (or is asymptotically uncorrelated) if there exist constants {Pr' T > O} such that 0 < Pr < 1, L~=o Pr < 00 and cov(Zt, Zt+r) < pAvar(Zt)var(Zt+r))1/2 for all T > 0, where var(Zt) < 00 for all t. Note that Pr is only an upper bound on the correlation between Zt and Zt+r and that the actual correlation may depend on t. Further, only positive correlation matters, so that if Zt and Zt+r are negatively correlated, we can set Pr = O. Also note that for L~o Pr < 00, it is necessary that Pr - t 0 as T - t 00, and it is sufficient that for all T sufficiently large, Pr < T- 1 - 8 for some {) > O. Example 3.56 Let Zt = PoZt-1 +Ct, where Ct is i.i.d., E(ct) = 0, var(cd = a~, E(Zt-lct) = O. Thencorr(Zt,Zt+r) = p~. If 0 < Po < 1, L~=oP~ = 1/(1 - Po) < 00, so the sequence {Zd is asymptotically uncorrelated. If a sequence has constant finite variance and has covariances that depend only on the time lag between Zt and Zt+r, the sequence is said to be covariance stationary. (This is implied by stationarity but is weaker because a sequence can be covariance stationary without being stationary.) Verifying that a covariance stationary sequence has asymptotically uncorrelated elements is straightforward when the process has a finite ARMA representation (see Granger and Newbold, 1977, Ch. 1). In this case, Pr can be determined from well-known formulas (see, e.g., Granger and Newbold, 1977, Ch. 1) and the condition L~=o Pr < 00 can be directly evaluated. Thus, covariance stationary sequences as well as stationary ergodic sequences can often be shown to be asymptotically uncorrelated, although an asymptotically uncorrelated sequence need not be stationary and ergodic or covariance stationary. Under general conditions on the size of ¢
3.5 Martingale Difference Sequences
53
or ct, mixing processes can be shown to be asymptotically uncorrelated. Asymptotically uncorrelated sequences need not be mixing, however. A law of large numbers for asymptotically uncorrelated sequences is the following.
Theorem 3.57 Let {Zt} be a scalar sequence with asymptotically uncorrelated elements with means f-lt _ E(Zt) and a; - var(Zt} < .0. < 00. Then Zn - iln ~ O. Proof. Immediate from Stout (1974, Theorem 3.7.2) . • Compared with Corollary 3.48, we have relaxed the dependence restriction from asymptotic independence (mixing) to asymptotic uncorrelation, but we have altered the moment requirements from restrictions on moments of order r + 8 (r > 1, 8 > 0) to second moments. Typically, this is a strengthening of the moment restrictions. Since taking functions of random variables alters their correlation properties, there is no simple analog of Proposition 3.2, Theorem 3.35, or Theorem 3.49. To obtain consistency ,results for the OLS or IV estimators, one must directly assume that all the appropriate sequences are asymptotically uncorrelated so that the almost sure convergence assumed in Theorem 2.18 or Exercise 2.20 holds. Since asymptotically uncorrelated sequences will not play an important role in the rest of this book, we omit stating and proving results for such sequences.
3.5
Martingale Difference Sequences
In all of the consistency results obtained so far, there has been the explicit requirement that either E(Xtct) = 0 or E(Ztct) = O. Economic theory can play a key role in justifying this assumption. In fact, it often occurs that economic theory is used to justify the stronger assumption that E( ct!Xd = o or E(ctIZt) = 0, which then implies E(Xtct) = 0 or E(Ztct} = O. In particular, this occurs when X~{3o is viewed as the value of Y t we expect to observe when X t occurs, so that X~{3o is the conditional expectation of Y t given Xt, i.e., E(YtIX t ) = X~{3o' Then we define ct = Y t - E(YtIXt} = Y t - X~{3o' Using the algebra of conditional expectations given below, it is straightforward to show that E(ctIXt) = O. One of the more powerful economic theories (powerful in the sense of imposing a great deal of structure on the way in which the data behave) is the theory of rational expectations. Often this theory can be used not only
54
3. Laws of Large Numbers
to justify the assumption that E(ctIXt) = 0 but also that
Le., that the conditional expectation of ct, given the entire past history of the errors ct and the current and past values of the explanatory variables X t , is zero. This assumption allows us to apply laws of large numbers for martingale difference sequences that are convenient and powerful. To define what martingale difference sequences are and to state the associated results, we need to provide a more complete background on the properties of conditional expectations. So far we have relied on the reader's intuitive understanding of what a conditional expectation is. A precise definition is the following.
Definition 3.58 Let Y be an F-measurable random variable, E(IYI) < 00, and let Q be a a-field contained in F. Then there exists a random variable E(YIQ) called the conditional expectation of Y given Q, such that
(i) E(YIQ) is Q-measurable and E(IE(YIQ)J)
1 and all t. Then - a.S. 0 Z n~
.
Using this result and the law of large numbers for mixing sequences, we can state the following consistency result for the OLS estimator.
Theorem 3.78 Suppose
(i) Y t = X~f3o
+ Ct,
t = 1,2, ... ,
130
E
JR.k;
(ii) {Xd is a sequence of mixing random variables with 1> of size -r I (2r1), r > 1, or a of size -rl(r -1), r > 1; (iii) (a) {XthiCth,Ft-r} is a martingale difference sequence, h = 1, ... ,p, i = 1, ... ,k; (b) E IXthiCthl2r < b. < 00, for all h = 1, ... ,p, i = 1, ... , k and t;
(iv) (a) EIX;hil r +8 < b. < 00, for some 8 > 0 and all h = 1, ... ,p, i = 1, ... , k and t; (b) Mn E(X'X/n) is uniformly positive definite.
3.5 Martingale Difference Sequences
Then
i3n
61
exists for all n sufficiently large a.s., and
i3n ~ f3o.
Proof. To verify that the conditions of Theorem 2.18 hold, we note first that X'c/n = L:~=1 X~ch/n where X h is the n x k matrix with rows X th and Ch is the n x 1 vector with elements Cth. By assumption (iii.a), {Xthicth, F t } is a martingale difference sequence. Because the moment conditions of Exercise 3.77 are satisfied by (iii.b), we have n -1 L: Xthicth ~ 0, h = 1, ... ,p, i = 1, ... ,k, so X'eln ~ 0 by Proposition 2.11. Next, Proposition 3.50 ensures that {XtXa is a mixing sequence (given (ii)) that satisfies the conditions of Corollary 3.48 (given (iv.a)). It follows that X'X/n - Mn ~ 0, and Mn = 0(1) (given (iv.a)) by Jensen's inequality. Hence the conditions of Theorem 2.18 are satisfied and the result follows .
•
Note that the conditions placed on X t by (ii) and (iv.a) ensure that X'X/n - Mn ~ 0 and that these conditions can be replaced by any other conditions that ensure the same conclusion. A result for the IV estimator can be obtained analogously.
Exercise 3.79 Prove the following result. Given
(i) Y t
= X~f3o + et,
t
= 1,2, ...
, f30 E
]Rk;
(ii) {(Z~,X~,et)} is a mixing sequence with ¢ of size -r/(2r -1), r > 1, or cy of size -rl(r - 1), r> 1;
(iii) (a) {Zthicth,Ft} is a martingale difference sequence, h = 1, ... ,p, i = 1, ... ,l; (b) EIZthicthl2r < ~ < 00, for all h = 1, ... ,p, i = 1, ... ,l and t; (iv) (a) EIZthiXthjlr+6 < ~ < 00, for some {j > 0 and all h = 1, ... ,p, i = 1, ... ,l, j = 1 ... ,k and t; (b) Qn _ E(Z'X/n) has uniformly full column rank;
n n
(c) P -P ~ 0, where P positive definite. Then
i3n
n=
0(1) and is symmetric and uniformly
exists for all n sufficiently large a.s., and
i3n ~ f3o.
As with results for the OLS estimator, (ii) and (iv.a) can be replaced by any other conditions that ensure Z'X/n - Qn ~ O. Note that assumption (ii) is stronger than absolutely necessary here. Instead, it suffices that {(Z~,X~)} is appropriately mixing. However, assumption (ii) is used later to ensure the consistency of estimated covariance matrices.
3. Laws of Large Numbers
62
References Bartle, R. G. (1966). The Elements of Integration. Wiley, New York. Billingsley, P. (1979). Probability and Measure. Wiley, New York. Chung, K. L. (1974). A Course in Probability Theory. Harcourt, New York. Corradi, V., N. Swanson, and H. White (2000). "Testing for stationarityergodicity and for comovement between nonlinear discrete time Markov processes." Journal of Econometrics, 96, 39-73. Domowitz, I. (1983). "The linear model with stochastic regressors and heteroskedastic dependent errors." Discussion Paper, Center for Mathematical Studies in Economics and Management Sciences, Northwestern University, Evanston, Illinois. - - - and M. A. El-Gamal (1993). "A consistent test of stationary-ergodicity." Econometric Theory, 9, 589-60l. Doob, J. L. (1953). Stochastic Processes. Wiley, New York. Granger, C. W. J. and P. Newbold (1977). Forecasting Economic Time Series. Academic Press, New York. Halmos, P. R. (1956). Lectures in Ergodic Theory. Chelsea Publishing, New York. Ibragimov, I. A. and Y. V. Linnik (1971). Independent and Stationary Sequences of Random Variables. Wolters-Noordhoff, The Netherlands. Iosifescu, M. and R. Theodorescu (1969). Random Processes and Learning. Springer-Verlag, New York. Kolmogorov, A. N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung. Ergebnisse der Mathematik und ihrer Grenzgebiete, Vol. 2, No.3. SpringerVerlag, Berlin. Lukacs, E. (1975). Stochastic Convergence. Academic Press, New York. McLeish, D. L. (1975). "A maximal inequality and dependent strong laws." Annals of Probability, 3, 826-836. Rao, C. R. (1973). Linear Statistical Inference and Its Applications. Wiley, New York. Rosenblatt, M. (1956). "A central limit theorem and a strong mixing condition." Proc. Nat. Acad. Sci. U.S.A., 42, 43-47. (1972). "Uniform ergodicity and strong mixing." Z. Wahrsch. Verw. Gebiete, 24, 79-84. (1978). "Dependence and asymptotic independence for random processes." In Studies in Probability Theory, M. Rosenblatt, ed., Mathematical Association of America, Washington, D.C. Samuelson, P. (1965). "Proof that properly anticipated prices fluctuate randomly." Industrial Management Review. 6, 41-49.
References
63
Stout, W. F. (1974). Almost Sure Convergence. Academic Press, New York. White, H. and 1. Domowitz (1984). "Nonlinear regression with dependent observations." Econometrica, 52, 143-162.
CHAPTER 4
Asymptotic Normality
In the classical linear model with fixed ff;pessors and normally distributed i.i.d. errors, the least squares estimator (3n is distributed as a multivariate normal with E(/3n) = (30 and var(/3n) = a~(X/X)-l for any sample size n. This fact forms the basis for statistical tests of hypotheses, based typically on t- and F-statistics. When the sample size is large, econometric estimators such as /3 n have a distribution that is approximately normal under much more general conditions, and this fact forms the basis for large sample statistical tests of hypotheses. In this chapter we study the tools used in determining the asymptotic distribution of /3n, how this asymptotic distribution can be used to test hypotheses in large samples, and how asymptotic efficiency can be obtained.
4.1
Convergence in Distribution
The most fundamental concept is that of convergence in distribution.
Definition 4.1 Let {b n } be a sequence of random finite-dimensional vectors with joint distribution functions {Fn }. If Fn(z) ----t F(z) as n ----t 00 for every continuity point z, where F is the distribution function of a random variable Z, then b n converges in distribution to the random variable Z, d denoted b n ----t Z.
65
66
4. Asymptotic Normality
Heuristically, the distribution of b n gets closer and closer to that of the random variable Z, so the distribution F can be used as an approximation to the distribution of b n . When b n ..'!., Z, we also say that b n converges in
law to Z (written b n ....!:..... Z), or that b n is asymptotically distributed as
F, denoted b n ~ F. Then F is called the limiting distribution of b n . Note that the convergence specified by this definition is pointwise and only has to occur at points z where F is continuous (the "continuity points").
Example 4.2 Let {b n } be a sequence of i.i.d. random variables with distribution function F. Then (trivially) F is the limiting distribution of b n . This illustrates the fact that convergence in distribution is a very weak convergence concept and by itself implies nothing about the convergence of the sequence of random variables.
Example 4.3 Let {Zt} be i. i. d. random variables with mean 11 and finite variance (J"2 > O. Define
Then by the Lindeberg-Levy central limit theorem (Theorem 5.2) bn N(O,l).
A '"
In other words, the sample mean Zn of i.i.d. observations, when standardized, has a distribution that approaches the standard normal distribution. This result actually holds under rather general conditions on the sequence {Zt}. The conditions under which this convergence occurs are studied at length in the next chapter. In this chapter, we simply assume that such conditions are satisfied, so convergence in distribution is guaranteed. Convergence in distribution is meaningful even when the limiting distribution is that of a degenerate random variable.
Lemma 4.4 Suppose bn ~ b (a constant). Then bn ~ Fb) where Fb is the distribution function of a random variable Z that takes the value b with probability one (i.e., bn ..'!., b). Also, ~fbn ~ Fb, then bn ~ b.
Proof. See Rao (1973, p. 120) . •
In other words, convergence in probability to a constant implies convergence in distribution to that constant. The converse is also true. A useful implication of convergence in distribution is the following lemma.
4.1 Convergence in Distribution
67
Lemma 4.5 If bn ~ Z, then bn is Op(l). Proof. Recall that bn is Op(l) if, given any 0 > 0, P [lbnl > .6. 0 ] < 0 for some .6.0 < 00 and all n > No. Because bn -> Z, P [lbnl > .6. 0 ] -> P [IZI > .6. 0 ] , provided (without loss of generality) that .6. 0 and -.6. 0 are continuity points of the distribution of Z. Thus IP[lbnl > .6. 0 ] - P[lZI > .6. 0 ]1 < 0 for all n > No, so that P[lbnl > .6. 0 ] < 0 + P[lZI > .6. 0 ] for all n > No. Because P[\Z\ > .6. 01 < 0 for .6. 0 sufficiently large, we have P[I bn I > .6. 0 ] < 20 for .6. 0 sufficiently large and all n > No. • This allows us to establish the next useful lemma. Lemma 4.6 (Product rule) Recall from Corollary 2.36 that if An = op(l) and b n = Op(l), then Anbn = op(l). Hence, if An ....!!-. 0 and bn
d ----t
Z, then Anbn
p ----t
O.
In turn, this result is often used in conjunction with the following result, which is one of the most useful of those relating convergence in probability and convergence in distribution. Lemma 4.7 (Asymptotic equivalence) Let {an} and {b n } be two sequences of random vectors. If an - b n
p ----t
0 and b n
d ----t
Z, then an
d ----t
Z.
Proof. See Rao (1973, p. 123) . • This result is helpful in situations in which we wish to find the asymptotic distribution of an but cannot easily do so directly. Often, however, it is easy to find a b n that has a known asymptotic distribution and that satisfies an - b n ....!!-. O. Lemma 4.7 then ensures that an has the same limiting distribution as b n and we say that an is "asymptotically equivalent" to bn' The joint use of Lemmas 4.6 and 4.7 is the key to the proof of the asymptotic normality results for the OLS and IV estimators. Another useful tool in the study of convergence in distribution is the characteristic function. Definition 4.8 Let Z be a k x 1 random vector with distribution function F. The characteristic function of Z is defined as f(A) - E(exp(iA' Z)), where i 2 = -1 and A is a k x 1 real vector. Example 4.9 Let Z be a nonstochastic real number, Z = c. Then f(.>..) = E(exp(i'>"Z)) = E(exp(i.>..c)) = exp(i.>..c). Example 4.10 (i) Let Z cv N(/-L, ( 2 ). Then f(.>..) = exp(i'>"/-L - .>..20 2/2). (ii) Let Z cv N(J..L, ~), where J..L is k x 1 and ~ is k x k. Then f(A) = exp( iA' J..L - A'~A/2).
68
4. Asymptotic Normality
A useful table of characteristic functions is given by Lukacs (1970, p. 18). Because the characteristic function is the Fourier transformation of the probability density function, it has the property that any characteristic function uniquely determines a distribution function, as formally expressed by the next result.
Theorem 4.11 (Uniqueness theorem) Two distribution functions are identical if and only if their characteristic functions are identical. Proof. See Lukacs (1974, p. 14) . • Thus the behavior of a random variable can be studied either through its distribution function or its characteristic function, whichever is more convenient.
Exrunple 4.12 The distribution of a linear transformation of a random variable is easily found using the characteristic function. Consider y = AZ, where A is a q x k matrix and Z is a random k-vector. Let 0 be q x l. Then
jy (0)
E( exp( iO'y)) = E( exp( iO' AZ)) E(exp(i,X'Z)) = fz(,X),
defining A = A'O. Hence if Z
~ N(p,~),
fz(,X) = exp(i,X'p- 'x'~'x/2)
jy (0)
exp( iO' Ap - 0' A~A' 0 /2), so that y
~
N(Ap,
A~A')
by the uniqueness theorem.
Other useful facts regarding characteristic functions are the following.
Proposition 4.13 Let Y = aZ
+ b,
a, bE K Then
fy(>.) = fz(a>.) exp(i>'b).
Proof. fy(>.)
E(exp(i>.Y)) = E(exp(i>.(aZ
+ b)))
E( exp( i>.aZ) exp( i>'b)) E(exp(i>.aZ)) exp(i>'b) = fz(>.a) exp(i>'b) .
•
4.1 Convergence in Distribution
69
Proposition 4.14 Let Y and Z be independent. Then if X
Y
+ Z,
fx(>') = fy(>.)fz(>,). Proof.
fx(>')
E(exp(i>.X» = E(exp(i>.(Y + Z») E( exp( i>.Y) exp( i>'Z» E(exp( i>.Y) )E(exp(i>'Z))
by independence. Hence fx(>') = fy(>.)fz(>,) . • Proposition 4.15 If the kth moment ILk of a distribution function F exists, then the characteristic function f of F can be differentiated k times and f(k)(O) = ikILk' where f(k) is the kth derivative of f. Proof. This is an immediate corollary of Lukacs (1970, Corollary 3 to Theorem 2.3.1, p. 22) . • Exrunple 4.16 Suppose that Z ~ N(O, ( 1'" (0) = 0, etc.
2 ).
Then f'(O)
= 0,1"(0) = _a 2 ,
The main result of use in studying convergence in distribution is the following. Theorem 4.17 (Continuity theorem) Let {b n } be a sequence of random k x 1 vectors with characteristic functions Un(A)}. Ifb n ~ Z, then for every A, fn(A) --; f(A), where f(A) = E(exp(iA' Z». Further, if for every A, fn(A) --; f(A) and f is continuous at A = 0, then b n ~ Z, where
f(A) = E(exp(iA' Z». Proof. See Lukacs (1970, pp. 49-50) . •
This result essentially says that convergence in distribution is equivalent to convergence of characteristic functions. The usefulness of the result is that often it is much easier to study the limiting behavior of characteristic functions than distribution functions. If the sequence of characteristic functions fn converges to a function f that is continuous at A = 0, this theorem guarantees that the function f is a characteristic function and that the limiting distribution F of b n is that corresponding to the characteristic function f(A). In all the cases that follow, the limiting distribution F will be either that of a degenerate random variable (following from convergence in probability to a constant) or a multivariate normal distribution (following from an
70
4. Asymptotic Normality
appropriate central limit theorem). In the latter case, it is often convenient to standardize the random variables so that the asymptotic distribution is unit multivariate normal. To do this we can use the matrix square root.
Exercise 4.18 Prove the following. Let V be a positive (semi) definite symmetric matrix. Then there exists a positive Csemi) definite symmetric matrix square root Vl/2 such that the elements of Vl/2 are continuous functions of V and V 1/2V 1/ 2 = V. (Hint: Express Vas V = Q'DQ, where Q is an orthogonal matrix and D is diagonal with the eigenvalues of V along the diagonal.) Exercise 4.19 Show that if Z ~ NCO, V), then V- 1 I z Z ~ NCO, I), provided V is positive definite, where V- 1/2 = (Vl/2)-1. Definition 4.20 Let {b n } be a sequence of random vectors. If there exists a sequence of matrices {Vn} such that V n is nonsingular for all n suf-
ficiently large and V;;:-1/2 bn ~ N(O, I), then V n is called the asymptotic covariance matrix of b n , denoted avar(b n ). When var(b n ) is finite, we can usually define V n = var(b n ). Note that the behavior of b n is not restricted to require that V n converge to any limit, although it may. Generally, however, we will at least require that the smallest eigenvalues of V nand V;;:-l are uniformly bounded away from zero for all n sufficiently large. Even when var(b n ) is not finite, the asymptotic covariance matrix can exist, although in such cases we cannot set V n var(b n ).
Example 4.21 Define bn = Z+Y/n, where Z ~ N(O, 1) andY is Cauchy,
independent of Z. Then var(bn ) is infinite for every n, but bn ~ N(O, 1) as a consequence of Lemma 4.7. Hence avar(b n ) = 1. Given a sequence {V;;:-1/2b n } that converges in distribution, we shall often be interested in the behavior of linear combinations of b n , say, {An b n }, where An, like V;;:-1/2, is not required to converge to a particular limit. We can use characteristic functions to study the behavior of these sequences by making use of the following corollary to the continuity theorem. Corollary 4.22 If A E ]Rk and a sequence {In (A) } of characteristic functions converges to a characteristic function f(A), then the convergence is uniform in every compact subset of]Rk. Proof. This is a straightforward extension of Lukacs (1970, p. 50) . •
4.1 Convergence in Distribution
71
This result says that in any compact subset of IRk the distance between fn(>') and f(>.) does not depend on >., but only on n. This fact is crucial to establishing the next result. Lemma 4.23 Let {b n } be a sequence of random k x 1 vectors with characteristic functions Un (>.)}, and suppose f n (>.) --> f (>.). If {An} is any sequence of q x k nonstochastic matrices such that An = 0(1), then the sequence {An b n } has characteristic functions {f~ (9)}, where 9 is q xl, such that for every 9, f~(9) - f(A~9) --> O.
Proof. From Example 4.12, f~(9) = fn(A~9). For fixed 9, >'n _ A~9 takes values in a compact region of IRk, say, No, for all n sufficiently large because An = 0(1). Because fn(>') --> f(>'), we have fn(>'n) - f(>'n) --> 0 uniformly for all >'n in No, by Corollary 4.22. Hence for fixed 9, fn(A~9) - f(A~9) = f~(9) - f(A~9) --> 0 for any 0(1) sequence {An}. Because 9 is arbitrary, the result follows . •
The following consequence of this result is used many times below. Corollary 4.24 Let {b n } be a sequence of random k x 1 vectors such that V;:;:1/2b ~ NCO, I), where {Vn} and {V;:;: 1} are 0(1). Let {An} be a 0(1) n
sequence of (nonstochastic) q x k matrices with full row rank q for all n sufficiently large, uniformly in n. Then the sequence {Anbn} is such that
r;:;:1/2 Anbn ~ N(O, I), where r n _ An V nA~ and r nand r;:;:l are 0(1). Proof. r n = 0(1) by Lemma 2.19. r;:;:l = 0(1) because r n = 0(1) and deter n) > {j > 0 for all n sufficiently large, given the conditions on {An} and {Vn}. Let f~ (9) be the characteristic function of r;:;: 1/2 An b n = 12 12 L 423 apr -n 1/ 2A n V1/2V-1/2b n n n' B ecause r-n / A nV n/ = 0(1) ,emma. plies, implying f~(9)- f(V;/2 A~r;:;:1/29) --> 0, where f(>.) = exp( ->.' >./2), the limiting characteristic function of V;:;:1/2bn- Now f(V;/2 A~r;:;:1/29) = exp( _9'r;:;:1/2 An V nA~r;:;:1/29 /2) = exp( -9'9 /2) by definition of r;:;:1/2.
Hence f~(>') - exp(-9'9/2) --> 0, so r;:;:1/2Anbn ~ N(O, I) by the continuity theorem (Theorem 4.17). • This result allows us to complete the proof of the following general asymptotic normality result for the least squares estimator. Theorem 4.25 Given
(i) Y t
= X~(3o + Ct,
t = 1,2, ... , (30
E
(ii) V;:;:1/2 n - 1/2X'c ~ N(O,I), where Vn uniformly positive definite;
IRk;
=var(n- 1/ 2X'c) is 0(1) and
4. Asymptotic Normality
72
(iii) X'X/n - Mn ~ 0, where Mn = E(X'X/n) is 0(1) and uniformly
positive definite. Then
where Dn
=M;:;-lVnM;:;-l and D;:;-l are 0(1). Suppose in addition that
(iv) there exists a matrix V n positive semidefinite and symmetric such that .... p ..... p V n- V n --+ 0. Then Dn - Dn --; 0, where
°
Proof. Because X'X/n - Mn ~ and Mn is finite and nonsingular by (iii), (X'Xjn)-l and /3 n exist in probability. Given (i) and the existence of (X'X/n)-l,
Hence, given (ii),
...;n(/3n - (3o) - M;;:ln- 1/ 2X'e;
= [(X'X!n)-l
- M;;:1]V:/2V;;:1/2n-l/2X'e;,
or, premultiplying by D;;:1/2,
D;;:1/2...;nCBn - (3o) - D;;:1/2M;;:ln- 1/ 2X'e; = D;;:1/2[(X'X/n)-1 _ M;;:1]V;!2V;;:1/2 n -l/2X'e;.
The desired result will follow by applying the product rule Lemma 4.6 to the line immediately above, and the asymptotic equivalence Lemma 4.7 to the preceding line. Now V;;:1/2 n - 1/2X'e; ~ N(O, I) by (ii); further, D;:;-1/2[(X'X!n)-1 - M;:;-1]V;!2 is op(l) because D;:;-1/2 and V~/2 are 0(1) given (ii) and (iii), and [(X'X/n)-l - M;:;-l] is op(l) by Proposition 2.30 given (iii). Hence, by Lemma 4.5,
By Lemma 4.7, the asymptotic distribution of D;;:1/2 In(.B n - (3o) is the same as that of D;;:1/2 M ;:;-ln- 1/ 2X' e;. We find the asymptotic distribution
4.1 Convergence in Distribution
73
of this random variable by applying Corollary 4.24, which immediately yields D~1/2M;;:ln-l/2X'c /~ N(O, I). Because (ii), (iii), and (iv) hold, sequence of Proposition 2.30. •
Dn -
Dn
~
°
is an immediate con-
The structure of this result is very straightforward. Given the linear data generating process, we require only that (X'X/n) and (X'X/n)-l are Opel) and that n- 1 / 2 X'e: is asymptotically unit normal after standardizing by the inverse square root of its asymptotic covariance matrix. The asymptotic covariance (dispersion) matrix of fo(/:Jn - f3o) is D n , which can be consistently estimated by Dn. Note that this result allows the regressors to be stochastic and imposes no restriction on the serial correlation or heteroskedasticity of ct, except that needed to ensure that (ii) holds. As we shall see in the next chapter, only mild restrictions need to be imposed in guar an teeing (i i) . In special cases, it may be known that V n has a special form. For example, when Ct is an i.i.d. scalar with E(ct) = 0, E(cn = ()~, and X t is nonstochastic, then V n = ()~X'X/n. Finding a consistent estimator for V n then requires no more than finding a consistent estimator for ()~. In more general cases considered below it is often possible to write
Vn
=
E(X'cc'X/n)
=
E(X'OnX/n).
Finding a consistent estimator for V n in these cases is made easier by the knowledge of the structure of On. However, even when On is unknown, it turns out that consistent estimators for V n are generally available. The conditions under which V n can be consistently estimated are treated in Chapter 6. A result analogous to Theorem 4.25 is available for instrumental variables estimators. Because the proof follows that of Theorem 4.25 very closely, proof of the following result is left as an exercise. Exercise 4.26 Prove the following result. Given
(i) Y t = X~f3o
+ ct,
t = 1,2, ... , f3 0 E
(ii) V~1/2n-l/2Z'c ~ N(O, I), where V n
]Rk;
var(n- 1/ 2 Z'c) is 0(1) and
uniformly positive definite;
(iii) (a) Z'X/n-Qn ~ 0, where Qn - E(Z'X/n) is 0(1) with un~formly full column rank; A
A
p
(b) There exists P n such that P n - P n ----+ symmetric and un~formly positive definite.
°
and P n = 0(1) and is
4. Asymptotic Normality
74
Then Dn-1/2 vr;;;(f3n n - f3 0 )
A ~
N ( 0, I) ,where
Dn -- (Q~PnQn)-lQ~Pn V nPnQn(Q~PnQn)-l and D;:;l are 0(1). Suppose in addition that (iv) There exists a matrix p that V n - V n --> O.
Vn
positive semidefinite and symmetric such
~
~
Then Dn - Dn
4.2
p
-->
0, where
Hypothesis Testing
A direct and very important use of the asymptotic normality of a given estimator is in hypothesis testing. Often, hypotheses of interest can be expressed in terms of linear combinations of the parameters as
Rf30 = r, where R is a given q x k matrix and r is a given q x 1 vector that, through Rf3o= r, specify the hypotheses of interest. For example, if the hypothesis is that the elements of f3 0 sum to unity, R = [1, ... ,1 J and r = 1. Several different approaches can be taken in computing a statistic to test the null hypothesis Rf3o= r versus the alternative Rf3o# r. The methods that we consider here involve the use of Wald, Lagrange multiplier, and quasi-likelihood ratio statistics. Although the approaches to forming the test statistics differ, the way that we determine their asymptotic distributions is the same. In each case we exploit an underlying asymptotic normality property to obtain a statistic distributed asymptotically as chi-squared (X2). To do this we use the following results.
Lemma 4.27 Let g : ]Rk --t ]Rl be continuous on k x 1 random vector. Then g(b n ) --t g(Z).
]Rk
and let b n
d --t
Proof. See Rao (1973, p. 124) . • Corollary 4.28 Let V n-1/2 b n
A ~
N ( O,I k ) . Then
b 'n V-1b = b 'n V-1/2V-1/2b ~ .2 n n n n n Xk,
where X~ is a chi-squared random variable with k degrees of freedom.
Z, a
4.2 Hypothesis Testing
75
Proof. By hypothesis, V;;:1/2b n ~ Z ~ N(O,Ik)' The function g(z) = z' z is continuous on jRk. Hence,
•
Typically, V n will be unknown, but there will be a consistent estimator .... p .... V n such that V n - V n ----+ O. To replace V n in Corollary 4.28 with V n, we use the following result. ....
Lemma 4.29 Let g :
jRk
b n ~ Z, then g(an )
-
-7
jRI
g(b n )
rr
be continuous on jRk. an - b n ~ 0 and ~ 0 and g(an ) ~ g(Z).
Proof. Rao (1973, p. 124) proves that g(an)-g(b n ) ~ O. That g(an ) ~ g(Z) follows from Lemmas 4.7 and 4.27. • Now we can prove the result that is the basis for finding the asymptotic distribution of the Wald, Lagrange multiplier, and quasi-likelihood ratio tests. Theorem 4.30 Let V;;:1/2b n ~ N(O,I k ), and suppose there exists V n positive semidefinite and symmetric such that Vn - V n ~ 0, where V n is 0(1), and for all n sufficiently large, det(V n ) > [j > O. Then b~ V;;:lb n ~ 2
Xx,'
Proof. We apply Lemma 4.29. Consider V;;:1/2b n - V;;:1/2b n , where V;;: 1/2 exists in probability for all n sufficiently large. Now
By hypothesis V;;:1/2b n ~ N(O,I k ) and V;;:1/2V~/2 - I ~ 0 by Proposition 2.30. It follows from the product rule Lemma 4.6 that V;;:1/2b n V;;:1/2b n ~ O. Because V;;:1/2b n 3:., Z ~ N(O, Id, it follows from Lemma ,
A
-1
A
2
4.29 that b n V n b n rv Xk' • The Wald statistic allows the simplest analysis, although it mayor may not be the easiest statistic to compute in a given situation. The motivation for the Wald statistic is that when the null hypothesis is correct, R,8n should be close to R,Bo = r, so a value of R,8n - r far from zero is evidence against the null hypothesis. To tell how far from zero R,8n - r must be before we reject the null hypothesis, we need to determine its asymptotic distribution.
76
4. Asymptotic Normality
Theorem 4.31 (Wald test) Let the conditions of Theorem 4.25 hold and let rankeR) = q < k. Then under Ho: R,8o = r, (i) r;:;-1/2y'n(R,Bn - r) ~ N(O,I), where
(ii) The Wald statistic Wn
r
n
= n(R,8n A
A
-1
A
A
r)'rn (R,8n - r) '" X~, where
=RUnR' = R(X'X/n)-lyn(X'X/n)-lR'.
Proof. Ci) Under Ho, R,Bn - r = R(,Bn - ,80)' so 1/ 2 'nCR?! - r) rn vn fJn
= r-l/2RDl/2D-1/2 'n(?'! _ fJo' r.:I ) n n n vn fJn
It follows from Corollary 4.24 that r;:;-1/2y'nCR,Bn - r) ~ NCO,I). (ii) Because Un - Dn ~ from Theorem 4.25, it follows from Proposition 2.30 that n - r n ~ o. Given the result in (i), (ii) follows from Theorem 4.30. •
r
°
This version of the Wald statistic is useful regardless of the presence of heteroskedasticity or serial correlation in the error terms because a consistent estimator (Yn) for V n is used in computing n. In the special case when Vn can be consistently estimated by o-;,(X'X/n), the Wald test has the form
r
which is simply q times the standard F-statistic for testing the hypothesis R,8o = r. The validity of the asymptotic x~ distribution for this statistic depends crucially on the consistency of Yn = o-;'(X'X/n) for V n; if this Yn is not consistent for V n, the asymptotic distribution of this form for Wn is not X~ in general. The Wald statistic is most convenient in situations in which the restrictions R,8o = r are not easy to impose in estimating ,80. When these restrictions are easily imposed (say, R,8o = r specifies that the last element of ,80 is zero), the Lagrange multiplier statistic is more easily computed. The motivation for the Lagrange multiplier statistic is that a constrained least squares estimator can be obtained by solving the problem min(Y - X,8)'(Y - X,8)/n, {3
s.t. R,8 = r,
4.2 Hypothesis Testing
77
which is equivalent to finding the saddle point of the Lagrangian
£
= (Y - X,8)'(Y - X,8)/n
+ (R,8 -
r)' A.
The Lagrange multipliers A can be thought of as giving the shadow price of the constraint and should therefore be small when the constraint is valid and large otherwise. (See Engle, 1981, for a general discussion.) The Lagrange multiplier test can be thought of as testing the hypothesis that A=O. The first order conditions are
2(X'X/n),8 - 2X'y /n R,8 - r = O.
8£/8,8 8£/8A
+ R' A = 0
To solve for the estimate of the Lagrange multiplier, premultiply the first equation by R(X'X/n)-1 and set R,8 = r. This yields
2(R(X'X/n)-IR')-I(R.Bn - r)
An i3n
=
.Bn - (X'X/n)-IR'A n /2,
where i3n is the constrained least squares estimator (which automatically satisfies Ri3n = r). In this form, An is simply a nonsingular transformation of R.Bn - r. This allows the following result to be proved very simply. Theorem 4.32 (Lagrange multiplier test) Let the conditions of Theorem 4.25 hold and let rankeR) = q < k. Then under Ho: R,8o = r, 1/2 .. A (i) A;:;- y'TiA n ~ N(O,I), where A = 4(RM- 1R')-lr (RM- 1R')-1 n -
and
rn
n
n
n
is as defined in Theorem 4.31.
(ii) The Lagrange multiplier statistic £Mn _ nA~A:IAn ~ X~, where An
4(R(X'X/n)-IR,)-IR(X'X/n)-lyn(X'X/n)-IR' X (R(X'X/ n) -1 R')-1
and Yn is computed from the constrained regression such that Vn ~ 0 under Ho·
Yn -
Proof. (i) Consider the difference
A;:;-I/2y'TiAn - 2A;:;-I/2(RM;:;-IR')-1y'Ti(R.Bn - r) = 2A;:;-1/2[(R(X'X/n)-1 R')-1 - (RM;:;-1 R')-I]r;!2r;:;-1/2y'Ti(R.Bn -r).
78
4. Asymptotic Normality
From Theorem 4.31, r~l/\,In(R,Bn - r) ~ N(O, I). Because (X/X/n) Mn ~ 0, it follows from Proposition 2.30 and the fact that A;:;-1/2 and r;:;-1/2 are 0(1) that A;:;-1/2[(R(X/X/n)-1 R/)-l_ (RMnR/)-11r~2 ~ O. Hence by the product rule Lemma 4.6, A~l/\;nAn - 2A~1/2(RM~lR/)-lVn(R,Bn - r) ~ O. It follows from Lemma 4.7 that A;:;-l /2 y'nAn has the same asymptotic distribution as 2A;:;-1/2(RM~lR/)-lyn(R,Bn - r). It follows immediately from
Corollary 4.24 that 2A;:;-1/2(RM~lR/)-1y'n(R,8n - r) ~ N(O, I); hence A;:;-1/2 ynAn ~ N(O, I). (ii) Because V n - V n ~ 0 by hypothesis and (X/X/n) - Mn ~ 0, An-An ~ 0 by Proposition 2.30, given the result in (i), (ii) follows from Theorem 4.30 . • Note that the Wald and Lagrange multiplier statistics would be identical if V n were used in place of V n' This suggests that the two statistics should be asymptotically equivalent. Exercise 4.33 Prove that under the conditions of Theorems 4.31 and 4.32, Wn-£.Mn ~O. Although the fact that An is a linear combination of R,Bn -r simplifies the proof of Theorem 4.32, the whole point of using the Lagrange multiplier statistic is to aVC:>.id computing ,Bn and to compute only the simpler i3n. Computation of ,8n is particularly simple when the data are generated as y = X 1,81 + X 2,82 + r:: and Ho specifies that ,82 (a q x 1 vector) is zero. Then
R=[ .·1
and,8n
··1
= (,81n' 0)
0
r-O ,
:1],
(qx 1)
(qx (k-q)) (qxq)
..
where ,81n
=
1
(X~Xl)- X~Y.
Exercise 4.34 Define it = Y - X 1i31n' Show that under Ho: ,82 = 0, An
=
2X~(1 - Xl(X~Xd-lX~)r::/n
=
2X;it/n.
(Hint: R,Bn - r = R(X/X/n)-l X/(Y - xi3n)/n). By applying the particular form of R to the result of Theorem 4.32 (ii), we obtain
£.Mn = n.x~[(-X~Xl(X~XJ)-l: Iq)Vn(-X~Xl(X~Xl)-l: I q),J- 1 An/4,
79
4.2 Hypothesis Testing
When V n can be consistently estimated by Vn = a;(X'X/n), where a; = "€:'"€: In, the .eMn statistic simplifies even further. Exercise 4.35 If a;(X'X/n) - V n ~ 0 and /3 2 = 0, show that .eMn = ne'X(X'X)-l X'e/(e'e), which is n times the simple R2 of the regression ofe on X. The result of this exercise implies a very simple procedure for testing /32 = 0 when V n = O';Mn. First, regress Y on Xl and form the constrained residuals e. Then regress e on X. The product of the sample size n and the simple R2 (i.e., without adjustment for the presence of a constant in the regression) from this regression is the .eMn test statistic, which has the X~ distribution asymptotically. As Engle (1981) showed, many interesting diagnostic statistics can be computed in this way. When the errors Et are scalar i.i.d. N(O, a;) random variables, the OLS estimator is also the maximum likelihood estimator (MLE) because /3n solves the problem
1
max .e(/3, 0'; Y) = exp[ -n log y'2;: - n log 0' -
n
'2 2)Yt -
X~/3)2 /0'2],
t=l
where .e(/3, 0'; Y) is the sample likelihood based on the normality assumption. When Et is not i.i.d. N(O, 0';), /3n is said to be a quasi-maximum likelihood estimator (QMLE). When /3n is the MLE, hypothesis tests can be based on the log-likelihood ratio
max .e (/3 , 0'; Y),
s.t.
R/3 =
r.
It is easy to show that i3n is the constrained OLS estimator and a; = e'e/n as before. The likelihood ratio is nonnegative and always less than or equal to 1. Simple algebra yields
.eRn Because a;
=
0-; + (/3n -
=
(n/2) log(o-;/a;).
i3n),(X'X/n)(/3n - i3n) (verify this),
.eRn = -(n/2) log[l
+ (/3n -
i3n)'(X'X/n)(/3n - i3n)/o-;]·
4. Asymptotic Normality
80
To find the asymptotic distribution of this statistic, we make use of the mean value theorem of calculus. Theorem 4.36 (Mean value theorem) Let s : Rk ---t R be defined on an open convex set e c Rk such that s is continuously differentiable on e with k x 1 gradient V's. Then for any points (J and (JoE e there exists B on the segment connecting (J and (Jo such that s(O) = s(8 0 ) + V's(B)'(O - ( 0 ), Proof. See Bartle (1976, p. 365) . • For the present application, we choose s(O) = log(1+0).lfwe also choose 0 0 = 0, we have s(O) = log(1)+(1/(1+B»O = O/(1+B),where Blies between o- and zero. Let On = (i3n - i3n)'(X'Xjn)(i3n - i3n)ja-;, so that under Ha, p p IOnl < IOnl --> 0; hence On --> 0. Applying the mean value theorem now gIves
-
Because (1 + On)-l
p
-->
1, it follows from Lemma 4.6 that
provided the second term has a limiting distribution. Now
Thus
-2.eRn - n(Ri3 n - r)'[R(X'X/n)-lR']-l(Ri3 n - r)/a~ ~ 0. This second term is the Wald statistic formed with V n = a~(X'X/n), so -2.eRn is asymptotically equivalent to the Wald statistic and has the X~ distribution asymptotically, provided a~ (X'X/ n) is a consistent estimator for V n' If this is not true, then -2.eRn does not in general have the X~ distribution asymptotically. It does have a limiting distribution, but not a simple one that has been tabulated or is easily computable (see White, 1994, Oh. 6, for further details). Note that it is not violation of the normality assumption per se, but the failure of V n to equal (T~Mn that results in -2.eRn not having the X~ distribution asymptotically. The formal statement of the result for the .eRn statistic is the following. Theorem 4.37 (Likelihood ratio test) Let the conditions of Theorem 4.25 hold, let rank(R) = q < k, and let a~ (X'Xjn) - V n ~ O. Then
under Ha: R.Bo
= r,
-2.eRn ~ X~.
4.2 Hypothesis Testing
81
A
A
2
Proof. Set Vn in Theorem 4.31 to Vn = O"n(X'X/n). Then from the argument preceding the theorem above -2CR n - Wn ~ O. Because Wn ~ X~, it follows from Lemma 4.7 that -2CR n ~ X~ . • The mean value theorem just introduced provides a convenient way to find the asymptotic distribution of statistics used to test nonlinear hypotheses. In general, nonlinear hypotheses can be conveniently represented as
where s: ~k ~ ~q is a continuously differentiable function of (3.
Example 4.38 Suppose Y = X 1,81 + X 2,82 + X 3,83 + c, where Xl, X 2, and X3 are n x 1 and ,81' ,82' and ,83 are scalars. Further, suppose we hypothesize that ,83 = ,81,82' Then s({3o) = ,83 - ,81,82 = 0 expresses the null hypothesis. Just as with linear restrictions, we can construct a Wald test based on the asymptotic distribution of s({3n); we can construct a Lagrange multiplier test based on the Lagrange multipliers derived from minimizing the least squares (or other estimation) objective function subject to the constraint; or we can form a log-likelihood ratio. To illustrate the approach, consider the Wald test based on s({3n)' As before, a value of s({3n) far from zero is evidence against Ha. To tell how far s({3n) must be from zero to reject H a, we need to determine its asymptotic distribution. This is provided by the next result.
Theorem 4.39 (Wald test) Let the conditions of Theorem 4.25 hold and let rank \1s({3o) = q < k, where \1s is the k x q gradient matrix ofs. Then under Ha: s({3o) = 0,
(i) r;;:1/2 y'ns({3n) ~ N(O, I), where
""
I
A.
....
\1s({3n) Dn \1s({3n) \1s({3n)' (X'X/n)-l Vn(X'X/n)-l\1s({3n)'
82
4. Asymptotic Normality
Proof. (i) Because s(,6) is a vector function, we apply the mean value theorem to each element Si(,6), i = 1, ... ,q, to get
where 13~) is a k x 1 vector lying on the segment connecting i3 n and ,60' The superscript (i) reflects the fact that the mean value may be different for each element Si(,6) of s(,6). Under Ho, Si(,6o) = 0, i = 1, ... ,q, so
This suggests considering the difference
1/2 A P By Theorem 4.25, D;;: yfii(,6n - ,(0) "" N(O, I). Because ,6n -----t ,60' it -(i) p -(i) p follows that,6n -----t,60 so \1si(,6n ) - \1Si(,6o) -----t 0 by Proposition 2.27. 1 /IS2 . ) ,we have ((-(i)) p Because Dn 0(1 \1s i ,6n - \1Si(,6o )) 'Dn1/2 -----t O. It follows from Lemma 4.6 that A
A
In vector form this becomes
and because r;;:1/2 is 0(1),
Corollary 4.24 immediately yields
so by Lemma 4.7, r;;1/2 yfiis(13n ) ~ N(O, I). p A p (ii) Because Dn -Dn -----t 0 from Theorem 4.25 and \1s(,6n)- \1s(,6o) -----t o by Proposition 2.27, l'n - r n ~ 0 by Proposition 2.30. Given the result in (i), (ii) follows from Theorem 4.30 . • A
4.3 Asymptotic Efficiency
83
Note the similarity of this result to Theorem 4.31 which gives the Wald test for the linear hypothesis R,(3o = r. In the present context, s(,(3o) plays the role of R,(3o - r, whereas \1s(,(3o)' plays the role of R in computing the covariance matrix.
Exercise 4.40 Write down the Wald statistic Jor testing the hypothesis oj Example 4.38. Exercise 4.41 Give the Lagrange multiplier statistic Jor testing the hypothesis Ho: s(,(3o) = 0 versus HI: s(,(3o) -=I- 0, and derive its limiting distribution under the conditions oj Theorem 4.25. Exercise 4.42 Give the Wald and Lagrange multiplier statistics Jor testing the hypotheses R,(3o = rand s(,(3o) = 0 on the basis oj the IV estimator ,(3n and derive their limiting distributions under the conditions oj Exercise 4·26.
4.3
Asymptotic Efficiency
Given a class of estimators (e.g., the class of instrumental variables estimators), it is desirable to choose that member of the class that has the smallest asymptotic covariance matrix (assuming that this member exists and can be computed). The reason for this is that such estimators are obviously more precise, and in general allow construction of more powerful test statistics. In what follows, we shall abuse notation slightly and write avar(,Bn) instead of avar ( fo(,Bn - ,(30))' A definition of asymptotic efficiency adequate for our purposes here is the following.
Definition 4.43 Let P be a set of data generating processes such that each data generating process po E P has a corresponding coefficient vector ,(30 E ~k and let E be a class of estimators {,Bn} such that for each
po E P,
(D~- 1/2 fo,(3n
_,(30 )
dO --->
N(O, I) Jor {avarO(,(3n) - D~} non-
stochastic, 0(1) and uniformly nonsingular. Then {,(3~} E E is asymptotically efficient relative to {,Bn} E E for P iJ Jor every po E P the matrix avarO(,Bn) - avarO(,(3;J is positive semidefinite for all n sufficiently large. The estimator {,(3~} E E is asymptotically efficient in E for P iJit is asymptotically efficient relative to every other member oj its class, E, Jor P. We write,(3° instead of,(3o to emphasize that,(3° corresponds to (at least) one possible data generating process (PO) among many (P).
84
4. Asymptotic Normality
The class of estimators we consider is the class t: of instrumental variables estimators
f3- n
=
(X'ZP n Z'X)-l X'ZP n Z'Y ,
where different members of the class are defined by the different possible choices for Pn and Z. For any data generating process po satisfying the con_ditions of Exercise 4.26, we have that the asymptotic covariance matrix of f3n is
Dn= (Q~PnQn)-lQ~Pn V nPnQn(Q~PnQn)-l. We leave the dependence of Dn (and other quantities) on po implicit for now, and begin our analysis by considering how to choose Pn to make Dn as small as possible. Until now, we have let Pn be any positive definite matrix. It turns out, however, that by choosing Pn = V~ 1, one obtains an asymptoticaI1y efficient estimator for the class of IV estimators with given instrumental variables Z. To prove this, we make use of the following proposition. Proposition 4.44 Let A and B be positive definite matrices of order k. Then A - B is positive semidefinite if and only if B- 1 - A-I is positive semidefinite. Proof. This follows from Goldberger (1964, Theorem 1.7.21, p. 38) . • This result is useful because in the cases of interest to us it will often be much easier to determine whether B-1 - A-I is positive semidefinite than to examine the positive semidefiniteness of A - B directly. Proposition 4.45 Given instrumental variables Z, let P be the collection of probability measures satisfying the conditions of Exercise 4.26. Then the choice P n = v~ 1 gives the IV estimator
which is asymptotically efficient for P within the class t: of instrumental variables estimators of the form
/3n =
(X'ZP nZ'X)-lX'ZP nZ'Y.
Proof. From Exercise 4.26, we have
4.3 Asymptotic Efficiency
85
From Proposition 4.44, avar(i3n) - avar(13~) is p.s.d. if and only if (avar(13~))-l - (avar(i3 n ))-l
is p.s.d. Now for all n sufficiently large, (avar(13~))-l - (avar(,B))-l
Q~ y;;-lQn-Q~PnQn(Q~Pn y nPnQn)-lQ~PnQn Q~ y;;-1/2(I
_yl/2p Q (Q' P yl/2yl/2p Q )-lQ'nPnyl/2)y-l/2Q n nnnnn n nn n n n _ G n (G'n G n )-lG'n )y-l/2 Q 'n y-l/2(I Q n, n n where G n - y;;-1/2PnQn. This is a quadratic form in an idempotent matrix and is therefore p.s.d. As this holds for all po in P, the result follows. • Hansen (1982) considers estimators for coefficients 130 implicitly defined by moment conditions of the form E(g(Xt, Y t , Zt, 130)) = 0, where g is an l x 1 vector-valued function. Analogous to our construction of the IV estimator, Hansen's estimator is constructed by attempting to make the sample moment n 1
n- Lg(X t , Y t , Zt,.B) t=l
as close to zero as possible by solving the problem
Hansen establishes that the resulting "method of moments" estimator is consistent and asymptotically normal and that the choice Pn = V;;-l, P Yn-Yn - > 0, where A
n
y
n
2 = var(n- / 1
L g(Xt, Y t , Zt, 130)) t=l
delivers the asymptotically efficient estimator in the class of method of moment estimators. Hansen calls the method of moments estimator with Pn = V~l the generalized method of moments (GMM) estimator. Thus, the estimator 13~ of Proposition 4.45 is the GMM estimator for the case in which
86
4. Asymptotic Normality
We call f3~ a "linear" GMM estimator because the defining moment condition E(Zt(Yt - X~f3o)) = 0 is linear in f3o. Exercise 4.46 Given instrumental variables X, suppose that
V~1/2
L Xtet ~ N(O, I),
where V n = a~Mn. Show that the asymptotically efficient IV estimator is the least squares estimator, (3n, according to Proposition 4.45. Exercise 4.47 Given instrumental variables Z, suppose that A Vn-1/2"",,,Z L.., tet '" N ( 0, I ) ,
where Vn = a~Ln and Ln _ E(Z'Z/n). Show that the asymptotically efficient IV estimator is the two-stage least squares estimator 132SLS
= (X'Z(Z'Z)-lZ'X)-lX'Z(Z'Z)-lZ'Y
according to Proposition 4.45. Note that the value of a~ plays no role in either Exercise 4.46 or Exercise 4.47 as long as it is finite. In what follows, we shall simply ignore a~, and proceed as if a~ = 1. If instead of testing the restrictions s(f3 o) = 0, as considered in the previous section, we believe or know these restrictions to be true (because, for example, they are dictated by economic theory), then, as we discuss next, we can further improve asymptotic efficiency by imposing these restrictions. (Of course, for this to really work, the restrictions must in fact be true.) Thus, suppose we are given constraints s(f3 o) = 0, where s : IRk - t IRq is a known continuously differentiable function, such that rank(V's(f3o)) = q and V's(f3o) (k x q) is finite; The constrained instrumental variables estimator can be found as the solution to the problem min(Y - X(3)'ZP n Z'(Y - X(3), {3
s.t. s(f3) = 0,
which is equivalent to finding the saddle point of the Lagrangian
I:
=
(Y - X(3)'ZP nZ' (Y - X(3)
+ s(f3)'-X.
The first-order conditions are 01: 2(X'ZP n Z'X)f3 - 2X'ZP n Z'Y
of3
01:
o-X
s(f3) = o.
+ V's(f3)-X =
0
87
4.3 Asymptotic Efficiency
Setting i3n = (X'ZP nZ'X)-l X'ZPnZ'Y and taking a mean value expansion of s(!3) around s(m yields the equations
a,C a!3 a,C a>..
2(X'ZP n Z'X)(!3 - i3 n ) + "ls(!3)>" = 0,
s(i3 n ) + Vs'(!3 - i3n)
=
0,
where "ls' is the q x k Jacobian matrix with ith row evaluated at a mean value 13~). To solve for >.. in the first equation, premultiply by "ls'(X'Z PnZ'X )-1 to get
2"ls(!3 - i3 n ) + "ls'(X'ZP n Z'X)-l"lS(f3)>"
= 0,
substitute -s(i3 n ) = "ls'(!3 - i3n), and invert "ls'(X'ZP n Z'X)-l"lS(!3) to obtain
The expression for
a,CI a!3 above yields
so we obtain the solution for
13 by substituting for >..:
The difficulty with this solution is that it is not in closed form, because the unknown 13 appears on both sides of the equation. Further, appearing in this expression is "ls, whic~ has q rows each of which depends on a mean value lying between 13 and !3nNevertheless, a computationally practical and asymptotically equivalent result can be obtained by replacing "ls and "ls(!3) by "ls(i3 n ) on the righthand side of the expression above, which yields
13;,
=
i3n - (X'ZP n Z'X)-l"ls(i3 n )["ls(i3 n )' x (X'ZP nZ'X) -1 "lS(i3 n )]-l S(i3 n ).
This gives us a convenient way of computing a constrained IV estimator. First we compute the unconstrained estimator, and then we "impose" the constraints by subtracting a "correction factor"
4. Asymptotic Normality
88
from the unconstrained estimator. We say "impose" because f3~ will not satisfy the constraints exactly for any finite n. However, an estimator that does satisfy the constraints to any desired degree of accuracy can be o_btained by iterating the procedure just described, that is, by replacing f3 n by f3~ in the formula above to get a second round estimator, say, f3~*. This process could continue until the change in the resulting estimator was sufficiently small. Nevertheless, this iteration process has no effect on the asymptotic covariance matrix of the resulting estimator.
Exercise 4.48 Define f3~*
f3~ X
-
(X'ZPnZ'X)-lVs(f3~)
[Vs(f3~)' (X'ZP nZ'X)-lVs(f3~)l-ls(f3~).
Show that under the conditions of Exercise 4.26 and the conditions on s that Vn(f3~* - f3.~) ~ 0, so that Vn(f3~ - f3o) has the same asymptotic distribution as Vn(f3~* - f3o)· (Hint: Show that Vns(f3~) ~ 0.)
Thus, in considering the effect of imposing the restriction s(f3o) = 0, we can just compare the asymptotic behavior of f3~ to 13n As we saw in Proposition 4.45, the asymptotically efficient IV estimator takes P n = V;:;-l. We therefore consider the effect of imposing constraints on the IV estimator with P n = V;:;-l .
Theorem 4.49 Suppose the conditions of Exercise 4.26 hold for Pn = V;:;-l and that s : IRk --+ IRq is a continuously differentiable function such that s(f3o) = 0, Vs(f3 o ) is finite and rankCVsCf3 o )) = q. Define 13n = CX'ZV:1Z'X)-lX'ZV:1Z'y and define f3~ as the constrained IV estimator above with P" = V;:;-l. Then avarC13n) - avar(j3~) = avarC13,J VsCf3 o ) [VsCf3 o )' avarC13n) VsCf3 o )
r 1VsCf3
o )' avarC13n)'
which is a positive semidefinite matrix.
Proof. From Exercise 4.26 it follows that avarC13n) = CQ~ V~lQn)-l. Taking a mean value expansion of sC13 n) around f3 0 gives sC13n) = sCf3o ) + Vs'c13 n - f3o), and because sCf3 o ) = 0, we have sC13 n) = Vs'C13 n - f3o). Substituting this in the formula for f3~ allows us to write
89
4.3 Asymptotic Efficiency
where
An
1- (X'ZV:1Z'X)-1V'S(,Bn)[V'S(,Bn)' 1
X
(X'ZV: Z'X)-1V's(,Bn)r1V's'.
Under the conditions of Exercise 4.26, ,Bn ~ (30 and Proposition 2.30 p applies to ensure that An - An ------+ 0, where An = I - avar(,Bn) V's({30) [V's({3 0)' avar(,Bn) V's ({30) r 1V's({3 0)'. Hence, by Lemma 4.6,
Vri({3~ - (30) - AnVri(,Bn - (30) = (An - An)Vri(,Bn - (30) ~ 0, because vn(,Bn - (30) is Op(l) as a consequence of Exercise 4.26. From the asymptotic equivalence Lemma 4.7 it follows that vn({3~ - (30) has the same asymptotic distribution as Anvn(,Bn - (30). It follows from Lemma 4.23 that vn({3~ - (30) is asymptotically normal with mean zero and
avar({3~) =
rn=
Anavar(,Bn)A~.
Straightforward algebra yields avar({3~)
-
-
avar({3n) - avar({3n) V's({3 0) x [V's({3 0)' avar(,Bn)V's({3 0) r
1V's({3 0)' avar(,Bn),
and the result follows immediately. • This result guarantees that imposing correct a priori restrictions leads to an efficiency improvement over the efficient IV estimator that does not impose these restrictions. Interestingly, imposing the restrictions using the formula for {3~ with an inefficient estimator ,Bn for given i~strumental variables mayor may not lead to efficiency gains relative to {3n. A feature of this result worth noting is that the asymptotic distribution of the constrained estimator (3~ is concentrated in a k - q-dimensional subspace of ]R.k, so that avar({3~) will not be nonsingular. Instead, it has rank k - q. In particular, observe that
Vri({3~ - (30) = VriAn(,Bn - (30)· But
(I - avar(,Bn) V's({30) x [V's({3 0)' avar(,Bn) V's ({3 0) r 1V's({3 0)')
V's({30)'
V's({3o)' - V's({3o)' 0.
4. Asymptotic Normality
90
Consequently, we have
An alternative way to improve the asymptotic efficiency of the IV estimator is to include additional valid instrumental variables. To establish this, we make use of the formula for the inverse of a partitioned matrix, which we now state for convenience. Proposition 4.50 Define the k x k nonsingular symmetric matrix
B
C']
A= [ C D ' where B is kl D - CB-1C' ,
X
kl' C is k2
X
kl and D is k2
X
k 2 . Then, defining E
=
Proof. See Goldberger (1964, p. 27) . • Proposition 4.51 Partition Z as Z = (ZI, Z2) and let P be the collection of all probability measures such that the conditions of Exercise 4.26 hold for both Zl and Z2. Define V In = E(Z~Ee'Zdn), and the estimators
(3n
(X'ZI Vl;Z~X)-lX'ZI Vl;Z~ Y,
(3~
(X'ZV:1Z'X)-IX'ZV:IZ'Y.
Then for each po in P, avar(,an)-avar((3~) is a positive semidefinite matrix for all n sufficiently large.
Proof. Partition Qn as Q~ = (Q~n' Q~n)' where Qln E(Z~X/n), and partition V n as
= E(Z~X/n), Q2n =
12n VV ] . 2n The partitioned inverse formula gives I -IVI2n E- V In n
E;;l
]
,
4.3 Asymptotic Efficiency
91
where En = V2n - V 21n V I;V12n . From Exercise 4.26, we have avar((:Jn) avar(.B~)
We apply Proposition 4.44 and consider ( avar(.B~)) -1
_
(avar((:Jn)) -1
Q~ V;:;-lQn - Q~n V 1;Q1n
[Q~n' Q~nl V;:;-l [Q~n' Q~n]' - Q~n V1;Qln Q~n (VI; + VI;V12nE;:;-lV21n VI;) Qln -Q~nE;:;-lV21n V1;Qln - Q~n Vl;V12nE;:;-lQ2n
+Q~nE;:;-lQ2n - Q~n V1;Qln (Q~n V 1;V 12n - Q~n) E;:;-l (V21n V1;Qln - Q2n).
Because E;:;-l is a symmetric positive definite matrix (why?), we can write 1 j2 -lj2 E n-l -- EEn n ,so th a t
(avar(.B~) ) -1
_ ( avar((:Jn) ) -1
12 12 -- (Q'In V-IV ) In 12n - Q'2n ) En / En / (V 21n V-1Q In In _ Q 2n· Because this is the product of a matrix and its transpose, we immediately have (avar(.B~))-l - (avar((:Jn))-l is p.s.d. As this holds for all po in P the result follows, which by Proposition 4.44 implies that avar((:Jn) - avar(.B~) is p.s.d. • This result states essentially that the asymptotic precision of the IV estimator cannot be worsened by including additional instrumental variables. We can be more specific, however, and specify situations in which avar(.B~) = avar((:Jn), so that nothing is gained by adding an extra instrumental variable. Proposition 4.52 Let the conditions of Proposition 4.51 hold. Then we have avar((:Jn) = avar(.B~) if and only if E(X'Zdn)E(Z~cc'Zdn)-lE(Z~CE:'Z2/n) - E(X'Z2/n) = O.
Proof. Immediate from the final line of the proof of Proposition 4.51. •
92
4. Asymptotic Normality
To interpret this condition, consider the special case in which E(Z'ee'Z/n) = E(Z'Z/n).
In this case the difference in Proposition 4.52 can be consistently estimated by
This quantity is recognizable as the cross product of Z2 and the projection of X onto the space orthogonal to that spanned by the columns of Zl. If we write X = X(ZI (Z~ ZltlZ~ - I), the differ~nce in Proposition 4.52 is consistently estimated by X'Z2/n, so that avar(f3n) = avar(f3~) if and only if X'Z2/n ~ 0, which can be interpreted as saying that adding Z2 to the list of instrumental variables is of no use if it is uncorrelated with X, the matrix of residuals of the regression of X on Zl' One of the interesting consequences of Propositions 4.51 and 4.52 is that in the presence of heteroskedasticity or serial correlation of unknown form, there may exist estimators for the linear model more efficient than OLS. This result has been obtained independently by Cragg (1983) and Chamberlain (1982). To construct these estimators, it is necessary to find additional instrumental variables uncorrelated with et. If E(et)X t ) = 0, such instrumental variables are easily found because any measurable function of X t will be uncorrelated with et. Hence, we can set Zt = (Xi,z(Xd)', where z(X t ) is a (I - k) x 1 vector of measurable functions of Xt. Example 4.53 Let p = k = 1 so yt = Xtf3 0 + et, where yt, X t and et are scalars. Suppose that X t is nonstochastic, and for convenience suppose Mn - n- 1L: Xt -+ 1. Let et be independent heterogeneously distributed such that E(ed = 0 and E(en = a;' FUrther, suppose X t > 8 > 0 for all t, and take z(Xt ) = X t- 1 so that Zt = (Xt,X t- 1 ),. We consider '"
f3 n
.... -1
.... -1
= (X'X)-l X'y and f3~ = (X'ZV n ZX)-l X'ZV n Z'Y, and suppose
that sufficient other assumptions guarantee that the result of Exercise 4.26 holds for both estimators. By Propositions 4.51 and 4.52, it follows that avar(,Bn) > avar(f3~) if and only if
or equivalently, if and only if n- 1L:~_l atXt =J n- I L:~=l a;' This would certainly occur if at = Xt-I. (Verify this using Jensen's inequality.)
4.3 Asymptotic Efficiency
93
It also follows from Propositions 4.51 and 4.52 that when V n =I Ln, there may exist estimators more efficient than two-stage least squares. If l > k, additional instrumental variables are not necessarily required to improve efficiency over 2SLS (see White, 1982); but as the result of Proposition 4.51 indicates, additional instrumental variables (e.g., functions of Zt) can nevertheless generate further improvements. The results so far suggest that in particular situations efficiency mayor may not be improved by using additional instrumental variables. This leads us to ask whether there exists an optimal set of instrumental variables, that is, a set of instrumental variables that one could not improve upon by using additional instrumental variables. If so, we could in principle definitively know whether or not a given set of instrumental variables is best. To address this question, we introduce an approach developed by Bates and White (1993) that permits direct construction of the optimal instrumental variables. This is in sharp contrast to the approach taken when we considered the best choice for Pn. There, we seemingly pulled the best choice, V~l, out of a hat (as if by magiC!), and simply verified its optimality. There was no hint of how we arrived at this choice. (This is a common feature of the efficiency literature, a feature that has endowed it with no little mystery.) With the Bates-White method, however, we will be able to see how to construct the optimal (most efficient) instrumental variables estimators, step-by-step. Bates and White consider classes of estimators [; indexed by a parameter 'Y such that for each data generating process po in a set P,
(4.1 ) is an estimator indexed by 'Y taking values in a set f; (r = 8(PO) is the k x 1 parameter vector corresponding to the data generating process po; H~b) is a nonstochastic nonsingular k x k matrix depending on'Y and po; s~b) is a random k x 1 vector depending on 'Y and po such where:
Onb)
dO
dO
that I~b)-1/2Vns~b) ~ N(O,I), where I~b) = varO(Vns~b)), ~ denotes convergence in distribution under po, and var O denotes variance under po; and opo (1) denotes terms vanishing in probability-Po. Bates and White consider how to choose 'Y in f to obtain an asymptotically efficient estimator. Bates and White's method is actually formulated for a more general setting but (4.1) will suffice for us here. To relate the Bates-White method to our instrumental variables estimation framework, we begin by writing the process generating the dependent
4. Asymptotic Normality
94
variables in a way that makes explicit the role of po. Specifically, we write
Yf=Xf'{3°+et,
t=1,2, ....
The rationale for this is as follows. The data generating process po governs the probabilistic behavior of all of the random variables (measurable mappings from the underlying measurable space (0, F)) in our system. Given the way our data are generated, the probabilistic behavior of the mapping et is governed by po, but the mapping itself is not determined by po. We thus do not give a ° superscript to et. On the other hand, different values {30 (corresponding to 0° in the BatesWhite setting) result in different mappings for the dependent variables; we acknowledge this by writing Yi, and we view {30 as determined by po, Le., {30 = (3(PO). Further, because lags of Yi or elements of Yi itself (recall the simultaneous equations framework) may appear as elements of the explanatory variable matrix, different values for {30 may also result in different mappings for the explanatory variables. We acknowledge this by writing Xt'. Next, consider the instrumental variables. These also are measurable mappings from {O, F} whose probabilistic behavior is governed by po. As we will see, it is useful to permit these mappings to depend on po as well. Consequently, we write Zi. Because our goal is the choice of optimal instrumental variables, we let Zt' depend on, and write Zt'(r) to emphasize this dependence. The optimal choice of , will then deliver the optimal choice of instrumental variables. Later, we will revisit and extend this interpretation of TIn Exercise 4.26 (ii), we assume V;;I/2n - 1/ 2Z 'e ~ N(O,I), where Vn = var(n- 1/ 2 Z'e). For the Bates-White framework with Zi(,), this -1/2
=
dO
becomes V~(r) n- 1 / 2 Z°(r)'e - - 7 N(O,I), where V~(r) varo(n- 1 / 2 Z°(r)'e), and Z°(r) is the np x 1 matrix with blocks Zi(r)'. In (iii.a) we assume Z'Xln - Qn = op(l), where Qn E(Z'Xln) is 0(1) with uniformly full column rank. This becomes Z°(r)'XO In - Q~ = opo(l), where Q~ EO(Z°(r)'XO In) is 0(1) with uniformly full column rank; EOO denotes expectation with respect to po. In (iii.b) we assume Pn - P n = op(1), where P n is 0(1) and uniformly positive definite. This becomes Pn(r) - P~(r) = opo(l), where P~(r) is 0(1) and uniformly positive definite. Finally, the instrumental variables estimator has the form
=
=
We will use the following further extension of Exercise 4.26.
4.3 Asymptotic Efficiency
95
Theorem 4.54 Let P be a set of probability measures po on (fl, F), and let r be a set with elements,. Suppose that for each po in P
(i) Yf=Xf'f3°+et,
t=1,2, ... ,f3°E~k.
For each, in r and each po in P, let {Z~th)} be a double array of random p x l matrices, such that (ii) n- 1/2 L~-1 Z~t(r)et - nl/2m~h) = opo(l), where V~h)-1/2nl/2m~h)
dO ---t
N(O, I),
and V~h) _ varo(nl/2m~h)) is 0(1) and uniformly positive definite;
(iii) (a) n- 1 L~ 1 Z~th)Xf' - Q~h) = opo(l), where Q~h) is 0(1) with uniformly full column rank;
.
(b) There exists Pn (,) such that Pn(')-P~(,) = 0p" (1) and P~(,) 0(1) and is symmetric and uniformly positive definite. Then for each po in P and, in
=
r
where D~(,)
(Q~ (, )'P~ (, )Q~ (,))-1 X Q~ (,)'P~ (,)V~ (, )P~( ,)Q~ (,)
(Q~ (, )'P~ (,)Q~ (,))-1
and D~h)-1 are 0(1).
Proof. Apply the proof of Exercise 4.26 for each po in P and, in
r .•
This result is stated with features that permit fairly general applications, discussed further next. In particular, we let {Z~th)} be a double array of matrices. In some cases a single array (sequence) {Z~h)} suffices. In such cases we can simply put m~h) = n- 1ZOh)'e, so that V~h) varO(n-l/2Z0h)'e), parallel to Exercise 4.26. In this case we also have Q~h) = EO(ZOh)'XO In). These identifications will suffice for now, but the additional flexibility permitted in Theorem 4.54 will become useful shortly. Recall that the proof of Exercise 4.26 gives
=
(Q~(,)'V~( ,)-IQ~(,))-1 xQ~h)'V~h)-ln-l/2Z0h)'e + opo(l)
96
4. Asymptotic Normality
when we choose Pn ("Y) so that P~ ("Y) = V~ ("Y) -1, the asymptotically efficient choice. Comparing this expression to (4.1) we see that (3nh) corresponds to Onh), {; is the set of estimators {; - {{(3nh)}, "Y E r}, f30 corresponds to ()O, and we can put H~h)
Q~ ("Y )'V~ ("Y)-1 Q~ ("Y)
s~ ("Y)
Q~( "Y )'V~ ("Y)-l n -1 ZO("Y)' e
I~h)
H~h)·
Our consideration of the construction of the optimal instrumental variables estimator makes essential use of these correspondences. The key to applying the Bates-White constructive method here is the following result, a version of Theorem 2.6 of Bates and White (1993).
Proposition 4.55 Let Onh) satisfy (4.1) for all "Y in a nonempty set and all po in P, and suppose there exists "Y* in r such that o
H~ ("Y) = cov (Vns~ ("Y), Vns~ ("Y*))
for all "Y E
r
and all po in P. Then for all "Y E avar o
(Onh)) -
avar
o
r
r
(4.2)
and all po in P
(Onh*))
is positive semidefinite for all n sufficiently large.
Proof. Given (4.1), we have that for all "Y in sufficiently large avar avar o
O
r,
all po in P, and all n
H~ ("Y) -1 I~ ("Y )H~ ("Y)-1
(Onh))
H~h*)-II~h*)H~h*)-1 = H~h*)-l,
(Onh*))
so
We now show that this quantity equals
Note that (4.2) implies varO (
vlns~h) )
vlns~h*)
= (
I~("Y)
H~h)
H~("Y) )
H~h*)
,
97
4.3 Asymptotic Efficiency
So the variance of H~h)-1y'nS~h) - H~h*)-1y'ns~h*) is given by H~ (7)-1 I~(7)H~(7)-1
+ H~( 7*)-1 H~( 7*)H~(7*)-1
- H~h)-1H~h)H~h*)-1 - H~h*)-1H~h)H~h)-1 = H~h)-1I~h)H~h)-1 - H~h*)-1,
as claimed. Since this quantity is a variance-covariance matrix it is positive semidefinite, and the proof is complete. _ The key to finding the asymptotically efficient estimator in a given class is thus to find 7* such that for all po in P and 7 in r H~ (7)
= cov o (y'ns~ (7), y'ns~ (7*)) .
Finding the optimal i~strumental variables turns out to be somewhat involved, but key insight can be gained by initially considering Zt'h) as previously suggested and by requiring that the errors {et} be sufficiently well behaved. Thus, consider applying (4.2) with Zt'h). The correspondences previously identified give Q~h)'V~h)-1Q~h) =
cov o (Q~h)'V~h)-1n-1/2Zoh)'e, Q~h*)'V~h*)-1n-1/2Zoh*)'e) . Defining C~ h, 7*) - covo(n-1/2Z0h)'e,n-1/2Zoh*)'e), we have the simpler expression Q~ (7)'V~ (7) -1 Q~ (7)
Q~h)'V~h)-1 xC~ (7, 7*) V~( 7* )-1 Q~ (7*).
It therefore suffices to find 7* such that for all 7
To make this choice tractable, we impose plausible structure on et. Specifically, we suppose there is an increasing sequence of a-fields {Fd adapted to {ed such that for each po in P {et, Fd is a martingale difference sequence. That is, EO(etIFt_d = O. Recall that the instrumental variables {Zt} must be uncorrelated with et. It is therefore natural to restrict attention to instrumental variables Zt' h) that are measurable-Ft _ 1, as then EO(EO(Z~( 7)et IFt - 1 )) EO(Z~h)EO(etIFt_1)) =
o.
98
4. Asymptotic Normality
Imposing these requirements on {ed and {Zf(r)}, we have that
n
n-
1
LEO (Zf(r)ete~Zf('''Y*n t=l n
t=l n
n-
1
LEO (Zf(r)EO(ete~IFt_J)Zf(r*)') t=l n
~-l LEo (Zf(r) Of Zf(r*)') ,
t=l
=
where Of EO(eteiIFt-1). The first equality is by definition, the second follows from the martingale difference property of et, the third uses the law of iterated expectations, and the fourth applies Proposition 3.63. Consequently, we seek '"'(* such that n
n-
1
2: EO(Zf(r)Xn = nt=l
n
1
L EO(Zf(r) Of Zf(r*)')V~(r*)-lQ~(r*), t=l
or, by the law of iterated expectations, n
n-
1
LEO (EO (Zf(r)Xt!Ft-l)) t=l n
= n-
1
2: EO(Zf(r)EO (Xf'IFt-
1 ))
t=l n
1
= n- L
EO(Zf(r)Xn
t=l n
= n-
1
L EO(Zf(r) Of Zf(r*)')V~(r*)-lQ~(r*), t=l
where we define Xf = EO (XfIFt - 1 ). Inspecting the last equation, we see that equality holds provided we choose '"'(* such that
99
4.3 Asymptotic Efficiency
If n~
=
EO(ctc~IFt_l)
is nonsingular, as is plausible, this becomes
Q~h*)/V~h*)-lZ~h*) = x~n~-l
which looks promising, as both sides are measurable-Ft _ 1 . We must therefore choose Z~h*) so that a particular linear combination of Z~h*) equals - X-ono-l t ~ 't
.
This still looks challenging, however, because ,* also appears in Q~ (,*) and V~h*). Nevertheless, consider the simplest possible choice for Z~h*), Z~h*)
= x~n~-l.
For this to be valid, it must happen that Q~h*)' V~h*), so that Q~h*),V~h*)-l = I. Now with this choice for Z~h*)
V~h*)
var
o
(n-
1 2 /
f;Z~h*)ct)
n
t=l n
t=l n
n- 1
~ EO (xOnO-lnOnO-lxO') ~ t t t t t
t=l n
n- LEO (x~n~-lxt) t=l 1
n 1
n- LEO
(Z~h*)X~/)
t=l n
t=l n
t=l Q~h*)·
The choice Z?h*) = X?n~-l thus satisfies the sufficient condition (4.2), and therefore delivers the optimal instrumental variables. Nevertheless, using these optimal instrumental variables presents apparent obstacles in that neither n~ = EO(ctc~IFt_l) nor X~ = EO(X~IFt_t}
100
4. Asymptotic Normality
are necessarily known or observable. We shall see that these obstacles can be overcome under reasonable assumptions. Before dealing with the general case, however, we consider two important special cases in which these obstacles are removed. This provides insight useful in dealing with the general case, and leads us naturally to procedures having general applicability. Thus, suppose for the moment
where f2 t is a known p x p matrix. (We consider the consequences of not knowing f2 t later.) Although this restricts P, it does not restrict it completely ~ different choices for po can imply different distributions for et, even though n t is identical. If we then also require that Xl' is measurable-Ft _ 1 it follows that
x~
EO (X~lft-l) X~,
so that
Xl' = Xl'
is observable. Note that this also implies
so that the conditional mean of Yl' still depends on po through f3 0 , but we can view (i') as defining a relation useful for prediction. This requirement rules out a system of simultaneous equations, but still permits Xt' to contain lags of elements of Yl' . With these restrictions on P we have that Zt'b*) = xt'n~-l = Xl'n t 1 delivers a usable set of optimal instrumental variables. Observe that this clearly illustrates the way in which the existence of an optimal IV estimator depends on the collection P of candidate data generating processes. Let us now examine the estimator resulting from choosing the optimal instrumental variables in this setting. Exercise 4.56 Let X t be ft-l -measurable and suppose that f2 t = EO (ctc~ I Ft-d for all po in P. Show that with the choice Zt'b*) = xt'n t 1 then V nb*) = E(n- 1 L~=1 xt'n t 1xt'} and show that with Vnb*) = n- 1 L~=1 Xt'f2t1Xt" this yields the GLS estimator
f3~
(n- tx~n;:lxt) -1 n- tXYf2t1yy 1
1
t=1
(X'n-1X) -1 X'n-1y,
t=1
101
4.3 Asymptotic Efficiency
where 0 is the np x np block diagonal matrix with p x p diagonal blocks Ot, t = 1, ... , n and zeroes elsewhere, and we drop the ° superscript in writing X andY.
This is the generalized least squares estimator whose finite sample efficiency properties we discussed in Chapter 1. Our current discussion shows that these finite sample properties have large sample analogs in a setting that neither requires normality for the errors et nor requires that the regressors Xf be nonstochastic. In fact, we have now proven part (a) of the following asymptotic efficiency result for the generalized least squares estimator.
Theorem 4.57 (Efficiency of GLS) Suppose the conditions of Theorem 4.54 hold and that in addition P is such that for each po in P (i) Yf=Xt/3°+et, t=1,2, ... ,/3°ElRk , where {(Xf+l,et),Ft} is an adapted stochastic sequence with EO(etl Ft-l) = 0, t = 1,2, ... ; and (ii) EO(ete~IFt-d = Ot is a known finite and nonsingular p x p matrix t = 1,2, .... (iii) For each IE r, Zf({) is Ft_1-measurable, t = 1,2, ... , and Q~({) = n- 1 E~-l E(Zf({)Xf'). Then
(a) Zf({*) = XfOtl delivers /3~ =
which is asymptotically efficient in [; for P. (b) FUrther, the asymptotic covariance matrix of /3~ is given by
avarO(/3~) =
[n- 1
t t=l
which is consistently estimated by
i.e.,
Dn -
o
avarO(/3~) ~ O.
EO
(X~Otlxnl-l ,
102
4. Asymptotic Normality
We leave the proof of (b) as an exercise: Exercise 4.58 Prove Theorem 4.57 (b). As we remarked in Chapter 1, the GLS estimator typically is not feasible because the conditional covariances {!1t} are usually unknown. Nevertheless, we might be willing to model !1 t based on an assumed data generating process for ete~, say
where h is a known function mapping Ft_l-measurable random variables Wf and unknown parameters 0° into p x p matrices. Note that different pO's can now be accommodated by different values for 0°, and that Wf may depend on po. An example is the ARCH(q) data generating process of Engle (1982) in which h(Wf, 0°) = 08 + a,!CLl + ... + q for the scalar (p = 1) case. Here, Wf = (CLl'" . , q )· The popular GARCH(l,l) DGP of Bollerslev (1986) can be similarly treated by viewing it as an ARCH(oo) DGP, in which case Wf = (CLl,cL2"")' If we can obtain an estimator, say On, consistent for aD regardless of which po in P generated the data, then we can hope that replacing the unknown !1 t = h(W~, 0°) with the estimator nnt = h(W~, on) might still deliver an efficient estimator of the form
agcL
cL
f3~ =
Such estimators are called "feasible" GLS (FGLS) estimators, because they can feasibly be computed whereas, in the absence of knowledge of !1t, GLS cannot. Inspecting the FGLS estimator, we see that it is also an IV estimator with
n
n
-1 ""xoA-lXOI
L.-t
t
Hnt
t·
t=l Observe that Z~t = Xfn:tl is now double subscripted to make explicit its dependence on both nand t. This double subscripting is explicitly allowed for in Theorem 4.54, and we can now see the usefulness of the flexibility
4.3 Asymptotic Efficiency
103
this affords. To accommodate this choice as well as other similar choices, we now elaborate our specifications of Z~t and T Specifically, we put
Now'"Y has three components: '"Y = h I ,'"Y2,'"Y3)' where '"Y3 = {'"Y3n} is a sequence of Ft_I-measurable mappings, , is a known function such that '(Z~IhI),'"Y3n) is a p x p matrix, and ZfIhI) and Zf2('"Y2) are F t - I measurable for all '"YI' '"Y2, where ZfI hI) is an l x p random matrix. By putting ZfIhI) = XL Zf2('"Y2) = Wf, '"Y3n = On, and, = (h)-I we obtain
Many other choices that yield consistent asymptotically normal IV estimators are also permitted. The requirement that' be known is not restrictive; in particular, we can choose, such that '(zI,'"Y3n) = '"Y3n(Zt) (the evaluation functional) which allows consideration of nonparametric estimators of the inverse covariance matrix, n~-I, as well as parametric estimators such as ARCH (set '"Y3n(-) = [h(.,On)]-I). To see what choices for '"Y are permitted, it suffices to apply Theorem 4.54 with ZOh) interpreted as the np x l matrix with blocks Z~th)/. We expand P to include data generating processes such that n~-I can be sufficiently well estimated by '(Z~I hI), '"Y3n) for some choice of hI' '"Y3)· (What is meant by "sufficiently well" is that conditions (ii) and (iii) of Theorem 4.54 hold.) It is here that the generality afforded by Theorem 4.54 (ii) is required, as we can no longer necessarily take m~h) = n- I L~=I Z~th)Ct· The presence of '"Y3n and the nature of the function , may cause n- I / 2 L~=I Z~th)ct to fail to have a finite second moment, even though n- I / 2 L~I Z~th)ct - nI/2m~h) = opo(l) for well behaved m~h). Similarly, there is no guarantee that EO(Z~th)Xn exists, although the sample average n- I L~-I Z~th)Xf' may have a well defined probability limit Q~h). Consider first the form of~. For notational economy, let '"Yfl - ZfI hI) and '"Yf2 = Zf2(2)· Then with
4. Asymptotic Normality
104
we have n
n
n- 1 / 2
L Z~ky)ct t=l
° r( ° ) n -1/2,", ~ lt2" lt1' 13n Ct t=l n
° r( ° 0) n -1/2,", ~ lt2" ' t l " 3 Ct t=l n
12 +n- / Ll~2 ['h~l' 13n) - 'h~l' 13)J Ct, t=l where we view 13 as the probability limit of 13n under po. If sufficient regularity is available to ensure that the second term vanishes in probability, I.e., n
12 n- / Ll~2 ['h~1"3n) - 'h~1"3)J ct = opo(l), t=l then we can take n
n1/2m~h) = n- 1/ 2 L'~2'h~1,,3)ct. t=l We ensure "sufficient regularity" simply by assuming that rand Pare chosen so that the desired convergence holds for all 1 E r and po in P. The form of Q~ follows similarly. With Z~th) = lf2'hf1' 13n) we have n
n
n- 1 LZ~th)xt t=l
° r( ° )X01 n -1,", ~ lt2" 'tl"3n t t=l n
n
-1,", ° r( ° O)XOI ~ lt2" 't1"3 t
t=l
n
1
+n- Ll~2 ['h~1"3n) - 'h~1"3)J xt· t=l If rand P are assumed chosen such that
1
n
1
n
n- L'~2'h~1"3)Xt - n- L E°[r~2'h~1"3)XtJ = opo(l) t=l t=l and n
1 n- Ll~2 ['h~1"3n) - 'h~1"3)JXt = opo(l), t=l
4.3 Asymptotic Efficiency
105
then we can take n
Q~b) = n- LEO[I~2(b~1,,3)X~/]. 1
t=l With V~b) - varo(n1/2m~b)), the Bates-White correspondences are now H~b)
Q~ b )/V~ b) -1 Q~ b)
S~(,)
Q~ (,)/V~ (,) -lm~ (,)
I~(,)
H~b)·
(As before, we set p~ = V~-l.) By argument parallel to that following Proposition 4.55, the sufficient condition for efficiency is
where now C~ b, 1*) = covo(n1/2m~b), n1/2m~b*))· Following arguments exactly parallel to those leading to Equation (4.3), we find that the sufficient condition holds if Q~b*)/V~b*)-lZ~2b2)((Z~1 br), 13 0 )nr = Xr·
1 Choosing Zf2(2) = Xf and ((Z~lbi)"3°) = nr- can be verified to ensure Q~b*), = V~b*) as before, so optimality follows. Thus, the optimal choice for Z~t (,) is
(*) = Zo nt I
A - 1 xot~£nt'
n:
1 1 where t = ((Z~l bi), 13n) is such that ((Z~l bi), 13°) = nr- . In particular, if nr = h(Wf,aO), then taking ((zl,,3n) = 13n(Zl), Zf1bi) = Wf and 13° = [he-, ao)]-l gives the optimal choice of instrumental variables. We formalize this result as follows.
Theorem 4.59 (Efficiency of FGLS) Suppose the conditions of Theorem 4.54 hold and that in addition P is such that for each po in P
(i) Yt' = Xf'(3°
+ et,
t
=
1,2, ... , (30 E IRk,
where {(Xf+l,et),Ft} is an adapted stochastic sequence with EO(etl Ft-d = 0, t = 1,2, ... ; and
(ii)
EO(ete~IFt-d =
t=1,2, ....
nr is an unknown finite and nonsingularpxp matrix
106
4. Asymptotic Normality
(iii) For each I E
r,
where I - b1,12,,3), and 13 - b3n} is such that 13n is Fmeasurable for all n = 1,2, ... , and Zt'l bl) and Zt'2(2) are F t - 1 measurable for all 11 and 12, t = 1,2, ... ; and there exists 13 such that n
m~b)
= n-
1
LZ~2b2h3(Z~1b1))ct t=1
and n
Q~b) = n- 1 LE(Z~2b2h3(Z~lbl))X~/); t=l (iv) for some
Ii, 13,
n:t
1
=
13n(Zt'lbi)), where 13°(Zt'1bi))
=
O~-l.
Then - -1
(a) Zntb*) = Xt'0nt delivers {3~ =
which is asymptotically efficient in [ for P; (b) the asymptotic covariance matrix of (3~ is given by
avarO({3~) = [n-
1
tEo (X~O~-lxn]-1
,
t=l
which is consistently estimated by -
Dn _
=
1
[
n
- -1
n- LX~Ont X~
I
]-1 ,
t=1
0
i.e., Dn - avarO({3~) ~ O.
Proof. (a) Apply Proposition 4.55 and the argument preceding the statement of Theorem 4.59. We leave the proof of part (b) as an exercise. _
4.3 Asymptotic Efficiency
107
Exercise 4.60 Prove part (b) of Theorem 4.59.
The arguments needed to verify that n
n -1
L 'i'~2 ['i'3n h~l) - 'i'3h~1)1 et = opo (1) t=l
and
1 n-
n
L 'i'~2 b3nh~1) - 'i'3h~1)1 X~' = opo (1) t=l
depend on what is known about exercise illustrates one possibility.
n~
and how it is estimated. The next
Exercise 4.61 Suppose that conditions 4.59 (i) and 4.59 (ii) hold with p = 1 and that
t where
=
1,2, ... ,
Wt = 1 during recessions and Wt = 0 otherwise.
(i) Let E:t = ~o - X'i' f3n' where f3n is the OLS estimator, and let an = (GIn, G2n)' be the OLS estimator .from the regression of E:~ on a constant and wt. Prove that an = a O + opo(l), where a providing any needed additional conditions.
O
= ((Yl,(t~2)/,
(ii) Next, verify that conditions 4.59 (iii) and (iv) hold for Z ont (*) 'i' = XO( t (tIn A
+ (t2n wo)-l t . A
When less is known about the structure of n~, non parametric estimators can be used. See, for example, White and Stinchcombe (1991). The notion of stochastic equicontinuity (see Andrews, 1994a,b) plays a key role in such situations, but will not be explored here. We are now in a position to see how to treat the general situation in which X~ - E(X~I.rt-l) appears instead of X~. By taking
we can accommodate the choice
108
4. Asymptotic Normality
where Xnt = 14n(Z~2(r2)) = 14n(r~2) is an estimator of X~. As we might now expect, this choice delivers the optimal IV estimator. To see this, we examine m~ and Q~. We have n
l 2 n- /
n
L Z~t(r)et
" 14n (0 n -1/2 'L..t It2 )13n (0 Itl )et
t=l
t=l n
0) 13O(Ttl 0) et n -1/2'" L..t 14O(1t2 t=l n
t=l nl/2m~(r)
provided we set m~(r)
1
+ opo(l),
= n- 1 I:~=114(r~2h3(r~1)et
and
n
n- Lb4n(r~2h3n(r~1) -,~(r~2h3(r~1)]et = opo(l). t=l
Similarly, n
n
t=l
t=l
1 1 n- L Z~t(r)xt = n- L 14n(r~2h3n(r~1)xt n
" 14O( It2 0) 13O( Itl ° )XO! = n -1 'L..t t t=l n
+ n- 1 L('4n(r~2h3n(r~1) -,~(r~2h3(r~1)JXt· t=l
Provided the second term vanishes in probability (_PO) and a law of large numbers applies to the first term, we can take n
Q~(r) = n- L E°(r~(r~2h3(r~1)X~f). 1
t=l
With V~(r) agam
varO(n1/2m~(r)), the Bates-White correspondences are
H~(r)
Q~ (r )'V~ (r) -1 Q~ (r),
s~ (,)
Q~ (,)!V~ (,)-lm~ (,),
I~(,)
H~(r),
4.3 Asymptotic Efficiency
109
and the sufficient condition for asymptotic efficiency is again
Arguments parallel to those leading to Equation (4.3) show that the sufficient condition holds if
Q~(r*)/V~(r*)-1'4°(Z~2(r:m'3°(Z~1
(ri»nr = xr·
nr-
1 and 14°(Z~2(r2» = Xf, as this enIt suffices that 13°(Z~1 (ri) = sures Q~(r*)' = V~(r*)-l. We can now state our asymptotic efficiency result for the IV estimator. Theorem 4.62 (Efficient IV) Suppose the conditions of Theorem hold and that in addition P is such that for each po in P
4.54
(i) Yf=X?,f3°+et, t=1,2, ... ,f3°E~k, where {et, Fd is an adapted stochastic sequence with EO(etIFt-d = 0, t=1,2, ... ;and (ii) EO(ete~IFt-1) = t = 1,2, .... (iii) For each I E
nr is an unknown finite and nonsingularpxp matrix
r,
where I = (r1,,2,,3,,4), and 13 = {,3n} and 14 - {,4n}, are such that 13n and 14n are F-measurable for all n = 1,2, ... and Zf1 (r1) and Zf2(r2) are F t _ 1-measurable for all 11 and 12' t = 1,2, ... ; and there exist 13 and 14 such that n
1
m~(r) = n- L 14(Zr2(r2»)f3(Zr1 (r1»et t=l
and
Q~(r)
n
1
= n- LE°(r4(Zr2(r2»)f3(Zr1(r1»Xn; t=l ~
-1
(iv) for some Ii and 13, nnt = 13n(Zf1(ri), where 13°(Zf1(ri» = 1 and for some 12 and 14' X = 14n (Zf2 (,2) ), where 14° (Zf2 (12) ) t = EO(XfIFt -1).
nr-
Then
110
4. Asymptotic Normality
(a) Z~t(-y*)
...
...-1
= XntO nt
delivers
f3~ =
which is asymptotically efficient in [; for P;
(b) the asymptotic covariance matrix of f3~ is given by
avarO(f3~) =
[n- t 1
EO
(X~O~-1X?, )]-1,
t=l
which is consistently estimated by
Proof. (a) Apply Proposition 4.55 and the argument preceding the statement of Theorem 4.62. We leave the proof of part (b) as an exercise. • Exercise 4.63 Prove Theorem 4.62 (b).
Other consistent estimators for avarO(f3~) are available, for example
This requires additional conditions, however. The estimator consistent without further assumptions.
Dn
Exercise 4.64 Suppose condition 4.62 (i) and (ii) hold and that ~O
,
Z~/7r°,
where
7r 0
t
=
1,2, ... ,
is an I x k matrix.
n-
1n - 1 L~_l ZrX?,. (i) (a) Let 1Tn = (n- 1 L~=1 Zr Z Give simple conditions under which 1Tn = 7r 0 + opo(l).
above is
111
References
(the 2SLS estimator). Give simple conditions under which tn = ~o + opo (1).
(ii) Next, verify that conditions 4.62 (iii) and 4.62 (iv) hold for Z~tb*) = -1 1r nZf~n . The resulting estimator is the three stage least squares ,
A
(3SLS) estimator.
References Andrews, A. (1994a). "Asymptotics for semiparametric econometric-models via stochastic equicontinuity." Econometrica, 62, 43-72. - - - (1994b). "Empirical process methods in econometrics." In Handbook of Econometrics, 4. Engle, R. F. and D. L. McFadden, eds., pp. 2247-2294. North Holland, Amsterdam. Bartle, R. G. (1976). The Elements of Real Analysis. Wiley, New York. Bates C. E. and H. White (1993). "Determination of estimtors with minimum asymptotic covariance matrices." Econometric Theory, 9, 633-648. Bollerslev, T. (1986). "Generalized autoregressive conditional heteroskedasticity." Journal of Econometrics, 31, 307-327. Chamberlain, G. (1982). "Multivariate regression models for panel data." Journal of Econometrics, 18, 5-46. Cragg, J. (1983). "More efficient estimation in the presence of heteroskedasticity of unknown form." Econometrica, 51, 751-764. Engle, R. F. (1981). "Wald, likelihood ratio, and Lagrange multiplier tests in econometrics." In Handbook of Econometrics, 2, Z. Griliches and M. Intrilligator, eds., North Holland, Amsterdam. (1982). "Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom inflation." Econometrica, 50, 987-1008. Goldberger, A. S. (1964). Econometric Theory. Wiley, New York. Hansen, L. P. (1982). "Large Sample Properties of Generalized Method of Moments Estimators." Econometrica, 50, 1029-1054. Lukacs, E. (1970). Characteristic Functions. Griffin, London. (1975). Stochastic Convergence. Academic Press, New York. Rao, C. R. (1973). Linear Statistical Inference and Its Applications. Wiley, New York.
112
4. Asymptotic Normality
White, H. (1982). "Instrumental variables regression with independent observations." Econometrica, 50, 483-500.
(1994). Estimation, Inference and Specification Analysis. Cambridge University Press, New York. and M. Stinchcombe (1991). "Adaptive efficient weighted least squares with dependent observations." In Directions in Robust Statistics and Diagnostics, W. Stahel and S. Weisberg, eds., pp. 337-364. IMA Volumes in Mathemaics and Its Applications. Springer-Verlag, New York.
CHAPTER 5
Central Limit Theory
In this chapter we study different versions of the central limit theorem that provide conditions guaranteeing the asymptotic normality of n- 1 / 2 X' e: or n- 1 / 2 Z' e: required for the results of the previous chapter. As with laws of large numbers, different conditions will apply to different kinds of economic data. Central limit results are generally available for each of the situations considered in Chapter 3, and we shall pay particular attention to the parallels involved. The central limit theorems we consider are all of the following form. Proposition 5.0 Given restrictions on the moments, dependence, and heterogeneity of a scalar sequence {Zt}, (Zn - iln)/(an/fo) = fo(Zn -
iln)/a n ~ N(O, 1), where iln - E(Zn) and a;,,/n
=var(Zn).
In other words, under general conditions the sample average of a sequence has a limiting unit normal distribution when appropriately standardized. The results that follow specify precisely the restrictions that are sufficient to imply asymptotic normality. As with the laws of large numbers, there are natural trade-offs among these restrictions. Typically, greater dependence or heterogeneity is allowed at the expense of imposing more stringent moment requirements. Although the results of the preceding chapter imposed the asymptotic normality requirement on the joint distribution of vectors such as n- 1 / 2 X'e: or n- 1 / 2 Z'e:, it is actually only necessary to study central limit theory for sequences of scalars. This simplicity is a consequence of the following result.
113
114
5. Central Limit Theory
Proposition 5.1 (Cramer-Wold device) Let {b n } be a sequence ofrandom k x 1 vectors and suppose that for any real k x 1 vector). such that ).'). = 1, ).'b n ~ ).' Z, where Z is a k x 1 vector with joint distribution function F. Then the limiting distribution function of b n exists and equals
F. Proof. See Rao (1973, p. 123) . • We shall apply this result by showing that under general conditions, n
n- 1 / 2
L ),'V~1/2Xtet ~ ).' Z t=l
or n n
-1/2,,", L...J A"V-1/2Z n tet
A
"Z ,
""' A
t=l
where Z ""' N(O, I), which, by Proposition 5.1, allows us to obtain the desired conclusion, i.e., 1 2 1 2 VI) or V -1/2 n -1/2Z'e ~ N(O , I) . n / n- / X'e ~ N(O , n
When used in this context below, the vector). will always be understood to have unit norm, i.e., ).'). = 1.
5.1
Independent Identically Distributed Observations
As with laws of large numbers, the case of independent identically distributed observations is the simplest. Theorem 5.2 (Lindeberg-Levy) Let {Zd be a sequence of i.i.d. random scalars, with J-L = E(Zt) and a 2 = var(Zt) < 00. If a 2 #- 0, then n
n- 1 / 2
L(Zt - J-L)/a ~ N(O, 1). t=l
Proof. Let f(>.) be the characteristic function of Zt - J-L and let fn(>') be the characteristic function of fo(Zn - Jln)/a n = n- 1 / 2 L~=l (Zt - J-L)/a. From Propositions 4.13 and 4.14 we have
5.1 Independent Identically Distributed Observations
115
or log fn(A) = n log f(A/(ayTi)). Taking a Taylor expansion of f(A) around A = 0 gives f(A) = 1- a 2A2 /2 + O(A2) since a 2 < 00, by Proposition 4.15. Hence log fn(A) = nlog[1 - A2/(2n)
+ O(A2 /n)]---t
-A2/2
as n ---t 00. Hence fn(A) ---t exp( _A 2/2). Since this is continuous at zero, it follows from the continuity theorem (Theorem 4.17), the uniqueness theorem (Theorem 4.11), and Example 4.10 (i) that
..jn(Zn -Pn)/a n ~ N(O, 1) .
• Compared with the law of large numbers for i.i.d. observations, we impose
=
a single additional requirement, i.e., that a 2 var(Zt) < 00. Note that this implies EIZtl < 00. (Why?) Also note that without loss of generality, we can set E(Zt) = O. We can apply Theorem 5.2 to give conditions that ensure that the conditions of Theorem 4.25 and Exercise 4.26 are satisfied. Theorem 5.3 Given
(i) Y t = X~J3o + et, t = 1,2, ... , 130 E IRk; (ii) {(X~,et)} is an i.i.d. sequence; (iii) (a) E(Xtet} = 0; (b) EIXthicthl2 < 00, h = 1, ... ,p, i = l, ... , k; (c) Y n = var(n- 1/ 2X'e) = Y is positive definite; (iv) (a) EIXt hil 2 < 00, h = 1, ... ,p, i = 1, ... , k; (b) M _
E(XtX~)
is positive definite.
Then D- 1 / 2 ..jn(,Bn - 130) ~ N(O, I), where D - M-1YM- 1. Suppose in addition that (v) there exists
Vn
symmetric and positive semidefinite such that
P --->
0, where Dn = (X'X/n)-lYn(X'X/n)-l.
Vn
-
Y~O. ....
Then Dn - D
...
A
Proof. We verify the conditions of Theorem 4.25. We apply Theorem 5.2 and set Zt = A'y-l/2Xtet. The summands .x'y-1/2Xtet are ij.d. given (ii), with E(Zt} = 0 given (iii.a), and var(Zt) = 1 given (iii.b) and (iii.c). Hence n- 1/ 2 L~=l Zt = n- 1/ 2 L~=l .x'y-1/2Xtet ~ N(O,I)
116
5. Central Limit Theory
by the Lindeberg-Levy Theorem 5.2. It follows from Proposition 5.1 that V-l/2 n -l/2X'e ~ N(O, I), where V is 0(1) given (iii.b) and positive definite given (iii.c). It follows from Kolmogorov's strong law of large numbers, Theorem 3.1, and from Theorem 2.24 that X'Xjn - M --E.... 0 given (ii) and (iv). Since the rest of the conditions of Theorem 4.25 are satisfied by assumption, the result follows. • In many cases V may simplify. For example, it may be known that E(ctIXt) = O"~ (p = 1). If so,
E(XtctctX~)
V
= E(c;XtX~) = E(E(c;XtX~IXd)
E(E(c;IXdXtXD = O"~E(XtX~) = O"~M. The obvious estimator for V is then V n = a~(X'Xjn), where a~ is consistent for A similar result holds for systems of equations in which it is known that E(ete~IXt) = I (after suitable transformation of an underlying DGP). Then V = M and a consistent estimator is Vn = (X'Xjn). Consistency results for more general cases are studied in the next chapter. In comparison with the consistency result for the OLS estimator, we have obtained the asymptotic normality result by imposing the additional second moment conditions of (iii.b) and (iii.c). Otherwise, the conditions are identical. A similar result holds for the IV estimator.
0";.
Exercise 5.4 Prove the following result. Given
(i) Y t = X~!3o + et, t = 1,2, ... , !3o E JRk; (ii) {(Z~,X~,et)} is an i.i.d. sequence; (iii) (a) E(Ztet) = 0; (b) EIZthiCthl2 < 00, h = 1, ... ,p, i = 1, ... ,l; (c) Vn - var(n- 1 / 2 Z'e) = V is positive definite;
(iv) (a) EIZthiXthjl < 00, h = 1, ... ,p, i = 1, ... ,l, and j (b) Q = E(ZtXD has full column rank; (c)
= 1, ...
,k;
symmetric and positive semidefinite such that
Vn
Pn --E.... P, .finite,
symmetric, and positive definite.
Then D- 1 / 2 yin(,Bn - !3o) ~ N(O, I), where
D _ (Q'PQ)-lQ'PVPQ(Q'PQ)-l. Suppose further that (v) there exists p
V->O.
Vn
-
5.2 Independent Heterogeneously Distributed Observations -
Then Dn - D
:On
p
~
117
0, where
=(X'ZPnZ'X/n2)-1(X'Z/n)Pn VnP n(Z'X/n)(X'ZP nZ'X/n 2)-1.
Exercise 5.5 If p = 1 and E(c:;IZt) = a~, what is the efficient IV estimator? What is the natural estimator for V? What additional conditions p p ensure that P n - P ~ 0 and V n - V ~ O? These results apply to observations from a random sample. However, they do not apply to situations such as the standard regression model with fixed regressors, or to stratified cross sections, because in these situations the elements of the sum n- 1 / 2 I:~=l Xtet are no longer identically distributed. For example, with X t fixed and E(c:;) = a~, var(XtC:t) = a~XtX~, which depends on XtX~ and hence differs from observation to observation. For these cases we need to relax the identical distribution assumption.
5.2
Independent Heterogeneously Distributed Observations
Several different central limit theorems are available for the case in which our observations are not identically distributed. The most general result is in fact the centerpiece of all asymptotic distribution theory. Theorem 5.6 (Lindeberg-Feller) Let {Zt} be a sequence of independent random scalars with Pt - E(Zt), = var(Zt) < 00, i 0 and distribution functions Ft , t = 1,2, .... Then
a;
a;
and
if and only if for every c: > 0,
Proof. See Loeve (1977, pp. 292-294) . • The last condition of this result is called the Lindeberg condition. It essentially requires the average contribution of the extreme tails to the
118
5. Central Limit Theory
variance of Zt to be zero in the limit. When the Lindeberg condition holds, not only does asymptotic normality follow, but the uniform asymptotic negligibility condition maxl for all n sufficiently large, then
°
y'n(Zn - Pn)/O' n ~ N(O, 1). 1 As
a;
°
stated, this result is actually a corollary to Liapounov's original theorem. See Loeve (1977, p. 287).
5.2 Independent Heterogeneously Distributed Observations
119
Proof. We verify that the Lindeberg condition is satisfied. Define A = {z : (z -l-£t)2 > wa;,}; then
L (z -l-£t)2dFt (z) = L
Iz -l-£tlOlz -I-£tl-O(z -l-£t)2dFt (z).
Whenever (z - I-£t)2 > wa;" it follows that
L (z -l-£t)2dFt (z)
Iz -
I-£tl- O
0 there exists
5. Central Limit Theory
124
M(c) such that 0 < E(IE(ZtIFt_m)l) < c for all m > M(c). By Jensen's inequality, IE[E(ZtIFt-m)ll < E(IE(ZtIFt-m)l), so 0 < IE[E(ZtIFt-m)ll < c for all m > M(c). But by the law of iterated expectations, E(Zt) = E[E(ZtIFt - m )], so 0 < IE(Zt)1 < c. Since c is arbitrary, it follows that E(Zt) = O. • In establishing the central limit result, it is necessary to have = var( ynZn) finite. However, for this it does not suffice simply to have a 2 = var(Zd finite. Inspecting we see that
a;
a;,
o'~
n var(Zn)
nE ((n-' t,z,r) n
n-1
n- 1 LE(Z;) +2n- 1 L
n
L
E(ZtZt-1')'
t=l
When Zt is stationary, P1' - E(ZtZt_1')la2 does not depend on t. Hence, n-1
a 2 + 2a 2n- 1 L(n - T)P1'
o'~
n-1
a 2 + 2a 2 L P1'(l - Tin). 1'=1 This last term contains a growing number of terms as n - t 00, and without further conditions is not guaranteed to converge. It turns out that the condition CXJ
L (E[E(ZOIF_mf])1/2
O.
The notion of a mixingale is due to McLeish (1974). Note that in this definition {Zt} need not be stationary or ergodic, but may be heterogeneous. In the mixingale definition of McLeish, F t need not be adapted to Zt. Nevertheless, we impose this because it simplifies matters nicely and is sufficient for all our applications. As the name is intended to suggest, mixingale processes have attributes of both mixing processes and martingale difference processes. They can be thought of as processes that behave "asymptotically" like martingale difference processes, analogous to mixing processes, which behave "asymptotically" like independent processes.
Theorem 5.16 (Scott) Let {Zt, Ftl be a stationary ergodic adapted mixingale with 1m of size -1. Then o'~ var(n- 1 / 2 2:~=1 Zt) -> 0'2 < 00 as n -> 00 and if 0'2 > 0, then n- 1 / 2 Zn/O' ~ N(O, 1).
=
Proof. Under the conditions given, the result follows from Theorem 3 of Scott (1973), provided that 00 {( ~ E [ E (ZoIF-m)
2])1/2 + (E [Zo -
E (ZoIFm)]
2)1/2}
1, E(ZoIFm) = Zo and the second term in the summation vanishes. It suffices then that
Applying the mixingale and stationarity conditions we have 00
-iXthiEth, h=1 i=1
=
where >-i is the ith element of the k x 1 vector X V- 1/ 2 A. By definition of A and V, there exists l:l < 00 such that I>-il < l:l for all i. It follows from
127
5.3 Dependent Identically Distributed Observations
Minkowski's inequality that
E(Z;)
< <
oo V n, which is positive definite given (iii.c). Then 0'2 = >..'V- 1/2VV- 1/ 2>.. = 1. It then follows from Theorem 5.16 that n- 1/ 2 2:~-1 >..'V-1/2Xtet ~ N(O, 1). Since this holds for every>.. such that >..' >.. = 1, it follows from Proposition 5.1 that V-l/2n -l/ 2 2:~=1 Xtet ~ N(O, I). Now n
n
V~1/2n-l/2 LXte t - V- 1/ 2n- 1/ 2 LXtet t=l t=l n
(V~1/2Vl/2 - I)V- 1/ 2n- 1/ 2 LXte t ~ 0, t=l
since V~1/2Vl/2 - I is 0(1) by Definition 2.5 and n
V- 1/ 2n- 1/ 2
L Xtet ~ N(O, I), t=l
which allows application of Lemma 4.6. Hence by Lemma 4.7, V~1/2n-l/2X' e ~ N(O, I).
°
Next, X'X/n-M ~ by the ergodic theorem (Theorem 3.34) and Theorem 2.24 given (ii) and (iv), where M is finite and positive definite. Since the conditions of Theorem 4.25 are satisfied, it follows that D~1/2 y'n(/3 n -
(3o) ~ N(O, I), where Dn it follows that
=M-1VnM-1. Because Dn -D °as n -+
-+ 00,
D- 1/ 2y'n(/3n - (3o) - D~1/2y'n(/3n - (3o) (D- 1/ 2D:!2 - I)D~1/2y'n(/3n - (3o) ~
°
by Lemma 4.6. Hence, by Lemma 4.7,
•Comparing this result with the OLS result in Theorem 5.3 for i.i.d. regressors, we have replaced the i.i.d. assumption with stationarity, ergodicity, and the mixingale requirements of (iii.a). Because these conditions are always satisfied for i.i.d. sequences, Theorem 5.3 is in fact a direct corollary of Theorem 5.16. (Condition (iii.a) is satisfied because for i.i.d. sequences
129
5.3 Dependent Identically Distributed Observations
E(XOhicOhIF-m) = 0 for all m > 0.) Note that these conditions impose the restrictions placed on Zt in Theorem 5.16 for each regressor-error cross product XthiCth. Although the present result now allows for the possibility that X t contains lagged dependent variables Yt-l, Yt-2, ... , it does not allow Ct to be serially correlated at the same time. This is ruled out by (iii.a) which implies E(Xtct) = 0 by Lemma 5.14. This condition will be violated if lagged dependent variables are present when Ct is serially correlated. Also note that if lagged dependent variables are present in Xt, condition (iv.a) requires that E(Y?) is finite. This in turn places restrictions on the possible values allowed for f30. Exercise 5.18 Suppose that the data are generated as Yt = ,801 Yt-l + ,8 02 Yt-2 + Ct· State general conditions on {Yt} and (,801' ,802) which ensure the consistency and asymptotic normality of the OLS estimator for ,801 and
,802· As just mentioTJ.ed, OLS is inappropriate when the model contains lagged dependent variables in the presence of serially correlated errors. However, useful instrumental variables estimators are often available.
Exercise 5.19 Prove the following result. Given
(i) Y t = X~f3o + Ct, t = 1,2, ... , f30 E ]Rk; (ii) {(Z~,X~,Ct)} is a stationary ergodic sequence; (iii) (a) {ZthiCth, Fd is an adapted mixingale of size -1, h = 1, ... ,p, i = 1, ... , l;
(b) E IZthiCth 12 < 00, h = 1, ... ,p, i = 1, ... , l; (c) Vn = var(n- 1 / 2 Z'c) is uniformly positive definite;
(iv) (a) EIZthiXthjl < 00, h = 1, ... ,p, i = 1, ... , l, and j = 1, ... , k; (b) Q E(ZtX~) has full column rank; (c)
Pn
~ P finite, symmetric, and positive definite.
Then V n ~ V.finite and positive de.finite as n
----t 00,
and D- 1 / 2 y'ri(i3n-
(30) ~ N(O, I), where
D - (Q'PQ)-I Q 'PVPQ(Q'PQ)-I. Suppose further that (v) there exists V~O.
Vn
symmetric and positive semidefinite such that
Vn
-
130
5. Central Limit Theory p
A
Then Dn - D
---->
0, where
This result follows as a corollary to a more general theorem for nonlinear equations given by Hansen (1982). However, all the essential features of his assumptions are illustrated in the present result. Since the results of this section are based on a stationarity assumption, unconditional heteroskedasticity is explicitly ruled out. However, conditional heteroskedasticity is nevertheless a possibility, so efficiency improvements along the lines of Theorem 4.55 may be obtained by accounting for condi tional heteroskedastici ty.
5.4
Dependent Heterogeneously Distributed Observations
To allow for situations in which the errors exhibit unconditional heteroskedasticity, or the explanatory variables contain fixed as well as lagged dependent variables, we apply central limit results for sequences of mixing random variables. A convenient version of the Liapounov theorem for mixing processes is the following. Theorem 5.20 (Wooldridge-White) Let {Znd be a double array of scalars with Jlnt E(Znt) = 0 and var(Znd such that EIZntir < .6. < 00 for some r > 2, and all nand t, and having mixing coefficients ¢ of size -r/2(r - 1), or a of size -r/(r - 2), r > 2. If var (n- 1 / 2 L~-l Zt) > b > 0 for all n sufficiently large, then fo(Zn -
a;t =
a; -
Jln)/a n ~ N(O, 1). Proof. The result follows from Corollary 3.1 of Wooldridge and White (1988) applied to the random variables Znt _ Znt!a n. See also Wooldridge (1986, Ch. 3, Corollary 4.4.) • Compared with Theorem 5.11, the moment requirements are now potentially stronger to allow for considerably more dependence in Zt. Note, however, that if ¢(m) or a(m) decrease exponentially in m, we can set r arbitrarily close to two, implying essentially the same moment restrictions as in the independent case. The analog to Exercise 5.12 is as follows.
5.4 Dependent Heterogeneously Distributed Observations
131
Exercise 5.21 Prove the following result. Given
(i) Y t = X~13o + Ct,
t
= 1,2, ... ,130 E IRk;
(ii) {(Xi,ct)} is a mixing sequence with either¢ of size -r/2(r-1), r or a of size -r/(r - 2),r > 2;
(iii) (a) E(Xtct) = 0,
>2
t = 1,2, ... ;
(b) EIXthiEthlr < ~ < 00 for h = 1, ... ,p, i = 1, ... ,k, and all t; (c) V n var (n- I / 2 L~=I Xtct) is uniformly positive definite; (iv) (a) EIX;hil(r/2)+D < ~ < 00 for some b > 0 and all h = 1, ... ,p, i = I, ... ,k, and t; (b) Mn E(X'X/n) is uniformly positive definite. Then D~I/2 fo(/3n - 1(0) ~ N(O, I), where Dn = M~IV nM~I. Suppose in addition that (v) there exists p V n -----;
Then
Un -
Yn
symmetric and positive semidefinite such that
Yn
-
o. Dn ~ 0, where
Un _
(X'X/n)-Iy n(X'X/n)-I.
Compared with Exercise 5.12, we have relaxed the memory requirement from independence to mixing (asymptotic independence). Depending on the amount of dependence the observations exhibit, the moment conditions mayor may not be stronger than those of Exercise 5.12. The flexibility gained by dispensing with the stationarity assumption of Theorem 5.17 is that the present result can accommodate the inclusion of fixed regressors as well as lagged dependent variables in the explanatory variables of the model. The price paid is an increase in the moment restrictions, as well as an increase in the strength of the memory conditions.
Exercise 5.22 Suppose the data are generated as Yi = f3 0I Yi-I + f30 2W t + ct, where W t is a .fixed scalar. Let X t = (Yi-I, W t ) and provide conditions on {(Xt, Ed} and (f3 ol ' (3 02) that ensure that the OLS estimator of f3 0I and f3 02 is consistent and asymptotically normal. The result for the instrumental variables estimator is the following.
Theorem 5.23 Given (i) Y t = Xi130
+ Ct,
t = 1,2, ... ,130 E IRk;
(ii) {(Zi, Xi, Ct)} is a mixing sequence with either ¢ of size -r /2(r - 1), r > 2 or a of size -r/(r - 2), r> 2;
132
5. Central Limit Theory
= 0, t = 1,2, ... ; (b) EIZthiEthlr < ~ < 00 for h = 1, ... ,p, i
(iii) (a) E(Ztet)
(c) V n
_
= 1, ... ,l, and all t;
var (n- l / 2 L~=l Ztet), is uniformly positive definite;
(iv) (a) EIZthiXthil(r/2)+D < ~ < 00 for 8 > 0 and all h = 1, ... ,p, i = 1, ... ,l, j = 1, ... ,k, andt; (b) Qn - E(Z'X/n) has uniformly full column rank; (c) Pn -P n ~ 0, where P n = 0(1) and is symmetric and uniformly positive definite. Then D;;:
1/2
A yTi({3n - (3o) '" N(O, I), where
Suppose further that (v) there exists Vn~O. A
Vn
Then Dn - Dn
symmetric and positive semidefinite such that
p
--->
Vn
-
0, where
Proof. We verify that the conditions of Exercise 4.26 hold. First we apply Proposition 5.1 to show V;;:1/2 n - l /2Z' e ~ N(O, I). Consider n
"V-l/2Z n -1/2"",,, ~/\ n tet· t=l By Theorem 3.49, A'V;;:1/2 Ztet is a sequence of mixing random variables with either ¢ of size -r /2(r -1), r > 2 or ex of size -r /(r -1), r > 2, given (ii). Further, E(A'V;;:1/2 Ztet ) = 0 given (iii.a), EIA'V;;:1/2Ztetlr < ~ < 00 for all t given (iii.b), and
a; _var (n- l / 2 t
A'V;;:1/2Ztet) = A'V;;:1/2V n V;;: 1/2 A = 1.
t=l
It follows from Theorem 5.20 that n- l / 2 L~=l A'V;;:1/2 ZtEt ~ N(O, I). Since this holds for every A, A' A = 1, it follows from Proposition 5.1 that l 2 V n- / n -1/2 L ..d=l Z tet ~ N(O ,I).
""n
133
5.5 Martingale Difference Sequences
Next, Z'X/n - Qn ~ 0 by Corollary 3.48 given (iv.a), where {Qn} is 0(1) and has uniformly full column rank by (iv.a) and (iv.b). Since (iv.c) also holds, the desired result now follows from Exercise 4.26 . • This result is in a sense the most general of all the results that we have obtained, because it contains so many special cases. Specifically, it covers every situation previously considered (i.i.d., i.h.d., and d.i.d. observations), although at the explicit cost of imposing slightly stronger conditions in various respects. Note, too, this result applies to systems of equations or panel data since we can choose p > 1.
5.5
Martingale Difference Sequences
In Chapter 3 we discussed laws of large numbers for martingale difference sequences and mentioned that economic theory is often used to justify the martingale diffe.ence assumption. If the martingale difference assumption is valid, then it often allows us to simplify or weaken some of the other conditions imposed in establishing the asymptotic normality of our estimators. There are a variety of central limit theorems available for martingale difference sequences. One version that is relatively convenient is an extension of the Lindeberg-Feller theorem (Theorem 5.6). In stating it, we consider sequences of random variables {Znt} and associated a-algebras {Fnt , 1 < t < n}, where F nt - 1 C Fnt and Znt is measurable with respect to F nt . We can think of Fnt as being the a-field generated by the current and past of Znt as well as any other relevant random variables.
Theorem 5.24 Let {Znt, Fnd be a martingale difference sequence such that a;t - E(Z;t) < 00, a;t i= 0, and let Fnt be the distribution function of Znt. Define Zn - n- 1 L~=l Znt and var (v'TiZn) = n- 1 L~=l a;t· If for every c > 0
iT; =
and n
n- 1 LZ;t/a;' -1 ~ 0,
t=l -
A
then v'TiZn/an '" N(O, 1).
134
5. Central Limit Theory
Proof This follows immediately as a corollary to Theorem 2.3 of McLeish (1974) . • Comparing this result with the Lindeberg-Feller theorem, we see that both impose the Lindeberg condition, whereas the independence assumption has here been weakened to the martingale difference assumption. The present result also imposes a condition not explicit in the Lindeberg-Feller theorem, i.e., essentially that the sample variance n- l L~=l Z;t is a consistent estimator for This condition is unnecessary in the independent case because it is implied there by the Lindeberg condition. Without independence, we make use of additional conditions, e.g., stationarity and ergodicity or mixing, to ensure that the sample variance is indeed consistent for To illustrate how use of the martingale difference assumption allows us to simplify our results, consider the IV estimator in the case of stationary observations. We have the following result.
0-;.
0-;.
Theorem 5.25 Suppose conditions (i), (ii), (iv) and (v) of Exercise 5.19 hold, and replace condition (iii) with (iii') (a) E(ZthifthIFt-d = 0 for all t, where {Ft} is adapted to {Zthiftd,
= 1, ... ,p, i = 1, ... ,I; (b) EIZthifthl2 < =, h = 1, ...
h
(c) V n
_
var(n- l / 2 Z'c)
,p, i = 1, ... ,I;
= var(ZtCt) - V is nonsingular.
Then the conclusions of Exercise 5.19 hold.
Proof. One way to prove this is to show that (iii') implies (iii). This is direct, and it is left to the reader to verify. Alternatively, we can apply Proposition 5.1 and Theorem 5.24 to verify that V;;1/2 n - 1/ 2Z' c ~ N(O, I). Since {Ztct} is a stationary martingale difference sequence, n
var
(n- 1 / 2 Z' c)
= n-
l
L E(Ztctc~Z~) = V, t=l
finite by (iii'.b) and positive definite by (iii'.c). Hence, consider n
n- l / 2
L>.'V- l / 2 Z t c t . t=l
By Proposition 3.23, >..'V- l / 2 ZtCt is measurable with respect to F t given (iii'.a). Writing >..'V-1/2ZtCt = L~=l L~=l \Zthifth, it follows from the
5.5 Martingale Difference Sequences
135
linearity of condition expectations that p
E(A'y-l/2Z tct IFt _d =
I
LL
).iE(ZthiCthIFt-r) = 0
h=l i=l given (iii'.a). Hence {A'y- 1/2Z tCt , Fd is a martingale difference sequence. As a consequence of stationarity, var(A'y-l/2Z tCt ) = A'y-l/2yy-l/2 A = 1 for all t, and for all t, Fnt = F, the distribution function of >..'y-l/2Z tct . It follows from Exercise 5.9 that the Lindeberg condition is satisfied. Since {>..'y-l/2ZtCtC~Z~y-l/2 A} is a stationary and ergodic sequence by Proposition 3.30 with finite expected absolute values given (iii'. b) and (iii'.c), the ergodic theorem (Theorem 3.34) and Theorem 2.24 imply n
n- 1
L
>..'y-l/2Ztctc~Z~y-l/2 >..
- >..'y-l/2yy-l/2 >..
t=l n
n- 1
L >"'y-l/2ZtCtC~Z~y-l/2 >.. -
1 ~ O.
t=l Hence, by Theorem 5.24 n- 1/ 2 I:~-l >..'y-l/2Z tCt ~ N(0,1). It follows from Proposition 5.1 that y-l/2 n -l/2Z'c ~ N(O, I), and since Y = Y n, y~1/2n-l/2Z'c ~ N(O, I). The rest of the results follow as before. _ Whereas use of the martingale difference assumption allows us to state simpler conditions for stationary ergodic processes, it also allows us to state weaker conditions on certain aspects of the behavior of mixing processes. To do this conveniently, we apply a Liapounov-like corollary to the central limit theorem just given.
Corollary 5.26 Let {Znt' Fnd be a martingale difference sequence such > {j' > 0 that EIZntI2+8 < .6. < 00 for some {j > 0 and all nand t. If for all n sufficiently large and n- 1 I:~=l Z;t - a; ~ 0, then y'nZn/an ~
a;
N(O, 1). Proof. Given EIZntI2+8 < .6. < 00, the Lindeberg condition holds as shown in the proof of Theorem 5.10. Since a; > {j' > 0, a;;2 is 0(1), so n- 1 I:~=l Z;t!a; - 1 = a;;2(n-l I:~=l Z;t - a;) ~ 0 by Exercise 2.35. The conditions of Theorem 5.24 hold and the result follows. _ We use this result to obtain an analog to Theorem 5.25.
Exercise 5.27 Prove the following. Suppose conditions (i), (ii), (iv), and (v) of Theorem 5.23 hold, and replace (iii) with
136
5. Central Limit Theory
(iii') (a) E(ZthiEthIFt-d = 0 for all t, where {Ft} is adapted to {ZthiEtd, h=l, ... ,p,i=l, ... ,l; (b) E IZthiEthlr < ~ < 00 for all h = 1, .. , ,p, i = 1, ... ,l and all t; (c) Vn - var(n- 1 / 2 Z'c) is uniformly positive definite. Then the conclusions of Theorem 5.23 hold. Note that although the assumption (iii.a) has been strengthened from E(Ztcd = 0 to the martingale difference assumption, we have maintained the moment requirements of (iii.b).
References Hansen, L. P. (1982). "Large sample properties of generalized method of moments estimators." Econometrica, 50, 1029-1054. Heyde, C. C. (1975). "On the central limit theorem and iterated logarithm law for stationary processes." Bulletin of the Australian Mathematical Society, 12, 1-8. Loeve, M. (1977). Probability Theory. Springer-Verlag, New York. McLeish, D. L. (1974). "Dependent central limit theorems and invariance principles." Annals of Probability, 2, 620-628. Rao, C. R. (1973). Linear Statistical Inference and Its Applications. Wiley, New York. Scott, D. J. (1973). "Central limit theorems for martingales and for processes with stationary increments using a Skorokhod representation approach." Advances in Applied Probability, 5, 119-137. Wooldridge, J. M. (1986). Asymptotic Properties of Econometric Estimators. Unpublished Ph.D. dissertation. University of California, San Diego. - - - and H. White (1988). "Some invariance principles and central limit theorems for dependent heterogeneous processes." Econometric Theory, 4, 210-230.
CHAPTER 6
Estimating Asymptotic Covariance Matrices
=
In all the preceding chapters, we defined Vn var(n- 1 / 2 X'c) or Vn _ var(n- 1 / 2 Z'c) and assumed that a consistent estimator Vn for Vn is available. In this chapter we obtain conditions that allow us to find convenient consistent estimators V n. Because the theory of estimating var(n- 1/ 2 X' c) is identical to that of estimating var(n- 1 / 2 Z' c), we consider only the latter. Further, because the optimal choice for P n is V;:;-I, as we saw in Chapter 4, conditions that permit consistent estimation of V n will also permit consistent estimation of P n = V;:;-1 by P n = V;:;-1 .
6.1
General Structure of V n
Before proceeding to look at special cases, it general form of V n. We have
IS
helpful to examine the
Vn - var(n- 1 / 2 Z'c) = E(Z'cc'Z/n),
because we assume that E(n- 1 / 2 Z'c) = tions, this can be expressed as
137
o.
In terms of individual observa-
6. Estimating Asymptotic Covariance Matrices
138
An equivalent way of writ.ing the summation on the right is helpful in obtaining further insight. We can also write n
n- LE(Ztctc~ZD t=l 1
Vn
n-1 1
n
L E(ZtCtC~_TZ~_T T=l t=T+ 1
+n- L
+ Zt-TCt-TC~Z~)
n
n-
1
L var(ZtCt) t=l 11-1 1
n
+n- L
L cov(Ztct, Zt-TCt-T) T=l t=T+ 1
+ cOV(Zt-TCt-T, ZtCt).
The last expression reveals that V n is the average of the variances of ZtCt plus a term that takes into account the covariances between ZtCt and Zt-Tct-T for all t and T. We consider three important special cases. Case 1. The first case we consider occurs when {Ztct} is uncorrelated, so that
for all t i=-
T,
so n
Vn = n- LE(Ztctc~Z~). 1
t=l This occurs when {(Z~, Ct)} is an independent sequence or when {ZtCt, Ft} is a martingale difference sequence for some adapted a-fields Ft. Case 2. The next case we consider occurs when {Ztct} is "finitely correlated," so that
for all
T
> m, 1 < m
2, implying
n
Vn
n- L E(ZtEtEtZ~) 1
t=1
Case 3. The last case that we consider occurs when {ZtE:t} is an asymptotically uncorrelated sequence so that
Rather than making direct use of the assumption that {(Z~, E:t)} is asymptotically uncorrelated, we shall assume that {(Z~, E:t)} is a mixing sequence, which will suffice for asymptotic uncorrelatedness. In what follows, we typically assume that nothing more is known about the correlation structure of {ZtE:d beyond that it falls in one of these three cases. If additional knowledge were available, then it could be used to estimate V n; more importantly, however, it could be used to obtain efficient estimators as discussed in Chapter 4. The analysis of this chapter is thus relevant in the common situation in which this knowledge is absent.
6.2
Case 1: {Ztet} Uncorrelated
In this section, we treat the case in which n
Vn = n- LE(ZtE:tE:~Z~). 1
t=1
A special case of major importance arises when
140
6. Estimating Asymptotic Covariance Matrices
so that n
Vn
= n- LO'~E(ZtZ~) = O'~Ln. 1
t=l
Our first result applies to this case.
=
0';
O';Ln, where < CXl and Ln is 0(1). If there exists a;' such that a;' ~ and if Z'Z/n - Ln ~ 0, then Vn a;,Z'Z/n is such that Vn - Vn ~ O. Theorem 6.1 Suppose V n
0';
=
Proof. Immediate from Proposition 2.30 . •
Exercise 6.2 Using Exercise 3.79, find conditions that ensure that a;' ~ and Z'Z/n - Ln ~ 0, where a;' = (Y - xi3n)'(Y - Xi3 n )/(np).
0';
0';
Conditions under which a;' ----t and Z'Z/n - Ln ~ 0 are easily found from the results of Chapter 3. In the remainder of this section we consider the cases in which the sequence {(Z~,X~,et)} is stationary or {(Z~,X~,et)} is a heterogeneous sequence. We invoke the martingale difference assumption in each case, which allows results for independent observations to follow as direct corollaries. The results that we obtain next are motivated by the following considerations. We are interested in estimating n
Vn = n- LE(Zt€te~Z~). 1
t=l
If both Zt and €t were observable, a consistent estimator is easily available from the results of Chapter 3, say, n
- n =V
'z't·
' " ' Z tetet n -1 6 t==l
For example, if{ (Z~, et)} were a stationary ergodic sequence, then as long as the elements of Ztet€~Z~ have finite expected absolute value, it follows from the ergodic theorem that V n - V n ~ O. Of course, €t is not observable. However, it can be estimated by
6.2 Case 1: {Ztet} Uncorrelated
141
where f3n is consistent for f3 o. This leads us to consider estimators of the form n
V n = n -I~Z--'Z' ~ tetet t· A
-
t=1
As we prove next, replacing et with €t makes no difference asymptotically under general conditions, so Vn - V n ~ O. These conditions are precisely specified for stationary sequences by the next result.
Theorem 6.3 Suppose that
(ii)
{(Z~,X~,et)}
is a stationary ergodic sequence;
(iii) (a) {Ztet, Fd is a martingale difference sequence;
(b) E/Zthicth/2
0, where P n = 0(1) is symmetric and uniformly positive definite.
152
6. Estimating Asymptotic Covariance Matrices PAl
A
Then V n - V n
----+
0 and V;:;- - V;:;-
1
p
----+
O.
(Hint: be careful to handle r properly to ensure the proper mixing size requirements. )
Exercise 6.13 Prove the following result. Suppose conditions 1/2 A Exercise 6.12 hold. Then D;:;- ..fii(13n - 130) "" N(O, I), where Dn
(i)~(iv)
of
= (Q~PnQn)~lQ~Pn V nPnQn(Q~PnQn)~l.
A
Further, Dn - Dn
P
----+
0, where
Dn = (X'ZPnZ'X/n2)~1(X'Z/n)Pn V nPn(Z'X/n)(X'ZPnZ'X/n2)~1. (Hint: apply Theorem 5.23.) Exercise 6.14 Prove the following result. Suppose
(i) Y t = X~13o + et, t = 1,2, ... , 130 E JRk; (ii) {(Z~,X~,ed} is a mixing sequence with either ¢ of size -r/(2r - 2), r > 2, or
0:
of size -r/(r - 2), r > 2;
(iii) (a) E(ZtetIFt~T) = 0 for t = 1,2, ... ;
T
=
m
+1
0 and all h = 1, ... ,p, i = 1, ... ,l, and t = 1,2, ... ; (c) V n var( n ~ 1/2 2::~=1 Ztet) is uniformly positive definite;
=
(iv) (a) EIZthiXthjlr+D < ~ < 00 for some 8 > 0 and all h = 1, ... ,p, i = 1, ... ,l, j = 1, ... ,k, and t = 1,2, ... ; (b) Qn E(Z'X/n) has uniformly full column rank; (c) Ln - E(Z'Z/n) is uniformly positive definite.
=
Define
where define
€t
=Y
t -
x~13n' 13n - (X'Z(Z'Z)~lZ'X)~l X'Z(Z'Z)~lZ'Y, and
6.3 Case 2: {Ztet} Finitely Correlated
Then D;;:1/2 vn(/3~
- /3 ~ N(O, I), 0
)
153
where
D n = (Q'n y-1Q n n )-1 . FUrther,
A
Dn -
P
Dn ---+
0, where
Because V n - V n -!:.... 0 and V n is assumed uniformly positive definite, it follows that V n is positive definite with probability approaching one as n - t 00. In fact, as the proofs of the theorems demonstrate, the convergence is almost sure (provided Pn - P n ~ 0), so V n will then be positive definite for all n sufficiently large almost surely. Nevertheless, for given n and particular samples, Vn as just analyzed can fail to be positive definite. In fact, not only can it be positive semidefinite, but it can be indefinite. This is clearly inconvenient, as negative variance estimates can lead to results that are utterly useless for testing hypotheses or for any other purpose where variance estimates are required. What can be done? A simple but effective strategy is to consider the following weighted version of V n, n
Vn
n- 1 LZt€t€~Z~ t=l n
m
+n-
1
L T=l
Wnr
L Zt€t€~_TZ~_T t=T+1
+ Zt-T€t-T€~Z~,
The estimats>r V n obtains when W nT = 1 for all nand 7. Moreover, consistency of Y n is straightforward to ensure, as the following exercise asks you to verify.
Exercise 6.15 Show that under the conditions of Theorem 6.9 or Exercise 6.12 a sufficient condition to ensure Vn - V n -!:.... 0 is that W nT -!:.... 1 for each 7 = 1, ... , m. Subject to this requir~ment, we can then manipulate Wnr to ensure the positive definiteness of V n in finite samples. One natural way to proceed is to let the properties of Vn suggest how to choose W nT . In particular, if Vn is positive definite (as evidenced by having all positive eigenvalues), then set W nT = 1. (As is trivially verified, this ensures W nT -!:.... 1.) Otherwise, choose W nT to enforce positive definiteness in some principled way. For example, choose WnT = f(7, en) and adjust en to the extent needed to
154
6. Estimating Asymptotic Covariance Matrices
achieve positive definite V n' For example, one might choose Wm = 1- On T (On> 0), Wnr = (Ont (On < 1), or Wm = exp( -OnT) (On> 0). More sophisticated methods are available, but as a practical matter, this simple approach will often suffice.
6.4
Case 3: {Ztet} Asymptotically Uncorrelated
In this section we consider the general case in which Vn
n- 1
n
L E (Zte:te:;Z;) t=l
+n- 1
n-1
n
L L
E(Zte:te:;_rZ;_r)
+ E(Zt-re:t-re:;Z;).
r=l t=r+1 The essential restriction we impose is that as T ----> 00 the covariance (correlation) between Zte:t and Zt-re:t-r goes to zero. We ensure this behavior by assuming that {(Z~, X~, e:t}} is a mixing sequence. In the stationary case, we replace ergodicity with mixing, which, as we saw in Chapter 3, implies ergodicity. Mixing is not the weakest possible requirement. Mixingale conditions can also be used in the present context. Nevertheless, to keep our analysis tractable we restrict attention to mixing processes. The fact that mixing sequences are asymptotically uncorrelated is a consequence of the following lemma.
Lemma 6.16 Let Z be a random variable measurable with respect to F,*r, o < T < 00, such that IIZll q _ [Elzlqj1/ q < 00 for some q > 1, and let 1 00, so this result imposes bounds on the rate that the conditional expectation of Z, given the past up to period n, converges to the unconditional expectation as the time separation T gets larger and larger. By setting r = 2, we obtain the following result.
6.4 Case 3: {Ztct} Asymptotically Uncorrelated
155
Corollary 6.17 Let E(Zn) = E(Zn+T) = 0 and suppose var(Zn) < and for some q > 2, EIZn+Tlq < 00, 0 < T < 00. Then
00,
and
\E(ZnZn+T)\
< 2(21/2 + 1)C't(T)1/2-1/Q(var(Zn))1/2 x liZn+TIIQ .
Proof. Put F::.oo = a( . .. ,Zn) and F~T iterated expectations,
= a(Zn+T>' .. ).
By the law of
E(E(ZnZn+TI~oo)) E(ZnE(Zn+TI~oo))
by Proposition 3.65. It follows from the Cauchy-Schwarz inequality and Jensen's inequality that
By Lemma 6.16, we have
and
where we set r final results,
=
2 in Lemma 6.16. Combining these inequalities yields the
and
• The direct implication of this result is that mixing sequences are asymptotically uncorrelated, because 4>( T) ----; 0 (q > 2) or a( T) ----; 0 (q > 2)
6. Estimating Asymptotic Covariance Matrices
156
implies IE(ZnZn+r) I ---t 0 as T ---t 00. For mixing sequences, it follows that V n might be well approximated by n
WnO n- 1 LE(Ztctc~ZD t=l
Vn
n
m
+n- 1
L Wnr L r=l
E(Ztctc~_rZ~_r)
+ E(Zt_ret-re~Z~)
t=r+1
for some value m, because the neglected terms (those with m < T < n) will be small in absolute value if m is sufficiently large. Note, however, that if m is simply kept fixed as n grows, the number of neglected terms grows, and may grow in such a way that the sum of the neglected terms does not remain negligible. This suggests that m will have to grow with n, so that the terms in V n ignored by V n remain negligible. We will write mn to make this dependence explicit. Note that iE- V n we have introduced weights Wnr analogous to those appearing in V n of the previous section. This facilitates our analysis of an extension of that estimator, which we now define as n
Vn
WnO
"'z - -IZIt
n - I L..-t t=l mn
tetet n
'L..-t " Z tetet_r - -I Zlt-r +n - 1 "'Wnr L..-t r=l t=r+1
-IZIt· + Z t-ret-ret
We establish the consistency of V n for V n by first showing that under suitable conditions Vn - V n L 0 and then showing that Vn - Vn L o. Now, however, it will not be enough for consistency just to require Wnr L 1, as we also have to properly treat mn ---t 00. For simplicity, we consider only nonstochastic weights Wnr in what follows. Stochastic weights can be treated straightforwardly in a similar manner. Our next result provides conditions under which V n - V n ---t O. Lemma 6.18 Let {Znd be a double array oj random k x 1 vectors such that E(IZ~tZntlr/2) < ~ < 00 Jor some r > 2, E(Znt) = 0, n, t = 1,2, ... ,and {Znd is mixing with¢ oJ size -rl(r-l) ora oJ size -2rl(r2). Define n
Vn -
var(n- 1 / 2
L Znt), t=l
6.4 Case 3: {Ztet} Asymptotically Uncorrelated
157
and Jor any sequence {m n } oj integers and any triangular array {wnr : n = 1,2, ... ; T = 1, ... ,mn } define n
\Tn
WnO n- 1 2:.:E(ZntZ~t) t=l mn
+n-
1
n
2:.: Wnr 2:.:
r=l
E(ZntZ~,t_r)
+ E(Zn,t-rZ~t)·
t=r+1
IJm n --> 00 as n --> 00, iJ Iwnrl < ~,_n = 1,2, ... , Jor each T, Wnr --> 1 as n --> 00 then V n - V n --> O.
T
= 1, ... ,mn , and iJ
Proof. See Gallant and White (1988, Lemma 6.6) . • Here we see the explicit requirements that mn --> 00 and Wnr
-->
1 for each
T.
Our next result is an intermediate lemma related to Lemma 6.19 of White (1984). That result, however, contained an error, as pointed out by Phillips (1985) and Newey and West (1987), that resulted in an incorrect rate for m n . The following result gives a correct rate. Lemma 6.19 Let {Znt} be a double array oj random k x 1 vectors such that E(IZ~tZntn < ~ < 00 Jor some r > 2, E(Znd = 0, n, t = 1,2, ... and {Znt} is mixing with ¢ oj size -rj(r - 1) or 0: oj size -2rj(r - 2). Define
Cr?tr = ZntiZn,t-r,j - E(ZntiZn,t-r,j)' IJm n = 0(n 1/ 4 ) and Iwnrl 2, E(Znd = 0, n, t = 1,2, ... and {Znd is mixing with ¢ oj size -rj(r - 1) or 0: oj size -2rj(r - 2). Define n
Vn
= var(n-
1
2:.: Znd, t=l
6. Estimating Asymptotic Covariance Matrices
158
and for any sequence {m n } of integers and any triangular array 1,2, ... ; T = 1, ... ,mn } define
{wnr :
n =
n
Vn
WnO
n-
1
L ZntZ~t t=1
+n- 1
LW L nr
r=l
as n --- 00, mn = 0(n 1/ 4 ), and if Iwnrl < ~, n = 1,2, ... , . p = 1, ... ,mn , and zf for each T, Wnr --- 1 as n --- 00, then V n - V n - - - t O.
If mn T
t=r+1
--- 00
Proof. For V n as defined in Lemma 6.18 we have
By Lemma 6.18 it follows that p Vn - - - t O. Now
Vn -
Vn ---
0 so the result holds if
\Tn
-
n
n-
WnO
1
L ZntZ~t
-
E(ZntZ~t)
t=1 n
mn
+n+n-
LW L
1
nr
1
r=l
t=r+1
mn
n
L
r=l
Wnr
L
Zn,t-rZ~t
-
E(Zn,t-rZ~t)·
t=r+1
The first term converges almost surely to zero by the mixing law of large numbers Corollary 3.48, while the next term, which has i, j element n
ffin
n-
1
L
r=1
Wnr
L
ZntiZn,t-r,j - E(ZntiZn,t-r,i)
t=r+1
converges in probability to zero by Lemma 6.19. The final term is the transpose of the preceding term so it too vanishes in probability, and the result follows. • With this result, we can establish the consistency of \Tn for the instrumental variables estimator by setting Znt = Zt€t and handling the additional terms that arise from the presence of in place of et with an application of Lemma 6.19.
et
6.4 Case 3; {Ztet} Asymptotically Uncorrelated
159
Because the results for stationary mixing sequences follow as corollaries to the results for general mixing sequences, we state only the results for general mixing sequences.
Theorem 6.21 Suppose (i)Yt=X~!3o+ct,
t=1,2, ... , !3oE~k;
(ii) {(Z~, X~, et)} is a mixing sequence with either ¢ of size -1'/(1' -1) or 0: of size -21'/(1' - 2), l' > 2; (iii) (a) E(Ztet) = 0,
t = 1,2, ... ;
(b) EIZthicthI2(r+h) < ~ < CXJ for some 8 > 0, h = 1, ... ,p, z 1, ... , l, and all t; (c) V n - var(n- 1/ 2 L~=l Ztet) is uniformly positive definite; (iv) (a) EIZthiXthjI2(r+h) < ~ < CXJ for some 8 > 0 and all h = 1, ... ,p, i = 1, ... ,l, j = 1, ... ,k, and t = 1,2, ... ; (b) Qn - E(Z/X/n) has uniformly full column rank; (c) Pn -P n ~ 0, where P n = 0(1) and is symmetric and uniformly positive definite. Define \Tn as above. lfm n -> CXJ as n -> CXJ such that mn = 0(n 1/ 4 ), and if IwnTI < ~, n = 1,2, ... , T = 1, ... ,mn such that W nT -> 1 as n -> CXJ P P for each T, then V n - V n ---t 0 and V;:;-l - V;:;-l ---t O. Proof. By definition n
V- n
-
V n = wnon -1
"""Z - -/ZIt ~ tetet t=l n
mn
+ n -1 """ ~ wnT T=l
""" ~
Z tetet_T - -I Z't-T + Zt-Tet-Tet -I ZIt - V n'
t=T+1
Substituting E:t = et - X~(!3n - 130) gives
mn
+ n-
1
L
T=l
n
Wm
L t=T+1
Ztete~_TZ~_T + Zt-Tet-Te~Z~ - V n + An,
160
6. Estimating Asymptotic Covariance Matrices
where n
-WnO n- 1
An
L
Zt e t((3n - ,l3o)'XtZ~
t=l -WnO n- 1
n
L ZtX~(,i3n -
,l3o)e~Z~
t=1 n
t=1 _n- 1 _n- 1
L WnT L T=l
t=T+1
mn
n
L WnT L T=1
+n- 1 L WnT _n- 1 _n- 1
n
mn
L t=T+1
mn
n
L WnT L T=1
t=T+1
mn
n
L WnT L
+n- 1 L WnT T=1
ZtX~((3n - ,l30)e~_TZ~_T
t=T+1
T=l
T=l
Zt e t(/3n - ,l30)'Xt-TZ~_T
ZtX~(/3n - ,l3o)(/3n - ,l30)'Xt-TZ~_T Zt-T e t-T(/3n - ,l3o)'XtZ~ Zt- TX t - T((3n - ,l3o)e~Z~
t=T+1
L
Zt- TX t - T(/3n - ,l3o)(/3n - ,l3o)'XtZ~.
t=T+1
The conditions of the theorem ensure that the conditions of Theorem 6.20 hold for Znt = Ztet, so that
WnO n- 1
n
L t=l
Ztete~Z~ + n- 1
mn
n
L WnT L 7=1
Ztete~_TZ~_T
+ Zt-7et_7e~Z~
t=T+1
-Vn ~O.
The desired result then follows provided that all remaining terms (in An) vanish in probability. The mixing and moment conditions imposed are more than enough to ensure that the terms involving Wno vanish in probability, by arguments
6.4 Case 3: {Ztet} Asymptotically Uncorrelated
161
identical to those used in the proof of Exercise 6.6. Thus, consider the term
n
mn
= n-
L WnT L
1
T=l
(Zt-TX~_T ® ZtCt) vec((i3n - (30)')
t=T+1 n
fin
= n-
1
+ n- 1 L
L WnT L
T=l
mn
n-
E (Zt-TX~_T ® ZtCt) vec((i3n -
(3o)')·
t=T+1
Applying Lemma 6.19 with 1
L
WnT
[(Zt-TX~_T ® ZtCt)
Znt = vec(ZtX~
0 ZtCt) we have that
n
L Wm L T=l
(Zt-TX~_T ® ZtCt) - E(Zt-TX~_T ® ZtCt) ~ o.
t=T+1
This together with the fact that i3n - {30 ~ 0 ensures that the first of the two terms above vanishes. It therefore suffices to show that mn
n-
1
n
L WnT L T=l
E (Zt-TX~_T !XI ZtCt) vec((i3n -
(30)')
t=T+1
vanishes in probability. Under the conditions of the theorem, we have fo(i3n - (30) = Opel) (by asymptotic normality, Theorem 5.23) so it will suffice that mn
n- 3/ 2
n
L WnT L T=l
E (Zt-TX~_T ® Ztct) = 0(1).
t=T+1
Applying the triangle inequality and Jensen's inequality we have mn
2 In- 3 /
n
L WnT L T=l
E (Zt-TX~_T ® Ztct) I
t=T+1 mn
9
Ztet) = 0(1)
t=T+l
as we needed. The same argument works for all of the other terms under the conditions given, so the result follows. • Comparing the conditions of this result to the asymptotic normality result, Theorem 5.23, we see that the memory conditions here are twice as strong as in Theorem 5.23. The moment conditions on ZtCt are roughly twice as strong as in Theorem 5.23, while those on ZtX~ are roughly four times as strong. The rate mn = 0(n 1 / 4 ) is not necessarily the optimal rate, and other methods of proof can deliver faster rates for m n . (See, e.g., Andrews, 1991.) For discussion on methods relevant for choosing m n , see Den Haan and Levin (1997). Gallant and White (1988, Lemma 6.5) provide_the following conditions on Wm guaranteeing the positive definiteness of V n' Lemma 6.22 Let {Znt} be an arbitrary double array and let {ani}, n = 1,2, ... , i = 1, ... ,mn + 1 be a triangular array of real numbers. Then for any triangular array of weights ffi n
W nT
=
+l
L
i=T+l
anian,i-T'
n
= 1,2, ...
,
7
= 1, ...
,mn
6.4 Case 3: {Ztct} Asymptotically Uncorrelated
163
we have Yn = WnO
n
mn
n
t=l
r=l
t=l
L Z~t + 2 L Wnr L ZntZn,t-r > O.
Proof. See Gallant and White (1988, Lemma 6.5) . •
For example, choosing ani
= (mn + 1) 1/2 for all i = 1, ... , mn + 1 delivers
w nr =1-r/(m n +1),
r=l, ... ,mn,
which are the Bartlett (1950) weights given by Newey and West (1987). Other choices for ani lead to other choices of weights that arise in the problem of estimating the spectrum of a time series at zero frequency. (See Anderson (1971, Ch. 8) for further discussion.) This is quite natural, as V n can be interpreted precisely as the value of the spectrum of ZtCt at zero frequency. Corollary 6.23 Suppose the conditions of Theorem 6.21 hold. Then
C() ",NO,I, A ( ) Dn-1/2 ynf3n-f3o where
-
p
Further, Dn - Dn --70, where
Proof. Immediate from Theorem 5.23 and Theorem 6.21. •
This result contains versions of all preceding asymptotic normality results as special cases while making minimal assumptions on the error covariance structure. Finally, we state a general result for the 2SIV estimator. Corollary 6.24 Suppose
(i) Yt=X~f3o+Ct, t=1,2, ... , f3oE~k; (ii) {(Z~, X~, cd} is a mixing sequence with either ¢ of size -r /(r -1) or Ct of size -2r/(r - 2), r > 2; (iii) (a) E(Ztct) = 0,
t = 1,2, ... ;
(b) EIZthiEthI2(rH) < Ll
0 and all h =
1, ... ,p,
i = 1, ... , l, and t = 1,2, ... ; (c) V n var(n- 1/ 2 I:~=1 Ztcd is uniformly positive definite;
=
164
6. Estimating Asymptotic Covariance Matrices
(iv) (a) EIZthiXthjI2(r+6) < .0. < 00 and EIZthil(r+6) < .0. < 00 for some 8> 0 and all h = 1, ... ,p, i = 1, ... ,l, j = 1, ... ,k, and t = 1,2, ... ; (b) Qn - E(Z'X/n) has uniformly full column rank; (c) Ln = E(Z'Z/n) is uniformly positive definite. Define
where define
et
=Y t -
x~iJn' iJn
=(X'Z(Z'Z)-IZ'X)-IX'Z(Z'Z)-IZ'Y, and
If mn -) 00 as n -) 00 such that mn = o(n 1 / 4 ) and if Iwnrl < .0., n = 1,2, ... , T = 1, ... , mn such that Wnr -) 1 as n -) 00 for each T, then D;;I/2 Vn({3~ - (3o) ~ N(O, I), where
Dn = (Q~ V;;IQn)-I. -
FUrther, Dn - Dn
p ---->
0, where
:On = (X'ZV~IZ'X/n2)-I. Proof. Conditions (i)-(iv) ensure that Theorem 6.21 holds for iJn' Set Pn = V;;1 in Corollary 6.23. Then P n = v;;I, and the result follows . •
References Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley, New York. Andrews, D. W. K. (1991). "Heteroskedasticity and autocorrelation consistent covariance matrix estimation." Econometrica, 59, 817-858. Bartlett, M. S. (1950). Biometrika, 37,1-16.
"Periodogram analysis and continuous spectra."
References
165
Den Haan, W. J. and A. T. Levin (1997). "A practitioner's guide to robust covariance matrix estimation." In Handbook of Statistics, vol. 15, G. S. Maddala and C. R. Rao, eds., pp. 309-327. Elsevier, Amsterdam. Gallant, A. R. and H. White (1988). A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models. Basil Blackwell, Oxford. McLeish, D. L. (1975). "A maximal inequality and dependent strong laws." Annals of Probability, 3, 826-836. Newey, W. and K. West (1987). "A simple positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix." Econometrica, 55, 703-708. Phillips, P. C. B. (1985). Personal communication. White, H. (1980). "A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity." Econometrica, 48, 817-838.
(1982). "Instrumental variables regression with independent observations." Econometrica, 50, 483-500. (1984). Asymptotic Theory for Econometricians. Academic Press, Orlando.
CHAPTER 7
Functional Central Limit Theory and Applications
The conditions imposed on the random elements of Xt, Y t , and Zt appearing in the previous chapters have required that certain moments be bounded, e.g., EIX;hil1+c5 < D. < 00 for some 8 > 0 and for all t (Exercise 5.12 (iv.a)). In fact, many economic time-series processes, especially those relevant for macroeconomics or finance violate this restriction. In this chapter, we develop tools to handle such processes.
7.1
Random Walks and Wiener Processes
We begin by considering the random walk, defined as follows. Definition 7.1 (Random Walk) Let {Xt} be generated according to X t = X t - 1 +Zt, t = 1,2, ... , where Xo = 0 and {Zt} is i.i.d. with E(Zt) = 0 and 0 < 0'2 - var(Zt} < 00. Then {Xd is a random walk. By repeated substitution we have
Xt
X t - 1 + Zt = X t - 2 t
Xo +
I::Zs s=l
167
+ Zt-1 + Zt
168
7. Functional Central Limit Theory and Applications
as we have assumed Xo = O. It is straightforward to establish the following fact. Exercise 7.2 If {Xd is a random walk, then E(Xd = 0 and var(Xt ) ta 2 , t = 1,2, .... Because E(X?)1/2 < E(IXtn~ for all r > 2 (Jensen's Inequality), a random walk X t cannot satisfy EIXll1+ 6 < D. < 00 for all t (set r = 2(1 + 8)). Thus when X t , Y t , or Zt contain random walks, the results of the previous chapters do not apply. One way to handle such situations is to transform the process so that it does satisfy conditions of the sort previously imposed. For example, we can take the "first difference" of a random walk to get Zt = X t - X t - 1 (which is Li.d.) and base subsequent analysis on Zt. Nevertheless, it is frequently of interest to examine the behavior of estimators that do not make use of such transformations, but that instead directly involve random walks or similar processes. For example, consider the least squares estimator (In = (X'X)-l X'Y, where X is n x 1 with X t = Yt-1' and Yt is a random walk. To study the behavior of such estimators, we make use of Functional Central Limit Theorems (FCLTs), which extend the Central Limit Theorems (CLTs) studied previously in just the right way. Before we can study the behavior of estimators based on random walks or similar processes, we must understand in more detail the behavior of the processes themselves. This also directly enables us to understand the way in which the FCLT extends the CLT Thus, consider the random walk {Xt } . We can write n
Rescaling, we have n
n- 1/ 2 X n /a = n- 1 / 2 LZt/a. t=l
According to the Lindeberg-Levy central limit theorem, we have
Thus, when n is large, outcomes of the random walk process are drawn from a distribution that is approximately normally distributed.
169
7.1 Random Walks and Wiener Processes
Next, consider the behavior of the partial sum
[an]
X[an] =
2: Z
t,
t=l
where l/n < a < 00 and [an] represents the largest integer less than or equal to an. For 0 < a < l/n, define X[an] = Xo = 0, so the partial sum is now defined for all 0 < a < 00. Applying the same rescaling, we define 1 2
X[anJia [an] n- 1/ 2 Z t!a.
n- /
2: t=l
Now
Wn(a) =
n- 1 / 2
[an]1/2
{
[an] } [anrl/2 ~ Zt!a ,
and for given a, the term in the brackets {.} again obeys the CLT and converges in distribution to N(O,l), whereas n- 1 / 2 [an]1/2 converges to a 1/ 2. It follows from now standard arguments that Wn(a) converges in distribution to N(O, a). We have written Wn(a) so that it is clear that Wn can be considered to be a function of a. Also, because Wn (a) depends on the Zt'S, it is random. Therefore, we can think of Wn (a) as defining a random function of a, which we write W n . Just as the CLT provides conditions ensuring that the rescaled random walk n- 1 / 2 X n /a (which we can now write as W n (l)) converges, as n becomes large, to a well-defined limiting random variable (the standard normal), the FCLT provides conditions ensuring that the random function Wn converges, as n becomes large, to a well-defined limiting random function, say W. The word "Functional" in Functional Central Limit Theorem appears because this limit is a function of a. The limiting random function specified by the FCLT is, as we should expect, a generalization of the standard normal random variable. This limit is called a Wiener process or a Brownian motion in honor of Norbert Wiener (1923, 1924), who provided the mathematical foundation for the theory of random motions observed and described by nineteenth century botanist Robert Brown in 1827. Of further historical interest is the fact that in his dissertation, Bachelier (1900) proposed the Brownian motion as a model for stock prices.
170
7. Functional Central Limit Theory and Applications
Before we formally characterize the Wiener process, we note some further properties of random walks, suitably rescaled. Exercise 7.3 If {Xt } is a random walk, then Xt• - Xt3 is independent of X t2 - X t , for all tl < t2 < t3 < t4. Consequently, W n (a4)- W n (a3) is independent ~fWn(a2)- Wn(al) for all ai such that rain] = ti, i = 1, ... ,4. Exercise 7.4 For given
n
°
W.
=
Proof. See the proof given by Billingsley (1968, Theorem 16.1, pp. 137138) . • Because pointwise convergence in distribution Wn(a,·) .:!:.. W(a,.) for each a E [0, 1] is necessary (but not sufficient) for weak convergence (Wn =>
W), the Lindeberg-Levy Central Limit Theorem (Wn (l,·) .:!:.. W(l, .» follows immediately from Donsker's theorem. Donsker's theorem is strictly stronger than Lindeberg-Levy however, as both use identical assumptions, but Donsker's theorem delivers a much stronger conclusion. Donsker called his result an invariance principle. Consequently, the FCLT is often referred to as an invariance principle. So far, we have assumed that the sequence {Zt} used to construct Wn is i.i.d. Nevertheless, just as we can obtain central limit theorems when {Zd is not necessarily i.i.d., so also can we obtain functional central limit theorems when {Zt} is not necessarily i.i.d. In fact, versions of the FCLT hold for each CLT previously given. To make the statements of our FCLTs less cumbersome than they would otherwise be, we use the following condition.
Definition 7.14 (Global covariance stationarity) Let {Zt} be a sequence of k x 1 random vectors such that E(Z~Zd < 00, t = 1,2, ... , and var(n- 1/ 2 L~=l Zt). If ~ limn-+oo ~n exists and is finite, define ~n
=
=
177
7.3 Functional Central Limit Theorems
then we say that {Zd is globally covariance stationary. We call global covariance matrix.
~
the
For our CLTs, we have required only that ~n = 0(1). FCLTs can be given under this weaker requirement, but both the conditions and conclusions are more complicated to state. (See for example Wooldridge and White, 1988, and Davidson, 1994, Ch. 29.) Imposing the global covariance stationarity requirement gains us clarity at the cost of a mild restriction on the allowed heterogeneity. Note that global covariance stationarity does not require us to assume that E(ZtZ~_T) = E(Z8Z~_T) for all t, s, and T (covariance stationarity). We give the following FCLTs.
Theorem 7.15 (Lindeberg-Feller i.n.i.d. FeLT) Suppose {Zd satisfies the conditions of the Lindeberg-Feller central limit theorem (Theorem 5.6), and suppose that {Zd is globally covariance stationary. If a 2 = limn->oo var(Wn (l)) > 0, then Wn =} W. Proof. This is a straightforward consequence of Theorem 15.4 of Billingsley (1968) . • Theorem 7.16 (Liapounov i.n.i.d. FeLT) Suppose {Zd satisfies the conditions of the Liapounov central limit theorem (Theorem 5.10) and suppose that {Zt} is globally covariance stationary. Ifa 2 -limn->oo var(Wn (l)) > 0, then Wn =} W. Proof. This follows immediately from Theorem 7.15, as the Liapounov moment condition implies the Lindeberg-Feller condition. • Theorem 7.17 (Stationary ergodic FeLT) Suppose {Zd satisfies the conditions of the stationary ergodic CLT (Theorem 5.16). If a 2 limn->oo var(Wn (l)) > 0, then Wn =} W. Proof. This follows from Theorem 3 of Scott (1973) . • Note that the conditions of the stationary ergodic CLT already impose global covariance stationarity, so we do not require an explicit statement of this condition.
Theorem 7.18 (Heterogeneous mixing FeLT) Suppose {Zd satisfies the conditions of the heterogeneous mixing CLT (Theorem 5.20) and suppose that {Zt} is globally covariance stationary. If a 2 - limn->oo var(Wn (1)) > 0, then Wn =} W.
7. Functional Central Limit Theory and Applications
178
Proof. See Wooldridge and White (1988, Theorem 2.11) • Theorem 7.19 (Martingale difference FCLT) Suppose {Zd satisfies the conditions of the martingale difference central limit theorem (Theorem 5.24 or Corollary 5.26) and suppose that {Zd is globally covariance stationary. If a 2 -limn~oo var(Wn (l» > 0, then Wn => W. Proof. See McLeish (1974). See also Hall (1977) . • Whenever {Zd satisfies conditions sufficient to ensure Wn => W, we say {Zt} obeys the FCLT. The dependence and moment conditions under which {Zd obeys the FCLT can be relaxed further. For example, Wooldridge and White (1988) give results that do not require global covariance stationarity and that apply to infinite histories of mixing sequences with possibly trending moments. Davidson (1994) contains an excellent exposition of these and related results.
7.4
Regression with a Unit Root
The FCLT and the following extension of Lemma 4.27 give us the tools needed to study regression for "unit root" processes. Our treatment here follows that of Phillips (1987). We begin with an extension of Lemma 4.27.
Theorem 7.20 (Continuous mapping theorem) Let the pair (S, S) be a metrized measurable space and let /-L, /-Ln be probability measures on (S, S) corresponding to V, V n , random elements of 5, n = 1,2, .... (i) Let h : 5 --+ JR be a continuous mapping. If Vn => V, then h(Vn ) => heY). (ii) Let h : S --+ JR be a mapping such that the set of discontinuities of h, Dh {s E S : limr~s her) =I h(s)} has /-L(Dh) = O. If Vn => V, then h(Vn ) => heY).
=
Proof. See Billingsley (1968, pp. 29-31) . • Using this result, we can now prove an asymptotic distribution result for the least squares estimator with unit root regressors, analogous to Theorem 4.25.
Theorem 7.21 (Unit root regression) Suppose
(i) Yt = Xt!3 o
+ Et,
t = 1,2, ... , where X t
= Yt-l,
(ii) Wn => W, where Wn is defined by Wn(a) a < 1, where a 2
=limn~oo var(n- 1/2I:~=l
(30
= 1,
and Yo
= 0;
=n- 1/2I:t:~ Et/a, 0 < Et)
is finite and nonzero.
179
7.4 Regression with a Unit Root
Then (a) n- 2 L~=l Y?-l
=? ()2
J; W(a)2da.
If in addition ... ) ( nz
"n
n -1 L..d=l Ct2
P
2
---t T,
0
W(1)2; by condition 7.21 (iii) we have n- l L~ 1 c; ~ 7 2 . It follows by Lemma 4.27 that n
n-lEYi-lc t => (O' 2/2)W(1)2-7 2/2 t=l (c) We have
n(Pn - 1)
~
(t. Y,~r t Y,-, (t. Y,~r t Y,~,1 (t. y,~,) -, [t. 1 (n- i;y,~,) -, n-' i>'"
n [
Y, -
Y,-,(Y, - Y,-l)
n
2
Conclusions (a) and (b) and Lemma 4.27 then give the desired result:
(n- 2
n((3n -1)
t~~l)
-1
n- l tYi-lCt t=l
-1
(1/2) (W(1)2 _ 72/0'2) .
t=l
(1 W(a)2 a) 1
=>
d
(d) From (c), we have n((3n -1) = Op(l). Thus, (3n -1 = n- l (n((3n -1)) = o(l)Op(l) = op(1), as claimed. -
183
7.4 Regression with a Unit Root
In the proof of (a), we appealed to the continuous mapping theorem to establish that I01 W n (a)2da ===> I; W(a)2da. The next exercise asks you to verify this and then gives a situation where the continuous mapping theorem does not apply.
Exercise 7.22 ForU, V E D, put du(U, V) = sUPaE[O,l] IU(a) - V(a)l, and consider the metrized measurable spaces (D, d u ) and (JR, 1·1), the mapping M1(U) = U 2 , and the two functionals M2(U) = I01 U(a)2da and M3(U) = I01log IU(a)lda. (i) Show that M1 : (D, d u ) ~ (D, d u ) is continuous at U for any U E C, but not everywhere in D. (ii) Show that M2 : (D, du ) ~ (JR, 1·1) is continuous at U for any U E C, so that I01 W n (a)2da ===> I01 W(a)2da. (iii) Show that M3 : (D, d u ) ~ (JR, 1·1) is not continuous everywhere in 1 C, and I01Iog(IWn(a)l)da =f?- Io Iog(IW(a)l)da. Phillips (1987, Theorem 3.1) states a version of Theorem 7.21 with specific conditions corresponding to those of Theorem 7.18 (heterogeneous mixing FCLT) ensuring that conditions 7.21 (ii) and (iii) hold. Clearly, however, many alternate versions can be stated. The following exercise asks you to complete the details for Phillips's result.
Exercise 7.23 Use Theorem 7.18 and a corresponding law of large numbers to state conditions on {cd ensuring that 7.21 (ii) and 7.21 (iii) hold, with conclusions 7.21 (a) - (d) holding in consequence. What conditions suffice for 7 2 = a 2 ? Phillips also gives a limiting distribution for the standard t-statistic centered appropriately for testing the hypothesis {30 = 1 (the "unit root hypothesis"). Not surprisingly, this statistic no longer has the Student-t distribution nor is it asymptotically normal; instead, the t-statistic limiting distribution is a function of W, similar in form to that for n(/3 n - 1). Although the nonstandard distribution of the t-statistic makes it awkward to use to test the unit root hypothesis (Ho: (30 = 1), a simple rearrangement leads to a convenient XI statistic, first given by Phillips and Durlauf (1986, Lemma 3.1 (d)). From 7.21 (b) we have that n
n- 1L:Yt-1 ct t=l
(a 2/2) (W(1)2 -
7
2
/a 2 )
.
7. Functional Central Limit Theory and Applications
184
Thus when /3 0 = 1
This statistic depends on the unknown quantities 0- 2 and T2, but estimators a~ and T~ consistent for 0- 2 and T2, respectively, under the unit root null hypothesis can be straightforwardly constructed using Ct = Yt - Yt-l. In particular, set T~ = n- 1 2:~=1 c; and form a~ using Theorem 6.20. We then have
under the unit root null hypothesis, providing a simple unit root test procedure. Despite the convenience of the Phillips-Durlauf statistic, it turns out to have disappointing power properties, even though it can be shown to represent the locally best invariant test (see Tanaka, 1996, pp. 324-336). The difficulty is that its optimality is typically highly localized. In fact as Elliott, Rothenberg, and Stock (1996) note, there is no uniformly most powerful test in the present context, in sharp contrast to the typical situation of earlier chapters, where we could rely on asymptotic normality. Here, the limiting distribution is nonnormal. In consequence, there is a plethora of plausible unit root tests in the literature, and our discussion in this section has done no more than scratch the surface. Nevertheless, the basic results just given should assist the reader interested in exploring this literature. An overview of the literature is given by Phillips and Xiao (1998); of particular interest are the articles by Dickey and Fuller (1979, 1981), Elliott et al. (1996), Johansen (1988, 1991), Phillips (1987), Phillips and Durlauf (1986), and Stock (1994, 1999).
7.5
Spurious Regression, Multivariate Wiener Processes, and Multivariate FeLTs
Now consider what happens when we regress a unit root process Yt = Yt-l + Ct not on its own lagged value Yt-l' but on another unit root process, say X t = X t - 1 + 7]t. For simplicity, assume that {7]t} is i.i.d. and independent of the i.i.d. sequence {cd so that {Yt} and {Xd are independent random
7.5 Spurious Regression and Multivariate FeLTs
185
walks, and as before we set Xo = Yo = O. We can write a regression equation for yt in terms of X t formally as
where (30 = 0 and Ut = yt, reflecting the lack of any relation between yt and Xt. We then ask how the ordinary least squares estimator
behaves as n becomes large. As we will see, iJn is not consistent for (30 = o but instead converges to a particular random varia~le. Because there is truly no relation between yt and Xt, and because (3n is incapable of revealing this, we call this a case of "spurious regression." This situation was first considered by Yule (1926), and the dangers of spurious regression were forcefully brought to the attention of economists by the Monte Carlo studies of Granger and Newbold (1974). To proceed, we write t
n-1/2Xt/al =n- 1/ 2 L'T]8/ a1 , 8=1
t
n- 1/ 2yt/a2 = n- 1/ 2 LE 8/a2, 8=1
"n
)
"n
- l'Im ->= var (n -1/2 L.d=1 'T]t an d a 22 = - l'lm n ->= var (n -1/2 L.d=1 a 21 = n Ed, and aAt = tin as befon::. Substituting for X t and yt and, for convenience, treating (3n-l instead of (3n we can write W h ere
(3n-l
(i?i-,) -, i?'-'y,-, )-1 ai (a,fa,) (n-' i;W,n(a,-d 2) -'n-' f;W,n(a,-dW2n(a,-d. n
(
n ~ W 1n (at_J)2
n
ala2 n ~ W1n(at-J)W2n(at-d
186
7. Functional Central Limit Theory and Applications
From the proof of Theorem 7.21 (a), we have that
where WI is a Wiener process. We also have
because W In (a)W2n (a) is constant for (t - l)ln < a < tin. Thus
n- I
n
I
L W In (at-dW2n(at-l) = r W In (a)W2n(a)da. io
t=1
J;
We might expect this to converge in distribution to WI(a)W2(a)da, where WI and W2 are independent Wiener processes. (Why?) If so, then we have
which is a nondegenerate random variable. f3 n is then not consistent for f3 0 = 0, so the regression is "spurious." Nevertheless, we do not yet have all the tools needed to draw this conclusion formally. In particular, we need to extend the notion of a univariate Wiener process to that of a multivariate Wiener process, and for this we need to extend the spaces of functions we consider from C[O, (0) or D[O,oo) to Cartesian product spaces Ck[O,oo) - xj=IC[O,oo) and Dk[O, (0) xj=ID[O, (0).
Definition 7.24 (Multivariate Wiener process) W = (WI, ... , Wk)' is a multivariate (k-dimensional) Wiener process ijWI , ... , Wk are independent (lR-valued) Wiener processes. The multivariate Wiener process exists (e.g., Breiman, 1992, Ch. 12), has independent increments, and has increments W(a,') - W(b,.) distributed as multivariate normal N(O, (a - b)I), < b < a < 00, as is straightforwardly shown. Further, there is a version such that for all wED, W(O,w) = 0 and W(',w) : [0,(0) ---> IRk is continuous, a consequence of Proposition 7.7 and the fact that W is continuous if and only if its
°
7.5 Spurious Regression and Multivariate FeLTs
187
components are continuous. Thus, there is a version of the multivariate Wiener process taking values in C k [0,00), so that W is a random element of C k [0, 00). As in the univariate case, when we speak of a multivariate Wiener process, we shall have in mind one with continuous sample paths. When dealing with multivariate FCLTs it will suffice here to restrict attention to functions defined on [0, 1J. We thus write C k xj=lC[O,lJ and Dk = xj=lD[O,lJ. Analogous to the univariate case, we can define a multivariate random walk as follows.
=
Definition 7.25 (Multivariate random walk) Let X t = X t - 1 + Zt, t = 1,2, ... , where Xo = 0 (k x 1) and {Zd is a sequence of i.i.d. k x 1 vectors Zt = (Ztl, ... , Ztk)' such that E(Zt) = 0 and E(ZtZ~) = :E, a finite positive definite matrix. Then {X d is a multivariate (k-dimensional) random walk. We form the rescaled partial sums as [an]
Wn(a)
=
:E- 1/ 2 n- l / 2
LZt. t=l
The components of Wn are the individual partial sums [an]
Wnj(a) = n
-1/2
~
-
~Ztj,
j = 1, ... ,k,
t=l
where Ztj is the jth element of :E- l / 2 Zt. For given w, the components Wnj(·,w) are piecewise constant, so Wn is a random element of Dk. We expect that a multivariate version of Donsker's theorem should hold, analogous to the multivariate CLT, so that W n =::} W. To establish this formally, we study the weak convergence of the measures f.Ln, defined by f.Ln(A)
= P{w : W
n (·, w) E A},
where now A E 8(Dk, d) for a suitable choice of metric d on Dk. For example, we can choose d(x,y) = L~=l dB(xj,Yj) for x = (Xl, ... ,Xk)', Y = (Yl>·· . ,Yk)', Xj, Yj ED with dB the Billingsley metric on D. The multivariate FCLT provides conditions under which f.Ln converges to the multivariate Wiener measure f.Lw defined by f.Lw (A) = P{w : W(-,w) E An Ck},
188
7. Functional Central Limit Theory and Applications
where A E B(Dk, d). When J.L n =} J.Lw we say that Wn =} Wand that Wn obeys the multivariate FCLT, or that {Zt} obeys the multivariate FCLT. To establish the multivariate FCLT it is often convenient to apply an analog of the Cramer-Wold device (e.g., Wooldridge and White, 1988). Proposition 7.26 (Functional Cranter-Wold device) Let {V n } be a sequence of random elements of Dk and let V be a random element of Dk (not necessarily the Wiener process). Then V n =} V if and only if A'Vn =} A'V for all A E ~k, A'A = 1. Proof. See Wooldridge and White (1988), proof of Proposition 4.1. • Applying the functional Cramer-Wold device makes it straightforward to establish multivariate versions of Theorems 7.13 and 7.15 through 7.19. To complete our discussion of spurious regression, we state the multivariate Donsker Theorem. Theorem 7.27 (Multivariate Donsker) Let {Zt} be a sequence ofi.i.d. k x 1 vectors Zt = (Ztl, ... ,Ztk)' such that E(Zt) = 0 and E(ZtZ~) = ~, a finite nonsingular matrix. Then W n =} W. Proof. Fix A E ~k, A'A = 1. Then A'Wn(a) = n- 1/ 2 L~:J A'~-1/2 Zt, where {A'~-1/2 Zt} is i.i.d. with E(A'~-1/2 Zt) = 0 and var(A'~-1/2 Zt}
A'~-1/2 E(ZtZ~)~-1/2 A A'~-1/2~~-1/2 A
= A' A =
1.
The conditions of Donsker's Theorem 7.13 hold, so A'Wn =} W, the univariate Wiener process. Now W = A'W and this holds for all A E ~k, A' A = 1. The result then follows by the functional Cramer-Wold device. • At last we have what we need to state a formal result for spurious regression. We leave the proof as an exercise. Exercise 7.28 (Spurious regression) Let {Xt} and {Yt} be independent random walks, X t = X t - 1 + 'T}t and Yt = Yt-l + tt, where O'~ - E('T}Z) 1 ] -1 1 and O'~ E(ct)· Then 13n =} (0'2/0'1) Wl(a)2da Wl(a)W2(a)da, where W = (WI, W2)' is a bivariate Wiener process. (Hint: apply the multivariate Donsker Theorem and the continuous mapping theorem with S = D2.)
=
A
[
Io
Io
Not only does iJn not converge to zero (the true value of 13 0 ) in this case, but it can be shown that the usual t-statistic tends to 00, giving the
7.5 Spurious Regression and Multivariate FeLTs
189
misleading impression that iJn is highly statistically significant. See Watson (1994, p. 2863) for a nice discussion and Phillips (1986) for further details. For each univariate FCLT, there is an analogous multivariate FCLT. This follows from the following result. Theorem 7.29 (Multivariate FCLT) Suppose that {Zd is globally covariance stationary with mean zero and nonsingular global covariance matrix ~ such that for each A E jRk, A' A = 1, {A'~-1/2 Zd obeys the FeLT. Then Wn =} W. Proof. Fix A E jRk, A' A = 1. Then A'Wn(a) = n- 1 / 2 2::~:'l A'~-1/2 Zt. Now E(A'~-1/2 Zt) = 0 and var(A'W n (l» ~ 1 as n ~ 00. Because {A'~-1/2 Zd satisfies the FCLT, we have A'Wn =} W = A'W. This follows for all A E jRk, A' A = 1, so the conclusion follows by the functional Cramer-Wold device. _ Quite general multivariate FCLTs are available. See Wooldridge and White (1988) and Davidson (1994, Ch. 29) for some further results and discussion. For example, here is a useful multivariate FeLT for heterogeneous mixing processes, which follows as an immediate corollary to Wooldridge and White (1988, Corollary 4.2). Theorem 7.30 (Heterogeneous mixing multivariate FCLT) Let {Znd be a double array of k x 1 random vectors Znt = (Zntl, ... ,Zntk)' such that {Znd is mixing with ¢ of size -r /(2r- 2) or a of size -r /(r-2), r > 2. Suppose further that EIZntjl < ~ < 00, E(Zntj) = 0, n, t = 1,2, ... , j = 1, ... ,k. If {Znd is globally covariance stationary with nonsingular global covariance matrix ~ = lim n -+ oo var(n- 1/ 2 2::;-1 Znt}, then Wn =}
W. Proof. Under the conditions given, A'~-1/2 Znt obeys the heterogeneous mixing FCLT, Theorem 7.18. As Znt is globally covariance stationary, the result follows from Theorem 7.29. See also Wooldridge and White (1988, Corollary 4.2). _ Our next exercise is an application of this result. Exercise 7.31 Determine the behavior of the least squares estimator iJ n when the unit root process Yt = Yt-1 + Ct is regressed on the unit root process X t = X t - 1 + TJt, where {TJt} is independent of {cd, and {TJt,cd satisfies the conditions of Theorem 7.30.
190
7.6
7. Functional Central Limit Theory and Applications
Cointegration and Stochastic Integrals
Continuing with the spurious regression setup in which {Xd and {yt} are independent random walks, consider what happens if we take a nontrivial linear combination of X t and yt:
where al and a2 are not both zero. We can write this as
where Zt = aIyt + a2Xt and Vt = alCt + a2"lt. Thus, Zt is again a random walk process, as {Vt} is i.i.d. with mean zero and finite variance, given that {cd and {1Jt} each are Li.d. with mean zero and finite variance. No matter what coefficients al and a2 we choose, the resulting linear combination is again a random walk, hence an integrated or unit root process. Now consider what happens when {Xd is a random walk (hence integrated) as before, but {yt} is instead generated according to yt = Xt/3 0 +Ct, with {Ct} again i.i.d. By itself, {yt} is an integrated process, because
yt - Yt-I
= (Xt -
X t -d(3o + et - et-I
so that
yt
yt-I + 1Jt(3o + Ct - Ct-I yt-I+(t,
where (t = 1Jt(3o + Ct - ct-I is readily verified to obey the FCLT. Despite the fact that both {Xt } and {yt} are integrated processes, the situation is very different from that considered at the outset of this section. Here, there is indeed a linear combination of X t and yt that is not an integrated process: putting al = 1 and a2 = -(30 we have
which is LLd. This is an example of a pair {Xt, yt} of cointegrated processes.
Definition 7.32 (Cointegrated processes) Let X t = (Xtl , ... , Xtk)' be a vector of integrated processes. If there exists a k x 1 vector a such that the linear combination {Zt = a' Xd obeys the FeLT, then X t is a vector of cointegrated processes.
7.6 Cointegration and Stochastic Integrals
191
Cointegrated processes were introduced by Granger (1981). This paper and that of Engle and Granger (1987) have had a major impact on modern econometrics, and there is now a voluminous literature on the theory and application of cointegration. Excellent treatments can be found in Johansen (1988, 1991, 1996), Phillips (1991), Sims, Stock, and Watson (1990) and Stock and Watson (1993). Our purpose here is not to pursue the various aspects of the theory of cointegration. Instead, we restrict our attention to developing the tools necessary to analyze the behavior of the least squares estimator (3n = (L:~=l XtXD- I L:~-l X t Y t , when (X t , Yd is a vector of cointegrated processes. For the case of scalar Xt, n
n
t=l
t=l n
(30
n
+ (LX;)-l LXtet. t=l
t=l
For now, we maintain the assumption that {Xd is a random walk and that {ed is i.i.d., independent of {1]t} (which underlies {Xd). We later relax these requirements. , From the expression above, we see that the behavior of (3n is determined by that of L:~=l xl and L:~=l Xtet, so we consider each of these in turn. First consider L:~=l From Theorem 7.21 (a) we know that
X;.
where ai = E(1];). Next, consider L:~=l Xtet. Substituting X t = X t - l n
n
n- l L Xtet = n- l L Xt-Iet t=l t=l
+ 1]t
we have
n
+ n- l L 1]t et· t=l
Under our assumptions, {1]ted obeys a law of large numbers, so the last term converges to E(1]ted = O. We thus focus on n- l L:~=l Xt-Iet. If we
192
7. Functional Central Limit Theory and Applications
write
8=1
t
n- 1/ 2 Le 8 /O'2, 8=1
where, as before, we set at = t/n, we have X t - 1 = n1/2O'lWln(at_d and et = n 1/ 2O'2 (W2n(at) - W2n(at-t}). Substituting these expressions we obtain n
n- 1 L
n 1/ 2O'l W 1n (at-d
t=1 x n 1/ 2O'2 (W2n(at) - W2n(at-d) n
0'10'2 L W 1n (at-d (W2n(at} - W2n(at-t}). t=1 The summation appearing in the expression immediately above has a prototypical form that plays a central role not only in the estimation of the parameters of cointegration models, but in a range of other contexts as well. The analysis of such expressions is facilitated by defining a stochastic integral as 1
1
n
Wln dW2n - LW1n (at-l) (W2n(at) - W2n(at-J)). o t=1
Writing the expression in this way and noting that Wn = (WIn, W2n)' W = (WI, W2)' suggest that we might expect that
11 W 1n dW2n
=}
11
=}
W1dW2,
where the integral on the right, involving the stochastic differential dW2 , will have to be given a suitable meaning, as this expression does not correspond to any of the standard integrals familiar from elementary calculus. Under certain conditions this convergence does hold. Chan and Wei (1988) were the first to establish a result of this sort. Nevertheless, this result is not generally valid. Instead, as we describe in more detail below, a recentering may be needed to accommodate dependence in the increments
193
7.6 Cointegration and Stochastic Integrals
of WIn or W2n not present in the increments of WI and W2. As it turns out, the generic result is that
where An performs the recentering. We proceed by first making sense of the integral on the right, using the phenomenally successful theory of stochastic integration developed by Ito in the 1940's; see Ito (1944). We then discuss the issues involved in 1 establishing the limiting behavior of Jo W 1n dW2n . To ensure that the stochastic integrals defined below are properly behaved, we make use of the notion of the .filtration generated by W. Definition 7.33 (Filtration generated by W) Let W be a standard multivariate Wiener process. The filtration generated by W is the sequence of a-fields {Ft, t E [0, oo)}, where F t = a(W(a), A, a finite matrix. Then
Proof. This is an immediate corollary of Theorem 4.1 of DeJong and Davidson (2000). • Theorem 7.46 (Cointegrating regression with mixing innovations) Suppose
(i) {(t = ("1~,EtY} is a globally covariance stationary mixing process of (k + 1) x 1 vectors with ¢ of size -r/(2r - 2) or a of size -r/(r - 2),
202
7. Functional Central Limit Theory and Applications
r > 2, such that E((t) = 0, E(I(tin < ~ < 00, i = 1, ... ,k + 1 and for all t. Suppose further that ~ = limn->= var(n- 1 / 2 L~=l (t) is finite and nonsingular. (ii) yt = X~,8o+tt, t = 1,2, ... , where X t = X t - 1 +7Jt, t Xo = 0, with,8o E IRk.
= 1,2, ...
, and
Let W denote a standard multivariate Wiener process, let D~ be the k x (k + 1) selection matrix such that D~ (t = 7Jt, let D~ be the 1 x (k + 1) selection vector such that D~(t = tt, let n
A
t
= lim ~-1/2n-1 ' " ' " E((8(~)~-1/2', ~~
n--+oo
t=28=1
=
and let r limn->= n- 1 L~-l E((ttt). Then A and r are finite and 1 (a) n- 2 L~ 1 XtX~ :::} D~~1/2 [fo W(a)W(a)'da] ~1/2'D1'
(b) n- 1 L~ 1 Xtt t :::} D~~1/2 [fo WdW' 1
+ A] ~1/2'D2 + r.
(c) n(/3n - ,80)
:::}
[D~~1/2 [1 x
[D~~1/2
1
W(a)W(a)'da]
[1
1
WdW'
~1/2'D1r1
+ A] ~1/2'D2 +
r] .
(d) /3n ~ ,80'
Proof. The mixing and moment conditions ensure the finiteness of A and r. (a) Under the mixing conditions of (i) {et = ~-1/2(t} obeys the heterogeneous mixing multivariate FeLT, so the result follows from the argument preceding the statement of the theorem. (b) As we wrote above, n
n- 1 I:Xtt t = t=1
1
D~~1/21 WndW~ ~1/2D2 0
n
+ n- 1 I : 7Jttt· t=1
Under the mixing conditions of (i) {7J~, td obeys the law of large numbers, so n- 1 L~=1 7Jttt ~ r. Theorem 7.45 applies given (i) to ensure f~ WndW~ :::} f~ WdW' + A. The result follows by Lemma 4.27.
7.6 Cointegration and Stochastic Integrals
203
(c) The results of (a) and (b), together with Lemma 4.27 deliver the · XX,)-l ,n X tEt· coneIUSlOn, as n ((3A n - (3)-( 0 n -2",n L.d=l t t n - l "L...t=l (d) i3 n
-
(30 = n- 1 (n(i3n - (30)) = o(l)Op(l) = op(l) . •
Observe that the conclusions here extend those of Exercise 7.44 in a natural way. The statement of the results can be somewhat simplified by making use of the notion of a Brownian motion with covariance ~, defined as B = ~1/2W, denoted BM(~), where W is a standard Wiener process (Brownian motion). With this notation we have
n(i3n - (30)
=}
1 [D~ [1
B(a)B(a)'da]
x
[D~
[1
1
BdB'
D1f1
+ ~1/2A~1/2']
D2
+
r] .
In conclusion (c), we observe the presence of the term
The component involving A arises from serial correlation in D.~) < P(I(n-Aa n + n-fLbn)1 > D.~) < P(I(n-Aanl + In-fLbnl > D.~) < P(I(n-Aanl > D.a,c) + P(ln-fLbnl > D.b,c)
P(ln-"(a n + bn)1
N, which proves that an
+ bn =
Op(n").
(ii) We have n-Aa n ~ 0, n-fLbn ~ O. It follows from Proposition 2.30 that n-Aann-fLb n = n-(A+fL)anbn ~ 0 so that anb n = op(nA+fL). Consider {an + bn }. Since {an} and {b n } are op(n"), we again apply Proposition 2.30 and obtain n-"a n + n-"b n = n-"(a n + bn ) ~ O.
(iii) We need to prove that for any c > 0, there exist Dc > 0 and Ne
E
N
such that
for all n > N. By definition there exist Na,e E Nand D.a,e < 00 such that P(w : In-Aanl > D.a,e) < c/2 for all n > Na,e' Define Db,e -
216
Solution Set
8£ / ~a,£ > 0; then by the definition of convergence in probability, there exists Nb,£ such that P(ln-f.Lbnl > 8b,£) < c/2 for all n > Nb,£. As in (i) we have that
Since this holds for arbitrary choice of c > 0, we have shown that anb n = op(n>'+f.L). Consider {an + bn }. Since {b n } is also Op(nf.L) , it follows from (i) that an + bn = Op(nl 1, or a of size -r j(r -1), r > 1, by Proposition 3.50. Given (iii.b) and (iv.a) the elements of {Xtcd and {XtXD satisfy the moment condition of Corollary 3.48 by Minkowski's inequality and the Cauchy-Schwarz inequality. It follows that X'cjn ~ 0 and X'Xjn - Mn ~ O. Mn = 0(1) by Jensen's inequality given (iv.a). Hence the conditions of Theorem 2.18 are satisfied and the result follows. • Exercise 3.53 Proof. (i) The following conditions are sufficient:
(i) rt = aort-l + 13 0 X t + Ct, 100 0 1 < 1, 1130 1 < 00; (ii) {(rt,Xt )} is a mixing sequence with ¢ of size -rj(2r -1), r > 1, or a of size -rj(r -1), r > 1; (iii) (a) E(Xt-jct) = 0, j = 0,1, ... and all t; (b) E(ctct-j) = 0, j = 1,2, ... and all t; (iv) (a)
ElxllrH < b. < 00 and ElcW H < b. < 00 for some 0 < 8 < r
and all t; (b) Mn E(X'Xjn) has det(Mn) > 'Y > 0 for all n sufficiently large, where X t = (rt-l,X t )'.
=
First, we verify the hint (see Laha and Rohatgi, 1979, p. 53). We are given that n
L(EIZtIP)l/p < t=l
00.
219
Solution Set
By Minkowski's inequality for finite sums,
for all n > 1. Hence P
n
. =
0
o.
Setting /3 n = (X'X/n)-l X'Y /n and taking a mean value expansion of s(,8) around /3n gives
01:/0,8 01:/0)..
2(X'X/n)(,8 - /3n) + \7s(,8)' >. s(,8n) + \7s (,8 - ,8n) = 0, A
,
= 0
A
where \7s' is the q x k Jacobian of s with ith row evaluated at a mean value j3~). Premultiplying the first equation by \7s'(X'X/n)-l and substituting -s(/3n) = \7s' (,8 - /3n) gives
An
= 2 [\7s' (X'X/n) -1 \7s(/3n)r1s(/3n).
Thus, following the procedures of Theorems 4.32 and 4.39 we could propose the statistic
Solution Set
227
where
An
4(\7s' (X'X/n)-l\7s(,Bn))-l\7s(,Bn)' (X'X/n)-l Vn
X(X'X/n )-1 \7s(,Bn) (\7s' (X'X/n) -1 \7s(,Bn) )-1. The preceding statistic, however, is not very useful because it depends on generally unknown mean values and also on the unconstrained estimate ,Bn. An asymptotically equivalent statistic replaces \7s' by \7s(,Bn)' and ,Bn by
,Bn:
where
and
An
4(\7s(,Bn)' (X'X/n )-1 \7S(,Bn)) -1 \7s(,Bn)' (X'Xjn)-l
xVn(X'Xjn )-1 \7s(,Bn) (\7s(,Bn)' (X'X/n )-1 \7s(,Bn) )-1. To show that £Mn ~ X~ under Ha we note that £Mn differs from Wn only ••
..............
....
p
in that V n is used in place of V nand f3n replaces f3n· Since f3n - f3n ---> 0 and Vn - Vn ~ 0 under Ha given the conditions of Theorem 4.25, it follows from Proposition 2.30 that
given that \7s is continuous. Hence £Mn ~ X~ by Lemma 4.7 . • Exercise 4.42 Proof. First consider testing the hypothesis Rf3 0 = r. Analogous to Theorem 4.31 the Wald statistic is -
A-I
-
Wn - n(Rf3n - r)'r n (Rf3n - r)
d
--->
2
Xq
under H a , where rn
RDnR'
R(X'ZP n Z'Xjn 2 )-1(X'Zjn)P nV nPn(Z'Xjn) x (X'ZP n Z'X/n 2 )-lR'.
228
Solution Set
To prove Wn has an asymptotic X~ distribution under H a, we note that
so _ -1/2 r n-1/2 y'n(R,(3n - r) - r n Ry'n(,(3n -
,(30)'
where
It follows from Corollary 4.24 that r~1/2RVn(.Bn - ,(30) ~ N(O, I) so that
r~1/2Vn(R.Bn - r) ~ N(O,I). Since Dn - Dn ~ 0 from Exercise 4.26 it follows that An - An ~ 0 by Proposition 2.30. Hence Wn ~ X~ by Theorem 4.30. We can derive the Lagrange multiplier statistic from the constrained minimization problem: min(Y - X,(3)'ZPnZ'(Y - X,(3)/n 2 S.t. R,(3 = r. (3
The first-order conditions are
2(X'ZP n Z'X/n 2),(3 - 2(X'Z/n)P n Z'Y /n R,(3 - r = 0,
OL/O,(3 OL/OA
+ RA =
0
where A is the vector of Lagrange multipliers. It follows that
An
2(R(X'ZP n Z'X/n 2 )-1R')-1(R,Bn - r)
,Bn
.Bn - (X'ZPnZ'X/n2)-lR'~n/2. ··'~-1··
A
Hence, analogous to Theorem 4.32, LMn - nAnAn An ~ X~ under H a, where
An
-
4(R(X'ZP nZ'X/n 2 )-lR')-lR(X'ZP n Z'X/n2 )-1 (X'Z/n)Pn \TnPn(Z'X/n)(X'ZPnZ'X/n2)-1 xR' (R(X'ZP nZ'X/n2)-1 R')-l X
and \Tn is computed under the constrained regression such that \Tn p p V n - 7 0 under Ha. If we can show that LMn - Wn - 7 0, then we can apply Lemma 4.7 to conclude LMn ~ X~. Note that LMn differs from ~
229
Solution Set ••
,.,..
,.,
p
Wn in that V n is used in place of V n. Since V n - V n ------t 0 under Ho, it follows from Proposition 2.30 that CMn - Wn ~ o. Next, consider the nonlinear hypothesis s(f3o) = o. The Wald statistic is easily seen to be
where
and Dn is given in Exercise 4.26. The proof that Wn ~ X~ under Ho is identical to that of Theorem 4.39 except that 13n replaces i3n, Dn is appropriately defined, and the results of Exercise 4.26 are used in place of those of Theorem 4.25. The Lagrange Multiplier statistic can be derived in a manner analogous to Exercise 4.41, and the result is that the Lagrange multiplier statistic has the form of the Wald statistic with the constrained estimates f3n and Vn replacing i3n and Vn. Thus
where
and
An
4[V's(f3n)' (X'ZP nZ'Xjn2) -1 V'S(f3 n )]-l XV'S(f3n)' (X'ZP nZ'Xjn2)-1(X'Zjn)P nVnP n(Z'Xjn) (X'ZP nZ'Xjn2) -1 V'S(f3n) [V'S(f3n)' 2 X (X'ZP n Z'Xjn )-lV'S(f3n)]-1. X
••
p
••
A
P
Now V n - V n ------t 0 and V's(f3n) - V's(f3n) ------t 0 given that V's(f3) is continuous. It follows from Proposition 2.30 that CMn - Wn ~ 0, so that CMn ~ X~ by Lemma 4.7 . •
230
Solution Set
Exercise 4.46 Proof. Under the conditions of Exercise 4.26, Proposition 4.45 tells us that the asymptotically efficient estimator is
{3~
(X'XV: 1X'X) -1 X'XV: 1X'y 1 (X'X)-lV n(X'X)-l X'XV: X'Y (X'X)-lX'Y = i3n,
where we substituted X'X
= Vn . •
Exercise 4.47 Proof. We assume that the conditions of Exercise 4.26 are satisfied. In addition, we assume a~(Z'Z/n) - a~Ln ~ 0 so that V n = a~(Z'Z/n). (This follows from a law of large numbers.) Then from Proposition 4.45 it follows that the asymptotically efficient estimator is {3~
(X'Z(a~ (Z'Z/n)) -lZ'X)-l X'Z( a~ (Z'Z/n)) -1 Z'Y
(X'Z(Z'Z)-lZ'X)-l X'Z(Z'Z)-l Z'y {32SLS·
• Exercise 4.48 Proof. From the definition Vn({3~* - {3~)
(X'ZPnZ'X/n2)-1V's({3~)
=
x [V's({3~)' (X'ZP nZ'X/n2) -1 V's({3~) ]-1 Vns({3~) it follows that Vn({3~* - {3~) ~ 0 provided that (X'ZP n Z'X/n 2)-1 = Op(1), V's({3~) = Op(1), and yTis({3~) = opel). The first two are easy to show, so we just show that yTis({3~) = opel). Taking a mean value expansion and substituting the definition of {3~ gives
Vn s(/3 n ) + VnV's~ ({3~ - /3 n ) Vns(/3 n ) - V's~(X'ZPnZ'X)-lV's(/3n) X
[V's(/3n)' (X'ZP nZ'X)-l V's(/3n)t 1Vns(/3 n )
(I - V's~(X'ZP nZ'X)-lV's(/3n)
t1) Vns((3n).
x [V's(/3n)' (X'ZP nZ'X) -1 V's(/3 n)
231
Solution Set
Now
since \7s n - \7s(i3n) = op(l), and y'Tis(i3 n ) = y'Tis(13 o ) + \7s~y'Ti(i3n 13 o ) = 0 + Op(1)Op(l) = Op(l) by Lemma 4.6, and the result follows . • Exercise 4.56 Proof. Since Xf is Ft_l-measurable then so is Zf (-y*) brevity, we write Zf = Zf(-y*) in the following. We have n
n
EO(n- 1 LEO(ZfetIFt - 1)) t=1
EO(n- 1 LZfet) t=1
n
t=1 0, and by the martingale difference property n
Vn
varo(n-
L Zfet)
1
t=1 n
EO(n- 1 L Zfete~Z~O) t=1 n
EO(n- 1 L
Zf EO(ete~IFt-dZ~O)
t=1 n
EO(n- 1 L
ZfntZ~O)
t=1 n
EO(n- 1 LXfntlX~O). t=1
Se t V n = n -1 A
""n x on L..d=1 t £t ~
1 X'o
t·
232
With
Solution Set
Zf
=
Xfn t 1 we have
• Exercise 4.58 Proof. We have
f3~
(tXfntlxt) -1 t Xfnt 1Yf t=1
t=1
(i;x;n"Xf) -, tJGn,'(JG' f30
+ (tXfntlxt) -1 t t=1
{3" + e,)
Xfnt le t.
t=1
Set m~(-y*) = n- 1
n
2:xfnt1et t=1
and we have
v'n(f3~ -
(30) = (n- 1
tXfntlxt) -1 nl/2m~(-y*) t=1
= Q~(-y*)-lnl/2m~(-y*)
+ [ (n-' ~JGn"x;')
-, -Q~(-y')-'l n'/2m
:,(1")
= Q~(-y*)-lnl/2m~(-y*)
using Theorem 4.54 (iii.a) and (ii). So avarO(f3~) = Q~(-y*)-IV~(-y*)Q~(-y*)-I,
+ opo(l)
233
Solution Set
where
V~(-y*) n
l g g'n- 1Xo,) n- 1 "" L...t Eo(Xontt ttt t t=l n
n- 1
L EO(Xf n i l EO(gtg~IFt_1)nilXn t=l
n- 1
n
L EO(Xf n i 1Xn t=l
using the law of iterated expectations. By assumption , QO(-v*) = n- 1 L ..d=l Eo(Xon-1XO') n f t t t , so
"n
avarO(,13~) = [n- 1t
E O(Xf n i 1xnj-1
t=l
Consistency of Dn follows given Theorem 4.54 (iii.a) and the assumption that Q~(-Y*) = n- 1 2:~=1 EO(X~nilXn by the continuity of the inverse function and given that Q~(-Y*) = V~(-y*) is uniformly positive definite. _
Exercise 4.60 Proof. Similar to Exercise 4.58 we have that
Solution Set
234
n
xn- 1 / 2
L Xrnr- 1C
t
t=l
+
(n- 1tEo(xrnr-1xn)-1 t=l n
x[n- 1 / 2
n
LXrn:t1 Ct -
n- 1 / 2
t=l
LXrnr-1Ctl t=l
+I(n-' i;x;n:,'xr') -, -(n-' t,E"(Xrn;-'xn) -'I L Xrn:t1Ct n
1 2 X [n- /
n
n- 1 / 2
t=l
L Xrnf-1 Ct l t=l
(n-' i?"(J(fn;-' xn)-, n-'/2 i;x;n;-'e, +op,,(I)Op,,(I)
+ O(l)op,,(l) + opn(l)op,,(l)
using Theorem 4.54 conditions (iii.a) and (ii) on the second, third, and fourth terms. It follows that
which is consistently estimated by
given Theorem 4.54 (iii.a). _
235
Solution Set
Exercise 4.61 Proof. (i) Let Wt' = (1, Wt)'; then
(n- i=Wfwt) -1 n- i=wfe;
On
1
1
t=1
(
n- 1
t=1
i=wrwt)-1 n- 1 i=Wf(c t t=1 t=1 n
(
)
n-1"8WfWt
-1
1
-
Xf'({3n - (30))2
n
n- ~WfcZ
-2 (n- 1 i=Wfwt) -1 n-1 i= w f ct X t({3n _ (30) t=1 t=1
+ (n- 1 i=Wfwt) -1 n- 1 i= Wf(Xf'{{3n _ (30))2 t=1 (
t=1
n- 1 'tWfWt)-1 n- 1 'tWfWfoo + opo(l)
t=1 00 + opo(l),
+ opo(l)
t=1
provided that
n
n-
1
z=WfetXt t=1
n
n- 1 z=Wf(xt ®Xt)
t=1 and ({3n - (30) = opo(l). The third condition enables us to verify n
n
n- 1 Z=Wf(xt({3n - (30)) = n- 1 Z=WfXf'({3n 1=1 t=1
-
(3°)({3n -
(3°)'Xf
236
Solution Set n
n- LW~(xt t=1 1
(>9
Xnvec((,Bn - (3°)(,Bn - (30)')
The first condition holds quite generally since Wt is bounded. Set W,'; n- 1 1 Wt then
2:;
n-1tw~wt= t=1
1
( WO n
n
WO L ...t=1
wto2
-1 ",nn
)
=
.
If this matrix is_uniformly nonsingular, its inverse is Opo(l), and for that it suffices that W,'; is bounded away from zero and one.
(ii) Consider n
n
n- 1 / 2 Lx~n;~lct - n- 1 / 2 Lx~n~-1Ct
t=1
t=1
n
n-1/2LX~(n~l- n~-1)Ct t=l n
n- 1/ 2 LX~((Wt6:n)-1 - (WtaO)-1)Ct t=l n
t=1 n
+n- 1/ 2 LX~((Ct1n + Ct2n)-1 - (Q~ + Q~)-1)ct1[Wt = 1] t=1 n
((Ct1n)-1 - (Qn-
1
)n- 1 / 2
LX~ct1[Wt
= 0]
t=l n
+((Ctln + Ct2n)-1 - (Q~ + Q~)-1)n-l/2 LX~ct1[Wt
=
1]
t=l
To obtain the final equality, we apply the martingale difference central limit theorem to the terms n- 1 / 2 2:;=1 Xt'ct1[wt = j], j = 0, 1, and require that Q)' and Q1 + Q2 are different from zero.
Solution Set
237
Similarly n
n
1xo1 _ n- 1"'xono-lxol n- 1"'xonL.-t t nt t L.-t t t t t=1 t=1 n 1 1 ((aln)-1 - (an- )n- LX~Xt1[Wt = 0] t=1 n
+((aln + a2n)-1 - (al' + a~D-l)n-l LX~Xt1[Wt t=1
= 1]
• Exercise 4.63 Proof. Similar to Exercise 4.60 we have
..;n((3~ - (30) =
(
1/2 ~ - --1 1 ~ - - -1 - ' ) -1 n~x~nnt x~ n~x~nnt et (
n-1 tEocx~n~-lxn) -1 n-1/2 tx~n~-let t=1 t=1
+ [(n-l t
x~n:tlxt)
(n-1 t t=1
-1 _
t=1
Eo(x~n~-lxn) -1]
n
X
n-l/2Lx~n~-let
t=1
+ (n- 1 tEo(x~n~-lxn)-1 t=1 n
1 2 X [n- /
Lx~n~tl et t=1
n
-
n- 1 / 2
Lx~n~-let] t=1
+ [(n-l tx~n:tlX~/)-1 _ (n- 1 tEo(x~n~-IX~/))-I] t=1
t=1 X
n
n
t=1
t=l
[n-l/2Lx~n:tlet -n-l/2Lx~n~-let]
238
=
Solution Set
(
n- 1
L EO(X~n~-IXn n
t=1
) -1
n- 1/ 2
L x~n~-let n
t=1 + opo(l)Opo(l) + O(l)opo(l)
+ opo(l)opo(l)
applying Theorem 4.54 conditions (iii.a) and (ii) to the second, third, and fourth terms. Since
EO (EO (X~n~-1
Xn IF
t-
d
EO(X~n~-IXn
by the law of iterated expectations, we have
To show that the matrix is consistently estimated by ... 1n' " ... -1 I 1 n) ... -1... 1 I Dn=2 n- Lx~nntX~ +n- Lx~nntX~ , ( t=1 t=l
it suffices that f>~1 - avaro(j3~)-l
= opo(l).
We have
f>~1 - avaro(j3~)-1 n
n
t=1
t=l
1 "xon- 1Xo!) + n(l/2)(n- 1 "xOn-lXo! ~ t nt t ~ t nt t
n
_n- 1 L EO(X~n~-lx~!) t=l n
(l/2)(n- Lx~n:tlXr - n- 1 1
t=l
n
L EO(X~n~-IX~!)) t=l
n
n
+(l/2)(n- Lx~n:tlXr - n- LEO(X~n~-lX~!)), 1
1
t=l
t=l
again using the law of iterated expectations. Now n
n
1 n- 1 "xon-1Xo! ~ t nt t - n- "~ EO(XonO-lXO!) t t t
t=1
t=1
L
0
239
Solution Set
by Theorem 4.54 condition (iii) and Theorem 4.62 condition (iii) and the result follows. • Exercise 4.64 Proof. (i)(a) Set Vt
= X~o -
E(X~ol.Ft_d
so that X~o
= zt7l"0 + Vt.
Then
(n-' i;Z;zt) -, n-' tz;xt
;cn
n
71"0
+ (n-
1
n
1
ZfZf,)-ln- L ZfVt
L t=l
71" 0
t=l
+ opo(1)
given n- 1 L~ 1 zrzt = Opo(l) and n- 1 L~=l ZrVt = opo(l); the latter follows by the martingale difference weak law of large numbers. (b) Substituting
we have that n
:En
n-lL€t€~ t=l
n
n
1
+n- Let(.Bo -13n)'Xf + nt=l
1
L X n,6° -13n)e~ t=l
n
+n- 1 L X t(,6° -13n)(,6° -13n)'Xf· t=l
n
n-
1
~zo'Zo ~
t
t
t=l
so that (,60 -13n) = opo(l), it follows that the last three terms vanish in probability. To see this, apply the rule vec(ABG) = (G' ®A)vec(B) to each
240
Solution Set
of terms, for example: n
n
1
vec(n- LetC13° -
!3n )'Xf)
n-
1
L(xt ® et)vec((J3° -
!3n)')
t=l
t=l
o
~
Given n- 1 L~=l ete~ ~ ~o it then follows that ~n = ~o + opo (1). (ii) To verify condition (iii) of Theorem 4.62 we write n
n- 1 /
n
2"
~
fr'n zoi;-l e tnt
n- 1 /
2"
~
t=l
t=l n
fr'n
n- 1 /
7r'° zo~o-l,," t "'t n
2"
n- 1 /
zoi;-l e - 7r'°
~tnt
t=l
Zo~o-le t
2"
~t
t=l
n
fr'n n- 1 / 2 "~ ZO(i;-l _~o-l)e t n t t=l n
1 2 Zo~o-le + fr'n n- / "~t t
n
_
7r'° n- 1 / 2 "~t Zo~o-le t
t=l
t=l
n
fr~n-l/2 L(e~ ® ZnveC(i;:l _~o-l) t=l n
+(fr~ - 7r~)n-l/2
L Z~~o-let
t=l Opo (1 )Opo (1 )Opo (1) + Opo (1 )Opo (1) = Opo (1). Similarly, to verify condition (vi) of Theorem 4.62 we write n
n- 1
n
"fr' zoi;-lyo ~ n t n' 0 or equivalently, if E(~2) > E(IYiYi-ll). Now
E(~2)
=
0"~(1- ,802)[(1
E(YiYi-l)
=
O"~,8ol[(l
+ ,802)(1 - ,802) -
+ ,802)(1 -
,8~lrl
,802) - ,8~ll-l,
(see Granger and Newbold, 1977, Ch. 1) so E(~2) > E(IYiYi-ll) holds if and only if (1- ,802) > 1,8011. This is ensured by (i'.b), and the condition holds.
Solution Set
246
(iv.a) EI~~il
0 it follows that
D- 1/ 2,Jn(i3n - f3 o) - D;;1/2,Jn(i3 n - f3o) I)D;;1/2,Jn(i3n - f3o) ~ 0
= (D-1/2D;/2 -
-
by Lemma 4.6. Therefore, by Lemma 4.7, D- 1/ 2y'n(f3n - f3o)
•
A rv
N(O, I) .
Exercise 5.21 Proof. We verify the conditions of Theorem 4.25. First we apply Theorem 5.20 and Proposition 5.1 to show that y;;1/2 n - 1/ 2X 'C; ~ N(O, I). Consider n
).,'y-1/2 ).,'y-1/2X c; n -1/2X' c; = n- 1 / 2 ~ n ~ n t t· t=l
Solution Set
247
By Theorem 3.49, {A'V;;-1/2 Xte t} is a mixing sequence, with either ¢ of size -r/(2r - 1), r > 2, or a of size -r/(r - 2), r > 2, given (ii). Further, E(A'V;;-1/2 Xtet ) = 0 given (iii.a), and application of Minkowski's inequality gives
E(IA'V;;-1/2Xtetlr+c5) < L:!.
0 and for all t given (iii.b). It follows from Theorem 5.20 that for all A, A'A = 1, we have n
1 2 e = A'V- 1/ 2n- 1/ 2X'e ~ N(O 1) n- 1/ 2 "~ A'Vn / X tt n ,. t=l Hence, by Proposition 5.1, V;;-1/2 n - 1/2X'e ~ N(O, I), so Theorem 4.25 (ii) holds. Next, X'X/n - Mn ~ 0 by Corollary 3.48 and Theorem 2.24 given (iv.a). Given (iv.a) Mn = 0(1) and det(Mn) > 8> 0 for all n sufficiently large given (iv.b). Hence, the conditions of Theorem 4.25 are satisfied and the result follows . • Exercise 5.22 Proof. The following conditions are sufficient for asymptotic normality:
(i') (a) Y't=.801Y't-1+.802Wt+Ct; (b) 1.8011 < 1, 1.8 02 1 < 00; (ii') {Y't} is a mixing sequence with either ¢ of size -r/2(r -1), r > 2, or a of size -r /(r - 2), r > 2; {Wt} is a bounded nonstochastic sequence. (iii') (a) E(ctIFt-1) = 0, t = 1,2, ... , where F t = a(Y't, Y't-1, ... ); (b) Elctl2r < ~ < 00, t = 1,2, ... ; (c) V n - var(n- 1/ 2 L~-l Xtct} is uniformly positive definite, where X t - (Y't-1, Wt); (iv') (a) Mn
= E(X'X/n) has det(Mn) > 8 > 0 for all n sufficiently large.
We verify the conditions of Exercise 5.21. Since Ct = Y't-.8 01 Y't-1-.80 2W t , it follows from Theorem 3.49 that {(Xt , ct)} = {(Y't-1, Wt, ct)} is mixing with either ¢ of size -r/2(r - 1), r > 2, or a of size -r/(r - 2), r > 2, given (ii'). Thus, condition (ii) of Exercise 5.21 is satisfied. Next consider condition (iii). Now E(Wtct) = WtE(ct) = 0 given (iii'.a). Also, by repeated substitution we can express Y't as 00
Y't = .802 L.8~l Wt - j j=O
00
+ L.8~lct-j. j=O
248
Solution Set
By Proposition 3.52 it follows that E(Yt-lCt) = 0 given (iii'.a) and (iii'.c). Hence, E(Xtct) = (E(Yt-lct), E(Wtct))' = 0 so that (iii.a) is satisfied. TUrning to condition (iii.b) we have EIWtctlr = /::1' Elctlr < /::1'/::1 < 00 given (iii'.b), where IWtl < /::1' < 00. Also
EIYt_lCtl r < (EIYt_112r)1/2(ElctI2r)1/2 by the Cauchy-Schwarz inequality. Since Elctl2r < /::1 < 00 given (iii'.b), it remains to be shown that EIYtl 2r < 00. Applying Minkowski's inequality (see Exercise 3.53),
2r 00
EIYtl
2r
00
E ,602L,6~lWt-j
+ L,6~lct-j
j=O
j=O
0,
(iv) (a) EIZthilr+6 < ~ < 00 and EIXthjlrH < ~ < 00 for some b > 0, h=l, ... ,p,i=l, ... ,l,j=l, ... k,andallt; (b) Qn E(Z'X/n) has full column rank uniformly in n for all n
=
sufficiently large; (c) Ln = E(Z'Z/n) has det(Ln)
> b > 0 for all n sufficiently large.
Given conditions (i)-(iv), the asymptotically efficient estimator is
by Exercise 4.47. First, consider Z'Z/n. Now {ZtZD is a mixing sequence with the same size as {(Z~, X~, et)} by Proposition 3.50. Hence, by Corollary 3.48, Z'Z/n - Ln = n- 1 I:~=l ZtZ~ - n- 1 I:~=l E(ZtZD ~ 0 given (iv.a) and Z'Z/n - Ln ~ 0 by Theorem 2.24. Next consider
iT;'
(np)-l(y - xi3n)'(Y - xi3n) (e - X(i3n - f3o))'(e - X(i3n - f3o))/(np) e'g/(np) - 2(i3n - f3o)'X'g/(np) +(i3n - f3o)'(X'X/n)(i3n - f3o)/p·
As the conditions of Exercise 3.79 are satisfied, it follows that i3n -130 ~ O. Also, X'g/n = Oa.s.(l) by Corollary 3.48 given (ii), (iii.b), and (iv.a). Hence (i3n -f3o)X'g/n ~ 0 by Exercise 2.22 and Theorem 2.24. Similarly, {XtXD a mixing sequence with size given in (ii) with elements satisfying
251
Solution Set
the moment condition of Corollary 3.48 given (iv.a), so that X'X/n = p Oa.s. (1) and therefore ({3n - (3o)' (X'X/n )({3n - (3o) ------t O. Finally, consider n
p
e'e/(np) = p-l L n- Le~h' h=l t=l 1
Now for any h = 1, ... ,p, {e~h} is a mixing sequence with ¢ of size -r /2(r1), r > 2, or a of size -r/(r - 2), r > 2. As {e~d satisfies the moment condition of Corollary 3.48 given (iii.b) and E(e~h) = a~ given (iii.c), it follows that
n n n 1 1 n- Le~h - n- LE(e~h) = n- Le~h - a~ ~ 0, 1
t=l
t=l
t=l
Hence, e'e/(np) ~ a~, and it follows that
iT; ~ a~ by Exercise 2.35. •
Exercise 6.6 Proof. The proof is analogous to that of Theorem 6.3. Consider for simplicity the case p = 1. We decompose V n - V n as follows: n
n
n- Le~ZtZ~ - n- LE(e~ZtZ~) 1
1
t=l
t=l n
-2n- L(i3n - {3o)'X~etZtZ~ 1
t=l
+n- 1 L(i3n - {3o)'XtX~(i3n - (3o)ZtZ~. Now {e;ZtZ~} is a mixing sequence with either ¢ of size -r/(2r - 1), r > 1, or a of size -r/(r - 1), r > 1, given (ii) with elements satisfying the moment condition of Corollary 3.48 given (iii.b). Hence n
n
n- 1 Le~ZtZ~ - n- 1 LE(e~ZtZ~) ~ O. t=l
t=l
The remaining terms converge to zero in probability as in Theorem 6.3, where we now use results on mixing sequences in place of results on stationary, ergodic sequences. For example, by the Cauchy-Schwarz inequality,
252
Solution Set
given (iii.b) and (iv.a). As {XtKZtiZtjct} is a mixing sequence with size given in (ii) and it satisfies the moment condition of Corollary 3.48, it follows that n
n
n- 1 LXtKZtiZtjCt - n- 1 LE(XtKZtiZtjCt) ~ O. t=l t=l
Since 13n - /30 ~ 0 under the conditions given, we have n
n- 1 L(13n - /3o)/XtctZtZ~ ~ 0 t=l
by Exercise 2.35. Finally, consider the third term. The Cauchy-Schwarz inequality gives EIXtKXt>,ZtiZtjlr+h < 00 so that n
n
n- 1 LXtKXt>,ZtiZtj - n- 1 LE(XtKXt>,ZtiZtj) ~ 0 t=l t=l
by Corollary 3.48. Thus the third term vanishes in probability and application of Exercise 2.35 yields V n - V n ~ O. • Exercise 6.7 Proof. The proof is immediate from Exercise 5.27 and Exercise 6.6 . • Exercise 6.8 Proof. Conditions (i)-(iv) ensure that Exercise 6.6 holds for 13n and V nV n ~ O. Next set Pn = V~l in Exercise 6.7. Then P n = V~l and the result follows. • Exercise 6.12 Proof. Exactly as in the proof of Theorem 6.9, all the terms n
(n - 7)-1 L ZtCtCt-rZ~_r - E(Ztctct-rZ~_r) t=r+1
-en - 7)-1 -en - 7)-1
n
L ZtX~(13n - /3o)ct-rZ~-r t=r+1 n
L ZtCt(13n - /3o)/Xt-rZ~_r t=r+1 n
+(n - 7)-1 L t=r+1
ZtX~(13n - /3o)(13n - /3o)/Xt-rZ~_r'
253
Solution Set
converge to zero in probability. Note that Theorem 3.49 is invoked to guarantee the summands are mixing sequences with size given in (ii), and the Cauchy-Schwarz inequality is used to verify the moment condition of Corollary 3.48. For example, given (ii) {Ztetet-rZ~_r} is a mixing sequence with either ¢ of size -rj(2r - 2), r > 2, or a of size -rj(r - 2), r > 2, with elements satisfying the moment condition of Corollary 3.48 given (iii.b). Hence n
n
The remaining terms can be shown to converge to zero in probability in a manner similar to Theorem 6.3. • Exercise 6.13 Proof. Immediate from Theorem 5.22 and Exercise 6.12 . • Exercise 6.14 Proof. Conditions (i)-(iv) ensure that Exercise 6.12 holds for i3n and V np Vn ----t O. Next set P n = V;;-l in Exercise 6.13. Then P n = V;;-l and the result follows. • A
A
Exercise 6.15 Proof. Under the conditions of Theorem 6.9 or Exercise 6.12 it holds that p _ P V n - V n ----t 0, so the result will follow if also V n - V n ----t O. Now A
A
n
Tn
V- n
-
VA n = L...... ~( Wnr
-
1) n -1 L...... ~ Z- e-I Zl et t _ r t-r
r=l
Since for each
T
-/ZI + Z t-ret-ret t·
t=r+1
= 1, . . . ,m we have n
- -I Z't-r n -1 ~ L...... Z tetet_r
-/ZIi + Z i-ret-ret
= 0 P (1)
t=r+1
it follows that Tn
Tn
L op(l)Op(l) = op(l), r=l
where we used that
Wnr
p ----t
1. •
254
Solution Set
Exercise 7.2 Proof. Since X t = Xo
+ I:~=1 Zs,
we have
s=1
t
°+ LE(Zs)
= 0,
8=1
and t
var(Xo +
L Zs) 8=1
t
0+ Lvar(Zs) =ta 2 , s=l
using the independence between Zr and Z8 for r =I- s . •
Exercise 7.3 Proof. Note that Zt4 Zt2
+ Zt.-1 + ... + Zt3+1, + Zt2-1 + ... + Ztl+l'
Since (Ztl+b'" ,Zt2) is independent of (Zt3+1, ... ,Zt.) it follows from Proposition 3.2 (ii) that X t • - Xt3 and X t2 - X tl are independent . •
Exercise 7.4 Proof. By definition [nbJ
n- 1 / 2
L
Zt
t=[naJ+1
n- 1/ 2([nb] _ [na])1/2 [nbJ
L
x([nb]- [na])-1/2
Zt.
t=[naJ+l
The last term ([nb] - [na])-1/2 I:~~JnaJ+1 Zt ~ N(O, 1) by the central limit theorem, and n- 1 / 2([nb]- [naJ)l/2 = (([nb]- [na])jn)1/2 --> (b-a)1/2 as n
--> 00.
Hence Wn(b) - Wn(a)
d ------>
N(O, b - a) . •
255
Solution Set
Exercise 7.22 Proof. (i) Let U E C, and let Un be a sequence of mappings in D such that Un -> U, i.e., bn du(Un,U) = sUPaE[0,1J IUn(a) - U(a)1 -> 0 as n -> 00. We have
=
du(U~,U2)
sup IUn(a)2 - U(a)21 aE[0,1J sup IUn(a) -U(a)IIUn(a) +U(a)1 aE[0,1J
< bn sup IUn(a) +U(a)l, aE[0,1J
=
where bn du(Un' U). The boundedness of U and the fact that bn -> o imply that for all n sufficiently large, Un is bounded on [0,1]. Hence, sUPaE[0,1J IUn(a) + U(a)1 = 0(1) and it follows that du(U~,U2) -> 0, which proves that M1 is continuous at U. Now consider the function on [0,1] given by
V(a)
= { int( 1~a)'
0,
for 0 < a < 1 for a = 1.
For 0 < a < 1/2, this function is 1, then jumps to 2 for a = 1/2, to 3 for a = 2/3, and so forth. Clearly V(a) is continuous from the right and has left limits everywhere, so V E D = D[O, 1]. Next, define the sequence of functions Vn(a) = V(a) + c/n, for some c > O. Then Vn E D and Vn -> V. Since sup IVn (a)2 - V(a)21 aE[0,1J sup I(Vn(a) - V(a))(Vn(a) aE[0,1J (c/n) sup IVn(a) + V(a)1 aE[0,1J
+ V(a))1
is infinite for all n, we conclude that V~ -+t V 2 . Hence, M1 is not continuous everywhere in D. (ii) Let U E C; then U is bounded and f01 IU(a)lda < 00. Let {Un} be a sequence of functions in D such that Un -> U. Define bn du(Un,U); then bn -> 0 and f01 IUn(a)lda < f01 IU(a)lda + bn ·
=
256
Solution Set
Since
111 Un (a)2da -
11
0
M2 is continuous at U E C. (iii) Let V(a) = a for all a E [0,1] and let Vn(a) = 0 for 0 < a < lin and Vn = V(a) for a > lin. Thus du(V, Vn ) = lin and Vn --> Vj 1 1 1 however, Jo Iog(IVn (a)l)da --17 Jo 10g(IV(a)l)da, since Jo 10g(IV(a)l)da = -1 whereas 10g(IVn(a)l)da = -00 for all n. Hence, it follows that M3 is not continuous. •
J;
Exercise 7.23 Proof. If {cd satisfies the conditions of the heterogeneous mixing CLT and is globally covariance stationary, then (ii) of Theorem 7.21 follows directly from Theorem 7.18. Since {cd is assumed to be mixing, then {en is also mixing of the same size (see Proposition 3.50). By Corollary 3.48 it also then follows that n- 1L~ 1 c; - n- 1L~=l E(c;) = op(l). So if n- 1L~=l E(c;) converges as n --> 00, then (iii) of Theorem 7.21 holds. For a 2 = 7 2 it is required that lim n- 1 L~=2 L~=-; E(ct-sct} = O. It suffices that {cd is independent or that {Ct, Fd is a martingale difference sequence. • Exercise 7.28 Proof. From Donsker's theorem and the continuous mapping theorem we have that n- 2 L~=l xl => J WI (a)2da. The multivariate version of Donsker's theorem states that
ar
Applying the continuous mapping theorem to the mapping
(x,y)
I--->
11
x(a)y(a)da,
Solution Set
257
we have that n
n-12::iT1W1n(at-diT2W2n(at-1) t=l iTIiT211 W1n(a)W2n(a)da
~
iT1iT211 WI (a)W2(a)da.
Hence
i3n
(n- 2 t x;)-1 n- 2 t xtYt t=1 t=l
~ (iT~
11
(iT2/iT1)
WI (a)2da )
1 (1
-1
iT1iT211 WI (a)W2(a)da
WI (a)2da )
-1
11
WI (a)W2(a)da .
• Exercise 7.31 Proof. Since {7]t, Ct} satisfies the conditions of Theorem 7.30, it holds that
(n -1/2X Ina] , n -1/2y;) Ina]
(iT1 W1n( at-1), iT2 W2n (at-1)) (iT1 W 1(a), iT2 W 2(a)),
where iTI and iT~ are the diagonal elements of ~ (the off diagonal elements are zero, due to the indepe;tdence of 7]t and Ct). Applying the continuous mapping theorem, we find (3n to have the same limiting distribution as we found in Exercise 7.28 . • Exercise 7.43 Proof. (i) If {1Jt} and {et} are independent, then E(1JseD = 0 for all s and t. So
An
~~1/2n-1
n t-1
2:: 2:: E(1Jse~)~~1/21 t=2 s=l
0, and dearly An -) A = O.
258
Solution Set
(ii) If TIt
= Ct-1,
then n
:E~1/2n-1
An
t-1
L L E(TlsC~):E21/21 t=2 s=1 n t-1
~-1/2 n -1 '"' )~-1/21 . ,u1 L '"' L E( cs-1 CIt,u1
t=28=1
If the sequence {cd is covariance stationary such that Ph well defined, then n
n-
1
t-1
L L E(cs-1c~) t=2 s=1
n
n- 1
=
E(CtC~_h) is
t
L L E(ct_sc~) t=2 s=2
t=2 s=2
~(n+1-s) I L n Ps, s=2
so that 1/ 2 An = :E1
(Ln (n + n1 - s) Ps ) :E1
1 / 21
A
-t,
s=2
where in this situation we have ~ ,u1
~ = 1·1m = ,u2 n---+
CXJ
L L Pt-s = Po + 1·1m L n
n -1
n
t=1s=1
n-1 (
n ---+ <Xl
s=1
)
n - s ( P s + P') s . n
Notice that when {cd is a sequence of independent variables such that Ps = 0 for s > 1, then var(Ct) = Po = :E1 and A = o. This remains true for the case where TIt = Ct, which is assumed in the likelihood analysis of cointegrated processes by Johansen (1988, 1991) . • Exercise 7.44 Proof.
(a) This follows from Theorem 7.21 (a), or we have directly
References
259
(b) Similarly,
by Theorem 7.42 and using that A = 0 since {td and {1Jt} are independent (see Exercise 7.43).
(c)
n(~n - Po)
(n- ~?tr (n-' t,x",) 2
::::}
(a2/a 1)
1 [1
W1(a)da]
-1 11
W1(a) dW2(a).
(d) From (c) we have that n(i3n - (30) = Op(l), and by Exercise 2.35 we have that (i3n - (30) = n- 1 n(i3n - (30) = op(l)Op(l) = op(l). So , p (3n ----+ (30'
• References Dhrymes, P. (1980). Econometrics. Springer-Verlag, New York. Granger, C. W. J. and P. Newbold (1977). Forecasting Economic Time Series. Academic Press, New York. Johansen, S. (1988). "Statistical analysis of cointegration vectors." Journal of Economic Dynamics and Control, 12,231-254.
(1991). "Estimation and hypothesis testing of cointegration vectors in Gaussian vector autoregressive models." Econometrica, 59, 1551-80. Laha, R. G. and V. K. Rohatgi (1979). Probability Theory. Wiley, New York. Rao, C. R. (1973). Linear Statistical Inference and Its Applications. Wiley, New York.
Index
Adapted mixingale, 124 Adapted stochastic sequence, 58 Approximatable stochastic function, 194 AR(I), 48, 49 ARCH(q), 102 ARMA,49 Asymptotic covariance matrix, 70, 73 estimator, 137 heteroskedasticity consistent, 139146 heteroskedasticity /autocorrelation consistent, 154-164 heteroskedasticity /moving average consistent, 147-154 Newey-West estimator, 163 Asymptotic distribution, 66 Asymptotic efficiency, 83 Bates and White framework, 93 IV estimator, 84 two-stage least squares, 86 Asymptotic equivalence, 67 Asymptotic normality, 71 of f3n, 71 of f3n, 73 Asymptotically uncorrelated process, 52, 139, 154 avar, 70, 83
261
Backshift operator, 42 Best linear unbiased estimator, 3 Borel sets, 41 a-field, 40 Boundary of a set, 173 Bounded in probability, 28 Brownian motion, 169 BM(~), 203 definition, 170 C[O,I], 175 C[O,oo),I72 C k [O,I], 187
C k [0,00), 186 Cadlag function, 172 Cauchy-Schwarz inequality, 33 Central limit theorem, 113 functional, 175 Liapounov, 118 Lindeberg-Feller, 117 Lindeberg-Levy, 114 martingale difference, 133 mixing, 130 stationary ergodic adapted mixingale, 125 Wooldridge-White, 130
262
Index
Charact.erist.ic function, 67-71 continuity theorem, 69 uniqueness theorem, 68 Chebyshev inequalit.y, 30 generalized, 29 Classical assumptions of least squares, 2
generalized, 5 Classical linear model, 1-8 Coint.egrat.ed processes, 190 likelihood analysis, 258 Coint.egrat.ed regression, 199, 201 Condit.ional expect.ation, 54-59 linearity of, 55 Consistency strong, 19 super, 180 weak, 24 Cont.inuit.y uniform, 21 Continuity set, 173 Continuous function, 16 Continuous mapping theorem, 178 Convergence almost sure, 18 in distribution, 65 in probability, 24 in rth mean, 28 of stochastic integral, 199, 201 weak, 171 Covariance stationary process, 52 globally, 177 Cramer-Wold device, 114 functional, 188
Estimator asymptotic covariance matrix, 137 heteroskedasticity consistent., 139-146 heteroskedasticity /autocorrelation consistent, 154-164 heteroskedasticity /moving average consistent, 147-154 generalized least squares, 5, 100 efficiency, 10 1 feasible, 102 feasible and efficient, 105 generalized method of moment.s, 85 instrumental variables, 9 constrained, 86 efficient, 109 M-,21O maximum likelihood, 3, 210 method of moments, 85 ordinary least squares, 2 three-stage least squares, III two-stage instrumental variables, 144, 163 two-stage least squares, 10, 144 asymptotically efficient, 86 efficient, 111 Events, 39
D[O,l], 175 D[O,oo), 172 Dk[O,l], 187 Dk[O,oo), 186 Data generating process, 94, 207 Dickey-Fuller distribution, 180 Dispersion matrix, 73 Donsker's theorem, 176 multivariat.e, 188
Filtration, 58, 193 Finitely correlated process, 138, 147 Fractionally integrated process, 208 Functional central limit theorem, 175 Donsker, 176 heterogeneous mixing, 177 multivariate, 189 Liapounov, 177 Lindeberg-Feller, 177 martingale difference, 178 mult.ivariate, 189 stationary ergodic, 177
Ergodic process, 44 theorem, 44
GARCH(1,1),102 Gaussian AR(l), 48
263
Index Generalized least squares, 5 estimator, 5 efficiency, 101 feasible estimator, 102 efficient, 105 Generalized method of moments, 85 Heteroskedast.icit.y, 3, 38, 143 ARCH,102 conditional, 130 GARCH,102 unconditional, 130 Holder inequality, 33 Hypothesis testing, 74-·83 Implication rule, 25 Inequalities Cauchy-Schwarz, 33 Chebyshev, 30 C r , 35 HOlder, 33 Jensen, 29 conditional, 56 Markov, 30 Minkowski, 36 Instrument.al variables, 8 est.imator, 9 asymptotic normality, 74, 143, 145, 150, 152 consist.ency, 21, 23, 26, 27, 34, 38, 46, 51, 61 constrained, 86 definit.ion, 9 efficient, 109 Integrat.ed process, 179 fract.ionally, 208 Invariallce principle, 176 Ito stochast.ic integral, 194 for random step funct.ions, 193 Jensen inequality, 29 conditional, 56 Lagged dependent variable, 7 Lagrange multiplier test., 77 Law of it.erated expectations, 57 Law of large numbers mart.ingale difference, 60
Laws of large numbers, 31 ergodic theorem, 44 i.i.d., 32 i.n.i.d.,35 mixing, 49 Levy device, 58 Likelihood analysis, 210 cointegrated processes, 258 Likelihood ratio test, 80 Limit, 15 almost sure, 19 in mean square, 29 in probability, 24 Limited dependent variable, 209 Limit.ing distribution, 66 Lindeberg condition, 117 M-est.imator, 210 Markov condition, 35 Markov inequality, 30 Martingale difference sequence, 53, 58 Maximum likelihood estimator, 210 Mean value theorem, 80 Measurable function, 41 one-t.o-one transformat.ion, 42 space, 39 Measure preserving transformation, 42 Measurement errors, 6 Method of moment.s estimat.or, 9 Metric, 173 space, 173 Metrized measurable space, 174 Metrized probabilit.y space, 174 Minkowski inequality, 36 Misspecification, 207 Mixing, 47 coefficient.s, 47 conditions, 46 size, 49 strong, (a-mixing), 47 uniform, (¢-mixing), 47 Mixingale, 125 Nonlinear model, 209 O(nA),16 o(n A ),16
264
Index
Oa .•. (n>'), 23 Oa .•. (n>'), 23 Op(n>'), 28 op(n>'), 28
Ordinary least squares, 2 estimator, 2 definition, 72 asymptotic normality, 72 consistency, 20, 22, 26, 27, 33, 38, 45, 50, 61 definition, 2 Panel data, 11 Partial sum, 169 Probability measure, 39 space, 39 Product rule, 28 Quasi-maximum likelihood estimator, 79, 210
Random function, 169 st.ep function, 193 Random walk, 167 multivariate, 187 Rcll function, 172 Serially correlated errors, 7 Shift operator, 42 a-algebra, 39 a-field, 39 Borel, 40 a-fields increasing sequence, 58 Simultaneous systems, 11 for panel data, 12 Spurious regression, 185, 188 Stationary process, 43 covariance stationary, 52 Stochastic function approximatable, 194 square integrable, 194 Stochastic integral, 192 Strong mixing, 47 Sum of squared residuals, 2 Superconsistency, 180
Test statistics Lagrange multiplier, 77 Likelihood ratio, 80 Wald, 76, 81 Three-stage least squares estimator, 111 Two-stage instrumental variables estimator, 144, 163 Two-stage instrumental variables estimator asymptotic normality, 146, 151, 153, 164 Two-stage least squares, 10 estimator, 144 efficient, 111 Uncorrelated process, 138, 139 asymptotically, 139, 154 Uniform asymptotic negligibility, 118 Uniform continuity, 21 theorem, 21 Uniform mixing, 47 Uniform nonsingularity, 22 Uniform positive definit.eness, 22 Uniformly full column rank, 22 Uniqueness theorem, 68 Unit root, 179 regression, 178 Vec operator, 142 Wald test, 76, 81 Weak convergence, 171 definition, 174 Wiener measure, 176 Wiener process, 170 definition, 170 multivariate, 186 sample path, 171