Chapter 36
LARGE SAMPLE ESTIMATION TESTING* WHITNEY
AND HYPOTHESIS
K. NEWEY
Massachusetts Institute of Technology DANIEL
MCFADDEN
University of California, Berkeley
Contents
2113 2113 2120
Abstract 1. Introduction 2. Consistency
3.
2.1.
The basic consistency
2.2.
Identification
2121
theorem
2124
2.2.1.
The maximum
2.2.2.
Nonlinear
likelihood
2.2.3.
Generalized Classical
method
minimum
2.3.
Uniform
convergence
2.4.
Consistency
of maximum
2.5.
Consistency
of GMM
2.6.
Consistency
without
2.1.
Stochastic
2.8.
Least absolute
Maximum
2128 2129 2131
likelihood
2132 2133
compactness and uniform
deviations
Censored
2126
of moments distance
and continuity
equicontinuity
2.8.2.
2124 2125
least squares
2.2.4.
2.8.1.
estimator
2136
convergence
2138
examples
2138
score least absolute
2140
deviations
2141
Asymptotic normality
2143
3.1.
The basic results
3.2.
Asymptotic
normality
for MLE
2146
3.3.
Asymptotic
normality
for GMM
2148
*We are grateful to the NSF for financial support P. Ruud, and T. Stoker for helpful comments.
and to Y. Ait-Sahalia,
J. Porter, J. Powell, J. Robins,
Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden 0 1994 Elsevier Science B.V. All rights reserved
Ch. 36: Large Sample Estimation and Hypothesis Testing
2113
Abstract Asymptotic distribution theory is the primary method used to examine the properties of econometric estimators and tests. We present conditions for obtaining consistency and asymptotic normality of a very general class of estimators (extremum estimators). Consistent asymptotic variance estimators are given to enable approximation of the asymptotic distribution. Asymptotic efficiency is another desirable property then considered. Throughout the chapter, the general results are also specialized to common econometric estimators (e.g. MLE and GMM), and in specific examples we work through the conditions for the various results in detail. The results are also extended to two-step estimators (with finite-dimensional parameter estimation in the first step), estimators derived from nonsmooth objective functions, and semiparametric two-step estimators (with nonparametric estimation of an infinite-dimensional parameter in the first step). Finally, the trinity of test statistics is considered within the quite general setting of GMM estimation, and numerous examples are given. 1.
Introduction
Large sample distribution theory is the cornerstone of statistical inference for econometric models. The limiting distribution of a statistic gives approximate distributional results that are often straightforward to derive, even in complicated econometric models. These distributions are useful for approximate inference, including constructing approximate confidence intervals and test statistics. Also, the location and dispersion of the limiting distribution provides criteria for choosing between different estimators. Of course, asymptotic results are sensitive to the accuracy of the large sample approximation, but the approximation has been found to be quite good in many cases and asymptotic distribution results are an important starting point for further improvements, such as the bootstrap. Also, exact distribution theory is often difficult to derive in econometric models, and may not apply to models with unspecified distributions, which are important in econometrics. Because asymptotic theory is so useful for econometric models, it is important to have general results with conditions that can be interpreted and applied to particular estimators as easily as possible. The purpose of this chapter is the presentation of such results. Consistency and asymptotic normality are the two fundamental large sample properties of estimators considered in this chapter. A consistent estimator 6 is one that converges in probability to the true value Q,,, i.e. 6% 8,, as the sample size n goes to infinity, for all possible true values.’ This is a mild property, only requiring ‘This property is sometimes referred to as weak consistency, with strong consistency holding when(j converges almost surely to the true value. Throughout the chapter we focus on weak consistency, although we also show how strong consistency can be proven.
W.K. Newey and D. McFadden
2114
that the estimator is close to the truth when the number of observations is nearly infinite. Thus, an estimator that is not even consistent is usually considered inadequate. Also, consistency is useful because it means that the asymptotic distribution of an estimator is determined by its limiting behavior near the true parameter. An asymptotically normal estimator 6is one where there is an increasing function v(n) such that the distribution function of v(n)(8- 0,) converges to the Gaussian distribution function with mean zero and variance V, i.e. v(n)(8 - 6,) A N(0, V). The variance I/ of the limiting distribution is referred to as the asymptotic variance of @. The estimator &-consistent
is ,,/&-consistent
if v(n) = 6.
case, so that unless otherwise
noted,
This chapter asymptotic
focuses normality
on the will be
taken to include ,,&-consistency. Asymptotic normality and a consistent estimator of the asymptotic variance can be used to construct approximate confidence intervals. In particular, for an esti1 - CY mator c of V and for pori2satisfying Prob[N(O, 1) > gn,J = 42, an asymptotic confidence interval is
Cal-@=
ce-g,,2(m”2,e+f,,2(3/n)“2].
If P is a consistent estimator of I/ and I/ > 0, then asymptotic normality of 6 will imply that Prob(B,EY1 -,)1 - a as n+ co. 2 Here asymptotic theory is important for econometric practice, where consistent standard errors can be used for approximate confidence interval construction. Thus, it is useful to know that estimators are asymptotically normal and to know how to form consistent standard errors in applications. In addition, the magnitude of asymptotic variances for different estimators helps choose between estimators in practice. If one estimator has a smaller asymptotic variance, then an asymptotic confidence interval, as above, will be shorter for that estimator in large samples, suggesting preference for its use in applications. A prime example is generalized least squares with estimated disturbance variance matrix, which has smaller asymptotic variance than ordinary least squares, and is often used in practice. Many estimators share a common structure that is useful in showing consistency and asymptotic normality, and in deriving the asymptotic variance. The benefit of using this structure is that it distills the asymptotic theory to a few essential ingredients. The cost is that applying general results to particular estimators often requires thought and calculation. In our opinion, the benefits outweigh the costs, and so in these notes we focus on general structures, illustrating their application with examples. One general structure, or framework, is the class of estimators that maximize some objective function that depends on data and sample size, referred to as extremum estimators. An estimator 8 is an extremum estimator if there is an ‘The proof of this result is an exercise in convergence states that Y. 5 Y, and Z, %C implies Z, Y, &Y,.
in distribution
and the Slutzky theorem,
which
Ch. 36: Large Sample Estimation and Hypothesis
objective
function
o^maximizes
Testing
2115
o,(0) such that o,(Q) subject to HE 0,
(1.1)’
where 0 is the set of possible parameter values. In the notation, dependence of H^ on n and of i? and o,,(G) on the data is suppressed for convenience. This estimator is the maximizer of some objective function that depends on the data, hence the term “extremum estimator”.3 R.A. Fisher (1921, 1925), Wald (1949) Huber (1967) Jennrich (1969), and Malinvaud (1970) developed consistency and asymptotic normality results for various special cases of extremum estimators, and Amemiya (1973, 1985) formulated the general class of estimators and gave some useful results. A prime example of an extremum estimator is the maximum likelihood (MLE). Let the data (z,, , z,) be i.i.d. with p.d.f. f(zl0,) equal to some member of a family of p.d.f.‘s f(zI0). Throughout, we will take the p.d.f. f(zl0) to mean a probability function where z is discrete, and to possibly be conditioned on part of the observation z.~ The MLE satisfies eq. (1.1) with Q,(0) = nP ’ i
(1.2)
lnf(ziI 0).
i=l
Here o,(0) is the normalized log-likelihood. Of course, the monotonic transformation of taking the log of the likelihood and normalizing by n will not typically affect the estimator, but it is a convenient normalization in the theory. Asymptotic theory for the MLE was outlined by R.A. Fisher (192 1, 1925), and Wald’s (1949) consistency theorem is the prototype result for extremum estimators. Also, Huber (1967) gave weak conditions for consistency and asymptotic normality of the MLE and other extremum estimators that maximize a sample average.5 A second example is the nonlinear least squares (NLS), where for data zi = (yi, xi) with E[Y Ix] = h(x, d,), the estimator solves eq. (1.1) with
k(Q)= - n- l i
[yi- h(Xi,
!!I)]*.
(1.3)
i=l
Here maximizing o,(H) is the same as minimizing the sum of squared residuals. The asymptotic normality theorem of Jennrich (1969) is the prototype for many modern results on asymptotic normality of extremum estimators. 3“Extremum” rather than “maximum” appears here because minimizers are also special cases, with objective function equal to the negative of the minimand. 4More precisely, flzIH) is the density (Radon-Nikodym derivative) of the probability measure for z with respect to some measure that may assign measure 1 to some singleton’s, allowing for discrete variables, and for z = (y, x) may be the product of some measure for ~1with the marginal distribution of X, allowing f(z)O) to be a conditional density given X. 5Estimators that maximize a sample average, i.e. where o,(H) = n- ‘I:= 1q(z,,O),are often referred to as m-estimators, where the “m” means “maximum-likelihood-like”.
W.K. Nrwuy
2116
and D. McFuddrn
A third example is the generalized method of moments (GMM). Suppose that there is a “moment function” vector g(z, H) such that the population moments satisfy E[g(z, 0,)] = 0. A GMM estimator is one that minimizes a squared Euclidean distance of sample moments from their population counterpart of zero. Let ii/ be a positive semi-definite matrix, so that (m’@m) ‘P is a measure of the distance of m from zero. A GMM estimator is one that solves eq. (1.1) with
&I) = -
[n-l izln
Ytzi,
O)
1
‘*[ n-l it1 e)]. Ytzi3
(1.4)
This class includes linear instrumental variables estimators, where g(z, 0) =x’ ( y - Y’O),x is a vector of instrumental variables, y is a left-hand-side dependent variable, and Y are right-hand-side variables. In this case the population moment condition E[g(z, (!I,)] = 0 is the same as the product of instrumental variables x and the disturbance y - Y’8, having mean zero. By varying I% one can construct a variety of instrumental variables estimators, including two-stage least squares for k%= (n-‘~;=Ixix;)-‘.” The GMM class also includes nonlinear instrumental variables estimators, where g(z, 0) = x.p(z, Q)for a residual p(z, Q),satisfying E[x*p(z, (!I,)] = 0. Nonlinear instrumental variable estimators were developed and analyzed by Sargan (1959) and Amemiya (1974). Also, the GMM class was formulated and general results on asymptotic properties given in Burguete et al. (1982) and Hansen (1982). The GMM class is general enough to also include MLE and NLS when those estimators are viewed as solutions to their first-order conditions. In this case the derivatives of Inf(zI 0) or - [y - h(x, H)12 become the moment functions, and there are exactly as many moment functions as parameters. Thinking of GMM as including MLE, NLS, and many other estimators is quite useful for analyzing their asymptotic distribution, but not for showing consistency, as further discussed below. A fourth example is classical minimum distance estimation (CMD). Suppose that there is a vector of estimators fi A x0 and a vector of functions h(8) with 7c,,= II( The idea is that 71consists of “reduced form” parameters, 0 consists of “structural” parameters, and h(0) gives the mapping from structure to reduced form. An estimator of 0 can be constructed by solving eq. (1.1) with
&@I)= -
[72-
h(U)]‘ci+t-
h(U)],
(1.5)
where k? is a positive semi-definite matrix. This class of estimators includes classical minimum chi-square methods for discrete data, as well as estimators for simultaneous equations models in Rothenberg (1973) and panel data in Chamberlain (1982). Its asymptotic properties were developed by Chiang (1956) and Ferguson (1958). A different framework that is sometimes useful is minimum distance estimation. “The l/n normalization in @does not affect the estimator, but, by the law oflarge numbers, that W converges in probability to a constant matrix, a condition imposed below.
will imply
Ch. 36: Large Sample Estimation and Hypothesis
Testing
2117
a class of estimators that solve eq. (1.1) for Q,,(d) = - &,(@‘@/g,(@, where d,(d) is a vector of the data and parameters such that 9,(8,) LO and I@ is positive semidefinite. Both GMM and CMD are special cases of minimum distance, with g,,(H) = n- l XI= 1 g(zi, 0) for GMM and g,(0) = 72- h(0) for CMD.’ This framework is useful for analyzing asymptotic normality of GMM and CMD, because (once) differentiability of J,(0) is a sufficient smoothness condition, while twice differentiability is often assumed for the objective function of an extremum estimator [see, e.g. Amemiya (1985)]. Indeed, as discussed in Section 3, asymptotic normality of an extremum estimator with a twice differentiable objective function Q,(e) is actually a special case 0, asymptotic normality of a minimum distance estimator, with d,(0) = V,&(0) and W equal to an identity matrix, where V, denotes the partial derivative. The idea here is that when analyzing asymptotic normality, an extremum estimator can be viewed as a solution to the first-order conditions V,&(Q) = 0, and in this form is a minimum distance estimator. For consistency, it can be a bad idea to treat an extremum estimator as a solution to first-order conditions rather than a global maximum of an objective function, because the first-order condition can have multiple roots even when the objective function has a unique maximum. Thus, the first-order conditions may not identify the parameters, even when there is a unique maximum to the objective function. Also, it is often easier to specify primitive conditions for a unique maximum than for a unique root of the first-order conditions. A classic example is the MLE for the Cauchy location-scale model, where z is a scalar, p is a location parameter, 0 a scale parameter, and f(z 10) = Ca- ‘( 1 + [(z - ~)/cJ]*)- 1 for a constant C. It is well known that, even in large samples, there are many roots to the first-order conditions for the location parameter ~1,although there is a global maximum to the likelihood function; see Example 1 below. Econometric examples tend to be somewhat less extreme, but can still have multiple roots. An example is the censored least absolute deviations estimator of Powell (1984). This estimator solves eq. (1.1) for Q,,(O) = -n-‘~;=,Jyimax (0, xi0) 1,where yi = max (0, ~18, + si}, and si has conditional median zero. A global maximum of this function over any compact set containing the true parameter will be consistent, under certain conditions, but the gradient has extraneous roots at any point where xi0 < 0 for all i (e.g. which can occur if xi is bounded). The importance for consistency of an extremum estimator being a global maximum has practical implications. Many iterative maximization procedures (e.g. Newton Raphson) may converge only to a local maximum, but consistency results only apply to the global maximum. Thus, it is often important to search for a global maximum. One approach to this problem is to try different starting values for iterative procedures, and pick the estimator that maximizes the objective from among the converged values. AS long as the extremum estimator is consistent and the true parameter is an element of the interior of the parameter set 0, an extremum estimator will be ‘For
GMM.
the law of large numbers
implies cj.(fI,) 50.
W.K. Newey und D. McFadden
2118
a root of the first-order conditions asymptotically, and hence will be included among the local maxima. Also, this procedure can avoid extraneous boundary maxima, e.g. those that can occur in maximum likelihood estimation of mixture models. Figure 1 shows a schematic, illustrating the relationships between the various types of estimators introduced so far: The name or mnemonic for each type of estimator (e.g. MLE for maximum likelihood) is given, along with objective function being maximized, except for GMM and CMD where the form of d,(0) is given. The solid arrows indicate inclusion in a class of estimators. For example, MLE is included in the class of extremum estimators and GMM is a minimum distance estimator. The broken arrows indicate inclusion in the class when the estimator is viewed as a solution to first-order conditions. In particular, the first-order conditions for an extremum estimator are V,&(Q) = 0, making it a minimum distance estimator with g,,(0) = V,&(e) and I%‘= I. Similarly, the first-order conditions for MLE make it a GMM estimator with y(z, 0) = VBIn f(zl0) and those for NLS a GMM estimator with g(z, 0) = - 2[y - h(x, B)]V,h(x, 0). As discussed above, these broken arrows are useful for analyzing the asymptotic distribution, but not for consistency. Also, as further discussed in Section 7, the broken arrows are not very useful when the objective function o,(0) is not smooth. The broad outline of the chapter is to treat consistency, asymptotic normality, consistent asymptotic variance estimation, and asymptotic efficiency in that order. The general results will be organized hierarchically across sections, with the asymptotic normality results assuming consistency and the asymptotic efficiency results assuming asymptotic normality. In each section, some illustrative, self-contained examples will be given. Two-step estimators will be discussed in a separate section, partly as an illustration of how the general frameworks discussed here can be applied and partly because of their intrinsic importance in econometric applications. Two later sections deal with more advanced topics. Section 7 considers asymptotic normality when the objective function o,(0) is not smooth. Section 8 develops some asymptotic theory when @ depends on a nonparametric estimator (e.g. a kernel regression, see Chapter 39). This chapter is designed to provide an introduction to asymptotic theory for nonlinear models, as well as a guide to recent developments. For this purpose,
Extremum
O.@) /
i$,{yi - 4~
/ MLE
@l’/n
Distance
-AW~cm
\
NLS
-
Minimum
------_---__*
\ CMD
GMM
iglsh
i In f(dWn ,=I
L-_________l___________T Figure
1
Q/n
{A(@))
3 - WI)
Ch. 36: Lurge Sample Estimation und Hypothesis
Testing
2119
Sections 226 have been organized in such a way that the more basic material is collected in the first part of each section. In particular, Sections 2.1-2.5, 3.1-3.4, 4.1-4.3, 5.1, and 5.2, might be used as text for part of a second-year graduate econometrics course, possibly also including some examples from the other parts of this chapter. The results for extremum and minimum distance estimators are general enough to cover data that is a stationary stochastic process, but the regularity conditions for GMM, MLE, and the more specific examples are restricted to i.i.d. data. Modeling data as i.i.d. is satisfactory in many cross-section and panel data applications. Chapter 37 gives results for dependent observations. This chapter assumes some familiarity with elementary concepts from analysis (e.g. compact sets, continuous functions, etc.) and with probability theory. More detailed familiarity with convergence concepts, laws of large numbers, and central limit theorems is assumed, e.g. as in Chapter 3 of Amemiya (1985), although some particularly important or potentially unfamiliar results will be cited in footnotes. The most technical explanations, including measurability concerns, will be reserved to footnotes. Three basic examples will be used to illustrate the general results of this chapter. Example 1.I (Cauchy location-scale) In this example z is a scalar random variable, 0 = (11,c)’ is a two-dimensional vector, and z is continuously distributed with p.d.f. f(zId,), where f(zl@ = C-a- ’ { 1 + [(z - ~)/a]~} -i and C is a constant. In this example p is a location parameter and 0 a scale parameter. This example is interesting because the MLE will be consistent, in spite of the first-order conditions having many roots and the nonexistence of moments of z (e.g. so the sample mean is not a consistent estimator of 0,). Example 1.2 (Probit) Probit is an MLE example where z = (y, x’) for a binary variable y, y~(0, l}, and a q x 1 vector of regressors x, and the conditional probability of y given x is f(zl0,) for f(zl0) = @(x’@~[ 1 - @(x’Q)]’ -y. Here f(z ItI,) is a p.d.f. with respect to integration that sums over the two different values of y and integrates over the distribution of x, i.e. where the integral of any function a(y, x) is !a(~, x) dz = E[a( 1, x)] + Epu(O,x)]. This example illustrates how regressors can be allowed for, and is a model that is often applied. Example 1.3 (Hansen-Singleton) This is a GMM (nonlinear instrumental variables) example, where g(z, 0) = x*p(z, 0) for p(z, 0) = p*w*yy - 1. The functional form here is from Hansen and Singleton (1982), where p is a rate of time preference, y a risk aversion parameter, w an asset return, y a consumption ratio for adjacent time periods, and x consists of variables
Ch. 36: Large Sample Estimation and Hypothesis
2121
Testing
lead to the estimator being close to one of the maxima, which does not give consistency (because one of the maxima will not be the true value of the parameter). The condition that QO(0) have a unique maximum at the true parameter is related to identification. The discussion so far only allows for a compact parameter set. In theory compactness requires that one know bounds on the true parameter value, although this constraint is often ignored in practice. It is possible to drop this assumption if the function Q,(0) cannot rise “too much” as 8 becomes unbounded, as further discussed below. Uniform convergence and continuity of the limiting function are also important. Uniform convergence corresponds to the feature of the graph that Q,(e) was in the “sleeve” for all values of 0E 0. Conditions for uniform convergence are given below. The rest of this section develops this descriptive discussion into precise results on consistency of extremum estimators. Section 2.1 presents the basic consistency theorem. Sections 2.222.5 give simple but general sufficient conditions for consistency, including results for MLE and GMM. More advanced and/or technical material is contained in Sections 2.662.8.
2.1.
The basic consistency
theorem
To state a theorem it is necessary probability, as follows:
to define
Uniform convergence_in
o,(d) converges
probability:
precisely
uniform
uniformly
convergence
in
in probability
to
Qd@ meanssu~~~~l Q,(e) - Qd@ 30. The following is the fundamental consistency is similar to Lemma 3 of Amemiya (1973).
result for extremum
estimators,
and
Theorem 2.1 If there is a function QO(0) such that (i)&(8) IS uniquely maximized at 8,; (ii) 0 is compact; (iii) QO(0) is continuous; (iv) Q,,(e) converges uniformly in probability to Q,(0), then i?p.
19,.
Proof For any E > 0 we have wit_h propability 43 by eq. (1.1); (b)
approaching
one (w.p.a.1) (a) Q,(g) > Q,(O,) -
Qd@ > Q.(o) - e/3 by (iv); (4 Q,&J > Qd&J - 43 by W9
‘The probability statements in this proof are only well defined if each of k&(8),, and &8,) are measurable. The measurability issue can be bypassed by defining consistency and uniform convergence in terms of outer measure. The outer measure of a (possibly nonmeasurable) event E is the infimum of E[ Y] over all random variables Y with Y 2 l(8), where l(d) is the indicator function for the event 6.
W.K. Newey and D. McFadden
2122
Therefore,
w.p.a. 1,
(b) Q,(e, > Q,(o^, - J3?
Q&J
- 2E,3(? Qo(&J - E.
Thus, for any a > 0, Q,(Q) > Qe(0,) - E w.p.a.1. Let .,Ir be any open subset of 0 containing fI,. By 0 n.4”’ compact, (i), and (iii), SU~~~~~,-~Q~(~) = Qo(8*) < Qo(0,) for some 0*~ 0 n Jt”. Thus, choosing E = Qo_(fIo)- supBE .,,flCQ0(8), it follows that Q.E.D. w.p.a.1 Q,(6) > SU~~~~~,~~Q,,(H), and hence (3~~4”. The conditions of this theorem are slightly stronger than necessary. It is not necessary to assume that 8 actually maximi_zes_the objectiv_e function. This assumption can be replaced by the hypothesis that Q,(e) 3 supBE @Q,,(d)+ o,(l). This replacement has no effect on the proof, in particular on part (a), so that the conclusion remains true. These modifications are useful for analyzing some estimators in econometrics, such as the maximum score estimator of Manski (1975) and the simulated moment estimators of Pakes (1986) and McFadden (1989). These modifications are not given in the statement of the consistency result in order to keep that result simple, but will be used later. Some of the other conditions can also be weakened. Assumption (iii) can be changed to upper semi-continuity of Q,,(e) and (iv) to Q,,(e,) A Q,(fI,) and for all E > 0, Q,(0) < Q,(e) + E for all 19~0 with probability approaching one.” Under these weaker conditions the conclusion still is satisfied, with exactly the same proof. Theorem 2.1 is a weak consistency result, i.e. it shows I!?3 8,. A corresponding strong consistency result, i.e. H^Z Ho, can be obtained by assuming that supBE eJ Q,(0) - Qo(0) 1% 0 holds in place of uniform convergence in probability. The proof is exactly the same as that above, except that “as. for large enough n” replaces “with probability approaching one”. This and other results are stated here for convergence in probability because it suffices for the asymptotic distribution theory. This result is quite general, applying to any topological space. Hence, it allows for 0 to be infinite-dimensional, i.e. for 19to be a function, as would be of interest for nonparametric estimation of (say) a density or regression function. However, the compactness of the parameter space is difficult to check or implausible in many cases where B is infinite-dimensional. To use this result to show consistency of a particular estimator it must be possible to check the conditions. For this purpose it is important to have primitive conditions, where the word “primitive” here is used synonymously with the phrase “easy to interpret”. The compactness condition is primitive but the others are not, so that it is important to discuss more primitive conditions, as will be done in the following subsections. I0 Uppersemi-continuity means that for any OE 0 and t: > 0 there is an open subset. 0 such that Q”(P) < Q,(0) + E for all U’EA’.
V of 0 containing
Ch. 36: Large Sample Estimation and Hypothesis
Testing
2123
Condition (i) is the identification condition discussed above, (ii) the boundedness condition on the parameter set, and (iii) and (iv) the continuity and uniform convergence conditions. These can be loosely grouped into “substantive” and “regularity” conditions. The identification condition (i) is substantive. There are well known examples where this condition fails, e.g. linear instrumental variables estimation with fewer instruments than parameters. Thus, it is particularly important to be able to specify primitive hypotheses for QO(@ to have a unique maximum. The compactness condition (ii) is also substantive, with eOe 0 requiring that bounds on the parameters be known. However, in applications the compactness restriction is often ignored. This practice is justified for estimators where compactness can be dropped without affecting consistency of estimators. Some of these estimators are discussed in Section 2.6. Uniform convergence and continuity are the hypotheses that are often referred to as “the standard regularity conditions” for consistency. They will typically be satisfied when moments of certain functions exist and there is some continuity in Q,(O) or in the distribution of the data. Moment existence assumptions are needed to use the law of large numbers to show convergence of Q,(0) to its limit Q,,(0). Continuity of the limit QO(0) is quite a weak condition. It can even be true when Q,(0) is not continuous, because continuity of the distribution of the data can “smooth out” the discontinuities in the sample objective function. Primitive regularity conditions for uniform convergence and continuity are given in Section 2.3. Also, Section 2.7 relates uniform convergence to stochastic equicontinuity, a property that is necessary and sufficient for uniform convergence, and gives more sufficient conditions for uniform convergence. To formulate primitive conditions for consistency of an extremum estimator, it is necessary to first find Q0(f9). Usually it is straightforward to calculate QO(@ as the probability limit of Q,(0) for any 0, a necessary condition for (iii) to be satisfied. This calculation can be accomplished by applying the law of large numbers, or hypotheses about convergence of certain components. For example, the law of large numbers implies that for MLE the limit of Q,(0) is QO(0) = E[lnf(zI 0)] and for NLS QO(0) = - E[ {y - h(x, @}‘I. Note the role played here by the normalization of the log-likelihood and sum of squared residuals, that leads to the objective function converging to a nonzero limit. Similar calculations give the limit for GMM and CMD, as further discussed below. Once this limit has been found, the consistency will follow from the conditions of Theorem 2.1. One device that may allow for consistency under weaker conditions is to treat 8 as a maximum of Q,(e) - Q,(e,) rather than just Q,(d). This is a magnitude normalization that sometimes makes it possible to weaken hypotheses on existence of moments. In the censored least absolute deviations example, where Q,,(e) = -n-rC;=,lJ$max (0, xi0) (, an assumption on existence of the expectation of y is useful for applying a law of large numbers to show convergence of Q,(0). In contrast Q,,(d) - Q,,(&) = -n- ’ X1= 1[ (yi -max{O, x:6} I- (yi --ax (0, XI@,}I] is a bounded function of yi, so that no such assumption is needed.
2124
2.2.
W.K. Newey end D. McFadden
Ident$cution
The identification condition for consistency of an extremum estimator is that the limit of the objective function has a unique maximum at the truth.” This condition is related to identification in the usual sense, which is that the distribution of the data at the true parameter is different than that at any other possible parameter value. To be precise, identification is a necessary condition for the limiting objective function to have a unique maximum, but it is not in general sufficient.” This section focuses on identification conditions for MLE, NLS, GMM, and CMD, in order to illustrate the kinds of results that are available. 2.2.1.
The maximum
likelihood estimator
An important feature of maximum likelihood is that identification is also sufficient for a unique maximum. Let Y, # Y2 for random variables mean Prob({ Y1 # Y,})>O. Lemma 2.2 (Information
inequality)
If 8, is identified [tI # 0, and 0~ 0 implies f(z 10)# f(z 1O,)] and E[ 1In f(z 10)I] < cc for all 0 then QO(tl) = E[lnf(zI@] has a unique maximum at 8,. Proof By the strict dom variable
version of Jensen’s inequality, for any nonconstant, positive Y, - ln(E[Y]) < E[ - ln(Y)].r3 Then for a = f(zIfI)/f(zI0,)
ranand
~~~,,Q,~~,~-Q,~~~=~C~-~~Cf~~I~~lf~~I~,~l~l~-~n~C~f(zl~)lf(zl~~)~l= Q.E.D. - In [i.f(z (B)dz] = 0. The term “information inequality” refers to an interpretation of QO(0) as an information measure. This result means that MLE has the very nice feature that uniqueness of the maximum of the limiting objective function occurs under the very weakest possible condition of identification of 8,. Conditions for identification in particular models are specific to those models. It
‘i If the set of maximands .1 of the objective function has more than one element, then this set does not distinguish between the true parameter and other values. In this case further restrictions are needed for identification. These restrictions are sometimes referred to as normalizations. Alternatively, one could work with convergence in probability to a set .,*/R,but imposing normalization restrictions is more practical, and is needed for asymptotic normality. “If Or, is not identified, then there will be some o# 0, such that the distribution of the data is the same when 0 is the true parameter value>s when 0, is the true parameter value. Therefore, Q*(O) will also be limiting objective function when 0 is the true parameter, and hence the requirement that Q,,(O) be maximized at the true parameter implies that Q,,(O) has at least two maxima, flo and 0. i3The strict version of Jensen’s inequality states that if a(y) is a strictly concave function [e.g. a(y) = In(y)] and Y is a nonconstant random variable, then a(E[Y]) > E[a(Y)].
Ch. 36:
Large
Samplr
Estimation
and Hypothesis
Testing
is often possible to specify them in a way that is easy to interpret way), as in the Cauchy example. Exampk
2125 (i.e. in a “primitive”
1.1 continued
It will follow from Lemma 2.2 that E[ln,f(z10)] has a unique maximum at the true parameter. Existence of E [I In f(z I@[] for all 0 follows from Ilnf(zIO)I d C, + ln(l+a-2~~-~~2) 0. Thus, by the information inequality, E [ln f(z I O)] has a unique maximum at OO.This example illustrates that it can be quite easy to show that the expected log-likelihood has a unique maximum, even when the first-order conditions for the MLE do not have unique roots. Example
I .2 continued
Throughout the probit example, the identification and regularity conditions will be combined in the assumption that the second-moment matrix E[xx’] exists and is nonsingular. This assumption implies identification. To see why, note that nonsingularity of E[xx’] implies that it is positive definite. Let 0 # O,, so that E[{x’(O - O,)}“] = (0 - O,)‘E[xx’](O - 0,) > 0, implying that ~‘(0 - 0,) # 0, and hence x’0 # x’OO, where as before “not equals” means “not equal on a set of positive probability”. Both Q(u) and @( - u) are strictly monotonic, so that x’0 # ~‘0, implies both @(x’O) # @(x’O,) and 1 - @(X’S) # 1 - @(x’O,), and hence that f(z I 0) = @(x’O)Y[1 - @(x’O)] l py # f(z IO,). Existence of E[xx’] also implies that E[ Ilnf(zlO)l] < co. It is well known that the derivative d In @(u)/du = %(u)= ~(U)/@(U) [for 4(u) = V,@(u)], is convex and asymptotes to - u as u -+ - cc, and to zero as u + co. Therefore, a mean-value expansion around 0 = 0 gives Iln @(x’O)l = Iln @(O) + ~(x’8”)x’O1d Iln Q(O)\ + i(x’@)lx’OI
~I~~~~~~I+~~~+I~‘~l~l~‘~Idl~~~(~~I+C(~+IIxII lIOIl)llxlI IlOll. Since 1 -@(u)=@(-u)andyis bounded, (lnf(zIO)Id2[Iln@(O)I+C(l + 11x/I x II 0 II )II x /III 0 II 1, so existence of second moments of x implies that E[ Ilnf(z1 O)/] is finite. This part of the probit example illustrates the detailed work that may be needed to verify that moment existence assumptions like that of Lemma 2.2 are satisfied. 2.2.2.
Nonlinear
least squares
The identification condition for NLS is that the mean square error E[ { y - h(x,O)l’] = - QJO) have a unique minimum at OO.As is easily shown, the mean square error
W.K. Newey
2126
und D. McFudden
has a unique minimum at the conditional mean. I4 Since h(x,O,) = E[ylx] is the conditional mean, the identification condition for NLS is that h(x, 0) # h(x, 0,) if 0 # 8,, i.e. that h(x, 0) is not the conditional mean when 8 # 0,. This is a natural “conditional mean” identification condition for NLS. In some cases identification will not be sufficient for conditional mean identification. Intuitively, only parameters that affect the first conditional moment of y given x can be identified by NLS. For example, if 8 includes conditional variance parameters, or parameters of other higher-order moments, then these parameters may not be identified from the conditional mean. As for identification, it is often easy to give primitive hypotheses for conditional mean identification. For example, in the linear model h(x, 19)= x’d conditional mean identification holds if E[xx’] is nonsingular, for then 6 # 0, implies ~‘6’ # x’O,,, as shown in the probit example. For another example, suppose x is a positive scalar and h(x, 6) = c( + bxy. As long as both PO and y0 are nonzero, the regression curve for a different value of 6 intersects the true curve at most at three x points. Thus, for identification it is sufficient that x have positive density over any interval, or that x have more than three points that have positive probability. 2.2.3.
Generalized
method
of moments
For generalized method of moments the limit cated than for MLE or NLS, but is still easy g,(O) L g,,(O) = E[g(z, O)], so that if 6’ A W W, then by continuity of multiplication, Q,(d) tion has a maximum of zero at 8,, so 8, will 0 # 00. Lemma
2.3 (GMM
function QO(fI)is a little more complito find. By the law of large numbers, for some positive semi-definite matrix 3 Q,JO) = - go(O) Wg,(B). This funcbe identified if it is less than zero for
identification)
If W is positive semi-definite and, for go(Q) = E[g(z, S)], gO(O,) = 0 and Wg,(8) for 0 # 8, then QJfI) = - g0(0)‘Wg,(8) has a unique maximum at 8,.
# 0
Proof
Let R be such that R’R = W. If 6’# (I,, then 0 # Wg,(8) = R’RgJB) implies Rg,(O) #O and hence QO(@ = - [RgO(0)]‘[Rgo(fl)] < QO(fl,) = 0 for 8 # Be. Q.E.D. The GMM identification condition is that if 8 # 8, then go(O) is not in the null space of W, which for nonsingular W reduces to go(B) being nonzero if 8 # 0,. A necessary order condition for GMM identification is that there be at least as many moment
“‘For ECOI
m(x)= E[ylx]
and
a(x) any
-a(~))~1 = ECOI -m(4)2l + ~JX{Y
with strict inequality
if a(x) #m(x).
function -m(4Hm(x)
with
finite
-&)}I
variance,
iterated
expectations
gives
+ EC~m(x)-~(x)}~l~ EC{y-m(x)}‘],
Ch. 36: Large Sumplr
Esrimution
and Hypothesis
Testing
2121
functions as parameters. If there are fewer moments than parameters, then there will typically be many solutions to ~~(8) = 0. If the moment functions are linear, say y(z, Q) = g(z) + G(z)0, then the necessary and sufficient rank condition for GMM identification is that the rank of WE[G(z)J is equal to the number of columns. For example, consider a linear instrumental variables estimator, where g(z, 19)= x.(y - Y’Q) for a residual y - Y’B and a vector of instrumental variables x. The two-stage least squares estimator of 8 is a GMM estimator with W = (C!‘= 1xixi/n)- ‘. Suppose that E[xx’] exists and is nonsingular, so that W = (E[xx’])- i by the law of large numbers. Then the rank condition for GMM identification is E[xY’] has full column rank, the well known instrumental variables identification condition. If E[Y’lx] = x’rt then this condition reduces to 7~having full column rank, a version of the single equation identification condition [see F.M. Fisher (1976) Theorem 2.7.11. More generally, E[xY’] = E[xE[Y’jx]], so that GMM identification is the same as x having “full rank covariance” with
-uYlxl.
If E[g(z, 0)] is nonlinear in 0, then specifying primitive conditions for identification becomes quite difficult. Here conditions for identification are like conditions for unique solutions of nonlinear equations (as in E[g(z, e)] = 0), which are known to be difficult. This difficulty is another reason to avoid formulating 8 as the solution to the first-order condition when analyzing consistency, e.g. to avoid interpreting MLE as a GMM estimator with g(z, 0) = V, In f(z 119). In some cases this difficulty is unavoidable, as for instrumental variables estimators of nonlinear simultaneous equations models.’ 5 Local identification analysis may be useful when it is difficult to find primitive conditions for (global) identification. If g(z,@ is continuously differentiable and VOE[g(z, 0)] = E[V,g(z, Q)], then by Rothenberg (1971), a sufficient condition for a unique solution of WE[g(z, 8)] = 0 in a (small enough) neighborhood of 0, is that WEIVOg(z,Bo)] have full column rank. This condition is also necessary for local identification, and hence provides a necessary condition for global identification, when E[V,g(z, Q)] has constant rank in a neighborhood of 8, [i.e. in Rothenberg’s (1971) “regular” case]. For example, for nonlinear 2SLS, where p(z, e) is a residual and g(z, 0) = x.p(z, 8), the rank condition for local identification is that E[x.V,p(z, f&J’] has rank equal to its number of columns. A practical “solution” to the problem of global GMM identification, that has often been adopted, is to simply assume identification. This practice is reasonable, given the difficulty of formulating primitive conditions, but it is important to check that it is not a vacuous assumption whenever possible, by showing identification in some special cases. In simple models it may be possible to show identification under particular forms for conditional distributions. The Hansen-Singleton model provides one example. “There are some useful results on identification (1983) and Roehrig remains difficult.
(1989), although
global
of nonlinear simultaneous equations models in Brown identification analysis of instrumental variables estimators
W.K. Newey and D. McFadden
2128
Example
I .3 continued
Suppose that l? = (n-l C;= 1x,x;), so that the GMM estimator is nonlinear twostage least squares. By the law of large numbers, if E[xx’] exists and is nonsingular, Then the l?’ will converge in probability to W = (E[xx’])~‘, which is nonsingular. GMM identification condition is that there is a unique solution to E[xp(z, 0)] = 0 at 0 = H,, where p(z, 0) = {/?wy’ - 1). Quite primitive conditions for identification can be formulated in a special log-linear case. Suppose that w = exp[a(x) + u] and y = exp[b(x) + u], where (u, u) is independent of x, that a(x) + y,b(x) is constant, and that rl(0,) = 1 for ~(0) = exp[a(x) + y,b(x)]aE[exp(u + yv)]. Suppose also that the first element is a constant, so that the other elements can be assumed to have mean zero (by “demeaning” if necessary, which is a nonsingular linear transformation, and so does not affect the identification analysis). Let CI(X,y)=exp[(Y-yJb(x)]. Then E[p(z, @lx] = a(x, y)v](@- 1, which is zero for 0 = BO,and hence E[y(z, O,)] = 0. For 8 # B,, E[g(z, 0)] = {E[cr(x, y)]q(8) - 1, Cov [x’, a(x, y)]q(O)}‘. This expression is nonzero if Cov[x, a(x, y)] is nonzero, because then the second term is nonzero if r](B) is nonzero and the first term is nonzero if ~(8) = 0. Furthermore, if Cov [x, a(x, y)] = 0 for some y, then all of the elements of E[y(z, 0)] are zero for all /J and one can choose /I > 0 so the first element is zero. Thus, Cov[x, c((x, y)] # 0 for y # y0 is a necessary and sufficient condition for identification. In other words, the identification condition is that for all y in the parameter set, some coefficient of a nonconstant variable in the regression of a(x, y) on x is nonzero. This is a relatively primitive condition, because we have some intuition about when regression coefficients are zero, although it does depend on the form of b(x) and the distribution of x in a complicated way. If b(x) is a nonconstant, monotonic function of a linear combination of x, then this covariance will be nonzero. l6 Thus, in this example it is found that the assumption of GMM identification is not vacuous, that there are some nice special cases where identification does hold. 2.2.4.
Classical minimum distance
The analysis
of CMD
identification
is very similar
to that for GMM.
If AL
r-r0
and %‘I W, W positive semi-definite, then Q(0) = - [72- h(B)]‘@72 - h(6)] -% - [rco - h(0)]’ W[q, - h(O)] = Q,(O). The condition for Qo(8) to have a unique maximum (of zero) at 0, is that h(8,) = rcOand h(B) - h(0,) is not in the null space of W if 0 # Be, which reduces to h(B) # h(B,) if W is nonsingular. If h(8) is linear in 8 then there is a readily interpretable rank condition for identification, but otherwise the analysis of global identification is difficult. A rank condition for local identification is that the rank of W*V,h(O,) equals the number of components of 0.
“It is well known variable x.
that Cov[.x,J(x)]
# 0 for any monotonic,
nonconstant
function
,f(x) of a random
Ch. 36: Laryr Sample Estimation and Hypothesis
2.3.
Unform
convergence
2129
Testing
and continuity
Once conditions for identification have been found and compactness of the parameter set has been assumed, the only other primitive conditions for consistency required by Theorem 2.1 are those for uniform convergence in probability and continuity of the limiting objective function. This subsection gives primitive hypotheses for these conditions that, when combined with identification, lead to primitive conditions for consistency of particular estimators. For many estimators, results on uniform convergence of sample averages, known as uniform laws oflarge numbers, can be used to specify primitive regularity conditions. Examples include MLE, NLS, and GMM, each of which depends on sample averages. The following uniform law of large numbers is useful for these estimators. Let a(z, 6) be a matrix of functions of an observation z and the parameter 0, and for a matrix A = [aj,], let 11 A 11= (&&)“’ be the Euclidean norm. Lemma
2.4
If the data are i.i.d., @is compact, a(~,, 0) is continuous at each 0~ 0 with probability one, and there is d(z) with 11 a(z,d)ll d d(z) for all 8~0 and E[d(z)] < co, then E[a(z, e)] is continuous
and supeto /In- ‘x1= i a(~,, 0) - E[a(z, 0)] I/ 3
0.
The conditions of this result are similar to assumptions of Wald’s (1949) consistency proof, and it is implied by Lemma 1 of Tauchen (1985). The conditions of this result are quite weak. In particular, they allow for a(~,@ this result is useful to not be continuous on all of 0 for given z.l’ Consequently, even when the objective function is not continuous, as for Manski’s (1975) maximum score estimator and the simulation-based estimators of Pakes (1986) and McFadden (1989). Also, this result can be extended to dependent data. The conclusion remains true if the i.i.d. hypothesis is changed to strict stationarity and ergodicity of zi.i8 The two conditions imposed on a(z, 0) are a continuity condition and a moment existence condition. These conditions are very primitive. The continuity condition can often be verified by inspection. The moment existence hypothesis just requires a data-dependent upper bound on IIa(z, 0) II that has finite expectation. This condition is sometimes referred to as a “dominance condition”, where d(z) is the dominating function. Because it only requires that certain moments exist, it is a “regularity condition” rather than a “substantive restriction”. It is often quite easy to see that the continuity condition is satisfied and to specify moment hypotheses for the dominance condition, as in the examples.
r
'The conditions of Lemma 2.4 are not sufficient but are sufficient for convergence of the supremum sufficient for consistency of the estimator in terms objective function is not continuous, as previously “Strict stationarity means that the distribution and ergodicity implies that n- ‘I:= ,a(zJ + E[a(zJ]
for measurability of the supremum in the conclusion, in outer measure. Convergence in outer measure is of outer measure, a result that is useful when the noted, of (zi, zi + ,, , z.,+,) does not depend on i for any tn, for (measurable) functions a(z) with E[ la(z)l] < CO.
Ch. 36: Large Sample Estimation and Hypothesis Testing
2.4.
Consistency
of maximum
2131
likelihood
The conditions for identification in Section 2.2 and the uniform convergence result of Lemma 2.4, allow specification of primitive regularity conditions for particular kinds of estimators. A consistency result for MLE can be formulated as follows: Theorem 2.5
Suppose that zi, (i = 1,2,. . .), are i.i.d. with p.d.f. f(zJ0,) and (i) if 8 f8, then f(zi18) #f(zilO,); (ii) B,E@, which is compact; (iii) In f(z,le) is continuous at each 8~0 with probability one; (iv) E[supe,oIlnf(~18)1] < co. Then &Lo,. Proof
Proceed by verifying the conditions of Theorem 2.1. Condition 2.1(i) follows by 2.5(i) and (iv) and Lemma 2.2. Condition 2.l(ii) holds by 2S(ii). Conditions 2.l(iii) and (iv) Q.E.D. follow by Lemma 2.4. The conditions of this result are quite primitive and also quite weak. The conclusion is consistency of the MLE. Thus, a particular MLE can be shown to be consistent by checking the conditions of this result, which are identification, compactness, continuity of the log-likelihood at particular points, and a dominance condition for the log-likelihood. Often it is easy to specify conditions for identification, continuity holds by inspection, and the dominance condition can be shown to hold with a little algebra. The Cauchy location-scale model is an example. Example 1 .l continued
To show consistency of the Cauchy MLE, one can proceed to verify the hypotheses of Theorem 2.5. Condition (i) was shown in Section 2.2.1. Conditions (iii) and (iv) were shown in Section 2.3. Then the conditions of Theorem 2.5 imply that when 0 is any compact set containing 8,, the Cauchy MLE is consistent. A similar result can be stated for probit (i.e. Example 1.2). It is not given here because it is possible to drop the compactness hypothesis of Theorem 2.5. The probit log-likelihood turns out to be concave in parameters, leading to a simple consistency result without a compact parameter space. This result is discussed in Section 2.6. Theorem 2.5 remains true if the i.i.d. assumption is replaced with the condition thatz,,~,,... is stationary and ergodic with (marginal) p.d.f. of zi given byf(z IO,). This relaxation of the i.i.d. assumption is possible because the limit function remains unchanged (so the information inequality still applies) and, as noted in Section 2.3, uniform convergence and continuity of the limit still hold. A similar consistency result for NLS could be formulated by combining conditional mean identification, compactness of the parameter space, h(x, 13)being conti-
2132
W.K. Nrwey and D. McFadden
nuous at each H with probability such a result is left as an exercise.
Consistency
2.5.
A consistency Theorem
one, and a dominance
condition.
Formulating
ofGMM
result for GMM
can be formulated
as follows:
2.6
Suppose that zi, (i = 1,2,. .), are i.i.d., I%’% W, and (i) W is positive semi-definite and WE[g(z, t3)] = 0 only if (I = 8,; (ii) tIO~0, which is compact; (iii) g(z, 0) is continuous at each QE 0 with probability one; (iv) E[sup~,~ I/g(z, 0) I/] < co. Then 6% (so. ProQf
Proceed by verifying the hypotheses of Theorem 2.1. Condition 2.1(i) follows by 2.6(i) and Lemma 2.3. Condition 2.l(ii) holds by 2.6(ii). By Lemma 2.4 applied to a(z, 0) = g(z, g), for g,(e) = n- ‘x:1= ,g(zi, 0) and go(g) = E[g(z, g)], one has supBEe I(g,(8) - go(g) II30 and go(d) is continuous. Thus, 2.l(iii) holds by QO(0) = - go(g) WY,(Q) continuous. By 0 compact, go(e) is bounded on 0, and by the triangle and Cauchy-Schwartz inequalities,
I!A(@- Qo@) I
G IICM@ - Yov4II2II + II + 2 IIso(@) II IId,(@- s,(@ II II @ II + llSo(~N2 II @- WII, so that sup,,,lQ,(g)
- Q,Jg)I AO,
and 2.l(iv) holds.
Q.E.D.
The conditions of this result are quite weak, allowing for discontinuity in the moment functions.’ 9 Consequently, this result is general enough to cover the simulated moment estimators of Pakes (1986) and McFadden (1989), or the interval moment estimator of Newey (1988). To use this result to show consistency of a GMM estimator, one proceeds to check the conditions, as in the Hansen-Singleton example.
19Measurability of the estimator becomes an issue in this case, although working with outer measure, as previously noted.
this can be finessed
by
2133
Ch. 36: Large Sample Estimation and Hypothesis Testing Example
1.3 continued
‘. For hypothesis (i), simply Assume that E[xx’] < a, so that I% A W = (E[xx’])assume that E[y(z, 0)] = 0 has a unique solution at 0, among all PIE0. Unfortunately, as discussed in Section 2.2, it is difficult to give more primitive assumptions for this identification condition. Also, assume that @is compact, so that (ii) holds. Then (iii) holds by inspection, and as discussed in Section 2.3, (iv) holds as long as the moment existence conditions given there are satisfied. Thus, under these assumptions, the estimator will be consistent.
Theorem 2.6 remains true if the i.i.d. assumption is replaced with the condition that zlr z2,. . is stationary and ergodic. Also, a similar consistency result could be formulated for CMD, by combining uniqueness of the solution to 7c,,= h(8) with compactness of the parameter space and continuity of h(O). Details are left as an exercise. 2.6.
Consistency
without compactness
The compactness assumption is restrictive, because it implicitly requires that there be known bounds on the true parameter value. It is useful in practice to be able to drop this restriction, so that conditions for consistency without compactness are of interest. One nice result is available when the objective function is concave. Intuitively, concavity prevents the objective function from “turning up” as the parameter moves far away from the truth. A precise result based on this intuition is the following one: Theorem
2.7
If there is a function QO(0) such that (i) QO(0) 1s uniquely maximized at 0,; (ii) B0 is an element of the interior of a convex set 0 and o,,(e) is concave; and (iii) o,(e) L QO(0) for all 8~0,
then fin exists with probability
approaching
one and 8,,-%te,.
Proof Let %?be a closed sphere of radius 2~ around 8, that is contained in the interior of 0 and let %?!be its boundary. Concavity is preserved by pointwise limits, so that QO(0) is also concave. A concave function is continuous on the interior of its domain, so that QO(0) is continuous on V?. Also, by Theorem 10.8 of Rockafellar (1970), pointwise convergence of concave functions on a dense subset of an open set implies uniform convergence on any compact subset of the open set. It then follows as in Andersen and Gill (1982) that o,(e) converges to QO(fI) in probability uniformly on any compact subset of 0, and in particular on %Y.Hence, by Theorem 2.1, the maximand f!?!of o,,(e) on % is consistent for 0,. Then the event that g,, is within c of fIO, so that Q,(g,,) 3 max,&,(@, occurs with probability approaching one. In this event, for any 0 outside W, there is a linear convex combination ,J$” + (1 - ,I)0
W.K. Newry and D. McFadden
2134
that lies in g (with A < l), so that_ Q,(g,,) 3 Q,[ng,, + (1 - i)U]. By concavity, Q.[ng,,_+ (1 - i)O] 3 ,$,(g,,) + (1 - E_)_Q,(e). Putting these inequalities together, Q.E.D. (1 - i)Q,(@ > (1 - i)Q,(0), implying 8, is the maximand over 0. This theorem is similar to Corollary II.2 of Andersen and Gill (1982) and Lemma A of Newey and Powell (1987). In addition to allowing for noncompact 0, it only requires pointwise convergence. This weaker hypothesis is possible because pointwise convergence of concave functions implies uniform con_vergence (see the proof). This result also contains the additional conclusion that 0 exists with probability approaching one, which is needed because of noncompactness of 0. This theorem leads to simple conditions for consistency without compactness for both MLE and GMM. For MLE, if in Theorem 2.5, (ii)are replaced by 0 convex, In f(z 10)concave in 0 (with probability one), and E[ 1In f’(z 10)I] < 03 for all 0, then the law of large numbers and Theorem 2.7 give consistency. In other words, with concavity the conditions of Lemma 2.2 are sufficient for consistency of the MLE. Probit is an example. Example
1.2 continued
It was shown in Section 2.2.1 that the conditions of Lemma 2.2 are satisfied. Thus, to show consistency of the probit MLE it suffices to show concavity of the loglikelihood, which will be implied by concavity of In @(x’@)and In @( - ~‘0). Since ~‘8 is linear in H, it suffices to show concavity of In a(u) in u. This concavity follows from the well known fact that d In @(u)/du = ~(U)/@(U) is monotonic decreasing [as well as the general Pratt (1981) result discussed below]. For GMM, if y(z, 0) is linear in 0 and I?f is positive semi-definite then the objective function is concave, so if in Theorem 2.6, (ii)are replaced by the requirement that E[ /Ig(z, 0) 111< n3 for all tj~ 0, the conclusion of Theorem 2.7 will give consistency of GMM. This linear moment function case includes linear instrumental variables estimators, where compactness is well known to not be essential. This result can easily be generalized to estimators with objective functions that are concave after reparametrization. If conditions (i) and (iii) are satisfied and there is a one-to-one mapping r(0) with continuous inverse such that &-‘(I.)] is concave_ on^ r(O) and $0,) is an element of the interior of r( O), then the maximizing value i of Q.[r - ‘(J”)] will be consistent for i, = s(d,) by Theorem 2.7 and invariance of a maxima to one-to-one reparametrization, and i? = r- ‘(I) will be consistent for 8, = z-~(&) by continuity of the inverse. An important class of estimators with objective functions that are concave after reparametrization are univariate continuous/discrete regression models with logconcave densities, as discussed in Olsen (1978) and Pratt (1981). To describe this class, first consider a continuous regression model y = x’& + cOc, where E is independent of x with p.d.f. g(s). In this case the (conditional on x) log-likelihood is - In 0 + In sCa_ ‘(y - x’fi)] for (B’, C)E 0 = @x(0, co). If In g(E) is concave, then this
Ch. 36: Large Sample Estimation and Hypothesis
Testing
2135
log-likelihood need not be concave, but the likelihood In ‘/ + ln Y(YY- ~‘6) is concave in the one-to-one reparametrization y = Q- ’ and 6 = /~‘/a. Thus, the average loglikelihood is also concave in these parameters, so that the above generalization of Theorem 2.7 implies consistency of the MLE estimators of fi and r~ when the maximization takes place over 0 = Rkx(O, a), if In g(c) is concave. There are many log-concave densities, including those proportional to exp( - Ixl”) for CI3 1 (including the Gaussian), logistic, and the gamma and beta when the p.d.f. is bounded, so this concavity property is shared by many models of interest. The reparametrized log-likelihood is also concave when y is only partially observed. As shown by Pratt (1981), concavity of lng(a) also implies concavity of ln[G(u)G(w)] in u and w, for the CDF G(u)=~“~~(E)~E.~~ That is, the logprobability of an interval will be concave in the endpoints. Consequently, the log-likelihood for partial observability will be concave in the parameters when each of the endpoints is a linear function of the parameters. Thus, the MLE will be consistent without compactness in partially observed regression models with logconcave densities, which includes probit, logit, Tobit, and ordered probit with unknown censoring points. There are many other estimators with concave objective functions, where some version of Theorem 2.7 has been used to show consistency without compactness. These include the estimators in Andersen and Gill (1982), Newey and Powell (1987), and Honort (1992). It is also possible to relax compactness with some nonconcave objective functions. Indeed, the original Wald (1949) MLE consistency theorem allowed for noncompactness, and Huber (1967) has given similar results for other estimators. The basic idea is to bound the objective function above uniformly in parameters that are far enough away from the truth. For example, consider the MLE. Suppose that there is a compact set % such that E[supBtOnMc In f(z 1d)] < E[ln f(z) fl,)]. Then by the law of large numbers, with probability approaching one, supBtOnXc&(0) d n-l x In f(zil@) < n-‘Cy= I In f(zl do), and the maximum must lie in %‘. c;= 1 suPoE@n’fjc Once the maximum is known to be in a compact set with probability approaching one, Theorem 2.1 applies to give consistency. Unfortunately, the Wald idea does not work in regression models, which are quite common in econometrics. The problem is that the likelihood depends on regression parameters 8 through linear combinations of the form ~‘9, so that for given x changing 8 along the null-space of x’ does not change the likelihood. Some results that do allow for regressors are given in McDonald and Newey (1988), where it is shown how compactness on 0 can be dropped when the objective takes the form Q,(e) = n- ’ xy= 1 a(Zi, X:O) an d a (z, u) goes to - co as u becomes unbounded. It would be useful to have other results that apply to regression models with nonconcave objective functions. “‘Pratt (1981) also showed that concavity to be concave over all v and w.
of In g(c) is necessary
as well as sufficient for ln[G(u) ~ G(w)]
W.K. Newey and D. McFadden
2136
Compactness is essential for consistency of some extremum estimators. For example, consider the MLE in a model where z is a mixture of normals, having likelihood f(z 1Q)= pea-‘~+!$a-‘(z-p)] +(I -p)y~‘f$Cy~l(z-~)l for8=(p,a,6y)‘, some 0 < p < 1, and the standard normal p.d.f. d(c) = (271) 1’2e-E2’2. An interpretation of this model is that z is drawn from N(p, a2) with probability p and from N(cc, r2) with probability (1 - p). The problem with noncompactness for the MLE in this model is that for certain p (and u) values, the average log-likelihood becomes unbounded as g (or y) goes to zero. Thus, for existence and consistency of the MLE it is necessary to bound 0 (and y) away from zero. To be specific, suppose that p = Zi as o+o, for some i. Then f(z,lfI) = ~.a ~‘@(O)$(l -p)y-lf$cy~l(zi-cc)]+co and assuming that zj # zi for all j # i, cs occurs with probability one, f(zj/U)+ (1 -p)y-l~[y-l(zj-@]>O. Hence, Q,,(e)= n-‘Cy=r lnf(zilO) becomes unbounded as (T+O for p = zi. In spite of this fact, if the parameter set is assumed to be compact, so that (Tand y are bounded away from zero, then Theorem 2.5 gives consistency of the MLE. In particular, it is straightforward to show that (I is identified, so that, by the information inequality, E[ln f(zl@] has a unique maximum at Be. The problem here is that the convergence of the sample objective function is not uniform over small values of fr. This example is extreme, but there are interesting econometric examples that have this feature. One of these is the disequilibrium model without observed regime of Fair and Jaffee (1972), where y = min{x’p, + G,,E,~‘6, + you}, E and u are standard normal and independent of each other and of x and w, and the regressors include constants. This model also has an unbounded average log-likelihood as 0 -+ 0 for a certain values of /I, but the MLE over any compact set containing the truth will be consistent under the conditions of Theorem 2.5. Unfortunately, as a practical matter one may not be sure about lower bounds on variances, and even if one were sure, extraneous maxima can appear at the lower bounds in small samples. An approach to this problem is to search among local maxima that satisfy the first-order conditions for the one that maximizes the likelihood. This approach may work in the normal mixture and disequilibrium models, but might not give a consistent estimator when the true value lies on the boundary (and the first-order conditions are not satisfied on the boundary).
2.7.
Stochastic
equicontinuity
and uniform
convergence
Stochastic equicontinuity is important in recent developments in asymptotic distribution theory, as described in the chapter by Andrews in this handbook. This concept is also important for uniform convergence, as can be illustrated by the nonstochastic case. Consider a sequence of continuous, nonstochastic functions {Q,(0)},“= 1. For nonrandom functions, equicontinuity means that the “gap” between Q,(0) and Q,(6) can be made small uniformly in n by making g be close enough to 0, i.e. a sequence of functions is equicontinuous if they are continuous uniformly in
Ch. 36: Lurqr
Sample Estimation
and Hypothesis
Testing
2137
More precisely, equicontinuity holds if for each 8, c > 0 there exists 6 > 0 with 1Q,(8) ~ Q,(e)1 < E for all Jj6 0 11< 6 and all 11.~~ It is well known that if Q,(0) converges to Q,J0) pointwise, i.e. for all UE 0, and @is compact, then equicontinuity is a necessary and sufficient condition for uniform convergence [e.g. see Rudin (1976)]. The ideas behind it being a necessary and sufficient condition for uniform convergence is that pointwise convergence is the same as uniform covergence on any finite grid of points, and a finite grid of points can approximately cover a compact set, so that uniform convergence means that the functions cannot vary too much as 0 moves off the grid. To apply the same ideas to uniform convergence in probability it is necessary to define an “in probability” version of equicontinuity. The following version is formulated in Newey (1991 a). n.
Stochastic_equicontinuity: For every c, n > 0 there exists a sequence of random variables d, and a sample size no such that for n > n,, Prob( 1d^,1> E) < q and for each 0 there is an open set JV containing 8 with
Here t_he function d^, acts like a “random epsilon”, bounding the effect of changing 0 on Q,(e). Consequently, similar reasoning to the nonstochastic case can be used to show that stochastic equicontinuity is an essential condition for uniform convergence, as stated in the following result: Lemma 2.8 Suppose 0 is compact and Qo(B) is continuous. Then ~up~,~lQ,(~) - Qo(@ 30 if and only if Q,(0) L Qo(e) for all 9~ @and Q,(O) is stochastically equicontinuous. The proof of this result is given in Newey (1991a). It is also possible to state an almost sure convergence version of this result, although this does not seem to produce the variety of conditions for uniform convergence that stochastic equicontinuity does; see Andrews (1992). One useful sufficient condition for uniform convergence that is motivated by the form of the stochastic equicontinuity property is a global, “in probability” Lipschitz condition, as in the hypotheses of the following result. Let O,(l) denote a sequence of random variables that is bounded in probability.22
” One can allow for discontinuity in the functions by allowing the difference to be less than I: only for n > fi, where fi depends on E, but not on H. This modification is closer to the stochastic equicontinuity condition given here, which does allow for discontinuity. ” Y” is bounded in probability if for every E > 0 there exists ii and q such that Prob(l Y,l > ‘1)< E for n > ii.
W.K. Newey and D. McFadden
2138
Lemma 2.9 %QO(0) for all 0~0, and there is If 0 is compact, QO(0) is contmuous,_Q,,(0) OL,then cr>O and B,=O,(l) such that for all 0, HE 0, 1o,(8) - Q^,(O)ld k,, I/g- 0 11 su~~lto
I Q,(@ - QdfO 5 0.
Prooj By Lemma 2.8 it suffices to show stochastic equicontinuity. Pick E, ye> 0. By B,n = o,(l) there is M such that Prob( IB,I > M) < r] for all n large enough. Let M) Q.E.D. and for all 0, ~E.~V, IQ,,(o) - Q,,(0)1 < 6,,Il& 8 lla < 2,. This result is useful in formulating the uniform law of large numbers given in Wooldridge’s chapter in this volume. It is also useful when the objective function Q,(e) is not a simple function of sample averages (i.e. where uniform laws of large numbers do not apply). Further examples and discussion are given in Newey (1991a).
2.8.
Least ubsolute deviations examples
Estimators that minimize a sum of absolute deviations provide interesting examples. The objective function that these estimators minimize is not differentiable, so that weak regularity conditions are needed for verifying consistency and asymptotic normality. Also, these estimators have certain robustness properties that make them interesting in their own right. In linear models the least absolute deviations estimator is known to be more asymptotically more efficient than least squares for thick-tailed distributions. In the binary choice and censored regression models the least absolute deviations estimator is consistent without any functional form assumptions on the distribution of the disturbance. The linear model has been much discussed in the statistics and economics literature [e.g. see Bloomfeld and Steiger (1983)], so it seems more interesting to consider here other cases. To this end two examples are given: maximum score, which applies to the binary choice model, and censored least absolute deviations. 2.8.1.
Maximum
score
The maximum score estimator of Manski (I 975) is an interesting example because it has a noncontinuous objective function, where the weak regularity conditions of Lemma 2.4 are essential, and because it is a distribution-free estimator for binary choice. Maximum score is used to estimate 8, in the model y = I(x’B, + E > 0), where l(.s&‘)denotes the indicator for the event .d (equal to one if d occurs and zero
Ch. 36: Lurye Sumple Estimation and Hypothesis
Testing
otherwise), and E is a disturbance term with a conditional The estimator solves eq. (1.1) for
!A(@=-H-It i=l
lyi-
2139
median (given x) ofzero.
l(x;H>o)/.
A scale normalization is necessary (as usual for binary choice), and a convenient one here is to restrict all elements of 0 to satisfy //0 /I = 1. To show consistency of the maximum score estimator, one can use conditions for identification and Lemma 2.4 to directly verify all the hypotheses of Theorem 2.1. By the law of large numbers, Q,(e) will have probability limit Qe(0) = - EC/y - l(x’U > O)l]. To show that this limiting objective has a unique maximum at fIO,one can use the well known result that for any random variable Y, the expected absolute deviation E[ 1Y - a(x)I] is strictly minimized at any median of the conditional distribution of Y given x. For a binary variable such as y, the median is unique when Prob(y = 1 Ix) # +, equal to one when the conditional probability is more than i and equal to zero when it is less than i. Assume that 0 is the unique conditional median of E given x and that Prob(x’B, = 0) = 0. Then Prob(y = 1 Ix) > ( < ) 3 if and only if ~‘0, > ( < ) 0, so Prob(y = 1 Ix) = i occurs with probability zero, and hence l(x’t), > 0) is the unique median of y given x. Thus, it suffices to show that l(x’B > 0) # l(x’B, > 0) if 0 # 19,. For this purpose, suppose that there are corresponding partitions 8 = (or, fl;,’ and x = (x,, x;)’ such that x&S = 0 only if 6 = 0; also assume that the conditional distribution of x1 given x2 is continuous with a p.d.f. that is positive on R, and the coefficient O,, of x1 is nonzero. Under these conditions, if 0 # 8, then l(x’B > 0) # l(x’B, > 0), the idea being that the continuous distribution of x1 means that it is allowed that there is a region of x1 values where the sign of x’8 is different. Also, under this condition, ~‘8, = 0 with zero probability, so y has a unique conditional median of l(x’8, > 0) that differs from i(x’8 > 0) when 0 # fI,,, so that QO(@ has a unique maximum at 0,. For uniform convergence it is enough to assume that x’0 is continuously distributed for each 0. For example, if the coefficient of x1 is nonzero for all 0~0 then this condition will hold. Then, l(x’B > 0) will be continuous at each tI with probability one, and by y and l(x’B > 0) bounded, the dominance condition will be satisfied, so the conclusion of Lemma 2.4 gives continuity of Qo(0) and uniform convergence of Q,,(e) to Qe(@. The following result summarizes these conditions: Theorem
2.10
If y = l(x’B, + E > 0) and (i) the conditional median at I: = 0; (ii) there are corresponding
distribution of E given x has a unique partitions x = (x,, xi)’ and 8 = (e,, pZ)’
13A median of the distribution and Prob(y < m) 2 +.
Y is the set of values m SUCKthat Prob( Y 2 m) > f
of a random
variable
W.K. Nrwey
2140
and D. McFadden
such that Prob(x;G # 0) > 0 for 6 # 0 and the conditional distribution of xi given x2 is continuous with support R; and (iii) ~‘8 is continuously distributed for all 0~0= (H:lIHIl = l}; then 850,. 2.8.2.
Censored leust ubsolute deviations
Censored least absolute deviations is used to estimate B0 in the model y = max{O, ~‘0, + F} where c has a unique conditional median at zero. It is obtained by solvingeq.(l.l)forQ,(0)= -n-‘~~=i (lyi- max{O,x~~}~-~yi-max{O,xj~,}~)= Q,(U) - Q,(0,). Consistency of 8 can be shown by using Lemma 2.4 to verify the conditions of Theorem 2.1. The function Iyi - max (0, xi0) 1- Iyi - max {0, xi@,} I is continuous in 8 by inspection, and by the triangle inequality its absolute value is bounded above by Imax{O,x~H}I + Imax{O,xI8,}I d lIxJ( 118ll + IId,ll), so that if E[ 11 x II] < cc the dominance condition is satisfied. Then by the conclusion of Lemma 2.4, Q,(0) converges uniformly in probability to QO(@= E[ ly - max{O,x’8} Ily - max{O, ~‘8,) I]. Thus, for the normalized objective function, uniform convergence does not require any moments of y to exist, as promised in Section 2.1. Identification will follow from the fact that the conditional median minimizes the expected absolute deviation. Suppose that P(x’B, > 0) and P(x’6 # Olx’8, > 0) > 0 median at zero, y has a unique if 6 # 0. 24 By E having a uniqu e conditional conditional median at max{O, x’o,}. Therefore, to show identification it suffices to show that max{O, x’d} # max{O, x’BO} if 8 # 0,. There are two cases to consider. In case one, l(x’U > 0) # 1(x’@, > 0), implying max{O,x’B,} # max{O,x’@}. In case two, 1(x’@> 0) = l(x’0, > 0), so that max 10, x’(9) - max 10, x’BO}= l(x’B, > O)x’(H- 0,) # 0 by the identifying assumption. Thus, QO(0) has a unique maximum over all of R4 at BO. Summarizing these conditions leads to the following result: Theorem 2.11 If (i) y = max{O, ~‘8, + a}, the conditional distribution of E given x has a unique median at E = 0; (ii) Prob(x’B, > 0) > 0, Prob(x’G # Olx’0, > 0) > 0; (iii) E[li x 111< a; and (iv) 0 is any compact set containing BO, then 8 3 8,. As previously promised, this result shows that no assumption on the existence of moments of y is needed for consistency of censored least absolute deviations. Also, it shows that in spite of the first-order conditions being identically zero over all 0 where xi0 < 0 for all the observations, the global maximum of the least absolute deviations estimator, over any compact set containing the true parameter, will be consistent. It is not known whether the compactness restriction can be relaxed for this estimator; the objective function is not concave, and it is not known whether some other approach can be used to get rid of compactness.
241t suffices for the second condition
that E[l(u’U,
> 0)x.x’] is nonsingular.
2141
Ch. 36: Large Sample Estimation and Hypothesis Testiny
3.
Asymptotic
normality
Before giving precise conditions for asymptotic normality, it is helpful to sketch the main ideas. The key idea is that in large samples estimators are approximately equal to linear combinations of sample averages, so that the central limit theorem gives asymptotic normality. This idea can be illustrated by describing the approximation for the MLE. When the log-likelihood is differentiable and 8 is in the interior of the parameter set 0, the first-order condition 0 = n ‘x1= 1V, In f(zi I$) will be satisfied. Assuming twice continuous differentiability of the log-likelihood, the mean-value theorem applied to each element of the right-hand side of this first-order condition gives
(3.1)
where t?is a mean value on the line joining i? and 19~and V,, denotes the Hessian matrix of second derivatives. ’ 5 Let J = E[V, In f(z (0,) (V, In f(z 1tl,)}‘] be the information matrix and H = E[V,, In f(z 1O,)] the expected Hessian. Multiplying through by Jn
and solving for &(e^ - 6,) gives
p
I
(Hessian Conv.)
d
(Inverse Cont.)
1 NO.
H-1
(CLT)
(3.2)
J)
By the well known zero-mean property of the score V,ln ,f(z/Q,) and the central limit theorem, the second term will converge in distribution to N(0, .I). Also, since eis between 6 and 8,, it will be consistent if 8 is, so that by a law of large numbers that is uniform in 0 converging to 8, the Hessian term converges in probability to H. Then the inverse Hessian converges in probability to H-’ by continuity of the inverse at a nonsingular matrix. It then follows from the Slutzky theorem that &(6-
0,) % N(0, Hm 1JH-‘).26
Furthermore,
by the information
matrix equality
25The mean-value theorem only applies to individual elements of the partial derivatives, so that 0 actually differs from element to element of the vector equation (3.1). Measurability of these mean values holds because they minimize the absolute value of the remainder term, setting it equal to zero, and thus are extremum estimators; see Jennrich (1969). *“The Slutzky theorem
is Y, 5
Y, and Z, Ac*Z,Y,
’ -WY,.
W’,K. Newey
2142
und D. McFadden
H = -J, the asymptotic variance will have the usual inverse information matrix form J-l. This expansion shows that the maximum likelihood estimator is approximately equal to a linear combination of the average score in large samples, so that asymptotic normality follows by the central limit theorem applied to the score. This result is the prototype for many other asymptotic normality results. It has several components, including a first-order condition that is expanded around the truth, convergence of an inverse Hessian, and a score that follows the central limit theorem. Each of these components is important to the result. The first-order condition is a consequence of the estimator being in the interior of the parameter space.27 If the estimator remains on the boundary asymptotically, then it may not be asymptotically normal, as further discussed below. Also, if the inverse Hessian does not converge to a constant or the average score does not satisfy a central limit theorem, then the estimator may not be asymptotically normal. An example like this is least squares estimation of an autoregressive model with a unit root, as further discussed in Chapter 2. One condition that is not essential to asymptotic normality is the information matrix equality. If the distribution is misspecified [i.e. is not f’(zI fI,)] then the MLE may still be consistent and asymptotically normal. For example, for certain exponential family densities, such as the normal, conditional mean parameters will be consistently estimated even though the likelihood is misspecified; e.g. see Gourieroux et al. (1984). However, the distribution misspecification will result in a more complicated form H- 'JH-' for the asymptotic variance. This more complicated form must be allowed for to construct a consistent asymptotic variance estimator under misspecification. As described above, asymptotic normality results from convergence in probability of the Hessian, convergence in distribution of the average score, and the Slutzky theorem. There is another way to describe the asymptotic normality results that is often used. Consider an estimator 6, and suppose that there is a function G(z) such that
fi(e-
0,) = t
$(zi)/$
+ o,(l),
EC$(Z)l = 0,
~%$(z)lc/(ZYl exists,
(3.3)
i=l
where o,(l) denote: a random vector that converges in probability to zero. Asymptotic normality of 6’then results from the central limit theorem applied to Cy= 1$(zi)/ ,,h, with asymptotic variance given by the variance of I/I(Z).An estimator satisfying this equation is referred to as asymptotically lineur. The function II/(z) is referred to as the influence function, motivated by the fact that it gives the effect of a single “It is sufficient that the estimator be in the “relative interior” of 0, allowing for equality restrictions to be imposed on 0, such as 0 = r(g) for smooth ~b) and the true )’ being in an open ball. The first-order condition does rule out inequality restrictions that are asymptotically binding.
Ch. 36: Lurge Sumplr Estimation and Hypothesis
2143
Testing
observation on the estimator, up to the o,(l) remainder term. This description is useful because all the information about the asymptotic variance is summarized in the influence function. Also, the influence function is important in determining the robustness properties of the estimator; e.g. see Huber (1964). The MLE is an example of an asymptotically linear estimator, with influence function $(z) = - H ‘V, In ,f(z IO,). In this example the remainder term is, for the mean value a, - [(n ‘C;= 1V,,,,In f(zi 1g))- ’ - H - ‘In- li2Cr= ,V, In f(zil e,), which converges in probability to zero because the inverse Hessian converges in probability to H and the $I times the average score converges in distribution. Each of NLS and GMM is also asymptotically linear, with influence functions that will be described below. In general the CMD estimator need not be asymptotically linear, because its asymptotic properties depend only on the reduced form estimator fi. However, if the reduced form estimator 72is asymptotically linear the CMD will also be. The idea of approximating an estimator by a sample average and applying the central limit theorem can be used to state rigorous asymptotic normality results for extremum estimators. In Section 3.1 precise results are given for cases where the objective function is “sufficiently smooth”, allowing a Taylor expansion like that of eq. (3.1). Asymptotic normality for nonsmooth objective functions is discussed in Section 7.
3.1.
The husic results
For asymptotic normality, two basic results are useful, one for an extremum estimator and one for a minimum distance estimator. The relationship between these results will be discussed below. The first theorem is for an extremum estimator. Theorem
3.1
Suppose
that 8 satisfies eq. (l.l),
@A O,, and (i) o,Einterior(O);
(ii) o,(e) is twice
continuously differentiable in a neighborhood Jf of Be; (iii) &V,&,(0,,) % N(0, Z); (iv) there is H(Q) that is continuous at 8, and supBEN IIV,,&(@ - H(d)11 30; (v) H = H(H,) is nonsingular.
Then J&(8 - 0,) % N(0, H
l,?ZH- ‘).
Proqf A sketch of a proof is given here, with full details described in Section 3.5. Conditions (i)-(iii) imply that V,&(8) = 0 with probability approaching one. Expanding around B0 and solving for ,,&(8 - 0,) = - I?(e)- ’ $V,&(0,),
where E?(B) = V,,&(0)
and f?is a mean value, located between Band 8,. By ep. Be and (iv), with probability approaching - one, I/fi(q - H /I< /IE?(g) - H(g) II + )IH(g) - H II d supBEell fi(O) H(B) /I + /IH(0) - H/I 3 0. Then by continuity of matrix inversion, - f?(g)- l 3 -H-l. The conclusion then follows by the Slutzky theorem. Q.E.D.
2144
W.K. Newey and D. McFuddun
The asymptotic variance matrix in the conclusion of this result has a complicated form, being equal to the product H -'EH- '.In the case of maximum likelihood matrix, because of the this form simplifies to J- ‘, the inverse of the information information matrix equality. An analogous simplification occurs for some other estimators, such as NLS where Var(ylx) is constant (i.e. under homoskedasticity). As further discussed in Section 5, a simplified asymptotic variance matrix is a feature of an efficient estimator in some class. The true parameter being interior to the parameter set, condition (i), is essential to asymptotic normality. If 0 imposes inequality restrictions on 0 that are asymptotically binding, then the estimator may not be asymptotically normal. For example, consider estimation of the mean of a normal distribution that is constrained to be nonnegative, i.e. f(z 1H) = (271~~)- ’ exp [ - (z - ~)~/20~], 8 = (p, 02), and 0 = [0, co) x (0, acj). It is straightforward to check that the MLE of ~1 is ii = Z,Z > 0, fi = 0 otherwise. If PO = 0, violating condition (ii), then Prob(P = 0) = i and Jnfi is N(O,o’) conditional on fi > 0. Therefore, for every n (and hence also asymptotically), the distribution of &(flpO) is a mixture of a spike at zero with probability i and the positive half normal distribution. Thus, the conclusion of Theorem 3.1 is not true. This example illustrates that asymptotic normality can fail when the maximum occurs on the boundary. The general theory for the boundary case is quite complicated, and an account will not be given in this chapter. Condition (ii), on twice differentiability of Q,(s), can be considerably weakened without affecting the result. In particular, for GMM and CMD, asymptotic normality can easily be shown when the moment functions only have first derivatives. With considerably more work, it is possible to obtain asymptotic normality when Q,,(e) is not even once differentiable, as discussed in Section 7. Condition (iii) is analogous to asymptotic normality of the scores. It -11 often follow from a central limit theorem for the sample averages that make up V,Q,(0,). Condition (iv) is uniform convergence of the Hessian over a neighborhood of the true parameter and continuity of the limiting function. This same type of condition (on the objective function) is important for consistency of the estimator, and was discussed in Section 2. Consequently, the results of Section 2 can be applied to give primitive hypotheses for condition (iv). In particular, when the Hessian is a sample average, or depends on sample averages, Lemma 2.4 can be applied. If the average is continuous in the parameters, as will typically be implied by condition (iv), and a dominance condition is satisfied, then the conclusion of Lemma 2.4 will give uniform convergence. Using Lemma 2.4 in this way will be illustrated for MLE and GMM. Condition (v) can be interpreted as a strict local identification condition, because H = V,,Q,(H,) (under regularity conditions that allow interchange of the limiting and differentiation operations.) Thus, nonsingularity of H is the sufficient (secondorder) condition for there to be a unique local maximum at 0,. Furthermore, if V,,QO(0) is “regular”, in the sense of Rothenberg (1971) that it has constant rank in a neighborhood of 8,, then nonsingularity of H follows from Qa(0) having a unique
Ch. 36:
Large
Sample Estimation
and ffypothesis
2145
Testing
maximum at fIO.A local identification condition in these cases is that His nonsingular. As stated above, asymptotic normality of GMM and CMD can be shown under once differentiability, rather than twice differentiability. The following asymptotic normality result for general minimum distance estimators is useful for this purpose. Theorem
3.2
Suppose that H^satisfies eq. (1.1) for Q,(0) = - 4,(0)‘ii/g,,(e) where ii/ 3 W, W is and (i) .Q,Einterior(O); (ii) g,(e) is continuously positive semi-definite, @Lo,, differentiable in a neighborhood JV’ of 8,; (iii) $9,(8,) 5 N(O,n); (iv) there is G(8) that is continuous at 0, and supBE y /(V&,,(e) - G(U) II A 0; (v) for G = G(e,), G’ WC is nonsingular.
Then $(8-
0,) bI[O,(G’WG)-‘G’Wf2WG(G’WG)-‘1.
The argument is similar to the proof of Theorem 3.1. By (i) and (ii), with probability approaching one the first-order conditions G(@t@@,($ = 0 are satisfied, for G(0) = V&,,(0). Expanding
d,(8) around
I?%@)] - 1G^(@I&“$,(&,),
B0 and
solving
gives Jn(e^-
e,,) = - [G(@ x
w h ere t?is a mean value. By (iv) and similar reasoning
as
for Theorem 3.1, G(8) A G and G(g) A G. Then by(v), - [G(@‘@‘G(@]-16(e),%~ - (G’WG)- 'G'W, so the conclusion follows by (iii) and the Slutzky theorem. Q.E.D. When W = Q - ‘, the asymptotic variance of a minimum distance estimator simplifies to (G’Q - ‘G)) ‘. As is discussed in Section 5, the value W = L2 _ ’ corresponds to an efficient weighting matrix, so as for the MLE the simpler asymptotic variance matrix is associated with an efficient estimator. Conditions (i)-(v) of Theorem 3.2 are analogous to the corresponding conditions of Theorem 3.1, and most of the discussion given there also applies in the minimum distance case. In particular, the differentiability condition for g,(e) can be weakened, as discussed in Section 7. For analyzing asymptotic normality, extremum estimators can be thought of as a special case of minimum distance estimators, with V&,(e) = d,(0) and t?f = I = W. The_ first-order conditions for extremum estimators imply that o,(tI)‘@g,(fI) = V,Q,(0)‘V,Q,(@ has a minimum (of zero) at 0 = 8. Then the G and n of Theorem 3.2 are the H and Z of Theorem 3.1, respectively, and the asymptotic variance of the extremum estimator is that of the minimum distance estimator, with (G’WG)-’ x G’Wf2WG(G’WG)p1 =(H’H)-‘H’L’H(H’H)m’ = H-‘ZHpl. Thus, minimum distance estimation provides a general framework for analyzing asymptotic normality, although, as previously discussed, it is better to work directly with the maximum, rather than the first-order conditions, when analyzing consistency.28 18This generality suggests that Theorem 3.1 could be formulated as a special case of Theorem 3.2. The results are not organLed in this way because it seems easier to apply Theorem 3.1 directly to particular extremum estimators.
W.K. Newey und D. McFadden
2146
3.2.
Asymptotic
normality
jbr MLE
The conditions for asymptotic to give a result for MLE. Theorem
normality
of an extremum
estimator
can be specialized
3.3
Suppose that zl,. . . , z, are i.i.d., the hypotheses of Theorem 2.5 are satisfied and (i) d,Einterior(O); (ii) f(zl0) is twice continuously differentiable and f(zl0) > 0 in a neighborhood ,X of 8,; (iii) {suP~~,~- 11 V,f(zl B) //dz < co, jsupe._, IIV,,f(zl@ I)dz < m;; VBHx (iv) J = ECVBln f(z I 4,) PO In f(z I 6Ji’l exists and is nonsingular; (v) E[suP~~_,~ 11 lnf(z~8)~l] 0. Since this conclusion is true for any CI# 0, J must be nonsingular. Example
1.2 continued
Existence and nonsingularity of E[xx’] are sufficient for asymptotic normality of the probit MLE. Consistency of 8 was shown in Section 2.6, so that only conditions (i)-(v) of Theorem 3.3 are needed (as noted following Theorem 3.3). Condition (i) holds because 0 = Rq is an open set. Condition (ii) holds by inspection of f’(z 10) = y@(x’O) + (1 - y)@( - x’(9). For condition (iii), it is well known that 4(u) and 4”(u) are uniformly bounded, implying V&z /0) = (1 - 2y)4(x’H)x and V,,f(z 10)= (1 - 2y) x ~,(x’@xx’ are bounded by C( 1 + I/x 11 2, for some constant C. Also, integration over dz is the sum over y and the expectation over x {i.e. ja(y, x)dz = E[a(O, x) + a( 1, x)] }, so that i( 1 + 11 x I/2)dz = 2 + 2E[ //x 11’1< GC. For (iv), it can be shown that J = E[i.(x’0&( - x’d,)xx’], for j(u) = ~(U)/@(U). Existence of J follows by E.(u)i.(- ~1) bounded, and nonsingularity by %(u)A(- u) bounded away from zero on any open interval.29 Condition (v) follows from V,, In ,f’(z IQ,,)= [&.(x’B,)y + &,( - x’tI,)( 1 - y)]xx’ 291t can be shown that Z(u)i.( - a) is bounded using l’H8pital’s rule. Also, for any Ir>O, J 2 E[l(lx’H,I < fi)i(x’fI,)n( -x’tI,)xx’] 2 CE[ l(lx’O,I < C)x.x’] in the positive semi-definite sense, the last term is positive definite for large enough V by nonsingularity of E[xx’].
W.K. Newey and D. McFuddm
2148
and boundedness of I_,(u). This example illustrates how conditions on existence of moments may be useful regularity conditions for consistency and asymptotic normality of an MLE, and how detailed work may be needed to check the conditions.
3.3.
Asymptotic
normulity for GMM
The conditions on asymptotic normality specialized to give a result for GMM. Theorem
of minimum
distance
estimators
can be
3.4
Suppose that the hypotheses ofTheorem 2.6 are satisfied, r;i/ A W, and (i) 0,Einterior of 0; (ii) g(z,O) is continuously differentiable in a neighborhood _t‘ of 0,, with probability approaching one; (iii) E[g(z, fl,)] = 0 and E[ I/g(z, 0,) I/‘1 is finite; (iv) E[su~,,~ Ij V&z, 0) 111< co;(v) G’WG is nonsingular for G = E[V,g(z, fl,)]. Then for 0 = ECg(z, @,Jg(z, Hd’l,$(@
- 0,) ~N[O,(G’WG)G’WBWG(G’WG)~‘].
Proof
The proof will be sketched, although a complete proof like that of Theorem 3.1 given in Section 3.5 could be given. By (i), (ii), and (iii), the first-order condition 2G,,(@%~,(8) = 0 is satisfied with probability approaching one, for G,(e) = V&,,(0). Expanding
J,,(g) around
fI,, multiplying
through
by $,
and solving gives (3.5)
where 0 is the mean [G,(~))‘~~,(8)]-‘~,(~))‘ii/ Slutzky theorem.
value.
By (iv), G,,(8) LG and G,(g) 3 G, so that by (v), The conclusion then follows by the Q.E.D.
~(G’WG)~‘G’W.
asymptotic variance formula simplifies to (G’R ‘G)- ’ when W = in Hansen (1982) and further discussed in Section 5, this value for W is optimal in the sense that it minimizes the asymptotic variance matrix of the GMM estimator. The hypotheses of Theorem 2.6 are only used to make sure that I!?L BO, so that they can be replaced by any other conditions that imply consistency. For example, the conditions that 8, is identified, g(z, 0) is linear in 8, and E[ /Ig(z, II) 111< cc for all 8 can be used as replacements for Theorem 2.6, because Theorem 2.7 then gives 830,. More generally, a GMM estimator will be asymptotically normal if it is consistent and the other conditions (i))(v) of Theorem 3.4 are satisfied. The complicated
R- ‘. As shown
2149
Ch. 36: Large Sample Estimation and Hypothesis Testing
It is straightforward
to derive
a corresponding
result
for classical
minimum
distance, under the conditions that 6 is consistent, &[72 - h(e,)] L N(0, fl) for some R, h(8) is continuously differentiable in a neighborhood of Be, and G’WG is nonsingular for G = V&(0,). The statement of a theorem is left as an exercise for the interested reader. The resulting asymptotic variance for CMD will have the same form as given in the conclusion of Theorem 3.4. By expanding the GMM first-order conditions, as in eq. (3.5), it is straightforward to show that GMM is asymptotically linear with influence function $(z) = - (G’ WC) - ‘G’ Wg(z, 0,).
(3.6)
In general CMD need not be asymptotically linear, but will be if the reduced form estimator 72 is asymptotically linear. Expanding the first-order conditions for 6 around
the truth gives $(e^-
0,) = - (G’WG)-‘6’6’&(72
G = V,@(8), and @is the mean value. Then &(fi and(~‘~G)-‘~‘ii/‘~(G’WG)-‘G’W. W&(72
- x0), where G = V&(8),
- rra) converging
implies that &(8-
in distribution
0,) = - (G’WG)-‘G’
x
- TC,J+ o,(l). Therefore,
ll/“(z), the CMD estimator t&z) = - (G’WG)-
if 72is asymptotically linear with influence function will also be asymptotically linear with influence function
‘G’W$“(z).
The Hansen-Singleton example provides of Theorem 3.4 can be verified.
(3.7) a useful illustration
of how the conditions
Example 1.3 continued
It was shown
in Section 2 that
sufficient conditions for consistency are that solution at 0eE 0 = [Be, /3,]x[yl, y,], and that E[llx(l] E) d E[A.(z)]/c -+ 0 for all E>O, giving n-‘Cy= IA,, 3 0. By Khintchine’s law of large numbers, n- ‘XI= 1u x
Ch. 36: Large Sample Estimation and Hypothesis (zi, fI,) %
E[a(z,
O,,)].
Also, with pro_bability
np’Cr=
la(Zi,8,)li &,)‘I- W,,(x, 4,) CY - hk 6,) } I}
= 2ECk9b,4$,(x, &)‘I, where h, denotes
the gradient,
h,, the Hessian
of h(x, O), and the second
equality
W.K. Newry and D. McFadden
2160
follows by the law of iterated expectations. Therefore, H can be estimated by s = 2n- ‘C:= ,h,(xi, @h&xi, @, which is convenient because it only depends on first derivatives, rather than first and second derivatives. Under homoskedasticity the matrix Z also simplifies, to 4f~~E[h,(x, 8,)h,(x, e,)‘] for cr2 F E[ { y - h(x, U,)}2], which can be estimated by 2e2H for e2 = nP ‘Cy= i { y - h(xi, d)}“. Combining this estimator of Z with the one for H gives an asymptotic variance estimator of the form ? = fi“TfiP1 = 262fim ‘. Consistency of this estimator can be shown by applying the conditions of Lemma 4.3 to both u(z, 6) = {y - h(x, 19))’ and a(z, 8) = h,(x, @h&x, e)‘, which is left as an exercise. If there is heteroskedasticity then the variance of y does not factor out of Z, so that one must use the estimator z= 4n-‘Cr, ,h,(xi, @h&xi, @‘{ yi - h(xi, 8)}2. Also, if the conditional expectation is misspecified, then second derivatives of the regression function do not disappea_r from the Hessian (except in the linear case), so that one must use the estimator H = 2n- ‘x1= 1 [h&x,, @h&xi, i$ + h&xi, @{ yi - h(xi, @}I. A variance estimator for NLS that is consistent in spite of heteroskedasticity or misspecification is fi-‘&-‘, as discussed in White (1982b). One could formulate consistency conditions for this estimator by applying Lemma 4.3. The details are left as an exercise.
4.3.
Asymptotic
vuriance estimation,for
GMM
The asymptotic variance of a GMM estimator is (G’WG))‘G’~~l&‘G(G’~G)-‘, which can be estimated by substituting estimators for each of G, W and 0. As p_reviously discussed,_estima_tors of G and Ware readily available, and are given by G = n- ‘x1= 1VOy(zi, e) and W, where k@is the original weighting matrix. To estimate R = E[g(z, H&z, 0,)‘], one can replace the population moment by a sample average and the true parameter by an estimator, to form fi = n- ’ Cy= r g(zi, @)g(z,, I!?)‘,as in eq. (4.2). The estimator of the asymptotic variance is then given by e = (G’I%‘G)-’ x G,I@r2 I?G@l?G,_l. Consistencyof Sz will follow from Lemma 4.3 with a(z, 8) = g(z, B)g(z, 0)‘, so that consistency of F’will hold under the conditions of Theorem 4.2, as applied to GMM. A result that summarizes these conditions is the following one: Theorem
4.5
If the hypotheses of Theorem 3.4 are satisfied, g(z,@ is continuous at B0 with probability_ one,a_nd_for ^a neighborhood JV of 8,, E[su~~,~ I/ g(z, 0) 11 2] < co, then ?. . ^ V=(&$‘G)-‘G’WRWG(G’WG)-
’ -(’
G’WG)-‘G’WRWG(G’WG)-‘.
Proof
By Lemma 4.3 applied to a(z, 0) = g(z, H)g(z, 6)‘, fiL a. Also, the proof of Theorem 3.4 shows that the hypotheses of Theorem 3.2 are satisfied, so the conclusion follows by Theorem 4.2. Q.E.D.
Ch. 36: Large Sample Estimation
and Hypothesis
Testing
2161
If @‘is a consistent estimator of a-‘, i.e. the probability limit W of @is equal to n-l, then a simpler estimator of the asymptotic variance can be formed as p = (@k&l. Alternatively, one could form &as in eq. (4.2) and use v = ((?fi-‘& ‘. Little seems to be known about the relative merits of these two procedures in small samples, i.e. which (if either) of the initial I%’or the final d-l gives more accurate or shorter confidence intervals. The asymptotic variance estimator c is very general, in that it does not require that the second moment matrix a= E[g(z,B,)g(z,8,)‘] be restricted in any way. Consequently, consistency of ? does not require substantive distributional restrictions other than E[g(z, Q,)] = 0.33 For example, in the context of least squares estimation, where y(z, 0) = x( y - x’d), l?f = I, and (? = - C;= 1x,xi/n, this GMM variance estimator is P = k’[n-‘C~= lxixi(yi - x$)‘]&‘, the Eicker (1967) and White (1980) heteroskedasticity consistent variance estimator. Furthermore, the GMM variance estimator includes many heteroskedasticity-robust IV variance estimators, as discussed in Hansen (1982). When there is more information about the model than just the moment restrictions, it may improve the asymptotic confidence interval approximation to try to use this information in estimation of the asymptotic variance. An example is least squares, where the usual estimator under homoskedasticity is n(Cr, 1xix:)- ‘C( yi - x@‘/ (n - K), where K is the dimension of x. It is well known that under homoskedasticity this estimator gives more accurate confidence intervals than the heteroskedasticity consistent one, e.g. leading to exact confidence intervals from the t-distribution under normality.
Example
1.3 continued
The nonlinear two-stage least squares estimator for the Hansen-Singleton example is a GMM estimator with g(z, 0) = x{bwyY - 1) and @= x1= 1x,xi/n, so that an asymptotic variance estimator can be formed by applying the general GMM formula to this case. Here an estimator of the variance of the moment functions can be formed as described above, with 8= n-‘~~=,x,x,{&viyf - l}‘. The Jacobian estimator is G^= n- ‘Cr= 1xi(wi yly^, Bwi In ( yi)yr). The corresponding asymptotic variance estimator then comes from the general GMM formula (~f~~)-‘~~~~~~(~f~~)~ ‘. Consistency of this estimator will follow under the conditions of Theorem 4.5. It was previously shown that all of these conditions are satisfied except the additional moment assumption stated in Theorem 4.5. For this assumption, it suffices that the upper and lower limits on y, namely yr and y,, satisfy E[~~x~/*~w~~(I~~*~’ + Iyl*‘“)] < co. This condition requires that slightly more moments exist than the previous conditions that were imposed.
331f this restriction is not satisfied, then a GMM estimator may still be asymptotically normal, but the asymptotic variance is much more complicated; see Maasoumi and Phillips (1982) for the instrumental variables case.
W.K. Newey and D. McFudden
2162
5.
Asymptotic
efficiency
Asymptotically normal estimators can be compared on the basis of their asymptotic variances, with one being asymptotically efficient relative to another if it has at least as small an asymptotic variance for all possible true parameter values. Asymptotic efficiency is desirable because an efficient estimator will be closer to the true parameter value in large samples; if o^is asymptotically efficient relative to 8 then for all constants K, Prob() e- O,I d K/&) > Prob( (8- 8,I < K/J%) for all n large enough. Efficiency is important in practice, because it results in smaller asymptotic confidence intervals, as discussed in the introduction. This section discusses general results on asymptotic efficiency within a class of estimators, and application of these results to important estimation environments, both old and new. In focusing on efficiency within a class of estimators, we follow much of the econometrics and statistics literature. 34 Also, this efficiency framework allows one to derive results on efficiency within classes of “limited information” estimators (such as single equation estimators in a simultaneous system), which are of interest because they are relatively insensitive to misspecification and easier to compute. An alternative approach to efficiency analysis, that also allows for limited information estimators, is through semiparametric efficiency bounds, e.g. see Newey (1990). The approach taken here, focusing on classes of estimators, is simpler and more directly linked to the rest of this chapter. Two of the most important and famous efficiency results are efficiency of maximum likelihood and the form of an optimal weighting matrix for minimum distance estimation. Other useful results are efficiency of heteroskedasticity-corrected generalized least squares in the class of weighted least squares estimators and two-stage least squares as an efficient instrumental variables estimator. All of these results share a common structure that is useful in understanding them and deriving new ones. To motivate this structure, and focus attention on the most important results, we first consider separately maximum likelihood and minimum distance estimation.
5.1.
Eficiency
of maximum
likelihood estimation
Efficiency of maximum likelihood is a central proposition of statistics that dates from the work of R.A. Fisher (1921). Although maximum likelihood is not efficient in the class of all asymptotically normal estimators, because of “superefficient” estimators, it is efficient in quite general classes of estimators.35 One such general class is the 341n particular, one of the precise results on efficiency of MLE is the HajekkLeCam representation theory, which shows efficiency in a class of reyular estimators. See, e.g. Newey (1990) for a discussion of regularity. 35The word “superefficient” refers to a certain type ofestimator, attributed to Hodges, that is used to show tha?there does not exist an efficient estimator in the class of all asymptotically normal estimators. Suppose 0 is asymptotically normal, and for some numb_er t( and 0 ^( p < i, suppose that 0 ha_s positive asympiotic variance when the trueparameter is rx. Let B = e if nalU - al > 1 and 0 = a if nPIO - a( < 1. Then 6’ is superefficient relative to 8, having the same asymptotic variance when the true parameter is not cxbut having a smaller asymptotic variance, of zero, when the true parameter is X.
Ch. 36: Large Sample Estimation and Hypothesis
2163
Testing
class of GMM estimators, which includes method of moments, least squares, instrumental variables, and other estimators. Because this class includes so many estimators of interest, efficiency in this class is a useful way of thinking about MLE efficiency. Asymptotic efficiency of MLE among GMM estimators is shown by comparing asymptotic variances. The asymptotic variance of the MLE is (E[ss’])-‘, where s = V, In f(zl0,) is the score, with the z and 8 arguments suppressed for notational convenience. The asymptotic variance of a GMM estimator can be written as m-%l)r ‘Jm~‘l(~b;l)l where m, = (E[Veg(z, (3,)])‘WV0g(z, 0,) and m = (E[V,g(z,8,)])‘Wg(z, 0,). At this point the relationship between the GMM and MLE variances is not clear. It turns out that a relationship can be derived from an interpretation of E[me] as the covariance of m with the score. To obtain this interpretation, consider the GMM moment condition jg(z, 19)f(z ItI) dz = 0. This condition is typically an identity over the parameter space that is necessary for consistency of a GMM estimator. If it did not hold at a parameter value, then the GMM estimator may not converge to the parameter at that point, and hence would not be consistent.36 Differentiating this identity, assuming differentiation under the integral is allowed, gives
s
0 = Vo s(z,W(zl@dzle=e, =
Cvodz,@lf(z I@dz + & ‘3CV,f(z I@I’ dz B=B” s
= ECVddz, 4Jl + %dz, &JVoInf(z IWI,
(5.1)
where the last equality follows by multiplying and dividing V, f(z IO,) by f(z IO,). This is the generalized information matrix equality, including the information matrix equality as a special case, where g(z, 0) = V,ln f(~l8).~’ It implies that E[m,] + E[ms’] = 0, i.e. that E[ms] = - E[ms’]. Then the difference of the GMM and MLE asymptotic variances can be written as (E[mJ-‘E[mm’](E[m~])-’
-(E[ss’])~’
= (E[ms’])-‘E[mm’](E[sm’])p’ = (E[ms’])-‘{E[mm’]
- (E[ss’])-’
- E[ms’](E[ss’])-‘E[sm’]}(E[sm’])-’
= (E[ms’])-‘E[UU’](E[sm’])-‘,
U = m - E[ms’] (E[ss’])-
1 s.
(5.2)
3hRecall that consistency means that the estimator converges in probability to the true parameter for all oossible true oarameter values. ‘;A similar eq’uality, used to derive the Cramer-Rao bound for the variance of unbiased estimators, is obtained by differentiating the identity 0 = JOdF,, where F’,, is the distribution of the data when 0 is the true parameter value.
W.K. Newey and D. McFadden
2164
Since E[UU’] is positive semi-definite, the difference of the respective variance matrices is also positive semi-definite, and hence the MLE is asymptotically efficient in the class of GMM estimators. To give a precise result it is necessary to specify regularity conditions for the generalized information matrix equality of eq. (5.1). Conditions can be formulated by imposing smoothness on the square root of the likelihood, f(zl@“‘, similar to the regularity conditions for MLE efficiency of LeCam (1956) and Hajek (1970). A precise result on efficiency of MLE in the class of GMM estimators can then be stated as: Theorem 5.1 If the conditions of Theorem O,, J is nonsingular, and for f(z I 0) dz and s ~upij~.~~ I/V&z (G’WG)- ‘G’WRWG(G’WG)
3.4 are satisfied,f(zl 0)1’2 is continuously differentiable at all 8 in a neighborhood JY of BO,JsuP~~,~ 11g(z, g) /I2 x Ir$1’2 11 2 dz are bounded and Jg(z, @f(z 10)dz = 0, then - JP1 is positive semi-definite.
The proof is postponed until Section 5.6. This result states that J-’ is a lower bound on the asymptotic variance of a GMM estimator. Asymptotic efficiency of MLE among GMM estimators then follows from Theorem 3.4, because the MLE will have J ’ for its asymptotic variance.38
5.2.
Optimal minimum distance estimation
The asymptotic variance of a minimum distance estimator depends on the limit W of the weighting matrix
[email protected] W = a-‘, the asymptotic variance of a minimum distance estimator is (G’R-‘G)-‘. It turns out that this estimator is efficient in the class of minimum distance estimators. To show this result, let Z be any random vector such that a= E[ZZ’], and let m = G’WZ and fi = G’K’Z. Then by G’ WC = E[mfi’]
and G’R- ‘G = E[riifi’],
(G’WG)~‘G’WL!WG(G’WG)~l-(G’~nlG)-’ = (G’WG)PIEIUU’](G’WG)-‘,
U = m - E[mfi’](E[tirii’])-‘6.
Since E[UU’] is positive semi-definite, the difference of the asymptotic positive semi-definite. This proves the following result:
(5.3) variances
is
38 It is possible to show this result under the weaker condition that f(zlO)“’ is mean-square differentiable, which allows for f(zlO) to not be continuously differentiable. This condition is further discussed in Section 5.5.
2165
Ch. 36: Large Sample Estimation and Hypothesis Testing
Theorem 5.2 If f2 is nonsingular, a minimum distance estimator with W = plim(@‘) = R-r asymptotically efficient in the class of minimum distance estimators.
is
This type of result is familiar from efficiency theory for CMD and GMM estimation. For example, in minimum chi-square estimation, where b(Q) = 72- $0) the efficient weighting matrix W is the inverse of the asymptotic variance of fi, a result given by Chiang (1956) and Ferguson (1958). For GMM, where Q(H)= x1=, g(Zi, d)/n, the efficient weighting matrix is the inverse of the variance of g(zi, fI,), a result derived by Hansen (1982). Each of these results is a special case of Theorem 5.2. Construction of an efficient minimum distance estimator is quite simple, because the weighting matrix affects the asymptotic distrib_ution only t_hrough its probability limit. All that is required is a consistent estimator R, for then W = fX ’ will converge in probability to rZP ‘. Since an estimator of R is needed for asymptotic variance estimation, very little additional effort is required to form an efficient weighting matrix. An efficient minimum distance estimator can then be constructed by minimizing d(O)‘& ‘g(O). Alternatively, the one-step estimator r?= &(@a‘c)) ’ x eh- ‘g(6) will also be efficient, because it is asymptotically equivalent to the fully iterated minimum distance estimator. The condition that W = fl- ’ is sufficient but not necessary for efficiency. A necessary and sufficient condition can be obtained by further examination of eq. (5.3). A minimum distance estimator will be efficient if and only if the random vector U is zero. This vector is the residual from a population regression of m on I+&and so will be zero if and only if m is a linear combination of fi, i.e. there is a constant matrix C such that G’WZ = CG’R-‘2. Since 2 has nonsingular variance matrix, this condition is the same as G’W = CG’O-‘. This is the necessary estimator.
5.3.
(5.4) and sufficient
condition
for efficiency of a minimum
distance
A general eficiency framework
The maximum likelihood and minimum distance efficiency results have a similar structure, as can be seen by comparing eqs. (5.2) and (5.3). This structure can be exploited to construct an eficiency framework that includes these and other important results, and is useful for finding efficient estimators. To describe this framework one needs notation for the asymptotic variance associated with an estimator. To this end, let r denote an “index” for the asymptotic variance of an estimator in some
W.K. Newey and D. McFadden
2166
class, where r is an element of some abstract set. A completely general form for z would be the sequence of functions of the data that is the sequence of estimators. However, since r is only needed to index the asymptotic variance, a simpler specification will often suffice. For example, in the class of minimum distance estimators with given g,(O), the asymptotic variance depends only on W = plim(I@, so that it suffices to specify that z = W. The framework considered here is one where there is a random vector Z such that for each r (corresponding to an estimator), there is D(z) and m(Z, z) with the asymptotic variance V(r) satisfying V(7) = D(z)_ l E[m(Z, T)rn(Z, T)‘]D(T)-
l’.
(5.5)
Note that the random vector Z is held fixed as t varies. The function m(Z, z) can often be interpreted as a score or moment function, and the matrix D(z) as a Jacobian matrix for the parameters. For example, the asymptotic variances of the class of GMM estimators satisfy this formula, with z being [g(z, 8&G, W], Z = z being a single observation, m(Z, r) = G’Wg(z, tl,), and D(r) = G’WG. Another example is minimum distance estimators, where Z is any random vector with mean zero and variance 0, z = W, m(Z, z) = G’WZ, and D(T) = G’ WC. In this framework, there is an interesting and useful characterization of an efficient estimator. Theorem 5.3 If Z satisfies D(z) = E[m(Z, z)m(Z, ?)‘I for all z then any estimator with variance L’(f) is efficient. Furthermore, suppose that for any ri, r2, and constant square matrices C,, C, such that C,D(z,) + C&r,) is nonsingular, there is z3 with (i) (linearity of the moment function set) m(Z,r,) = C,m(Z,z,) + C,m(Z,z,); (ii) (linearity of D) D(r,) = C,D(t,) + C,D(z,). If there is an efficient estimator with E[m(Z, z)m(Z, z)‘] nonsingular then there is an efficient estimator with index F such that D(z) = E[m(Z, z)m(Z, f)‘] for all z. Proof If r and S satisfy D(z) = E[m(Z,r)m(Z,?)‘] then the difference asymptotic variances satisfies, for m = m(Z, z) and 6 = m(Z, ?), V(7) -
V(f) =
(E[m~‘])~‘E[mm’](E[fim’])~’
= (E[mti’])-‘E[UU’](E[tim’])p CJ= m-E[mrii’](E[@iti’])-‘ti,
- (E[tid])~
of the respective
l
‘, (5.6)
so the first conclusion follows by E[UU’] positive semi-definite. To show the second conclusion, let ll/(Z, t) = D(T)- ‘m(Z, T), so that V(7) = E[$(Z, z)$(Z, s)‘]. Consider
Ch. 36: Large Sample Estimation
and Hypothesis
2167
Testing
any constant matrix B, and for 7r and T* let C, = BD(7,))’ and C, = (I - B)D(T~)-’ note that C,D(z,) + C,D(z,) = I is nonsingular, so by (i) and (ii) there is 73 such that Bl+b(Z,7,)+(Z-B)II/(Z,7,)
=
c,m(z,7,)+C,m(Z,7,)=m(Z,7,)
=
I-‘m(Z,z,)
=
[C,D(t,) + C,D(t,)]‘m(Z, z3) = D(z,)- 'm(Z, 7j) = I/I(Z, 73). Thus, the set ($(Z, 7)} is affine, in the sense that B$(Z, tl) + (I - B)$(Z, z2) is in this set for any 71, z2 and constant matrix B. Let $(Z,?) correspond to an efficient estimator. Suppose that there is 7 with E[($ - $)$‘I # 0 for $ = $(Z, 7) and & = $(Z, 5). Then $ - 6 # 0, so there exists a constant matrix F such that e = F($ - $) has nonsingular variance andE[e~]#O.LetB=-E[~e’](E[ee’])-’Fandu=~+B(~-~)=(Z-B)~+B~. By the affine property of {rj(Z,z)} there is z”such that k’(f) = E[uu’] = E[$$‘] E[$e’](E[ee’])-‘E[e$‘] = V(T) - E[$e’](E[ee’])-‘E[e$‘], which is smaller than V(S) in the positive semi-definite sense. This conclusion contradicts the assumed - -, efficiency of Z, so that the assumption that E[($ - $)tj ] # 0 contradicts efficiency. Thus, it follows that E[($ - $)I+?‘]= 0 for all 7, i.e. that for all 7,
D(t)-
‘E[m(Z,r)m(Z,f)‘]D(?)-”
= D(t)-'E[m(Z,~)m(Z,~)']D(~)-
“.
(5.7)
By the assumed nonsingularity of E[m(Z, T)m(Z, Z)‘], this equation can be solved for D(7) to give D(7) = E[m(Z, z)m(Z, T)‘](E[m(Z, f)m(Z, 2)‘])- ‘D(f). Since C = D(f)‘(E[m(Z, f)m(Z, ?)‘I)- ’ is a nonsingular matrix it follows by (i) and (ii) that there exists ? with m(Z, ?) = Cm(Z, Y). Furthermore, by linearity of D(7) it follows that V(?)= V(Z), so that the estimator corresponding to z”is efficient. The second conQ.E.D. clusion then follows from D(7) = E[m(Z, z)m(Z, S)‘] for all 7. This result states that D(7) =
E[m(Z, t)m(Z,
Z)‘],
for all 7,
(5.8)
is sufficient for Z to correspond to an efficient estimator and is necessary for some efficient estimator if the set of moment functions is linear and the Jacobian is a linear function of the scores. This equality is a generalization of the information matrix equality. Hansen (1985a) formulated and used this condition to derive efficient instrumental variables estimators, and gave more primitive hypotheses for conditions (i) and (ii) of Theorem 5.3. Also, the framework here is a modified version of that of Bates and White (1992) for general classes of estimators. The sufficiency part of Theorem 5.3 appears in both of these papers. The necessity part of Theorem 5.3 appears to be new, but is closely related to R.A. Fisher’s (1925) necessary condition for an efficient statistic, as further discussed below. One interpretation of eq. (5.8) is that the asymptotic covariance between an efficient estimator and any other estimator is the variance of the efficient estimator. This characterization of an efficient estimator was discussed in R.A. Fisher (1925),
W.K.NeweyandD.McFaddvn
2168
and is useful in constructing Hausman (1978) specification tests. It is derived by assuming that the asymptotic covariance between two estimators in the class takes as can usually be verified by “stacking” theform D(r,)~'E[m(Z,z,)m(Z,s,)']D(z,)-", the two estimators and deriving theirjoint asymptotic variance (and hence asymptotic covariance). For example, consider two different GMM estimators 8, and g2, with two different moment functions g,(z, 6) and g2(z, @, and r = q for simplicity. The vector y*= (@,, &)’ can be considered a joint GMM estimator with moment vector g(z, y) = [gr(z, H,)‘, gz(z, @,)‘I’. The Jacobian matrix of the stacked moment vector will be block diagonal, and hence so will its inverse, so that the asymptotic covariance between 6, and 6, will be {E[V,g,(z, e,)]} _ ‘E[g,(z, d0)g2(z, O,)‘] x {ECV,g,(z, &,)I) - l’. Th’ISISexactly of the form D(T,)- ‘E[m(Z, tl)m(Z, TJ']O(T~)-I', where Z = z, m(Z, TV)= g,(z,O,),etc. When the covariance takes this form, the covariance between any estimator and one satisfying eq. (5.8) will be D(T)-' x E[m(Z,z)m(Z,~)l]D(~)-“=I~D(~)~“=D(~)-’E[m(Z,t)m(Z,~)‘]D(~)-” = V(t), the variance of the efficient estimator. R.A. Fisher (1925) showed that this covariance condition is sufficient for efficiency, and that it is also necessary if the class of statistics is linear, in a certain sense. The role of conditions (i) and (ii) is to guarantee that R.A. Fisher’s (1925) linearity condition is satisfied. Another interpretation ofeq. (5.8) is that the variance of any estimator in the class can be written as the sum of the efficient variance and the variance of a “noise term”. to Let u(Z)= D(T)-'m(Z,T)-D(f)-'m(Z,f), and note that U(Z) is orthogonal D(5)_ ‘m(Z, Z) by eq. (5.8). Thus, V(T)= V(Z)+ E[CI(Z)U(Z)‘]. This interpretation is a second-moment version of the Hajek and LeCam efficiency results.
5.4.
Solving fir the smallest asymptotic variance
The characterization of an efficient estimator given in Theorem 5.3 is very useful for finding efficient estimators. Equation (5.8) can often be used to solve for Z, by following two steps: (1) specify the class of estimators so that conditions (i) and (ii) of Theorem 5.3 are satisfied, i.e. so the set of moment functions is linear and the Jacobian D is linear in the moment functions; (2) look for Z such that D(T) = E[m(Z, s)m(Z, Z)‘]. The importance of step (1) is that the linearity conditions guarantee that a solution to eq. (5.8) exists when there is an efficient estimator [with the variance of m(Z, t) nonsingular], so that the effort of solving eq. (5.8) will not be in vain. Although for some classes of estimators the linearity conditions are not met, it often seems to be possible to enlarge the class of estimators so that the linearity conditions are met without affecting the efficient estimator. An example is weighted least squares estimation, as further discussed below. Using eq. (5.8) to solve for an efficient estimator can be illustrated with several examples, both old and new. Consider first minimum distance estimators. The asymptotic variance has the form given in eq. (5.5) for the score G’WZ and the Jacobian term G’ WC. The equation for the efficient W is then 0 = G’ WC - G’Wf26’G =
Ch. 36: Large Sample Estimation and Hypothesis Testing
2169
G’W(I - flW)G, which holds if fll?f= I, i.e. w = R- ‘. Thus, in this example one can solve directly for the optimal weight matrix. Another example is provided by the problem of deriving the efficient instruments for a nonlinear instrumental variables estimator. Let p(z, (3)denote an s x 1 residual vector, and suppose that there is a vector of variables x such that a conditional moment restriction,
ma
&I)Ixl = 0,
(5.9)
is satisfied. Here p(z, 0) can be thought of as a vector of residuals and x as a vector of instrumental variables. A simple example is a nonlinear regression model y = ,f(x, (3,) + E, @&lx] = 0, where the residual p(z, 0) = y - f(x, 0) will satisfy the conditional moment restriction in eq. (5.9) by E having conditional mean zero. Another familiar example is a single equation of a simultaneous equations system, where p(z, 0) = y - Y’8 and Y are the right-hand-side endogenous variables. An important class of estimators are instrumental variable, or GMM estimators, based on eq. (5.9). This conditional moment restriction implies the unconditional moment restriction that E[A(x)p(z, e,)] = 0 for any q x s matrix of functions A(x). Thus, a GMM estimator can be based on the moment functions g(z, 0) = A(x)p(z, 0). Noting that V&z, 0) = A(x)V,p(z, Q), it follows by Theorem 3.4 that the asymptotic variance of such a GMM estimator will be
WV = {~%4x)Vep(z> 441) ‘~C44dz, ~JP(z,&J’44’1 {~C44Ve~k441 > “2 (5.10) where no weighting matrix is present because g(z, Q) = A(x)p(z,B) has the same number of components as 0. This asymptotic variance satisfies eq. (5.5), where T = A(-) indexes the asymptotic variance. By choosing p(z, 0) and A(x) in certain ways, this class of asymptotic variances can be set up to include all weighted least squares estimators, all single equation instrumental variables estimators, or all system instrumental variables estimators. In particular, cases with more instrumental variables than parameters can be included by specifying A(x) to be a linear combination of all the instrumental variables, with linear combination coefficients given by the probability limit of corresponding sample values. For example, suppose the residual is a scalar p(z,@ = y- Y’B, and consider the 2SLS estimator with instrumental variables x. Its asymptotic variance has the form given in eq. (5.10) for A(x) = E[ Yx’](E[x~‘])~‘x. In this example, the probability limit of the linear combination coefficients is E[Yx’](E[xx’])-‘. For system instrumental variables estimators these coefficients could also depend on the residual variance, e.g. allowing for 3SLS. The asymptotic variance in eq. (5.10) satisfies eq. (5.5) for Z=z, D(r)= E[A(x) x V&z, Q,)], and m(Z, r) = A(x)p(Z, 0,). Furthermore, both m(Z, r) and D(r) are linear in A(x), so that conditions (i) and (ii) should be satisfied if the set of functions {A(x)}
W.K. Newey and D. McFadden
2170
is linear. To be specific, consider the class of all A(x) such that E[A(x)V&z, O,)] and E[ )1.4(x)11 2 /)p(z, 0,) I/2] exist. Then conditions (i) and (ii) are satisfied with TV= A3(*) = CIA,(.) _t C,A,(V).~~ Thus, by Theorem 5.3, if an efficient choice of instruments exist there will be one that solves eq. (5.8). To find such a solution, let G(x) = E[V,p(z, 0,)j x] and 0(x) = E[p(z, Qp(z, 0,)’ (xl, so that by iterated expectations eq. (5.8) is 0 = E[A(x)(G(x) - Q(x)A(x)‘}]. This equation will be satisfied if G(x) - Q(x),?(x)’ = 0, i.e. if
A(x) = G(x)'O(x)- ‘.
(5.11)
Consequently, this function minimizes the asymptotic variance. Also, the asymptotic variance is invariant to nonsingular linear transformations, so that A(x) = CG(x)‘n(x)-’ will also minimize the asymptotic variance for any nonsingular constant matrix C. This efficient instrument formula includes many important efficiency results as special cases. For example, for nonlinear weighted least squares it shows that the optimal weight is the inverse of the conditional variance of the residual: For G,(0) = - n- 1C;= 1w(xi)[ yi - h(xi, O)]“, the conclusion of Theorem 3.1 will give an asymptotic variance in eq. (5.10) with A(x) = w(x)h,(x, S,), and the efficient estimator has A(x) = {E[a2 1x] } - ‘h,(x, Q,), corresponding to weighting by the inverse of the conditional variance. This example also illustrates how efficiency in a class that does not satisfy assumptions (i) and (ii) of Theorem 5.3 (i.e. the linearity conditions), can be shown by enlarging the class: the set of scores (or moments) for weighted least squares estimators is not linear in the sense of assumption (i), but by also including variances for “instrumental variable” estimators, based on the moment conditions y(z, 19)= A(x)[y - h(x, tI)], one obtains a class that includes weighted least squares, satisfies linearity, and has an efficient member given by a weighted least squares estimator. Of course, in a simple example like this one it is not necessary to check linearity, but in using eq. (5.8) to derive new efficiency results, it is a good idea to set up the class of estimators so that the linearity hypothesis is satisfied, and hence some solution to eq. (5.8) exists (when there is an efficient estimator). Another example of optimal instrument variables is the well known result on efficiency of 2SLS in the class of instrumental variables estimators with possibly nonlinear instruments: If p(z, 0) = y - Y’O, E[ Yjx] = 17x, and c2 = E[p(z, B,J2 1x] is constant, then G(x) = - Ii’x and 0(x) = 02, and the 2SLS instruments are E[ Yx’](E[xx’])lx = 17x = - 02&x), a nonsingular linear combination of A(x). As noted above, for efficiency it suffices that the instruments are a nonsingular linear combination of A(x), implying efficiency of 2SLS. This general form A(x) for the optimal instruments has been previously derived in Chamberlain (1987), but here it serves to illustrate how eq. (5.8) can be used to “Existence of the asymptotic Cauchy-Schwartz inequalities.
variance
matrix
corresponding
to 53 follows
by the triangle
and
Ch. 36: Large Sample Estimation and Hypothesis Testing
2171
derive the form of an optimal estimator. In this example, an optimal choice of estimator follows immediately from the form of eq. (5.8) and there is no need to guess what form the optimal instruments might take.
5.5.
Feasible
efficient estimation
In general, an efficient estimator can depend on nuisance parameters or functions. For example, in minimum distance estimation the efficient weighting matrix is a nuisance parameter that is unknown. Often there is a nuisance function, i.e. an infinite-dimensional nuisance parameter, such as the optimal instruments discussed in Section 5.4. The true value of these nuisance parameters is generally unknown, so that it is not feasible to use the true value to construct an efficient estimator. One feasible approach to efficient estimation is to use estimates in place of true nuisance parameters, i.e. to “plug-in” consistent nuisance parameter estimates, in the construction of the estimator. For example, an approach to feasible, optimal weighted least squares estimator is to maximize - n-l x1= r G(xi)[yi - h(xi, 8)12, where a’(x) is an estimator of 1/E[.a2 1x]. This approach will give an efficient estimator, if the estimation of the nuisance parameters does not affect the asymptotic variance of 6. It has already been shown, in Section 5.2, that this approach works for minimum distance estimation, where it suffices for efficiency that the weight matrix converges in probability to R - ‘. More generally, a result developed in Section 6, on two-step estimators, suggests that estimation of the nuisance parameters should not affect efficiency. One can think of the “plug-in” approach to efficient estimation as a two-step estimator, where the first step is estimating the nuisance parameter or function, and the second is construction of &. According to a principle developed in the next section, the first-step estimation has no effect on the second-step estimator if consistency of the first-step estimator does not affect consistency of the second. This principle generally applies to efficient estimators, where nuisance parameter estimates that converge to wrong values do not affect consistency of the estimator of parameters of interest. For example, consistency of the weighted least squares estimator is not affected by the form of the weights (as long as they satisfy certain regularity conditions). Thus, results on two-step estimation suggest that the “plug-in” approach should usually yield an efficient estimator. The plug-in approach is often easy to implement when there are a finite number of nuisance parameters or when one is willing to assume that the nuisance function can be parametrized by a finite number of parameters. Finding a consistent estimator of the true nuisance parameters to be used in the estimator is often straightforward. A well known example is the efficient linear combination matrix Z7= E[Yx’](E[xx’])’ for an instrumental variables estimator, which is consistently estimated by the 2SLS coefficients fi= xy= r Yix:(Cy= ,x,x~)-‘. Another example is the optimal weight for nonlinear least squares. If the conditional variance is parametrized as a’(~, y), then
W.K. Newey and D. McFadden
2172
the true y can be consistently estimated from the nonlinear least squares regression of $ on aZ(xi, y), where Ei = yi - h(xi, I$ (i = 1,. . , n), are the residuals from a preliminary consistent estimator (7. Of course, regularity conditions are, useful for showing that estimation of the nuisance parameters does not affect the asymptotic variance of the estimator. To give a precise statement it is helpful to be more specific about the nature of the estimator. A quite general type of “plug-in” estimator is a GMM estimator that depends on preliminary estimates of some parameters. Let g(z, 19,y) denote a q x 1 vector of functions of the parameters of interest and nuisance parameters y, and let y*be a first-step estimator. Consider an estimator e that, with probability approaching one. solves
n-
l f
cJ(Zi,f&y*)= 0.
(5.12)
i=l
This class is quite general, because eq. (5.12) can often be interpreted as the firstorder conditions for an estimator. For example, it includes weighted least squares estimators with an estimated weight w(x,y*), for which eq. (5.12) is the first-order condition with g(z, 8, y) = w(x, y&(x, 0)[ y - h(x, 8)]. One type of estimator not included is CMD, but the main result of interest here is efficient choice of weighting matrix, as already discussed in Section 5.2. Suppose also that y*is a GMM estimator, satisfying n-l x1= i m(zi, y) = 0. If this equation is “stacked” with eq. (5.12), the pair (6, $) becomes a joint GMM estimator, so that regularity conditions for asymptotic efficiency can be obtained from the assumptions for Theorem 3.4. This result, and its application to more general types of two-step estimators, is described in Section 6. In particular, Theorem 6.1 can be applied to show that 6’from eq. (5.12) is efficient. If the hypotheses of that result are satisfied and G, = E[V,g(z, B,, yO)] = 0 then 8 will be asymptotically normal with asymptotic variance the same as if 7 = yO. As further discussed in Section 6, the condition G, = 0 is related to the requirement that consistency of ji not affect consistency of 8. As noted above, this condition is a useful one for determining whether the estimation of the nuisance parameters affects the asymptotic variance of the feasible estimator 6. To show how to analyze particular feasible estimators, it is useful to give an example. Linear regression with linear heteroskedusticity: Consider a linear model where &lx] = ~‘8, and C?(X) = Var( y Jx) = w’c(~for some w = w(x) that is a function of x. As noted above, the efficient estimator among those that solve n-‘Cy= i A(xi) x [ yi - x:(3] = 0 has A(x) = A(x) = (~‘a,))’ x. A feasible efficient estimator can be constructed by using a squared residual regression to form an estimator oi for Q, and plugging this estimator into the first-order conditions. More precisely, let p be the least squares estimator from a regression of y on x and & the least squares
Ch. 36: Large Sample Estimation and Hypothesis
2173
Testing
estimator from a regression of (y - x/j?)’ on w. Suppose that w’aO is bounded below and let r(u) be a positive function that is continuously differentiable with bounded derivative and z(u) = u for u greater than the lower bound on w’cx,,.~’ Consider 8 obtained from solving CT= i r(w’&)) ‘xi(yi - xi@ = 0. This estimator is a two-step GMM estimator like that given above with y = (cc’,fl’)‘, m(z, y) =
[( y -
x’P)x’,
{( y - x’py - w’cr}w’]‘,
g(z,8, y) = T(W’cI)- ‘x( y - de). It is straightforward to verify that the vector of moment functions [m(z, y)‘, g(z, 8, y)‘]’ satisfies the conditions of Theorem 6.1 if w is bounded, x and y have finite fourth moments, and E[xx’] and E[ww’] are nonsingular. Furthermore, E[V,g(z, do, yo)] = - E[~(w’a~)-~(y - x’~,)xw’] = 0, so that this feasible estimator will be efficient. In many cases the efficiency of a “plug-in” estimator may be adversely affected if the parametrization of the nuisance functions is incorrect. For example, if in a linear model, heteroskedasticity is specified as exponential, but the true conditional variance takes another form, then the weighted least squares estimator based on an exponential variance function will not be efficient. Consistency will generally not be affected, and there will be only a little loss in efficiency if the parametrization is approximately correct, but there could be big efficiency losses if the parametrized functional form is far from the true one. This potential problem with efficiency suggests that one might want to use nonparametric nuisance function estimators, that do not impose any restrictions on functional form. For the same reasons discussed above, one would expect that estimation of the nuisance function does not affect the limiting distribution, so that the resulting feasible estimators would be efficient. Examples of this type of approach are Stone (197.3 Bickel (1982), and Carroll (1982). These estimators are quite complicated, so an account is not given here, except to say that similar estimators are discussed in Section 8.
5.4.
Technicalities
It is possible to show the generalized information matrix equality in eq. (5.1) under a condition that allows for f(zl @‘I2 to not be continuously differentiable and g(z, 0) to not be continuous. For the root-density, this condition is “mean-square”differentiability at fIO with respect to integration over z, meaning that there is 6(z) with l /l&z) )I2dz < co such that J[f(zI @‘I2 - f(zl Qo)1’2- 6(z)‘(H- 0,)12 dz = o( II8 - ,!?,I)2, ““The T(U)function is a “trimming” device similar to those used in the semiparametric estimation literature. This specification requires knowing a lower bound on the conditional variance. It is also possible to allow T(U)to approach the identity for all u > 0 as the sample size grows, but this would complicate the analysis.
W.K. Newey and D. McFadden
2174
that as O-+9,. As shown in Bickel et al. (1992) it will suffice for this condition ,f(zl0) is continuously differentiable in 0 (for almost all z) and that J(0) = jV, In ,f(zlO) x and continuous in 8. Here 6(z) is the derivative {VOln~(zIfO>‘,!“(zI@d z ISnonsingular off(z I w2, so by V,f‘(z 141/2= +f(z l 0)1/2V0 In f(z Id), the expression for the information matrix in terms of 6(z) is J = 4J”6(z)&z) dz. A precise result on efficiency of MLE in the class of GMM estimators can then be stated as: Lemma 5.4 If(i) ,f(~l@r/~ is mean-square differentiable at 0, with derivative 6(z); (ii) E[g(z, Q)] is differentiable at 0, with derivative G; (iii) g(z, 0) is continuous at B,, with probability one; (iv) there is a neighborhood _N of 6, and a function d(z) such that IIg(z, 0) II d d(z) and Srl(z)‘f(z IO)dz is bounded for BEN; then lg(z, Q)f(z 10)dz is differentiable at B0 with derivative G + 2jg(z, Q,)~(z)f(~lQ,)‘~~ dz. Proof The proof is similar to that of Lemma 7.2 of Ibragimov and Has’minskii (1981). Let r(0) = f(z IQi”, g(e) = g(z, 0), 6 = 6(z), and A(B) = r(0) - r(Q,) - iY(d - Q,), suppressing the z argument for notational convenience. Also, let m(8,8) = ~g(@r(Q2dz and M = jg(b’,)&(ll,) dz. By (ii), m(0, d,) - m(B,, ~9,)- G(B - 0,) = o( Ij 0 - do I/). Also, by the triangle inequality, I/m(0,@ - m(B,, 0,) - (G + 2M)(0 - 0,) 11< I/m(e,e,) m(fl,, 0,) - G(8 - 0,) /I + 11 m(6,6) - m(@,d,) - 2M(6’ - 0,) 11,so that to show the conclusion it suffices to show IIm(d, 0) - m(d, 0,) - 2M(B - 0,) II = o( 110- B. /I). To show this, note by the triangle inequality,
IIde, 4 - MO,&4 - 2M(8 - 0,) I/ = d
IIs
g(d)[r(d)’
- r(0,J2] dz - 2~(0 - 0,)
!I
Cd4 - g(~o)lr(80)6’dz II8 - 8, II s(QCr@)+ r(bJl44 dz + IIS II IIS II +
Cs(e)r(e)-s(e,)r(e,)lsdz II~-~,ll=~,+~,ll~-8,l/+R,ll8-~,ll.
Therefore, it suffices to show that R, =o( /I0-0, II), R, -0, By (iv) and the triangle and Cauchy-Schwartz inequalities,
R, < { [ [g(Q2r(R,’
d.z]li2 + [ Ig(Q2r(&J2
Also, by (iii) and (iv) and the dominated
dzy
convergence
and R, +O as O+ fl,,.
“)[p(B)’
dz]12
theorem, E[ I(g(e) - g(0,) \I’] +O,
Ch. 36: Large Sample Estimation
and Hypothesis
2175
Testing
so by the Cauchy-Schwartz inequality, R, 0,
s
/Icd@It Ir(@ - 44,) I II6 II dz d
d
s
44 Ir@) - 4%) I II6 II dz
4.4 Ir(Q)- r&J I II6 IIdz + K Ir(Q)- 4%)l II6 II dz s d(z)>.4 s
I is l/2
0 and choose K so jdtz,a K ((6 ((2 dz < 3~. Then by the last term is less than +E for 0 close enough to 0,, implying that j 1)g(0) I/Ir(8) - r(Q,)l 116I/ dz < E for 0 close enough to (IO. The conclusion then follows by the triangle inequality. Q.E.D. Proof of Theorem
5.1
By condition (iv) of Theorem 3.4 and Lemma 3.5, g(z, e) is continuous on a neighborhood of 8, and E[g(z, 0)] is differentiable at B0 with derivative G = E[Vsy(z, (II,)]. Also, f(z I0)“’ is mean-square differentiable by the dominance condition in Theorem 5.1, as can be shown by the usual mean-value expansion argument. Also, by the conditions of Theorem 5.1, the derivative is equal to $1 [f(zj0,)>O]f(z(0,)-“2 x V,f(z) 0,) on a set of full measure, so that the derivative in the conclusion of Lemma 5.4 is G + %(z,
j442.fW)dz
WV0 ln
bounded,
f(zl&Jl. Also, IIdz, 0)II d 44 = su~~~.~- IIAZ, 0)II Has
so that u = g(z, 0,) + GJ- ‘V,in f(zIB,), (G’WG)- ‘G’B’~WG(G’WG)-
the conclusion
of Lemma
5.4 holds.
Then
for
1 - J-l
=(G’WG)-‘G’W(~uu’dz)WG(G’WG)-‘, so the conclusion
6.
follows by i UU’dz positive
semi-definite.
Q.E.D.
Two-step estimators
A two-step estimator is one that depends on some preliminary, “first-step” estimator of a parameter vector. They provide a useful illustration of how the previous results
W.K. Newey and D. McFadden
2116
can be applied, even to complicated estimators. In particular, it is shown in this section that two-step estimators can be fit into the GMM framework. Two-step estimators are also of interest in their own right. As discussed in Section 5, feasible efficient estimators often are two-step estimators, with the first step being the estimation of nuisance parameters that affect efficiency. Also, they provide a simpler alternative to complicated joint estimators. Examples of two-step estimators in econometrics are the Heckman (1976) sample selection estimator and the Barro (1977) estimator for linear models that depend on expectations and/or corresponding residuals. Their properties have been analyzed by Newey (1984) and Pagan (1984,1986), among others. An important question for two-step estimators is whether the estimation of the first step affects the asymptotic variance of the second, and if so, what effect does the first step have. Ignoring the first step can lead to inconsistent standard error estimates, and hence confidence intervals that are not even asymptotically valid. This section develops a simple condition for whether the first step affects the second, which is that an effect is present if and only if consistency of the first-step estimator affects consistency of the second-step estimator. This condition is useful because one can often see by inspection whether first-step inconsistency leads to the secondstep inconsistency. This section also describes conditions for ignoring the first step to lead to either an underestimate or an overestimate of the standard errors. When the variance of the second step is affected by the estimation in the first step, asymptotically valid standard errors for the second step require a correction for the first-step estimation. This section derives consistent standard error estimators by applying the general GMM formula. The results are illustrated by a sample selection model. The efficiency results of Section 5 can also be applied, to characterize efficient members of some class of two-step estimators. For brevity these results are given in Newey (1993) rather than here.
6.1.
Two-step
estimators
as joint GMM
estimators
The class of GMM estimators is sufficiently general to include two-step estimators where moment functions from the first step and the second step can be “stacked” to form a vector of moment conditions. Theorem 3.4 can then be applied to specify regularity conditions for asymptotic normality, and the conclusion of Theorem 3.4 will provide the asymptotic variance, which can then be analyzed to derive the results described above. Previous results can also be used to show consistency, which is an assumption for the asymptotic normality results, but to focus attention on the most interesting features of two-step estimators, consistency will just be assumed in this section.
Ch. 36: Large Sample Estimation and Hypothesis
Testing
A general type of estimator 8 that has as special cases most examples is one that, with probability approaching one, solves an equation
n-
’
i$l
dzi,
8,
2117
of interest
(6.1)
y*)= O,
where g(z,B,y) is a vector of functions with the same dimension as 0 and y*is a first-step estimator. This equation is exactly the same as eq. (5.12), but here the purpose is analyzing the asymptotic distribution of gin general rather than specifying regularity conditions for $ to have no effect. The estimator can be treated as part of a joint GMM estimator if y^also satisfies a moment condition of the form, with probability approaching one, n-l
i
m(z,,y)=O,
(6.2)
i=l
where m(z,y) is a vector with the same dimension as y. If g(z, 0,~) and m(z,r) are “stacked” to form J(z, 8, y) = [m(z, O)‘,g(z, 8, y)‘]‘, then eqs. (6.1) and (6.2) are simply the two components of the joint moment equation n-i C;= 1 g(zi, 8,y*) = 0.Thus, the two-step estimator from eq. (6.1) can be viewed as a GMM estimator. An interesting example of a two-step estimator that fits into this framework is Heckman’s (1976) sample selection estimator. Sample selection example: In this example the first step +$is a probit estimator with regressors x. The second step is least squares regression in the subsample where the probit-dependent variable is one, i.e. in the selected sample, with regressors given by w and i(x’y^) for n(o) = ~(U)/@(U). Let d be the probit-dependent variable, that is equal to either zero or one. This estimator is useful when y is only observed if d = 1, e.g. where y is wages and d is labor force participation. The idea is that joint normality of the regression y = w’/& + u and the probit equation leads to E[yl w, d = 1, x] = w’p,, + cc,A(x’y,), where a, is nonzero if the probit- and regression-dependent variables are not independent. Thus, %(x’cr,) can be thought of as an additional regressor that corrects for the endogenous subsample. This two-step estimator will satisfy eqs. (6.1) and (6.2) for
Y(4 8,Y)= d m(z,y)
=
[
A(&1 CY-w'B-~wr)l~
Il(x’y)a=-‘( -x’y)x[d-
@(x’y)],
(6.3)
where 8 = (/Y, a)‘. Then eq. (6.1) becomes the first-order condition for least squares on the selected sample and eq. (6.2) the first-order condition for probit.
W.K. Newley and D. McFadden
2178
Regularity conditions for asymptotic normality can be formulated by applying the asymptotic normality result for GMM, i.e. Theorem 3.4, to the stacked vector of moment conditions. Also, the conclusion of Theorem 3.4 and partitioned inversion can then be used to calculate the asymptotic variance of 8, as in the following result. Let
G, = ECV,g(z,‘&>YO)I> Y(Z)= dz, &, Yoh
G, = W,dZ> Q,>ro)l, M = ECV,mk ~o)l, Theorem
I,@) = - M
‘m(z, y,,).
(6.4)
6.1
Ifeqs. (6.1) and (6.2) are satisfied with probability approaching one, 8% 8,, y*3 ye, and g(z, 8, y) satisfies conditions (i)-(v) of Theorem 3.4, then 8 and 9 are asymptotically normal
and $(&
0,) 4
N(0, V) where
I/ = G; ‘EC {g(z) + G,$(z)}(g(z)
+
G,WJ’1G,“. Proof
By eqs. (6.1) and (6.2), with probability approaching one (8, y*)is a GMM estimator with moment function g”(z,_B, y) = [m(z,y)‘,g(z, e,y)‘]’ and I? equal to an identity the asymptotic variance of the estimator is matrix. By (~?‘1@‘6’= G-‘, (W‘Z(‘IE[#(z, do, y&(z, 8,, y,)‘]zz;(~~zz1)- l = CT-‘E[ij(z, 8,, y&(z, o,, yJ]G- l’. Also, the expected Jacobian matrix and its inverse are given by
(6.5) that the first row of G- ’ is G; ’ [I, - GYM - ‘1 and that [I, - G,M- ‘1 x variance of 8, which is the upper left block Q.E.D. of the joint variance matrix, follows by partitioned matrix multiplication. Noting
g(z, BO,yO) = g(z) + G&(z), the asymptotic
An alternative
approach
to deriving
the asymptotic
distribution
of two-step
esti-
mators is to work directly from eq. (6. l), expanding in 6’to solve for &(e^ - 6,) and then expanding the result around the true yO. To describe this approach, first note that 9 is an asymptotically linear estimator with influence function $(z) = - M- ‘m(zi, ye), where fi(y* - yO) = Cr= 1 $(zi)/$ + op(1). Then left-hand side of eq. (6.1) around B0 and solving gives:
Jj2(8-8,)= -
a-1 t
[ =-[& t i=l
i=l
Vog(z.
1)
@) ?
1
-l iFl n
1-l
V,g(z,,8,y^)
Stzi,
eO,Y”V&
expanding
the
2179
Ch. 36: Large Sample Estimation and Hypothesis Testing
x = -
ii$l
g(zi)l&
+
[,-l
i, i=l
vyCl(zi,
V,]\;;(9 - YOJ]
eO,
GB1 t {g(zi)+ Gyti(zJ}lJn + up, i=l
where (? and 7 are mean values and the y^and the mean values and the conclusion by applying the central limit theorem to One advantage of this approach is
third equality follows by convergence of of Lemma 2.4. The conclusion then follows the term following the last equality. that it only uses the influence function
representation &($ - ye) = x1= 1 tj(z,)/& + o,(l) for 9, and not the GMM formula in eq. (6.2). This generalization is useful when y*is not a GMM estimator. The GMM approach has been adopted here because it leads to straightforward primitive conditions, while an influence representation for y*is not a very primitive condition. Also the GMM approach can be generalized to allow y*to be a two-step, or even multistep, estimator by stacking moment conditions for estimators that affect 3 with the moment conditions for 0 and y.
6.2.
The efect
ofjrst-step
estimation
on second-step
standard
errors
One important feature of two-step estimators is that ignoring the first step in calculating standard errors can lead to inconsistent standard errors for the second step. The asymptotic variance for the estimator solving eq. (6.1) with y*= yO, i.e. the asymptotic variance ignoring the presence of y*in the first stage, is G; ’ E[g(z)g(z)‘]G; l’. In general, this matrix differs from the asymptotic variance given in the conclusion of Theorem 6.1, because it does not account for the presence of the first-step estimators. Ignoring the first step will be valid if G, = 0. Also, if G, # 0, then ignoring the first step will generally be invalid, leading to an incorrect asymptotic variance formula, because nonzero G, means that, except for unusual cases, E[g(z)g(z)‘] will not equal E[ (g(z) + G&(z)} {g(z) + G&(z)}‘]. Thus, the condition for estimation of the first step to have no effect on the second-step asymptotic variance is G, = 0. A nonzero G, can be interpreted as meaning that inconsistency in the first-step estimator leads to inconsistency in the second-step estimator. This interpretation is useful, because it gives a comparatively simple criterion for determining if first-stage estimation has to be accounted for. To derive this interpretation, consider the solution 8(y) to E[g(z, B(y), y)] = 0. Because 8 satisfies the sample version of this condition, B(y) should be the probability limit of the second-step estimator when J? converges to y (under appropriate regularity conditions, such as those of Section 2). Assuming differentiation inside the expectation is allowed, the implicit function theorem gives V$(y,)
= - G; ‘Gy.
(6.7)
W.K. Newey and D. McFadden
2180
By nonsingularity of G,, the necessary and sufficient condition for G, = 0 is that V,H(yJ = 0. Since H(y,) = H,, the condition that V,B(y,J = 0 is a local, first-order condition that inconsistency in y*does not affect consistency of 8. The following result adds regularity conditions for this first-order condition to be interpreted as a consistency condition. Theorem 6.2 Suppose that the conditions of Theorem 6.1 are satisfied and g(z, 0, y) satisfies the conditions of Lemma 2.4 for the parameter vector (H’,y’). If &A 8, even when ‘j-y # yO, then G, = 0. Also suppose that E[V,g(z, 8,, y)] has constant rank on a neighborhood of yO. If for any neighborhood of y0 there is y in that neighborhood such that 8 does not converge in probability to H, when $ L y, then G, # 0. Proof By Lemma 2.4, 8 3 8, and y*3 y imply that Cy= r g(zi, 8, y^)/n -% E[g(z, 8,, y)]. The sample moment conditions (6.1) thus imply E[g(z, BO,y)] = 0. Differentiating this identity with respect to y at y = y0 gives G, = 0.41 To show the second conclusion, let H(y) denote the limit of e when 9 L y. By the previous argument, E[g(z, 8(y), y)] = 0. Also, by the implicit function theorem 0(y) is continuous at yO, with @ye) = BO.By the conditions of Theorem 6.1, G&8, y) = E[V,g(z, 0, y)] is continuous in a neighborhood of B0 and yO, and so will be nonsingular on a small enough neighborhood by G, nonsingular. Consider a small enough convex neighborhood where this nonsingularity condition holds and E[V,g(z, 8,, y)] has constant rank. A mean-value expansion gives E[g(z, 8,, ?)I.= E[g(z, B(y), y)] + G,(& y)[e, - 0(y)] ~0. Another expansion then gives E[g(z, Be, y)] = E[V,g(z, O,, -$](y - y,,) # 0, implying E[V,g(z, do, v)] # 0, and hence G, # 0 (by the derivative having constant rank). Q.E.D. This results states that, under certain regularity conditions, the first-step estimator affects second-step standard errors, i.e. G, # 0, if and only if inconsistency in the first step leads to inconsistency in the second step. The sample selection estimator provides an example of how this criterion can be applied. Sample selection continued: The second-step estimator is a regression where some of the regressors depend on y. In general, including the wrong regressors leads to inconsistency, so that, by Theorem 6.2, the second-step standard errors will be affected by the first step. One special case where the estimator will still be consistent is if q, = 0, because including a regressor that does not belong does not affect consistency. Thus, by Theorem 6.2, no adjustment is needed (i.e. G, = 0) if c(~ = 0. This result is useful for constructing tests of whether these regressors belong, because 41Differentiation
inside the expectation
is allowed
by Lemma
3.6.
Ch. 36: Large Sample Estimation and Hypothesis
2181
Testing
it means that under the null hypothesis the test that ignores the first stage will have asymptotically correct size. These results can be confirmed by calculating
where n,(o) = di(v)/dv. a, = 0.
By inspection
this matrix is generally
nonzero,
but is zero if
This criterion can also be applied to subsets of the second-step coefficients. Let S denote a selection matrix such that SA is a matrix of rows of A, so that Se is a subvector of the second-step coefficients. Then the asymptotic variance of Se is SC, ’ E[ {g(z) + G&(z)} {g(z) + G,$(z)}‘]G; ‘S’, while the asymptotic variance that ignores the first step is SC; ‘E[g(z)g(z)‘]G; 1S’. The general condition for equality of these two matrices is 0= -
SC,' G, = SV,B(y,) = V,[SB(y,)],
where the second equality follows statement that asymptotic variance only if consistency of the first step could be made precise by modifying is not given here.
(6.8)
by eq. (6.7). This is a first-order version of the of Skis affected by the first-step estimator if and affects consistency of the second. This condition Theorem 6.2, but for simplicity this modification
Sample selection continued: As is well known, if the correct and incorrect regressors are independent of the other regressors then including the wrong regressor only affects consistency of the coefficient of the constant. Thus, the second-step standard errors of the coefficients of nonconstant variables in w will not be affected by the first-step estimation if w and x are independent. One can also derive conditions for the correct asymptotic variance to be larger or smaller than the one that ignores the first step. A condition for the correct asymptotic variance to be larger, given in Newey (1984), is that the first- and second-step moment conditions are uncorrelated, i.e.
Gdz, &, xJm(z,YJI = 0.
(6.9)
In this case E[g(z)$(z)‘]
= 0, so the correct
G; ‘~,WW$WlG;~, the one G; ‘E[g(z)g(z)‘]G;
“2which is larger, in the positive semi-definite
variance
” that ignores first-step
is G; ’ E[g(z)g(z)‘]G; I’ + sense, than estimation.
W.K. Newey and D. McFadden
2182
continued: In this example, E[y - w’fiO - cr,i(x’y,)l w, d = 1, x] = 0, which implies (6.9). Thus, the standard error formula that ignores the first-step estimation will understate the asymptotic standard error. Sump/e selection
A condition for the correct asymptotic variance to be smaller ignores the first step, given by Pierce (1982), is that
than the one that
(6.10)
m(z) = m(z, yO) = V, ln f(z I Q,, yd
In this case, the identities Sm(z, ~)f(zl O,, y) dz = 0 and lg(z, 0,, y)f(z Id,, y) dz = 0 can be differentiated to obtain the generalized information matrix equalities M = - E[s(z)s(z)‘] and G, = - E[g(z)s(z)‘]. It then follows that G, = - E[g(z)m(z)‘] = variance is - J%d4w’l I~c$wwl> - l> so that the correct asymptotic
G; 1~Cg(4g(4’lG;’ - G; 'ECgWWl { ~C+WW’l> - ‘-f%WsM’lG; “. This variance is smaller, in the positive semi-definite sense, than the one that ignores the first step. Equation (6.10) is a useful condition, because it implies that conservative asymptotic confidence intervals can be constructed by ignoring the first stage. Unfortunately, the cases where it is satisfied are somewhat rare. A necessary condition for eq. (6.10) is that the information matrix for Q and y be block diagonal, because eq. (6.10) implies that the asymptotic variance of y*is {E[m(z)m(z)‘]} - ‘, which is only obtainable when the information matrix is block diagonal. Consequently, if g(z, 8, y) were the score for 8, then G, = 0 by the information matrix equality, and hence estimation of 9 would have no effect on the second-stage variance. Thus, eq. (6.10) only leads to a lowering of the variance when g(z, 8, y) is not the score, i.e. 8 is not an efficient estimator. One case where eq. (6.10) holds is if there is a factorization of the likelihood f(z ItI, y) = fl(z IB)f,(z Iy) and y^is the MLE of y. In particular, if fi (z 10)is a conditional likelihood and f,(zl y) = fi(x 17) a marginal likelihood of variables x, i.e. x are ancillary to 8, then eq. (6.8) is satisfied when y*is an efficient estimator of yO.
6.3.
Consistent
asymptotic
variance
estimation
for two-step
estimators
The interpretation of a two-step estimator as a joint GMM estimator can be used to construct a consistent estimator of the asymptotic variance when G, # 0, by applying the general GMM formula. The Jacobian terms can be estimated by sample Jacobians, i.e. as
60~n-l t v,g(ziy8,9), Gy= 6’ i=l
The second-moment
t V,g(Z,,BJ),ii = n-l i V,m(z,,y*). i=l
matrix
can be estimated
i=l
by a sample second-moment
matrix
Ch. 36: Larye Sample Estimation
and Hypothesis
Testing
2183
di = y(zi, 8, y*)and Ai = m(z,, f), of the form fi= n- ‘x1= ,(& &i)‘(& &I). An estimator of the joint asymptotic variance of 8 and 7 is then given by
An estimator of the asymptotic variance of the second step 8 can be extracted from the upper left block of this matrix. A convenient expression, corresponding to that in Theorem 6.1, can be obtained by letting $i = - & l&z,, so that the upper left block of ? is
(6.11)
If the moment functions are uncorrelated as in eq. (6.9) so that the first-step estimation increases the second-step variance, then for ?? = n- ‘Cy= 1JitJ:y an asymptotic variance estimator for 8 is
(6.12) This estimator is quite convenient, because most of its pieces can be recovered from standard output of computer programs. The first of the two terms being summed is a variance estimate that ignores the first step, as often provided by computer output (possibly in a different form than here). An estimated variance FYis also often provided by standard output from the first step. In many cases 6;’ can also be recovered from the first step. Thus, often the only part of this variance estimator requiring application-specific calculation is eY. This simplification is only possible under eq. (6.9). If the first- and second-step moment conditions are correlated then one will need the individual observations Gi, in order to properly account for the covariance between the first- and second-step moments. A consistency result for these asymptotic variance estimators can be obtained by applying the results of Section 4 to these joint moment conditions. It will suffice to assume that the joint moment vector g(z, 0, y) = [m(z, y)‘, y(z, 0, r)‘]’ satisfies the conditions of Theorem 4.5. Because it is such a direct application of previous results a formal statement is not given here. In some cases it may be possible to simplify PO by using restrictions on the form of Jacobians and variance matrices that are implied by a model. The use of such restrictions in the general formula can be illustrated by deriving a consistent asymptotic variance estimator for the example.
W.K. Newey and D. McFadden
2184
Sumple selection example continued: Let Wi = di[wI, /z(xIyo)]’ and %i = di[wI, i(.$j)]‘. Note that by the residual having conditional mean zero given w, d = 1, and x, it is the case that G, = - E[diWiWJ and G, = - a,E[di~,,(xlyo)WiX11, where terms involving second derivatives have dropped out by the residual having conditional mean zero. Estimates of these matrices are given by ee = - x1= 1ki/iA~/~ and G, = -oily= II.,(x~j)ii/,x~/n. Applying eq. (6.12) to this case, for ii = yi - W#‘, 3i)‘, then gives
(6.13)
where pY is a probit estimator of the asymp_totic_variance of &(y - yO), e.g. as provided by a canned computer program, and 17~ G; ‘Gy is the matrix of coefficients from a multivariate regression of c?%,(x~y*)xi on Wi. This estimator is the sum of the White (1980) variance matrix for least squares and a correction term for the firststage estimation.42 It will be a consistent estimator of the asymptotic variance of JII@ - do).43
7.
Asymptotic
normality with nonsmooth objective functions
The previous asymptotic normality results for MLE and GMM require that the log-likelihood be twice differentiable and that the moment functions be once differentiable. There are many examples of estimators where these functions are not that smooth. These include Koenker and Bassett (1978), Powell’s (1984, 1986) censored least absolute deviations and symmetrically trimmed estimators, Newey and Powell’s (1987) asymmetric least squares estimator, and the simulated moment estimators, of Pakes (1986) and McFadden (1989). Therefore, it is important to have asymptotic normality results that allow for nonsmooth objective functions. Asymptotic normality results for nonsmooth functions were developed by Daniels (1961), Huber (1967), Pollard (1985), and Pakes and Pollard (1989). The basic insight of these papers is that smoothness of the objective function can be replaced by smoothness of the limit if certain remainder terms are small. This insight is useful because the limiting objective functions are often expectations that are smoother than their sample counterparts. 4*Contrary to a statement given in Amemiya (1985), the correction term is needed here. 43The normalization by the total sample size means that one can obtain asymptotic confidence intervals as described in Section 1, with the n given there equal to the total sample size. This procedure is equivalent to ignoring the n divisor in Section 1and dropping the n from the probit asymptotic variance estimator (as is usually done in canned programs) and from the lead term in eq. (6.13).
2185
Ch. 36: Large Sample Estimation and Hypothesis Testing
To illustrate how this approach works it is useful to give a heuristic The basic idea is the approximation
&@)- e^,&)r &e - &J + Qo(4 E&e
description.
Qo(4J
- (3,) + (0 - O,)H(B - 8,)/2, (7.1)
where 6, is a derivative, or approximate derivative, of Q,,(e) at B,,, H = V,,Q,(B,), and the second approximate equality uses the first-order condition V,QO(e,) = 0 in a second-order expansion of QO(0). This is an approximation of Q,(e) by a quadratic function. Assuming that the approximation error is of the right order, the maximum of the approximation should be close to the true maximum, and the maximum of the approxi_mation is 8 = B0 - H- ‘fi,,. This random yariable will be asymptotically normal if D, is, so that asymptotic normality of 0 will follow from asymptotic normality of its approximate value 8.
7.1.
The basic results
In order to make the previous argument precise the approximation error in eq. (7.1) has to be small enough. Indeed, the reason that eq. (7.1) is used, rather than some other expansion, is because it leads to approximation errors of just the right size. Suppose for discussion Purposes that 6,, = V&(6,), where the derivative exists with probability one. Then Q,(e) - Q,(e,) - 6;(0 - 0,) goes to zero faster than 118- doI/ does, by the definition of a derivative. Similarly, QO(e) - QO(O,) goes to zero faster than ((8 - 0, (([since V,Q,(B,) = 01. Also, assuming ded in probability for each 8, as would typically
that J%@,,(e) - Qo(@] is bounbe the case when Q,(e) is made
up of sample averages, and noting that $0, bounded in probability asymptotic normality, it follows that the remainder term,
k(e)
= JtrcOm
- O,w
- 6,te - 0,) - mv3
- ade,w
follows by
Ii 8 - e. II,
(7.2)
is bounded in probability for each 0. Then, the combination of these two properties suggests that l?,(e) goes to zero as the sample size grows and 8 goes to BO,a stochastic equicontinuity property. If so, then the remainder term in eq. (7.1) will be of order oP( I/0 - 8, I//& + II8 - 8, /I*). The next result shows that a slightly weaker condition is sufficient for the approximation in eq. (7.1) to lead to asymptotic normality of 8. Theorem
7.1
Suppose that Q.(8) 2 supti&(@ - o&r- ‘), 8 A 8,, and (i) QO(0) is maximized on @ at 8,; (ii) 8, is an interior point of 0, (iii) Qe(0) is twice differentiable at 8,
W.K. Newey
2186
with nonsingular
second
s~p~~~-~,,,,,~~R,(e)/[l
derivative
+ JnllO
H; (iv) &fi
- ~,I111 LO.
Then
5
N(O,Q;
&(e-
and D. McFadden
(v) for any 6, +O,
&J ~N(O,H-‘~H-‘).
The
proof of this result is given in Section 7.4. This result is essentially a version of Theorem 2 of Pollard (1985) that applies to any objective function rather than just a sample average, with an analogous method of proof. The key remainder condition is assumption (v), which is referred to by Pollard as stochastic diflerentiability. It is slightly weaker than k,(O) converging to zero, because of the presence of the denominator term (1 + & /I8 - 8, II)- ‘, which is similar to a term Huber (1967) used. In several cases the presence of this denominator term is quite useful, because it leads to a weaker condition on the remainder without affecting the conclusion. Although assumption (v) is quite complicated, primitive conditions for it are available, as further discussed below. The other conditions are more straightforward._Consistency can be shown using Theorem 2.1, or the generalization that allows for 8 to be an approximate maximum, as suggested in the text following Theorem 2.1. Assumptions (ii) and (iii) are quite primitive, although verifying assumption (iii) may require substantial detailed work. Assumption (iv) will follow from a central limit theorem in the usual case where 6, is equal to a sample average. There are several examples of GMM estimators in econometrics where the moments are not continuous in the parameters, including the simulated moment estimators of Pakes (1986) and McFadden (1989). For these estimators it is useful to have more specific conditions than those given in Theorem 7.1. One way such conditions can be formulated is in an asymptotic normality result for minimum distance estimators where g,(e) is allowed to be discontinuous. The following is such a result. Theorem
7.2
Suppose that $,,(@I?o.(@ < info,0Q.(8)‘i@&(8) + o,(n-‘), 8-% 8,, and I? L W, W is positive semi-definite, where there is go(e) such that (i) gO(O,) = 0; (ii) g,,(d) is differentiable at B0 with derivative G such that G’WG is nonsingular; (iii) 8, is an interior point of 0; (iv) +g,(e,) $,(e,)-g&III/[1 WZWG
L
+fiIIe-e,II]
NO, z3; (v) for any 6, + LO.
Then
0, supllO- OolI $6,&
II8,u4 -
,/“(k@wV[O,(G’WG)-‘G’R
(G’WG)-‘1.
The proof is given in Section 7.4. For the case where Q,(e) has the same number of elements as 8, this result is similar to Huber’s (1967), and in the general case is like Pakes and Pollard’s (1989), although the method of proof is different than either of these papers’. The conditions of this result are similar to those for Theorem 7.1. The function go(e) should be thought of as the limit of d,(e), as in Section 3. Most of the conditions are straightforward to interpret, except for assumption (v). This assumption is a “stochastic equicontinuity” assumption analogous to the condition (v) of Theorem 7.1. Stochastic equicontinuity is the appropriate term here because when go(e) is the pointwise
limit of $,,(e), i.e. d,(e) Ago(B)
for all 0, then for all
Ch. 36: Laryr Sample Estimation and Hypothesis
Testing
2187
8 # 8,, & 11Q,(O) - &,(8,) - go(H) II/[ 1 + Ji )I0 - B. II] AO. Thus, condition (v) can be thought of as an additional requirement that this convergence be uniform over any shrinking neighborhood of BO.As discussed in Section 2, stochastic equicontinuity is an essential condition for uniform convergence. Theorem 7.2 is a special case of Theorem 7.1, in the sense that the proof proceeds by showing that the conditions of Theorem 7.1 are satisfied. Thus, in the nonsmooth case, asymptotic normality for minimum distance is a special case of asymptotic normality for an extremum estimator, in contrast to the results of Section 3. This relationship is the natural one when the conditions are sufficiently weak, because a minimum distance estimator is a special case of a general extremum estimator. For some extremum estimators where V,&,(0) exists with probability one it is possible to, use Theorem 7.2 to show asymptotic normality, by setting i,,(e) equal to V,Q,(@. An example is censored least absolute deviations, where V,&(0) = n - l C;= 1xil(xj8 > 0)[ 1 - 2.l(y < x’e)]. However, when this is done there is an additional condition that has to be checked, namely that )/V,Q,(0) )/* d
inf,, 8 11 V,&(e) II2 + o,(n- ‘), for which it suffices to show that J&V&,(@ L 0. This is an “asymptotic first-order condition” for nonsmooth objective functions that generally has to be verified by direct calculations. Theorem 7.1 does not take this assumption to be one of its hypotheses, so that the task of checking the asymptotic first-order condition can be bypassed by working directly with the extremum estimator as in Theorem 7.1. In terms of the literature, this means that Huber’s (1967) asymptotic first-order condition can be bypassed by working directly with the extremum formulation of the estimator, as in Pollard (1985). The cost of doing this is that the remainder in condition (v) of Theorem 7.1 tends to be more complicated than the remainder in condition (v) of Theorem 7.2, making that regularity condition more difficult to check. The most complicated regularity condition in Theorems 7.1 and 7.2 is assumption (v). This condition is difficult to check in the form given, but there are more primitive conditions available. In particular, for Q,(0) = n ‘Cy= 1 q(z,, 8), where the objective function is a sample average, Pollard (1985) has given primitive conditions for stochastic differentiability. Also, for GMM where J,(0) = C;= i g(z, 0)/n and go(B) = E[g(z, 0)], primitive conditions for stochastic equicontinuity are given in Andrews’ (1994) chapter of this handbook. Andrews (1994) actually gives conditions for a stronger result, that s~p,,~_~,,, da./% )Id,(0) - .&(0,) - go(e) 1)L 0, i.e. for (v) of Theorem 7.2 without the denominator term. The conditions described in Pollard (1985) and Andrews (1994) allow for very weak conditions on g(z, 0), e.g. it can even be discontinuous in 8. Because there is a wide variety of such conditions, we do not attempt to describe them here, but instead refer the reader to Pollard (1985) and Andrews (1994). There is a primitive condition for stochastic equicontinuity that is not covered in these other papers, that allows for g(z, 8) to be Lipschitz at 0, and differentiable with probability one, rather than continuously differentiable. This condition is simple but has a number of applications, as we discuss next.
W.K. Newey and D. McFadden
2188
7.2.
Stochastic
equicontinuity
for Lipschitz
moment,functions
The following result gives a primitive condition for the stochastic equicontinuity hypothesis of Theorem 7.2 for GMM, where Q,(e) = nP ‘Cy= 1g(Zi, 0) and go(O)=
ECg(z, @I. Theorem
7.3
Suppose that E[g(z, O,)] = 0 and there are d(z) and E > 0 such that with probability r(z,B)]
one,
IIdz, Q)- & 0,) - W(fl - 44 II/IIQ- 0, I/+ 0 as Q+ oo,~C~W,,,-,~,, Ccx
r(z, d) =
< a,
Theorem
and n- ‘Cr= 1d(zi) LE[d(z)]. 7.2 are satisfied for G = E[d (z)].
Then
assumptions
(ii) and
(v) of
Proof
one r(z, E) + 0 as For any E > 0, let r(z,E) = sup, o-00, BEIIr(z, 0) 11.With probability E+ 0, so by the dominated convergence theorem, E[r(z, E)] + 0 as E+ 0. Then for 0 + 0, and s = IIQ- 4, II, IIad@- sd4) - (30 - 0,) I/= IIEC&, 0)- g(z,0,) - 44 x (0 - O,)]11 d E[r(z, E)] II0 - 0, /I+O, giving assumption (iii). For assumption (v), note that for all (5’with /I8 - 0, I/ < 6,, by the definition
of r(z, E) and the Markov
inequality,
II4,(@- &(Ho)- go(@I//Cl + fi II0 - 0, II1 d Jn CIICY=1{d(zi) - EC&)1 } x (0- f&)/nII + {C1=Ir(zi, Wn + ECr(z, S.)l > II0 - 00IIl/(1 + Jn II0 - 00II1d IICy=1 Q.E.D. j A(zJ - J%A(z)lj/n II + ~,@Cr(z,%)I) JS 0. Jn
The condition on r(z, Cl) in this result was formulated by Hansen et al. (1992). The requirement that r(z, 0) --f 0 as 8 + B0 means that, with probability one, g(z, 19)is differentiable with derivative A(z) at BO.The dominance condition further restricts this remainder to be well behaved uniformly near the true parameter. This uniformity property requires that g(z, e) be Lipschitz at B0 with an integrable Lipschitz constant.44 A useful aspect of this result is that the hypotheses only require that Cr= 1A(zi) 3 E[A(z)], and place no other restriction on the dependence of the observations. This result will be quite useful in the time series context, as it is used in Hansen et al. (1992). Another useful feature is that the conclusion includes differentiability of go(e) at B,, a “bonus” resulting from the dominance condition on the remainder. The conditions of Theorem 7.3 are strictly weaker than the requirement of Section 3 that g(z, 0) be continuously differentiable in a neighborhood of B0 with derivative that is dominated by an integrable function, as can be shown in a straightforward way. An example of a function that satisfies Theorem 7.3, but not the stronger continuous differentiability condition, is the moment conditions corresponding to Huber’s (1964) robust location estimator.
44For44
= SUPI~~~~,,~ < &tiz, 01, the triangle and Cauchy-Schwarz
Ill&) II + &)I II0 - 6, Il.
inequalities
imply
1)~(z,o)- g(~,0,) I/
so differentiating for J(z)=
expectations
implies that
s
vW,CyIx1.0x Iv) dx = E,Cv(x)yl,
gives V,j G[z, y(v)] dF, = V,E,[v(x)y]
= E[v(x)ySJ
= E[G(z)SJ, (8.13)
V(X)Y- ECv(x)~l.
For example, for the consumer surplus estimator, by eq. (8.8), one has v(x) = and y = (l,q), so that 6(z) = l(a 6 x d b)f,(x)-’ x l(U~X~b)f~(x)-lC-h~(x),ll c4- Mx)l. With a candidate for 6(z) in hand, it is easier to find the integral representation for assumption (iii) of Theorem 8.1. Partition z as z = (x, w), where w are the components of z other than x. By a change of variables, 1 K,(x - xi) dx = j K(u) du = 1, so that
s
v(x)y,(x)dx=n-’
G(z,y^-y,)dF,=
i, i=l
- E[v(x)y]
= n
”
1 izl
s
V(X)yiK,(X
-
Xi)
dx
f
J 6(X,WJKg(X- xi) dx = Jd(z)d’, (8.14)
where the integral of a function a(z) over d$ is equal to np ’ x1= I sa(x, wi)K,(x - xi)dx. The integral here will be the expectation over a distribution when K(u) 2 0, but when K(u) can be negative, as for higher-order kernels, then the integral cannot be interpreted as an expectation. The final condition of Theorem 8.1, i.e. assumption (iv), will follow under straightforward conditions. To verify assumption (iv) of Theorem 8.1, it is useful to note that the integral in eq. (8.14) is close to the empirical measure, the main difference being that the empirical distribution of x has been replaced by a smoothed version with density n- ’ x1= 1K,(x - xi) [for K(u) 3 01. Consequently, the difference between the two integrals can be interpreted as a smoothing bias term, with
b(z)diSd(z)dF=K’
By Chebyshev’s in probability
inequality,
$r
[ Sv(x)K,(X-xi)dX-V(Xi)]Yi.
sufficient conditions
to zero are that JnE[y,{
‘CIIY~II*IISV(X)K,(X-~~)~X-VV(X~)II~I~O.A and smoothness
parts of Assumptions
for Jn
(8.15)
times this term to converge
jv(x)K,(x - xi)dx - V(Xi)}] -0 and that s s.h own below, the bias-reducing kernel 8.1-8.3 are useful in showing that the first
Ch. 36: Large Sample Estimation and Hypothesis
condition holds, while continuity the second. In particular, one can even when v(x) is discontinuous, Putting together the various asymptotic Theorem
normality
2209
Testing
of v(x) at “most points” of v(x) is useful for showing show that the remainder term in eq. (8.15) is small, as is important in the consumer surplus example. arguments described above leads to a result on
of the “score” x1= r g(zi, y”)/&.
8.11
Suppose that Assumptions 8.1-8.3 are satisfied, E[g(z, yO)] = 0, E[ I/g(z, y,,) )(‘1 < a, X is a compact set, cr = o(n) with na2’+4d/(ln n)2 -+ cc and na2m + 0, and there is a vector of functionals G(z, y) that is linear in y such that (i) for ll y - y. I/ small
IIdz, y)- dz, yo)- W, Y- yo) II 6 W IIY-y. II2,ECWI < ~0;(3 IIW, Y) II d c(z) 1)y 1)and E[c(z)‘] < co; (iii) there is v(x) with 1 G(z, y) dF,(z) = lv(x)y(x) dx for all /ly 11< co; (iv) v(x) is continuous almost everywhere, 111v(x) 11dx < co, and there
enough,
is E > 0 such that E[sup,,,,, GE(Iv(x + u) II41 < co. Then for 6(z) = v(x)y - E[v(x)y], Cr= 1S(zi, Y^)l& 5
N(0, Var [g(z, yo) + 6(Z)]}.
Proof The proof proceeds
by verifying
the conditions
of Theorem
8.1. To show assump-
2 30 which follows by the rate conditions tion (i) it suffices to show fi /Iy*- y. 11 on 0 and Lemma 8.10. To show assumption iii), note that by K(u) having bounded derivatives of order d and bounded support, (IG[z, yK,(. - x)] 11d o-‘c(z) IIy )I. It then follows by Lemma 8.4 that the remainder term of eq. (8.11) is O,,(n- ‘a-’ x {E[c(z,)/( y, II] + (E[c(z~)~ IIy, 112])1’2})= o,(l) by n-‘a-‘+O. Also, the rate conditions imply 0 --f 0, so that E[ I(G(z, 7 - yo) )I2] d E[c(z)~] 117- y. 1)2 + 0, so that the other remainder term for assumption (ii) also goes to zero, as discussed following eq. (8.11). Assumption (iii) was verified in the text, with dF as described there. To show assumption (iv), note that
v(x)K,(x
- xi) dx - v(xi) yi I
v(x)K(u)y,(x
Cyo(x < &
s
II v(x) II
iI1
Ill
- au) dudx -
- 0~)-
yo(x)PW
du
[y,(x - au) - y,,(x)lK(u) du 11dx < 1 Ca” jll
v(x)IIdx,
(8.16)
W.K. Newey and D. McFadden
2210
for some constant C. Therefore, //J&[ { Sv(x)K,(x-xi) dx - V(Xi)}yi] 11d CJjza”‘+O. Also, by almost everywhere continuity of v(x), v(x + au) + v(x) for almost all x and U. Also, on the bounded support of K(u), for small enough 0, v(x + W) d SU~~~~~~ S ,v(x + o), so by the dominated convergence theorem, j v(x + au)K(u) du + j v(x)K(u) du = v(x) for almost all x. Another application of the dominated convergence theorem, using boundedness of K(u) gives E[ 11 j v(x)K,(x - xi) dx - v(xi) 114]-0, so by the CauchySchwartz inequality, E[ 11yi I/2 11j v(x)K,(x - xi) dx - v(xi) II2] + 0. Condition (iv) then follows from the Chebyshev inequality, since the mean and variance of Q.E.D. II- l” C;= 1[I v(x)K,(x - xi) dx - v(x,)]y, go to zero. The assumptions of Theorem 8.11 can be combined of the Jacobian to obtain an asymptotic normality estimator. As before, let R = Var [g(z, y,,) + 6(z)]. Theorem
with conditions for convergence result with a first-step kernel
8.12
Suppose that e -% 00~ interior(O), the assumptions of Theorem 8.11 are satisfied, E(g(z, yO)] = 0 and E[ 11g(z, ye) 11 2] < co, for 11 y - y. II small enough, g(z, 0, y) is continuously differentiable in 0 on a neighborhood _# of O,, there are b(z), s > 0 with
EC&)1< ~0, IIV~s(z,~,y)-V,g(z,~,,y,)/I d&)Cl/Q-4ll”+ E[V,g(z, Oo,yo)] exists and is nonsingular.
Then $(&
0,) 3
IIY-Y~II~~~ and N(0, G; ‘L2G; I’).
Proof
It follows similarly to the proof of Theorem 8.2 that 6; ’ 3 G; ‘, so the conclusion follows from Theorem 8.11 similarly to the proof of Theorem 8.2. Q.E.D. As previously discussed, the asymptotic variance can be estimatedby G,,‘86, I’, C;= 1 Vsg(zi, e,y*) and 8= n- ’ x1= lliiti; for ai = g(zi, 0, $) + 6(zi). The whereG,=n-’ main question here is how to construct an estimator of 6(z). Typically, the form of 6(z) will be known from assumption (iii) of Theorem 8.11, with 6(z) = 6(z, 8,, yo) for some known function 6(z, 0, y). An estimator of 6(z) can then be formed by substituting 8 and $3for 8, and y. to form
8(z)= 6(z, 6,jq.
(8.17)
The following result gives regularity asymptotic variance estimator. Theorem
conditions
for consistency
of the corresponding
8.13
Suppose that the assumptions of Theorem 8.12 are satisfied and there are b(z), s > 0, such that E[~(z)~] < cc and for /Iy - y. /I small enough, IIg(z, 19,y)-g(z, do, y)II d h(z) x
CIIQ-Q,ll”+ /I~-~~ll~l and 11~~~,~,~~-~6(~,~~,~~~ll dWCII~-~oI/“+ /IY-Y~II”~. Then 6; ’ 86;
l’ L
G; ‘RG;
“.
Ch. 36: Large Sample Estimation and Hypothesis
2211
Testing
Proof
It suffices to show that the assumptions of Theorem 8.3 are satisfied. By the conditions of Theorem 8.12, I/t? - 0, I/ 3 0 and /I9 - y0 I/ 5 0, so with probability approaching one,
because n- ’ x1= 1 b(zi) 2 is bounded in probability follows similarly that Cr= 1 11 8(zi) - 6(Zi) II“/ n 30, Theorem 8.3.
by the Markov so the conclusion
inequality. It follows by Q.E.D.
In some cases 6(z, 0, y) may be complex and difficult to calculate, making it hard to form the estimator 6(z, e,?). There is an alternative estimator, recently developed in Newey (1992b), that does not have these problems. It uses only the form of g(z, 6,~) and the kernel to calculate the estimator. For a scalar [ the estimator is given by
i(zi)=v, n-l [
j$l
C71zj38,y*
+
(8.18)
i,K,(‘yxil}]~
i=O’
This estimator can be thought of as the influence of the ith observation through the kernel estimator. It can be calculated by either analytical or numerical differentiation. Consistency of the corresponding asymptotic variance estimator is shown in Newey (1992b). It is helpful to consider some examples to illustrate how these results for first-step kernel estimates can be used. Nonparametric consumer surplus continued: To show asymptotic normality, one can first check the conditions of Theorem 8.11. This estimator has g(z, yO) = Jib,(x) dx 8, = 0, so the first two conditions are automatically satisfied. Let X = [a, b], which is a compact set, and suppose that Assumptions 8.1-8.3 are satisfied with m = 2, d = 0, and p = 4, so that the norm IIy 1) is just a supremum norm, involving no derivatives. Note that m = 2 only requires that JuK(u)du = 0, which is satisfied by many kernels. This condition also requires that fO(x) and fO(x)E[q Ix] have versions that are twice continuously differentiable on an open set containing [a, b], and that q have a fourth moment. Suppose that no2/(ln n)‘+ CC and no4 +O, giving the bandwidth conditions of Theorem 8.11, with r = 1 (here x is a scalar) and d = 0. Suppose that f,,(x) is bounded away from zero on [a, b]. Then, as previously shown in eq. (8.9), assumption (i) is satisfied, with b(z) equal to a constant and G(z, y) = (ii) holds by inspection by fO(x)-’ and !,bfo(x)- ’ C- Mx), lldx) dx. Assumption h,(x) bounded. As previously noted, assumption (iii) holds with v(x) = l(a < x < b) x fO(x)- ’ [ - h,(x), 11. This function is continuous except at the points x = a and x = b,
W.K. Newey and D. McFadden
2212
and is bounded, so that assumption Theorem 8.11 it follows that i;(x) - 0,
>
LV(O,
(iv) is satisfied.
Then
E[l(a ,< x d 4f,(x)~‘{q
-
by the conclusion
hI(x))21)>
of
(8.19)
an asymptotic normality result for a nonparametric consumer surplus estimator. To estimate the asymptotic variance, note that in this example, 6(z) = l(a d x d b) x Then f&)- ’ [I4- Mx)l = &z,h) for h(z,Y)= 1(a d x d b)y,(~)~’[q - y1(x)-1y2(x)]. for 6(z) = 6(z, y^),an asymptotic variance estimator will be
‘= f 8(Zi)2/n = n-l i=l
i$l l(U <Xi
< b)f(Xi)p2[qi-
&(Xi)]2.
(8.20)
By the density bounded away from zero on 3 = [a, b], for /Iy - y. /I small enough that yr (x) is also bounded away from zero on .oll‘,16(zi, y) - 6(zi, yO)1d C( 1 + qi) 11y - y0 1) for some constant C, so that the conditions of Theorem 8.13 are satisfied, implying consistency of d. Weighted average derivative estimation: There are many examples of models where there is a dependent variable with E[qlx] = T(X’ /3,Jfor a parameter vector /IO, as discussed in Powell’s chapter of this handbook. When the conditional expectation satisfies this “index” restriction, then V,E[ql.x] = s,(x’~,,)~~, where r,(v) = dr(v)/dv. Consequently, for any bounded function w(x), E[w(x)V,E[q(x]] = E[w(x)r,(x’/3,)]&,, i.e. the weighted average derivative E[w(x)V,E[qlx]] is equal to a scale multiple of the coefficients /I,,. Consequently, an estimate of /I0 that is consistent up to scale can be formed as
B=n-'
t W(Xi)V,L(Xi), C(X)= i i=l
i=l
qiK,(X-Xi)/i
K,(X-Xi).
(8.21)
i=l
This is a weighted average derivative estimator. This estimator takes the form given above where yIO(x) = f,Jx), yIO(x) = fO(x) x ECq Ixl,
and
Yk 0, v) = %47,cY2(4lY,(~)l
- 8.
(8.22)
The weight w(x) is useful as a “fixed trimming” device, that will allow the application of Theorem 8.11 even though there is a denominator term in g(z, 0, y). For this purpose, let 3 be a compact set, and suppose that w(x) is zero outside % and bounded. Also impose the condition that fe(x) = yIO(x) is bounded away from zero on I%^.Suppose that Assumptions 8.1-8.3 are satisfied, n~?‘+~/(ln ~)~+co and &“+O.
Ch. 36: Large Sample Estimation and Hypothesis
2213
Testing
These conditions will require that m > r + 2, so that the kernel must be of the higher-order type, and yO(x) must be differentiable of higher order than the dimension of the regressors plus 2. Then it is straightforward to verify that assumption (i) of Theorem 8.11 is satisfied where the norm (/y )I includes the first derivative, i.e. where d = 1, with a linear term given by
G(z,Y)= w(x)Cdx)~(x)+ V,r(xMx)l, %b4 = .I-&)- l c- &Ax) + kl(xb(x), - SWI,
Md = .foW l c- Mx), II’> (8.23)
where an x subscript denotes a vector of partial derivatives, and s(x) = fO,.(x)/fO(x) is the score for the density of x. This result follows from expanding the ratio V,[y,(x)/y,(x)] at each given point for x, using arguments similar to those in the previous example. Assumption (ii) also holds by inspection, by fO(x) bounded away from zero. To obtain assumption (iii) in this example, an additional step is required. In particular, the derivatives V,y(x) have to be transformed to the function values y(x) in order to obtain the representation in assumption (iii). The way this is done is by integration by parts, as in
HwW,~Wd41=
=-s
s
w(x)fo(x)b,(x~CV,~(x)l dx
V,Cw(x)fo(x)~o(x)l’~O dx>
v,Cww-ow,(x)l’
= w,(x)C- Mx), II+ w(x)c- 4&d, 01
It then follows that 1 G(z, y) dF, = j v(x)y(x) dx, for
44 = - w,(x)C- w4 11- w(x)II- &&4,01 + wb)a,(x) = - {WAX) + w(x)s(x)> c- Mx), 11= 04 t(x) = - w,(x) -- w(x)s(x).
c-
h_l(x),11, (8.24)
By the assumption that fO(x) is bounded away from zero on .!Zand that 9” is compact, the function a(~)[ - h,(x), l] is bounded, continuous, and zero outside a compact set, so that condition (iv) of Theorem 8.11 is satisfied. Noting that 6(z) = C(x)[q - h,(x)], the conclusion of Theorem 8.11 then gives
1L
w(xi)V,&xi) - 80
W,Var{w(x)V,h,(x) + QX)[q - &(x)]}). (8.25)
W.K. Newey
2214
The asymptotic
&!=n-’
i,
variance I,?$,
of this estimator
can be estimated
pi = W(Xi)V,~(Xi) - H^+ ~(Xi)[qi -
I],
and D. McFadden
as (8.26)
i=l
where z(x) = - w,(x) - w(x)fJx)/f^(x) for T(X) = n- ’ C;= 1 K(x - xi). Consistency of this asymptotic variance estimator will follow analogously to the consumer surplus example. One cautionary note due to Stoker (1991) is that the kernel weighted average derivative estimators tend to have large small sample biases. Stoker (1991) suggests a corrected estimate of - [n-l Cy= 1e^(x,)x~]- ‘8, and shows that this correction tends to reduce bias 8 and does not affect the asymptotic variance. Newey et al. and show that (1992) suggest an alternative estimator o^+ n- ’ C;= 1 &xi) [qi - &)I, this also tends to have smaller bias than 6. Newey et al. (1992) also show how to extend this correction to any two-step semiparametric estimator with a first-step kernel.
8.4.
Technicalities
Proof of Lemma 8.4 Let mij = m(zi, zj), 61,.= m,(z,), and fi., = m2(zi). Note that E[ /Im, 1 - p 111d E[ 11 m, I 111 +(E[ I~m,,~~2])1’2 and (E[I~m,, -p(12])1’2 .
.,/LO)
=
fi
f’k&xtBtL
1=1
and is easily estimated. For example, F normal yields binary probits, and F logistic yields binary logits. A Lagrange multiplier test for 6 = 0 will detect the presence of unobserved heterogeneity across cases. Assume a sample of n cases, drawn randomly from the population. The LM test statistic is
[
LM=
n C (Vd2/n-
i
(VAVJYln
Cw
I[
12
1 (VpW&Yln 1
where e is the log-likelihood of the case, V,/ = (VP,/, , V,,Z?), and all the derivatives are evaluated at 6 = 0 and the Bernoulli model estimates of /I. The j? derivatives are straightforward, e,t = d,x,f(d,x,P,)/F(d,x,B,), where f is the density I’Hopital’s rule:
la=;
~-
of F. The 6 derivative
is more delicate,
f(4xtBJ’ + i 4f(4x,B,) 2 t=I F(d,x,&) II ’ W,x,BJ2 I[
requiring
use of
Ch. 36: Large Sample Estimation
and Hypothesis
2239
Testing
The reason for introducing 6 in the form above, so J&J appeared in the probability, was to get a statistic where C V& was not identically zero. The alternative would have been to develop the test statistic in terms of the first non-identically zero higher derivative; see Lee and Chesher (1986). The LM statistic can be calculated by regressing the constant 1 on V& and V,,P, . . . ) V,“e, where all these derivatives are evaluated at 6 = 0 and the Bernoulli model estimates, and then forming the sum of squares of the fitted values. Note that the LM statistic is independent of the shape of the heterogeneity distribution h(v), and is thus a “robust” test against heterogeneity of any form.
9.8.
Technicalities
Some test statistics are conveniently defined using generalized inverses. This section gives a constructive definition of a generalized inverse, and lists some of its properties. A matrix ,& is a Moore-Penrose generalized inverse of a matrix ,A, k if it has three properties: (i) AA-A = A, (ii) A-AA= A-, (iii) AA _ and A -A are symmetric. There are other generalized inverse definitions that have some, but not all, of these properties; in particular A + will denote any matrix that satisfies (i). First, a method for constructing a generalized inverse is described, and then some of the implications of the definition are developed. The construction is called the singular value decomposition (SVD) of a matrix, and is of independent interest as a tool for finding the eigenvalues and eigenvectors of a symmetric matrix, and for calculation of inverses of moment matrices of data with high multicollinearity; see Press et al. (1986) for computational algorithms and programs. Lemma 9.4
Every real m x k matrix A of rank r can be decomposed
into a product
A=UDV mxk
mxrrxrrxk’
where D is a diagonal matrix with positive diagonal, and U and V are column-orthonormal;
nonincreasing elements i.e. U’U = I, = V’V.
down
the
Proof
The m x m matrix AA’ is symmetric and positive semi-definite. Then, there exists an m x m orthonormal matrix W, partitioned W = [W, W,] with WI of dimension m x r, such that w;(AA’)W, = G is diagonal with positive, nonincreasing diagonal
WK. Newey and D. McFadden
2240
elements, and W;(AA’)W, = 0, implying A’W, = 0. Define D from G by replacing the diagonal elements of G by their positive square roots. Note that W' W = I = W W’ = W, W; + W, W;. Define U = W, and V’ = D-l U’A. Then, U’U = I, and V’V = D~‘U’AUD~‘=D-‘GD-‘=I,.Further,A=(Z,-W,W;)A=UU’A=UD1/‘.This Q.E.D. establishes the decomposition. Note that if A is symmetric, then U is the array of eigenvectors of A corresponding to the nonzero roots, so that A’U = UD,, with D, the r x r diagonal matrix with the nonzero eigenvalues in descending magnitude down the diagonal. In this case, V = A’UD-’ = UD,D-‘. Since the elements of D, and D are identical except possibly for sign, the columns of U and V are either equal (for positive roots) or reversed in sign (for negative roots). Lemma 9.5 The Moore-Penrose generalized inverse of an m x k matrix A is the matrix A- = V D-l U’ Let A’ denote any matrix, including A-, that satisfies AA+A = A. kxr
rxr
rxm
These matrices satisfy: (1) A+ = A- ’ if A is square and nonsingular. (2) The system of equations Ax = y has a solution if and only if y = AA+y, and the linear subspace of all solutions is the set ofvectors x = A+y + [Z - A+A]z for all ZERk.
(3) AA+ and A+A are idempotent. (4) If A is idempotent, then A = A-. (5) If A = BCD with B and D nonsingular, A+ = D-‘C+B-’ satisfies AA+A = A.
then A- = D-‘C-B-‘,
and any matrix
Proof Elementary;
see Pringle
and Rayner
(1971).
Lemma 9.6 If A is square, symmetric, and positive semi-definite of rank r, then (1) There exist Q positive definite and R idempotent of rank r such that A = QRQ and A- = Q-‘RQ-‘. (2) There exists kt, column-orthonormal such that U’AU = D is nonsingular diagonal and A- = U(U’AU)- ’ U’. (3) A has a symmetric square root B = A”‘, and A- = B-B-. Proof Let W = [U W,] diagonal W’,R=
be an orthogonal
matrix
matrix of positive eigenvalues, 1, 0 W [ 00
1
w’ and B = UD’i2U’. ’
diagonalizing
A. Then,
U’AU = D, a
ID:1:_.
and A W, = 0. Define Q = W
Q.E.D.
Ch. 36: Large Sample Estimation and
2241
Hypothesis Testing
Lemma 9.7 If y - N(A,I, A), with A of rank I, and A+ is any symmetric matrix satisfying AA+A = A, then y’A+y is noncentral chi-square distributed with I degrees of freedom and noncentrality parameter ,l’Al.
Proof Let W = [U W,] be an orthonormal matrix that diagonalizes A, as in the proof of Lemma 9.6, with U’AU = D, a positive diagonal r x r matrix, and W’AW, = 0, implying
A W, = 0. Then, the nonsingular
mean [ Dm1’F’A’]
and covariance
transformation
z=
matrix
buted N(D- “2U’A2,Z,), z2 = W,y = 0, implying w’y = [Dli2z, 01. It is standard that z’z has a noncentral chi-square distribution with r degrees of freedom and noncentrality parameter A’AUD-‘U’AA = 2’A;1. The condition A = AA+A implies U’AU = U’AWW’A+ WW’AU, or D = [DO]W’A+
W[DO]‘=
Hence, U’A+U = D-l. y’A+y = y’WW’A+
D(U’A+U)D.
Then WW’y = [z;D”~O](W’A+
= z;D”~(U’A+U)D~‘~Z~
W)[D”2z;
01’
= z;zl. Q.E.D.
References Ait-Sahalia, Y. (1993) “Asymptotic Theory for Functionals of Kernel Estimators”, MIT Ph.D. thesis. Amemiya, T. (1973) “Regression Analysis When the Dependent Variable is Truncated Normal”. Econometrica, 41, 997-1016. Amemiya, T. (1974) “The Nonlinear Two-Stage Least-Squares Estimator”, Journal of Econometrics, 2, 105-l 10. Amemiya, T. (1985) Advanced Econometrics, Cambridge, MA: Harvard University Press. Andersen, P.K. and R.D. Gill (1982) “Cox’s Regression Model for Counting Processes: A Large Sample Study”, The Annals of Statistics, 19, 1100-1120. Andrews, D.W.K. (1990) “Asymptotics for Semiparametric Econometric Models: I. Estimation and Testing”, Cowles Foundation Discussion Paper No. 908R. Andrews, D.W.K. (1992) “Generic Uniform Convergence”, Econometric Theory, 8,241-257. Andrews, D.W.K. (1994) “Empirical Process Methods in Econometrics”, in: R. Engle and D. McFadden, eds., Handbook ofEconometrics, Vol. 4, Amsterdam: North-Holland. Barro, R.J. (1977) “Unanticipated Money Growth and Unemployment in the United States”, American Economic Reoiew, 67, 101-115.
2242
W.K. Newey and D. McFadden
Bartle, R.G. (1966) The Elements oflntegration, New York: John Wiley and Sons. Bates, C.E. and H. White (1992) “Determination of Estimators with Minimum Asymptotic Covariance Matrices”, preprint, University of California, San Diego. Berndt, E.R., B.H. Hall, R.E. Hall and J.A. Hausman (1974) “Estimation and Inference in Nonlinear Structural Models”, Annals of Economic and Social Measurement, 3,653-666. Bickel, P. (1982) “On Adaptive Estimation,” Annals of Statistics, 10, 6477671. Bickel, P., C.A.J. Klaassen, Y. Ritov and J.A. Wellner (1992) “Efficient and Adaptive Inference in Semiparametric Models” Forthcoming monograph, Baltimore, MD: Johns Hopkins University Press. Billingsley, P. (1968) Convergence ofProbability Measures, New York: Wiley. Bloomfeld, P. and W.L. Steiger (1983) Least Absolute Deviations: Theory, Applications, and Algorithms, Boston: Birkhauser. Brown, B.W. (1983) “The Identification Problem in Systems Nonlinear in the Variables”, Econometrica, 51, 175-196. Burguete, J., A.R. Gallant and G. Souza (1982) “On the Unification of the Asymptotic Theory of Nonlinear Econometric Models”, Econometric Reviews, 1, 151-190. Carroll, R.J. (1982) “Adapting for Heteroskedasticity in Linear Models”, Annals of Statistics, 10,1224&1233. Chamberlain, G. (1982) “Multivariate Regression Models for Panel Data”, Journal of Econometrics, 18, 5-46. Chamberlain, G. (1987) “Asymptotic Efficiency in Estimation with Conditional Moment Restrictions”, Journal of Econometrics, 34, 305-334. Chesher, A. (1984) “Testing for Neglected Heterogeneity”, Econometrica, 52, 865-872. Chiang, C.L. (1956) “On Regular Best Asymptotically Normal Estimates”, Annals of Mathematical Statistics, 27, 336-351. Daniels, H.E. (1961) “The Asymptotic Efficiency of a Maximum Likelihood Estimator”, in: Fourth Berkeley Symposium on Mathematical Statistics and Probability, pp. 151-163, Berkeley: University of California Press. Davidson, R. and J. MacKinnon (1984) “Convenient Tests for Probit and Logit Models”, Journal of Econometrics, 25, 241-262. Eichenbaum, M.S., L.P. Hansen and K.J. Singleton (1988) “A Time Series Analysis of Representative Agent Models of Consumption and Leisure Choice Under Uncertainty”, Quarterly Journal of Economics, 103, 5 l-78. Eicker, F. (1967) “Limit Theorems for Regressions with Unequal and Dependent Errors”, in: L.M. LeCam and J. Neyman, eds., Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley: University of California Press. Fair, R.C. and D.M. Jaffee (1972) “Methods of Estimation for Markets in Disequilibrium”, Econometrica, 40,497-514. Ferguson, T.S. (1958) “A Method of Generating Best Asymptotically Normal Estimates with Application to the Estimation of Bacterial Densities”, Annals of Mathematical Statistics, 29, 1046-1062. Fisher, F.M. (1976) The Identification Problem in Econometrics, New York: Krieger. Fisher, R.A. (1921) “On the Mathematical Foundations of Theoretical Statistics”, Philosophical Transactions, A, 222, 309-368. Fisher, R.A. (1925) “Theory of Statistical Estimation”, Proceedings of the Cambridge Philosophical Society, 22, 700-725. Gourieroux, C., A. Monfort and A. Trognon (1983) “Testing Nested or Nonnested Hypotheses”, Journal of Econometrics, 21, 83-l 15. Gourieroux, C., A. Monfort and A. Trognon (1984) “Psuedo Maximum Likelihood Methods: Theory”, Econometrica, 52, 68 l-700. Hajek, J. (1970) “A Characterization of Limiting Distributions of Regular Estimates”, Z. Wahrscheinlichkeitstheorie uerw. Geb., 14, 323-330. Hansen, L.P. (1982) “Large Sample Properties of Generalized Method of Moments Estimators”, Econometrica, 50, 1029-1054.
Ch. 36: Large Sample Estimation
and Hypothesis
Testing
2243
Hansen, L.P. (1985a) “A Method for Calculating Bounds on the Asymptotic Covariance Matrices of Generalized Method of Moments Estimators”, Journal ofEconometrics, 30, 203-238. Discussion, December meetings of the Hansen, L.P. (1985b) “Notes on Two Step GMM Estimators”, Econometric Society. Hansen, L.P. and K.J. Singleton (1982) “Generalized Instrumental Variable Estimation of Nonlinear Rational Expectations Models”, Econometrica, 50, 1269-1286. Hansen, L.P., J. Heaton and R. Jagannathan (I 992) “Econometric Evaluation of Intertemporal Asset Pricing Models Using Volatility Bounds”, mimeo, University of Chicago. Hardle, W. (1990) Applied Nonparametric Regression, Cambridge: Cambridge University Press. Hiirdle, W. and 0. Linton (1994) “Nonparametric Regression”, in: R. Engle and D. McFadden, eds., Handbook of Econometrics, Vol. 4, Amsterdam: North-Holland. Hausman, J.A. (1978) “Specification Tests in Econometrics”, Econometrica, 46, 1251-1271. Hausman, J.A. and D. McFadden (1984) “Specification Tests for the Multinomial Logit Model”, Econometrica, 52, I2 19-l 240. Heckman, J.J. (1976) “The Common Structure of Statistical Models of Truncation, Sample Selection, and Limited Dependent Variables and a Simple Estimator for Such Models”, Annals ofEconomic and Social Measurement, 5,475-492. Honor&, B.E. (1992) “Timmed LAD and Least Squares Estimation of Truncated and Censored Models with Fixed Effects”, Econometrica, 60, 533-565. Honor& B.E. and J.L. Powell (1992) “Pairwise Difference Estimators of Linear, Censored, and Truncated Regression Models”, mimeo, Northwestern University. Huber, P.J. (1964) “Robust Estimation of a Location Parameter”, Annals ofMathematical Statistics, 35, 73-101. Huber, P. (1967) “The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions”, in: L.M. LeCam and J. Neyman, eds., Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley: University of California Press. Huber, P. (1981) Robust Statistics, New York: Wiley. Ibragimov, LA. and R.Z. Has’minskii (1981) Statistical Estimation: Asymptotic Theory, New York: Springer-Verlag. Jennrich (1969), “Asymptotic Properties of Nonlinear Least Squares Estimators”, Annals of Mathematical Statistics, 20, 633-643. Koenker, R. and G. Bassett (1978) “Regression Quantiles”, Econometrica, 46, 33-50. LeCam, L. (1956) “On the Asymptotic Theory of Estimation and Testing Hypotheses”, in: L.M. LeCam and J. Neyman, eds., Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 129-156, Berkeley: University of California Press. Lee, L. F. and A. Chesher (1986) “Specification Testing when the Score Statistics are Identically Zero”, Journal ofEconometrics, 31, 121-149. Maasoumi, E. and P.C.B. Phillips (1982) “On the Behavior of Inconsistent Instrumental Variables Estimators”, Journal ofEconometrics, 19, 183-201. Malinvaud, E. (1970) “The Consistency of Nonlinear Regressions”, Annals of Mathematical Statistics, 41,956-969. Manski, C. (1975) “Maximum Score Estimation of the Stochastic Utility Model of Choice”, Journal of Econometrics, 3, 205-228. McDonald, J.B. and W.K. Newey (1988) “Partially Adaptive Estimation of Regression Models Via the Generalized T Distribution”, Econometric Theory, 4, 428-457. McFadden, D. (1987) “Regression-Based Specification Tests for the Multinomial Logit Model”, Journal of Econometrics, 34, 63-82. McFadden, D. (1989) “A Method of Simulated Moments for Estimation of Multinomial Discrete Response Models Without Numerical Integration”, Econometricu, 57, 995-1026. McFadden, D. (1990) “An Introduction to Asymptotic Theory: Lecture Notes for 14.381”, mimeo, MIT.
W.K. Newey and D. McFadden
2244
Newey, W.K. (1984) “A Method of Moments Interpretation of Sequential Estimators”, Economics Letters, 14, 201-206. Newey, W.K. (1985) “Generalized Method of Moments Specification Testing”, Journal ofEconometrics, 29,229-256. Newey, W.K. (1987) “Asymptotic Properties of a One-Step Estimator Obtained from an Optimal Step Size”, Econometric Theory, 3, 305. Newey, W.K. (1988) “Interval Moment Estimation of the Truncated Regression Model”, mimeo, Department of Economics, MIT. Newey, W.K. (1989) “Locally Efficient, Residual-Based Estimation of Nonlinear Simultaneous Equations Models”, mimeo, Department of Economics, Princeton University. Newey, W.K. (1990) “Semiparametric Efficiency Bounds”, Journal of Applied Econometrics, 5,99-l 35. Newey, W.K. (1991a) “Uniform Convergence in Probability and Stochastic Equicontinuity”, Econometrica, 59, 1161-l 167. Newey, W.K. (1991b) “Efficient Estimation of Tobit Models Under Conditional Symmetry”, in: W. Barnett, J. Powell and G. Tauchen, eds., Semiparametric and Nonparametric Methods in Statistics and Econometrics, Cambridge: Cambridge University Press. Newey, W.K. (1992a) “The Asymptotic Variance of Semiparametric Estimators”, MIT Working Paper. Newey, W.K. (1992b) “Partial Means, Kernel Estimation, and a General Asymptotic Variance Estimator”, mimeo, MIT. Newey, W.K. (1993) “Efficient Two-Step Instrumental Variables Estimation”, mimeo, MIT. Newey, W.K. and J.L. Powell (1987) “Asymmetric Least Squares Estimation and Testing”, Econometrica, 55,819-847. Newey, W.K. and K. West (1988) “Hypothesis Testing with Efficient Method of Moments Estimation”, International Economic Review, 28, 777-787. Newey, W.K., F. Hsieh and J. Robins (1992) “Bias Corrected Semiparametric Estimation”, mimeo, MIT. Olsen, R.J. (1978) “Note on the Uniqueness Econometrica, 46, 1211~1216.
of the Maximum
Likelihood
Estimator
for the Tobit Model”,
Pagan, A.R. (1984) “Econometric Issues in the Analysis of Regressions with Generated Regressors”, International Economic Review, 25,221-247. Pagan, A.R. (1986) “Two Stage and Related Estimators and Their Applications”, Reuiew of Economic Studies, 53, 517-538. Pakes, A. (1986) “Patents as Options: Some Estimates of the Value of Holding European Patent Stocks”, Econometrica, 54, 755-785. Pakes, A. and D. Pollard (1989) “Simulation metrica, 57, 1027-1057.
and the Asymptotics
of Optimization
Estimators”,
Econo-
Pierce, D.A. (1982) “The Asymptotic Effect of Substituting Estimators for Parameters in Certain Types of Statistics”, Annals ofStatistics, IO, 475-478. Pollard, D. (1985) “New Ways to Prove Central Limit Theorems”, Econometric Theory, 1, 295-314. Pollard, D. (1989) Empirical Processes: Theory and Applications, CBMS/NSF Regional Conference Series Lecture Notes. Powell, J.L. (1984) “Least Absolute ofEconometrics, 25, 303-325. Powell, J.L. (1986) “Symmetrically 54.1435-1460.
Deviations Trimmed
Powell, J.L., J.H. Stock and T.M. Stoker Econometrica, 57, 1403-1430.
Pratt,J.W. (1981) “Concavity
Estimation
for the Censored
Least Squares Estimation (1989) “Semiparametric
of the Log Likelihood”,
Regression
Model”, Journal
for Tobit Models”, Econometrica, Estimation
of Index Coefficients”,
Journal ofthe American Statistical Association, 76,
103%106. Press, W.H., B.P. Flannery, University Press.
S.A. Tenkolsky
and W.T. Vetterling
(1986) Numerical Recipes, Cambridge
Ch. 36: Large Sample Estimation and Hypothesis
Testing
2245
Pringle, R. and A. Rayner (1971) Generalized Inverse Matrices, London: Griffin. Robins, J. (1991) “Estimation with Missing Data”, preprint, Epidemiology Department, Harvard School of Public Health. Robinson, P.M. (1988a) “The Stochastic Difference Between Econometric Statistics”, Econometrica, 56, 531-548. Robinson, P. (1988b) “Root-N-Consistent Semiparametric Regression”, Econometrica, 56, 931-954. Rockafellar, T. (1970) Convex Analysis, Princeton: Princeton University Press. Roehrig, C.S. (1989) “Conditions for Identification in Nonparametric and Parametric Models”, Econometrica, 56, 433-447. Rothenberg, T.J. (1971) “Identification in Parametric Models”, Econometrica, 39, 577-592. Rothenberg, T. J. (1973) Eficient Estimation with a priori Ir$ormation, Cowles Foundation Monograph 23, New Haven: Yale University Press. Rothenberg, T.J. (1984) “Approximating the Distributions of Econometric Estimators and Test Statistics”, Ch. 15 in: Z. Griliches and M.D. Intriligator, eds., Handbook of Econometrics, Vol 2, Amsterdam, North-Holland. Rudin, W. (1976) Principles ofMathematical Analysis, New York: McGraw-Hill. Sargan, J.D. (1959) “The Estimation of Relationships with Autocorrelated Residuals by the Use of Instrumental Variables”, Journal of the Royal Statistical Society Series B, 21, 91-105. Serfling, R.J. (1980) Approximation Theorems of MathematicalStatistics, New York: Wiley. Stoker, T. (1991) “Smoothing Bias in the Measurement of Marginal Effects”, MIT Sloan School Working Paper, WP3377-91-ESA. Stone, C. (1975) “Adaptive Maximum Likelihood Estimators of a Location Parameter”, Annals of Statistics, 3, 267-284. Tauchen, G.E. (1985) “Diagnostic Testing and Evaluation of Maximum Likelihood Models”, Journal of Econometrics, 30, 4 155443. Van der Vaart, A. (1991) “On Differentiable Functionals”, Annals ofStatistics, 19, 178204. Wald (1949) “Note on the Consistency of the Maximum Likelihood Estimate”, Annals ofMathematical Statistcs, 20, 595-601. White, H. (1980) “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity”, Econometrica, 48, 8177838. White, H. (1982a)“Maximum Likelihood Estimation ofMisspecified Models”, Econometrica, 50, l-25. White, H. (1982b) “Consequences and Detection of Misspecified Linear Regression Models”, Journal of the American Statistical Association, 76, 419-433.
Chapter 37
EMPIRICAL PROCESS IN ECONOMETRICS DONALD Co&s
METHODS
W.K. ANDREWS’
Foundation Yale University
Contents Abstract 1. Introduction 2. Weak convergence 3. Applications
4.
and stochastic
3.1.
Review of applications
3.2.
Parametric
Tests when a nuisance
3.4.
Semiparametric
Stochastic Primitive
4.2.
Examples
5. Stochastic 6. Conclusion Appendix References
based on non-differentiable parameter
is present
criterion
functions
only under the alternative
conditions
2267
via symmetrization
for stochastic
equicontinuity
2255 2259 2263
estimation
equicontinuity
4.1.
equicontinuity
2253
M-estimators
3.3.
2248 2248 2249 2253
2267
equicontinuity
2273
2276 2283 2284 2292
via bracketing
‘This paper is a substantial revision of the first part of the paper Andrews (1989). I thank D. McFadden for comments and suggestions concerning this revision. I gratefully acknowledge research support from the Alfred P. Sloan Foundation and the National Science Foundation through a Research Fellowship and grant nos. SES-8618617, SES-8821021, and SES-9121914 respectively.
Handbook of Econometrics, Volume IV, Edited by R.F. En& 0 1994 Elsevier Science B. V. All rights reserved
and D.L. McFadden
D. W.K. Andrew
2248
Abstract
This paper provides an introduction to the use of empirical process methods in econometrics. These methods can be used to establish the large sample properties of econometric estimators and test statistics. In the first part of the paper, key terminology and results are introduced and discussed heuristically. Applications in the econometrics literature are briefly reviewed. A select set of three classes of applications is discussed in more detail. The second part of the paper shows how one can verify a key property called stochastic equicontinuity. The paper takes several stochastic equicontinuity results from the probability literature, which rely on entropy conditions of one sort or another, and provides primitive sufficient conditions under which the entropy conditions hold. This yields stochastic equicontinuity results that are readily applicable in a variety of contexts. Examples are provided.
1.
Introduction
This paper discusses the use of empirical process methods in econometrics. It begins by defining, and discussing heuristically, empirical processes, weak convergence, and stochastic equicontinuity. The paper then provides a brief review of the use of empirical process methods in the econometrics literature. Their use is primarily in the establishment of the asymptotic distributions of various estimators and test statistics. Next, the paper discusses three classes of applications of empirical process methods in more detail. The first is the establishment of asymptotic normality of parametric M-estimators that are based on non-differentiable criterion functions. This includes least absolute deviations and method of simulated moments estimators, among others. The second is the establishment of asymptotic normality of semiparametric estimators that depend on preliminary nonparametric estimators. This includes weighted least squares estimators of partially linear regression models and semiparametric generalized method of moments estimators of parameters defined by conditional moment restrictions, among others. The third is the establishment of the asymptotic null distributions of several test statistics that apply in the nonstandard testing scenario in which a nuisance parameter appears under the alternative hypothesis, but not under the null. Examples of such testing problems include tests of variable relevance in certain nonlinear models, such as models with BoxCox transformed variables, and tests of cross-sectional constancy in regression models. As shown in the first part of the paper, the verification of stochastic equicontinuity in a given application is the key step in utilizing empirical process results. The
Ch. 37: Empirical Process
Methods
2249
in Econometrics
second part of the paper provides methods for verifying stochastic equicontinuity. Numerous results are available in the probability literature concerning sufficient conditions for stochastic equicontinuity (references are given below). Most of these results rely on some sort of entropy condition. For application to specific estimation and testing problems, such entropy conditions are not sufficiently primitive. The second part of the paper provides an array of primitive conditions under which such entropy conditions hold, and hence, under which stochastic equicontinuity obtains. The primitive conditions considered here include: differentiability conditions, Lipschitz conditions, LP continuity conditions, Vapnikkcervonenkis conditions, and combinations thereof. Applications discussed in the first part of the paper are employed to exemplify the use of these primitive conditions. The empirical process results discussed here apply only to random variables (rv’s) that are independent or m-dependent (i.e. independent beyond lags of length m). There is a growing literature on empirical processes with more general forms of temporal dependence. See Andrews (1993) for a review of this literature. The remainder of this paper is organized as follows: Section 2 defines and discusses empirical processes, weak convergence, and stochastic equicontinuity. Section 3 gives a brief review of the use of empirical process methods in the econometrics literature and discusses three classes of applications in more detail. Sections 4 and 5 provide stochastic equicontinuity results of the paper. Section 6 provides a brief conclusion. An Appendix contains proofs of results stated in Sections 4 and 5.
2.
Weak convergence and stochastic equicontinuity
We begin by introducing some notation. Let ( Wr,: t G T, T 2 l} be a triangular array of w-valued rv’s defined on a probability space (0, d, P), where w is a (Bore1 measurable) subset of Rk. For notational simplicity, we abbreviate W,, by W, below. Let .Y be a pseudometric space with pseudometric p.* Let A! = {m(~,z):z&~-) be a class of R”-valued functions empirical process vT(.) by
VT(~)= it Jr1
(2.1) defined
Cm(W,, r) - Em( W,, 7)]
on -ly- and indexed
for
r~r-,
by KEY. Define
an
(2.2)
‘That is, F is a metric space except that p(~, , TV)= 0 does not necessarily imply that r1 = r2. For example, the class of square integrable functions on [0, 11 with p(s,,r,) = [lA(T,(W) - T2(W))Zdw]1’2.is a pseudometric space, but not a metric space. The reason is that if rr(w) equals T?(W) for all w except one point, say, then ~(5,. T2) = 0, but TV # TV. In order to handle sets Y that are function spaces of the above type, we allow F to be a pseudometric space rather than a (more restrictive) metric space.
D.W.K.
2250
Andrew
where CT abbreviates xF= i. The empirical process vT(.) is a particular type of stochastic process. If Y = [0,11, then vT(.) is a stochastic process on [0,11. For parametric applications of empirical process theory, Y is usually a subset of RP. For semiparametric and nonparametric,applications, Y is often a class of functions. In some other applications, such as chi-square diagnostic test applications, .q is a class of subsets of RP. We now define weak convergence of the sequence of empirical processes {vT(.): T 2 l} to some stochastic process v(.) indexed by elements z of Y. (v(.) may or may not be defined on the same probability space (a,,&‘, P) as vT(.) VT> 1.) Let * denote weak convergence of stochastic processes, as defined below. Let % denote convergence in distribution of some sequence of rv’s. Let 1).1)denote the Euclidean norm. All limits below are taken as T-+ 00. Definition of weak convergence
v~(.)=-v(.)
if
E*f(v,(.))+Ef(v(.))
VfWB(F_)),
where B(Y) is the class of bounded R”-valued functions on Y (which includes all realizations of vr(.) and v(.) by assumption), d is the uniform metric on B(Y) (i.e. d(b,, b2) = sup,,r 11 b,(z) - b2(7) II), and @(B(S)) is the class of all bounded uniformly continuous (with respect to the metric d) real functions on B(Y). In the definition, E* denotes outer expectation. Correspondingly, P* denotes outer probability below. (It is used because it is desirable not to require vr(.) to be a measurable random element of the metric space (B(Y), d) with its Bore1 o-field, since measurability in this context can be too restrictive. For example, if (B(Y), d) is the space of functions D[O, l] with the uniform metric, then the standard empirical distribution function is not measurable with respect to its Bore1 a-field. The limit stochastic process v(.), on the other hand, is sufficiently well-behaved in applications that it is assumed to be measurable in the definition.) The above definition is due to HoffmanJorgensen. It is widely used in the recent probability literature, e.g. see Pollard (1990, Section 9). Weak convergence is a useful concept for econometrics, because it can be used to establish the asymptotic distributions of estimators and test statistics. Section 3 below illustrates how. For now, we consider sufficient conditions for weak convergence. In many applications of interest, the limit process v(.) is (uniformly p) continuous in t with probability one. In such cases, a property of the sequence of empirical processes {vr(.): T 2 11, called stochastic equicontinuity, is a key member of a set of sufficient conditions for weak convergence. It also is implied by weak convergence (if the limit process v(.) is as above).
Ch. 37:
Empirical
Dejnition
Process
Methods
of stochastic
equicontinuity
{I+(.): T> l} IS t s t oc h as t’KU11y equicontinuous
lim P -[
T+m
2251
in Econometrics
SUP T,.?2E~:&J(r,,rZ) 0 and q > 0,36 > 0 such that
>II
1
(2.3)
-Cc.
Basically, a sequence of empirical processes iv,(.): T > l} is stochastically equicontinuous if vT(.) is continuous in z uniformly over Y at least with high probability and for T large. Thus, stochastic equicontinuity is a probabilistic and asymptotic generalization of the uniform continuity of a function. The concept of stochastic equicontinuity is quite old and appears in the literature under various guises. For example, it appears in Theorem 8.2 of Billingsley (1968, p. 55), which is attributed to Prohorov (1956), for the case of 9 = [O, 11. Moreover, a non-asymptotic analogue of stochastic equicontinuity arises in the even older literature on the existence of stochastic processes with continuous sample paths. The concept of stochastic equicontinuity is important for two reasons. First, as mentioned above, stochastic equicontinuity is a key member of a set of sufficient conditions for weak convergence. These conditions are specified immediately below. Second, in many applications it is not necessary to establish a full functional limit (i.e. weak convergence) result to obtain the desired result - it suffices to establish just stochastic equicontinuity. Examples of this are given in Section 3 below. Sufficient conditions for weak convergence are given in the following widely used result. A proof of the result can be found in Pollard (1990, Section 10) (but the basic result has been around for some time). Recall that a pseudometric space is said to be totally bounded if it can be covered by a finite number of c-balls VE> 0. (For example, a subset of Euclidean space is totally bounded if and only if it is bounded.) Proposition If (i) (Y,p) is a totally bounded pseudometric space, (ii) finite dimensional (fidi) convergence holds: V finite subsets (z,, . . . , T_,)of Y-, (v,(z,)‘, . . . , ~~(7,)‘)’ converges in distribution, and (iii) {v*(.): T 3 l} is stochastically equicontinuous, then there exists a (Borel-measurable with respect to d) B(F)-valued stochastic process. v(.), whose sample paths are uniformly p continuous with probability one, such that VT(.)JV(.). Conversely, if v=(.)*v(.) (ii) and (iii) hold.
for v(.) with the properties
above
Condition (ii) of the proposition typically is verified by central limit theorem (CLT) (or a univariate CLT coupled device, see Billingsley (1968)). There are numerous CLTs in different configurations of non-identical distributions and
and (i) holds, then
applying a multivariate with the Cramer-Wold the literature that cover temporal dependence.
D. W.K. Andrews
2252
Condition (i) of the proposition is straightforward to verify if Y is a subset of Euclidean space and is typically a by-product of the verification of stochastic equicontinuity in other cases. In consequence, the verification of stochastic equicontinuity is the key step in verifying weak convergence (and, as mentioned above, is often the desired end in its own right). For these reasons, we provide further discussion of the stochastic equicontinuity condition here and we provide methods for verifying it in several sections below. Two equivalent definitions of stochastic equicontinuity are the following: (i) {v,(.): T 3 1) is stochastically equicontinuous if for every sequence of constants (6,) that converges to zero, we have SUP~(,~,~~)~~~IV~(Z~)- vT(rZ)l 30 where “A” denotes convergence in probability, and (ii) {vT(.): vT 3 l} is stochastically equicontinuous if for all sequences of random elements {Z^iT} and {tZT} that satisfy p(z^,,,f,,) LO, we have v,(Q,,) - v,(z^,,) L 0. The latter characterization of stochastic equicontinuity reflects its use in the semiparametric examples below. Allowing {QiT} and {tZT} to be random in the latter characterization is crucial. If only fixed sequences were considered, then the property would be substantially weaker-it would not deliver the result that vT(z*,,)- vr.(fZT) 30 ~ and its proof would be substantially simpler - the property would follow directly from Chebyshev’s inequality. To demonstrate the plausibility of the stochastic equicontinuity property, suppose JZ contains only linear functions, i.e. ~2’ = {g: g(w) = w’t for some FERN} and p is the Euclidean metric. In this simple linear case,
< E,
(2.4)
where the first inequality holds by the CauchyySchwarz inequality and the second inequality holds for 6 sufficiently small provided (l/J?)x T( W, - E IV,) = O,( 1). Thus, Iv,(.): T 3 l} is stochastically equicontinuous in this case if the rv’s { W, - E W,: t < T, T 2 l} satisfy an ordinary CLT. For classes of nonlinear functions, the stochastic equicontinuity property is substantially more difficult to verify than for linear functions. Indeed, it is not difficult to demonstrate that it does not hold for all classes of functions J?‘. Some restrictions on .k are necessary ~ ~2! cannot be too complex/large. To see this, suppose { W,: t d T, T 3 l} are iid with distribution P, that is absolutely continuous with respect to Lebesgue measure and J$? is the class of indicator
Ch. 37: Empirical Process Methods in Econometrics
2253
functions of all Bore1 sets in %‘“.Let z denote a Bore1 set in w and let Y denote the collection of all such sets. Then, m(w, t) = l(w~r). Take p(r,, z2) = (J(m(w, ri) m(w, rz))*dPl(w)) ‘I* . For any two sets zl, r2 in Y that have finite numbers of elements, v,(zj) = (l/$)C~l(W,~t~) and p(r1,z2) = 0, since P1(WI~tj) = 0 forj = 1,2. Given any T 2 1 and any realization o~Q, there exist finite sets tlTo and rZTwin Y such that W,(o)~r,,~ and IVJo)$r,rwVt d T, where W,(o) denotes the value of W, when o is realized. This yields vr-(riTw) = @, v~(~*~J = 0, and supP(rl,Q), l}, one obtains
(For example, if the rv’s a[Em( W,, z,)]/at’ continuous
fi(z^- TO)= (M (Here, o,(l) denotes
-
fitiT
1+ op(l)@%*,(t).
a term that converges
Now, the asymptotic process methods
distribution
to determine
(3.5) in probability
of fi(s*
the asymptotic
to zero as T + co.)
- re) is obtained distribution
of fitiT(
by using empirical We write
= [J’rrn,(Q- @ii;(t)] - JTrn,(t*) = tvT@) -
The term able sum
W, are identically distributed, it suffices to have in r at r,,.) Thus, provided M is nonsingular, one has
vT(TO)) +
vTh,) -
fifi,(t).
(3.6)
third term on the right hand side (rhs) of (3.6) is o,(l) by (3.1). The second on the rhs of (3.6) is asymptotically normal by an ordinary CLT under suitmoment and temporal dependence assumptions, since vr(t,,) is a normalized of mean zero rv’s. That is, we have
(3.7) where S = lim,, m varC(lIJT)CTm(w,,~,)l. F or example, if the rv’s W, are independent and identically distributed (iid), it suffices to have S = Em( W,, z,)m( W,, to) well-defined.) Next, the first term on the rhs of (3.6) is o,(l) provided {vT(.): T 2 l> is stochastically equicontinuous and Z Lro. This follows because given any q > 0 and E > 0, there exists a 6 > 0 such that
Ch. 37: Empirical Process Methods
2257
in Econometrics
lim P(I vT(f)- vTh)l > d
T-ra,
d lim P( 1VT(t) - vT(ro)l > q, p(t, To) d 6) + lim P(p(z*, rO) > 6) T+OZ
T-rm
d
lim P -(
T-CC
(3.8)
6
where the second inequality uses z^A t0 and the third uses stochastic Combining (3.5))(3.8) yields the desired result that
JT(r*-
zo) L N(O,M-'S(M-I)')
as T+
equicontinuity.
co.
(3.9)
It remains to show how one can verify the stochastic equicontinuity of (VT(.): T 2 l}. This is done in Sections 4 and 5 below. Before doing so, we consider several examples. Example 1 M-estimators for standard, censored and truncated linear regression model. In the models considered here, {(K, X,): t d T} are observed rv’s and {(Y:, XT): t d T} are latent rv’s. The models are defined by Yf = linear
xye, + u,, regression
censored truncated
t=l,...,T,
(LR):
regression regression
(YE X,) = (Y:, X:)7 (Y,, X,) = (Y: 1(Y:
(CR): (TR):
Depending upon the context, assumptions such as constant about zero for all t. We need We consider M-estimators
2 Cl), Xf),
(q, X,) = (Y: 1(Y: 2 0), XT 1(Y:
2 0)).
(3.10)
the errors (U,} may satisfy any one of a number of conditional mean or quantile for all t or symmetry not be specific for present purposes. ? of r,, that satisfy the equations
(3.11)
O=~ci/l(r,-X;r*)~,(w,,~)X* with probability + 1 as T-, co, where W, = (Y,, Xi,‘. Such estimators framework of (3.1)-(3.2) with m(w, r) = 11/i(y - x’~)$~(w, r)x,
where w = (y, x’)‘.
fit the general
(3.12)
D. WK. Andrews
2258
Examples of such M-estimators in the literature include the following: (a) LR model: Let $r(z) = sgn(z) and tiz = 1 to obtain the least absolute deviations (LAD) estimator. Let $r(z) = q - l(y - x’~ < 0) and $* = 1 to obtain Koenker and Bassett’s (1978) regression quantile estimator for quantile qE(O, 1). Let rc/1(z) = (z A c) v (- c) (where A and v are the min and max operators respectively) and $z = 1 to obtain Huber’s (1973) M-estimator with truncation at + c. Let $t (z) = 1q - 1(y - x’t < O)l and $z(w, r) = y - x’s to obtain Newey and Powell’s (1987) asymmetric LS estimator. (b) CR model: Let $r(z) = q - 1(y - x’r < 0) and tjz(w, r) = l(x’r > 0) to obtain Powell’s (1984, 1986a) censored regression quantile estimator for quantile qE(O, 1). Let $r = 1 and tjz(w, r) = 1(x? > O)[(y - x’r) A x’r] to obtain Powell’s (1986b) symmetrically trimmed LS estimator. (c) TR model: Let $r = 1 and $z(w, r) = l(y < 2x’t)(y - x’r) to obtain Powell’s (1986b) symmetrically trimmed LS estimator. (Note that for the Huber M-estimator of the LR model one would usually simultaneously estimate a scale parameter for the errors U,. For brevity, we omit this above.) Example
2
Method of simulated moments (MSM) estimator for multinomial probit. The model and estimator considered here are as in McFadden (1989) and Pakes and Pollard (1989). We consider a discrete response model with r possible responses. Let D, be an observed response vector that takes values in {ei: i = 1,. . . , I}, where ei=(O ,..., O,l,O ,..., 0)’ is the ith elementary r-vector. Let Zli denote an observed b-vector of covariates - one for each possible response i = 1,. , r. Let Z, = Z;J’. The model is defined such that cZ:r’Z:2’...’ D, = e,
if
(Zti - Z,,)‘(j3(s0) + A(r,)U,)
3 0
Vl = 1,. . . , r,
(3.13)
where U, N N(O,Z,) is an unobserved normal rv, /3(.) and A(.) are known RbX ‘and RbX ‘-valued functions of an unknown parameter rOey c RP. McFadden’s MSM estimator of r0 is constructed using s independent simulated N(0, I,) rv’s (Y,, , . . . , Y,,)’ and a matrix of instruments g(Z,, r), where g(., .) is a known R” b-valued function. The MSM estimator is an example of the estimator of (3.1)-(3.2) with W, L (D,, Z,, Ytl,. . . , Y,,) and m(“‘, r) = g(z, r)
(
d - 1 ,gI HCz(P(r) + A(z)Yj)I 2
where w = (d, z, y,, , . . , y,). Here, H[.] is of the form
nl 1=1
CCzi
-
(3.14)
>
J
is a (0, I}‘-valued
zlY(B(t)+ A(z)Yj) 3 Ol.
function
whose ith element
(3.15)
Ch. 37: Empirical Process Methods in Econometrics
3.3.
Tests when a nuisance parameter
2259
is present only under the alternative
In this section we consider a class of testing problems for which empirical process limit theory can be usefully exploited. The testing problems considered are ones for which a nuisance parameter is present under the alternative hypothesis, but not under the null hypothesis. Such testing problems are non-standard. In consequence, the usual asymptotic distributional and optimality properties of likelihood ratio (LR), Lagrange multiplier (LM), and Wald (W) tests do not apply. Consider a parametric model with parameters 8 and T, where & 0 c R”, TEF c R”. Let 0 = (/I’, S’)‘, where BERN, and FERN, and s = p + q. The null and alternative hypotheses of interest are H,:
/I=0
H,:
pzo.
and
(3.16)
Under the null hypothesis, the distribution parameter r by assumption. Under the examples are the following. Example
of the data does not depend on the alternative hypothesis, it does. Two
3
This example variable/vector
is a test for variable relevance. We want to test whether a regressor Z, belongs in a nonlinear regression model. This model is
Y,=dX,,4)+LWZ,,z)+
U,,
u,-N(o,d,),
t= l,...,~.
(3.17)
The functions g and h are assumed known. The parameters (/?,bl,fi2, r) are unknown. The regressors (X,,Z,) and/or the errors U, are presumed to exhibit some sort of asymptotically weak temporal dependence. As an example, the term h(Z,,r) might be of the Box-Cox form (Z: - 1)/r. Under the null hypothesis H,: /I = 0, Z, does not enter the regression function and the parameter r is not present. Example 4 This example is a test of cross-sectional constancy in a nonlinear regression model. A parameter r (ERR) partitions the sample space of some observed variable Z, (E R’) into two regions. In one region the regression parameter is 6, (ERR) and in the other region it is 6, + /I. A test of cross-sectional constancy of the regression parameters corresponds to a test of the null hypothesis H,: p = 0. The parameter r is present only under the alternative. To be concrete, the model is
for Wt34>0 for
for
t=l
T ,...,
h(Z,,z) 6 0
3
(3.18)
D. W.K. Andrew
2260
where the errors CJ, N iid N(O,6,), the regressors X, and the rv Z, are m-dependent and identically distributed, and g(.;) and h(.;) are known real functions, For example, h(Z,,t) could equal Z, - r, where the real rv Z, is an element of X,, an element of Xt_d for some integer d 2 1, or Y,_, for some integer d > 1. The model could be generalized to allow for more regions than two. Problems of the sort considered above were first treated in a general way by Davies (1977, 1987). Davies proposed using the LR test. Let LR(r) denote the LR test statistic (i.e. minus two times the log likelihood ratio) when t is specified under the alternative. For given r, LR(r) has standard asymptotic properties (under standard regularity conditions). In particular, it converges in distribution under the null to a random variable X2(r) that has a xi distribution. When r is not given, but is allowed to take any value in y, the LR statistic is (3.19)
sup LR(r). rsf
This statistic has power against a much wider variety of alternatives than the statistic LR(r) for some fixed value of r. To mount a test based on SUP,,~ LR(r), one needs to determine its asymptotic null distribution. This can be achieved by establishing that the stochastic process LR(r), viewed as a random function indexed by r, converges weakly to a stochastic process X’(r). Then, it is easy to show that the asymptotic null distribution of SUP,,~ LR(t) is that of the supremum of the chi-square process X’(r). The methods discussed below can be used to provide a rigorous justification of this type of argument. Hansen (1991) extended Davies’ results to non-likelihood testing scenarios, considered LM versions of the test, and pointed out a variety of applications of such tests in econometrics. A drawback of the supLR test statistic is that it does not possess standard asymptotic optimality properties. Andrews and Ploberger (1994) derived a class of tests that do. They considered a weighted average power criterion that is similar to that considered by Wald (1943). Optimal tests turn out to be average exponential tests:
Exp-LR = (1 + c)-~‘~ ]exp(
k&
LRo)dJW.
(3.20)
where J(.) is a specified weight function over r~9 and c is a scalar parameter that indexes whether one is directing power against close or distant alternatives (i.e. against b small or /I large). Let Exp-LM and Exp-W denote the test statistic defined as in (3.20) but with LR(t) replaced by LM(7) and W(7), respectively, where the latter are defined analogously to LR(7). The three statistics Exp-LR,
Ch. 37: Empirical Process Methods
2261
in Econometrics
Exp-LM, and Exp-W each have asymptotic optimality properties. Using empirical process results, each can be shown to have an asymptotic null distribution that is a function of the stochastic process X”(z) discussed above. First, we introduce some notation. Let I,(B,r) denote a criterion function that is used to estimate the parameters 6’ and r. The leading case is when l,(Q, r) is the log likelihood function for the sample of size T. Let D&.(8, r) denote the s-vector of partial derivatives of I,(Q,r) with respect to 8. Let 8, denote the true value of 8 under the null hypothesis H,, i.e. B0 = (0,s;)‘. (Note that D1,(8,, r) depends on z in general even though I,(B,,s) does not.) By some manipulations (e.g. see Andrews and Ploberger (1994)), one can show that the test statistics SUP~~,~LR(r), Exp-LR, Exp-LM, and Exp-W equal a continuous real function of the normalized score process {D/,(0,, r)/,,@: try-) plus an op( 1) term under H,. In view of the continuous mapping theorem (e.g. see Pollard (1984, Chapter 111.2)), the asymptotic null distributions of these statistics are given by the same functions of the limit process More specifically, let
VT(T)=
as T-r co of {D1,(8,, r)/fi:
AN,(A,,7).
reF_).
(3.21)
Jr
(Note that EDIr(BO,r) = 0 under Ho, since these are the population first order conditions for the estimator.) Then, for some continuous function g of v,(.), we have sup LR(r) = g(vT(.)) + o,(l) re.!7
under
H,.
(3.22)
(Here, continuity is defined with respect to the uniform metric d on the space of bounded R”-valued functions on Y-, i.e. B(Y).) If vr.(.)* v(.), then
;,“,p LR(r) 5
g(v(.))
under
H,,
(3.23)
which is the desired result. The distribution of g(v(.)) yields asymptotic critical values for the test statistic SUP,,,~ LR(z). The results are analogous for Exp-LR, Exp-LM, and Exp-W. In conclusion, if one can establish the weak convergence result, v=(.)*v(.) as T-t co, then one can obtain the asymptotic distribution of the test statistics of interest. As discussed in Section 2, the key condition for weak convergence is stochastic equicontinuity. The verification of stochastic equicontinuity for Examples 3 and 4 is discussed in Sections 4 and 5 below. Here, we specify the form of v=(z) in these examples.
2242
Examples
D. WK.
Andrews
the assumption
of iid
3 (continued)
In this example, normal errors:
1,(O,r) is the log likelihood
function
under
and
VT(Z)
=
1
D1,(0,,
7)
(3.24)
=
fi
Since 7 only appears in the first term, it suffices to show that { (l/fl)xTU,h(Z,, T 3 l} is stochastically equicontinuous.
.):
Example 4 (continued) In this cross-sectional constancy example, I(& 7) is the log likelihood the assumption of iid normal innovations:
function
under
Since 7 only appears in the first term, it suffices to show that {(I/fi)CTU, a[g(X,, 8,,)]/&S, 1(h(Z,;) d 0): T 3 I} is stochastically equicontinuous.
x
MO,
7) =
-
+gZnd,
-+f:
[r, 2
-
dx,,
61
+
B)
l(W,,
-
g(X,,6J
l(h(Z,,z)
> 0)
1 7)
d
(31’
and
L
D&-(0,,
7) =
Jr
2263
Ch. 37: Empirical Process Methods in Econometrics
3.4.
Semiparametric
estimation
We now consider the application of stochastic equicontinuity results to semiparametric estimation problems. The approach that is discussed below is given in more detail in Andrews (1994a). Other approaches are referenced in Section 3.1 above. Consider a two-stage estimator e of a finite dimensional parameter 0e~ 0 c R’. In the first stage, an infinite dimensional parameter estimator z*is computed, such as a nonparametric regression or density estimator or its derivative. In the second stage, the estimator 8 of 8, is obtained from a set of estimating equations that depend on the preliminary estimator t^. Many semiparametric estimators in the literature can be defined in this way. By linearizing the estimating equations, one can show that the asymptotic distribution of ,/?((8- 19,)depends on an empirical process vr(t), evaluated at the preliminary estimator f. That is, it depends on vr(?). To obtain the asymptotic distribution of 8, then, one needs to obtain that of vr(?). If r*converges in probability to some t0 (under a suitable pseudometric) and vT(r) is stochastically equicontinuous, then one can show that v=(f) - Q(Q) 50 and the asymptotic behavior of ,/?(e^- 19,) depends on that of v&& which is obtained straightforwardly from an ordinary CLT. Thus, one can effectively utilize empirical process stochastic equicontinuity results in establishing the asymptotic distributions of semiparametric estimators. We now provide some more details of the argument sketched above. Let the data consist of {W,: t Q T}. Consider a system of p estimating equations ti,(B, f) = f
$m(e, f),
(3.26)
where m(0, r) =, m( W,, 8, z) and m(., ., *) is an RP-valued known function. Suppose the estimator 0 solves the equations
J%iT(& f)
= 0
(3.27)
(at least with probability that goes to one as T+ CO). These equations might be the first order conditions from some minimization problem. We suppose consistency of 8 has already been established, i.e. e-%0, (see Andrews (1994:) for sufficient conditions). We wish to determine the asymptotic distribution of 8. When m( W,, 8, t) is a smooth function of 8, the following approach can be used. Element by element mean value expansions stacked yield
o,(l) = w& 4 = .JTm,(e,,f) + a[rii,(e*,
f)yaelfi@-
e,),
(3.28)
where 8* lies between 6 and 0, (and 0* may differ from row to row in
D.W.K. Andrews
2264
a[fi,(O*,
z*)],W’). Under
suitable
conditions,
(3.29) Thus, JT(e^-
l + o,(l))Jrrn,(O,,~)
0,) = -(A!_
= - (M- 1 + o,(l))CJr(m,(e,,
t*) - m;(e,,z*)) + @ii;(8,,
?)I, (3.30)
where ti*,(O,z) = (l/T)CTEm(W,, 8,~). Again under suitable conditions, either
for some covariance Let
VT(4=
matrix
JeMe,,
z) -
A, see Andrews
(1994a).
fqe,, t)).
(3.32)
Note that v=(.) is a stochastic process indexed by an infinite dimensional parameter in this case. This differs from the other examples in this section for which r is finite dimensional. Under standard conditions, one can establish that
%-bo)5 N(O,S) for some covariance can show that
(3.33)
matrix
S, by applying
an ordinary
CLT. If, in addition,
VT(z*) - VT(%)J+0,
one
(3.34)
then we obtain
JT(&
e,) = -(M-l
= - M5 which is the desired
+
‘CVTh)+ JTmge,,f)]
N(O, M - ‘(S + A)(M result,
@qe,,?)I
%U))CVT(Q) +
‘)‘),
+ O,(l) (3.35)
2265
Ch. 37: Empirical Process Methods in Econometrics
To prove (3.34), we can use the stochastic is stochastically (i) {v,(.): T 2 1) and pseudometric p on r-, (ii) P(QEF)+ 1, and (iii) p(?, tO) J+ 0,
equicontinuity equicontinuous
property.
Suppose
for some choice of F
(3.36)
then (3.34) holds (as shown below). Note that there exist tradeoffs between conditions (i), (ii), and (iii) of l(3.36) in terms of the difficulty of verification and the strength of the regularity conditions needed. For example, a larger set Y makes it more difficult to verify (i), but easier to verify (ii). A stronger pseudometric p makes it easier to verify (i), but more difficult to verify (iii). Since the sufficiency of (3.36) for (3.34) is the key to the approach considered here, we provide a proof of this simple result. We have: V E > 0, V n > 0,3 6 > 0 such that
lim P(I vT(z*) - vT(d > rl)
T-30
< lim P(
I VT(?)
-
vT(zo)
1>
q, QEF,
p(t,
zo)
d
6)
T-CC +
lim P(2#Y
or
p(t^,r,) > 6)
T+CC
1vT(z)
sup re.F:
‘1
< d
(3.37)
E,
where the term on the third line of (3.37) is zero by (ii) and (iii) and the last inequality holds by (i). Since E > 0 is arbitrary, (3.34) follows. To conclude, one can establish the fi-consistency and asymptotic normality of the semiparametric estimator 6 if one can establish, among other things, that {v,(.): T 2 l} is stochastically equicontinuous. Next, we consider the application of this approach to two examples and illustrate the form of vT(.) in these examples. In Sections 4 and 5, we discuss the verification of stochastic equicontinuity when “M = {m(., t): ZEY} is an infinite dimensional class of functions. Example
5
This example considers a weighted least squares (WLS) estimator linear regression (PLR) model. The PLR model is given by Y, = X:6’, + g(Z,) + U,
and
E( U,I X,, Z,) = 0
a.s.
of the partially
(3.38)
D. W.K. Andrew
2266
W, = (Y,,X:,Z:)’ is iid or for t= l,..., T, where the real function g(.) is unknown, m-dependent and identically distributed, Y,, U,eR, X,, tl,ERP and Z,eRka. This model is also discussed by Hlrdle and Linton (1994) in this handbook. The WLS estimator is defined for the case where the conditional variance of U, given (X,, Z,) depends only on Z,. This estimator is a weighted version of Robinson’s (1988) semiparametric LS estimator. The PLR model with heteroskedasticity of the above form can be generated by a sample selection model with nonparametric selection equation (e.g. see Andrews (1994a)). Let rlO(Z,) = E(Y,IZ,),r,,(Z,) = E(X,IZ,), r3JZt) = E(U: IZ,) and r. = (riO, rio, rzo)). Let fj(.) be an estimator of tjO(.) for j = 1,2,3. The semiparametric WLS estimator of the PLR model is given by -1
e=[ 51 5wt)(X, -
Z^2(Zt))(Xt - ~*m)‘/~,m
1
x i 5wJw, - z^2(Zt))(yt- ~lm)/~,(z,),
(3.39)
1
where r( W,) = l(Z,~f%“*) is a trimming function and 5?* is a bounded Rka. This estimator is of the form (3.16)-(3.17) with m(K, 8, f) = S(K)Cr,
- %(Z,) - (X, - z^,(Z,))‘ei LX, - e,(Z,)l/t3(Z,).
subset
of
(3.40)
To establish the asymptotic normality of z^using the approach above, one needs to establish stochastic equicontinuity for the empirical process vr(.) when the class of functions JJ’ is given by J? = {m(., Bo, t):
ZEF}
where
m(w, eo, r) =
1
of
(3.42)
Ch. 37: Empirical
Process Methods
2261
in Econometrics
for some specified R”-valued function in econometrics are quite numerous,
Ic/(., .), where X,eRkn. Examples of this model see Chamberlain (1987) and Newey (1990).
Let %(X,) = E($(Z,, t%)$(z,, &)‘lX,), d&X,) = ECaC$(z,, 4Jll~@I~,l and to(X,) = d,(X,)‘R, ‘(X,). By assumption, a,(.), A,(.), and rO(.) do not depend on t. Let fi(.) and A(.) be nonparametric estimators of a,(.) and A,(.). Let t*(.) = d^(.)‘lt;,- ‘(.). Let W, = (Z;, Xi)‘. A GMM estimator 6 of B,, minimizes
over
0~ 0 c RP’,
(3.43)
where 9 is a data-dependent weight matrix. To obtain the asymptotic distribution of this estimator using the approach above, we need to establish a stochastic equicontinuity result for the empirical process vT(.) when the class of functions J? is given by M = {m(., do, 5): TEL?-},
where
m(w, &, r) = r(x)lcI(z, 6,) = A(x)‘nw = (z’, x’) and Y is defined
4. 4.1.
Stochastic
equicontinuity
‘(x)$(z,
&,),
(3.44)
below.
via symmetrization
Primitive conditions for stochastic
equicontinuity
In this section we provide primitive conditions for stochastic equicontinuity. These conditions are applied to some of the examples of Section 3 in Section 4.2 below. We utilize an empirical process result of Pollard (1990) altered to encompass m-dependent rather than independent rv’s and reduced in generality somewhat to achieve a simplification of the conditions. This result depends on a condition, which we refer to as Pollard’s entropy condition, that is based on how well the functions in JV can be approximated by a finite number of functions, where the distance between functions is measured by the largest L’(Q) distance over all distributions Q that have finite support. The main purpose of this section is to establish primitive conditions under which the entropy condition holds. Following this, a number of examples are provided to illustrate the ease of verification of the entropy condition. First, we note that stochastic equicontinuity of a vector-valued empirical process (i.e. s > 1) follows from the stochastic equicontinuity of each element of the empirical process. In consequence, we focus attention on real-valued empirical processes (s = 1).
D. W.K. Andrews
2268
The pseudometric
p on Y is defined
E(m(W,,
71)
-
in this section
m(W,,
by
1’2.3
72))2
(4.1)
>
Let Q denote a probability measure on W. For a real function f on W, let Qf 2 = 1% f*(w)dQ(w). Let 9 be a class of functions in c(Q). The L2(Q) cover numbers of 9 are defined as follows: Definition For any E > 0, the cover number N*(E, Q, F) is the smallest value of n for which in 4 such that minj, ,(Q(f - fj)*)li2 < EVlf~p. there exist functions fI, . . ,f,, N2(&, Q, 9) = co if no such n exists. The log of N2(&,Q,S7 is referred to as the L*(Q) &-entropy of 9. Let 2 denote the class of all probability measures Q on W that concentrate on a finite set. The following entropy/cover number condition was introduced in Pollard (1982). Definition A class F of real functions
defined
on W satisfies Pollard’s entropy condition
if
1
sup [log N*(E(QF~)“~, Q, F)]1’2 de < co, s o QE~
(4.2)
where F is some envelope function for 9, i.e. F is a real function on W” for which If(.)1 < F(.)V’fEF. As ~10, the cover number N2(&(QF2)“*, Q,p) increases. Pollard’s entropy condition requires that it cannot increase too quickly as ~10. This restricts the complexity/size of 9 and does so in a way that is sufficient for stochastic equicontinuity given suitable moment and temporal dependence assumptions. In particular, the following three assumptions are sufficient for stochastic equicontinuity. Assumption
A
JZZ satisfies Pollard’s Assumption
entropy
condition
with some envelope
ti.
B
lim T_ 3. ( l/T)CTEti2
“(IV,) < CCfor some 6 > 0, where M is as in Assumption
A.
3The pseudometric p(., .) is defined here using a dummy variable N (rather than T) to avoid confusion when we consider objects such as plim T_rcp(Q,so). Note that p(.;) is taken to be independent of the sample size T.
Ch. 37: Empirical Process
Assumption
Methods
C
( W,: t < T, T 2 1j is an m-dependent Theorem
2269
in Econometrics
triangular
array of rv’s.
1 (Pollard)
Under Assumptions given by (4.1).
A-C,
{vT(.): T > l} is stochastically
equicontinuous
with p
Comments (1) Theorem 1 is proved using a symmetrization argument. In particular, one obtains a maximal inequality for vT(r) by showing that SUP,,,~ 1vT(t)j is less variable 1o,m( W,, z)l, where (6,: t d T} are iid rv’s that are indepenthan suproY l(l/fi)CT dent of { W,: t < T) and have Rudemacher distribution (i.e. r~( equals + 1 or - 1, each with probability i). Conditional on { W,} one performs a chaining argument that relies on Hoeffding’s inequality for tail probabilities of sums of bounded, mean zero, independent rv’s. The bound in this case is small when the average sum of squares of the bounds on the individual rv’s is small. In the present case, the latter ultimately is applied to the is just (lIT)Clm T ’ ( W t, z). The maximal inequality empirical measure constructed from differences of the form m( W,, zl) - m( W,, r2) rather than to just m(W,, z). In consequence, the measure of distance between m(.,z,) and m(.,z,) that makes the bound effective is an L2(P,) pseudometric, where P, denotes the empirical distribution of (W,: t d Tj. This pseudometric is random and depends on T, but is conveniently dominated by the largest L2(Q) pseudometric over all distributions Q with finite support. This explains the appearance of the latter in the definition of Pollard’s entropy condition. To see why Pollard’s entropy condition takes the precise form given above, one has to inspect the details of the chaining argument. The interested reader can do so, see Pollard (1990, Section 3). (2) When Assumptions A-C hold, F is totally bounded under the pseudometric p provided p is equivalent to the pseudometric p* defined by p*(z,,z2) = zi) - m(W,, T2))2]1’2. By equivalent, we mean that &,+ co[(l/N)CyE(m(W,, p*(~, , TV)2 Cp(z,, z2) V tl, Z*EF for some C > 0. (p*(~i, z2) < p(r,, ZJ holds automatically.) Of course, p equals p* if the rv’s W, are identically distributed. The proof of total boundedness is analogous to that given in the proof of Theorem 10.7 in Pollard (1990). Combinatorial arguments have been used to establish that certain classes of functions, often referred to as Vapnik-Cervonenkis (VC) classes of one sort or another, satisfy Pollard’s entropy condition, see Pollard (1984, Chapter 2; 1990, Section 4) and Dudley (1987). Here we consider the most important of these VC classes for applications (type I classes below) and we show that several other classes of functions satisfy Pollard’s entropy condition. These include Lipschitz functions
D.W.K. Andrew
2270
indexed by finite dimensional parameters (type II classes) and infinite dimensional classes of smooth functions (type III classes). The latter are important for applications to semiparametric and nonparametric problems because they cover realizations of nonparametric estimators (under suitable assumptions). Having established that Pollard’s entropy condition holds for several useful classes of functions, we proceed below to show that functions from these classes can be “mixed and matched”, e.g. by addition, multiplication and division, to obtain new classes that satisfy Pollard’s entropy condition. In consequence, one can routinely build up fairly complicated classes of functions that satisfy Pollard’s entropy condition. In particular, one can build up classes of functions that are suitable for use in the examples above. The first class of functions we consider are applicable in the non-differentiable M-estimator Examples 1 and 2 (see Section 3.2 above). Dejinition A class F of real functions on W is called a type I class if it is of the form (a) 8 = {f:f(w) = ~‘4 V w~-Iy- for some 5~ Y c Rk} or (b) 9 = {f:f(w) = h(w’t) V w~.q for some <E Y c Rk, hi V,}, where V, is some set of functions from R to R each with total variation less than or equal to K < co. Common choices for h in (b) include the indicator function, the sign function, and Huber $-functions, among others. For the more knowledgeable reader (concerning empirical processes), we note that it is sometimes useful to extend the definition of type I classes of functions to include various classes of functions called VC classes. By definition, such classes include (i) classes of indicator functions of VC sets, (ii) VC major classes of uniformly bounded functions, (iii) VC hull classes, (iv) VC subgraph classes, and (v) VC subgraph hull classes, where each of these classes is as defined in Dudley (1987) (but without the restriction that f > 0 V’~EF). For brevity and simplicity, we do not discuss all of these classes here. The second class of functions we consider contains functions that are indexed by a finite dimensional parameter and are Lipschitz with respect to that parameter: Dejinition A class F of real functions on W is called a type II class if each function f in F satisfies: f(.) = f(., t) for some re5-, where Y is some bounded subset of Euclidean space and f(., r) is Lipschitz in r, i.e.,
Lf(~*~l)-f(~>~2)1 k,/2, f~9 has partial derivatives of order [q] on W”* = {weW: wa~W~}; (b) the [q]th order partial derivatives of f satisfy a Lipschitz condition with exponent q - [qJ and some Lipschitz constant C* that does not depend on f V f ~9; and (c) W”,* is a convex compact set. The envelope of a type III class 9 can be taken to be a constant function, since the functions in 9 are uniformly bounded in absolute value over WEW and f~9. Type III classes can be extended to allow Wa to be a finite union of connected compact subsets of Rkm.In this case, (4.4) only needs to hold V wgW and w + hEY+‘” such that w, and w, + h, are in the same connected set in W,*.
2212
D.W.K. Andrews
In applications, type III classes of functions typically are classes of realizations of nonparametric function estimates. Since these realizations usually depend on only a subvector W,, of W, = (Wb,, Wb,)‘, it is advantageous to define type III classes to contain functions that may depend on only part of W,. By “mixing and matching” functions of type III with functions of types I and II (see below), classes of functions are obtained that depend on all of w. In applications where the subvector W,, of W, is a bounded rv, one may have YV,*= W,. In applications where W,, is an unbounded rv, vV~ must be a proper subset of wa for 9 to be a type III class. A common case where the latter arises in the examples of Andrews (1994a) is when W,, is an unbounded rv, all the observations are used to estimate a nonparametric function I for w,EYV~, and the semiparametric estimator only uses observations W, such that W,, is in a bounded set -WT. In this case, one sets the nonparametric estimator of rO(w,) equal to zero outside YV,*and the realizations of this trimmed estimator form a type III class if they satisfy the smoothness condition (ii) for w,E%‘“~. Theorem 2 If g is a class of functions of type I, II, or III, then Pollard’s entropy condition (4.2) (i.e. Assumption A) holds with envelope F(.) given by 1 v SUP~~,~If(. 1 v su~r~.~ If(.)1 v B(.), or 1 v su~~~~~ If( .) 1,respectively, where v is the maximum operator. Comment For type I classes, the result of Theorem 2 follows from results in the literature such as Pollard (1984, Chapter II) and Dudley (1987) (see the Appendix for details). For type II classes, Theorem 2 is established directly. It is similar to Lemma 2.13 of Pakes and Pollard (1989). For type III classes, Theorem 2 is established using uniform metric entropy results of Kolmogorov and Tihomirov (1961). We now show how one can “mix and match” functions of types I, II, and III to obtain a wide variety of classes that satisfy Pollard’s entropy condition (Assumption A). Let 3 and g* be classes of I x s matrix-valued functions defined on -Iy- with scalar envelopes G and G*, respectively (i.e. G: -ly- + R and Igij(.) I < G( .) V i = 1>..., r,vj= 1, . . . , s, V g&J). Let g and g* denote generic elements of 3 and g*. Let Z be defined as 3 is, but with s x u-valued functions. Let h denote a generic element of Z. We say that a class of matrix-valued functions 3, ?J*, or 2 satisfies Pollard’s entropy condition or is of type I, II, or III if that is the case element by element for each of the rs or su elements of its functions. Let~~O*={g+g*}(={g+g*:g~~,g*~~}),~~={gh},4ev~*=(gvg*}, 9~Y*={gr\g*} and Igl={lgl}, w h ere v, A, and 1.1 denote the element by element maximum, minimum, and absolute value operators respectively. If I = s and g(w) is non-singular V w~-ly- and VgM, let 3-i = {g-i}. Let ,&(.) denote the smallest eigenvalue of the matrix.
Ch. 37: Empirical Process
Theorem
Methods
2213
in Econometrics
3
If g, F?*, and 9 satisfy Pollard’s entropy condition with envelopes G, G*, and H, respectively, then so do each of the following classes (with envelopes given in parentheses): %ug* (G v G*), g@O* (G + G*), Y.8 ((G v l)(H v l)), $9 v 9* (G v G*), 9 A Y* (G v G*), and 191 (G). If in addition r = s and 3-l has a finite envelope c, and 9-i also satisfies Pollard’s entropy condition (with envelope (G v l)@‘). Comments
(1) The stability properties of Pollard’s entropy condition quite similar to stability properties of packing numbers
given in Theorem 3 are considered in Pollard
(1990).
(2) If r = s and infgsY infwEw &,(g(w)) uniformly bounded by a finite constant.
4.2.
> 0, then 9-i
has an envelope
that
is
Examples
We now show how Theorems l-3 can be applied obtain stochastic equicontinuity of vT(.). Example
of Section
3 to
1 (continued)
By Theorems l-3, the following conditions of vr(.) in this example.
(4
in the examples
are sufficient for stochastic equicontinuity
{(Y,, X,): t > l> is an m-dependent
(ii) ~~~~~~IlX~ll
2+60. Pollard’s
rE3 T-r, [ some 6 > 0. (iv) 11/r(.) is a function of bounded
entropy
condition
(IlX,l12+s+ l)s~pI$,(W’,,r)~~+~ re.F
variation.
1 0). When tj2(w, t) = y - X'T,I+b2(W, t) = l(X’T > o)[(y - X’T) A X’T], or $*(w, r) - 1(y < 2x’r)(y - x’r), condition (iii) is satisfied provided Y is bounded and 1 r lim ~~C[EIU,12+6+EilX(i14+b
+E~lU,X,~~2+b]0.
T-GPTI
This follows from Theorem 3, since {1(x’s > 0): reY}, {y - X'T:TEF}, {x'T:TE.?} and (1 (y < ~x’T): TELT} are type I classes with envelopes 1, Iu I + I\x (I supIGg 11 r - ?. 11, respectively, where u = y x’~e. IIx IIsu~,,.~ IIT II and1, Example
2 (continued)
In the method of simulated moments example, sufficient for stochastic equicontinuity of vT(.).
the following
conditions
are
is an m-dependent sequence of rv’s. is a type IT class of functions with Lipschitz function
B(.)
(9 (U&Z,, ytl,..., Y,,): t 3 l} (ii) {g(., t): reY_)
EB*+“(Z,)
that satisfies
+ Esup
))g(Z,,r)))2+6
re.Y for some
< co >
6 > 0.
Note that condition open, and
(ii) holds if g(w,r) is differentiable
(4.6) in ZV w~-ly-,Vr~~-,~
is
Sufficiency is established as follows. Classes of functions of the form { l((Zi-Zl)'(fl(z)+ A(z)yj)> 0): rsY c RP} are type I classes with envelopes equal to 1 (by including products ziyj and z,yj as additional elements of w) and hence satisfy Pollard’s entropy condition by Theorem 2. {g(.,r):rEY} also satisfies Pollard’s entropy condition with envelope 1 v supres 1)g(‘, t) II v B(.) by condition (ii) and Theorem 2. The 9% result of Theorem 3 now implies that A satisfies Pollard’s entropy condition with envelope 1 v SUP,,,~ IIg(‘, r) II v B(.). Stochastic equicontinuity now follows by Theorem 1. Example
5 (continued)
By applying Theorems stochastic equicontinuity
l-3, we find the following conditions are sufficient for of vT(.) in the WLS/PLR example. With some abuse of
Ch. 37: Empirical Process Methods in Econometrics
2275
notation, let rj(w) denote a function on W that depends on w only through the k,-subvector z and equals tj(z) above for j = 1,2,3. The sufficient conditions are:
(4 {(K,X,,Z,):t2 1) is an m-dependent
identically distributed sequence of rv’s. (ii) El1 Yt-Xle,I12+“+EIIX,I12+S+E/I(Y,-XX:B,)X,l)2+6< cc for some 6 > 0. (iii) F={t:r=(tl,t2, tJ),tjEFj for j = 1,2,3}. Fj is a type III class of RPj-valued functions on W c Rk that depend on w =(y, x’, z’)’ only through the k,-vector z for j = 1,2,3, where pi = 1, p2 =p and p3 = 1, and C 1 y-3 = tj: inf lr3(w)l 2 E for some E > 0. (4.7) wsll i 1’ The set W,* in the definition of the type III class Fj equals g* in this example for j = 1,2,3. Since g* is bounded by condition (iii), conditions (i)-(iii) can be satisfied without trimming only if the rv’s {Z,: t > l} are bounded. Sufficiency of conditions (i)-(iii) for stochastic equicontinuity is established as follows. Let h,(w) = y - ~‘0, and h2(w) = x. By Theorem 2, {c}, (hi}, {h2} and Fj satisfy Pollard’s entropy condition with envelopes 1, Ih, 1,Ih, I and Cj, respectively, for some constant C,E[~, co), for j = 1,2,3. By the 9-l result of Theorem 3, so 2 2. By the F?% and $!?@?J* results of does {l/r,:rj~Fj} with envelope CJE Theorem 3 applied several times, .&! satisfies Pollard’s entropy condition with envelope (lh,l v l)C,+(lh,I v l)C,+(lh,I v l)((h,( v l)C, for some finite constants C,, C,, and C,. Hence, Theorem 1 yields the stochastic equicontinuity of v,(.), since (ii) suffices for Assumption B. Next, we consider the conditions P(Z*EF)+ 1 and ? Are of (3.36). Suppose (i) fj(z) is a nonparametric estimator of rjO(z) that is trimmed outside T* to equal zero for j = 1,2 and one for j = 3, (ii) %* is a finite union of convex compact subsets of Rka, (iii) fj(z) and its partial derivatives of order d [q] + 1 are uniformly consistent over ZEN* for Tag and its corresponding partial derivatives, for j = 1,2,3, for some q > k,/2, and (iv) the partial derivatives of order [q] + 1 of Tag are uniformly bounded over ZEN!‘* and infiea* Ewmin(~&z)) > 0. Then, the realizations of fj(z), viewed as functions of w, lie in a type III class of functions with probability -+ 1 for j = 1,2,3 and t L T,, uniformly over 5?’ (where zjO(z) is defined for ZEN - %* to equal zero for j = 1,2 and one for j = 3). Hence, the above conditions plus (i) and (ii) of (4.7) imply that conditions (i)-(iii) of (3.36) hold. If fj(z) is a kernel regression estimator for j = 1,2,3, then sufficient conditions for the above uniform consistency properties are given in Andrews (1994b).
2276
5.
D. W.K. Andrew
Stochastic equicontinuity
via bracketing
This section provides an alternative set of sufficient conditions for stochastic equicontinuity to those considered in Section 4. We utilize a bracketing result of Ossiander (1987) for iid rv’s altered to encompass m-dependent rather than independent rv’s and extended as in Pollard (1989) to allow for non-identically distributed rv’s. This result depends on a condition, that we refer to as Ossiander’s entropy condition, that is based on how well the functions in JZ can be approximated by a finite number of functions that “bracket” each of the functions in A. The bracketing error is measured by the largest L’(P,) distance over all distributions P, of IV, for t d T, T 3 1. The main purpose of this section is to give primitive conditions under which Ossiander’s entropy condition holds. The results given here are particularly useful in three contexts. The first context is when r is finite dimensional and m(W,, t) is a non-smooth function of some nonlinear function of t and W,. For example, the rn(W,,~) function for the LAD estimator of a nonlinear regression model is of this form. In this case, it is difficult to verify Pollard’s entropy condition, so Theorems l-3 are difficult to apply. The second context concerns semiparametric and nonparametric applications in which the parameter r is infinite dimensional and is a bounded smooth function with an unbounded domain. Realizations of smooth nonparametric estimators are sometimes of this form. Theorem 2 above does not apply in this case. The third context concerns semiparametric and nonparametric applications in which r is infinite dimensional, is a bounded smooth function on one set out of a countable collection of sets and is constant outside this set. For example, realizations of trimmed nonparametric estimators with variable trimming sets are sometimes of this form. The pseudometric p on r that is used in this section is defined by
p(rl,~2) = We adopt
sup (W(W, tl) - WK,T~))~)“~. ti N.N> 1
the following
notational
convention:
~,@lf(K)IJYP = supwsr- If(w)1 if P = 00. An entropy condition analogous to Pollard’s bracketing cover numbers.
For
(5.1) any real function
is defined
using
f on
the following
Dejnition For any E > 0 and p~[2, m], the Lp bracketing cover number N:(e, P,,F)is the smallest value of n for which there exist real functions a,, . . . ,a, and b,, ,b, on YV such that for each f~9 one has If - ajl < bj for some j < II and maxjG n supt< r r> l (Eb$‘( Wr))lIpd E, where { W,: t d T, T > l} has distribution determined by PF ’ ’ The log of N~(E, P,F) is referred to as the Lp bracketing E-entropy of F. The following entropy condition was introduced by Ossiander (1987) (for the case p = 2).
Ch. 37: Empirical Process Methods
2271
in Econometrics
Definition A class F of real functions p~[2, co] if
s
on ?Y satisfies Ossiander’s Lp entropy condition for some
1
(log N;(E, P, F))“2
d& < a3.
(5.2)
0
As with Pollard’s entropy condition, Ossiander’s entropy condition restricts the complexity/size of F by restricting the rate ofincrease of the cover numbers as ~10. Often our interest in Ossiander’s Lp entropy condition is limited to the case where p = 2, as in Ossiander (1987) and Pollard (1989). To show that Ossiander’s Lp entropy condition holds for p = 2 for a class of products of functions 32, however, we need to consider the case p > 2. The latter situation arises quite frequently in applications of interest. Assumption
D
_k’ satisfies Ossiander’s
Lp entropy
condition
with p = 2 and has envelope
I&
Theorem 4 Under Assumptions B-D (with M in Assumption B given by Assumption D rather than Assumption A), {vT(.): T > l} is stochastically equicontinuous with p given by (5.1) and F is totally bounded under p. Comments 1. The proof of this theorem follows easily from Theorem 2 of Pollard (1989) (as shown in the Appendix). Pollard’s result is based on methods introduced by Ossiander (1987). Ossiander’s result, in turn, in an extension of work by Dudley (1978). 2. As in Section 4, one establishes stochastic equicontinuity here via maximal inequalities. With the bracketing approach, however, one applies a chaining argument directly to the empirical measure rather than to a symmetrized version of it. The chaining argument relies on the Bernstein inequality for the tail probabilities of a sum of mean zero, independent rv’s. The upper bound in Bernstein’s inequality is small when the L2(P,) norms of the underlying rv’s are small, where P, denotes the distribution of the tth underlying rv. The bound ultimately is applied with the underlying rv’s given by the centered difference between an arbitrary function in _&’and one of the functions from a finite set of approximating functions, each evaluated at W,. In consequence, these functions need to be close in an L2(P,) sense for all t < T for the bound to be effective, where P, denotes the distribution of W,. This explains the appearance of the supremum L2(P,) norm as the measure of approximation error in Ossiander’s L2 entropy condition.
D. W.K. Andrew
2278
We now provide primitive conditions under which Ossiander’s entropy condition is satisfied. The method is analogous to that used for Pollard’s entropy condition. First, we show that several useful classes of functions satisfy the condition. Then, we show how functions from these classes can be mixed and matched to obtain more general classes that satisfy the condition. Dejinition A class 9 of real functions on w is called a type IV class under P with index p~[2, CO] if each function f in F satisfies f(.) = f(., r) for some Roy-, where F is some bounded subset of Euclidean space, and l/P
V r~r and V 6 > 0 in a neighborhood of 0, for some finite positive constants C and I,+,where { W,: t d T, T b l} has distribution determined by P.4 Condition (5.3) is an Lp continuity condition that weakens the Lipschitz condition (4.3) of type II classes (provided suptG r,r> l(EBp(W,))“p < 00). The Lp continuity condition allows for discontinuous functions such as sign and indicator functions. For example, for the LAD estimator of a nonlinear regression model one takes f( W,, z) = sgn (Y, - g(X,, z))a[g(X,, z)]/hj for different elements rj of r. Under appropriate conditions on (Y,, X,) and on the regression function g(., .), the resultant class of functions can be shown to be of type IV under P with index p. Example 3 (continued) In this test of variable relevance the following condition: sup EU: SUP IW,,~,) r> I ?,:~I?,-?/~ <s
example,
-
h(Z,,z)l’
J& is a type IV class with p = 2 under
d
Cd*
(5.4)
for all thy, for all 6 > 0, and for some finite positive constants C and $. Condition (5.4) is easy to verify if h(Z,,t) is differentiable in r. By a mean value expansion, (5.4) holds if supt, 1 E II II, supTGF a[h(z,, z)]/ik II2 < 00 and r is bounded. On the other hand, condition (5.4) can be verified even if h(Z,,z) is discontinuous in r. For example, suppose h(Z,, z) = l(h*(Z,, r) d 0) for some real differentiable function h*(Z,, z). In this case, it can be shown that condition (5.4) holds if supta 1 El U,)2+6 < CO for some 6 > 0, sup*> 1 SUP,,,~ (Ia[h*(Z,, z)yar Ii d C, < cc a.s. for some constant C,, and h*(Z,, t) has a (Lebesgue) density that is bounded above uniformly over ZEF. 41f need be, the bound in (5.3) can be replaced i. > 1 and Theorem 5 still goes through.
by CIlog61-”
for arbitrary
constants
CE(~, co) and
2219
Ch. 37: Empirical Process Methods in Econometrics
Example 4 (continued) Jl is a type IV class with p = 2 in this cross-sectional constancy example under the same conditions as in Example 3 with U, of Example 3 replaced by U,a[g(X,,s,,)]/a8, and with h(Z,,z) taken to be of the non-differentiable form 1(h*(Z,, t) d 0) discussed above. Note that the conditions placed on a type IV class of functions are weaker in several respects than those placed on the functions in Huber’s (1967, Lemma 3, p 227) stochastic equicontinuity result. (Huber’s conditions N-2, N-3(i), and N-3(ii) are not used here, nor is his independence assumption on { W,}.)Huber’s result has been used extensively in the literature on M-estimators. Next we consider an analogue of type III classes that allows for uniformly bounded functions that are smooth on an unbounded domain. (Recall that the functions of type III are smooth only on a bounded domain and equal a constant elsewhere.) The class considered here can be applied to the WLS/PLR Example 5 or the GMM/CMR Example 6. Define wU as in Section 4 and lel w = (wb, wb)‘, h = (hb, hb)‘, and W, = (W;,, Wb,)‘. Dejinition A class 9
of real functions
on w
is called
a type I/ class under P with index
001,if
PER
(i) each fin F depends on w only through a subvector w, of dimension k, d k, (ii) wb is such that w0 n {w,ER~=: I/w, I/ < r} is a connected compact set V r > 0, (iii) for some real number q > k,/2 and some finite constants C,, . . . , Clql, C,, each f EF satisfies the smoothness condition V w~-llr and w + hew, f
(w+ h)=
vro y’!B,(k,,
w,) + W,, w,),
R(h,>w,) G C, IIh, I?, and
IB,(h,, w,)l 6 C,
IIh, II” for v = 0,. . , Cd, (5.5)
where B,(h,, w,) is homogeneous of degree v in h, and (q, C,, . . . , C,) do not depend on f,w,or h, (iv) suPtg T,Ta 1 E I/ W,, Iii < co for some [ > pqkJ(2q - k,) under P. In condition (iv) above, the condition [ > co, which arises when p = co, is taken to hold if [ = 00. Condition (ii) above holds, for example, if “IIT,= Rka. As with type III classes, the expansion of f(w + h) in (5.5) is typically a Taylor expansion and B,(h,, w,) is usually the vth differential of f at w. In this case, the third condition of (5.5) holds if the partial derivatives of f of order k,/2, each fEF has partial derivatives of order [q] on YF that are bounded uniformly over W~YY and f EF, (b) the [q]th order partial derivatives off satisfy
D. W.K. Andrews
2280
a Lipschitz condition with exponent q - [q] and some Lipschitz constant C, that does not depend on f, and (c) Y+$ is a convex set. The envelope of a type V class 9 can be taken to be a constant function, since the functions in 9 are uniformly bounded over wcw and f EF:. Type V classes can be extended to allow wO to be such that _wbn{w,~RI’~: 11w, 11d r} is a finite union of connected sets V r > 0. In this case, (5.5) only needs to hold V w~-llr and w + hE-IY_ such that w, and h, are in the same connected set in “wb n {w,: IIw, II d r} for some r > 0. In applications, the functions in type V classes usually are the realizations of nonparametric function estimates. For example, nonparametric kernel density estimates for bounded and unbounded rv’s satisfy the uniform smoothness conditions of type V classes under suitable assumptions. In addition, kernel regression estimates for bounded and unbounded regressor variables satisfy the uniform smoothness conditions if they are trimmed to equal a constant outside a suitable bounded set and then smoothed (e.g. by convolution with another kernel). The bounded set in this case may depend on T. In some cases one may wish to consider nonparametric estimates that are trimmed (i.e. set equal to a constant outside some set), but not subsequently smoothed. Realizations of such estimates do not comprise a type V class because the trimming procedure creates a discontinuity. The following class of functions is designed for this scenario. It can be used with the WLS/PLR Example 5 and the GMMjCMR Example 6. The trimming sets are restricted to come from a countably infinite number of sets {wOj: j 3 l}. (This can be restrictive in practice.) Definition
A class 9
of real functions
on w
is called a type
VI class
under
P with index
PECK, 001,if (i) each f in F depends on w only through a subvector w, of w of dimension k, d k, (ii) for some real number q > k, 12, some sequence {wOj: j 2 1) of connected compact subsets of Rka that lie in wO, some sequence {Kj:j 3 l} of constants that satisfy supja 1llyjl < co, and some finite constants C,, . . , CLql,C,, each f~9- satisfies the smoothness condition: for some integer J, (a) f(w) = K, V WE%/ for which w,+!~~~ and (b) V w~YY and w + hEW for which w,E~~~ and w, + huEdyb,, f(w
+ h) = .rO ,I;B,(h.,
R(hm wJ d C,
wu) + R(h,, w,),
IIh, 114,and IMh,, w,)l d C, /Ih, II” for v = 0,.
where B,(h,, w.) is homogeneous do not depend on f, w, or h.
. . , [q],
(5.6)
of degree v in h, and (q, (Woj: j >, l}, C,, . . . , C,)
Ch. 37: Empirical
Process
Methods
in Econometrics
2281
supti . T1T> I 1 E (1IV,, lli < cc for some iy> pqk,/(2q - k,) under P, (iv) n(r) < K, exp(K,rr) for some 5 < 2[/p and some finite constants K 1, K,, where n(r) is the number of sets Waj in the sequence {Waj: j 3 l} that do not include
(iii)
(W&“K:
IIw, II G 4.
Conditions (i)-(iii) in the definition of a type VI class are quite similar to conditions used above to define type III and type V classes. The difference is that with a type VI class, the set on which the functions are smooth is not a single set, but may vary from one function to the next among a countably infinite number of sets. Condition (iv) restricts the number of ^whj sets that may be of a given radius or less. Sufficient conditions for condition (iv) are the following. Suppose Wuj 3 or allj sufficiently large, where II(.) is a nondecreasing real Cw,E”Wb: IIw,II d r?(j)) f function on the positive integers that diverges to infinity as j-+ a3. For example, {Waj: j 3 l} could contain spheres, ellipses, and/or rectangles whose “radii” are large for large j. If q(j)3 D*(log j)lir
(5.7)
for some positive finite constant D *, then condition (iv) holds. Thus, the “radii” of the sets {~~j: j > 1) are only required to increase logarithmically for condition (iv). This condition is not too restrictive, given that the number of trimming sets {Waj} is countable. More restrictive is the latter condition that the number of trimming sets {-Wbj} is countable. As with type III and type V classes, the envelope of a type VI class of functions can be taken to be a constant function. The trimmed kernel regression estimators discussed in Andrews (1994b) provide examples of nonparametric function estimates for which type VI classes are applicable. For suitable trimming sets {WGj: j 2 l} and suitable smoothness conditions on the true regression function, one can specify a type VI class that contains all of the realizations of such kernel estimators in a set whose probability -1. The following result establishes Ossiander’s Lp entropy condition for classes of type II-VI. Theorem 5 Let p~[2,00]. If Y is a class of functions of type II with supt< T T, I (EBp(Wt))“p < co, of type III, or of type IV, V, or VI under P with index i, then Ossiander’s Lp entropy condition (5.2) holds (with envelope F(.) given by supltF If(.)\). Comments (1) To obtain Assumption D for any of the classes of functions considered above, one only needs to consider p = 2 in Theorem 5. To obtain Assumption D for a
D. W.K. Andrew
2282
class of the form 3&F’, where 9 and 2 are classes of types II, III, IV, V or VI, however, one needs to apply Theorem 5 to 9 and 2 for values of p greater than 2, see Theorem 6 below. (2) Theorem 5 covers classes containing a finite number of functions, because such functions are of type IV under any distribution P and for any index PE[~, co]. In particular, this is true for classes containing a single function. This observation is useful when establishing Ossiander’s Lp entropy condition for classes of functions that can be obtained by mixing and matching functions from several classes, see below. We now show how one can “mix and match” functions of types II-VI. Let 9?,9*, Y?, 9 @ %*, etc., be as defined in Section 4. We say that a class of matrixvalued functions 3,9*, or H satisfies Ossiander’s Lp entropy condition or is of type II, III, IV, V or VI if it does so, or if it is, element by element for each of the IS or su elements of its functions. We adopt the convention that &/(A + p) = ~E(O, co] if A = co and vice versa.
Theorem 6 (a) If 3 and 3* satisfy Ossiander’s Lp entropy condition for some p~[2, co], with envelopes G and G*, respectively, then so do each of the following classes (with envelopes given in parentheses): 9 u 3* (G v G*), 9 0 9* (G + G*), 3’ v Y* (G v G*), 9 A 3* (G v G*), and IF?\(G). If in addition r = s and inf,,, inf,,,,,- A,,,(g(w)) = A., for some A.,,> 0, then 9-i also satisfies Ossiander’s Lp entropy condition (with envelope r/E,,). (b) The class 3% satisfies Ossiander’s Lp entropy condition with p equal to cr~[2, co] and envelope sGH, if(i) 3 and A? satisfy Ossiander’s Lp entropy condition with p equal to k(cc, co] and p equal to ,ULE(CL, co], respectively, (ii) +/(A + p) 3 CI, and (iii) the envelopes G and H of Y and YP satisfy sup,< T,Ta 1(EG”(W,))“’ < cc and suptG T,Ta ,(EH”(K))“”
< 00.
Example 6 (continued) Theorems 4-6 can be used to verify stochastic equicontinuity of vT(.) and total boundedness of F in the GMMjCMR example. With some abuse of notation, let d(w) and n(w) denote functions on -w^ whose values depend on w only through the k,-vector x and equal A(x) and Q(x) respectively. Similarly, let $(w, 0,) denote the function on -w^ that depends on w only through z and equals ll/(z,e,). The following conditions are sufficient.
(i) {(Z,,X,):t> l} is an m-dependent
sequence
of rv’s.
(ii) ;;y E II$(Z,, &J II6 < ~0. (iii) $ = {r: r = A’R-’ for some AE~ and a~&‘}, where $3 and s4 are type V or type VI classes of functions on FY c Rk with index p = 6 whose functions
Ch. 37: Empirical Process
Methods
depend on w only through for some E > 0.
2283
in Econometrics
the k,-vector x, and .d c
R: inf &,(fl(w))> we*
E (5.8)
Note that condition (iii) of (5.8) includes a moment condition on X,:supta 1 E I/X, lir< co for some i > 6qk,/(2q - k,). Sufficiency of conditions (i))(iii) for stochastic equicontinuity and total boundedness is established as follows. By Theorem 5, {$(., (!I,)}, LS and d satisfy Ossiander’s Lp entropy condition with p = 6 and with envelopes I$(.,tI,)l, C, and C,, respectively, for some finite constants C,, C,. By the 9-l result of Theorem 6, so does J4-’ with some constant envelope C, < co. By the 32 result of Theorem 6 applied with c1= 3 and 1, = p = 6, SS&-’ satisfies Ossiander’s Lp entropy condition with p = 3 and some constant envelope C, < co. By this result, condition (ii), and the 9%’ result of Theorem 6 applied with c1= 2, 2 = 3, p = 6, 9 = g&-r, and Y? = ($(.,e,)}, JY satisfies Ossiander’s Lp entropy condition with p = 2 and envelope C, I$(., Q,)l for some constant C, < co. Theorem 4 now yields stochastic equicontinuity, since condition (ii) is sufficient for Assumption B. Condition (iii) above covers the case where the domain of the nonparametric functions is unbounded and the nonparametric estimators A and fi are not trimmed to equal zero outside a single fixed bounded set, as is required when the symmetrization results of Section 4 are applied. As discussed above, nonparametric kernel regression estimators that are trimmed and smoothed or trimmed on variable sets provide examples where condition (iii) holds under suitable assumptions for realizations of the estimators that lie in a set whose probability + 1. For example, Andrews (1994b) provides uniform consistency on expanding sets and LQconsistency results for such estimators, as are required to establish that P(Z*EY) -+ 1 and z^3 z0 (the first and second parts of (3.36)) when stochastic equicontinuity is established using conditions (i)-(iii) above.
6.
Conclusion
This paper illustrates how empirical process methods can be utilized to find the asymptotic distributions of econometric estimators and test statistics. The concepts of empirical processes, weak convergence, and stochastic equicontinuity are introduced. Primitive sufficient conditions for the key stochastic equicontinuity property are outlined. Applications of empirical process methods in the econometrics literature are reviewed briefly. More detailed discussion is given for three classes of applications: M-estimators based on non-differentiable criterion functions; tests of hypotheses for which a nuisance parameter is present only under the alternative hypothesis; and semiparametric estimators that utilize preliminary nonparametric estimators.
D. W.K. Andrew
2284
Appendix
Proof of Theorem 1 Write vT(.) as the sum of m empirical processes {vrj(.): T 3 l} forj = 1,. . , m, where vTj(.) is based on the independent summands {m(W,, .): t = j + sm, s = 1,2,. .}. By standard inequalities is suffices to prove the stochastic equicontinuity of {vTj(.): T3 l} for each j. The latter can be proved using Pollard’s (1990) proof of stochastic equicontinuity for his functional CLT (Theorem 10.7). We take his functions &(w, t) to be of the form m( IV,, r)/fl We alter his pseudometric from lim,, m [ (l/N)xyE 11 m( W,, zl) m(W,, t2) 11 2]“2 to that given in (3.1). Pollard’s proof of stochastic equicontinuity relies on conditions (i) and (iii)-(v) of his Theorem 10.7. Condition (ii) of Theorem 10.7 is used only for obtaining convergence of the finite dimensional distributions, which we do not need, and for ensuring that his pseudometric is well-defined. Our pseudometric does not rely on this condition. Inspection of Pollard’s proof shows that any pseudometric can be used for his stochastic equicontinuity result (although not for his total boundedness result) provided his condition (v) holds. Thus, it suffices to verify his conditions (i) and (iii)-(v). Condition “manageable.” satisfy
(i) requires that the functions {m(W,, t)/fi: t d T, T > l} are This holds under Assumption A because Pollard’s packing numbers
sup
D(s
Ia0 F”.(w)I,a0 Pn,) d sup N,(.s/2, Q, A).
Conditions (iii) and (iv) are implied by Assumption matically given our choice of pseudometric.
B. Condition
(A.1) (v) holds autoQ.E.D.
Proof of Theorem 2 Type I classes of form (a) satisfy Pollard’s entropy condition by Lemmas II.28 and 11.36(ii) of Pollard (1984, pp 30 and 34). Type I classes of form (b) satisfy Pollard’s entropy condition because (i) they are contained in VC hull classes by the proof of Proposition 4.4 of Dudley (1987) and the fact that {f: f (w)= w' 5’}.3 Examples 1 and 2 illustrate such probabilities. It is these probabilities, the discrete components of the p.d.f., that pose computational obstacles to classical estimation. One must carry out multivariate integration and differentiation in (2))(5) to obtain the likelihood for the observed data - see the following example for a clear illustration of this problem. Because accurate numerical approximations are unavailable, this integration is often handled by such general purpose numerical methods as quadrature. But the speed and accuracy of quadrature are inadequate to make the computation of the MLE practical except in special cases. Example
1.
Multinomial
probit
The multinomial probit model is a leading illustration of the computational difficulties of classical estimation methods for LDV models, which require the repeated evaluation of (2))(5). This model is based on the work of Thurstone (1927) and was first analyzed by Bock and Jones (1968). For a multinomial model with J = M possible outcomes, the latent y* is N(p, f2) where p is a J x 1 vector of means and R is a J x J symmetric positive definite covariance matrix. The observed y is often represented as a vector of indicator functions for the maximal element of y*: r(y*) = [l(yf=max,y*}; j= l,..., 51. Therefore, the sampling space B of y is the set of orthonormal elementary unit vectors, whose elements are all zero except for a unique element that equals one:
B= {(l,O,O ,..., O),(O,l,O,O ,..., 0) ,..., (0,O ,..., O,l)}. The probability function for y can be written as an integral over J - 1 dimensions after noting that the event {yj = 1, yi = 0, i # j} is equivalent to {y; - y* 3 0, i = 1,. . . ,
3The height of the discontinuity
is denoted
by
F(O; Y) - F(B; Y - 0) = lim [F(f$ Y) - F(0; Y - E)]. 40
Ch. 40: CIassical Estimation
Methods
2389
for LDV Models Using Simulation
J}. By creating the first-difference vector zj = [y: - y*, i = 1,. . . ,J, i #j] = AjY* and denoting its mean and covariance by pj = Ajp and fij = AjflA> respectively, F(B;y) and ~“(0; y) are both functions of multivariate normal negative orthant integrals of the general form
ss 0
0
@(p,L?) =
4(x + P, Wx.
..’
-co
-m
We obtain
F(e;Y)= i
l{Yj3
l}@(-Pj,.Rj)
j=l
and
f
ng= r @( - ~j, Rj)Y’
(0;Y) =
0
When J = 2, this reduces the introduction:
f
(0;Y) =
ifyE&
i
WC12
-PI,
@(
p’,
-
(6)
otherwise. to the familiar
l)y’@(P1
-
P2,
1)’ _Y’@($, 1)Y’
binomial
probit
likelihood
mentioned
in
1Y2
(7)
where ji = pL1- pLzand y’ = y,. If J > 5, then the likelihood function (6) is difficult to compute using conventional expansions without special restrictions on the covariance matrix or without adopting other distributions that imply closed-form expressions. Examples of the former approach are the factor-analytic structures for R analyzed in Heckman (1981), Bolduc (1991) and Bolduc and Kaci (1991), and the diagonal R discussed in Hausman and Wise (1978), p. 310. An example of the latter is the i.i.d. extreme-value distribution which, as McFadden (1973) shows, yields the analytically tractable multinomial logit model. See also Lerman and Manski (1981), p. 224, McFadden (198 1) and McFadden (1986) for further discussions on this issue.
Example
2.
Tohit
The tobit or censored regression model4 is a simple example of a mixed distribution with discrete and continuous components. This model has a univariate latent
4Tobin
(I 95X).
V.A. Hajivassiliou
2390
and P.A. Ruud
rule is also similar: structure like probit: y* - N(p, a’). The observation l{y* >O}.y* which leads to the sample space B = {yeRly 2 0} and c.d.f. 0
F(B; Y) =
+(y*-p,~)dy*=@(Y-~(,a*) /
s (Y’
0
i
if Y < 0, ifY30.
< Y)
The p.d.f. is mixed, containing
f(@Y)=
r(y*) =
discrete and continuous
terms:
if Y < 0,
@(--~,a’)
ifY=O,
f$(Y-~,a’)
ifY>O.
(8)
The discrete jump in F at Y = 0 corresponds to the nonzero probability of { Y = 0}, just as in binomial probit. F is differentiable for Y > 0 so that the p.d.f. is obtained by differentiation. Just as in the extension of binomial to multinomial probit, multivariate tobit models present multivariate integrals that are difficult to compute. Example 3.
Nonrandom sample selection
The nonrandom sample selection model provides a final example of partial observability which generalizes the tobit model.’ In the simplest version, the latent y* consists of two elements drawn from a bivariate normal distribution where
n(a) =
l
[ al2
The observation
*
al2
a2 1
rule is
so that the first element of y observation on yf when y, = is identically zero. That is, the B = { (O,O)} u { (l,y,), ~,ER}.
‘See
is a binomial variable and the second element is an 1; otherwise, there is no observation of yf because y, sampling space of y is the union of two disjoint sets: Thus, two cases capture the nonzero regions of the
Gronau (1974), Heckman (1974), Heckman (1979), Lewis (1974), Lee (1978) and Lee (1979).
Ch. 40: Chwical
Estimation
Merhods
fir
LDV
Models
c.d.f. of y. First of all, the c.d.f. is constant
F(B; Y)=
s
2391
Using Simuhfion
on B, = [0, 1) x [0, co):
YEB,
ddy* - p,fWy* = @(- ~l,lX
rv:
if Y, = 0,
1)
1 - 0:,/0:).4(Y,
- ~,,a:)
if Y, = 1.
is often more complicated, with several causes of the latent yT is a vector with each element associated observation. The latent yT is observed only if all the are J = M - 1) are positive so that the observation
where 1 {y: 2 0) is an (M - 1) x 1 vector of indicator
variables. The sampling space is
M-l
B=
JJER~IYM=O,fl Yj=l,YjE{O,l},
j<M
j= 1 M-l
JJER~IYM=O,
JJ Yj=O,YjE{O,l}
9
j=l
and the likelihood sions of yy.
function
contains
multivariate
integrals
over the M - 1 dimen-
Other types of nonrandom sample selection lead to general discrete/continuous models and models of switching regressions with known sample separation. Such
V.A. Hajiwssiliou
2392
models are discussed extensively in Dubin and McFadden Lee (1978) Maddala (1983) and Amemiya (1984).
2.3.
and P.A. Ruud
(1984), Hanemann
(1984)
Truncation
When it is represented as a partial observation, censored latent variable. Another mechanism variables is truncation, which refers to dropping tion goes unrecorded. Dejinition
2.
Truncated
a limited dependent variable is a for generating limited dependent observations so that their realiza-
random variables
Let F(Y) be the c.d.f. of y* and let D be a proper subset of the support its complement such that Pr {y*E DC} > 0. The function G(Y)=
F(Y)/Pr{YED} 0
is the c.d.f. of a truncated
of F and DC
if YED, if YED’.
y*.
One can generate a sample of truncated random variables with the c.d.f. G by drawing a random sample of y* and removing the realizations that are not members of D. This is typically the way truncation arises in practice. To draw a single realization of the truncated random variable, one can draw y*‘s until a realization falls into D. The term “truncation” derives from the visual effect dropping the set DC has on the original distribution when DC is a tail region: the tail of the p.d.f. is cut off or truncated. To incorporate truncation, we expand the observation rule to
Y=
4Y*)
ify*ED,
unobserved
otherwise,
where D is an “acceptance region”. This situation differs from that of the nonrandom sample selection model in which an observation is still partially observed: at least, every realization is recorded. In the presence of truncation, the observed likelihood requires normalization relative to the latent likelihood:
(10)
Ch. 40: Classical
Estimation Methodsfor
The normalization with an upper bound Example
4.
by a probability of one.
Truncated
2393
LDV Models Using Simulation
in the denominator
makes the c.d.f. proper,
normal regression
Ify* - N(p, a’) and y is an observation of y* when y* > 0, the model is a truncated normal regression. Setting D = (y~f?I y > 0) makes B = D so that the c.d.f. and p.d.f. of y are
F(6;
Y)=
4 y
P, 4
NY*
0
sm 0
f(& Y) =
I
dy*
= @(y-PY~2)-@(-I44
@(Y* -p,ddy* 0
&Y-
if Y < 0,
0
r
1-
@(-p,c?)
if y>.
2
if Y d 0, K 0)
! 1 -@(-&a”)
ifY>O,
As in the tobit model, a normal integral appears in the likelihood function. However, this integral enters in a nonlinear fashion, in the denominator of a ratio. Clearly, multivariate forms of truncation lead to multivariate integrals in the denominator. To accommodate both censored and truncated models, in the remainder of this chapter we will often denote the general log-likelihood function for LDV models with a two-part function:
ln f@ Y) = ln ft (8 Y) - ln fd@ Y)
(11)
where fi represents the normalizing probability Pr {y* E D ] = SDdF(0; y*). In models with only censoring, f2E 1. But in general, both fi and f2 will require numerical approximation. Note that in this general form, the log-likelihood function can be viewed as the difference between two log-likelihood functions for models with censoring. For example, the log-likelihood of the truncated regression in Example 4 is the difference between the log-likelihoods of the tobit regression in Example 2 and the binomial probit model mentioned in the introduction and Example 1 (see equations (7) and (S))?
6Note that scale information about y* is available in the censored and truncated normal regression models which is not in the case of binary response, so that u2 is now identifiable. Hence, the normalization c’_= 1 is not necessary, as it is in the binary probit model where only the discrete information 1{ Y> 0) is available.
V.A. Hajivassiliou
2394
1 {y >
L!9kEL [ 1 -@(-p,a2)
0) ln
und P.A. Ruud
1
=l{Y>O}ln[C$(Y-/J,a)]+1{Y=0}@(-,U,0*) -[l(Y>O}ln[l-@(--p,c*)] +l(Y=o}@(-~,a*)].
2.4.
Mixtures
LDV models have limited dependent an analytical trait generally contains Definition
3.
come to variables. with the discrete
include a family of models that do not necessarily have This family, containing densities called mixtures, shares LDV models that we have already reviewed: the p.d.f. probability terms.
Mixtures
Let F(8; Y) be the c.d.f. of y* depending Then the c.d.f.
on a parameter
8 and H(8) another
c.d.f.
F(B; Y) dH(8)
G(Y) =
s is a mixture. Possible ways in which mixtures arise in econometric models are unobservable heterogeneity in the underlying data generating process (see, for example, Heckman (198 1)) and “short-side” rationing rules (Quandt (1972), Goldfeld and Quandt (1975), Laroque and Salanit: (1989)). Laroque and Salanitt (1989) discuss simulation estimation methods for the analysis of this type of model. Example 5.
Mixture
A cousin of the nonrandom by an underlying trivariate
The observation written as
sample selection model is the mixture normal distribution, where
rule maps a three-dimensional
model generated
vector into a scalar; the rule can be
Ch. 40: Cla.ssical Estimation
Mefhods,fiw
LDV
Models
2395
Usiny Simulation
An indicator function determines whether yT or y: is observed. An important difference with sample selection is that the indicator itself is not observed. Thus, y is a “mixture” of yz’s and yz’s. As a result, such mixtures have qualitatively distinct c.d.f’s, compared to the other LDV models we have discussed. In the present case,
F(8; Y) =
s s o(
3
4(y*
- PL,f4
0.y; G Y)u (Y: < 0-y;Q Yl 4(y*
- PL,Qdy*
+
dy*,
s
$(Y* -
PL,4
dy*,
{Y:< 0.y: s Y)
(Y:a o,y: s Yl
and
where, for j = {2,3}, PlljeE(Y:lYj*=
Llllj
E
y)=Pl
+alj(YTPj)/af9
Iqyyyj* = Y) = 1 - 0+;
are conditional moments. The p.d.f. particularly demonstrates the weighted nature of the distribution: the marginal distributions of yz and ys are mixed together by probability weights.
2.5.
Time series models
LDV models are not typically applied to time series data sets but short time series have played an important role in the analysis of panel or longitudinal data sets. Such time series are another source of high-dimensional integrals in likelihood functions. Here we expand our introductory example. Example
6.
Multiperiod
binary probit model
A random sample of N economic agents is followed over time, with agent n being observed for T periods. The latent variable y,*, = pnt + E,, measures the net benefit to the agent characterizing an action in period t. Typically, pnt is a linear index
V.A. Hajiwssiliou
2396
und P.A. Ruud
function of a k x 1 vector of exogenous explanatory variables x,,~,i.e., pL,t= xk#. The agent chooses one of two actions in each period, denoted by y,,,~jO, l}, depending upon the value of y,*,:
r(y*) =
y,, = 1 i y,, = 0
ify,*, > 0,
(12)
t= l,...,T.
if yz* d 0, I
Hence, the sample space for r(y*) is B = x T= I (0, l}, i.e., all possible (2r) sequences of length T, with 0 and 1 as the possible realizations in each period. normal given in Let the distribution of y,* = (y,*,, . . , y,*,)’ be the multivariate equation (3). Then, for individual n the LDV vector {y,,}, t = 1,. , T, has the discrete p.d.f. where p = x’b
f(fi, 0; Y) = @(S,,, SOS),
and S = diag{2y - 1).
This is a special case of the multinomial probit model of Example 1, with J = 2r alternatives and a typically highly restricted 52, reflecting the assumed serial correlation in the {snt}T=1 sequence. By way of illustration, let us consider the specific covariance structure, found very useful in applied work:7 E”*= rn + in,,
IPI < 1
&I*= Pi,,* - 1 + “nt,
and v and yeindependent.
This implies that p
p2
...
pT-’
P
1
p
...
pT-2
p2
p
1
pT-l
(13)
pT-2
...
i
...
1
P
...
/,
1
+ a;.J,.
The variance parameters 0: and of cannot both be identified, so the normalization a~+~,2=1isused.’ The probability of the observed sequence of choices of individual n is
s b&n)
Pr (y,; 8, XJ =
4(~,*
- A, f&J dy:,
MYn)
‘See Hajivassiliou and McFadden (1990), BGrsch-Supan et al. (1992) and Hajivassiliou *This is the structure assumed in the introductory example see equation (1) above.
(1993a).
Ch. 40: Classical
Estimation Methods/iv
LD V Models Using Simulation
2397
with 0 = (/3, of, p) and 0
%t =
if y,, = 1,
i -YE
ify,, = 0,
Note that the likelihood of this example is another member of the family of censored models. Time series models like this do not present a new analytical problem. Indeed, such time series models are more tractable for estimation because classical methods do provide consistent, though statistically inefficient, estimators (see Poirier and Ruud (1988), Hajivassiliou (1986) and Avery et al. (1983)).9 Keane (1993) discusses extensively special issues in the estimation by simulation of panel data models and Miihleisen (1991) compares the performance of alternative simulation estimators for such models. Studies of dynamic discrete behavior using simulation techniques are Berkovec and Stern (1991), Bloemen and Kapteyn ($991), Hajivassiliou and Ioannides (1991), Hotz and Miller (1989), Hotz and Sanders (1991), Hotz et al. (1991), Pakes (1992) and Rust (1992). In this chapter, we do not analyze the estimation by simulation of “long” time series models. We refer the reader to Lee and Ingram (1991), Duffie and Singleton (1993), Laroque and Salanie (1990) and Gourieroux and Monfort (1990) for results on this topic.
2.6.
Score functions
For models with censoring, the score for 8 can be written in two ways which we will use to motivate two approaches to approximation of the score by simulation.
V,lnf(Q;y)
=z =
(14)
~JW~lnf(~;~*)l~l
(15)
where V0 is an operator that represents partial differentiation with respect to the elements of 0. The ratio (14) is simply the derivative of the log-likelihood and ‘Panel data sets in which each agent is observed for the same number of time periods T are called balanced, while sets with T. # T for some n = 1,. , N are known as unhalancrd. As long as the determination of T. is not endogenous to the economic model at hand, balanced and unbalanced sets can be analyzed using the same techniques. There exists, however, the interesting case in which T,, is determined endogenously through an economic decision, which leads to a multiperiod sample selection problem. See Hausman and Wise (1979) for a discussion of this case.
V.A. Hajivassiliou
2398
and P.A. Ruud
simulation can be applied to the numerator and denominator separately. The second expression (15), the conditional expectation of the score of the latent loglikelihood, can be simulated as a single expectation if V, lnf(8; y*) is tractable. Ruud (1986), van Praag and Hop (1987), Hajivassiliou and McFadden (1990) and Hajivassiliou (1992) have noted alternative ways of writing score functions for the purpose of estimation by simulation. Here is the derivation of (15). Let F(8; y* 1y) denote the conditional c.d.f. of y* given that r(y*) = y.” We let
ECNY*)IYI =
r(y*)dF(&y*Iy)
(16)
s denote the expectation of a random variable c.d.f. F(B; y* 1y) of y* given r(y*) = y. Then
vtm Y) _
1
.mY)
f(R Y) =
s
s
V,dF(&y*)
(y*,r(y*)
‘y)
v&m
(y'lr(y*)'Y)
t(y*) with respect to the conditional
Y*)
fvt Y*)
= -V,WV;Y*)I~(Y*)
fvt Y*) fvt Y)
dy*
= ~1
since
f(8; y*)dy* is the p.d.f. of the truncated We; Y*)ifvt Y) = fv3 Y*YJ(~.,~(~*)=~~ distribution {y* 1r(y*) = y}. This formula for the score leads to the following general equations for normal LDV models when y* has the multivariate normal p.d.f. given in (3):
v,ln f(e; Y) = a- 1[E(Y*
IY) - ~1,
V,lnf(e;Y)=~~-l{~(Y*lY)+c~(y*i~(y*)=y)-~i x [WY* using the standard
Ir(Y*) = Y) - PI’ - fl>fl-
derivatives
for the log-likelihood
I0 Formally, F(O; Y* 1T(y*) =
y)
E
lim 610
Pr{y*< y*,y-E0, then @(e^,o,
- e^,,,) -% 0.
2428
V.A. Hajiuassiliou
and P.A. Ruud
For any given residual and instrumental variables, there generally exist optimal weights among MOM estimators, and the same holds for MSM as well. In what is essentially an asymptotic counterpart to the GaussMarkov theorem, if H = ZMSM then the MSM estimator is optimal (Hansen (1982)). To construct an MSM estimator that satisfies this restriction, one normalizes the simulated residual by its variance and makes the instrumental variables the partial derivatives of the conditional expectation of the simulated moment with respect to the unknown parameters:
One can approximate these functions using simulations that are independent of the moment simulations with R fixed, but efficiency will require increasing R with sample size. If /? is differentiable in t3, then independent simulations of the V,,ii are unbiased simulators of the instruments. Otherwise, discrete numerical derivatives can be employed. The covariance matrix can be estimated using the sample variance of p and the simulated variance of y. Inefficiency in simulated instruments constructed in this way has two sources: the simulation noise and the bias in the inverse of an estimated variance. Both sources disappear asymptotically if R approaches infinity with N. While it is critical that the simulations of w be independent of the simulations of b, there is no obvious advantage to simulating the individual components of w independently. In some cases, for example simulating a ratio, it appears that independent simulation may be inferior.23
4.4.
Simulation
of the score function
Interest in the efficiency of estimators naturally leads to attempts to construct an efficient MSM estimator. The obvious way to do this is to simulate the score function as a set of simulated moment equations. Within the LDV framework, however, unbiased simulation of the score with a finite number of operations is not possible with simple censored simulators. The efficient weights are nonlinear functions of the objects that require simulation. Nevertheless, it may be possible with the aid of simulation to construct good approximations that offer improvements in efficiency over simpler MSM estimators. There is an alternative approach based on truncated simulation. We showed in Section 2 that every score function can be expressed as the expectation of the score of a latent data generating process taken conditional on the observed data. In the particular case of normal LDV models, this conditional expectation is taken over a truncated multivariate normal distribution and the latent score is the score of an untruncated multivariate normal distribution. Simulations from the truncated nor23A Taylor series expansion suggests that positive correlation between the numerator of a ratio can yield a smaller variance than independent simulation.
and denominator
Ch. 40: Classical
Estimation
Methods for LDV Models Using
Simulation
2429
ma1 distribution can replace the expectation operator to obtain unbiased simulators of the score function. In order to include both the censored and truncated approaches to simulating the score function, we define the method of simulated scores as follows.24 Dejinition 8.
Method
of simulated scores
Let the log-likelihood function for the unknown parameter vector (3 given the sample of observations (y,, n = 1,. . . , N) be I,(0) = C,“= 1In f(& y,). Let fi(Q; y,, w,) = (l/R)Cr=, ~(&y,,,lo,J be an asymptotically (in R) unbiased simulator of the score function ~(0;y) = Vlnf(B; y) where o is a simulated random variable. The method of simulated scores estimator is &,s, E arg min,, J/ 5,(e) (1 where .YN(0)3 (l/N)Cr= ,b(@, y,, 0,) for some sequence {on}. Our definition includes all MSL estimators as MSS estimators, because they implicitly simulate the score with a bias that disappears asymptotically with the number of replications R. But there are also MSS estimators without simulation bias for fixed R. These estimators rely on simulation from the truncated conditional distribution of the latent y* given y. We turn to such estimators first. 4.4.1.
Truncated
simulation of the score
The truncated simulation methods described in Section 3.3 provide unbiased simulators of the LDV score (17), which is composed of elements of the form (24). Such simulation would be ideal, because R can be held fixed, thus leading to fast estimation procedures. The problem is that these truncated simulation methods pose new problems for the MSS estimators that use them. The first truncated simulation scheme, discussed in Section 3.3.1 above, is the A/R method. This provides simulations that are discontinuous in the parameters, a property shared with the CMC. A/R simulation delivers the first element in a simulated sequence that falls into a region which depends on the parameters under estimation. As a result, changes in the parameter values cause discrete changes in which element in the sequence is accepted. An example of this phenomenon is to suppose that one is drawing a sequence of normal random variables {ql} IV N(0, I,) in order to obtain truncated multivariate normal random variables for rank ordered probit estimation. Given the observation y, one seeks a simulation from D(y), as defined in Example 8. Let the simulation of y* be jjl(pl, r,) 3 ,~i + T1qr at the parameter values (pi, r,). At neighboring parameter values where two elements of the vector j,(p, r) are equal, the A/R simulation is at the point of jumping from the value j& r) to another point in the sequence {J,(p, r)}. See Hajivassiliou and McFadden (1990) and McFadden and Ruud (1992) for treatments of the special
24The term was coined by Hajivassiliou
and McFadden
(1990)
V.A. Hajivassiliou
2430
and P.A. Ruud
asymptotic distribution theory for such simulation estimators. Briefly described, this distribution theory requires a degree of smoothness in the estimator with respect to the parameters that permits such discontinuities but allows familiar linear approximations in the limit. See Ruud (1991) for an illustrative application. The second truncated simulation scheme we discussed above was the Gibbs resampling simulation method; see Section 3.3.2. This method is continuous in the parameters provided that one uses a continuous univariate truncated normal simulation scheme. But this simulation method also has a drawback: Strictly applied, each simulation requires an infinite number of resampling rounds. In practice, Gibbs resampling is truncated and applied as an approximation. The limited Monte Carlo evidence that we have seen suggests that such approximation is reliable. Simulation of the efficient score fits naturally with the EM algorithm for computing the MLE derived by Dempster et al. (1977). The EM algorithm includes a step in which one computes an expectation with respect to the truncated distribution of y* conditional on y. Ruud (1991) suggested that a simulated EM (SEM) algorithm could be based on simulation of the required expectation.25 This substitution provides a computational algorithm for solving the simulated score of MSS estimators. Dejinition 9.
EM algorithm
The EM algorithm is an iterative process for computing the MLE of a censored data model. On the ith iteration, the EM algorithm solves 0 i+ ’ = arg max Q(0, 0';y), where the function
(39)
Q is
Q(O1,OO;~)~E,oClnf(O1;~*)l~l, where EOo[. Iy] indicates
an expectation
(40) measured
with respect to ~(0”; y* 1y).
If Q is continuous in both 0 arguments, then (39) is a contraction mapping converges to a root of the normal equations; as Ruud (1991) points out, 0 = 0’ = 0’ *V,,
Q(O’, 0’; y) = VBIn F(0; y),
that
(41)
so that the first-order conditions for an iteration of (39) and the normal equations for ML are intimately related. Unlike the log-likelihood function, this Q can be simulated without bias for LDV models because the latent likelihood f(0; y*) is tractable and Q is linear in In f(0; y*) 25van Pragg et al. (1989) and van Praag et al. (1991) also investigated a study of the Dutch labor market.
this approach
and applied
it in
Ch. 40: Classical
Estimution Methodsfor
2431
LDV Models Using Simulation
(see equation (40)). According to (41). unbiased simulation of Q implies a means for unbiased simulation of the score. Although it is not guaranteed, an unbiased simulator of Q usually yields a contraction mapping to a stationary point. For LDV models based on a latent multivariate normal distribution, the iteration in (39) is quite simple to compute, given Q or a simulation of Q. Iff(e;y*) = 4(y*
-
P; W,
then
and
IR’ = k $ n
&o[(y:
- P’)(Y,* - P’),\Y~],
(42)
1
which are analogous to the equations for the MLE using the latent data. This algorithm is often quite slow, however, in a neighborhood of the stationary point of (39). Any normalizations necessary for identification of 0 can be imposed at convergence. See Ruud (1991) for a discussion of these points. Example 13.
SEM estimation
In this example, we apply the SEM procedure to the rank ordered probit model of our previous examples. We simulated an (approximately) unbiased 0 of Q by drawing simulations of y: from its truncated normal distribution conditional on y, using the Gibbs resampling method truncated to 10 rounds. The support of this truncated distribution is specified as D(y) in Example 8. The simulated estimators were computed according to (42), after replacing the expectations with the averages of independent simulations. The usual Monte Carlo results for 500 experiments with J = 6 ranked alternatives are reported in Table 8 for data sets containing 100 observations and R = 5 simulations per observation. These statistics are comparable to those in Table 5 for the MSL estimator of the same model with the same number of simulation replications. The biases for the true parameter values appear to be appreciably smaller in the SEM estimator, while the sampling variances are larger. We cannot judge either estimator as an approximation to the MLE, because the latter is prohibitively difficult to compute.
Sample statistics
Parameter
01 0, 0, 0, 0,
Population value - 0.4000 -0.4000 - 0.4ooo - 0.4000 - 0.4000
for rank ordered
Table 8 probit SEM using Gibbs simulation
(J = 6, R = 5).
Mean
Standard deviation
Lower quartile
Median
Upper quartile
- 0.3827 -0.4570 -0.4237 - 0.4268 -0.4300
0.1558 0.3271 0.2262 0.2710 0.2622
- 0.4907 -0.5992 -0.5351 -0.5319 -0.5535
- 0.3848 -0.4089 -0.3756 -0.3891 - 0.3794
-0.2757 -0.2455 - 0.2766 -0.2580 -0.2521
V.A. Hajivassiliou
2432
and P.A. Ruud
Although truncated simulation is generally more costly, the SEM estimator remains a promising general approach to combining simulation with relatively efficient estimation. It is the only method that combines unbiased simulation of the score with optimization of an objective function and the latter property appears to offer substantial computational advantages. 4.4.2.
Censored simulation of ratios
The censored simulation methods in Section 3.2 can also be applied to approximating the efficient score. These simulation methods tend to be much faster computationally than the truncated simulation methods, but censored simulations introduce simulation bias in much the same way as in the MSL. Censored simulation can be applied to discrete LDV models by noting that the score function of an LDV model with observation rule y = z(y*) can generally be written in the ratio form:
s s
VtJf(0;Y)
V, dF(e;Y*)
(Y'IHY')'Yl
.0&Y) = ____
WA Y*)
(Y*lr(Y*) =Y)
= W,WRY*)I$Y*)
= Y)
%y*lT(y*) = Y>
’
where F(0; y* 1y) is the conditional c.d.f. of y given r(y*) = y. See Section 2.6 for more details. Van Praag and Hop (1987), McFadden (1989) and Hajivassiliou and McFadden (1990) note that this form of the score function offers the potential of by simulating estimation by simulation. 26 An MSS estimator can be constructed separately the numerator and denominator of the score expressions:
s,(e)=-
1 N @;y,,q,) 1 N
n = I P(e;
(43)
Y,,, ~2n)’
where 2(&y,, win) = (l/R,)CF: 1&e; y ,,,winr) is an unbiased simulator of the derivative function V,f(0, y) and p(0; y,, oz.) = (l/R,)CFZ 1d(& y,, mznr) is an unbiased function of the probability expression f(0; y,). Hajivassiliou and McFadden (1990) prove that when the approximation of the scores in ratio form is carried out using the GHK simulator, the resulting MSS estimator is consistent and asymptotically normal when N -+ 00 and R,/,,h-t co. The number of simulations for the numerator expression, R,, affects the efficiency of the resulting MSS estimator. Because the unbiased simulator p(&y, 02) of f(e; y) does not yield an unbiased simulator of
26See Hajivassiliou models.
(1993~) for a survey of the development
of simulation
estimarion
methods
for LDV
Ch. 40: Classical Estimation Methodsfor
LDV Models Using Simulation
2433
the reciprocal l/f(& y) in the simulator l/p(0; y, w,), R2 must increase with sample size to obtain a consistent estimator. This is analogous to simulation in MSL. In fact, this simulation scheme is equivalent to MSL when ur = o2 and d’= V,p. McFadden and Ruud (1992) note that MSM techniques can also be used generally to remove the simulation bias in such MSS estimators. In discrete LDV models, where y has a sampling space B that is countable and finite, we can always write y as a vector of dummy variables for each of the possible outcomes so that E,(y,) = Pr{yi = 1; 6} = f(Q; Y)
if
Yi = 1, Yj = 0, j # i.
Thus, v,f(e;Y)
E
8 [
f@Y)
=o= 1
Y) 1 f@; Y).- VfLf(~; --~ YEB f (0; Y)
and the score can be written
(44) Provided that the “residual” l{y = Y} - f(0; Y) and the “instrumental variables” V,f(e; Y)/f(0; Y) are simulated independently, equation (44) provides a moment function for the MSM. In this form, the instrumental variables ratio can be simulated with bias as in (43) because the residual term is independently distributed and possesses a marginal expectation equal to zero at the population parameter value. For example, we can alter (43) to
ue)=X$1 YEB c
CRY,=
”
&@Y,%“)
~1-iw; Y,O,A-,
iv; y,%“I
(45)
where wr and o2 are independent pseudo-random variables. While such bias does not introduce inconsistency into the MSM estimator, the simulation bias does introduce inefficiency because the moment function is not an unbiased simulator of the score function. This general approach underlies the estimation method for multinomial probit originally proposed by McFadden (1989). 4.4.3.
MSM
versus MSS
MSM and MSS are natural competitors in estimation with simulation because each has a comparative advantage. MSM uses censored simulations that are cheap to
V.A. Hajivassiliou
2434
and P.A. Ruud
compute, but it cannot simulate the score without bias within a finite number of calculations. MSS uses truncated simulations that are expensive to compute (and introduce jumps in the objective function with A/R simulations), but simulates the score (virtually) without bias. McFadden and Ruud (1992) make a general comparison of the asymptotic covariance matrices that suggests when one method is preferable to the other. Consider the special MSS case in which the simulations P*(e; Y,w) are drawn from the latent conditional distribution and the exact latent score V,l* is available so that &(O)=
R,’
2 V,I*[8;
?*(Q; Y,w)].
r=l
Then Z,, the contribution has a useful interpretation:
of simulation
to the covariance
matrix
of the estimator,
where Z, = E,{V,I*(@; Y*)[VBI*(O; Y*)]‘} is th e information matrix of the latent log-likelihood. The simulation noise is proportional to the information loss due to partial observability. In the simplest applications of censored simulation to the MSM, the simulations are independent of sample outcomes and their contribution to the moment function is additively separable from the contribution of the data. Thus we can write &,(0) = g(& Y,o&g(O; wi,~J(see(45)). In that case, &simplifies In general, the simulation process makes R independent tions {ml; r = 1,. , R}, so that
to V{,/%[~“(~,;O,,O,)]}. replications of the simula-
and & = R- ’ V,&j(d,; ml, oJ]. In an important special case of censored simulation, the simulation process makes R independent replications of the modeled data generating process, { F(L);w,,); r = 1,. . . , R}, so that
ae;al,
4
= R; 1
de; t(e; ml,), w2]
and Z, = R ’ V[g(B,; Y, co’)] = Z,/R. Then the MSM covariance matrix equals (1 + l/R) times the classical MOM covariance matrix without simulation G-‘Z,(G’))‘. Now let us specialize to simulation of the score. For simplicity, suppose that the simulated moment functions are unbiased simulations of the score: E[S,,,(B)I Y] = V,1(8; Y). Of course in most cases, the MSM estimator will have a simulation bias
Ch. 40: Classical
for the score. The asymptotic Z,
2435
Estimation Methods for LDV Models Using Simulation
= lim,,, = lim,,
I/(&,(0,)
variance
of the MSM estimator
is
- V,1(8,; Y)]
m ~chmlw4J)l +
&l
= z, + E,, where 22, = Z’,/R and Z, holds additional variation attributable to the simulation of the score. If the MSS and MSM estimators use the same number of simulation replications, we can make a simple comparison of the relative efficiency of the two methods. The difference between the asymptotic covariance matrices is R-‘Z,‘[&
+
(R+
l)Z,
- (Z, - &)]Z,‘.
This expression gives guidance about the conditions under which censored simulation is likely to dominate truncated. It is already obvious that if Z, is high, so that censored simulation is inefficient due to a poor approximation of the score, then truncated simulation is likely to dominate. On the other hand, if Z:, is low, because partial observability causes a large loss in information, then estimation with censored simulation is likely to dominate truncated. Thus, we might expect that the censored simulation method will dominate the truncated one for the multinomial probit model, particularly if Z, = 0. That, however, is a special case in which a more efficient truncated simulation estimator can be constructed from the censored simulation estimator. Because E[E(Q)I Y] = VfJ(R Y),
ma y,02) - m 01, w2)l -E[g(e;ol,o,)]
=
= E{g[e;
VfJ(RVI
~(e;o),d]} =o t/e.
The bias correction is obviously unnecessary and only increases the variance of the MSM estimator. But an MSM estimator based on g(& Y, o) is a truncated simulation MSM estimator; only simulation for the particular Y observed is required. We conclude that the censored method can outperform the truncated method only by choosing E,[e(B)] # V,1(8; Y) in such a way that the loss in efficiency in Z, is offset by low Zc, and low Z,.27
4.5.
Bias corrections
In this section, we interpret estimation with simulation as a general method for removing bias from approximate parametric moment functions, following McFadden “The actual difference in asymptotic however, because G # EM # &.
covariance
matrices
is more complicated
than the formula
above
V.A. Hajicawiliou
2436
and P.A. Ruud
and Ruud (1992). The approximation of the efficient score is the leading problem in estimation with simulation. In a comparison of the MSM and MSS approximations, we have just described a simple trade-off. On the one hand, the simulated term in the residual of (45) that replaces the expectation in (44) is clearly redundant when the instrumental variables are V,f(Q; Y)/f(& Y). The expectation of the simulated terms multiplied by the instruments is identically zero for all parameter values so that the simulation merely adds noise to the score and the resulting estimator. On the other hand, the simulated residual is clearly necessary when the instruments are not ideal. Without the simulation, the moment equation is invalid and the resultant estimators are inconsistent. This trade-off motivates a general structure of simulated moments estimators. We can interpret the extra simulation term as a bias correction to an approximation of the score. For example, one can view the substitution of non-ideal weights into the original score function as an approximation to the score, chosen for its computational feasibility. Because the approximation introduces bias, the bias is removed by simulating the (generally) unknown expectation of the approximate score. Suppose the moment restrictions have a general form
Hs(&; y, Xl Ix 1 = 0. When the moment function s is computationally burdensome, an approximation g(fl; y, X, o) becomes a feasible alternative. The additional argument o represents an ancillary statistic containing the “coefficients” of the approximation. In general, such approximation will introduce inefficiency and bias into MOM estimators constructed from g. Simulation of g over the distribution of y produces an approximate bias correction Q(8; X, w’, o), where o’ represents the simulated component. Thus, we consider estimators 6 that satisfy g(& y, x, w) MSM estimators too. 4.5.1.
lj(8;x, co’,w)
= 0.
have this general form; and feasible MSS estimators
(47) generally
do,
A score test for estimator bias
The appeal of simulation estimators without bias correction is substantial. Although the simulation of moments or scores overcomes a substantial computational difficulty in the estimation of LDV models, there may remain practical difficulties in solving the simulated moment functions for the estimators. Whereas maximum likelihood possesses a powerful relationship between the normal equations and the likelihood function, moment equations generally do not satisfy such “integrability” conditions. As a result, there is not even a guarantee that a root of the estimating
Ch. 40: Classicul Esfimation
Methodsfor
LDV Models Using Simulution
2437
equations
exists. Bias correction can introduce a significant amount of simulation noise to estimators. For these reasons, the approximation of the log-likelihood function itself through simulation still offers an important opportunity to construct feasible and relatively efficient estimators. MSS, and particularly MSL, estimators can be used without bias correction if the bias is negligible relative to the sampling error of the estimator and the magnitude of the true parameter. A simple score test for significant bias can be developed and implemented easily. Conditional on the MSS estimator, the expectation of the simulated bias in the approximate score should be zero. The conditional distribution of the elements of the bias correction are i.n.i.d. random variables to which a central limit theorem can be applied. In addition, the White-Eicker estimator of the covariance matrix of the bias elements is consistent so that the usual Wald statistic, measuring the statistical significance of the bias term, can be computed (see Engle (1984)). As an alternative to testing the significance of this statistic, the bias correction term can be used to compute a local approximate confidence region for the biases in the moment function or the estimated parameters. This has the advantage of providing a way to assess whether the biases are important for the purposes of inference.
5.
Conclusion
In this chapter, we have described the use of simulation methods to overcome the difficulties in computing the likelihood and moment functions of LDV models. These functions contain multivariate ‘integrals that cannot be easily approximated by series expansions. However, unbiased simulators of these integrals can be computed easily. We began by reviewing the ways in which LDV models arise, describing the differences and similarities in censored and truncated data generating processes. Censoring and truncation give rise to the troublesome multivariate integrals. Following the LDV models, we described various simulation methods for evaluating such integrals. Naturally, censoring and truncation play roles in simulation as well. Finally, estimation methods that rely on simulation were described in the final section of this chapter. We organized these methods into three broad groups: MSL, MSM, and MSS. These are not mutually exclusive groups. But each group has a different motivation: MSL focuses on the log-likelihood function, the MSM on moment functions, and the MSS on the score function. The MSS is a combination of ideas from MSL and MSM, treating the efficient score of the log-likelihood function as a moment function. Software for implementing these methods is not yet widely available. But as such tools spread, and as improvements in the simulators themselves are developed, simulation methods will surely become a familiar tool in the applied econometrician’s workshop.
V.A. Hajiuassiliou and P.A. Ruud
2438
6.
Acknowledgements
We would like to thank John Geweke and Daniel McFadden for very helpful comments. John Wald provided expert research assistance. We are grateful to the National Science Foundation for partial financial support, under grants SES929411913 (Hajivassiliou) and SES-9122283 (Ruud).
Amemiya, T. (1984) “Tobit Models: A Survey”, Journal of Econometrics, 24, 3-61. Avery, R., Hansen, L. and Hotz, V. (1983) “Multiperiod Probit Models and Orthogonality Condition Estiplation”, International Economic Review, 24, 21-35. Bauwens, L. (1984) Bayesian Full Information Analysis ofSimultaneous Equation Models using Integration by Monte Carlo. Berlin: Springer. Beggs, S., Cardell, S. and Hausman, J. (1981) “Assessing the Potential Demand for Electric Cars”, Journal of Econometrics, 17, l-20. Berkovec, J. and Stern, S. (1991) “Job Exit Behavior of Older Men”, Econometrica, 59, 189-210. Bloemen, H. and Kapteyn, A. (1991) The Joint Estimation of a Non-linear Labour Supply Function and a Wage Equation Using Simulated Response Probabilities. Tilburg University, mimeo. Bock, R.D. and Jones, L.V. (1968) The Measurement and Prediction of Judgement and Choice. San Francisco: Holden-Day. Bolduc, D. (1992) “Generalized Autoregressive Errors in the Multinomial Probit Model”, Transportation Research B - Methodological, 26B(2), 155- 170. Bolduc, D. and Kaci, M. (1991) Multinomial Probit Models with Factor-Based Autoregressioe Errors: A Computationally EJ’icient Estimation Approach. Universite Lava], mimeo. Borsch-Supan, A. and Hajivassiliou, V. (1993) “Smooth Unbiased Multivariate Probability Simulators for Maximum Likelihood Estimation of Limited Dependent Variable Models”, Journal of Econometrics, 58(3), 347-368. Biirsch-Supan, A., Hajivassiliou, V., Kotlikoff, L. and Morris, J. (1992) Health, Children and Elderly Living Arrangements: A Multi-Period Multinomial Probit Model with Unobserved Heterogeneity and Autocorrelated Errors pp. 79-108, in: D. Wise, ed., Topics in the Economics of Aging. Chicago: University of Chicago Press. Chib, S. (1993) “Bayes Regression with Autoregressive Errors: A Gibbs Sampling Approach”, Journal of Econometrics, 58(3), 275-294. Clark, C. (1961) “The Greatest of a Finite Set of Random Variables”, Operations Research, 9, 145-162. Daganzo, C. (1980) Multinomial Probit. New York: Academic Press. Daganzo, C., Bouthelier, F. and Sheffi, Y. (1977) “Multinomial Probit and Qualitative Choice: A Computationally Efficient Algorithm”, Transportation Science, 11,338-358. Davis, P. and Rabinowitz, P. (1984) Methods of Numerical Integration. New York: Academic Press. Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society, Series B, 39, l-38. Devroye, L. (1986) Non-Uniform Random Variate Generation. New York: Springer. Dubin, J. and McFadden, D. (1984) “An Econometric Analysis of Residential Electric Appliance Holdings and Consumption”, Econometrica, 52(2), 345-362. Duffie, D. and Singleton, K. (1993) “Simulated Moments Estimation of Markov Models of Asset Prices”, Econometrica, 61(4), 929-952. Dutt, J. (1973) “A Representation of Multivariate Normal Probability Integrals by Integral Transforms”, Biometrika, 60, 637-645. Dutt, J. (1976) “Numerical Aspects of Multivariate Normal Probabilities in Econometric Models”, Annals of Economic and Social Measurement, 5, 547-562. Engle, R. (1984) Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics, pp. 776-826, in: Z. Griliches and M. Intriligator, eds., Handbook of Econometrics, Vol. 2. Amsterdam: NorthHolland. Feller, W. (1971) An Introduction to Probability Theory and its Applications. New York: Wiley, Fishman, G. (1973) Concepts and Methods of Digital Simulation. New York: Wiley.
Ch. 40: Classical Estimufion
Methodsfor
LD V Models Using Simulation
2439
Geman, S. and Geman, D. (1984) “Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images”,IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721~741. &w&e, J. (1989) “Bayesian Inference in Econometric Models Using Monte Carlo Integration”, Econometrica, 57, 1317-1340. Geweke, J. (1992)Efficient Simulation from the Multivariate Normal and Student-t Distributions Subject to Linear Constraints. Computing Science and Statistics: Proceedings qf the Twenty-Third Symposium, 571-578. Goldfeld, S. and Quandt, R. (1975) “Estimation in a Disequilibrium Model and the Value of Information”, Journal ofEconometrics, 3(3), 325-348. Gourieroux, C. and Monfort, A. (1990) Simulation Based Inference in Models with Heterogeneity. INSEE, mimeo. Gourieroux, C., Monfort, A., Renault, E. and Trognon, A. (1984a) “Pseudo Maximum Likelihood Methods: Theory”, Econometrica, 52, 681-700. Gourieroux, C., Monfort, A., Renault, E. and Trognon, A. (1984b) “Pseudo Maximum Likelihood Methods: Applications to Poisson Models”, Econometrica, 52, 701-720. Gronau, R. (1974) “The Effect of Children on the Housewife’s Value of Time”, Journal of Political Economy”, 81, 168-199. Hajivassiliou, V. (1986) Serial Correlation in Limited Dependent Variable Models: Theoretical and Monte Carlo Results. Cowles Foundation Discussion Paper No. 803. Hajivassiliou, V. (1992) The Method ofSimulated Scores: A Presentation and Comparative Evaluation. Cowles Foundation Discussion Paper, Yale University. Hajivassiliou, V. (1993a) Estimation by Simulation of the External Debt Repayment Problems. Cowles Foundation Discussion Paper, Yale University. Published in the Journal of Applied Econometrics, 9(2) (1994) 109-132. Hajivassiliou, V. (1993b) “Simulating Normal Rectangle Probabilities and Their Derivatives: The effects of Vectorization”. International Journal ofSupercomputer Applications, 7(3), 231-253. Hajivassiliou, V. (1993~) Simulation Estimation Methods for Limited Dependent Variable Models. pp. 519-543, in: G.S. Maddala, C.R. Rao and H.D. Vinod, eds., Handbook ofstatistics (Econometrics), Vol. 11. Amsterdam: North-Holland. Hajivassiliou, V. and Ioannides, Y. (1991) Switching Regressions Models of the Euler Equation: Consumption Labor Supply and Liquidity Constraints. Cowles Foundation for Research in Economics, Yale University, mimeo. Hajivassiliou, V. and McFadden, D. (1990). The Method ofsimulated Scores, with Application to Models of External Debt Crises. Cowles Foundation Discussion Paper No. 967. Hajivassiliou, V., McFadden, D. and Ruud, P. (1992) “Simulation of Multivariate Normal Orthant Probabilities: Methods and Programs”, Journal of Econometrics, forthcoming. Hammersley, J. and Handscomb, D. (1964) Monte Carlo Methods. London: Methuen. Hanemann, M. (1984) “Discrete/Continuous Models of Consumer Demand”, Econometrica, 52(3), 541562. Hansen, L.P. (1982) “Large Sample Properties of Generalized Method of Moments Estimators” Econometrica, 50, 1029-1054. Hausman, J. and Wise, D. (1978) “A Conditional Probit Model for Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences”, Econometrica, 46, 403-426. Hausman, J. and Wise, D. (1979) “Attrition Bias in Experimental and Panel Data: The Gary Negative Income Maintenance Experiment”, Econometrica, 47(2), 445-473. Heckman, J. (1974) “Shadow Prices, Market Wages, and Labor Supply”, Econometrica, 42, 679-694. Heckman, J. (1979) “Sample Selection Bias as a Specification Error”, Econometrica, 47, 153-161. Heckman, J. (1981) Dynamic Discrete Models. pp. 179-195,_in C. Manski and D. McFadden, eds., Structural Analysis ofDiscrete Data with Econometric Applications. Cambridge: MIT Press. Hendry, D. (1984) Monte Carlo Experimentation in Econometrics, pp. 937-976 in: Z. Griliches and M. Intriligator, eds., Handbook ofEconometrics, Vol. 2. Amsterdam: North-Holland. Horowitz, J., Sparmonn, J. and Daganzo, C. (1981) “An Investigation of the Accuracy of the Clark Approximation for the Multinomial Probit Model”, Transportation Science, 16, 382-401. Hotz, V.J. and Miller, R. (1989) Condirional Choice Probabilities and the Estimation of Dynamic Programming Models. GSIA Working Paper 88-89-10. Hotz, V.J. and Sanders, S. (1991) The Estimation ofDynamic Discrete Choice Models by the Method of Simulated Moments. NORC, University of Chicago.
2440
V.A. Hujiuussiliou
and P.A. Ruud
Hotz, V.J., Miller, R., Sanders, S. and Smith, J. (1991) A Simulation Estimatorfin Dynamic Discrete Choice Models. NORC, University of Chicago, mimeo. Keane, M. (1990) A Computationully &ficient Practical Simulation Estimator ,Jor Panel Data with Applications to Estimating Temporal Dependence in Employment and Wqes. University of Minnesota, mimeo. Keane, M. (1993) Simulation Estimation Methods for Panel Data Limited Dependent Variable Models, in: G.S. Maddala, C.R. Rao and H.D. Vinod, eds., Handbook of Statistics (Econometrics), Vol. 11. Amsterdam: North-Holland. Kloek, T. and van Dijk, H. (1978) “Bayesian Estimates of Equation System Parameters: An Application of Integration by Monte Carlo”, Econometrica, 46, l-20. Laroque, G. and Salanie, B. (1989) “Estimation of Multi-Market Disequilibrium Fix-Price Models: An Application of Pseudo Maximum Likelihood Methods”, Econometrica, 57(4), 83 t-860. Laroque, G. and Salanie, B. (1990) The Properties of Simulated Pseudo-Maximum Likelihood Methods: The Case ofthe Canonical Disequilibrium Model. Working Paper No. 9005, CREST-Departement de la Recherche, INSEE. Lee, B.-S. and Ingram, B. (1991) “Simulation Estimation of Time-Series Models”, Journal of Econometrics, 47, 197-205. Lee, L.-F. (1978) “Unionism and Wage Rates: A Simultaneous Equation Model with Qualitative and Limited Denendent Variables”. international Economic Review, 19,415-433. Lee, L.-F. (1979) “Identification ‘and Estimation in Binary Choice Models with Limited (Censored) Dependent Variables”, Econometrica, 47, 977-996. Lee, L.-F. (1992) “On the Efficiency of Methods of Simulated Moments and Maximum Simulated Likelihood Estimation of Discrete Response Models”, Econometric Theory, 8(4), 518-552. Lerman, S. and Manski, C. (1981) On the Use of Simulated Frequencies to Aproximate Choice Probabilities, pp. 305-319, in: C. Manski and D. McFadden, eds., Structural Analysis ofDiscrete Data with Econometric Applications. Cambridge: MIT Press. Lewis, H.G. (1974) “Comments on Selectivity Biases in Wage Comparisons”, Journal of Political Economy, 82(6), 114551155. Maddala, G.S. (1983) Limited Dependent and Qualitative Variables in Econometrics. Cambridge: Cambridge University Press. McCulloch, R. and Rossi, P.E. (1993) An Exact Likelihood Analysis of the Multinomial Probit Model. Working Paper 91-102, Graduate School of Business, University of Chicago. McFadden, D. (1973) Conditional Logit Analysis of Qualitative Choice Behavior, pp. 105-142, in: P. Zarembka, ed., Frontiers in Econometrics. New York: Academic Press. McFadden, D. (1981) Econometric Models of Probabilistic Choice, pp. 1988272, in: C. Manski and D. McFadden, eds., Structural Analysis ofDiscreteData with Econometric Applications. Cambridge: MIT Press. McFadden, D. (1986) Econometric Analysis of Qualitative Response Models, pp. 1395-1457, in: Z. Griliches and M. Intriligator, eds., Handbook ofEconometrics, Vol. 2, Amsterdam: North-Holland. McFadden, D. (1989) “A Method of Simulated Moments for Estimation of Discrete Response Models without Numerical Integration”, Econometrica, 57, 99551026. McFadden, D. and Ruud, P. (1992) Estimation by Simulation. University of California at Berkeley, working paper. Moran, P. (1984) “The Monte Carlo Evaluation of Orthant Probabilities for Multivariate Normal Distributions”, Australian Journal of’ Statistics, 26, 39-44. Miihleisen, M. (1991) On the Use of Simulated Estimators for Panel Models with Limited-Dependent Variables. University of Munich, mimeo. Newey, W.K. and McFadden, D.L. (1994) Estimation in Large Samples, in: R. Engle and D. McFadden, eds., Handbook ofEconometrics, Vol. 4. Amsterdam: North-Holland. Owen, D. (1956) “Tables for Computing Bivariate Normal Probabilities”, Annals of Mathematical Statistics, 27, 1075-1090. Pakes, A. (1992) Estimation of Dynamic Structural Models: Problems and Prospects Part II: Mixed ContinuoussDiscrete Controls and Market Interactions. Yale University, mimeo. Pakes, A. and Pollard, D. (1989) “Simulation and the Asymptotics of Optimization Estimators”, Econometrica, 57, 1027-1057. Poirier, D. and Ruud, P.A. (1988) “Probit with Dependent Observations”, Review of Economic Studies, 55,5933614.
Ch. 40: CIassicaf Estimation Methods for LDV Models Using Simulation
2441
Quandt, R. (1972) “A New Approach to Estimating Switching Regressions”, Journal of the American Statistical Association, 67, 306-310. Quandt, R. (1986) Computational Problems in Econometrics, pp. 1395-1457, in: Z. Griliches and M. Intriligator, eds., Handbook ofEconometrics, Vol. 1. Amsterdam: North-Holland. Rubinstein, R. (1981) Simulation and the Monte Carlo Method. New York: Wiley. Rust, J. (1992) Estimation of Dynamic Structural Models: Problems and Prospects Part II: Discrete Decision Processes. SSRI Working Paper #9106, University of Wisconsin at Madison. Ruud, P. (1986) On the Method ofsimulated Moments for the Estimation of Limited Dependent Variable Models. University of California at Berkeley, mimeo. Ruud, P. (1991) “Extensions of Estimation Methods Using the EM Algorithm”, Journal ofEconometrics, 49,305-341. Stroud, A. (1971) Approximate Calculation ofMultiple Integrals. New York: Prentice-Hall. Thisted, R. (1988) Elements ofStatistical Computing. New York: Chapman and Hall. Thurstone, L. (1927) “A Law of Comparative Judgement”, Psychological Review, 34,273-286. Tierny, L. (1992). Markov Chainsfor Exploring Posterior Distributions. University of Minnesota, working paper. Tobin, J. (1958) “Estimation of Relationships for Limited Department Variables”, Econometrica, 26, 24-36. van Dijk, H.K. (1987) Some Advances in Bayesian Estimation Methods Using Monte Carlo Integration, pp. 205-261, in: T.B. Fomby and G.F. Rhodes, eds., Aduances in Econometrics, Vol. 6, Greenwich, CT: JAI Press. van Praag, B.M.S. and Hop, J.P. (1987) Estimation of Continuous Models on the Basis of Set-Valued Observations. Erasmus University Working Paper, presented at the ESEM Copenhagen. van Praag, B.M.S., Hop, J.P. and Eggink, E. (1989) A Symmetric Approach to the Labor Market by Means of the Simulated Moments Method with an Application to Married Females. Erasmus University Working Paper, presented at the EEA Augsburg. van Praag, B.M.S., Hop, J.P. and Eggink, E. (1991) A Symmetric Approach to the Labor Market by Means of the Simulated EM-Algorithm with an Application to Married Females. Erasmus University Working Paper, presented at the ESEM Cambridge. West, M. (1990) Bnyesian Computations: Monte-Carlo Density Estimation. Duke University, Discussion Paper 90-AlO.
Chapter 41
ESTIMATION MODELS* JAMES
OF SEMIPARAMETRIC
L. POWELL
Princeton Unioersity
Contents 2444 2444
Abstract 1. Introduction
2.
3.
2444
1.1.
Overview
1.2.
Definition
of “semiparametric”
1.3.
Stochastic
restrictions
1.4.
Objectives
and techniques
Stochastic
and structural
2449 2452
models
of asymptotic
2460
theory
2465
restrictions
2.1.
Conditional
mean restriction
2466
2.2.
Conditional
quantile
2469
2.3.
Conditional
symmetry
2.4.
Independence
2.5.
Exclusion
Structural
restrictions
2416
restrictions
2482
and index restrictions
2487
models
3.1.
Discrete
3.2.
Transformation
3.3.
Censored
and truncated
3.4.
Selection
models
3.5.
Nonlinear
4. Summary References
2414
restrictions
response
2487
models
2492
models regression
2500
models
2506 2511
panel data models
2513 2514
and conclusions
*This work was supported by NSF Grants 91-96185 and 92-10101 to Princeton University. I am grateful to Hyungtaik Ahn, Moshe Buchinsky, Gary Chamberlain, Songnian Chen, Gregory Chow, Angus Deaton, Bo Honor&, Joel Horowitz, Oliver Linton, Robin Lumsdaine, Chuck Manski, Rosa Ma&kin, Dan McFadden, Whitney Newey, Paul Ruud, and Tom Stoker for their helpful suggestions, which were generally adopted except when they were mutually contradictory or required a lot of extra work.
Handbook of Econometrics, Volume IV, Edited by R.F. En& 0 1994 Elseuier Science B.V. All rights reserved
and D.L. McFadden
J.L. Powell
2444
Abstract
A semiparametric model for observational data combines a parametric form for some component of the data generating process (usually the behavioral relation between the dependent and explanatory variables) with weak nonparametric restrictions on the remainder of the model (usually the distribution of the unobservable errors). This chapter surveys some of the recent literature on semiparametric methods, emphasizing microeconometric applications using limited dependent variable models. An introductory section defines semiparametric models more precisely and reviews the techniques used to derive the large-sample properties of the corresponding estimation methods. The next section describes a number of weak restrictions on error distributions ~ conditional mean, conditional quantile, conditional symmetry, independence, and index restrictions - and show how they can be used to derive identifying restrictions on the distributions of observables. This general discussion is followed by a survey of a number of specific estimators proposed for particular econometric models, and the chapter concludes with a brief account of applications of these methods in practice.
1.
1.l.
Introduction Overview
Semiparametric modelling is, as its name suggests, a hybrid of the parametric and nonparametric approaches to construction, fitting, and validation of statistical models. To place semiparametric methods in context, it is useful to review the way these other approaches are used to address a generic microeconometric problem ~ namely, determination of the relationship of a dependent variable (or variables) y to a set of conditioning variables x given a random sample {zi = (yi, Xi), i = 1,. . . , N} of observations on y and x. This would be considered a “micro’‘-econometric problem because the observations are mutually independent and the dimension of the conditioning variables x is finite and fixed. In a “macro’‘-econometric application using time series data, the analysis must also account for possible serial dependence in the observations, which is usually straightforward, and a growing or infinite number of conditioning variables, e.g. past values of the dependent variable y, which may be more difficult to accommodate. Even for microeconometric analyses of cross-sectional data, distributional heterogeneity and dependence due to clustering and stratification must often be considered; still, while the random sampling assumption may not be typical, it is a useful simplification, and adaptation of statistical methods to non-random sampling is usually straightforward. In the classical parametric approach to this problem, it is typically assumed that the dependent variable is functionally dependent on the conditioning variables
2445
Ch. 41: Estimation of Semiparametric Models
(“regressors”) and unobservable of the form
“errors” according to a fixed structural relation (1.1)
Y = g(x, @o,s),
where the structural function g(.) is known but the finite-dimensional parameter vector a,~Iwp and the error term E are unobserved. The form of g(.) is chosen to give a class of simple and interpretable data generating mechanisms which embody the relevant restrictions imposed by the characteristics of the data (e.g. g(‘) is dichotomous if y is binary) and/or economic theory (monotonicity, homotheticity, etc.). The error terms E are introduced to account for the lack of perfect fit of (1.1) for any fixed value of c1eand a, and are variously interpreted as expectational or optimization errors, measurement errors, unobserved differences in tastes or technology, or other omitted or unobserved conditioning variables; their interpretation influences the way they are incorporated into the structural function 9(.). To prevent (1.1) from holding tautologically for any value of ao, the stochastic behavior of the error terms must be restricted. The parametric approach takes the error distribution to belong to a finite-dimensional family of distributions, a Pr{s d nix} = f,(a Ix, ~0) dl,, (1.2) s -CO where f(.) is a known density (with respect to the dominating measure p,) except for an unknown, finite-dimensional “nuisance” parameter ‘lo. Given the assumed structural model (1.1) and the conditional error distribution (1.2), the conditional distribution of y given x can be derived,
s 1
Pr{y
for some “index” function u(x) with dim{u(x)} < dim (x}; a weak or mean index restriction asserts a similar property only for the conditional expectation ~ (1.12)
E[&1X] = E[&IU(X)].
For different structural models, the index function v(x) might be assumed to be a known function of x, or known up to a finite number of unknown parameters (e.g. u(x) = x’BO), or an unknown function of known dimensionality (in which case some extra restriction(s) will be needed to identify the index). As a special case, the function u(x) may be trivial, which yields the independence or conditional mean restrictions as special cases; more generally, u(x) might be a known subvector x1 of the regressors x, in which case (1.11) and (1.12) are strong and weak forms of an exclusion restriction, otherwise known as conditional independence and conditional mean independence of E and x given x1, respectively. When the index function is unknown, it is often assumed to be linear in the regressors, with coefficients that are related to unknown parameters of interest in the structural model. The following diagram summarizes the hierarchy of the stochastic restrictions to be discussed in the following sections of this chapter, with declining level of generality from top to bottom: Nonparametric
I
-1
Conditional
lndedndence
1
Parametric
Turning parametric
mean
Location
Conditional
symmetry
median
m
now to a description of some structural models treated in the semiliterature, an important class of parametric forms for the structural
Ch. 41: Estimation
of Semiparametric
2455
Models
functions is the class of linear latent variable models, in which the dependent y is assumed to be generated as some transformation Y =
(1.13)
t(y*; &, %(.I)
of some unobservable y* = x’& +
variable
variable
y*, which itself has a linear regression
representation (1.14)
E.
Here the regression coefficients /I0 and the finite-dimensional parameters 2, of the transformation function are the parameters of interest, while the error distribution and any nonparametric component rO(.) of the transformation make up the nonparametric component of the model. In general y and y* may be vector-valued, and restrictions on the coefficient matrix /I0 may be imposed to ensure identification of the remaining parameters. This class of models, which includes the classical linear model as a special case, might be broadened to permit a nonlinear (but parametric) regression function for the latent variable y*, as long as the additivity of the error terms in (1.14) is maintained. One category of latent variable models, parametric transformation models, takes the transformation function t(y*;&) to have no nonparametric nuisance component to(.) and to be invertible in y* for all possible values of &. A well-known example of a parametric transformation model is the Box-Cox regression model (Box and Cox (1964)), which has y = t(x’&, + E; A,) for t - yy.
2)
=
-va F - l 1(1#
0} + ln(y)l{A = O}.
This transformation, which includes linear and log-linear (in y) regression models as special cases, requires the support of the latent variable y* to be bounded from below (by - I/&) for noninteger values of A,, but has been extended by Bickel and Doksum (1981) to unbounded y*. Since the error term E can be expressed as a known function of the observable variables and unknown parameters for these models, a stochastic restriction on E (like a conditional mean restriction, defined below) translates directly into a restriction on y, x,/IO, and II, which can be used to construct estimators. Another category, limited dependent variable models, includes latent variable models in which the transformation function t(y*) which does not depend upon unknown parameters, but which is noninvertible, mapping intervals of possible y* values into single values of y. Scalar versions of these models have received much of the attention in the econometric literature on semiparametric estimation, owing to their relative simplicity and the fact that parametric methods generally yield inconsistent estimators for /I0 when the functional form of the error distribution is misspecified. The simplest nontrivial transformation in this category is
J.L.Powell
2456
an indicator model
for positivity
y = 1{x’/A)+
E>
of the latent variable
O},
y*, which yields the binary response
(1.15)
which is commonly used in econometric applications to model dichotomous choice problems. For this model, in which the parameters can be identified at most up to a scale normalization on PO or E, the only point of variation of the function t(y*) occurs at y* = 0, which makes identification of &, particularly difficult. A model which shares much of the structure of the binary response model is the ordered response model, with the latent variable y* is only known to fall in one of J + 1 ordered intervals { (- co, c,], (c,, c,], . . . , (c,, GO)}; that is, J y=
1 1(x’& +& >Cj}.
(1.16)
j=l
Here the thresholds {cj} are assumed unknown (apart from a normalization like c0 = 0), and must be estimated along with PO. The grouped dependent oariable model is a variation with known values of {cj}, where the values of y might correspond to prespecified income intervals. A structural function for which the transformation function is more “informative” about /I0 is the censored regression model, also known in econometrics as the censored Tobit model (after Tobin (1956)). Here the observable dependent variable is assumed to be subject to a nonnegativity constraint, so that y = max (0, x’pO + s};
(1.17)
this structural function is often used as a model of individual demand or supply for some good when a fraction of individuals do not participate in that market. A variation on this model, the accelerated failure time model with jixed censoring, can be used as a model for duration data when some durations are incomplete. Here y=min{x;&+E,x2},
(1.18)
where y is the logarithm of the observable duration time (e.g. an unemployment spell), and x2 is the logarithm of the duration of the experiment (following which the time to completion for any ongoing spells is unobserved); the “fixed” qualifier denotes models in which both x1 and x2 are observable (and may be functionally related). These univariate limited dependent variable models have multivariate analogues which have also been considered in the semiparametric literature. One multivariate generalization of the binary response model is the multinomial response
Ch. 41: Estimation
model, for which the dependent y=vec{yj,j= l,..., J}, with
yj=l(yf3y:
for
and with each latent yj*=x’pj,+E.
Bo
variable
is a J-dimensional
=
vector
of indicators,
(1.19)
k#j)
variable
J’
2457
Models
of Semipurametric
y? generated cp;,
. . . , a&
by a linear model ‘.
>
DJ,l.
(1.20)
That is, yj = 1 if and only if its latent variable yl is the largest across alternatives. Another bivariate model which combines the binary response and censored regression models is the censored sample selection model, which has one binary response variable y, and one quantitative dependent variable y, which is observed only when yi = 1: y1=
l(x;B;+E,
>O)
(1.21)
and Y2 = Yl
cx;fi:+4.
(1.22)
This model includes the censored regression model as a special case, with fi; = fii s /I, and s1 = a2 = E. A closely related model is the disequilibrium regression model with observed regime, for which only the smaller of two latent variables is observed, and it is known which variable is observed: y, =
1(x;& + El -=c x;g +e,)
(1.23)
and
A special case of this model, the randomly censored regression model, imposes the restriction fii = 0, and is a variant of the duration model (1.18) in which the observable censoring threshold x2 is replaced by a random threshold a2 which is unobserved for completed spells. A class of limited dependent variable models which does not neatly fit into the foregoing latent variable framework is the class of truncated dependent variable models, which includes the truncated regression and truncated sample selection models. In these models, an observable dependent variable y is constructed from latent variables drawn from a particular subset of their support. For the truncated regression model, the dependent variable y has the distribution of y* = x’/I,, + E
J.L. Powell
2458
conditional
on y* > 0: (1.25)
y = x’po + u, with Pr(u
Y, = t(Y+$D,
(1.27)
+ &2'?J'
where t(.) is any of the transformation functions discussed above and y is an unobservable error term which is constant across time periods (unlike the timespecific errors cl and s2) but may depend in an arbitrary way on the regressors* x1 and x2. Consistent estimation of the parameters of interest PO for such models is a very challenging problem; while “time-differencing” or “deviation from cell means” eliminates the fixed effect for linear models, these techniques are not applicable to nonlinear models, except in certain special cases (as discussed by Chamberlain (1984)). Even when the joint distribution of the error terms E, and s2 is known parametrically, maximum likelihood estimators for &,, r0 and the distributional parameters will be inconsistent in general if the unknown values of y are treated as individual-specific intercept terms (as noted by Heckman and MaCurdy (1980)), so semiparametric methods will be useful even when the distribution of the fixed effects is the only nuisance parameter of the model. The structural functions considered so far have been assumed known up to a finite-dimensional parameter. This is not the case for the generalized regression
Ch. 41: Estimation
ofSemiparametric Models
2459
model, which has Y =
(1.28)
%(X’Po + 4,
for some transformation function TV which is of unknown parametric form, but which is restricted either to be monotonic (as assumed by Han (1987a)), or smooth (or both). Formally, this model includes the univariate limited dependent variable and parametric transformation models as special cases; however, it is generally easier to identify and estimate the parameters of interest when the form of the transformation function t(.) is (parametrically) known. Another model which at first glance has a nonparametric component in the structural component is the partially linear or semilinear regression model proposed by Engle et al. (1986), who labelled it the “semiparametric regression model”; estimation of this model was also considered by Robinson (1988). Here the regression function is a nonparametric function of a subset xi of the regressors, and a linear function of the rest: Y=
x;p,+&(x,)
+6
(1.29)
where A,(.) is unknown but smooth. By defining a new error term E* = 2,(x,) + E, a constant conditional mean assumption on the original error term E translates into a mean exclusion restriction on the error terms in an otherwise-standard linear model. Yet another class of models with a nonparametric component are generated regressor models, in which the regressors x appear in the structural equation for y indirectly, through the conditional mean of some other observable variable w given x:
Y=h(ECwlxl,~,,&)~g(x,~,,~,(~),&),
(1.30)
with 6,(x) _=E[wjx]. These models arise when modelling individual behavior under uncertainty, when actions depend upon predictions (here, conditional expectations) of unobserved outcomes, as in the large literature on “rational expectations”. Formally, the nonparametric component in the structural function can be absorbed into an unobservable error term satisfying a conditional mean restriction; that is, defining q 5 w -JZ[wlx] (so that E[qlx] -O), the model (1.30) with nonparametrically-generated regressors can be rewritten as y = g(w - q,cr,,s), with a conditional mean restriction on the extra error term q. In practice, this alternative representation is difficult to manipulate unless g(.) is linear, and estimators are more easily constructed using the original formulation (1.30). Although the models described above have received much of the attention in the econometric literature on semiparametrics, they by no means exhaust the set of models with parametric and nonparametric components which are used in
J.L.Powell
2460
econometric applications. One group of semiparametric models, not considered here, include the proportional hazards model proposed and analyzed by Cox (1972, 1975) for duration data, and duration models more generally; these are discussed by Lancaster (1990) among many others. Another class of semiparametric models which is not considered here are choice-based or response-based sampling models; these are similar to truncated sampling models, in that the observations are drawn from sub-populations with restricted ranges of the dependent variable, eliminating the ancillarity of the regressors x. These models are discussed by Manski and McFadden (1981) and, more recently, by Imbens (1992).
1.4.
Objectives and techniques
of asymptotic
theory
Because of the generality of the restrictions imposed on the error terms for semiparametric models, it is very difficult to obtain finite-sample results for the distribution of estimators except for special cases. Therefore, analysis of semiparametric models is based on large-sample theory, using classical limit theorems to approximate the sampling distribution of estimators. The goals and methods to derive this asymptotic distribution theory, briefly described here, are discussed in much more detail in the chapter by Newey and McFadden in this volume. As mentioned earlier, the first step in the statistical analysis of a semiparametric model is to demonstrate identijkation of the parameters a0 of interest; though logically distinct, identification is often the first step in construction of an estimator of aO. To identify aO, at least one function T(.) must be found that yields T(F,) = aO, where F, is the true joint distribution function of z = (y,x) (as in (1.3) above). This functional may be implicit: for example, a,, may be shown to uniquely solve some functional equation T(F,, a,,) = 0 (e.g. E[m(y, x, a,,)] = 0, for some m(.)). Given the functional T(.) and a random sample {zi = (y,, xi), i = 1,. . . , N) of observations on the data vector z, a natural estimator of a0 is 62= T(P),
(1.31)
where P is a suitable estimator of the joint distribution function F,. Consistency of & (i.e. oi+ a,, in probability as N + co) is often demonstrated by invoking a law of large numbers after approximating the estimator as a sample average:
’=
$,f cPiV(Yi3xi) + Op(1)~ I
(1.32)
1
where E[q,(y, x)] + aO. In other settings, that the estimator maximizes a random almost surely to a limiting function with As noted below, establishing (1.31) can
consistency is demonstrated by showing function which converges uniformly and a unique maximum at the true value aO. be difficult if construction of 6i involves
Ch. 41: Estimation
ofSemiparametric Models
2461
explicit nonparametric estimators (through smoothing of the empirical distribution function). Once consistency of the estimator is established, the next step is to determine its rate ofconueryence, i.e. the steepest function h(N) such that h(N)(Gi - Q) = O,(l). so this is a maximal rate under weaker For regular parametric models, h(N) = fi, semiparametric restrictions. If the estimator bi has h(N) = fi (in which case it is said to be root-N-consistent), then it is usually possible to find conditions under which the estimator has an asymptotically linear representation:
di= '0 +
k,E$(Yi, I
where the “influence The Lindeberg-Levy estimator,
JNca-
xi)
+
op(11JN)2
(1.33)
1
function” I/I(.) has E[$(y, x)] = 0 and finite second moments. central limit theorem then yields asymptotic normality of the
ao) L Jqo, If,),
(1.34)
where V, = E{$(y,x)[$(y,x)]‘}. With a consistent estimator of V, (formed as the sample covariance matrix of some consistent estimator ~(yi,Xi) of the influence function), confidence regions and test statistics can be constructed with coverage/ rejection probabilities which are approximately correct in large samples. For semiparametric models, as defined above, there will be other functionals T+(F) which can be used to construct estimators of the parameters of interest. The asymptotic efJtciency of a particular estimator 6i can be established by showing that its asymptotic covariance matrix V, in (1.34) is equal to the semiparametric analogue to the Cramer-Rao bound for estimation of ~1~. This semiparametric ejjiciency bound is obtained as the smallest of all efficiency bounds for parametric models which satisfy the semiparametric restrictions. The representation ~1~= T*(F,) which yields an efficient estimator generally depends on some component do(.) of the unknown, infinite-dimensional nuisance parameter qo(.), i.e. T*(.) = T*(., 6,), so construction of an efficient estimator requires explicit nonparametric estimation of some characteristics of the nuisance parameter. Demonstration of (root-iv) consistency and asymptotic normality of an estimator depends on the complexity of the asymptotic linearity representation (1.33), which in turn depends on the complexity of the estimator. In the simplest case, where the estimator can be written in a closed form as a smooth function of sample averages,
6i=a
(
9 j$,$ m(Yi,xi) I 1 >
(1.35)
J.L. Powell
2462
the so-called +(Y,
“delta method”
yields an influence
function
II/ of the form
4 = ca4~o)/ad c~(Y, 4 - ~~1,
(1.36)
where pLoE E[m(y,x)]. Unfortunately, except for the classical linear model with a conditional mean restriction, estimators for semiparametric models are not of this simple form. Some estimators for models with weak index or exclusion restrictions on the errors can be written in closed form as functions of bivariate U-statistics.
(1.37)
with “kernel” function pN that has pN(zi, zj) = pN(zj,zi) for zi = (y,,z,); under conditions given by Powell et al. (1989), the representation (1.33) for such an estimator has influence function II/ of the same form as in (1.36), where now
m(.V, X)= lim EEPN(zi9 zj)lzi
=
(Y,X)1,
PO = ECm(y,
41.
(1.38)
N-02
A consistent estimator of the asymptotic sample second moment matrix of
covariance
matrix
of bi of (1.37) is the
(1.39)
In most cases, the estimator 6i will not have a closed-form expression like in (1.35) or (1.37), but instead will be defined implicitly as a minimizer of some sample criterion function or a solution of estimating equations. Some (generally inefficient) estimators based on conditional location or symmetry restrictions are “Mestimators”, defined as minimizers of an empirical process
h = aigETn $ ,i L and/or
solutions
0= i
p(_Yi,Xi, a) = argmin S,(a) 1
of estimating
equations
.g m(yit xi, 8.)= kN(B). I
(1.40)
asI3
(1.41)
1
for some functions p(.) and m(.), with dim{m(.)} = dim(a). When p(y,x,cr) (or m(y,x, a)) is a uniformly continuous function in the parameters over the entire parameter space 0 (with probability one), a standard uniform law of large numbers can be used to ensure that normalized versions of these criteria converge to their
J.L. Powell
2464
where the kernel pN(.) has the same symmetry property as stated for (1.37) above; such estimators arise for models with independence or index restrictions on the error terms. Results by Nolan and Pollard (1987,1988), Sherman (1993) and Honor& and Powell (1991) can be used to establish the consistency and asymptotic normality of this.estimator, which will have an influence function of the form (1.42) when m(y, X, a) = lim aE [pN(zi, zj, CC)1yi =
y, xi =
xyaE.
(1.47)
N+m
A more difficult class of estimators to analyze are those termed “semiparametric M-estimators” by Horowitz (1988a), for which the estimating equations in (1.41) also depend upon an estimator of a nonparametric component de(.); that is, ai solves
o=~.~ m(yi,xi,6i,~(‘))=mN(6i,6^(‘))
(1.48)
1
I
for some nonparametric estimator sof 6,. This condition might arise as a first-order condition for minimization of an empirical loss function that depends on 8,
d=ar~~n~i~lP(Yi,xi,a,6^(‘)),
(1.49)
as considered by Andrews (1990a, b). As noted above, an efficient estimator for any semiparametric model is generally of this form and estimators for models with independence or index restrictions are often in this class. To derive the influence function for an estimator satisfying (1.48), a functional mean-value expansion of Ci,(& c!?)around s^= 6, can be used to determine the effect on di of estimation of 6,. Formally, condition (1.48) yields
o=
mN(61,
&‘,, = &(&&,(‘)) + &,(8(‘)-
for some linear functional this second term
do(‘)) +
op(f/v6)
L,; then, with an influence
function
(1.50) representation
of
(1.51)
(with E[S(y, x)] = O), the form of the influence estimator is G(Y, 4
= raE(m(Y,x3 4 ~o)iiw,_,3
function
for a semiparametric
- 1~4~~ X, ao,6,) + a~, 41.
M-
(1.52)
Ch. 41: Estimation
of Semiparametric
To illustrate, suppose 6, is finite-dimensional, (1.50) would be a matrix product,
L&%9- hk)) = b&
2465
Models
6,~@‘; then the linear functional
6,) = CaE(m(y,~,a,6)/a6'1.=.o,a=a,](6-
do),
in
(1.53)
and the additional component 5 of the influence function in (1.52) would be the product of the matrix L, with the influence function of the preliminary estimator 8. When 6, is infinite-dimensional, calculation of the linear functional L, and the associated influence function 5 depends on the nature of the nuisance parameter 6, and how it enters the moment function m(y,x,a,d). One important case has 6, equal to the conditional expectation of some function s(y,x) of the data given some other function u(x) of the regressors, with m(.) a function only of the fitted values of this expectation; that is,
43= ~,(44) = ECdY, x)l a41
(1.54)
and (1.55) with am/&J well-defined. For instance, this is the structure of efficient estimators for conditional location restrictions. For this case, Newey (1991) has shown that the adjustment term t(y,x) to the influence function of a semiparametric Mestimator 6i is of the form a~,
X) = CWm(y,x3 4 4
ewa~~t,=,,i CS(Y~ 4
- 4A44)i.
(1.56)
In some cases the leading matrix in this expression is identically zero, so the asymptotic distribution of the semiparametric M-estimator is the same as if 6,(.) were known; Andrews (1990a, b) considered this and other settings for which the adjustment term 5 is identically zero, giving regularity conditions for validity of the expansion (1.50) in such cases. General formulae for the influence functions of more complicated semiparametric M-estimators are derived by Newey (1991) and are summarized in Andrews’ and Newey and McFadden’s chapters in this volume.
2.
Stochastic restrictions
This section discusses how various combinations of structural equations and stochastic restrictions on the unobservable errors imply restrictions on the joint distribution of the observable data, and presents general estimation methods for the parameters of interest which exploit these restrictions on observables. The classification scheme here is the same as introduced in the monograph by Manski
2466
(1988b) (and also in Manski’s chapter in this volume), although the discussion here puts more emphasis on estimation techniques and properties. Readers who are familiar with this material or who are interested in a particular structural form, may wish to skip ahead to Section 3 (which reviews the literature for particular models), referring back to this section when necessary.
2.1.
Conditional
mean restriction
As discussed in Section 1.3 above, restrictions for the error distribution
the class of constant assert constancy of
conditional
vO = argmin E[r(c - b)jx],
location
(2.1)
b
for some function r(.) which is nonincreasing for negative arguments and nondecreasing for positive arguments; this implies a moment condition E[q(.z - po)lx] = 0, for q(u) = ar(t#Ih. When the loss function of (2.1) is taken to be quadratic, r(u) = u’u, the corresponding conditional location restriction imposes constancy of the conditional mean of the error terms, .%4x) = PO
(2.2)
for some po. By appropriate definition of the dependent variable(s) y and “exogenous” variables x, this restriction may be applied to models with “endogenous” regressors (that is, some components of x may be excluded from the restriction (2.2)). This restriction is useful for identification of the parameters of interest for structural functions g(x, IX,E) that are invertible in the error terms E; that is, Y = g(x, MO,40s for some function
= 4Y, x, MO)
e(.), so that the mean restriction
(2.1) can be rewritten (2.3)
where the latter equality imposes the normalization p. E 0 (i.e., the mean ,u~ is appended to the vector ~1~of parameters of interest). Conditional mean restrictions are useful for some models that are not completely specified ~ that is, for models in which some components of the structural function g(.) are unknown or unspecified. In many cases it is more natural to specify the function e(.) characterizing a subset of the error terms than the structural function g(.) for the dependent variable; for example, the parameters of interest may be coefficients of a single equation from a simultaneous equations system and it is
Ch. 41: Estimation
of Semipurametric
2461
Models
often possible to specify the function e(.) without specifying the remaining equations of the model. However, conditional mean restrictions generally are insufficient to identify the parameters of interest in noninvertible limited dependent variable models, as Manski (1988a) illustrates for the binary response model. The conditional moment condition (2.3) immediately yields an unconditional moment equation of the form
0 = EC4x)4.k x, 41,
(2.4)
where d(x) is some conformable matrix with at least as many rows as the dimension of a,. For a given function cl(.), the sample analogue of the right-hand side of (2.8) can be used to construct a method-of-moments or generalized method-of-moments estimator, as described in Section 1.4; the columns of the matrix d(x) are “instrumental variables” for the corresponding rows of the error vector E. More generally, the function d(.) may depend on the parameters of interest, Q, and a (possibly) infinite-dimensional nuisance parameter 6,(.), so a semiparametric M-estimator for B may be defined to solve
(2.5)
estimator of the where dim(d(.)) = dim(g) x dim(s) and s^= c?(.) is a consistent nuisance function 6,(.). For example, these sample moment equations arise as the first-order conditions for the GMM minimization given in (1.43), where the moment functions take the form m(y, x, U) = c(x) e(y, x, a), for a matrix c(x) of fixed functions of x with number of rows greater than or equal to the number components of CC Then, assuming differentiability of e(.), the GMM estimator solves (2.5) with
d(x,
d, 8)= i
$ ,$[ae(y,, xi, d)pd]‘[c(xi)]’ L
1
I
A,c(x),
(2.6)
where A, is the weight matrix given in (1.43). Since the function d(.) depends on the data only through the conditioning variable x, it is simple to derive the form of the asymptotic distribution for the estimator oi which solves (2.5) using the results stated in Section 1.4: ,h@ where
- a,,)~N(O,
M,‘V’JM;)-‘),
(2.7)
J.L. Powell
2468
and
V. = ECdb,ao,6,) e(y,x, a01e’(y,x, a01d’k a0, S0)l
= E[d(x, aO, 6O)z(X)d’(xi, aO>sO)l. In this expression, Z(x)-
Z(x) is the conditional
E[e(y,x,ao)e’(y,x,ao)lx]
covariance
matrix
of the error terms,
= E[EdIx].
Also, the expectation and differentiation in the definition of MO can often be interchanged, but the order given above is often well-defined even if d(.) or e(.) is not smooth in a. A simple extension of the Gauss-Markov argument can be used to show that an efficient choice of instrumental variable matrix d*(x) is of the form
d*(x)=d*(x,ao,do)= the resulting
,/??(a*
Cal -‘; &E[e(yyx,cr)lxi]Ia=,,,
efficient estimator
- ~1~)5
J(O,
V*),
(2.8)
&* will have
with
I/* = {E[d*(x)C~(x)lCd*(x)l’}-‘,
(2.9) under suitable regularity conditions. Chamberlain (1987) showed that V* is the semiparametric efficiency bound for any “regular” estimator of ~1~when only the conditional moment restriction (2.3) is imposed. Of course, the optimal matrix d*(x) of instrumental variables depends upon the conditional distribution of y given x, an infinite-dimensional nuisance parameter, so direct substitution of d*(x) in (2.5) is not feasible. Construction of a feasible efficient estimator for a0 generally uses nonparametric regression and a preliminary inefficient GMM estimator of u. to construct estimates of the components of d*(x), the conditional mean of ae(y,x, a,)/aa and the conditional covariance matrix of e(y, x, ao). This is the approach taken by Carroll (1982), Robinson (1987), Newey (1990b), Linton (1992) and Delgado (1992), among others. Alternatively, a “nearly” efficient sequence of estimators can be generated as a sequence of GMM estimators with moment functions of the form m(y, x, a) = c(x) e(y, x, a), when the number of rows of c(x) (i.e. the number of “instrumental variables”) increases slowly as the sample size increases; Newey (1988a) shows that if linear combinations of c(x) can be used to approximate d*(x) to an arbitrarily high degree as the size of c(x) increases, then the asymptotic variance of the corresponding sequence of GMM estimators equals v*.
Ch. 41: Estimation
2469
of Semipammrtric Models
For the linear model y = x’& +
E
with scalar dependent matrix d*(x) simplifies d*(x) = [a’(x)]
variable y, the form of the optimal to the vector
instrumental
variable
- lx,
where a’(x) is the conditional variance of the error term E. As noted in Section 1.2 above, an efficient estimator for f10 would be a weighted least squares estimator, with weights proportional to a nonparametric estimator of [a’(x)] -I, as considered by Robinson (1987).
2.2.
Conditional quantile restrictions
In its most general form, the conditional 71th quantile of a scalar error term E is defined to be any function 9(x; rr) for which the conditional distribution of E has at least probability rr to the left and probability 1 - rc to the right of q=(x): Pr{s d q(x; n) Ix} 2 71 and
Pr{.s > ~(x; n)lx} 3 1 - 7~.
A conditional quantile restriction is the assumption conditional quantile is independent of x, 9(x; 7r)= rj,(7c) = qo,
a.s.
(2.10)
that, for some rt~(O, l), this
(2.11)
Usually the conditional distribution of E is further restricted to have no point mass at its conditional quantile (Pr{s = q,,} = 0), which with (2.10) implies the conditional moment restriction E[71-
l{E
(2.17)
where p,(u) = u[7c - l(u (2.18)
over the parameter space, where w(x) is any scalar, nonnegative function of x which has E[w(x).Ig(x,a,O)l] < co. For a particular structural function g(.), then, the unknown parameters will be identified if conditions on the error distribution, regressors, and weight function w(x) are imposed which ensure the uniqueness of the minimizer of Q(cc;w(.), n) in (2.18). Sufficient conditions are uniqueness of the rrth conditional quantile q0 = 0 of the error distribution and Pr{w(x) > 0, g(x, u, r~)# g(x, c1,,0)} > 0 whenever c1# ~1~. Given a sample {(y,, xi), i = 1,. . . , N} of observations on y and x, the sample analogue of the minimand in (2.18) is
QN(CC wt.),n) =
k.$1 W(Xi)Pn(yi- g(xi,m,OIL
(2.19)
L
where an additive constant which does not affect the minimization problem has been deleted. In general, the weight function w(x) may be allowed to depend upon
J.L. Powell
2472
nuisance parameters, w(x) E w(x, 6,), so a feasible weighted quantile estimator of CC~might be defined to minimize SN(a,q, G(.);x), with G(x) = w(x, $) for some preliminary estimator 6^of 6,. In the special case of a conditional median restriction (n = $), minimization of QN is equivalent to minimization of a weighted sum of absolute deviations criterion (2.20)
which, with w(x) 3 1, is the usual starting point for estimation of the particular models considered in the literature cited below. When the structural function g(.) is of the latent variable form (g(x, CL,&) = t(x’/3 + E,T)), the estimator oi which minimizes QJLY; cii,rr) will typically solve an approximate first-order condition, ,:
fl
k(Xi)[71 - l(y, < g(xi, oi,O))]b(Xi, a) ag(;;e,O)
r 0,
(2.21)
where b(x, CY)is defined in (2.15) and ag(.)/acr denotes the vector of left derivatives. (The equality is only approximate due to the nondifferentiability of p,(u) at zero and possible nondifferentiability of g(.) at c?; the symbol “G” in (2.21) means the left-hand side converges in probability to zero at an appropriate rate.) These equations are of the form
where the moment
d(X,
bi,8)3
W(Xi,
function
&Xi,
m(.) is defined in (2.16) and
d,jag’:&“’ O).
Thus the quantile minimization problem yields an analogue to the unconditional moment restriction E[m( y, x, cl,,) d(x, CI~,S,)] = 0, which follows from (2.16). As outlined in Section 1.4 above, under certain regularity conditions (given by Powell (1991)) the quantile estimator di will be asymptotically normal,
,/%a - ~0)5 A’“@,M,
’ Vo(Mb)- ‘),
(2.22)
where now
adx, uo,wm, ao,0) MO= E ./-@I4 w(x,&.Jm,ao)
1
au
aa
I
Ch. 41:
Estimation
2413
Models
qf Semiparametric
and V, = E ~(1 - rc) w2(x, 6,) b(x, CIJ
ag(x, a,,O)ag(x,
aa
ao, 0)
ad
I ’
for f(Ol x) being the conditional density of the “residual”y - g(x, clO,0) at zero (which appears from the differentiation of the expectation of the indicator function in (2.21)). The “regularity” conditions include invertibility of the matrix MO, which is identically zero for the binary and ordered response models; as shown by Kim and Pollard (1990), the rate of convergence of the estimator bi is slower than fi for these models. When (2.22) holds, an efficient choice of weight function w(x) for this problem is
w*(x) E .I-(0IX)?
(2.23)
for which the corresponding
JN@* - a,) J+J-(0,
estimator
c?* has
v*),
(2.24)
with
v* = n(l
- 7t) E i[
ah, ao,wdx, a,, 0) - 1 f2(Olx)b(x,ao). aa ad ’ 11
The matrix V* was shown by Newey and Powell (1990) to be the semiparametric efficiency bound for the linear and censored regression models with a conditional quantile restriction, and this is likely to be the case for a more general class of structural models. For the linear regression model g(x, c(~,E) 3 x’bo + E, estimation of the true coefficients PO using a least absolute deviations criterion dates from Laplace (1793); the extension to other quantile restrictions was proposed by Koenker and Bassett (1978). In this case b(x, CI)= 1 and ag(x, a, .s)/aa = x, which simplifies the asymptotic variance formulae. In the special case in which the conditional density of E = y - x’BO at zero is constant - f(Olx) = f. - the asymptotic covariance matrix of the quantile estimator B further simplifies to V*=rc(l
-~)[f~]-~(E[xx’]}-~.
(Of course, imposition of the additional restriction of a constant conditional density at zero may affect the semiparametric information bound for estimation of PO.) The monograph by Bloomfield and Steiger (1983) gives a detailed discussion of the
Ch. 41: Estimation
of Semiparametric
2475
Models
for some h(.) and all possible x, CIand E. Then the random function h(y, x, a) = h(g(x, Q,E),x, a) will also be symmetrically distributed about zero when CI= LX~, implying the conditional moment restriction
my, x, MO)Ixl
=
awx,
MO,4, XT@ON xl =
0.
(2.27)
As with the previous restrictions, the conditional moment restriction can be used to generate an unconditional moment equation of the form E[d(x) h( y, x, LY,)]= 0, with d(x) a conformable matrix of instruments with a number of rows equal to the number of components of 0~~.In general, the function d(x) can be a function of a and nuisance parameters S (possibly infinite-dimensional), so a semiparametric M-estimator biof ~1~can be constructed to solve the sample moment equations
O= i
,$
d(xi,
I
Oi,4 h(Yi, xi, Oi),
(2.28)
1
for s^an estimator of some nuisance parameters 6,. For structural functions g(x, M,E) which are invertible in the error terms, it is straightforward to find a transformation satisfying condition (2.26). Since E= e( y, x, ~1) is an odd function of E, h(.) can be chosen as this inverse function e(.). Even for noninvertible structural functions, it is still sometimes possible to find a “trimming” function h( .) which counteracts the asymmetry induced in the conditional distribution of y by the nonlinear transformation g(.). Examples discussed below include the censored and truncated regression models and a particular selectivity bias model. As with the quantile estimators described in a preceding section, the moment condition (2.27) is sometimes insufficient to identify the parameters go, since the “trimming” transformation h(.) may be identically zero when evaluated at certain values of c1in the parameter space. For example, the symmetrically censored least squares estimator proposed by Powell (1986b) for the censored regression model satisfies condition (2.27) with a function h(.) which is nonzero only when the fitted regression function x$ exceeds the censoring point (zero), so that the sample moment equation (2.28) will be trivially satisfied if fl is chosen so that x$ is nonpositive for all observations. In this case, the estimator /? was defined not only as a solution to a sample moment condition of the form (2.28), but in terms of a particular minimization problem b = argmino &(/I) which yields (2.28) as a firstorder condition. The limiting minimand was shown to have a unique minimizer at /IO, even though the limiting first-order conditions have multiple solutions; thus, this further restriction on the acceptable solutions to the first-order condition was enough to ensure consistency of the estimator ,!?for PO.Construction of an analogous minimization problem might be necessary to fully exploit the symmetry restriction for other structural models. Once consistency of a particular estimator di satisfying (2.28) is established, the asymptotic distribution theory immediately follows from the GMM formulae pre-
J.L. Powell
2476
sented in Section 2.1 above. For a particular choice of h(.), the form of the sample moment condition (2.28) is the same as condition (2.6) of Section 2.2 above, replacing the inverse transformation “e(.)” with the more general “h(.)” here; thus, the form of the asymptotically normal distribution of 6i satisfying (2.28) is given by (2.7) of Section 2.2, again replacing “e(.)” with “h(.)“. Of course, the choice of the symmetrizing transformation h(.) is not unique - given any h(.) satisfying (2.26), another transformation h*( y, x, U) = I(h( y, x, CI),x, U) will also satisfy (2.26) if I(u, x, a) is an odd function of u for all x and CI.This multiplicity of possible symmetrizing transformations complicates the derivation of the semiparametric efficiency bounds for estimation of ~1~under the symmetry restriction, which are typically derived on a case-by-case basis. For example, Newey (1991) derived the semiparametric efficiency bounds for the censored and truncated regression models under the conditional symmetry restriction (2.25), and indicated how efficient estimators for these models might be constructed. For ,the linear regression model g(x, cue,E) E x’b + E, the efficient symmetrizing transformation h(y, x, B) is the derivative of the log-density of E given x, evaluated at the residual y - x’j, with optimal instruments equal to the regressors x: h*(~,x,p)=alnf~,~(y--‘BIx)la&,
d*(x, p, 6) = x.
Here an efficient estimator might be constructed using a nonparametric estimator of the conditional density of E given x, itself based on residuals e”= y - x’g from a preliminary fit of the model. Alternatively, as proposed by Cragg (1983) and Newey (1988a), an efficient estimator might be constructed as a sequence of GMM estimators, based on a growing number of transformation functions h(.) and instrument sets d(.), which are chosen to ensure that the sequence of GMM influence functions can approximate the influence function for the optimal estimator arbitrarily well. In either case, the efficient estimator would be “adaptive” for the linear model, since it would be asymptotically equivalent to the maximum likelihood estimator with known error density.
2.4.
Independence
restrictions
Perhaps the most commonly-imposed semiparametric restriction of independence of the error terms and the regressors, Pr(si < ;1Ixi} = Pr(s, < A}
for all real 2, w.p.1.
is the assumption
(2.29)
Like conditional symmetry restrictions, this condition implies constancy of the conditional mean and median (as well as the conditional mode), so estimators which are consistent under these weaker restrictions are equally applicable here. In fact, for models which are invertible in the errors (E E e(y,x, cle) for some e(.)), a large
Ch. 41: Estimation
of Semiparametric
class of GMM
estimators
2417
Models
is available,
based upon the general
E(d(x)Cl(e(y,x,cr,))-v,l} =O
moment
condition (2.30)
for any conformable functions d(.) and I(.) for which the moment in (2.30) is well-defined, with v,, = EC/(s)]. (MaCurdy (1982) and Newey (1988a) discuss how to exploit these restrictions to obtain more efficient estimators of linear regression coefficients.) Independence restrictions are also stronger than the index and exclusion restrictions to be discussed in the next section, so estimation approaches based upon those restrictions will be relevant here. In addition to estimation approaches based on these weaker implied stochastic restrictions, certain approaches specific to independence restrictions have been proposed. One strategy to estimate the unknown parameters involves maximization of a “feasible” version of the log-likelihood function, in which the unknown distribution function of the errors is replaced by a (preliminary or concomitant) nonparametric estimator. For some structural functions (in particular, discrete response models), the conditional likelihood function for the observable data depends only on the cumulative distribution function FE(.) of the error terms, and not its derivative (density). Since cumulative distribution functions are bounded and satisfy certain monotonicity restrictions, the set of possible c.d.f.‘s will be compact with respect to an appropriately chosen topology, so in such cases an estimator of the parameters of interest CI~can be defined by maximization of the log-likelihood simultaneously over the finite-dimensional parameter c1and the infinite-dimensional nuisance parameter F,( .). That is, if f( y Ix, a, FE(.)) is the conditional density of y given x and the unknown parameters cl0 and F, (with respect to a fixed measure pLy),a nonparametric maximum likelihood (NPML) estimator for the parameters can be defined as
= argmax 1 $J Inf(yiIxi,cr,F(.)), at~,~Ep IV i= 1
(2.31)
where p is the space of admissible c.d.f.‘s. Such estimators were proposed by, e.g. Cosslett (1983) for the binary response model and Heckman and Singer (1984) for a duration model with unobserved heterogeneity. Consistency of 6i can be established by verification of the Kiefer and Wolfowitz (1956) conditions for consistency of NPML estimation; however, an asymptotic distribution theory for such estimators has not yet been developed, so the form of the influence function for 6i (if it exists) has not yet been rigorously established. When the likelihood function of the dependent variable y depends, at least for some observations, on the density function f,(e) = dF,(e)/de of the error terms, the joint maximization problem given in (2.31) can be ill-posed: spurious maxima (at infinity) can be obtained by sending the (unbounded) density estimator Te to infinity at particular points (depending on c1and the data). In such cases, nonparametric density estimation techniques are sometimes used to obtain a preliminary estimator
2419
Ch. 41: Estimation of Semiparametric Models
and identically distributed random variables are symmetrically distributed about zero. For a particular structural model y = g(x, CC, E), the first step in the construction of a pairwise difference estimator is to find some transformation e(z,, zj, a) E eij(a) of pairs of observations (zi, zj) 3 (( yi, xi), (yj, xi)) and the parameter vector so that, conditional on the regressors xi and xj, the transformations eij(crO) and eji(cr,) are identically distributed, i.e. =Y(eij(ao)lXi,
xj)
=
~(eji(Q)lXi,
xj)
as.,
(2.35)
where LZ(.l.) denotes the conditional sampling distribution of the random variable. In order for the parameter a0 to be identified using this transformation, it must also be true that 9(eij(a,)Ixi, xj) # _Y(eji(a,)Ixi, xj) with positive probability if a1 # ao, which implies that observations i andj cannot enter symmetrically in the function e(zi,zj,a). Since si and sj are assumed to be mutually independent given xi and Xi, eij(a) and eji(a) will be conditionally independent given xi and xj; thus, if (2.35) is satisfied, then the difference eij(a) - eji(a) will be symmetrically distributed about zero, conditionally on xi and xi, when evaluated at a = a,,. Given an odd function {(.) (which, in general, might depend on xi and xj), the conditional symmetry of eij(a) - eji(a) implies the conditional moment restriction
E[S(eij(%J- ~ji(%))I~i~xjl = O a.s.,
(2.36)
provided this expectation exists, and a0 will be identified using this restriction if it fails to hold when a # ao. When [(.) is taken to be the identity mapping t(d) = d, the restriction that eij(ao) and eji(ae) have identical conditional distributions can be weakened to the restriction that they have identical conditional means, E[eij(a,)IXi,
Xjl
=
ECeji(ao)lXi,
Xjl a.s.,
(2.37)
which may not require independence of the errors Ei and regressors xi, depending on the form of the transformation e(.). Given an appropriate (integrable) vector /(xi, xj, a) of functions of the regressors and parameter vector, this yields the unconditional moment restrictions (2.38) which can be used as a basis for estimation. If Z(.) is chosen to have the same dimension as a, a method-of-moments estimator bi of a0 can be defined as the solution to the sample analogue of this population moment condition, namely,
02”-IiTj
4teijCbi) -
eji(d))l(Xi,
Xj,
di) = 0
(2.39)
J.L. Powell
2480
(which may only approximately hold if t(eij(a) - eji(M))is discontinuous in E). For many models (e.g. those depending on a latent variable y* E g(xi, a) + ci), it is possible to construct some minimization problem which has this sample moment condition as a first-order condition, i.e. for some function s(zi, zj, IX)with
as(z:azj’ ‘) =((eij(a) -
eji(a))l(xi,xj,
a),
the estimator d might alternatively be defined as bi= argmin ; aE@ 0
(2.40)
-I 1 S(Zi,Zj$). iij
A simple example of a model which is amenable to the pairwise differencing approach is the linear model, yi = x:/I0 + ci, where gi and xi are assumed to be independent. For this case, one transformation function which satisfies the requirements above is 4Yi,
xi, xj, Coc
Yi -
X:B,
which does not depend on xP Choosing l(x,, xj, ~1)= xi - xi, a pairwise difference estimator of /&,can be defined to solve
0i -’
1 (((yi - yj) -(Xi
- Xj))fi)(xi -
xj)
E
OT
i<j
or, if E is the antiderivative of r, to minimize
0
&l(B)= ;
-’ ~j~((Yi-Yj)-(xi-xj)lB).
When &I) = d, the estimator fiis algebraically equal to the slope coefficient estimators of a classical least squares regression of yi on Xi and a constant (unless some normalization on the location of the distribution of ci is imposed, a constant term is not identified by the independence restriction). When t(d) = sgn(d), j? is a rank regression estimator which sets the sample covariance of the regressors xi with the ranks of the residuals yi - x$ equal (approximately) to zero (JureEkovB (1971), Jaeckel(l972)). The same general approach has been used to construct estimators for discrete response models and censored and truncated regression models. In all of these cases, the pairwise difference estimator diis defined as a minimizer of a second-order U-statistic of the form
2481
Ch. 41: Estimation of Semiparametric Models
(with zi 3 ( yi, xi)), and will solve an approximate first-order condition
0 -’ n 2
C q(Zi,Zj,6i)=",(n-"2), icj
where q(.) = ap(.)/aa when this derivative is well-defined. As described in Section 1.4 above, the asymptotic normal distribution of the estimator 6i can be derived from the asymptotically linear representation
h = %3 -
m t H, l r(zi, cto)+ n
o&n- l/2),
(2.41)
i=l
where r(zj, LX) E E[q(zi, zj, CY)/ zi] and
The pairwise comparison approach is also useful for construction of estimators for certain nonlinear panel data models. In this setting functions of pairs of observations are constructed, not across individuals, but over time for each individual. In the simplest case, where only two observations across time are available for each individual, a moment condition analogous to (2.36) is ECS(elz,i(~o)
- ezl,i(~o))IXil9
xi21 = 0
a.s.,
(2.42)
where now ei2,Ja) - e(zil, zi2, a) for th e same types of transformation functions e(.) described above, and where the second subscripts on the random variables denote the respective time periods. To obtain the restriction (2.42), it is not necessary for the error terms Ei= (sil, ci2) to be independent of the regressors xi = (xii, xi2) across individuals i; it suffices that the components sil and si2 are mutually independent and identically distributed across time, given the regressors xi. The pairwise differencing approach, when it is applicable to panel data, has the added advantage that it automatically adjusts for the presence of individual-specific fixed effects, since Eil + yi and Ei2+ yi will be identically distributed if sil and si2 are. A familiar example is the estimation of the coefficients /IOin the linear fixed-effects model Yit
=
XIrbO
+
Yi
+
&it,
t=
where setting the transformation in the moment condition
1,2, e12Jcl) = yi, - xi1 /I and 5(u) = u in (2.42) results
J.L. Powell
2482
which is the basis for the traditional least squares fixed effects estimator. As described in Section 3.5 below, this idea has been exploited to construct estimators for panel data versions of the binary response and censored and truncated regression models which are semiparametric with respect to both the error distribution and the distribution of the fixed effects.
2.5.
Exclusion and index restrictions
Construction of estimators based on index restrictions can be based on a variety of different approaches, depending upon whether the index function u(x) is completely known or depends upon (finite- or infinite-dimensional) unknown parameters, and whether the index sufficiency condition is of the “weak” (affecting only the conditional mean or median) or “strong” (applying to the entire error distribution) form. Estimators of the parameters of interest under mean index restrictions exploit modified forms of the moment conditions implied by the stronger constant conditional mean restrictions, just as estimators under distributional index restrictions use modifications of estimation strategies for independence restrictions. Perhaps the simplest version of the restrictions to analyze are mean exclusion restrictions, for which the index function is a subset of the regressors (i.e. u(x) E x1, where x = (xi, xi)‘), so that the restriction is E[elx]
= E[E[x,]
a.s.
(2.43)
As for conditional mean restrictions, this condition can be used to identify the parameters of interest, 01~,for structural functions y = g(x, a,, E) which are invertible in the error terms (E = e(y,x, a,,)), so that the exclusion restriction (2.43) can be rewritten as
EC4y,x,aO)Ixl-ECe(~,x,a~)lx~l=O. By iterated expectations, is analogous to condition
this implies an unconditional (2.4) of Section 2.1, namely,
(2.44) moment
restriction
which
(2.45) where now (2.46) for any conformable matrix d(x) and square matrix A(x) of functions of the regressors for-which the relevant expectations and inverses exist. (Note that, by construction, E[d(x)lx,] = 0 almost surely.) Alternatively, estimation might be based on the
ofSemiparametric Models
Ch. 41: Estimation
2483
condition 0=
EC&)ay,x, cl,)l,
(2.47)
where, analogously to (2.46),
Given a particular nonparametric method for estimation of conditional means given x1 (denoted E[*lx,]), a semiparametric M-estimator 61of the structural coefficients c1ecan be defined as the solution to a sample analogue of (2.45), 0=
j!$ ,${d(xi,4 s3- E[d(xi,4 s31xi1] (i[A(xi)lxil])- ’ A(xi)}e(.Yi,
Xi,
I
a),
1
(2.48) where the instrumental variable matrix d(x) is permitted to depend upon LXand a preliminary nuisance parameter estimator 8, as in Section 2.2. Formally, the asymptotic distribution of this estimator is given by the same expression (2.7) for estimation with conditional mean restrictions, replacing d with 2 throughout. However, rigorous verification of the consistency and asymptotic normality of dzis technically difficult, and the estimating equation (2.48) must often be modified to, “trim” (i.e. delete) observations where the nonparametric regression estimator EC.1 is imprecise. A bound on the attainable efficiency of estimators of t1e under condition (2.44) was derived by Chamberlain (1992), who showed that an optimal instrumental variable matrix d”*(x)of the form (2.46) is related to the corresponding optimal instrument matrix d*(x) for the constant conditional moment restrictions of Section 2.2 by the formula
d”(x)= d*(x)
- E[d*(x)lx,]
[E{ [Z(x)]-‘lxl}]-’
[Z(x)]-
‘,
(2.49)
where d*(x) is defined in (2.8) above and E(x) is the conditional covariance matrix of the errors s given the regressors x. This formula directly generalizes to the case in which the subvector x1 is replaced by a more general (but known) index function u(x). For a linear model y = x!J& + E, the mean exclusion restriction (2.43) yields the semilinear model considered by Robinson (1988): Y = @cl + w% I+ % where 0(x,) - E[E/x,] and E[qlx] = E[E - 0(x,)1x] = 0. Defining y - xi/I, d(x) 3 x2, and A = I, the moment condition (2.47) becomes
e(y,x,cl) E
J.L. Powell
2484
which can be solved for PO:
Robinson (1988) proposed an estimator of /IO constructed from a sample analogue to (2.47), using kernel regression to nonparametrically estimate the conditional expectations and “trimming” observations where a nonparametric estimator of the density of x1 (assumed continuously distributed) is close to zero and gave conditions under which the resulting estimator was root-N-consistent and asymptotically normal. Linton (1992) constructs higher-order approximations to the distribution of this estimator. Strengthening the mean exclusion restriction to a distributional exclusion condition widens the class of moment restrictions which can be exploited when the structural function is invertible in the errors. Imposing Pr{s < ulx} = Pr{s bulx,) for all possible
0=
(2.50)
values of u yields the general moment
conditions
EC&444x x, d)l
(2.51)
for any square-integrable function I(E)of the errors, which includes (2.45) as a special case. As with independence restrictions, precision of estimators of a, can be improved by judicious choice of the transformation I(.). Even for noninvertible structural functions, the pairwise comparison approach considered for index restrictions can be modified to be applicable for distributional exclusion (or known index) restrictions. For any pair of observations zi and zj which have the same value of the index function u(xi) = u(xj), the corresponding error terms si and sj will be independently and identically distributed, given the regressors xi and xj, under the distributional index restriction Pr{.s < ulx} = Pr{s < ulu(x)>.
(2.52)
Given the pairwise transformation function e(z,, zj, a) = eij@) described in the previous section, an analogue to restriction (2.35) holds under this additional restriction of equality of index functions: T(eij(cco))xi, xj) = Y(eji(M,)lXi, Xj) As for independence tion E[eij(cr,)Ixi,
restrictions,
a.s. if
U(Xi)
=
U(Xj).
(2.53) implies the weaker conditional
xi] = E[eji(M,)IXi, Xj]
a.s. if
U(Xi)
=
U(Xj),
(2.53) mean restric-
(2.54)
Ch. 41: Estimation of‘Semiparametric
2485
Models
which is relevant for invertible structural functions (with eij(a) equated with the inverse function e( yi, xi, a) in this case). These restrictions suggest estimation of a, by modifying the estimating equation (2.39) or the minimization problem (2.40) of the preceding subsection to exclude pairs of observations for which u(xi) # D(x~). However, in general U(Xi)- U(Xj) may be continuously distributed around zero, so direct imposition of this restriction would exclude all pairs of observations. Still, if the sampling distributions LZ’(eij(Uo) (xi, xj, u(xi) - u(xj) = c) or conditional expectations E[eij(Eo)l xi, xj, u(xi) - u(xj) = c] are smooth functions of c at c = 0, the restrictions (2.53) or (2.54) will approximately hold if u(xi) - u(xj) is close to zero. Then appropriate modifications of the estimating equations (2.39) and minimization problem (2.40) are
0 *:
-I
iTj
4(eij(d)
-
eji(d))
l(Xi,
Xj,
d) WN(U(Xi)
-
U(Xj))
=
()
(2.55)
and
oi = argmin &Q
0
T
-
1 iTj Nzi2 zj9 Co wN("(xi)-
u(xj))3
(2.56)
for some weighting function wN(.) which tends to zero as the magnitude of its argument increases and, at a faster rate, as the sample size N increases (so that, ultimately, only observations with u(xi) - u(xj) very close to zero are included in the summations). Returning to the semilinear regression model y = xi& + 0(x,) + 4, E[qlx] = 0, the same transformation as used in the previous subsection can be used to construct a pairwise difference, provided the nonparametric components B(xil) and /3(x,J are equal for the two observations; that is, if e( yi, xi, Xi, CI)= eij(a) = yi - xQ and u(xi) = xii, then
if u(xi) = D(Xj). Provided B(x,r) is a smooth (continuous and differentiable) function, relation (2.36) will hold approximately if xi1 E xjI. Defining the weight function w,,&.) to be a traditional kernel weight, WN(d)= k(h, l d),
k(O)>O,k(ll)+Oas
IIAII+oO,hN+OasN+cO,
(2.57)
and+ taking l(x,, xj, CC) = xiZ - xj2 and t(d) = d, a pairwise difference estimator of PO using either (2.55) or (2.56) reduces to a weighted least squares regression of the distinct differences (yi - yj) in dependent variables on the differences (Xi2 - xj2) in regressors, using k(h,‘(xi, - xjI)) as weights (as proposed by Powell (1987)).
Consistency of the resulting estimator a requires only the weak exclusion restriction (2.43); when the strong exclusion restriction (2.53) is imposed, other choices of odd function t(d) besides the identity function are permissible in (2.55). Thus, an estimator of Do using t(d) = sgn(d) might solve N
0 2
’ iTj sgn((yi - yj) - (xi1 - xjl)‘g)(xil
- Xjl)k((Xi2 - Xj2)lhN) E 0.
This is the first-order condition of a “smoothed” problem defining the rank regression estimator,
b = argmin : 0 a
version
(2.5’)
of the minimization
- ’ iTj I(Yi - Yj) ~ (xi - xj)‘B Ik((xi, - Xjz)/hN),
(2.59)
which is a “robust” alternative to estimators proposed by Robinson (1988b) and Powell (1987) for the semilinear model. Although the asymptotic theory for such estimators has yet to be developed, it is likely that reasonable conditions can be found to ensure their root-N-consistency and asymptotic normality. So far, the discussion has been limited to models with known index functions u(x). When the index function depends upon unknown parameters 6, which are functionally unrelated to the parameters of interest rxe,and when preliminary consistent estimators s^ of 6, are available, the estimators described above are easily adapted to use an estimated index function O(x) = u(x, 8). The asymptotic distribution theory for the resulting estimator must properly account for the variability of the preliminary estimator $. When 6, is related to a,, and that relation is exploited in the construction of an estimator of CI~, the foregoing estimation theory requires more substantial modification, both conceptually and technically. A leading special case occurs when the index governing the conditional error distribution appears in the same form in the structural function for the dependent variable y. For example, suppose the structural function has a linear latent variable form, Y=
57(x, %I, 4 = WPo + 4,
(2.60)
and index u(x) is the latent linear regression Pr(s d ulx} = Pr{s
P,) > 0 only if x$,, > x&. Various estimators based upon these conditions have been proposed for the monotone regression model, as discussed in Section 3.2 below. More complicated examples involve multiple indices, with some indices depending upon parameters of interest and others depending upon unrelated nuisance parameters, as for some of the proposed estimators for selectivity bias models. The methods of estimation of the structural parameters ~1~vary across the particular models but generally involve nonparametric estimation of regression or density functions involving the index u(x).
3. 3.1.
Structural models Discrete response models
The parameters
of the binary
y = 1(x’& + E > 0)
response
model (3.1)
J.L. Powell
2488
are traditionally
estimated
by maximization
of the average log-likelihood
function
(3.2)
where the error term E is assumed to be distributed independently of x with known distribution function F(.) (typically standard normal or logistic). Estimators for semiparametric versions of the binary response model usually involve maximization of a modified form of this log-likelihood, one which does not presuppose knowledge of the distribution of the errors. For the more general multinomial response model, in which J indicator variables { yj, j = 1,. . . , J} are generated as yj=l{x’fl~+~j>x’&++Ek the average log-likelihood
~N(P,..., BJ;
F, = i
forall
k#j},
has the analogous
itl
j$l
YijlnCFj(x$‘,
(3.3)
form
. . . , XipJ)],
(3.4)
where Fj(.) is the conditional probability that yj = 1 given the regressors x. This form easily specializes to the ordered response or grouped dependent variable models, replacing Fj(.) with F(x& - cj) - F(x$,, - cj_ r), where the {cj} are the (known or unknown) group boundaries. The earliest example of a semiparametric approach for estimation of a limited dependent variable model in econometrics is the maximum score estimation method proposed by Manski (1975). For the binary response mode, Manski suggested that PO be estimated by maximizing the number_of correct predictions of y by the sign of the latent regression function x’p; that is, /I was defined to maximize the predictive score function
(3.5)
i=l
over a suitable parameter space 0 (e.g. the unit sphere). The error terms E were restricted to have conditional median zero to ensure consistency of the estimator. A later interpretation of the estimator (Manski (1985)) characterized the maximum score estimator p^as a least absolute deviations estimator, since the estimator solved the minimization problem
b = arg:in
A i$r I Yi - 1 tX:B >
Ol1.
(3.6)
Ch. 41: Estimation
of Semiparametric
Models
2489
This led to the extension of the maximum score idea to more general quantile estimation of /?,,, under the assumption that the corresponding conditional quantile of the error terms was constant (Manski (1985)). The maximum score approach was also applied to the multinomial response model by Manski (1975); in this case, the score criterion becomes
and its consistency was established under the stronger condition of mutual independence of the alternative specific errors (ej}. M. Lee (1992) used conditional median restrictions to define a least absolute deviations estimator of the parameters of the ordered response model along the same lines. Although consistency of the maximum score estimator for binary response was rigorously established by Manski (1985) and Amemiya (1985), its asymptotic distribution cannot be established by the methods described in Section 2.2 above, because of lack of continuity of the median regression function 1{x’j?, > 0} of the dependent variable y. More importantly, because this median regression function is flat except at its discontinuity points, the estimator is not root-N-consistent under standard regularity conditions on the errors and regressors. Kim and Pollard (1990) found that the rate of convergence of the maximum score estimator to j?,, under such conditions is N1/3, with a nonstandard asymptotic distribution (involving the distribution of the maximum value of a particular Gaussian process with quadratic drift). This result was confirmed for finite samples by the simulation study of Manski and Thompson (1986). Chamberlain (1986) showed that this slow rate of convergence of the maximum score estimator was not particular to the estimation method, but a general consequence of estimation of the binary response model with a conditional median restriction. Chamberlain showed that the semiparametric version of the information matrix for this model is identically zero, so that no regular root-N-consistent estimator of /I?,,exists in this case. An extension by Zheng (1992) derived the same result - a zero semiparametric information matrix - even if the conditional median restriction is strengthened to an assumption of conditional symmetry of the error distribution. Still, consistency of the maximum score estimator fi illustrates the fact that the parameters flc,of the binary response model are identified under conditional quantile or symmetry assumptions on the error terms, which is not the case if the errors are restricted only to have constant conditional mean. If additional smoothness restrictions on the distribution of the errors and regressors are imposed, the maximum score (quantile) approach can be modified to obtain estimators which converge to the true parameters at a faster rate than N113. Nawata (1992) proposed an estimator which, in essence, estimates f10by maximizing the fit of an estimator of the conditional median function 1(x’& > 0) of the binary variable to a nonparametric estimator of the conditional median of y given x. In a
J.L.Powell
2490
first stage, the observations are grouped by a partition of the space of regressors, and the median value of the dependent variable y is calculated for each of these regressor bins. These group medians, along with the average value of the regression vector in each group, are treated as raw data in a second-stage fit of the binary response model using the likelihood function (3.2) with a standard normal cumulative and a correction for heteroskedasticity induced by the grouping scheme. Nawata (1992) gives conditions under which the rate of convergence of the resulting estimator is N2’5, and indicates how the estimator and regularity conditions can be modified to achieve a rate of convergence arbitrarily close to N”‘. Horowitz (1992) used a different approach, but similar strengthening of the regularity conditions, to obtain a median estimator for binary response with a faster convergence rate. Horowitz modifies the score function of (3.5) by replacing the conditional median function l{x’/I > 0} by a “smoothed” version, so that an estimator of /I,, is defined as a minimizer of the criterion
s,*(P)=
iilYi K(x:B/hN)+ t1 - Yi) Cl -
K(x~B/hN)l~
(3.8)
where K(.) is a smooth function in [0, l] with K(u)+0 or 1 as U+ - co or co, and h, is a sequence of bandwidths which tends to zero as the sample size increases (so that K(x’&/h,) approaches the binary median 1(x’& > 0) as N + co). With particular conditions on the function K(.) and the smoothness of the regressor distribution and with the conditional density of the errors at the median being zero, Horowitz (1992) shows how the rate of convergence of the minimizer of S;G(fi) over 0 can be made at least N2” and arbitrarily close to N”2; moreover, asymptotic normality of the resulting estimator is shown (and consistent estimators of asymptotic bias and covariance terms are provided), so that normal sampling theory can be used to construct confidence regions and hypothesis tests in large samples. When the error terms in the binary response model are assumed to satisfy the stronger assumption of independence of the errors and regressors, Cosslett (1987) showed that the semiparametric information matrix for estimation of fiO in (3.1) (once a suitable normalization is imposed) is generally nonsingular, a necessary condition for existence of a regular root-N-consistent estimator. Its form is analogous to the parametric information matrix when the distribution function F(.) of the errors is known, except that the regressors x are replaced by deviations from their conditional means given the latent regression function x’&; that is, the best attainable asymptotic covariance matrix for a regular estimator of &, when E is independent of x with unknown distribution function F(.) is Cf (x’Bo)12
wm where f(u) = dF(u)/du
- wm1
[5Z- E(:lx’&)]
and Z?is the subvector
II
(3.9)
x which eliminates
the
[Z - E(i(x’&,)]’
of regressors
-l,
Ch. 41: Estimution
~JSemipurametric
2491
Models
last component (whose coefficient is assumed normalized to unity to pin down the scale of /IO). Existence of the inverse in (3.9) implies that a constant term is excluded from the regression vector, and the corresponding intercept term is absorbed into the definition of the error cumulative F(.). For the binary response model under an index restriction, Cosslett (1983) proposed a nonparametric maximum likelihood estimator (NPMLE) of j3e through maximization of the average log-likelihood function _Y,,@‘;F) simultaneously over BE 0 and FEN, where g is the space of possible cumulative distributions (monotonic functions on [0,11). Computationally, given a particular trial value b of fi, an estimator of F is obtained by monotonic regression of the indicator y on x’b, using the pool adjacent violators algorithm of isotonic regression; this estimator F^ of F is then substituted into the likelihood function, and the concentrated criterion SY,(b; F) is maximized over bE O= {/I: )//31)= 1 }. Cosslett (1983) establishes consistency of the resulting estimators of j?, and F(.) through verification of the Kiefer-Wolfowitz (1956) conditions for the consistency of NPMLE, constructing a topology which ensures compactness of the parameter space B of possible nuisance functions F(.). As noted in Section 2.4 above, an asymptotic distribution for NMPLE has not yet been established. Instead of the monotonic regression estimator F(.) of F(.) implicit in the construction of the NPMLE, the same estimation approach can be based upon other nonparametric estimators of the error cumulative. The resulting projle likelihood estimator of /IO, maximizing ZP,(b; F) of (3.2) using a kernel regression estimator F, was considered by Severini and Wong (1987a) (for a single parameter) and Klein and Spady (1993). Because kernel regression does not impose monotonicity of the function estimator, this profile likelihood estimator is valid under a weaker index restriction on the error distribution Pr{.s < u/x} = Pr{& < u[x’&,}, which implies that E[ ~1x1 = F(x’/?,) for some (not necessarily monotone) function F(.). Theoretically, the form of the profile likelihood TN(b;@ is modified by Klein and Spady (1993) to “trim” observations with imprecise estimators of F(.) in order to show root-N-consistency and asymptotic normality of the resulting estimator p. Klein and Spady show that this estimator is asymptotically efficient under the assumption of independence of the errors and regressors, since its asymptotic covariance matrix equals the best attainable value V* of (3.9) under this restriction. Other estimators of the parameters of the binary response model have been proposed which do not exploit the particular structure of the binary response model, but instead are based upon general properties of transformation models. If indepen-
dence of the errors and regressors is assumed, the monotonicity function (3.1) in E can be used to define a pairwise comparison Imposition
of a weaker index restriction
ECYIXI= WA,)
of the structural
estimator of Do. Pr{s < u (x] = Pr{s < ~1x’p,} implies that (3.10)
for some unknown function G(.), so any estimator which is based on this restriction
J.L. Powell
2492
is applicable to the binary response model. A number of estimators proposed for this more general setup are discussed in the following section on transformation models. Estimation of the multinomial response model (3.3) under independence and index restrictions can be based on natural extensions of the methods for the binary response model. In addition to the maximum score estimator defined by minimizing (3.7), Thompson (1989a, b) considered identification and estimation of the parameters in (3.3) assuming independence of the errors and regressors; Thompson showed how consistent estimators of (/?A,. . . , /I”,) could be constructed using a least squares criterion even if only a single element yj of the vector of choice indicators (y,, . . . , yj) is observed. L. Lee (1991) extended profile likelihood estimation to the multinomial response model, and obtained a similar efficiency result to Klein and Spady’s (1993) result for binary response under index restrictions on the error terms. And, as for the binary response model, various pairwise comparison or index restriction estimators for multiple index models are applicable to the multinomial response model; these estimators are reviewed in the next section.
3.2.
Transformation models
In Section 1.3 above, two general classes of transformation models were distinguished. Parametric transformation models, in which the relation between the latent and observed dependent variables is invertible and of known parametric form, are traditionally estimated assuming the errors are independent of the regressors with density function f(.;r) of known parametric form. In this setting, the average conditional log-likelihood function for the dependent variable y = t(x’&
+
E;
&JO& = t - l (Y; &)
- x’Po= 4x x, PO,2,)
k.z (InCf(e(Yi, 1
B,4; r)l - ln CladYi,
(3.11)
is
ThdP,A ? f) =
xi,
Xi,
B,2yay I]),
I
(3.12) which is maximized over 8 = (B, ;1,r) to obtain estimators of the parameters /IO and 2, of interest. Given both the monotonicity of the transformation t(.) in the latent variable and the explicit representation function e(.) for the errors in terms of the observable variables and unknown parameters, these models are amenable to estimation under most of the semiparametric restrictions on the error distribution discussed in Section 2. For example, Amemiya and Powell (1981) considered nonlinear twostage least squares (method-of-moments) estimation of /IO and A,, for the Box-Cox
Ch. 41: Estimation
of Semiparametric
Models
2493
transformation under a conditional mean restriction on the errors E given the regressors x, and showed how this estimator could greatly outperform (in a meansquared-error sense) a misspecified Gaussian ML estimator over some ranges of the transformation parameter &. Carroll and Ruppert (1984) and Powell (1991) discuss least absolute deviations and quantile estimators of the Box-Cox regression model, imposing independence or constant quantile restrictions on the errors. Han (1987b) also assumes independence of the errors and regressors, and constructs a pairwise difference estimator of the transformation parameter 2, and the slope coefficients &, which involves maximization of a fourth-order U-statistic; this approach is a natural generalization of the maximum rank correlation estimation method described below. Newey (1989~) constructs efficient method-of-moments estimators for the BoxxCox regression model under conditional mean, symmetry, and independence restrictions on the error terms. Though not yet considered in the econometric literature, it would be straightforward to extend the general estimation strategies described in Section 2.5 above to estimate the parameters of interest in a semilinear variant of the BoxxCox regression model. When the form of the transformation function t(.) in (3.11) is not parametrically specified (i.e. the transformation itself is an infinite-dimensional nuisance parameter), estimation of &, becomes more problematic, since some of the semiparametric restrictions on the errors no longer suffice to identify /I,, (which is, at most, uniquely determined up to a scale normalization). For instance, since a special case is the binary response model, it is clear from the discussion of the previous section that a conditional mean restriction on E is insufficient to identify the parameters of interest. Conversely, any dependent variable generated from an unknown (nonconstant and monotonic) transformation can be further transformed to a binary response model, so that identification of the parameters of a binary response model generally implies identification of the parameters of an analogous transformation model. Under the assumption of independence of the errors and regressors, Han (1987a) proposed a pairwise comparison estimator, termed the maximum rank correlation estimator, for the model (3.11) with t(.) unknown but nondecreasing. Han actually considered a generalization of (3.1 l), the generalized regression model, with structural function Y =
tCs(x’Bo, 41,
(3.13)
with t[.] a monotone (but possibly roninvertible) function and s(.) smooth and invertible in both of its arguments; with continuity and unbounded support of the error distribution, this construction ensures that the support of y will not depend upon the unknown parameters &,. Though the discussion below focusses on the special case s(x’ /I, E) = x’fi + E, the same arguments apply to this, more general, setup. For model (3.11), with t(.) unknown and E and x assumed independent, Han proposed estimation of /I,, by maximization of
J.L. Powell
2494
=0“1
-IN-1
RN(b)
N ’
x;P)+ l(Yi
Yj)-
l(xlP>x;B)I.
(3.15) In terms of the pairwise
eij(B)
E
difference
estimators
l(Yi Z Yj)%nCl(Yi > Yj)-
identification of & using the maximum conditional symmetry of
=
of Section 2.4, defining
l(x$>xJB)l~ rank correlation
2 l(yi # Yj)Sgn[l((Xi-Xj)‘Bo
criterion
> &j-&i)-
is related to the
l((xi-xjYBO
‘“)l
about zero given xi and xj. The maximum rank correlation estimator defined in (3.15) does not solve a sample moment condition like (2.39) of Section 2.4 (though such estimators could easily be constructed), because the derivative of RN(B) is zero wherever it is well-defined; still, the estimator b is motivated by the same general pairwise comparison approach described in Section 2.4. Han (1987a) gave regularity conditions under which fl is consistent for & these included continuity of the error distribution and compact support for the regressors. Under similar conditions Sherman (1993) demonstrated the root-N-consistency and asymptotic normality of the maximum rank estimator; writing the estimator as the minimizer of a second-order U-process,
j? = argmax 0
N 0 2
-r ‘jj’ i=l
5
P(ziYzjt8)>
(3.16)
j=i+l
Sherman showed that the asymptotic distribution of B is the same as that for an M-estimator based on N/2 observations which maximizes the sample average of the conditional expectation r(zi, /I) = E[ p(zi, zj, /I) 1zi] over the parameter space 0,
y*. Greene (1981, 1983) derives similar results for classical least squares estimates in the special case of a censored dependent variable. Brillinger (1983) shows consistency of classical least squares estimates for the general transformation model when the regressors are jointly normally distributed, which implies that the conditional distribution of the regressors x given the index x’BO has the linear form
Cxl X’BOI= PO + vo(X’BO)
(3.20)
for some p. and vo. Ruud (1983) noted that condition (3.20) (with a full-rank condition on the distribution of the regressors) was sufficient for consistency (up to scale) of a misspecified maximum likelihood estimator of PO in a binary response model with independence of the errors and regressors; this result was extended by Ruud (1986) to include all misspecified maximum likelihood estimators for latent variable models when (3.1 l), (3.20) and independence of the errors and regressors are assumed. Li and Duan (1989) have recently noted this result, emphasizing the importance of convexity of the assumed likelihood function (which ensures uniqueness of the minimizer rcfio of the limiting objective function). As Ruud points out, all of these results use the fact that the least squares or misspecified ML estimators 6i and y^of the intercept term and slope coefficients satisfy a sample moment condition of the form
5
O= i=l r(.Yi,6i +
1 Xiy*) [3xi
(3.21)
for some “quasi-residual” function I(.). Letting F(x’Bo, ~1+ x’y) = E[r(y, c1+ x’y) 1x] and imposing condition (3.20), the value y* = rcpo will solve the corresponding population moment condition if K and the intercept CIare chosen to satisfy the two conditions
0 = W(x’Bo, a + dx’Do))l = -w(x’&, a + K(X’PO))(x’A41, since the population
analogue
of condition
(3.21) then becomes
under the restriction (3.20). (An analogous argument works for condition (3.19), replacing x’fio withy* where appropriate; in this case, the index restriction _Y(yI x) = _Y(yIx’p,) is not necessary, though this condition may not be as easily verified as (3.20).) Conditions (3.19) and (3.20) are strong restrictions which seem unlikely to hold for observational data, but the consistency results may be useful in experimental design settihgs (where the distribution of the regressors can be chosen to satisfy
2497
Ch. 41: Estimation of Semiparametric Models
(3.20)), and the results suggest that the inconsistency of traditional maximum likelihood estimators may be small when the index restriction holds and (3.19) or (3.20) is approximately satisfied. If the regressors are assumed to be jointly continuously distributed with known density function fX(x), modifications of least squares estimators can yield consistent estimators of /I0 (up to scale) even if neither (3.19) nor (3.20) holds. Ruud (1986) proposed estimation of & by weighted least squares, &
(d4xi)lfx(xi))(xi - a)(Yi - Jib
(3.22) where 4(x) is any density function for a random vector satisfying (for example, a multivariate normal density function) and
condition
(3.20)
(3.23)
with an analogous definition for 9. This reweighting ensures that the probability limit for the weighted least squares estimator in (3.22) is the same as the probability limit for an unweighted least squares estimator with regressors having marginal density 4(x); since this density is assumed to satisfy (3.20), the resulting estimator will be consistent for /I,, (up to scale) by the results cited above. A different approach to use of a known regressor density was taken by Stoker (1986), who used the mean index restriction E[y 1x] = E[y 1x’/I,,] = G(x’/?,) implied by the transformation model with a strong index restriction on the errors. If the nuisance function G(.) is assumed to be smooth, an average of the derivative of E[ylx] with respect to the regressors x will be proportional to PO:
EEaE~ylxll~xl= ECWx’P,)lWP,)l PO= K*Po.
(3.24)
Furthermore, if the regressor density f,(x) declines smoothly to zero on the boundary of its support (which is most plausible when the support is unbounded), an integrationby-parts argument yields
huh = - EC9 lnCfx(41/ax)~
(3.25)
which implies that PO can be consistently estimated (up to scale) by the sample average ofy, times the derivative of the log-density of the regressors, a ln[fX(xi)]/ax. Also, using the facts that
-W lnCfxb)l/ax) = 0,
E{(a Ufx(41/WX’) = - 1,
(3.26)
J.L. Powell
2498
Stoker proposed an alternative estimator of K*& as the slope coefficients of an instrumental variables fit ofyi on xi using the log-density derivatives a ln[fJxJ]/ax, and a constant as instruments. This estimator, as well as Ruud’s density-weighted least squares estimator, is easily generalized to include models which have regressor density f,(x; rO) of known parametric form, by substitution of a preliminary estimator + for the unknown distribution parameters and accounting for the variability of this preliminary estimator in the asymptotic covariance matrix formulae, using formula (1.53) in Section 1.4 above. When the regressors are continuously distributed with density function f,(x) of unknown form, nonparametric (kernel) estimators of this density function (and its derivatives) can be substituted into the formulae for the foregoing estimators. Although the nonparametrically-estimated components necessarily converge at a rate slower than N1’2, the corresponding density-weighted LS and average derivative estimators will be root-IV-consistent under appropriate conditions, because they involve averages of these nonparametric components across the data. Newey and Ruud (1991) give conditions which ensure that the density-weighted LS estimator (defined in (3.22) and (3.23)) is root-iV-consistent and asymptotically normal when f,.(x) is replaced by a kernel estimator_?Jx). These conditions include the requirement that the reweighting density 4(x) is nonzero only inside a compact set which has f,(x) bounded above zero, to guarantee that the reciprocal of the corresponding nonparametric estimator f,(x) is well-behaved. Hlrdle and Stoker (1989) and Stoker (1991) considered substitution of the derivative of a kernel estimator of the logdensity, a ln[~.Jx)]fix into a sample analogue of condition (3.26) (which deletes observations for which a ln[TX(xi)]/Z x is small), and gave conditions for root-l\rconsistency and asymptotic normality of the resulting estimator. A “density-weighted” variant on the average derivative estimator was proposed by Powell et al. (1989), using the fact that
where the last inequality follows from a similar integration-by-parts used to derive (3.25). The resulting estimator s^of 6, = K+&,
argument
as
(3.28) was shown to have Ith component
of the form
(3.29)
with weights c+,(xi - xi) which tend to zero as 11 Xi - Xj 11increases,
and, for fixed
J.L.
2500
Powell
separately from the index x’f10 in that formula, is replaced by the deviation of the regressors from their conditional mean given the index, x - E[x) x’&J. Newey and Stoker (1993) derived the semiparametric efficiency bound for estimation of PO (up to a scale normalization on one coefficient) under condition (3.32), which has a similar form to the semiparametric efficiency bound for estimation under exclusion restrictions given by Chamberlain (1992) as described in Section 2.5 above.
3.3.
Censored
and truncated regression
models
A general notation for censored regression models which covers fixed and random censoring takes the dependent variable y and an observable indicator variable d to be generated as y = min {x’& + E,u},
d=
(3.34)
l{y x’fl,Jx} = Pr{x’& = Pr{x’P, = H(x’B,)(f
< u
and
E > 01x)
< ulx} Pr{s > 01x) - xx
(3.42)
where H(c) = Pr{u > c} is the survivor function of the random variable u. The unknown function H(.) can be consistently estimated using the Kaplan and Meier (1958) product-limit estimator for the distribution function for censored data. The resulting consistent estimator H(.) uses only the dependent variables (y,} and the
Ch. 41: Estimation
of‘Semiparametric
censoring indicators solution to estimating OZb,$
2503
Models
(di). Ying et al. (1991) define equations of the form
a quantile
estimator
[ as a
(3.43)
[[E?(x~~)]-‘l{yi>X~~-(l-n)]Xi, I 1
based on the conditional moment restriction (3.42) and give conditions for the root-N-consistency and asymptotic normality of this estimator. Since H(x’/?,) = &x/&J = 1 {x’p O< uO} when the censoring points ui are constant at some value uO with probability one, these equations are not well-defined for fixed censoring (say, at zero) except in the special case Pr{x’& < uO> = 1. A modification of the sample moment conditions defined in (3.43), 0 ~ ~ ,~ [l{ri I
>
XIP}
-
[~(XlB)](l
-
71)]Xi,
1
would allow a constant censoring value, and when n = i would reduce to the subgradient condition for the minimization problem (3.41) in this case. Unfortunately, this condition may have a continuum of inconsistent roots, if B can be chosen so that x$ > ui for all observations. It is not immediately clear whether an antiderivative of the right-hand side of (3.44) would yield a minimand which could be used to consistently estimate PO under random censoring, as it does (yielding (3.41) for z = i) for fixed censoring. Because the conditional median (and other quantiles) of the dependent variable y depend explicitly on the error distribution when the dependent variable is truncated, quantile restrictions are not helpful in identifying p,, for truncated samples. With a stronger restriction of conditional symmetry of the errors about a constant (zero), the “symmetric trimming” idea mentioned in Section 2.3 can be used to construct consistent estimators for both censored and truncated samples. Powell (1986b) proposed a symmetrically truncated least squares estimator of PO for a truncated sample. The estimator exploited the moment condition E[l{y>2x’~,-u}(y-x’~o)~x,~X’~o-u}F(X,E 0, constructed estimators for truncated and censored samples based on the moment conditions
(3.48) and
~C1~~-x’80~w}min(/~-x’ll,I,w}sgn{y-x’Bo}Ixl (3.49)
=E[l{u-x’~o>w}min{Isl,w}sgn{~}Ix]=O,
respectively. Newey (1989a) derives the semiparametric efficiency bounds for estimation of f10 under conditional symmetry with censored and truncated samples, noting that the symmetrically truncated least squares estimator attains that efficiency bound in the special case where the unknown error distribution is, in fact, Gaussian (the analogous result does not hold, though, for the symmetrically censored estimator). As described at the end of Section 2.2, conditional mode restrictions can be used to identify PO for truncated data, and an estimator proposed by M. Lee (1992) exploits this restriction. This estimator solves a sample analogue to the characterization of /I0 as the solution to the minimization problem B0 = argmin
Pr{ Iy - minju,
+ 0,
x;h})
>
w},
as long as the modal interval of length 201 for the untruncated error distribution is assumed to be centered at zero. M. Lee (1992) showed the N”3-consistency of this estimator and considered its robustness properties. Most of the literature on semiparametric estimation for censored and truncated regression in both statistics and econometrics has been based upon independence restrictions. Early estimators of /I0 for random censoring models which relaxed the assumed parametric form of the error distribution (but maintained independence
J.L.Powell
2506
Pairwise difference estimators for the censored and truncated regression models have also been constructed by Honor& and Powell (1991). For model (3.34) with fixed censoring, and using the notation of Section 2.4, these estimators were based upon the transformation eij(0) = e(z,, zj, fl) = min{y, - Xi/I,
Ui - Xi/?),
(3.54)
which satisfies eij(Bo) = min(min{e,,
ui - xi&}, Ui - xi/&} = min{ei, ui - X$e, Ui - x)be},
so that eij(Q,) and eji(Qo) are clearly independently and identically distributed given xi and xi. Again choosing /(xi, xj, 0) = xi - xj, the pairwise difference estimator for the censored regression model was given as a solution to the sample moment condition (2.39) of Section 2.4 above. These estimating equations were shown to have a unique solution, since they correspond to first-order conditions for a convex minimization problem. Honor& and Powell (1991) also considered estimation of the truncated regression model, in which yi and xi are observed only if yi is positive; that is, ify, = xi/&, + vi, where ui has the conditional distribution ofei given si > - x&, then 6p(Ui 1xi) = di”(q 1Xi, q > - xi&). Again assuming the untruncated errors si are i.i.d. and independent of the regressors xi, a pairwise difference estimator of &, was defined using the transformation e(zi, Zj, p)
E
When evaluated
(Yi - Xib) l(Yi - Xi/? > - Xip) l(Yj - X;fi > - X:/3).
(3.55)
at the true value /IO, the difference
eij(fio) - eji(fio) = (Vi- Uj) l(Ui > - X;p)l(Oj > - Xifl)
(3.56)
is symmetrically distributed around zero given xi and xj. As for the censored case, the estimator B for this model was defined using &xi, xj, 0) = (xi - xj) and (2.39) through (2.40) above. When the function c(d) = sgn(d), the solution to (2.39) for this model was proposed by Bhattacharya et al. (1983) as an estimator of /?,, for this model under the assumption that xi is a scalar. The general theory derived for minimizers of mth-order U-statistics (discussed in Section 1.3) was applied to show root-N-consistency and to obtain the large-sample distributions of the pairwise difference estimators for the censored and truncated regression models.
3.4.
Selection models
Rewriting
the censored
selection
model of (1.21) and (1.22) as
d = 1(x’@, + v > 0}, Y = dCx;Bo +
~1
(3.57)
Ch. 41:
Estimation
of Semiparametric
2507
Models
(for y, E d, y, E y, /IA = 6,, and & E /IO), a fully parametric model would specify the functional form of the joint density f(s, q; tO) of the error terms. Then the maximization of the average log-likelihood function
m -x;{s cc m [s s
f(_Yi- X;ip, q; 5)dq
+(l
-d,)ln
.I-(&,rl; r) dq ds
-m
-x;,a
1 11
(3.58)
over fi, 6, and r in the parameter space. An alternative estimation method, proposed by Heckman (1976), can be based upon the conditional mean of y given x and with d= 1: 00 E[ylx,d
= l] =x;&
m
+ m
rl; r,,) drl ds
[S -a, s -xi60 00
X
-x;ao 1 E f(.s,
[S -00 s
1 -1
0,
q; q-,)
dq ds
= x;/?~ + 1(x;&,; rO).
(3.59)
When the “selection correction function” A(x;~;z) is linear in the distributional parameters r (as is the case for bivariate Gaussian densities), a two-step estimator of &, can be constructed using linear least squares, after inserting a consistent first-step estimator $ of 6, (using the indicator d and regressors x1 in the binary log-likelihood of (3.2)) into the selection correction function. Alternatively, a nonlinear least squares estimator of the parameters can be constructed using (3.59), which is also applicable for truncated data (i.e. for y and x being observed conditional on d = 1). To date, semiparametric modelling of the selection model (3.57) has imposed independence or index restrictions on the error terms (&,I]). Chamberlain (1986a) derived the semiparametric efficiency bound for estimation of /IO and 6, in (3.57) when the errors are independent of the regressors with unknown error density. The form of the efficiency bound is a simple modification of the parametric efficiency bound for this problem when the error density is known, with the regression vectors x1 and x2 being replaced by their deviations from their conditional means, given the selection index, x1 - E[x, 1xi S,] and x2 - E[x, 1x;d,], except for terms which involve the index ~~6,. Chamberlain notes that, in general, nonsingularity of the semiparametric information matrix will require an exclusion restriction on x2 (i.e. some component of x1 with nonzero coefficient in 6,, is excluded from x,), as well as a normalization restriction on 6,. The efficiency bound, which was derived imposing independence of the errors and regressors, apparently holds more generally when the joint distribution of the errors in (3.57), given the regressors, depends only upon the index xi&, appearing in the selection equation.
J.L. Powell
2508
Under this index restriction, the conditional mean of y given d = 1 and x will have the same form as in (3.59), but with a selection correction function of unknown form. More generally, conditional on d = 1, the dependent variable y has the linear representation y = x’&, + E, where E satisfies the distributional index restriction dP(s(d=
l,x)=Z’(sId=
1,~~6,)
as.,
so that other estimation methods for distributional Section 2.5) are applicable here. So far, though, exploited only the weaker mean index restriction E(sld = 1,x) = E(cJd = 1,x’,&).
(3.60) index restrictions (discussed in the econometric literature has
(3.61)
A semiparametric analogue of Heckman’s two-step estimator was constructed by Cosslett (1991), assuming independence of the errors and regressors. In the first step of this approach, a consistent estimator of the selectivity parameter 6, is obtained using Cosslett’s (1983) NPMLE for the binary response model, described in Section 3.1 above. In this first step, the concomitant estimator F(.) of the marginal c.d.f. of the selection error ‘1is a step function, constant on a finite number J of intervals {~-(~j-,,~j),j= l,..., J> with cc, = - cc and cJ = co. The second-step estimator of PO approximates the selection correction function A(.) by a piecewise-constant function on those intervals. That is, writing
y = x;&) + i Aj 1 {QOEllj} j=l
+ t?,
(3.62)
the estimator B is constructed from a linear least squares regression of y on x2 and the J indicator variables {l(x;z~T~}}. Cosslett (1991) showed consistency of the resulting estimator, using the fact that the number of intervals, J, increases slowly to infinity as the sample size increases so that the piecewise linear function could approximate the true selection function A(.) to an arbitrary degree. An important identifying assumption was the requirement that some component of the regression vector xi for the selection equation was excluded from the regressors x2 in the equation for y, as discussed by Chamberlain (1986a). Although independence of the errors and regressors was imposed by Cosslett (1991), this was primarily used to ensure consistency of the NPML estimator of the selection coefficient vector 6,. The same approach to approximation of the selection correction function will work under an index restriction on the errors, provided the first-step estimator of 6, only requires this index restriction. In a parametric context, L. Lee (1982) proposed estimation of /I0 using a flexible parametrization of the selection correction function A(.) in (3.59). For the semiparametric model Newey (1988) proposed a similar two-step estimator, which in the second step used a series
2509
Ch. 41: Estimation
of Semiparametric
approximation
to the selection correction
Y ZE x;PO+
,$,
AjPj(x;61J)
Models
+
function
to obtain the approximate
model (3.63)
e9
which was estimated (substituting a preliminary estimator Jfor 6,) by least squares to obtain an estimator of Do. Here the functions {pj(.)} were a series of functions whose linear combination could be used to approximate (in a mean-squared-error sense) the function A(.) arbitrarily well as J --f co. Newey (1988) gave conditions (including a particula_ rate of growth of the number J of series components) under which the estimator p of &, was root-IV-consistent and asymptotically normal, and also discussed how efficient estimators of the parameters could be constructed. As discussed in Section 2.5, weighted versions of the pairwise-difference estimation approach can be used under the index restriction of (3.61). Assuming a preliminary, root-N-consistent estimator s^ of 6, is available, Powell (1987) considers a pairwise-difference estimator of the form (2.55) when t(d) = d, eij(B) = yi - x&/3 and I(x,, xj, 0) = xi2 - xjz, yielding the explicit estimator
PC C
[
WN((Xil
- Xjl)l&(Xi,
-
Xjz)(Xi,
-
Xj2)'
icj
I
x iTj wN((xil
-
xjl)18J(xi2
-
xj*)(Yi2
-
1-’ 1’
Yj2)
(3.64)
Conditions were given in Powell (1987) on thedata generating process, the weighting functions w,(.), and the preliminary estimator 6 which ensured the root-IV-consistency and asymptotic normality of p. The dependence of this asymptotic distribution on the large-sample behavior of s^ was explicitly derived, along with a consistent estimator of the asymptotic covariance matrix. The approach was also extended to permit endogeneity of some components of xi2 using an instrumental variables version of the estimator. L. Lee (1991) considers system identification of semiparametric selection models with endogenous regressors and proposes efficient estimators of the unknown parameters under an independence assumption on the errors. When the errors in (3.57) are assumed independent of the regressors, and the support of the selection error q is the entire real line, the assumption of a known parametric form ~~6, of the regression function in the selection equation can be relaxed. In this case, the dependent variable y given d = 1 has the linear representation yi = x$‘,, + ci, where the error term E satisfies the distributional index restriction Y(EId = 1,x) = 2(&/d = l,p(x,))
a.s.,
where now the single index p(xl) is the “propensity
(3.65) score” (Rosenbaum
and Rubin
J.L. Powell
2510
(1983)), defined
as (3.66)
Given a nonparametric estimator @(xi) of the conditional mean p(xi) of the selection indicator, it is straightforward to modify the estimation methods above to accommodate this new index restriction, by replacing the estimated linear index x;J by the nonparametric index @(xi) throughout. Choi (1990) proposed a series estimator of /3,, based on (3.63) with this substitution, while Ahn and Powell (1993) modified the weighted pairwise-difference estimator in (3.64) along these lines. Both papers used a nonparametric kernel estimator to construct @(xi), and both gave conditions on the model, the first-step nonparametric estimator and the degree of smoothing in the second step which guaranteed root-N-consistency and asymptotic normality of the resulting estimators of &. The influence functions for these estimators depend upon the conditional variability of the errors E and the deviations of the selection indicator from its conditional mean, d - p(xi). Newey and Powell (1993) calculate the semiparametric efficiency bounds for &, under the distributional index restriction (3.65) and its mean index analogue, while Newey and Powell (1991) discuss construction of semiparametric M-estimators which will attain these efficiency bounds. For the truncated selection model (sampling from (3.57) conditional on d = l), identification and estimation of the unknown parameters is much more difficult. Ichimura and Lee (1991) consider a semiparametric version of a nonlinear least squares estimator using the form of the truncated conditional mean function
ECylx,d = 11 =x;&
+ 2(x;&)
(3.67)
from (3.59) with A(.) unknown, following the definition of Ichimura’s (1992) estimator in (3.33) above. Besides giving conditions for identification of the parameters and root-N-consistency of their estimators, Ichimura and Lee (1991) consider a generalization of this model in which the nonparametric component depends upon several linear indices. If the linear index restriction (3.61) is replaced by the nonparametric index restriction (3.65), identification and consistent estimation of &, requires the functional independence of xi and x2, in which case the estimator proposed by Robinson (1988), discussed in Section 2.5 above, will be applicable. Chamberlain (1992) derives the efficiency bound for estimation of the parameters of the truncated regression model under the index restriction (3.65). Just as eliminating the information provided by the selection variable d makes identification and estimation of fi,, harder, a strengthening of the information in the selection variable makes estimation easier, and permits identification using other semiparametric restrictions on the errors. Honore et al. (1992) consider a model in which the binary selection variable d is replaced by a censored dependent variable
Ch. 41: Estimation
of Semiparametric Models
2511
y,, so that the model becomes
(3.68) Yz = l{Y, >O)cx;Po
+&I.
This model is called the “Type 3 Tobit” model by Amemiya (1985). Assuming conditional symmetry of the errors (E,‘I) about zero given x (as defined in Section 2.3), the authors note that 6, can be consistently estimated using the quantile or symmetric trimming estimators for censored regression models discussed in Section 3.3, and, furthermore, by symmetrically trimming the dependent variable y, using the trimming function ~(Yl,Y2,Xl,XZ,4B)-
(3.69)
p=Yl