The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 1–39. doi: 10.1111/j.1368-423X.2009.00301.x
Heterogeneity in dynamic discrete choice models M ARTIN B ROWNING † AND J ESUS M. C ARRO ‡ †
Department of Economics, University of Oxford, Manor Road, Oxford OX1 3UQ, UK E-mail:
[email protected] ‡
Departamento de Economia, Universidad Carlos III de Madrid, Calle Madrid, 126, 28903 GETAFE, Madrid, Spain E-mail:
[email protected] First version received: November 2008; final version accepted: September 2009
Summary We consider dynamic discrete choice models with heterogeneity in both the levels parameter and the state dependence parameter. We first present an empirical analysis that motivates the theoretical analysis which follows. The theoretical analysis considers a simple two-state, first-order Markov chain model without covariates in which both transition probabilities are heterogeneous. Using such a model we are able to derive exact small sample results for bias and mean squared error (MSE). We discuss the maximum likelihood approach and derive two novel estimators. The first is a bias corrected version of the Maximum Likelihood Estimator (MLE) although the second, which we term MIMSE, minimizes the integrated mean square error. The MIMSE estimator is always well defined, has a closedform expression and inherits the desirable large sample properties of the MLE. Our main finding is that in almost all short panel contexts the MIMSE significantly outperforms the other two estimators in terms of MSE. A final section extends the MIMSE estimator to allow for exogenous covariates. Keywords: Binary choice, Fixed effects, Heterogeneous slopes, Panel data, Unobserved heterogeneity.
1. INTRODUCTION Heterogeneity is an important factor to take into account when making inference based on microdata. A significant part of the literature on binary choice models in the recent years has been about estimating dynamic models accounting for permanent unobserved heterogeneity in a robust way. Honor´e and Kyriazidou (2000) and Carro (2007) are two examples; surveys of this literature can be found in Arellano and Honor´e (2001) and Arellano (2003a). Unobserved heterogeneity in dynamic discrete choice models is usually only allowed through a specific constant individual term, the so-called individual effect. In this paper, we consider that there may be more unobserved heterogeneity than is usually allowed for. In particular, we investigate whether the state dependence parameter in dynamic binary choice models is also individual specific. In Browning and Carro (2006) we presented two principal objections to allowing for limited heterogeneity. The first is that this rules out, a priori, some interesting structural models. The C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
2
M. Browning and J. M. Carro
second objection is that whenever we have sufficiently long panels to allow for heterogeneity in slope parameters, we usually find it. In Section 2, we complement the latter analysis with an illustration using consumer milk-type choice from a consumer panel data set. The sample used contains more than 100 periods for each household, so we have a panel with large T. This allows us to overcome the incidental parameters problem and use the standard Maximum Likelihood Estimator (MLE) to test for the presence of permanent unobserved heterogeneity both on the intercept and on the coefficient of the lag of the endogenous variable versus a model where only the intercept is heterogeneous. 1 A likelihood ratio test overwhelmingly rejects the restricted model. Furthermore, the estimates of the parameters of interest are very different when we allow for the more general form of heterogeneity. This illustration serves to further motivate the subsequent theoretical analysis. Micropanels with a large number of periods is rare. Therefore, we need to find a way to estimate the model with two sources of heterogeneity when the number of periods is small. Furthermore, we want to do that without imposing any restriction on the conditional distribution of the heterogeneous parameters. There are not many examples in the literature where more than one source of heterogeneity is allowed in dynamic models, even for linear models. For example, the surveys of dynamic linear models in Arellano and Honor´e (2001), Wooldridge (2002, ch. 11) and (in the statistics literature) Diggle et al. (2002) do not consider the possibility of allowing for heterogeneity other than in the ‘intercept’. When we consider dynamic discrete choice models, even less is known than for the linear model. Given this relative ignorance we begin by concentrating attention on the simplest possible model and providing a thorough analysis of different estimators in respect to their tractability, bias, mean squared error (MSE) and the power of tests based on them. Thus we consider the model in which a lag of the endogenous variable is the only explanatory variable and both the slope and the intercept are individual specific with an unknown joint distribution. This simple two-state, first-order Markov chain model allows us to make a fully non-parametric analysis and to derive exact analytical expressions for the bias and MSE of the estimators we consider. We show how to use the analytical expression for the bias if T is fixed to correct the MLE estimator and obtain a Non-linear Bias Corrected (NBC) Estimator. We find that both MLE and NBC perform poorly in MSE terms. This leads us to suggest a third alternative estimator which minimizes the integrated MSE; we term this the ‘minimizes the integrated mean square error’ (MIMSE) estimator. This is an attractive estimator since it performs much better than the other two for small values of T but converges to MLE as T becomes large. Moreover, it is computationally very simple. After a thorough examination of the simple case with no covariates, we provide an extension of the MIMSE estimator to the case in which we have exogenous covariates. The structure of rest of the paper is outlined in the next paragraphs. We regard the positive suggestions below as a first step toward incorporating more heterogeneity in dynamic discrete outcome models than is usually allowed for. Much of the analysis presented is frankly exploratory and leads to inconclusive or even negative results. For example, the exact analytical results for the MLE and NBC with small T indicate that bias reduction techniques are unlikely to lead to useful estimators in these cases. Section 2 presents the empirical milk analysis that illustrates the need for multiple sources of heterogeneity. 1 Others have suggested panel data tests for heterogeneous slopes when the time dimension is small; see Pesaran and Yamagata (2008) for a review of these tests and a novel test. Our emphasis in this paper is on allowing for slope heterogeneity rather than simply testing for it. C The Author(s). Journal compilation C Royal Economic Society 2010.
Heterogeneity in dynamic discrete choice models
3
In Section 3, we study the basic model without covariates and with four observations per unit (including the initial observation). Although taking four observations for one unit may seem excessively parsimonious, this analysis allows us to display almost all of the features of interest in a transparent way. We show that there is no unbiased estimator. Following this we derive the bias for the MLE. An important finding in this respect is that the bias of the MLE estimator of the marginal dynamic effect is always negative; this is the non-linear analogue of the Nickell bias result for linear dynamic models (see Arellano, 2003b). Based on this derivation we define a one-step bias corrected estimator, which we term non-linear biased corrected (NBC). We calculate the exact bias and MSE of the MLE and the NBC. We show that whilst NBC reduces the bias it is sometimes worse than MLE for the MSE. The relatively poor performance of the NBC together with the result on the non-existence of an unbiased estimator sets limits on the bias correction route as a solution to the estimation problem for dynamic discrete outcome models. To take into account that both the MLE and the NBC display high MSE, in Section 4 we present a new estimator that MIMSE. We derive the closed-form expression for the estimator. We also derive the Bayesian posterior assuming a uniform prior over the two transition probabilities and relate our estimators to that. Section 5 compares the exact finite sample properties of the three estimators (MLE, NBC and MIMSE) with T > 3. There are two main conclusions. First, for most of the possible values the NBC is best in terms of bias, both for levels for very small T and for convergence of the bias to zero as T becomes large. Second, MIMSE almost always dominates MLE and NBC on the MSE criterion. We show the exact areas of dominance for MLE; these include most cases that we would ever be interested in. In Section 6, we shift perspective and consider estimating the distribution of parameters of interest in the population of households. In a small-T context this will seem a natural shift given that there are severe limits on how much we can learn about individual parameters with small T. We consider estimators based on the three estimators already considered (MLE, NBC and MIMSE). Using both analytical and simulation analysis, we conclude that MIMSE dominates both other estimators and gives less biased estimators of both the location and dispersion of the distribution. This is the case both as the number of cross-section units becomes large and when it is fixed at the value we have in our empirical application in Section 2. The broad conclusion is that if we are interested in population outcomes then MIMSE performs well relative to the other two estimators. In Section 7, we extend the MIMSE estimator to allow for exogenous covariates. We propose to use the equivalence between MIMSE and the mean of the posterior distribution with flat priors. This way it can be easily computed using MCMC techniques. Our analysis suggests that MIMSE is a credible and feasible candidate for estimating dynamic discrete choice models. Section 8 concludes and proofs are given in the Appendix.
2. RESULTS FOR A LARGE T PANEL 2.1. Incorporating heterogeneity In this section, we present results for a dynamic discrete choice analysis from a long panel. Specifically, we estimate the patterns of buying full-fat milk (rather than low-fat milk) on a
C The Author(s). Journal compilation C Royal Economic Society 2010.
4
M. Browning and J. M. Carro
Danish consumer panel that gives weekly individual purchases by households for more than 100 weeks. 2 Although the results have substantive interest, we present the analysis here mainly to motivate the subsequent econometric theory. A conventional treatment would take yit = 1 αyit−1 + xit β + ηi + vit ≥ 0 (t = 0, . . . , T ; i = 1, . . . , N ), (2.1) where yit takes value 1 if household i purchases full-fat milk in week t, and zero otherwise. The parameter ηi reflects unobserved differences in tastes that are constant over time. The parameter α accounts for state dependence on individual choices due to habits. The xit variables are other covariates that affect for the demand for full-fat milk. In our empirical analysis these are the presence of a child aged less than 7, quarterly dummies and a time trend. Since the relative prices of different varieties of milk are very stable across our sample period, it is reasonable to assume that the time trend picks up both price effects and common taste changes. A more flexible specification of model (2.1) that we will also consider is a model with interactions between the lagged dependent variable and the observables. (2.2) yit = 1 αyit−1 + xit β + (yit−1 xit ) γ + ηi + vit ≥ 0 (t = 0, . . . , T ; i = 1, . . . , N). This allows that the state dependence depends on observables but still the only latent factor is the individual specific parameter. It is conventional to allow for a ‘fixed effect’ ηi as in (2.1). The primary focus of this paper is on whether this makes sufficient allowance for heterogeneity. In particular, we examine whether it is also necessary to allow that the state dependence parameter varies across households and, if it does, how should we estimate if we have a short panel. Thus we take the following extended binary choice model: yit = 1 αi yit−1 + xit β + ηi + vit ≥ 0 (t = 0, . . . , T ; i = 1, . . . , N ). (2.3) In model (2.3), we allow that both the intercept and the state dependence parameter are heterogeneous but the effects of the covariates are assumed to be common across households. 3 The values of the parameters of (2.3) are not usually of primary interest; rather they can be used to generate other ‘outcomes of interest’. There are several candidates. In this paper, we focus on the dependence of the current probability of y being unity on the lagged value of y; this is the marginal dynamic effect: mi (x) = Pr(yit = 1 | yi,t−1 = 1, x) − Pr(yit = 1 | yi,t−1 = 0, x).
(2.4)
Another important outcome of interest is the long-run proportion of time that yit is unity, given a particular fixed x vector. Using standard results from Markov chain theory this is given by Pr(yit = 1 | yi,t−1 = 0, x) . Pr(yit = 1 | yi,t−1 = 0, x) + Pr(yit = 0 | yi,t−1 = 1, x)
(2.5)
In this paper, we shall only concern ourselves with the marginal dynamic effect; this is simply to limit what is already a long paper. In this empirical section we assume that the unobserved 2 In Denmark during the first four years of our sample period there were three levels of fat content in milk: skimmed (0.01%), medium (1.5%) and high (3.5%). In the final year another low-fat (0.5%) milk was introduced. The 3.5% milk is what we call full-fat milk. 3 We could extend the following empirical analysis to allow for heterogeneous effects of these covariates (and would certainly do so if our main concern was to analyse milk expenditure patterns) but for our purposes here it suffices to consider only heterogeneity in (η, α).
C The Author(s). Journal compilation C Royal Economic Society 2010.
Heterogeneity in dynamic discrete choice models
5
random shock vit is an i.i.d. standard Normal; in the analysis in the following sections we consider the non-parametric case in which the distribution of vit is not known. For the Normal case the marginal dynamic effect is given by mi (x) = (αi + x β + ηi ) − (x β + ηi ),
(2.6)
where (·) is the standard Normal cdf. 2.2. The Danish consumer panel We have a Danish consumer panel that follows the same households for up to five years (with most households exiting the survey before the end of the five-year period) from January 1997 to December 2001. This panel provides data on all grocery purchases during the survey period and some characteristics of the household. Respondents provide detailed information on every item bought. For example, for milk they record the volume and price paid, the store where it is purchased, the fat content and other characteristics of that specific purchase. We aggregate purchases of milk to the weekly level (in Denmark households only consume fresh milk so that taking weekly averages gives positive purchases of milk in every week) and set the fullfat indicator for that week/household to unity if the household buys any full-fat milk in that week; this does not exclude the possibility that they also buy low-fat milk in the same week. Our strategy in this empirical section is to estimate the parameters of (2.1), (2.2) and (2.3) without imposing any restriction on the joint distribution of αi and ηi . We thus select a subsample of the data in which the household is observed for at least for 100 weeks so that we are in a large-T context. We assume that this selection is exogenous to the milk-buying decision. We also select on households having the number of changes on their decision with respect to the previous period greater than 10% of the number of periods; without this the parameters for a particular household may not be estimated or may be very imprecisely estimated. 4 We take up this issue in more detail in the next section. This sample selection gives us 371 households who are observed from between 100 and 260 weeks. We then use a standard Probit to estimate; this is a consistent estimator under the assumptions made. If we did not include covariates with common effects (βi = β) then this estimation strategy would be the same as treating each household as a time series and estimating αi and ηi (and βi ) for each separately. Given the length of our panel, we invoke standard large-T results. 2.3. Missing observations Some weeks are missing for some households. This seems to be mainly because households are not disinclined to keep complete records in that week or because of being on holiday. 5 We shall take these missing weeks to be ‘missing at random’ in the sense that their occurrence is independent of the taste for full-fat milk. There are then two options for dealing with missing
4 It should be noted that excluding observations that do not change their decision implies no bias on the estimation of models (2.1) and (2.2), because the contribution to the log-likelihood of these observations is zero. To see this, notice that the MLE estimate of ηi will ±∞, so the likelihood (log-likelihood) of the observations of those households i that never change will be one (zero) at the estimated value of ηi , regardless of the value of the other variables and parameters. 5 We emphasize again that we are here presenting an illustration. For a substantive study of fat consumption we would need to explicitly model the possibility of some purchases not being recorded.
C The Author(s). Journal compilation C Royal Economic Society 2010.
6
M. Browning and J. M. Carro
Model
Table 1. Estimates. (2.1)
(2.2)
α mean(α)
0.81 –
0.71 –
– 0.70
SD(α) mean(η)
– −0.72
– −0.70
0.76 −0.73
0.60 – 0.47
0.61 – 0.14
0.70 −0.31 0.38
Quarter 2 Quarter 3
−0.04 −0.06
−0.05 −0.08
−0.05 −0.06
Quarter 4 Trend (×100) yit−1 ∗ Child present
0.08 −0.015 –
0.13 −0.014 0.76
0.09 −0.014 –
yit−1 ∗ Quarter 4 Log-likelihood
– −27,905
−0.14 −27,659
– −26,376
SD(η) corr(η, α) Child present
(2.3)
weeks. Suppose, for example, that week t − 1 is missing but we observe in weeks t − 2, t and t + 1. The first option is to use the probability Pr(yit = 1 | yi,t−2 = 1, xit , xi,t−1 ) in the likelihood. This assumes that we can impute xi,t−1 which is not problematic in our case (for example, the presence of a child aged less than 7 or the season). The alternative procedure, which we adopt, is to drop observation t and to start again at period t + 1. When we do this we of course keep (ηi , αi ) constant for each household. Using the latter procedure causes a small loss of efficiency but is much simpler. The proportion of missing observations is about 14% of the total number of observations. 2.4. Results for the long panel Table 1 contains the estimates of models (2.1), (2.2) and (2.3) by maximum likelihood estimation (MLE). The model with observable variation in the state dependence parameter, (2.2), fits significantly better than the most restricted model (2.1) (a likelihood ratio statistic of 492 with 5 degrees of freedom) but much worse than the general model (2.3). The likelihood ratio test statistic for model (2.1) against (2.3) is 3058 with 370 degrees of freedom and 2566 with 365 degrees of freedom for testing model (2.2) against (2.3). This represents a decisive rejection of the conventional model which only allows for a single ‘fixed effect’. Figure 1 shows the marginal distributions of the two parameters, αi and ηi ; as can be clearly seen the state dependence parameter varies quite widely across households. Restricting the state dependence parameter to be common across households gives significant bias in the mean of the state dependence and in the impact of children. It also gives a value for the variability of the η that is too low. For the general model we find a significant negative correlation between the two parameters; obviously the standard model (2.1) is not able to capture this. Figure 2 plots the estimated state dependence parameter (αˆ i ) and its 95% confidence interval for each of our 371 households, sorted from the smallest value of αˆ i to the largest C The Author(s). Journal compilation C Royal Economic Society 2010.
7
Heterogeneity in dynamic discrete choice models 0.9 0.8
State dependence Fixed effect
0.7
Density
0.6 0.5 0.4 0.3 0.2 0.1 0 -3
–2
–1
0 1 Value of the parameters
Figure 1. Marginal densities of αi and ηi .
Figure 2. Estimated state dependence parameter, αi .
C The Author(s). Journal compilation C Royal Economic Society 2010.
2
3
8
M. Browning and J. M. Carro 1.2 Homogeneous alpha Marginal Dynamic Effect
1 0.8 0.6
Heterogeneous alpha Lower Confidence Interval Upper Confidence Interval
0.4 0.2 0 -0.2 -0.4 -0.6 Households
Figure 3. Estimates of marginal dynamic effects.
value. 6 The darker horizontal line is the value of α = 0.81 estimated from model (2.1). The proportion of households whose confidence interval of αˆ i contains αˆ is 59%. Thus for 41% of our sample the estimated α parameter using a model with more heterogeneity (2.3) is statistically different from the value using model (2.1). We can also consider the marginal effect, which is of more interest than the parameters that are directly estimated. For both models the marginal effect is different for each household but the variation in the magnitude of the marginal effect among households is greater in model (2.3) than in model (2.1). This is shown in Figure 3; to plot this we set the quarterly dummies and time trend to zero and the child variable to the mode for the household. The x-axis values are sorted according to the values of the marginal effect for the general model (2.3). The flatter (variable) line is for model (2.1) and the increasing curve is the value for model (2.3) (with 95% confidence bands). In this case 46% of households have a marginal effect that is significantly different from that implied by model (2.1) and 52% have a marginal dynamic effect that is not significantly different from zero (at a 5% significance level). The differences between the implications of the two models for the outcome of interest (the marginal dynamic effect) can be seen even more dramatically in Figure 4 and Table 2 which present the estimated distribution of the marginal dynamic effect, for the three estimated models of those households with the child variable equal to zero. We plot a grid on this figure to facilitate comparisons across the three sets of estimates. Once again we see that the extended model gives much more variation across households in the marginal effects. But there are also other strong differences; for example, for the conventional models ((2.1) and (2.2)) all households are estimated to have a positive marginal dynamic effect, whereas for the unrestricted model about 18% have a negative effect (although most are not
6 In the confidence intervals of Figures 2 and 3 we are ignoring the sampling variability in the estimation of model (2.1) because it is negligible in comparison to the sampling variability in estimating model (2.3).
C The Author(s). Journal compilation C Royal Economic Society 2010.
Heterogeneity in dynamic discrete choice models
9
Figure 4. Distribution of the dynamic marginal effect.
Model
Table 2. Distribution of the marginal dynamic effect. (2.1) (2.2)
(2.3)
Minimum
0.10
0.08
−0.29
First quartile Median Third quartile
0.22 0.26 0.30
0.19 0.23 0.26
0.03 0.15 0.32
Maximum Mean
0.31 0.26
0.28 0.22
0.80 0.19
SD
0.05
0.04
0.23
‘significantly’ different from zero; see Figure 3). Moreover the mean and median are lower for the extended model. This empirical analysis serves to illustrate our contention that there is probably more heterogeneity in dynamic models than is allowed for by conventional schemes that only allow ‘intercepts’ to vary across households. We turn now to a consideration of estimation when we do not have the luxury of observing households for very many periods. One option is to formulate a (random effects) parametric model for the conditional joint distribution of (α, η | x, y0 ) and then to estimate the parameters by, say, maximum likelihood. This parametric model would have to accommodate the bimodalities and fat tails displayed by the distributions shown in Figure 1. In C The Author(s). Journal compilation C Royal Economic Society 2010.
10
M. Browning and J. M. Carro
this paper, we consider the alternative of estimating non-parametric models which do not restrict the joint distribution of the latent factors.
3. EXACT BIAS AND MSE ANALYSIS, T = 3 3.1. A simple model with a lagged dependent variable The empirical analysis above suggested strongly that we need to allow for heterogeneity in both the intercept and the state dependence parameter when we consider dynamic models. Since relatively little is known about the behaviour of the dynamic non-linear panel data estimators in the simpler case in which we only allow for heterogeneity in the ‘intercept’ (see, for example, Arellano and Honor´e, 2001, sec. 8), we necessarily have to be modest in our aims here. Consequently we restrict attention to the simple model with no covariates, in which case we can dispense with parametric formulations such as (2.3) and focus directly on the two transition parameters: Gi = Pr(yit = 1 | yi,t−1 = 0),
(3.1)
Hi = Pr(yit = 1 | yi,t−1 = 1).
(3.2)
This is a two-state, first-order stationary Markov model with a marginal dynamic effect given by: Mi = Hi − Gi .
(3.3)
There is a large literature on the estimation of Markov models considering such issues as testing for stationarity or the order of the process; the classic reference is Anderson and Goodman (1957) who consider the case in which all agents have the same transition matrix. In general, most investigators assume less heterogeneity than we do here. Exceptions include Billard and Meshkani (1995) and Cole et al. (1995) who both use an empirical Bayes approach, 7 and Albert and Waclawiw (1998) who adopt a quasi-likelihood approach to estimate the first two moments of the joint distribution of the transition probabilities. The distributions plotted in Figure 1 suggest that this may miss important features of the joint distribution. There are two primary virtues of considering the simplest model of a first-order stationary Markov chain without covariates. The first is that we can derive exact analytical finite sample results and discuss estimation and bias reduction without recourse to simulation. This allows us, for example, to sign the bias for particular estimators for any value of (G, H ) and not just for particular values as in Monte Carlo studies. The second advantage is that the analysis here is fully non-parametric and does not require assumptions concerning functional forms. Thus the basic case serves as a general benchmark which we can examine in great and exact detail. We shall only consider estimation conditional on the observed initial value yi0 . 8 We start with an exhaustive account of the case in which T = 3 and, with no loss of generality, we only consider 7
This is essentially a random coefficients model. If we are willing to make assumptions concerning the initially observed value (for example, it is drawn from the longrun distribution) then there may be considerable gain in efficiency when T is small. We do not explore this here to avoid potential biases caused by misspecifications of the distribution of the initial value. For recent results on taking account of the initial conditions problem, see Honor´e and Tamer (2006). 8
C The Author(s). Journal compilation C Royal Economic Society 2010.
11
Heterogeneity in dynamic discrete choice models
paths that start with yit = 1. This very simple case is instructive and leads us to reject some possibilities and also suggests general results. In a later section, we consider the general fixed-T case. If we take a parametric formulation with an arbitrary cdf F (·) then we have Gi = F (ηi ),
(3.4)
Hi = F (αi + ηi ).
Observing this allows us to derive a restriction that is analogous to the usual model (2.1) with a homogeneous state dependence parameter, αi . Assuming that F (·) is everywhere strictly increasing we can invert both equations to give αi = F −1 (Hi ) − F −1 (Gi ).
(3.5)
Then the usual homogeneity restriction , αi = α, gives the restriction Hi = F (α + F −1 (Gi )).
(3.6)
It is important to note that this restriction is parametric and depends on the chosen cdf. That is, an assumption of a homogeneous state dependence parameter for one distribution is implicitly assuming that the state dependence parameter is heterogeneous for any other distribution, unless α is zero. This emphasizes the arbitrariness in the usual homogeneity assumption since there is no reason why the homogeneity of the state dependence parameter αi should be linked to the distribution of F (·). Given this arbitrariness, we see as more natural the hypothesis that the marginal dynamic effect is the same for everyone: Mi = M ⇒ Hi = M + Gi .
(3.7)
We shall return to testing for this in Section 3.6 below. When there are no covariates we can treat each household as an individual (albeit short) time series and drop the i subscript. Table 3 gives the outcomes for the case with T = 3 (that is, four observations including period 0) and y0 = 1. The first column gives the name we have given to each case, the second column gives the observed path and the next four columns give the frequencies for observed pairs of outcomes 00, 01, 10 and 11, respectively. The final column gives the probability of observing the path (conditional on y0 = 1) which we denote by
Case
Path
n 00
Table 3. Outcomes for T = 3. n 01 n 10 n 11
Probability of case j , pj
a
1000
2
0
1
0
(1 − H ) (1 − G) (1 − G)
b c d
1001 1010 1011
1 0 0
1 1 1
1 2 1
0 0 1
(1 − H ) (1 − G) G (1 − H ) G(1 − H ) (1 − H ) GH
e f g
1100 1101 1110
1 0 0
0 1 0
1 1 1
1 1 2
H (1 − H ) (1 − G) H (1 − H ) G H H (1 − H )
h
1111
0
0
0
3
HHH
C The Author(s). Journal compilation C Royal Economic Society 2010.
12
M. Browning and J. M. Carro
pa , pb , . . . , ph , respectively. This is given by j
j
j
j
pj = (G)n01 (1 − G)n00 (H )n11 (1 − H )n10 ,
(3.8)
j
where n01 is the number of 0 → 1 transitions for case j, etc. We now consider the choice of an estimator for this scenario. 3.2. All estimators are biased ˆ Hˆ ) assigns values to G and H for each case a, b, . . . , h: An estimator (G, ˆ Hˆ } : {a, b, c, d, e, f , g, h} → ([0, 1]2 ), {G,
(3.9)
ˆ Hˆ ) is the correspondence (3.9) where (X) denotes the power set of X. An estimator (G, evaluated at the random indicator for the paths a to h. For the marginal dynamic effect the correspondence is given by ˆ : {a, b, c, d, e, f , g, h} → ([−1, 1]). Mˆ = Hˆ − G
(3.10)
If the values given by the estimator are unique for each case then the corresponding parameter is point estimated, otherwise the estimator is partially defined. For example, as we shall see in the next subsection, maximum likelihood is point defined for H but only partially defined for G. Before considering particular estimators we show analytically that there is no unbiased estimator for G and H. P ROPOSITION 3.1. All estimators of (G, H ) are biased. This is a useful result since it shows that there is no point in searching for an unbiased estimator and we consequently have to seek for estimators that have low bias or low MSE. An alternative way to state this result is that for any estimator of (G, H ) we can find an alternative estimator and some values of (G, H ) which give a lower bias. Thus we will always be in the situation in which we are making trade-offs, even when we restrict attention to bias. 3.3. Maximum likelihood estimator In the current context in which probabilities are given, the most natural estimator is maximum ˆ MLE , Hˆ MLE }j =a,...,h gives the values of G and H that maximize the likelihood. The MLE {G j j probabilities for each case. It is convenient to give the results for any fixed T (≥3) at this point. From (3.8) it is easily seen that the log-likelihood is maximized for values of G and H given by j
ˆ MLE G = j
n01 j
j
,
(3.11)
j
.
(3.12)
n00 + n01 j
Hˆ jMLE =
n11 j
n10 + n11
If this mapping exists then the parameter is point estimated. Since we condition on y0 = 1 we j j always have (n10 + n11 ) = 0 so that Hˆ jMLE is always defined. The MLE estimator for G does not C The Author(s). Journal compilation C Royal Economic Society 2010.
13
Heterogeneity in dynamic discrete choice models
Case a b c d e f
Table 4. Outcomes conditioning on point estimation. Maximum Non-linear likelihood bias corrected Adjusted ˆ MLE ˆ BC 1 G Hˆ MLE G Hˆ BC 1 probability, p˜ j (1−G)(1−G) (1+H ) (1−G)G (1+H ) G(1−H ) (1+H ) GH (1+H ) H (1−G) (1+H ) HG (1+H )
0
0
0
0
1/2
0
3/8
0
1
0
1
0
1
1/2
1
2/3
0
1/2
0
5/6
1
1/2
1
2/3
exist if we observe yt = 1 for t = 1, 2, . . . , T − 1 (⇒ n00 + n01 = 0). The probability of this is given by: Pr(non-existence|y0 = 1) = H T −1 .
(3.13)
Thus there is always a positive probability of non-existence (so long as H > 0) but it goes to zero as T becomes large (so long as H < 1). Even for modest T , it is small, unless H is very j close to 1. Moreover, for the ‘non-existence’ case where n10 = 0, i.e. yt = 1 for t = 1, 2, . . . , T (case h in Table 3), the contribution to the log-likelihood is zero since Hˆ jMLE = 1 for this case. For the other ‘non-existence’ case the contribution is not zero, but it is close to zero and it goes to zero as T becomes large, since Hˆ jMLE = T T−1 . Given these reasons, most investigators ignore the bias introduced by selecting out the non-identifying paths. We thus have two distinct classes of estimator. In the first, we exclude any observation with n00 + n01 = 0. In this case, both G and H are point estimated. When we analyse this case in finite samples, we have to correct the probabilities for sample selection by dividing the given probabilities by (1 − H T −1 ) (and using p˜ to denote adjusted probabilities). The second class of estimator uses all the observed paths but ˆ MLE is only partially defined. We concentrate attention on the former, (point estimated) then G case and do not consider the partially defined estimator. 9 Table 4 gives the relevant details for the point-estimated context for T = 3 in which we exclude cases g and h. The second column gives the probabilities adjusted for the sample selection and the next two columns give the maximum likelihood estimators for (G, H ). These estimators are calculated without taking into account that we select out cases g and h (that is, they are based on the unadjusted probabilities given in Table 3). This is largely to conform with current practice which does not adjust probabilities when calculating maximum likelihood estimators for the reasons given in the previous paragraph. The alternative is to use the adjusted probabilities when calculating the MLE; this is perfectly legitimate (and may even be considered better) but it is not the common practice and it leads to estimators which look ‘non-standard’ so we choose to analyse only the MLE estimator using the unadjusted probabilities. In all the analysis, we always use the adjusted probabilities when calculating biases and MSEs, as previously explained.
9 The proof that there is no unbiased estimator was given for the no-selection case. It is easy to show by the same methods that there is no unbiased estimator of the pair (G, H ) for the class in which we select our cases g and h.
C The Author(s). Journal compilation C Royal Economic Society 2010.
14
M. Browning and J. M. Carro
The result of the previous subsection tells us that MLE is biased. Since we have an exact ˆ MLE to probability model we can go further than this and give the exact bias (using the notation G j ˆ MLE ): denote the jth element of G ˆ MLE ) = E(G ˆ MLE ˆ MLE ˆ MLE ) − G = p˜ a G bias(G + · · · + p˜ f G −G a f 1 (1 − G)G = ≥ 0, 2 (1 + H ) bias(Hˆ MLE ) = E(Hˆ MLE ) − H = p˜ a Hˆ aMLE + · · · + p˜ f Hˆ fMLE − H 1 (G − 2H − 1)H ≤ 0, = 2 (1 + H )
(3.14)
(3.15)
ˆ MLE ) bias(Mˆ MLE ) = bias(Hˆ MLE ) − bias(G =
1 G2 − G + GH − H − 2H 2 ≤ 0. 2 (1 + H )
(3.16)
Although the exact bias depends on the unobserved probabilities G and H, the sign of the bias ˆ MLE is always biased upwards and Hˆ MLE and Mˆ MLE always have a does not. As can be seen, G negative bias. In particular, the bias of the MLE estimate of the marginal dynamic effect, Mˆ MLE , is always negative for interior values of (G, H ). This is the analogue of the signable Nickell bias for the linear autoregressive model (see, for example, Arellano, 2003b). 10 We shall return to this in the section in which we consider T > 3. The bias of G is maximized at (G, H ) = (0.5, 0) and the absolute value of the biases of H and M are both maximized at (G, H ) = (0, 1). Knowing the sign of the bias is sometimes useful since it allows us to put bounds on the possible values of the parameters and the marginal effect. For example, for the marginal effect ˆ MLE , 1]. Admittedly these are not very tight bounds for case j we have the bounds [Hˆ jMLE − G j (particularly for case c), but we should not expect tight bounds if we only observe a household for four periods. One view of the choice of an estimator is then that it reduces to finding an estimator that has the smallest expected bounds. The negative bias result of the previous subsection then states that no estimator gives uniformly tight bounds (that is, smallest bounds independent of the true parameter values). 3.4. Bias corrected estimators Since we have an exact and explicit form for the bias, one improvement that immediately suggests itself is to use these expressions for the bias with the ML estimates substituted in to define a new ˆ NBC , Hˆ NBC ), as the (bias corrected) estimator. We define the NBC estimator, which we denote (G MLE estimate minus the estimated bias of the latter. 11 We denote the probability of case k using
We have the same pattern of signs for the bias when we consider the case in which y0 = 0, so this is a general result. The terminology here is to distinguish our correction from linear bias correction estimators as in McKinnon and Smith (1998). 10 11
C The Author(s). Journal compilation C Royal Economic Society 2010.
15
Heterogeneity in dynamic discrete choice models
ˆ MLE , Hˆ MLE ) and define the new estimator by the estimates from observing case j by pk (G j j
ˆ NBC = G j
ˆ MLE G j
f MLE MLE MLE ˆ j , Hˆ j ˆk −G ˆ MLE p˜ k G G − j k=a
=
ˆ MLE 2G j
−
f
MLE MLE MLE ˆ j , Hˆ j ˆk p˜ k G G ,
(3.17)
k=a
Hˆ jNBC = Hˆ jMLE −
f MLE MLE MLE ˆ j , Hˆ j p˜ k G Hˆ k − Hˆ jMLE k=a
= 2Hˆ jMLE −
f
MLE MLE MLE ˆ ˆ ˆ p˜ k Gj , Hj Hk .
(3.18)
k=a
The values for these are given in the NBC column of Table 4. We can also derive the biases for the NBC estimator: ˆ NBC ) − G = 3 (1 − G)G ≥ 0, ˆ NBC ) = E(G bias(G 8 (1 + H )
(3.19)
1 (3G − 6H − 1)H ≶ 0, bias(Hˆ NBC ) = E(Hˆ NBC ) − H = 6 (1 + H )
(3.20)
bias(Mˆ NBC ) =
1 (9G2 − 9G + 12GH − 4H − 24H 2 ) ≶ 0. 24 (1 + H )
(3.21)
Note that the bias for H and M is now not necessarily negative. Nevertheless the situation where bias(Hˆ NBC ) and bias(Mˆ NBC ) are not negative is an extreme case of ‘negative autocorrelation’ in that it implies that both Pr(yit = 1 | yi,t−1 = 1) and Pr(yit = 0 | yi,t−1 = 0) are small. The bias for H is positive if the following two conditions are both satisfied: H < 13 and G > 13 + 2H . If we restrict attention to values of (G, H ) such that M = H − G > −0.5 then we can show that the ˆ NBC with equation (3.14) we see immediately bias of Mˆ NBC is negative. Comparing the bias for G that the NBC estimator always has a smaller bias for G than MLE. Moreover, if we again restrict attention to M = H − G > −0.5 then we can show that the absolute value of biases of H and M are lower for NBC than for MLE. Actually, for M that holds also for M > −0.8. Thus, for T = 3 and ‘reasonable’ values of (G, H ), the bias correction does indeed lead to a reduction in the bias; although there are some extreme cases for which bias correcting actually increases the bias of the estimator. The definitions in (3.17) and (3.18) suggest a recursion in which we take the new bias corrected estimator and adjust the bias again. This leads to a second round estimator in which some estimated probabilities exceed unity. If we continue iterating then the estimator does not converge. Formally we can show that there does not exist a limit estimator (see the Appendix) and numerically we have that the iterated estimator does not converge for one case. This may happen when dealing with non-linear transformations as here. Even if a limit estimator had existed, it would still be biased, since we proved there is no unbiased estimator in Proposition 3.1. Given C The Author(s). Journal compilation C Royal Economic Society 2010.
16
M. Browning and J. M. Carro Table 5. Mean squared errors for estimators. Mean squared error
MLE (Mean) NBC (Mean)
ˆ G
Hˆ
1 (5−4G+4H )(1−G)G 4 (1+H )
1 (4H 2 −4GH +G+1)H 4 (1+H )
(0.138)
(0.159)
1 (73−48G+64H )(1−G)G 64 (1+H )
1 (36H 2 −36GH +7G−24H +25)H 36 (1+H )
(0.140)
(0.158)
Note: Value in parenthesis is mean assuming uniform over (G, H ).
this we consider only two candidate estimators: maximum likelihood and the (one-step) nonlinear bias corrected estimator. 3.5. Mean squared error of the estimators ˆ MLE , Hˆ MLE ) The results above have focused on the bias of the maximum likelihood estimators (G ˆ NBC , Hˆ NBC ). However, the MSE can increase even if the bias is and the NBC estimators (G reduced. Thus we also need to consider the MSE of our candidate estimators. The MSE for any estimator is given by ˆ = E(G ˆ − G)2 MSE(G) =
f ˆ j − G)2 , p˜ j (G, H )(G
(3.22)
j =a
MSE(Hˆ ) = E(Hˆ − H )2 =
f p˜ j (G, H )(Hˆ j − H )2 .
(3.23)
j =a
Table 5 gives the exact MSEs for the two estimators; the values given are not symmetric in G and H since we consider only the case with y0 = 1. Given these expressions, it is easy to show neither estimator dominates the other in terms of MSE. For example, if we take (G, H ) = (0.5, 0.5) then the MSE of ML estimators of G and H are lower, whereas for (G, H ) = (0.25, 0.75) the NBC estimator has the lowest MSE. Given that we have exact expressions for the MSE, we can find the mean for each of our estimators if we assume a distribution for (G, H ). The values in parentheses in Table 5 give the means assuming a uniform distribution over [0, 1]2 . As can be seen, the two estimators of G and H are quite similar in this regard. We shall return to the MSE analysis in the later sections. 3.6. Inference The final consideration for the two estimators is their performance for hypothesis testing. In the current context the most important hypothesis we would wish to test is that the marginal dynamic effect is zero: G = H . Table 6 gives the probabilities for the six possible paths under H0 : G = H C The Author(s). Journal compilation C Royal Economic Society 2010.
17
Heterogeneity in dynamic discrete choice models
Case
Path
a
1000
b
1001
c
1010
d
1011
e
1100
f
1101
Table 6. Outcomes for no marginal dynamic effect. ˆ MLE Prob, given G = H G (1−G)2 (1+G) (1−G)G (1+G) (1−G)G (1+G) G2 (1+G) (1−G)G (1+G) G2 (1+G)
ˆ NBC G
0
0
1/3
7/18
1/3
7/18
2/3
38/45
1/3
7/18
2/3
38/45
Figure 5. Inference for MLE and NBC.
and the corresponding ML estimator and NBC estimator under the null. To consider inference we have to specify a decision process that leads us to either reject or not reject H 0 consequent on observing one of the cases a, b, . . . , f . We consider symmetric two-sided procedures in which ˆ > τ , where τ is a cut-off value between zero and unity. The top panel of we reject H 0 if |M| Figure 5 shows the probabilities of rejecting the null when it is true for values of τ ∈ (0, 1) and G = H = 0.5 when T = 3. This shows that neither estimator dominates the other in terms of size. What of the converse: the probability of rejecting H 0 when it is false. The bottom panel of C The Author(s). Journal compilation C Royal Economic Society 2010.
18
M. Browning and J. M. Carro
Figure 5 shows the case with G = 0.25, H = 0.75 (that is, with reasonably strong positive state dependence). Once again, neither estimator dominates the other. 3.7. Where does this leave us? A number of conclusions arise from a consideration of the simple model with T = 3 and y0 = 1: • There is no unbiased estimator for either the point-identified case nor the partially identified case. • MLE gives an upwards biased estimator for G = Pr(yit = 1 | yi,t−1 = 0) and a downwards biased estimator of H = Pr(yit = 1 | yi,t−1 = 1) and the marginal dynamic effect M = H − G. • We can calculate the bias of the MLE and consequently define a one-step bias corrected estimator (NBC). • The bias corrected estimator makes the absolute value of the bias of G smaller, as compared to MLE. For values of M > −0.8 the NBC estimator of M also gives a lower bias in absolute terms than MLE, but not for values of M close to −1. • NBC does not dominate MLE on an MSE criterion. In fact the mean MSE of the two estimators are very close if we assume that (G, H ) are uniformly distributed. • Neither of the two estimators dominates the other in terms of making inferences. Most of these conclusions apply in the T > 3 case; before considering that explicitly we present a new estimator that is designed to address the relatively poor performance of MLE and NBC for the MSE.
4. MINIMIZING THE INTEGRATED MSE 4.1. Minimum integrated MSE estimator of M The two estimators developed so far are based on MLE but the case for using MLE is not very compelling if we have small samples (see Berkson, 1980, and the discussion following that paper). As we have seen, we can make small sample corrections for the bias to come up with an estimator that is less biased, but our investigations reveal that this is not necessarily better on the MSE criterion. Given that we use the latter as our principal criterion, it is worth investigating alternative estimators that take the MSE into account directly. To focus our discussion we concentrate on the estimator for the marginal effect M = H − G. The MSE for an estimator Mˆ j (where j refers to an observed path of zeros and ones) is given by: ˆ G, H ) = λ(M;
J pj (Mˆ j − (H − G))2 .
(4.1)
j =1
As with the bias, we can show that there is no estimator that minimizes the MSE for all values of (G, H ) so we have to settle for finding the minimum for some choice of a prior distribution of (G, H ). Given that we are looking at the general case in which we have no idea of the context, the obvious choice is the uniform distribution on [0, 1]2 . This gives the C The Author(s). Journal compilation C Royal Economic Society 2010.
19
Heterogeneity in dynamic discrete choice models
integrated MSE: 1 1 ˆ G, H ) dGdH ψ= λ(M; 0
0
0
0 j =1
1 1 J pj (Mˆ j − (H − G))2 dGdH = =
1 1 J nj j j j G 01 (1 − G)n00 H n11 (1 − H )n10 (Mˆ j − (H − G))2 dGdH , 0
(4.2)
0 j =1
where we have substituted for pj from Table 3. The criterion (4.2) is additive in functions of Mˆ 1 , Mˆ 2 , . . . , Mˆ J so that we can find minimizing values of the estimator considering each case in isolation. Differentiating (4.2) with respect to Mˆ j , setting the result to zero and solving for Mˆ j gives:
1 1 pj (H − G)dGdH Mˆ j = 0 0 1 1 (4.3) p dGdH j 0 0
1 =
j
j
(1+n11 ) (1 − H )n10 dH 0H
j 1 nj 11 (1 − H )n10 dH 0H
j j (1+n01 ) n00 G (1 − G) dG 0
. j 1 nj n00 01 0 G (1 − G) dG
1 −
Using the result that for x and z that are integers we have 1 x!z!
(x + 1) (z + 1) = Y x (1 − Y )z dY =
(x + z + 2) (x + z + 1)! 0
(4.4)
(4.5)
(where (·) is the gamma function) we have the following closed form for the minimum integrated MSE (MIMSE) estimator: j j j j j j n11 + 1 ! n10 + n11 + 1 ! n01 + 1 ! n00 + n01 + 1 ! MIMSE ˆ Mj = − j j j j j j n10 + n11 + 2 !n11 ! n00 + n01 + 2 !n01 ! j
j
=
n11 + 1 j
j
n10 + n11 + 2
−
n01 + 1 j
j
n00 + n01 + 2
(4.6)
. j
j
As can be seen, the MIMSE estimator is simply the MLE estimator with nst + 1 replacing nst everywhere. It is important to note that the first term on the right-hand side of equation (4.6) is ˆ MIMSE . Hˆ jMIMSE , and the second term is G j The MIMSE point estimates the values of the parameters in cases where the MLE did not exist. Moreover, the MIMSE estimate will always be in the interior of the parameter space (that is, Mˆ jMIMSE ∈ (−1, 1)). In terms of computational difficulty, the MIMSE estimator is as easy to compute as the MLE estimator and somewhat easier to compute than the NBC estimator. In j j j j particular, we only require observation of the sufficient statistics {n00 , n01 , n10 , n11 } to compute the estimator Mˆ jMIMSE . Of most importance is that as each nst → ∞ (which would follow from n → ∞ and the transition probabilities being interior) the MIMSE estimator converges to the MLE. Convergence to MLE is a considerable virtue, since then MIMSE inherits all of the desirable asymptotic properties (consistency and asymptotic efficiency) of MLE. C The Author(s). Journal compilation C Royal Economic Society 2010.
20
M. Browning and J. M. Carro
4.2. A Bayesian perspective The use of a uniform distribution in the derivation of the MIMSE estimator suggests extending to a Bayesian analysis (see Billard and Meshkani, 1995, Cole et al., 1995). Suppose we have a sample Y and parameters (G, H ) ∈ [0, 1]2 . The posterior distribution of the parameters is given by P (G, H | Y ) =
P (Y | G, H )P (G, H ) P (Y | G, H )P (G, H ) = , P (Y ) P (Y | G, H )P (G, H )dGdH
(4.7)
where P (Y | G, H ) is the likelihood of the data and P (G, H ) is the prior distribution. In our case: P (Y | G, H ) = Gn01 (1 − G)n00 H n11 (1 − H )n10
(4.8)
and we take a uniform prior P (G, H ) = 1. Then, using the same results used to obtain the closed form for the MIMSE, we have P (Y ) =
1 0
=
1
Gn01 (1 − G)n00 H n11 (1 − H )n10 dGdH
0
n11 !n10 ! n01 !n00 ! . (n10 + n11 + 1)! (n00 + n01 + 1)!
(4.9)
The posterior distribution is given by P (G, H | Y ) = Gn01 (1 − G)n00 H n11 (1 − H )n10
(n00 + n01 + 1)!(n10 + n11 + 1)! . (4.10) n11 !n10 !n01 !n00 !
For a Bayesian analysis this provides all that is required from the data for subsequent analysis of, say, the Bayesian risk for the marginal dynamic effect, M = H − G. Our interest here is in how this relates to our estimators. To link to estimators we consider the marginal posterior of G:
1
P (G | Y ) =
P (G, H | Y )dH
0
(n00 + n01 + 1)!(n10 + n11 + 1)! n01 = G (1 − G)n00 n11 !n10 !n01 !n00 ! (n00 + n01 + 1)! = Gn01 (1 − G)n00 , n00 !n01 !
1
H n11 (1 − H )n10 dH
0
where we have used (4.10). A standard result is that the MLE is the mode of the posterior distribution assuming a flat prior. In the current context, taking the derivative of this expression with respect to G, setting this equal to zero and solving for G gives the maximum likelihood estimator in equation (3.11). To show the link to the MIMSE, we have that the conditional mean C The Author(s). Journal compilation C Royal Economic Society 2010.
Heterogeneity in dynamic discrete choice models
21
Case
Table 7. Estimates of marginal effect for three estimators. Mˆ NBC Path Mˆ MLE
a
1000
b c d
1001 1010 1011
−1/2 −1 −1/2
−3/8 −1 −1/3
−1/6 −5/12 −1/6
e f g
1100 1101 1110
1/2 −1/2 –
5/6 −1/3 –
1/6 −1/6 1/10
h
1111
0
–
0
–
Mˆ MIMSE 1/12
3/10
of G is given by E(G | Y ) =
1
GP (G | Y )dG
0
(n00 + n01 + 1)! 1 n01 +1 G (1 − G)n00 dG = n00 !n01 ! 0 (n01 + 1)!n00 ! (n00 + n01 + 1)! = n01 !n00 ! (n00 + n01 + 1 + 1)! n01 + 1 = , n00 + n01 + 2 which is the MIMSE estimator for G (see the second expression on the right-hand side of equation (4.6)). 4.3. Comparing the MIMSE estimator with MLE and NBC, T = 3 We now consider how the MIMSE estimator compares to MLE and NBC in terms of finite sample bias and MSE. In the interests of comparability we shall only consider the estimates for cases a to f and exclude the cases for which MLE does not exist. In doing this, we use the adjusted probabilities given in Table 4 that take into account the sample selection. This is consistent with our earlier decision to consider the MLE derived using the uncorrected probabilities but to use the corrected probabilities when considering bias and MSE. Note that the MIMSE estimator does not minimize the integrated MSE for the corrected probabilities so that these comparisons are relatively unfavourable to MIMSE. In Table 7 we give the three sets of values for the estimator of M. As can be seen, the estimates of M for MIMSE range from −5/12 to 0.3. Figure 6 shows the comparisons of bias and MSE for values of G = 0.2, 0.5, 0.8 and H ∈ [0, 1] when T = 3. The left-hand panels give the bias and the right-hand panels give the MSE. As can be seen for the bias, sometimes MIMSE is worse than NBC and sometimes it is better. In particular, since the bias of the MIMSE estimator can be positive or negative, we can have zero bias for some parameter values (for example, at (G, H ) = (0.5, 0.366)). Turning to the right-hand-side panels for MSE we see that the MIMSE estimator does better than MLE and NBC unless there is strong negative state dependence and C The Author(s). Journal compilation C Royal Economic Society 2010.
22
M. Browning and J. M. Carro
Figure 6. Bias and MSE for three estimators, T = 3.
sometimes does very much better. For example, for (G, H ) = (0.5, 0.8) (which implies moderate positive state dependence with M = 0.3) we have values for the MSE of 0.49, 0.41 and 0.17 for MLE, NBC and MIMSE, respectively.
5. EXACT BIAS AND MSE ANALYSIS FOR FIXED T > 3 As before we shall only consider sequences that start with y0 = 1. When considering T = 3 we could write down all eight possible cases and show explicit expressions for the bias and MSE. For larger values of T, tables such as Table 3 become impractical. For the observed sequence {1, y1 , y2 , . . . , yT } there are 2T possible distinct paths; for convenience we denote 2T by . An ˆ and Hˆ . Given estimator for G and H is given by a mapping from the outcomes to values for G (3.11) and (3.12), the bias of the MLE estimators is given by ⎛
−2 1 MLE ˆ ⎝ pj bias(G ) = (1 − H T −1 ) j =1
j
n01 j
j
n00 + n01
⎞ ⎠ − G,
(5.1)
C The Author(s). Journal compilation C Royal Economic Society 2010.
23
Heterogeneity in dynamic discrete choice models
⎛ bias(Hˆ MLE ) = ⎝
−2 pj
1 (1 − H T −1 ) j =1
⎞
j n11 j
j
n10 + n11
⎠ − H.
(5.2)
Note that the summation is from 1 to ( − 2) since the last two cases are selected out. The MSEs for the MLE are given by 2
−2 j n01 1 ˆ pj −G , MSE(G) = j j (1 − H T −1 ) j =1 n00 + n01 (5.3) 2
−2 j n 1 11 pj MSE(Hˆ ) = −H . j j (1 − H T −1 ) j =1 n10 + n11 These are exact analytical expressions for the bias and MSE. We cannot derive closed form j expressions for these (mainly because we cannot display closed form expressions for nst ) but we can compute the exact values numerically using these formulas. We postpone presenting these until after we define the bias corrected estimator. As before, we can define a new estimator by taking the bias of the MLE estimator, assuming that the values of G and H are the estimated values and then bias correcting. This gives 12 −2 MLE MLE MLE MLE NBC ˆ ˆ j , Hˆ j ˆk ˆj − p˜ k G G G = 2G , j
k=1
Hˆ jNBC
−2 MLE MLE MLE MLE ˆ j , Hˆ j p˜ k G Hˆ k = 2Hˆ j − .
(5.4)
k=1
Finally, the MIMSE estimator is given by (4.6). We turn now to the performance of our three estimators as we increase T from 3 to 12. We consider three cases: (G, H ) = (0.75, 0.25), (0.5, 0.5) and (0.25, 0.75). The first of these cases is somewhat extreme in that M = −0.5 and the y variable has a high probability of changing from period to period. For most contexts (for example, state dependence due to habit formation as in our empirical example) this will never be considered. Nonetheless, there may be circumstances, such as the purchase of a particular small durable, when we see this sort of behaviour. Figure 7 shows the results. The left-hand panels give the bias against T and the right-hand panels give the MSEs against T; note that the y-axis scales vary from panel to panel. We consider first the (absolute) biases. There are two aspects to this. First, how big is the bias for very small T? And, second, how quickly does the bias converge to zero (if it does) as T increases (see, for example, Carro, 2007; Hahn and Newey, 2004). Since we gave an exact analysis of the former for T = 3 in the previous section we concentrate here on the second issue. For all three cases shown in Figure 7 the NBC estimator usually has the smallest bias (in absolute value) and appears to be converging to zero faster. 13 Taking the values shown in the figure we can actually be more precise
12 Since we have to sum over all the − 2 cases to calculate the NBC, the computation time increases with T. However, for any T < 24 a regular PC takes less than a minute in computing the NBC. We have not tried higher T, because most of the micropanels found in practice have fewer than 25 periods. 13 It is worth noting that the biases for G and H are not so regular and are not even always monotone decreasing in T. Despite this, the difference, M, is well behaved.
C The Author(s). Journal compilation C Royal Economic Society 2010.
24
M. Browning and J. M. Carro
Figure 7. Bias and MSE for estimators of marginal dynamic effect.
than this. To a very high order of approximation we have that the bias of estimator e, which we denote be , is polynomial in T with the following dominant term: be κe T δe ,
(5.5)
where κe and δe are functions of (G, H ). For the case G = H = 0.5 and the MLE estimator this is an exact relationship with κMLE = −1 and δMLE = −1 so that the bias is always exactly −1/T . Regressions, for the three cases we consider, of the log (absolute) bias on log T gives values of δMLE −0.9, δNBC −2 and δMI MSE −0.6. Thus the bias disappears fastest for NBC and slowest for MIMSE. The exact rates for MLE and NBC are close to the expected orders of O(T −1 ) and O(T −2 ), respectively. Given that the bias of NBC is also usually lowest for T = 3 this corroborates what the figure suggests, namely that NBC is superior to MLE and MIMSE in terms of bias. In addition to the dependence on T, the bias also depends on the values of G and H; that is, on the number of transitions we have. Given that we are analysing paths that start with 1, the bias, for any given T, is higher the closer H is to 1. That is, the bias is higher, the higher the probabilities of having paths with no changes. This effect of H on the bias can be seen in Figure 6, and in Figure 7 where for any given T the bias of the MLE is always higher in the third panel which has the higher H (H = 0.75). C The Author(s). Journal compilation C Royal Economic Society 2010.
25
Heterogeneity in dynamic discrete choice models T=3
T=4
T=5
1 0.9 0.8 T=5
0.7
H
0.6
Area with MSE of MIMSE smaller than MSE of MLE.
0.5 0.4
T=4 T=3
0.3 0.2 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
G
Figure 8. MIMSE and MLE of M in terms of MSE.
Turning to the MSEs a radically different pattern emerges. MIMSE has the lowest MSE in all cases and MLE is almost always better than NBC. One feature to note is that although the MLE MSE is clearly converging towards the MIMSE MSE (as we expect theoretically) it is still significantly higher even when we have T = 10. The figures for these three cases suggest that MIMSE is usually best in MSE terms. In Figure 8, we display the values of G and H in the unit square for which MIMSE is MSE better than MLE for values of T = 3, 4, 5. Note that the sets for the different values of T are not nested The MIMSE estimator performs worse only for extreme values of G and H, particularly those that imply a very negative state dependence.
6. MANY HOUSEHOLDS 6.1. Using MLE, NBC and MIMSE to estimate the distribution of M In the previous three sections we have considered households in isolation and treated their observed paths as separate time series. However, in most empirical analyses, the interest is not in individual households but in the population. Thus it may be that the distribution of M in the population is of primary interest, rather than the values for particular households. We now consider how the estimators we had before—MLE, NBC and MIMSE—could be used in estimating the distribution of M on the population. We take T = 9 (that is, 10 observations per unit, including the initial observation) as being a ‘reasonably’ long panel in practical terms, but still short enough to give concern over small sample bias. As before we continue with the context in which yi0 = 1. We present results for three different distributions of (G, H ). Firstly we consider a uniform distribution for (G, H ) over [0, 1]2 . For this distribution we have exact calculations of the properties of the estimators when T = 9 and N goes to infinity. The second distribution is the empirical distribution of (G, H ) for the 367 households considered in the empirical section above. In this case we simulate a sample with T = 9 and large N to display the properties of the estimators when we pool many households. For the final set of simulations we C The Author(s). Journal compilation C Royal Economic Society 2010.
26
M. Browning and J. M. Carro
take a uniform distribution for G on [0.1, 0.9] and impose the homogeneous state dependence parameter condition (3.6) for H with a Normal distribution: Hi = (0.81 + −1 (Gi )).
(6.1)
The value of α = 0.81 is taken from the empirical estimate of (2.1). This can be considered a ‘standard’ model with homogeneous state dependence. For the first case, the true distribution of M over the population has the following cdf: 1 (1 + 2x + x 2 ) if x ≤ 0, FMi (x) = 21 (6.2) (1 + 2x − x 2 ) if x > 0, 2 and pdf fMi (x) =
(1 + x)
if x ≤ 0,
(1 − x)
if x > 0.
(6.3)
This implies that the mean and median value of the marginal dynamic effect M are zero. To calculate the estimated distributions, firstly note that Mˆ i can only take one of 2T possible values, since any household sequence observed on the pooled sample will correspond with one of the 2T combinations of 1’s and 0’s we can have conditional on the first observation. Then, the distribution of Mˆ i when N goes to infinity and T is fixed is given by the probabilities of observing each path j on a pooled sample: Pr(j ) = Pr(j | H , G) Pr(H , G) = pj f (G, H ) dGdH G H j j j j (6.4) (G)n01 (1 − G)n00 (H )n11 (1 − H )n10 f (G, H ) dGdH . = G
H
In the case of a uniform distribution we are considering,
1
Pr(j ) = 0
=
1
j
j
j
j
(G)n01 (1 − G)n00 (H )n11 (1 − H )n10 dGdH
(6.5)
0
n01 !n00 ! n11 !n10 ! . (n10 + n11 + 1)! (n00 + n01 + 1)!
(6.6)
From this we can derive the distribution of Mˆ as N → ∞ with a fixed T. The differences in the estimated distribution between the three estimators comes from the different Mˆ i ’s estimated from a given path j (this is what we have studied in previous sections). Figures 9 and 10 give the graphical comparisons of the true distribution and the estimated distributions based on the estimates of Mi for each possible path by MLE, NBC and MIMSE, conditioning on identification of the MLE for T = 9 and N → ∞, uniform case. The first, Figure 9, shows the cumulative distributions and the second, Figure 10, shows the Q − Q plot; although the two figures are informationally equivalent, the latter reveals to the eye different detail to the former. Consider first the MLE and NBC estimators. The NBC cdf is always to the right of the MLE estimator, and for many values NBC is closer to the true C The Author(s). Journal compilation C Royal Economic Society 2010.
27
Heterogeneity in dynamic discrete choice models
Figure 9. Estimates of the distribution of M.
centiles of distributions
100 90
True
80
MLE
70
NBC
60
MIMSE
50 40 30 20 10 0 0
10
20
30
40
50
60
70
80
90
100
centiles of true distribution of M
Figure 10. Q–Q plot of estimators of M.
distribution, since the bias is higher in absolute value for MLE as compared to NBC. However, these figures also show that the NBC estimate does worse than MLE for high values of the marginal effect. Thus the lower bias at the lower end for NBC is cancelled out by the higher bias for higher values of M. Hence the MLE usually has a lower variance. A conventional statistic to measure the difference between a true distribution and one for an estimator is the absolute value of the difference between them; that is the Kolmogorov–Smirnov (K–S) statistic: D=
(m) − F (m)|. sup |F m∈[−1,1]
C The Author(s). Journal compilation C Royal Economic Society 2010.
(6.7)
28
M. Browning and J. M. Carro
The NBC estimator dominates the ML estimator on this criterion. Turning to the MIMSE estimator, we see from the Q − Q plot that up to the 6th decile this tracks the true value very closely; the divergences are mainly due to the MIMSE estimator taking on a finite number of values. In particular, the median of the MIMSE estimator and the true distribution are very close. The estimated medians when N → ∞ and T = 9 by MLE, NBC and MIMSE converge to −0.14, −0.11 and −0.07, respectively. This close correspondence is to be expected given that the estimator was derived assuming a distribution close to the one used here. At the top of the M distribution, however, the MIMSE tends to underestimate the true value (that is, the cdf is to the left of the true cdf). Despite these differences, the main conclusion from the two figures is that the MIMSE estimator is considerably better than either MLE or NBC in terms of the fit to the true cdf. The probabilities in (6.6) can also be used in deriving the asymptotic properties of estimates of moments of Mi as N → ∞ and T is held fixed. This can then be used as an approximation to exact finite sample properties in panels with large N and small T. Looking at the mean marginal ˆ¯ = 1 N Mˆ ) converge dynamic effect, the estimated average from our three estimators (M i i=1 N to the true value as (T , N) → ∞, because Mˆ i → Mi and the sample average converges to the population mean. But for a given T, as N → ∞, ˆ¯ → E(Mˆ ) = E(M ) M p i i
(6.8)
as long as Mˆ i is a biased estimator of Mi . 14 Therefore, Pr(j ) in (6.6) will give the probabilities of each possible value of Mˆ i , allowing us to calculate the asymptotic properties as N → ∞, of the estimators based on moments of Mi : ˆ¯ → E(Mˆ ) = M Pr(j )Mˆ j , (6.9) p i j
√ ˆ¯ − bias(Mˆ )) → N (0, Var(Mˆ )), N(M i d i
(6.10)
where bias(Mˆ i ) = E(Mˆ i ) − E(Mi ). When N goes to infinity and T equals 9, the MLE, NBC and MIMSE estimates of the mean of M converge to −0.16, −0.08 and −0.05, respectively. Thus pooling gives that the MIMSE has lower asymptotic bias than NBC for the mean of M. As for the asymptotic root MSE, we have values of 0.21 for the MLE, 0.25 for the NBC and 0.10 for the MIMSE. Thus MIMSE is best for this criterion and NBC is worst. The top panels of Figure 11 present results using the empirical distribution (G, H ) for the 367 households considered in Section 2. For each pair we simulate 50 paths of length 10 with an initial value of unity (so that we have 18,350 paths in all, before we select out the paths for which MLE is not identified). The mean of M for the data is 0.23 (positive mean state dependence) and the means of the estimates from the simulated data are 0.08, 0.17 and 0.14 for MLE, NBC and MIMSE, respectively. Thus the bias is negative in all three cases and largest in absolute value for MLE and smallest for NBC. This reflects the fact that the NBC estimator usually has a lower bias for any particular path (see Section 5). The median of M for the data is 0.178 and the estimates are 0, 0 and 0.150 for the MLE, NBC and MIMSE, respectively. The latter displays much less bias than the other two estimators. One notable feature of these distributions is that all three display a sharp jump at some point in the distribution; at zero for MLE and NBC (hence the median result) and at about 0.25 for MIMSE. It is this clustering (around zero for MLE and NBC
14
ˆ¯ = E(Mˆ ). Also, note that E(M) i C The Author(s). Journal compilation C Royal Economic Society 2010.
Heterogeneity in dynamic discrete choice models
29
Figure 11. Estimates of the distribution of M.
and close to the true mean for MIMSE) that seems to give the lower mean bias for the MIMSE. Once again, the MIMSE estimator gives a much closer fit to the true distribution. The final set of simulations assume that the state dependence parameter in a parametric Normal model is constant across the population (note that this does not impose that the dynamic marginal effect is the same for everyone). The results for the three estimators are given in the bottom panels of Figure 11. When comparing the three estimators, the conclusion is the same as in the other two simulations: the MIMSE estimator is clearly better than MLE or NBC. And this is true here even for the higher percentiles. Note that the overall fit for all estimators is much worse in the case, mainly due to the efficiency loss caused by not imposing a constant state dependence parameter when estimating. This emphasizes the importance of first testing for slope homogeneity (see Pesaran and Yamagata, 2008). 6.2. Finite sample comparisons In the previous subsection, we examined the estimated distribution when the number of households becomes large. To end this section, we look at the finite sample performance of the three estimators, in terms of mean bias and root mean squared error (RMSE), when we want to estimate the mean and some quartiles of the distribution of M, with samples where the number of households is large but not unduly so and the number of periods is small. We consider the same three experiments as in the previous subsection. The first simulation experiment consider C The Author(s). Journal compilation C Royal Economic Society 2010.
30
M. Browning and J. M. Carro
a uniform distribution for (G, H ) over [0.1, 0.9]2 . The second distribution is the empirical distribution of (G, H ) for the 367 households considered in the empirical section above. For the final set of simulations we take a uniform distribution for G on [0.1, 0.9] and impose the homogeneous state dependence parameter condition (3.5) for H with a Normal distribution, as in equation (6.1), with α = 0.81. In all of them, yi0 = 1 and the number of households N is equal to 367 (the number of households in the empirical illustration). As before, we exclude observations for which MLE is not identified. As before, we take the number of observed periods equal to 10 (T = 9). Table 8 contains the true values and mean estimates of the mean marginal dynamic effect (mean M), and of the median and the other quartiles of the distribution of M. Mean bias and RMSE over 1000 simulations are also reported. The results are in accordance with the conclusions from the previous subsection. In terms of RMSE, the MIMSE estimator is significantly better than other two, except for the highest quartile, where MLE has a better RMSE in two of the three experiments. For the mean marginal dynamic effect, NBC has slightly smaller RMSE than MIMSE in the last two experiments. However, the NBC estimator of the median M, performs significantly worse than MIMSE, both in terms of mean bias and RMSE.
7. EXTENSION TO THE CASE WITH COVARIATES In the previous sections, we considered the case without covariates. This allowed us to derive exact results for the MLE (including the sign of the bias in the dynamic marginal effect) and to consider exact bias corrections instead of corrections based on the leading terms of an asymptotic approximation. We now extend the application of the alternative estimators proposed in this paper to the semiparametric case with covariates, allowing for full heterogeneity. That is, (7.1) yit = 1 αi yit−1 + xit βi + ηi + vit ≥ 0 t = 0, . . . , T ; i = 1, . . . , N , where xit is a vector of exogenous variables. We assume that identification conditions are satisfied. These include, for instance, the condition that xit covariates vary over time for person i. 7.1. Discrete covariates If xit contains only discrete variables, it is conceptually simple to extend our estimators. For a single binomial covariate we have: Hi0 = Pr(yit = 1 | yit−1 = 1, xit = 0), Gi0 = Pr(yit = 1 | yit−1 = 0, xit = 0), Hi1 = Pr(yit = 1 | yit−1 = 1, xit = 1), Gi1 = Pr(yit = 1 | yit−1 = 0, xit = 1), as parameters to be estimated. The estimators are analogous to the case without covariates. The only difference is that now we have to look not only at the 0 → 1 transition in the yit but also at the possible values of xit . That is, the likelihood of an observed path is j
n
j
j
n
j
j
n
j
j
n
j
G0s01 | 0 (1 − G0s )n00 | 0 H0s11 | 0 (1 − H0s )n10 | 0 G1s01 | 1 (1 − G1s )n00 | 1 H1s11 | 1 (1 − H1s )n10 | 1 ,
(7.2)
C The Author(s). Journal compilation C Royal Economic Society 2010.
31
Heterogeneity in dynamic discrete choice models Table 8. Estimation of the quartiles and the mean of the distribution of M on the population. Parameters of interest 1st quartile
Median
3rd quartile
Mean
1. (G, H ) from a Uniform distribution (0.1, 0.9)2
Mean estimate
True value
−0.233
0
0.237
0
MLE NBC
−0.384 −0.344
−0.144 −0.081
0.121 0.245
−0.129 −0.046
MIMSE
−0.240
−0.030
0.156
−0.041
MLE
−0.150
−0.143
−0.116
−0.130
Mean bias
NBC MIMSE
−0.110 −0.006
−0.080 −0.030
0.008 −0.081
−0.047 −0.041
MLE
0.158
0.143
0.125
0.132
RMSE
NBC MIMSE
0.124 0.023
0.083 0.058
0.046 0.084
0.053 0.044
2. (G, H ) from the empirical distribution for the 367 households True value Mean estimate
Mean bias
RMSE
0.047
0.177
0.381
0.227
MLE
−0.143
0.000
0.327
0.080
NBC MIMSE
−0.124 0.026
0.002 0.153
0.488 0.240
0.169 0.139
MLE NBC MIMSE
−0.190 −0.171 −0.021
−0.176 −0.175 −0.024
−0.054 0.107 −0.141
−0.148 −0.058 −0.088
MLE NBC
0.190 0.172
0.176 0.175
0.063 0.115
0.149 0.062
MIMSE
0.023
0.029
0.142
0.089
3. Homogeneous state dependence parameter True value Mean estimate
0.1597
0.2516
0.2974
0.2202
−0.1259 −0.0520
0.0013 0.0430
0.3397 0.5084
0.0926 0.1977
0.0359
0.1440
0.2441
0.1529
MLE NBC
−0.2856 −0.2117
−0.2503 −0.2087
0.0422 0.2109
−0.1276 −0.0225
MIMSE
−0.1239
−0.1076
−0.0533
−0.0674
MLE NBC MIMSE
Mean bias
RMSE
MLE
0.2856
0.2506
0.0475
0.1291
NBC MIMSE
0.2118 0.1248
0.2162 0.1085
0.2137 0.0560
0.0323 0.0683
C The Author(s). Journal compilation C Royal Economic Society 2010.
32
M. Browning and J. M. Carro j
j
where n01 | 0 is the number of yt−1 = 0 → yt = 1 transitions for path j given xt = 0, n01 | 1 is the number of yt−1 = 0 → yt = 1 transitions for path j given xt = 1, and similarly for the other j j j three transitions. For example, the MLE for Hi0 is given by Hˆ i0MLE = n11 | 0 /(n10 | 0 + n11 | 0 ) (and similarly for the other parameters). The NBC and MIMSE estimators are obtained similarly by following the same procedure as in previous sections. In fact, the MIMSE estimator has the same j j j simple form as in (4.6), replacing n01 by n01 | 0 or n01 | 1 depending on the whether we want to obtain a transition probability given xit = 0 or given xit = 1. 7.2. Continuous covariates The approach in the previous subsection has the virtue of being non-parametric but it quickly becomes infeasible as the number of values of the discrete covariate increases and/or as the number of covariates increases. Moreover, this non-parametric approach is not feasible if we have a continuous covariate. Therefore we go back to the parametric assumption about vit in (7.1) that we made in Section 2. In this subsection, for simplicity in the notation we consider only one x covariate. We assume that −vit follows a distribution with cdf F. Then, the log-likelihood for each i is lki (γi ) =
T
log[F (αi yit−1 + βi xit + ηi )(2yit − 1) + 1 − yit ]
(7.3)
t=1
and the MLE of γi = (αi , ηi , βi ) is the value that maximizes lki . The first-order conditions are a set of non-linear equations that do not have a closed form solution. Here it is not possible to repeat the analysis in previous sections to derive the exact bias of the MLE, even numerically. It is only possible to get information about the bias by simulation. Although simulation is a very good way of finding the bias of an estimator of the parameters in (7.1), a correction made based on the bias from simulations will include the simulation error on top of the problems inherent in non-linear bias corrections. Moreover, as we have seen for the case without covariates, the MIMSE estimator outperforms both the MLE and NBC in terms of MSE. As a result of all this, we consider only the MIMSE estimator as an alternative to the MLE. In what follows, we omit the subscript i since, as before, we are considering each individual in isolation. The MSE for an estimator γˆj (where j refer to an observed path of zeros and ones), conditional on x, is given by λ(γˆj ; γ ) = E[(γˆj − γ ) (γˆj − γ ) | x; γ ] =
pj (γˆj − γ ) (γˆj − γ ),
(7.4)
j
where pj is the likelihood of the observations of path j conditional on x: pj =
T (F (αyj t−1 + βxt + η)(2yj t − 1) + 1 − yj t ).
(7.5)
t=1
The non-informative or flat prior for parameters between −∞ and +∞ is the Jeffrey’s prior: p(γ )dγ = dγ .
(7.6)
C The Author(s). Journal compilation C Royal Economic Society 2010.
Heterogeneity in dynamic discrete choice models
33
This is the equivalent to the uniform prior when the parameter is in the [0, 1] interval that we used in Section 4. 15 This gives the integrated MSE: pj (γˆj − γ ) (γˆj − γ )dγ , (7.7) ψ = λ(γˆj ; γ )dγ = j
where the integrals are between −∞ and +∞. The criterion (7.7) is additive in functions of j so that we can find minimizing values of the estimator considering each case in isolation. MIMSE is the value of γˆj that minimize ψ. Differentiating (7.7) with respect to γˆj , setting the result to zero and solving for γˆj gives 1 MIMSE γˆj pj γ dγ . =
(7.8) pj dγ Since (7.8) do not have an analytical closed form solution, these integrals have to be solved numerically. Alternatively, we can take advantage of the relation between MIMSE and the mean of the posterior of γ when using a flat prior. The posterior distribution of γ if the priors are those in (7.6) is P (γ | Y ) =
1 pj , pj dγ
(7.9)
where pj is the likelihood of the data (given γ and x) written in (7.5). Since the denominator is a constant that does not depend on γˆ nor γ , we have that min ψ = min (γˆj − γ ) (γˆj − γ )pj dγ γˆj γˆj 1
(γˆj − γ ) (γˆj − γ )pj dγ = min γˆj pj dγ 1 (γˆj − γ ) (γˆj − γ )
= min pj dγ γˆj pj dγ = min (γˆj − γ ) (γˆj − γ )P (γ | Y )dγ . (7.10) γˆj
Therefore, minimizing the integrated MSE is equal to minimizing the expected posterior loss function with a quadratic loss function. As it is proved, for instance, in page 24 of Zellner (1971), this minimum in (7.10) is equal to the mean of the posterior function. Hence, we can obtain the MIMSE estimates of γ by computing the mean of the
posterior function (7.9). The only difficult part to compute in (7.9) is pj dγ , since the likelihood has a simple analytical form and does not require any integral (given a known cdf). The Metropolis–Hastings
Algorithm can be used to obtain draws from the posterior density without computing pj dγ . Since the likelihood can be calculated very easily, we can make very many iterations in this MCMC algorithm, both to guarantee convergence to the posterior and to obtain a good number of valid draws. Once we have many draws we simple compute the average to obtain the MIMSE estimator. One important advantage of this procedure is that we can automatically accommodate a large number of covariates (subject to the identification conditions).
15
See Zellner (1971) for further discussion on non-informative priors.
C The Author(s). Journal compilation C Royal Economic Society 2010.
34
M. Browning and J. M. Carro
In order to illustrate the usefulness of MIMSE when having covariates, we simulate (7.1), obtain the marginal dynamic effect M from the MLE and the MIMSE estimates for different T and values of the parameters. In particular, we choose values of γ so that G, H and M evaluated at the mean value of x are equal to the values used in Figure 6 for the case without covariates. This will allow to have graphs as comparable as possible. The specific details of the simulations are: β = 1, xit ∼ N (0, 1), vit ∼ logistic, we make 10,000 simulations and 20,000 iterations in i.i.d.
i.i.d.
the Metropolis–Hastings Algorithm of which the first 10,000 are for burn-in and of the 10,000 made after convergence every fifth is retained as a draw from the posterior. This is made for T = 4, . . . , 13 and for the following three values of the α and η parameters. Firstly with α = −2.2 and η = 1.1, which imply G = 0.75, H = 0.25 and M = −0.5 when computed at the mean of x. Secondly, with α = 0 and η = 0, which imply G = 0.5, H = 0.5 and M = 0 when computed at the mean of x. Thirdly, with α = 2.2 and η = −1.1, which imply G = 0.25, H = 0.75 and M = 0.5 when computed at the mean of x. As we did in Figure 7, Figure 12 shows the mean bias and MSE of the MLE and MIMSE in estimating the marginal dynamic effect at the mean value of x for those three sets of values of the parameters as we increase T. Note that the results as T increases are not as smooth as in
Bias of M estimator (G=0.75, H=0.25) MLE
MLE
0.16
MIMSE
MIMSE
0.14
Mean squared error
0.1 0.08
Bias
Mse of M estimator (G=0.75, H=0.25)
0.18
0.12
0.06 0.04
0.12 0.1 0.08 0.06 0.04
0.02
0.02 0
0
3
4
5
6
7
8
9
10
11
12
3
13
4
5
6
7
0
MIMSE
10
Mean squared error
-0.06 -0.08 -0.1
11
12
13
MLE MIMSE
0.2
-0.04
Bias
9
Mse of M estimator (G=0.50, H=0.50)
0.25
MLE -0.02
8
T
T
Bias of M estimator (G=0.50, H=0.50)
0.15
0.1
0.05 -0.12
0
-0.14
3
4
5
6
7
8 T
9
10
11
12
0
5
6
7
8
9
10
11
12
13
12
13
Mse of M estimator (G=0.25, H=0.75)
0.4
MLE MIMSE
0.35
Mean squared error
MLE MIMSE
-0.1
4
T
Bias of M estimator (G=0.25, H=0.75)
-0.05
-0.15
Bias
3
13
-0.2
-0.25 -0.3 -0.35
0.3 0.25 0.2 0.15 0.1 0.05
-0.4 -0.45 3
4
5
6
7
8
T
9
10
11
12
13
0 3
4
5
6
7
8
9
10
11
T
Figure 12. Bias and MSE for estimators of marginal dynamic effect in a model with covariates. C The Author(s). Journal compilation C Royal Economic Society 2010.
Heterogeneity in dynamic discrete choice models
35
Figure 7, because this is based on simulations and seven are exact calculations. As can be seen, the MIMSE is performing better than the MLE in terms of MSE for all T and the three values of (G, H ) as it did in the case without covariates. However, here MIMSE is better than the MLE also in terms of bias. In the comparison between Figures 7 and 12, it is important to note that, for most of the cases, both the MLE and the MIMSE have smaller biases (in absolute value) and MSEs in the case with covariates than in the case without covariates. This is not surprising since we have added exogenous variations to the model. This is an indication that the detail and exact results for models without covariates in previous sections could be taken as a worse case reference when adding exogenous covariates. For a similar reason, the marginal dynamic effect in model (7.1) that is considered in Figure 12, is more problematic than looking at the marginal effect of x.
8. CONCLUSIONS We have considered in detail the dynamic choice model with heterogeneity in both the intercept (the ‘fixed effect’) and in the autoregressive parameter. We motivated this analysis by considering the estimates from a long panel in which we could effectively treat each household as a single time series. This analysis suggested strongly that both the parameters vary systematically across households. Moreover, the results of this analysis gave us a joint distribution over the two latent variables that may be difficult to pick up with the usual fully parametric random coefficients model. Consequently, we examined the finite sample properties of non-parametric estimators. In the case without covariates we present exact analytical results for the bias and MSE. We found the following for a simple two-state first-order Markov chain model: (1) (2)
There is no unbiased estimator for the transition probabilities. Conditioning on identification, we found that the MLE estimate of the marginal dynamic effect: Pr(yit = 1 | yi,t−1 = 1) − Pr(yit = 1 | yi,t−1 = 0)
(3)
(4)
(8.1)
has a negative bias. This is the non-linear analogue of the Nickell finding that in the linear autoregressive model panel data estimates of the autoregressive parameter are biased toward zero but note that our results are exact finite sample calculations. The degree of bias depends on the parameter values and the length of the panel, T. The bias of the MLE estimator of the marginal dynamic effect does diminish as we increase the length of the panel, but even for T = 16 it can be high. Based on the analysis of bias, we constructed an NBC estimator as a two-step estimator with the MLE as the first step. We find that this estimator does indeed reduce the bias for most cases (as compared to MLE) but in MSE terms it is similar or even worse than MLE. For all but extreme values of negative state dependence, the NBC estimator also has a negative bias for the marginal dynamic effect. A detailed examination of the distribution of the MLE and NBC estimators for T = 3 and T = 10 suggested that neither can be preferred to the other. Given the relatively poor performance of the MLE and NBC in terms of MSE, we constructed an estimator that MIMSE and that has a simple closed form. This estimator coincides with the mean of the posterior distribution assuming a uniform prior. The MIMSE
C The Author(s). Journal compilation C Royal Economic Society 2010.
36
(5)
M. Browning and J. M. Carro
estimator is sometimes better than MLE and NBC in terms of bias but usually it is worse. In terms of MSE, however, it is much better than either of the first two estimators, particularly when there is some positive state dependence. Turning to the many-person context, we considered a joint distribution of Pr(yit = 1 | yi,t−1 = 1) and Pr(yit = 1 | yi,t−1 = 0) over the population and use our non-parametric estimators to estimate the empirical distribution of the parameters. Exact calculations and simulations with T = 9 and large N suggest that the MIMSE-based estimator significantly outperforms the MLE and NBC estimators in recovering the distribution of the marginal dynamic effect.
The conclusion from our exact analyses on a single observed path and from simulations in a many-unit context is that the MIMSE estimator is superior to MLE or a particular bias corrected version of MLE. As emphasized in Section 3, we deemed it necessary to examine the no-covariate case in great detail given that we know very little about the performance of alternative dynamic choice estimators which allow for a great deal of heterogeneity. However, for most analyses, we would also want to condition on covariates. The results in Section 7 suggest that MIMSE is a credible and feasible candidate for estimating dynamic discrete choice models with exogenous covariates.
ACKNOWLEDGMENTS The authors thank two referees, the editor, Manuel Arellano, Bo Honor´e, Thierry Magnac, Enrique Sentana and participants in the CAM workshop on ‘Limited dependent variable models’ in July 2004, the 12th Conference on Panel Data in Copenhagen, a CEMFI seminar, and a seminar at Universidad de Alicante for helpful comments and discussion. This work was supported by the EU AGE RTN, HPRN-CT-2002-00235 and by the Danish National Research Foundation through its grant to CAM.
REFERENCES Albert, P. and M. Waclawiw (1998). A two state Markov chain for heterogeneous transitional data: a quasilikelihood approach. Statistics in Medicine 17, 1481–93. Anderson and Goodman (1957). Statistical inference about Markov chains. Annals of Statistics 28, 89– 100. Arellano, M. (2003a). Discrete choice with panel data. Investigaciones Econ´omicas XXVII, 423–58. Arellano, M. (2003b). Panel Data Econometrics. Oxford: Oxford University Press. Arellano, M. and B. Honor´e (2001). Panel data models: some recent developments. In J. J. Heckman and E. Leamer (Eds.), Handbook of Econometrics, Volume 5, 3229–96. Amsterdam: Elsevier Science. Berkson, J. (1980). Minimum chi-squared, not maximum likelihood! Annals of Statistics 8, 457–87. Billard, L. and M. Meshkani (1995). Estimation of a stationary Markov chain. Journal of the American Statistical Association 90, 307–15. Browning, M. and J. M. Carro (2006). Heterogeneity and Microeconometrics Modelling. In R. Blundell, W. K. Newey and T. Persson (Eds.), Advances in Economics and Econometrics, Theory and Applications: Ninth World Congress of the Econometric Society, Volume 3, 45–74. New York: Cambridge University Press. C The Author(s). Journal compilation C Royal Economic Society 2010.
Heterogeneity in dynamic discrete choice models
37
Carro, J. M. (2007). Estimating dynamic panel data discrete choice models with fixed effects. Journal of Econometrics 140, 503–28. Cole, B. F., M. T. Lee, G. A. Whitmore and A. M. Zaslavsky (1995). An empirical Bayes model for Markovdependent binary sequences with randomly missing observations. Journal of the American Statistical Association 90, 1364–72. Diggle, P., P. Heagerty, K.-Y. Liang and S. Zeger (2002). Analysis of Longitudinal Data (2nd ed.). Oxford: Oxford University Press. Hahn, J. and W. Newey (2004). Jackknife and analytical bias reduction for nonlinear panel data models. Econometrica 72, 1295–319. Honor´e, B. and E. Kyriazidou (2000). Panel data discrete choice models with lagged dependent variables. Econometrica 68, 839–74. Honor´e, B. and E. Tamer (2006). Bounds on parameters in dynamic discrete choice models. Econometrica 74, 611–29. McKinnon, J. G. and A. A. Smith (1998). Approximate bias correction in econometrics. Journal of Econometrics 85, 205–30. Pesaran, H. and T. Yamagata (2008). Testing slope homogeneity in large panels. Journal of Econometrics 142, 50–93. Wooldridge, J. (2002). Econometric Analysis of Cross Section and Panel Data. Cambridge: MIT Press. Zellner, A. (1971). An Introduction to Bayesian Inference in Econometrics. New York: John Wiley.
APPENDIX A.1. Proof of Proposition 3.1 ˆ Hˆ } = {(Ga , Gb , . . . , Gh ), (Ha , Hb , . . . , Hh )}. If G ˆ is unbiased then we have Take any estimator {G, ˆ = G = E(G)
h
p j Gj
j =a
= (Gh − Gg )H 3 + (Gc + Ge − Gd − Gf )GH 2 + (Gg − Ge )H 2 + (Gb − Ga )G2 H + (2Ga + Gd + Gf − Gb − 2Gc − Ge )GH + (Ge − Ga )H + (Ga − Gb )G2 + (Gb + Gc − 2Ga )G + Ga .
(A.1)
Equating the last four terms on the right-hand side with the left-hand side in order to obtain the values of the coefficients that make the right- and left-hand-side polynomials of G and H equal, that is, Ga = 0, Gb + Gc − 2Ga = 1, Ga − Gb = 0, Ge − Ga = 0, gives Ga = Gb = Ge = 0,
Gc = 1.
(A.2)
Substituting into the first three terms and equating gives G e = Gg = Gh ,
1 + G e = Gd + Gf .
C The Author(s). Journal compilation C Royal Economic Society 2010.
(A.3)
38
M. Browning and J. M. Carro
Substituting this into the term for GH gives the contradiction 0 = 2Ga + Gd + Gf − Gb − 2Gc − Ge = 0 + 1 + Ge − 0 − 2 − Ge = −1.
(A.4)
If Hˆ is unbiased, then H = E(Hˆ ) =
h
pj Hj
j =a
= (Hh − Hg )H 3 + (Hc + He − Hd − Hf )GH 2 + (Hg − He )H 2 + (Hb − Ha )G2 H + (2Ha + Hd + Hf − Hb − 2Hc − He )GH + (He − Ha )H + (Ha − Hb )G2 + (Hb + Hc − 2Ha )G + Ha ˆ also lead to a contradiction. and calculations similar to those for G
(A.5)
A.2. The recursive biased corrected estimator If we iterate on (3.17) and (3.18) and the process converges (so that |G(k+1) − G(k) j j | → 0 as k → ∞ and (k+1) (k) similarly for H) then we have the limit estimators when Gj = Gj : ˆ j = E (∞) (G) ˆ G (∞) (∞) (∞) ˆ ˆ a + · · · + pf G(∞) G Gf , = pa Gj , Hj j , Hj
(A.6)
Hˆ j = E (∞) (Hˆ ) (∞) ˆ (∞) ˆ = pa G(∞) Ha + · · · + pf G(∞) Hf . j , Hj j , Hj
(A.7)
This gives two equations in two unknowns for each case a, . . . , f . The first issue in this iteration is whether there is a solution to these two equations, i.e. whether there is a fixed point on the iterative process. Ideally we would like to have a unique solution for each case that satisfies G(∞) ∈ [0, 1] and Hj(∞) ∈ [0, 1]. We can j do this for cases a, b, c, d, f but not for case e. To see this, note that the equations for case e are: (∞) (∞) (∞) (∞) (∞) + pc Ge , He + pd G(∞) + pf G(∞) (A.8) 0 = 0.5pb G(∞) e , He e , He e , He =
(∞) Ge 1 3 + 2He(∞) − G(∞) e , 2 1 + He(∞)
(A.9)
(∞) (∞) (∞) + pe G(∞) + pf G(∞) 0.5 = 0.5 pd G(∞) e , He e , He e , He (∞) He 1 1 + G(∞) e . = (∞) 2 1 + He
(A.10)
This set of equations has no solution that satisfies the constraints. To see this, if G(∞) = 0 then the second e = 0. Substituting this into the equation implies a contradiction. Thus we must have 3 + 2He(∞) − G(∞) e second equation gives 2 (A.11) 2 He(∞) + 3He(∞) − 1 = 0, which does not have any roots between zero and unity. C The Author(s). Journal compilation C Royal Economic Society 2010.
39
Heterogeneity in dynamic discrete choice models Table A.1. Outcomes conditioning on point identification. MLE NBC Case a b c d e f
Limit estimator
Prob
ˆ G
Hˆ
G
(1−H )(1−G)(1−G) (1−H 2 ) (1−H )(1−G)G (1−H 2 ) (1−H )G(1−H ) (1−H 2 ) (1−H )GH (1−H 2 ) H (1−H )(1−G) (1−H 2 ) H (1−H )G (1−H 2 )
0
0
0
0
0
0
1/2
0
3/8
0
0.382
0
1
0
1
0
1
0
1
1/2
1
2/3
1
1
0
1/2
0
5/6
(0)
(1)
1
1/2
1
2/3
1
1
(1)
H
(1)
G
(∞)
H (∞)
A second issue is where the iterated estimators converge to. For case e the recursion goes outside the interval [0, 1] and never reach a fixed point for H. Nonetheless, the other five cases converge to their fixed points; these are given in Table 4. We can also take the estimates for case e that minimize the sum of the differences between the expected values of the ML estimators and the values of the latter: 2 (∞) 2 (∞) Ge He 1 1 + G(∞) 1 3 + 2He(∞) − G(∞) e e + (A.12) − 0.5 . 2 2 1 + He(∞) 1 + He(∞) The minimizing values are G(∞) = 0 and He(∞) = 1 (shown in parentheses in Table A.1 to indicate that they e are biased). In fact, these values are a solution of the equation for G but not for H. The biases for the limit estimator are given by: (1 − G)G ≥ 0, φ G(∞) = E G(∞) − G = 0.382 (1 + H )
(A.13)
(G − H )H ϕ H (∞) = E H (∞) − H = ≷ 0. (1 + H )
(A.14)
For H we can now have a positive bias if M = H − G < 0 and it is unbiased if H = G, i.e. if M = 0. It could be seen that the bias for G is smaller than for ML but larger than for the one-step estimator. The bias for H is smaller than for the MLE or the NBC for some values of (G, H ), but not for all.
C The Author(s). Journal compilation C Royal Economic Society 2010.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 40–62. doi: 10.1111/j.1368-423X.2009.00300.x
Smoothness adaptive average derivative estimation M ARCIA M. A. S CHAFGANS † AND V ICTORIA Z INDE -WALSH ‡,§ †
Department of Economics, London School of Economics. Houghton Street, London WC2A 2AE, UK E-mail:
[email protected] ‡
Department of Economics, McGill University, 805 rue Sherbrooke ouest, Montr´eal, QC H3A 2T7, Canada E-mail:
[email protected] §
CIREQ, Universit´e de Montr´eal, C.P. 6128, Succursale Centre-ville, Montr´eal (Qu´ebec), H3C 3J7, Canada
First version received: July 2007; final version accepted: September 2009
Summary Many important models utilize estimation of average derivatives of the conditional mean function. Asymptotic results in the literature on density weighted average derivative estimators (ADE) focus on convergence at parametric rates; this requires making stringent assumptions on smoothness of the underlying density; here we derive asymptotic properties under relaxed smoothness assumptions. We adapt to the unknown smoothness in the model by consistently estimating the optimal bandwidth rate and using linear combinations of ADE estimators for different kernels and bandwidths. Linear combinations of estimators (i) can have smaller asymptotic mean squared error (AMSE) than an estimator with an optimal bandwidth and (ii) when based on estimated optimal rate bandwidth can adapt to unknown smoothness and achieve rate optimality. Our combined estimator minimizes the trace of estimated MSE of linear combinations. Monte Carlo results for ADE confirm good performance of the combined estimator. Keywords: Density weighted average derivative estimator, Non-parametric estimation.
1. INTRODUCTION Many important models, such as index models widely used in limited dependent variables, partial linear models and non-parametric demand studies utilize estimation of average derivatives (sometimes weighted) of a conditional mean function. H¨ardle et al. (1991) and Blundell et al. (1998), amongst others, advocated the derivative-based approach in the analysis of consumer demand, where non-parametric estimation of Engel curves has become common place (e.g. Yatchew, 2003). Powell et al. (1989, hereafter referred to as PSS) popularized the use of density weighted average derivatives of the conditional mean in the semi-parametric estimation of index models by pointing out that the average derivatives in single index models identify the parameters ‘up to scale’. A large literature is devoted to the asymptotic properties of non-parametric estimators of average derivatives and to their use in estimation of index models and testing of coefficients. C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Smoothness adaptive average derivative estimation
41
Asymptotic properties of average density weighted derivatives, hereafter referred to as ADEs, are discussed in PSS and Robinson (1989); H¨ardle and Stoker (1989) investigated the properties of the average derivatives themselves; Newey and Stoker (1993) addressed the choice of weighting function; Horowitz and H¨ardle (1996) extended the ADE approach in estimating the coefficients in the single index model to the presence of discrete covariates; Donkers and Schafgans (2008) extended the ADE approach to multiple index models; Chaudhuri et al. (1997) investigated the average derivatives in quantile regression; Li et al. (2003) investigated the local polynomial fitting to average derivatives; Banerjee (2007) provided a recent discussion on estimating the average derivatives using a fast algorithm; and Cattaneo et al. (2008) investigated for ADE a weakening on the lower bound of the bandwidth while avoiding the use of higher order kernels. Higher order expansions and the properties of bootstrap tests of ADE are investigated in Nichiyama and Robinson (2000, 2005). To formulate the ADE under consideration in our paper, let g(x) = E(y | x) with y ∈ R and x ∈ R k , and define δ0 = E(f (x)g (x)),
(1.1)
with g (x) the derivative of the unknown conditional mean function and f (x) the density of x. With x ∈ R k , g (x) stands for the vector (∂g(x)/∂x1 , . . . , ∂g(x)/∂xk )T . Recognizing that δ0 = −2E(f (x)y) under certain regularity conditions, PSS introduced the estimator 2 δˆN (K, h) = − N
N
fˆ(K,h) (xi )yi ,
(1.2)
i=1
with fˆ(K,h) (xi ) =
N 1 1 k+1 xi − xj . K N − 1 j =i h h
Here K denotes a kernel smoothing function, K its derivative and h denotes the smoothing parameter that depends on the sample size N , with h → 0 as N → ∞. In all of the literature on ADE, asymptotic theory was provided for parametric rates of convergence. Even though the estimators are based on non-parametric kernel estimators which depend on the kernel and bandwidth and converge at a non-parametric rate, averaging can produce a parametric convergence rate thus reducing dependence on selection of the kernel and bandwidth which may not appear in the leading term of the mean squared error (MSE) expansion. This parametric rate of convergence (and thus the results in this literature), however, relies on the assumption of sufficiently high degree of smoothness of the underlying density of the regressors, f (x). This assumption is not based on any a priori theoretical considerations. Various multimodal distributions are encountered in biomedical and statistical studies (see e.g. Izenman and Sommer, 1988); multimodal distributions, even if they are sufficiently smooth, possess derivatives that are large enough to cause problems—see discussion in Marron and Wand (1992) for examples of normal mixtures that exhibit features usually thought of as characteristic of non-smooth densities. Even when there is sufficient smoothness for parametric rates the choice of bandwidth and kernel affects second-order terms in MSE which are often not much smaller than first-order terms (see e.g. Dalalyan et al., 2006). Our concern with the possible violation of assumed high degree of density smoothness led us to extend the existing asymptotic results for ADE by relaxing the smoothness assumptions on C The Author(s). Journal compilation C Royal Economic Society 2010.
42
M. M. A. Schafgans and V. Zinde-Walsh
the density. We examine an expansion of the variance up to the first term that depends on the bandwidth. The leading term in the bias expansion is called ‘asymptotic bias’ and the terms in the expansion of MSE that combine these leading terms of bias and variance we call ‘asymptotic MSE’ (AMSE). Insufficient smoothness will result in possible asymptotic bias and may easily lead to non-parametric rates (exact results are in Theorem 3.1). Since selection of optimal kernel (order) and bandwidth (Powell and Stoker, 1996, and Theorem 3.1) presumes the knowledge of the degree of density smoothness, uncertainty about the degree of density smoothness poses an additional concern. In principle, smoothness properties of density f (x) could differ for different components of the vector x = (x1 , . . . , xk )T which could lead to possibly different rates for the component bandwidths, h[], = 1, . . . , k (e.g. Li and Racine, 2007). Even when all the rates are the same x −x it may be advantageous to use different bandwidths in finite sample, and we regard K( i h j ) as xi1 −xj 1 xik −xj k K( h[1] , . . . , h[k] ). Denote by h the diagonal matrix h = diag(h[1], . . . , h[k]),
(1.3)
with inverse h−1 , and hk the product of bandwidth components hk = k=1 h[].
(1.4)
With all bandwidths equal, h and h can be read as scalar h. With this notation the vector x−x x−x ∂ K( h j ) = h−1 K ( h j ) and the ADE δˆN (K, h) is given by ∂x δˆN (K, h) = −
N N 2 −1 xi − xj yi . h K h N (N − 1)hk i=1 j =i
If the degree of smoothness is known an optimal asymptotic rate of the bandwidth that balances the asymptotic variance and squared bias can be derived. Under some more restrictions the optimal bandwidth vector, hopt , is obtained in Theorem 3.1. Given sufficient smoothness the optimal bandwidth rate balances second-order terms of the variance that depend on the kernel and bandwidth with the leading term in squared bias. The estimator with this bandwidth rate is referred to as second-order rate efficient. With insufficient smoothness there is first-order dependence of the variance on the kernel and bandwidth and the rate optimal bandwidth ADE estimator is referred to as first-order rate efficient. With an unknown degree of smoothness, the optimal rate of bandwidth cannot be derived; however, it is consistently estimated here. For a given kernel it involves an estimator of rate using a technique that can be traced back to Woodroofe (1970). 1 In Theorem 3.2, we show that there exists a (non-convex) linear combination of ADE estimators: Ss=1 as δˆN (K, hs ) for a set of bandwidth vectors hs (with components hs [], = 1, . . . , k) s = 1, . . . , S such that the trace of AMSE of this linear combination is strictly smaller than that of δˆN (K, hopt ). This somewhat surprising result is the consequence of elimination of leading terms in the biases in the linear combination similar to the generalized jackknifing √ proposed in PSS (see appendix 2 in PSS) for N-consistent ADE. We consider different kernels as well as different bandwidths in linear combinations, since the selection among kernels (higher 1 We are grateful to an anonymous referee who pointed out that our approach to rate estimation is reminiscent of Woodroofe (1970). Unfortunately, lemma 2.3 of that paper does not hold and the proofs about convergence of MSE that use it cannot be applied.
C The Author(s). Journal compilation C Royal Economic Society 2010.
Smoothness adaptive average derivative estimation
43
and lower order) is also hampered by an unknown degree of smoothness. This is an important generalization, in particular given that the order of the kernel has been shown to have a large impact on the finite sample performance for density estimation and similarly, for kernels of the same order, different shapes (including asymmetric) affect performance; see Hansen (2005) and Kotlyarova and Zinde-Walsh (2007). Combining estimators was recently investigated in the statistical literature, where for the most part convex combinations are used as a means to achieve adaptiveness (Juditsky and Nemirovski, 2000; Yang, 2000). Kotlyarova and Zinde-Walsh (2006, hereafter KZW) propose non-convex combinations for estimators with possibly non-parametric rates. They develop the so-called combined estimator with weights that minimize the trace of its estimated AMSE. Our proposed estimation strategy is as follows. If the smoothness and thus optimal rate for bandwidth were known we select several suitable kernels and specify a set of bandwidths for each that would ensure that the optimal linear combination would outperform any individual optimal estimator. When the smoothness is not assumed known we use a consistent estimator of the rate of optimal bandwidth and consider a corresponding set of bandwidths. Next, we obtain the optimal linear combination of the individual ADEs under consideration by minimizing the trace of AMSE. For this minimization the (leading terms of) variances of estimators, covariances between the different ADEs and biases would have to be known. To obtain a feasible combined estimator (as in KZW) we use consistent estimators for biases and covariances. We thus consider an estimator that optimally combines ADE δˆN (Ks , hs ) for different kernel/bandwidth pairs (Ks , hs ), s = 1, . . . , S. The combined estimator is given by δˆN,comb =
S
as∗ δˆN (Ks , hs ),
s=1
where as∗ , s = 1, . . . , S are chosen so as to minimize the trace of the estimated MSE subject to S ∗ s=1 as = 1. We use a Monte Carlo experiment for the Tobit model, for a variety of distributions for the explanatory variables (gaussian, tri-modal gaussian mixture and the ‘double claw’ and ‘discrete comb’ mixtures from Marron and Wand, 1992). There, we demonstrate that there is no clear guidance on the choice of suitable kernel/bandwidth pair. Even though in these cases the smoothness assumptions hold, the high modal nature of these mixture distributions leads to large partial derivatives that undermine the performance of ADE. At the same time, the combined estimator provides reliable results in all cases. The paper is organized as follows. In Section 2, we provide the assumptions, where we relax the usual high smoothness assumptions common in the literature. In Section 3, we derive the asymptotic properties of the ADE under various assumptions about density smoothness, the joint asymptotics for ADE estimators based on different bandwidth/kernel pairs, examine the advantages of linear combinations and develop the combined estimator. Section 4 provides the Monte Carlo study results and Section 5 concludes.
2. ASSUMPTIONS The assumptions here keep some conditions common in the literature on ADE but relax the usual higher smoothness assumptions. C The Author(s). Journal compilation C Royal Economic Society 2010.
44
M. M. A. Schafgans and V. Zinde-Walsh
The first two assumptions are similar to PSS; they restrict x to be random variables that are continuously distributed with no component of x functionally determined by other components of x (y could be discrete, e.g. a binary variable) and impose the minimal smoothness assumption of continuous differentiability on f and g. A SSUMPTION 2.1. Let zi = (yi , xiT )T , i = 1, . . . , N be a random sample drawn from a distribution that is absolutely continuous in x. The support of the density of x, f (x), is a convex (possibly unbounded) subset of R k with non-empty interior 0 . A SSUMPTION 2.2. The density function f (x) is continuous over R k , so that f (x) = 0 for all x ∈ ∂, where ∂ denotes the boundary of ; f is continuously differentiable in the components of x for all x ∈ 0 and the conditional mean function g(x) is continuously differentiable in the ¯ where ¯ differs from 0 by a set of measure 0. components of all x ∈ , Additional requirements involving the conditional distribution of y given x, as well as more detailed differentiability conditions subsequently need to be added. The conditions are slightly amended from how they appear in the literature, in particular we use the weaker H¨older conditions instead of Lipschitz conditions; all the proofs can accommodate this weakened assumption. A SSUMPTION 2.3. (a) E(y 2 | x) is continuous in x. (b) The components of the random vector g (x) and matrix f (x)[y, x T ] have finite second moments; (f g) satisfies a H¨older condition with 0 < α0 ≤ 1: |(f g) (x + x) − (f g) (x)| ≤ ω(fg) (x)xα0 2 and E(ω(fg) (x)[1 + |y| + x]) < ∞.
The kernel K satisfies a standard assumption. 2 A SSUMPTION 2.4. (a) The function K(u) is a symmetric continuously differentiable function in R k with convex support. (b) The kernel function K(u) has order v(K) > 1: K(u) du = 1; ui11 . . . uikk K(u) du = 0, i1 + · · · + ik < v(K); ui11 . . . uikk K(u) du = 0, i1 + · · · + ik = v(K), where (i1 , . . . , ik ) is an index set. (c) The kernel smoothing function K(u) is differentiable up to the order v(K). Density smoothness plays a role in controlling the rate for the bias of the PSS estimator; the bias is
ˆ ˆ Bias δN (K, h) = E(δN (K, h) − δ0 ) = −2E y f (x − hu) − f (x) K(u) du . (2.1) We formalize the degree of density smoothness in terms of the H¨older space of functions. More precisely, with the ADE involving the derivative (vector) of the density, we specify the 2 In Schafgans and Zinde-Walsh (2007) we discuss the possibility of non-symmetric kernels and derive results for that case.
C The Author(s). Journal compilation C Royal Economic Society 2010.
Smoothness adaptive average derivative estimation
45
smoothness for each component of the derivative vector, f (x) with = 1, . . . , k, separately, thereby enabling some components to be smoother than others. The H¨older space of functions, denoted as Cm −1+α (), consists of m − 1 times continuously differentiable functions on with all (m − 1)th partial derivatives satisfying H¨older’s condition of order α . We assume that f (x) ∈ Cm −1+α (), implying that all its (m − 1)th partial derivatives, denoted as f(m −1) (·), satisfy: (m −1) f (x + x) − f(m −1) (x) ≤ ω (x) xα . A SSUMPTION 2.5. The derivative of the density satisfies f ∈ Cm −1+α () with m ≥ 1, 0 < α ≤ 1 and E(ω2 (x)[1 + |y|2 + x]) < ∞, = 1, . . . , k. Note that in the case m = 1 there may be no more than H¨older continuity of the partial derivative without further differentiability, significantly relaxing the usual assumptions in the literature. We denote m − 1 + α by v and define the vector v¯ = (v¯1 , . . . , v¯k ), with v¯ = min(v , v(K)). (xi ) − f (xi )) ] = Provided v¯ = v(K) ≤ v , the derivative of the density bias, E[(fˆ(K,h) E( K(u)(f (xi − uh) − f (xi )) du) is as usual O(hv(K) ) (by applying the v(K)th order Taylor expansion of f (xi − uh) around f (xi )). If differentiability conditions typically assumed do not hold, then the bias does not vanish sufficiently fast even for bandwidth vectors such that 1 N 2v(K) h = o(1). All we can state is the upper bound on the bias (component-wise):
|E(δˆN (K, h) − δ0 ) | ≤ ωδ hv¯ . We make a somewhat stronger assumption on the bias that is similar to Woodroofe (1970) for density estimation. To this end, we introduce the diagonal matrix hv¯ = diag(h[1]v¯1 , h[2]v¯2 , . . . , h[k]v¯k ), whose inverse is denoted by h−v¯ . A SSUMPTION 2.6. (a) As N → ∞, h → 0 h−v¯ Bias(δˆN (K, h)) → B(K),
(2.2)
where the vector B(K) = (B1 (K), . . . , Bk (K))T is such that 0 < |B (K)| < ∞, = 1, . . . , k; (b) v¯ = v¯ = const. This assumption significantly relaxes the usual smoothness assumptions. Part (b) additionally assumes the same smoothness for the different derivatives. When all the bandwidths are the same h and v¯ is constant for all components, the matrix h−v¯ in Assumption 2.6 can be read as a scalar, h−v¯ .
3. MAIN RESULTS We extend the existing asymptotic results for ADE by relaxing the smoothness assumptions on the density and obtain optimal bandwidth rates. We show that linear combinations of ADE can C The Author(s). Journal compilation C Royal Economic Society 2010.
46
M. M. A. Schafgans and V. Zinde-Walsh
have better asymptotic properties than optimal ADE and propose a feasible combination (the combined estimator) that minimizes the trace of estimated MSE. 3.1. Asymptotic results for ADEs based on a specific kernel and bandwidth vector We consider the asymptotic results for the ADE, δˆN (K, h) given in (1.2), under the Assumptions 2.1–2.6(a) of the previous section for all possible degrees of smoothness and kernel orders (for v¯ = min(v, v(K)). Under minimal smoothness assumptions, Lemma 3.1 presents an expression for its variance. L EMMA 3.1. Under Assumptions 2.1–2.5, if h → 0 and N 2 hk h2 → ∞ the variance of δˆN (K, h) is given by Var(δˆN (K, h)) = N −2 h−k h−1 ( 1 (K) + o(1)) h−1 + ( 2 + o(1)) N −1 , with
1 (K) = 4E yi2 f (xi ) − (gf )(xi )yi μ2 (K), μ2 (K) = K (u)K (u)T du,
2 = 4 E [(g f )(xi ) − (yi − g(xi ))f (xi )][(g f )(xi ) − (yi − g(xi ))f (xi )]T − 4δ0 δ0T .
We see that unless the O(N −1 ) term dominates the variance, there is first-order dependence on the kernel. With the bias of our ADE given by Assumption 2.6(a) it then follows that the MSE satisfies MSE(δˆN (K, h)) = N −2 h−k h−1 ( 1 (K) + o(1)) h−1 + ( 2 + o(1)) N −1 + hv¯ B(K)B T (K) + o(1) hv¯ .
(3.1)
The following Theorem 3.1 summarizes all the possible convergence rates and limit features of δˆN (K, h) for different choices of bandwidth and kernel and presents the optimal bandwidth rate based on the standard bias variance trade-off. T HEOREM 3.1. Under Assumptions 2.1–2.6(a). (a) If the density is sufficiently smooth and the , = 1, . . . , k, the rate O(N −1 ) for the MSE order of kernel is sufficiently high: all v¯ > k+2 2 √ and the parametric rate N for the ADE can be achieved for a range of bandwidth vectors {h : [N hk h2 ]−1 = O(1); N h2v¯ = O(1)}. Outside this range when N hk h2 → 0 the asymptotic variance depends on the kernel, if N h2v¯ → ∞ asymptotic bias dominates. (b) If the density , the parametric rate is not smooth enough or the order of the kernel is too low: some v¯ < k+2 2 cannot be obtained. The asymptotic variance depends on the kernel. Depending on v¯ and bandwidth/kernel pair (K, h), a diagonal matrix of rates rN with diagonal elements [rN ] → ∞, = 1, . . . , k, such that rN (δˆN (K, h) − δ0 ) has finite first and second moments, obtains. If rN hv¯ → 0, ADE has no asymptotic bias at the rate of convergence rN . (c) The optimal bandwidth vector can be obtained by minimizing trace of AMSE(δˆN (K, h); under Assumption 2.6(b) this provides the optimal rate ¯ . (3.2) hopt = O N −2/(2v+k+2)
C The Author(s). Journal compilation C Royal Economic Society 2010.
Smoothness adaptive average derivative estimation
47
The optimal constants for each can be obtained from this minimization (see the Appendix for details). The theorem provides a full description of the asymptotic behaviour of the moments of the estimator allowing for different bandwidth rates for different components. For equal (rate) bandwidth under Assumption 2.6(b), the PSS results with the parametric rate hold for sufficiently smooth f (x) (permitting v¯ > (k + 2)/2) with h → 0 and N hk+2 → ∞. In the absence of the high degree of differentiability the first-order asymptotic variance (as the asymptotic bias) does depend on the weighting used in the local averaging—involves 1 (K)—yielding a nonparametric rate. Selection of the optimal bandwidth and kernel (order) that minimize the mean squared error depends on our knowledge of the degree of smoothness of the density (see also Powell and Stoker, 1996). 3.2. Asymptotic results for linear combinations of ADEs To reduce the dependence of ADE on the optimal bandwidth and kernel (order) selection, we consider a linear combination of different ADE estimators, δˆN,c = Ss=1 as δˆN (Ks , hs ) with S s=1 as = 1, for a range of different kernel/bandwidth pairs. In order to obtain the AMSE of δˆN,c , we need to find the leading terms of the first and second moments for the stacked vector rNs (δˆN (Ks , hs ) − δ0 ), s = 1, . . . , S, with rNs the diagonal matrix of rates associated with kernel/bandwidth pair (Ks , hs ). First moments are given by Assumption 2.6; the limit covariances between the estimators are derived in the following Lemma 3.2. The fact that some estimators have zero covariances in the limit indicates that they provide complementary information. L EMMA 3.2. Under Assumptions 2.1–2.5, if hs → 0 and N 2 hks h2s → ∞ the limit covariance matrix for the vector with components rNs (δˆN (Ks , hs ) − δ0 ), s = 1, . . . , S, has k × k blocks rNs1 rNs2 s1 s2 , s1 , s2 = 1, . . . , S, with −1 −1 −1 s1 s2 = N −2 h−k s2 hs1 ( 1 (Ks1 , Ks2 , hs1 , hs2 ) + o(1))hs2 + ( 2 + o(1)) N ,
where 1 (Ks1 , Ks2 , hs1 , hs2 ) is defined in Lemma A.1 in the Appendix. Covariance matrices between estimators converging at different rates go to zero. −1 −1 With ˜ s1 s2 = N −2 h−k s2 hs1 ( 1 (Ks1 , Ks2 , hs1 , hs2 ))hs2 , the part of the asymptotic covariance between δˆN (Ks1 , hs1 ) and δˆN (Ks2 , hs2 ) that depends on the bandwidth, we denote their asymptotic covariance as
acov(δˆN (Ks1 , hs1 ), δˆN (Ks2 , hs2 )) = ˜ s1 s2 + N −1 2 . The trace of the AMSE of δˆN,c , can then be written as trAMSE(δˆN,c ) =
S
as1 as2 B˜sT1 B˜s2 + tr(˜ s1 s2 + N −1 2 )
s1 ,s2 =1
= a Da + N −1 tr 2 ,
(3.3)
where {D}s1 s2 = B˜sT1 B˜s2 + tr ˜ s1 s2 with B˜s = hvs¯ Bs , s = s1 , s2 . We note that the O(N −1 ) part in trAMSE(δˆN,c ) does not depend on the weights in the linear combination. In Theorem 3.2, we consider for a given kernel linear combinations of ADEs with C The Author(s). Journal compilation C Royal Economic Society 2010.
48
M. M. A. Schafgans and V. Zinde-Walsh
different bandwidths. It shows that with appropriately chosen bandwidths it is possible to obtain an estimator that is superior to the individual optimal estimator. T HEOREM 3.2. Under the Assumptions 2.1–2.6(a), for any kernelK and given an optimal bandwidth hopt there exists a linear combination, Ss=1 as δˆN (K, hs ); Ss=1 as = 1 for a set of bandwidth vectors h1 , . . . , hS such that {hs []} = cs {hopt []} with constants cs > 1, = 1, . . . , k, that provides S trAMSE as δˆN (K, hs ) < trAMSE(δN (K, hopt )). (3.4) s=1
If v—and ¯ thus the optimal rate—and an upper bound on the constant in the optimal bandwidth were known it would be straightforward to get weights that satisfy (3.4) 3 even without S knowing the optimal bandwidth itself. Under conditions of Theorem 3.2, trAMSE( s=1 as δˆN (K, hs )) could be minimized to obtain an optimal combination. Weights could be restricted to a compact set (e.g. a ≤ A < ∞) that would include weights that result in (3.4). Including different kernels in the linear combination would ensure that the linear combination performs better than optimal regardless of which of the chosen kernels dominates. When v¯ is low or not known, low-order kernels should be used. With moderate sample size we use two low-order kernels. For large samples a variety of kernels including asymmetric kernels could be beneficial—the order and shapes affect performance; see Hansen (2005) and Kotlyarova and Zinde-Walsh (2007). Since minimizing trAMSE(δˆN,c ) means in effect minimizing a Da of (3.3), which has exactly the same structure as in KZW, their theorem 3.2 applies to show that the optimal weights provide the best convergence rate for a Da available for any included bandwidth. 3.3. The combined estimator Next we consider replacing unknown quantities by estimates and define (as in KZW) the combined estimator as δˆN,comb =
S s=1
as∗ δˆN (Ks , hs ) with
S
as∗ = 1,
(3.5)
s=1
where the weights as∗ , s = 1, . . . , S, are chosen so as to minimize the trace of the estimated AMSE. Suppose that covariances are estimated so that c ov(δˆN (Ks1 , hs1 ), δˆN (Ks2 , hs2 )) − −1 −1 h h ); this will result from consistent acov(δˆN (Ks1 , hs1 ), δˆN (Ks2 , hs2 )) = op (N −2 h−k s2 s1 s2 estimation of 2 and of terms depending on the bandwidth, 1 (Ks1 , Ks2 , hs1 , hs2 ) (e.g. using the plug-in approach); suppose the estimated biases are such that B˜ˆs = B˜s + o(hvs¯ ). Then an argument as in theorem 3 of KZW implies that the weights that minimize the estimated trMSE will similarly lead to the best available rate for bandwidth-dependent part of trAMSE, a Da. Here we further improve the combined estimator by appropriately choosing the bandwidths for the estimators in the combination. As noted in the previous section, had we known the smoothness properties and thus v¯ we could use optimal rate bandwidths. Not knowing v¯ requires
3
Implementing a sequence of slowly diverging constants could get around having strict bounds on the optimal constant. C The Author(s). Journal compilation C Royal Economic Society 2010.
Smoothness adaptive average derivative estimation
49
us to propose a strategy that allows to adapt to the unknown smoothness in the model. Suppose for a kernel K we obtain an estimator vˆ¯ of v¯ such that vˆ¯ − v¯ = op (ln N −1 ); then the bandwidth ˆ¯ ¯ ¯ opt = cN −2/(2v+k+2) opt − cN −2/(2v+k+2) satisfies h = op (N −2/(2v+k+2) ). It follows that, vector h ¯ v+k+2) ¯ opt , the bandwidth-dependent part of AMSE achieves the same best rate O(N −4v/(2 ) with h −2/(2v+k+2) ¯ . Following Theorem 3.2, we include for each kernel the estimated rate as for cN opt and k larger bandwidths; we also include a marginally smaller band optimal bandwidth h opt (ln N )−0.1 , and h gcv —an automatic bandwidth selector for non-parametric regression. width, h The remainder of this section is devoted to deriving the estimator that satisfies vˆ¯ − v¯ = op (ln N −1 ) and proposing consistent estimators for the components of AMSE. The leading terms of variances and covariances, 1 and 2 , can be consistently estimated with the usual plug-in approach (i.e. by replacing the densities and derivatives by consistent nonparametric estimators). One can also use a bootstrap (see H¨ardle and Bowman, 1988), as we do in our simulation. 4 It is obtained as s1 ,s2 =
B 1 ˆ (δb,N (Ks1 , hs1 ) − δˆN (Ks1 , hs1 ))(δˆb,N (Ks2 , hs2 ) − δˆN (Ks2 , hs2 ))T , B b=1
(3.6)
where for each of the B bootstrapped samples estimates δˆb,N (Ks , hs ) are obtained for s = 1, . . . , S. Theorem 3.3 below details our consistent estimators for v¯ and the bias. To obtain a consistent estimator of v, ¯ we construct for a given kernel K a set of for which the corresponding estimators are asymptotically biased bandwidth vectors {ht }H t=1 (oversmoothed). One such bandwidth vector, hgcv , is given by the usual cross-validation for non¯ (following Stone, 1982). The parametric regression, it is oversmoothed since hgcv = cN −1/(2v+2) consistent estimator for v¯ is obtained by an approach reminiscent of Woodroofe’s (1970); it relies on Assumption 2.6 from which it follows that for each lth component of the ADE and two distinct oversmoothed bandwidth vectors ht and ht from the set {ht }H t=1 detailed in Theorem 3.3 below (which satisfy limN→∞ (ht []/ht []) = 0) we have ln {δˆN (K, ht ) − δˆN (K, ht )}2 = v¯ ln ht []2 + ln B2 (K) + op (1), where ht [] is the lth component of the vector ht . Part (a) of Theorem 3.3 below provides the ˆ¯ together with a consistent estimator of the optimal rate for the bandwidth. regression estimator, v, To obtain a consistent estimator of the bias of δˆN (K, h) for a given kernel K, we make use of the properties of oversmoothed and undersmoothed estimators. Specifically, using a pair of estimators δˆN , one of which is based on an oversmoothed bandwidth, and the other on a somewhat undersmoothed one, we consistently estimate the bias for the oversmoothed estimator. Subsequently, for any bandwidth vector h a consistent estimator of bias of δˆN (K, h) relies on the fact that the leading terms of the bias differ by the ratio of bandwidths to the power v, ¯ which we ˆ¯ The details are given in part (b) of Theorem 3.3. consistently estimate by v. T HEOREM 3.3. Under Assumptions 2.1–2.6. (a) Consider a sequence of bandwidth vectors gcv γt N for some positive constants ct with 0 ≤ γ1 < · · · < γH < {ht }H t=1 , such that ht = ct h 1 . Let T define a subset of all pairs {(ht , ht ), t, t = 1, . . . , H with t < t} with cardinality 2v(K)+k ˆ¯ given by ¯ v, Q : 2 ≤ Q ≤ H (H2+1) . An estimator for v, 4 The bootstrapped variance provides the same expansion as in Lemma 3.2 for our simulation example with k = 2; generally validity of this bootstrap expansion holds with somewhat stronger moment assumptions, such as E(y 4 | x) < ∞. Details can be obtained from the authors. C The Author(s). Journal compilation C Royal Economic Society 2010.
50
M. M. A. Schafgans and V. Zinde-Walsh
vˆ¯ =
(t,t )∈T
ln[(δˆN (K, ht ) − δˆN (K, ht ) )2 ] · ln ht []2 − Q1 (t,t )∈T ln ht []2 , 1 2 2 2 (t,t )∈T ln ht [] − Q (t,t )∈T ln ht []
(3.7)
for any = 1, . . . , k satisfies vˆ¯ − v¯ = op ((ln N)−1 ). Given vˆ¯ a bandwidth vector with optimal ˆ¯ opt = cN −2/(2v+k+2) opt N ζ , with . (b) Given bandwidths ho = h rate is consistently estimated by h ˆ 2 vζ ¯ k+2 1 2 −ξ opt N , with 0 < ξ < } < ζ < 2v+k+2 , and hu = h ; a consistent max{0, (1 − 2vˆ¯ ) 2v+k+2 ˆ¯ ˆ¯ k+2 v¯ ˆ estimate for asymptotic BiasδN (K, ho ), which is ho B(K), is provided by BiasδˆN (K, ho ) = δˆN (K, ho ) − δˆN (K, hu ). Consistent estimates of BiasδˆN (K, h) with h → 0 as N → ∞ can be vˆ¯ ˆ obtained as hvˆ¯ h− o BiasδN (K, ho ). We note that by construction the estimator in (3.7) when applied to the different components, = 1, . . . , k, will lead to k consistent estimators for v, ¯ which will differ in finite samples. 5 Summarizing, our proposed procedure consists of the following steps: Step 1: Step 2:
Step 3: Step 4:
ˆ¯ of the smoothness index v¯ for each included kernel. Two lowCompute an estimator, v, order kernels should be included (we use one second- and one fourth-order kernel). Compute δˆN (Ks , hs ) for a set of kernel/bandwidth combinations, s = 1, . . . , S. For each kernel we use k + 3 bandwidths. Based on the estimated vˆ¯ for that kernel opt and k larger bandwidths in we include the estimated rate optimal bandwidth h accordance with Theorem 3.2; we also include a marginally smaller bandwidth, opt (ln N )−0.1 , and h gcv —an automatic bandwidth selector for non-parametric h regression. Note that the set of bandwidths need not increase with N . Estimate all the covariances and biases for the individual estimators. Compute the combined estimator, δˆN,comb = Ss=1 as∗ δˆN (Ks , hs ), where as∗ , s = 1, . . , S are weights that minimize the estimated trMSE(δˆN,comb ) subject to Ss=1 as∗ = 1 over some compact set, e.g. {a ∗ : a ∗ ≤ A < ∞}.
4. SIMULATION In order to illustrate the effectiveness of the combined estimator, we provide a Monte Carlo study where we consider the Tobit model. The Tobit model under consideration is given by yi = yi∗ if yi∗ > 0, =0
yi∗ = xiT β + εi ,
i = 1, . . . , n
otherwise,
where our dependent variable yi is censored to zero for all observations for which the latent variable yi∗ lies below a threshold, which without loss of generality is set equal to zero. We randomly draw {(xi , εi )}ni=1 , where we assume that the errors, drawn independently of the regressors, are standard Gaussian. Consequently, the conditional mean representation of y given x can be written as g(x) = x T β · (x T β) + φ(x T β),
5 all H (H + 1)/2 pairs we can simplify some of sums above, e.g. As a referee2 pointed H out, if we use 2 t=2 (t − 1) ln ht [] . (t,t )∈T ln ht [] = C The Author(s). Journal compilation C Royal Economic Society 2010.
Smoothness adaptive average derivative estimation
51
where (·) and φ(·) denote the standard normal cdf and pdf, respectively. We consider the density weighted average derivative estimate (ADE) of this single-index model defined in (1.2) which identifies the parameters β ‘up to scale’ without relying on the Gaussianity assumption on εi . Under the usual smoothness assumptions, the finite sample properties of the ADE for this Tobit model have been considered in the literature (Nichiyama and Robinson, 2005). We use two explanatory variables and select β = (1, 1)T . We make various assumptions about the distribution of our independent, explanatory variables. The base model uses two standard normal explanatory variables. In the other models various multimodal normal mixtures are considered, which while still being infinitely differentiable, do allow behaviour resembling that of non-smooth densities. In particular, we consider the trimodal normal mixture used in ) + 2φ( x+0.767−1.2 ), and the ‘double claw’ and ‘discrete KZW, 0.5φ(x + 0.767) + 3φ( x+0.767−0.8 0.1 0.1 comb’ mixtures (Marron and Wand, 1992). The models are labelled using two indices (i1 , i2 ) representing the distributions used for the two explanatory variables with each index i· = s (standard normal), m (trimodal normal mixture), c (double claw) and d (discrete comb). The sample size is set at N = 2000 with 100 replications. The multivariate kernel function K(·) (on R 2 ) is chosen as the product of two univariate kernel functions. We use the quartic second-order kernel (see e.g. Yatchew, 2003) and a fourthorder kernel in our Monte Carlo experiment, where, given that we use two explanatory variables, the highest order satisfies the theoretical requirement for ascertaining a parametric rate subject to the necessary smoothness assumptions. 6 First, we apply the usual cross-validation for non-parametric regression, yielding a bandwidth ¯ with v¯ = min(v, v(K)), c some positive vector. Even though the sequence hgcv = cN −1/(2v+2) rates are the same, for computation of bandwidth vectors in our finite sample experiment we allow for differing bandwidths. We obtain them using a gridsearch. 7 Next, we estimate v¯ using bandwidths that satisfy the conditions of Theorem 3.3(a). We 1 · t−1 , t = 1, . . , H . The actual bandwidth sequences {ct hgcv N γt } are set H = 6, γt = 2v(K)+2 6 gcv gcv 1/36 gcv 2/36 {h , 1.01h N , 0.98h N , 0.93hgcv N 3/36 , 0.86hgcv N 4/36 , 0.78hgcv N 5/36 } for secondgcv gcv 1/60 , 1.16hgcv N 2/60 , 1.20hgcv N 3/60 , 1.21hgcv N 4/60 , 1.19hgcv N 5/60 } order kernel; {h , 1.10h N for fourth-order kernel. The reason for selecting the ct ’s is to ensure a reasonable spread of bandwidths for N = 2000 (correspond to bandwidth sequences {hgcv , 1.25hgcv , 1.5hgcv , 1.75hgcv , 2.0hgcv , 2.25hgcv }). To estimate v¯ we select a subset of Q bandwidths in the following way: select a range of consecutive bandwidths where differences {δˆN (K, ht ) − δˆN (K, ht )} all have the same sign; if that is not possible, we use all ¯ (vˆ¯ 1 , vˆ¯ 2 ), for our models on average provided in the (s,s) Q = H (H2+1) . The estimated v, model (1.99, 1.98) for the second-order kernel, K2 , and (3.68, 3.71) for the fourth-order kernel, K 4 ; in the (s,m) model K 2 provided (1.70, 1.51) and K4 : (3.16, 2.67); the (m,m) model K2 : (1.40, 1.37) and K4 : (1.98, 1.96); the (s,c) model K2 : (1.94, 1.87) and K4 : (3.56, 3.21); the (s,d) model K2 : (1.50, 1.04) and K4 : (3.36, 1.67); and the (c,d) model K2 : (1.49, 0.89) and K4 : (3.14, 1.68), which are reasonable. We use vˆ¯ 1 and vˆ¯ 2 as estimators of v¯ relating them to their respective component in the ADE vector.
6 4 2 The fourth-order kernel we use is given by K(x) = 105 64 (−3x + 7x − 5x + 1)1(|x| ≤ 1). The cross-validated bandwidths for the second- and fourth-order kernel in the (s,s) model with N = 2000 were ( 0.66 0.66 ) 1.50 0.63 0.52 ) respectively. The bandwidths for the (s,m) model were ( 0.52 ) and ( 1.54 and ( 1.50 0.92 ) respectively; the (m,m) model ( 0.52 ) 0.61 1.45 0.69 1.57 0.75 1.70 and ( 1.19 ); the (s,c) model ( ) and ( ); the (s,d) model ( ) and ( ); and the (c,d) model ( ) and ( ). 1.18 0.70 1.57 0.43 0.94 0.39 0.97 6 7
C The Author(s). Journal compilation C Royal Economic Society 2010.
52
M. M. A. Schafgans and V. Zinde-Walsh Table 1. Relative RMSE of the density weighted ADE estimators. Model (s,s) Model (s,m)
Bandwidth/Kernel
K2
K4
K2
K4
Model (m,m) K2
K4
h0 (hu )
0.234
0.141
0.457
0.427
0.672
0.686
h1 h2 (hopt ) h3
0.156 0.106 0.093
0.113 0.085 0.078
0.471 0.500 0.533
0.445 0.495 0.516
0.694 0.755 0.811
0.743 0.811 0.839
h4 (ho ) h5 (hgcv )
0.096 0.113
0.078 0.083
0.564 0.607
0.519 0.543
0.856 0.910
0.864 0.934
Combined
Bandwidth/Kernel
0.096
0.561
0.869
Model (s,c)
Model (s,d)
Model (c,d)
K2
K4
K2
K4
K2
K4
h0 (hu ) h1 h2 (hopt )
0.499 0.465 0.470
0.455 0.447 0.444
1.487 1.319 1.168
1.538 1.271 1.038
1.054 0.900 0.728
1.015 0.785 0.632
h3 h4 (ho )
0.480 0.486
0.447 0.451
1.033 0.895
0.995 0.925
0.613 0.517
0.671 0.671
h5 (hgcv ) Combined
0.497
0.461
0.766
0.847
0.479
0.575
0.465
0.872
0.690
opt N 0.05 and h = h opt N −0.07 and In accordance with Theorem 3.3(b), we choose ho = h u δ(K, ˆ obtain Bias ho ). For the combined estimator we consider, as indicated in our Step 2, a opt for each kernel and two larger bandwidths, one marginally range of bandwidths that include h opt (ln N )−0.1 , smaller bandwidth and the generalized cross-validation bandwidth, providing {h 0.1 0.2 opt , h opt (ln N) , h opt (ln N ) , h gcv } (or {0.75h opt , h opt , 1.25h opt , 1.50h opt , h gcv }). With two h kernels, this implies that the combined estimator under consideration has S = 10. Covariances are computed by bootstrap using (3.6); biases according to Theorem 3.3(b). The weights are then obtained by minimizing the trAMSE constructed according to (3.3) with estimated biases and covariances subject to the sum of the weights being equal to 1. 8 Larger weights including those of opposite signs, are typically given to the higher bandwidths for the second- and fourth-order kernel. In Table 1, we report relative error: the ratio of the true finite sample root mean squared errors (RMSE) to δ0 for ADE in the different models for the sample size N = 2000. Note that the relative errors for model (s,s) are in the range 7.8–23.4% and are relatively small. For (s,c) the errors are much larger but are close for all bandwidths and kernels: range is 44.4–49.9%, so there is not much sensitivity to the choice of bandwidth/kernel order. There 8 Ordering the kernel/bandwidth pairs, s = 1, . . . , 10 as: (K , h ), . . . , (K , h ), (K , h ), . . . , (K , h ), on average 2 1 2 5 4 1 4 5 the weights are (−0.00, −0.03, 0.65, −0.45, −0.07, −0.09, −0.04, −0.23, 2.30, −1.05) for the (s,s) model; for (s,m) the weights are (0.03, −0.01, 0.89, 1.00, −0.77, −0.38, −0.10, −0.22, 1.24, −0.70); for (m,m) (0.02, 0.07, −0.92, 4.01, −2.40, −0.32, 0.14, −0.74, 1.43, −0.28); for (s,c) (0.02, −0.11, 0.74, −0.14, −0.09, −0.14, −0.11, −0.08, 1.96, −1.05); for (s,d) (0.03, 0.11, 0.30, 2.35, −0.60, −0.50, −0.36, −0.52, 1.10, −0.90); and for (c,d) (0.05, 0.09, −0.29, 2.77, −0.85, −0.52, 0.65, −0.87, 1.62, −1.06). C The Author(s). Journal compilation C Royal Economic Society 2010.
Smoothness adaptive average derivative estimation
53
is somewhat more of a dispersion for the (s,m) case: the range is 42.7–60.7%, but even in this case the price of an incorrect choice (associated with a too large bandwidth) is not that dramatic. More striking consequences of choice are seen for the (m,m) case: the range of relative errors is 68.6–93.4%; here similarly to (s,m) incorrect choice involves oversmoothing but unlike (s,m) higher-order kernel gives consistently worse results. The most dramatic cases are (s,d) with range 76.6–153.8% and (c,d) with 47.9–105.4% where now incorrect choice is associated with undersmoothing. In these cases the combined estimator gives results much closer to the lower bound than upper bound of the errors, and also often presents a better choice than the estimated optimal bandwidth. We conclude that there is no rule regarding either kernel order or bandwidth that works uniformly (similar results found by Hansen, 2005): some individual estimators that are best for one model are worst for another. The estimated optimal bandwidth compares favourably with many bandwidths (including cross-validation), but there is no indication which order of kernel to use. The combined estimator offers reliably good performance and is often better than the optimal, especially in cases of large relative errors.
5. CONCLUSIONS In this paper we provide asymptotic properties of the ADE in the case of insufficient smoothness (or kernel order) and demonstrate availability of estimators that improve on the ADE with optimal bandwidth via using linear non-convex combinations of ADEs. We adapt to the unknown and/or insufficient density smoothness by using a combined estimator that is constructed with specially selected bandwidths, based on the optimal rate. With an unknown degree of smoothness, the optimal rate of bandwidth is consistently estimated. Monte Carlo simulations demonstrate that even in the case where formally the smoothness assumptions hold, due to large values for the derivatives there is no general guidance for selecting a kernel and bandwidth that will not lead to large errors for some distributions. Using the estimated optimal rate bandwidth leads to less erratic performance but could be adversely affected by incorrect kernel choice. By not relying on a single kernel/bandwidth choice, the combined estimator reduces sensitivity and provides good and reliable performance.
ACKNOWLEDGMENTS The authors would like to thank two anonymous referees and Richard Smith for their comments and suggestions. The work was supported by the Social Sciences and Humanities Research Council of Canada (SSHRC) and by the Fonds qu´ebecois de la recherche sur la soci´et´e et la culture (FQRSC).
REFERENCES Banerjee, A. N. (2007). A method of estimating the average derivative. Journal of Econometrics 136, 65– 88. Blundell, R., A. Duncan and K. Pendakur (1998). Semiparametric estimation and consumer demand. Journal of Applied Econometrics 13, 435–61. C The Author(s). Journal compilation C Royal Economic Society 2010.
54
M. M. A. Schafgans and V. Zinde-Walsh
Cattaneo, M. D., R. K. Crump and M. Jansson (2008). Small bandwidth asymptotics for density-weighted average derivatives. Research Paper 2008-24, CREATES, Aarhus University. Available at SSRN: http://ssrn.com/abstract=1148173. Chaudhuri, P., K. Doksum and A. Samarov (1997). On average derivative quantile regression. Annals of Statistics 25, 715–44. Dalalyan, A. S., G. K. Golubev and A. B. Tsybakov (2006). Penalized maximum likelihood and semiparametric second order efficiency. Annals of Statistics 34, 169–201. Donkers, B. and M. Schafgans (2008). Specification and estimation of semiparametric multiple-index models. Econometric Theory 24, 1584–606. Hansen, B. E. (2005). Exact mean integrated squared error of higher order kernel estimators. Econometric Theory 21, 1031–57. H¨ardle, W. and A. W. Bowman (1988). Bootstrapping in nonparametric regression: local adaptive smoothing and confidence bands. Journal of the American Statistical Association 83, 101–10. H¨ardle, W., W. Hildenbrand and M. Jerison (1991). Empirical evidence on the law of demand. Econometrica 59, 1525–49. H¨ardle, W. and T. M. Stoker (1989). Investigating smooth multiple regression by the method of average derivatives. Journal of the American Statistical Association 84, 986–95. Horowitz, J. L. and W. H¨ardle (1996). Direct semiparametric estimation of single-index models with discrete covariates. Journal of the American Statistical Association 91, 1632–40. Izenman, A. J. and C. J. Sommer (1988). Philatelic mixtures and multimodal densities. Journal of the American Statistical Association 83, 941–53. Juditsky, A. and A. Nemirovski (2000). Functional aggregation for nonparametric regression. Annals of Statistics 3, 681–712. Kotlyarova, Y. and V. Zinde-Walsh (2006). Non- and semi-parametric estimation in models with unknown smoothness. Economics Letters 93, 379–86. Kotlyarova, Y. and V. Zinde-Walsh (2007). Robust kernel estimator for densities of unknown smoothness. Journal of Nonparametric Statistics 19, 89–101. Li, Q., X. Lu and A. Ullah (2003). Multivariate local polynomial regression for estimating average derivatives. Journal of Nonparametric Statistics 15, 607–24. Li, Q. and J. S. Racine (2007). Nonparametric Econometrics: Theory and Practice. Princeton: Princeton University Press. Marron, J. S. and M. P. Wand (1992). Exact mean integrated squared error. Annals of Statistics 20, 712– 36. Newey, W. K. and T. M. Stoker (1993). Efficiency of weighted average derivative estimators and index models. Econometrica 61, 1199–223. Nichiyama, Y. and P. M. Robinson (2000). Edgeworth expansions for semiparametric averaged derivatives. Econometrica 68, 931–80. Nichiyama, Y. and P. M. Robinson (2005). The bootstrap and the Edgeworth correction for semiparametric averaged derivatives. Econometrica 73, 903–48. Powell, J. L., J. H. Stock and T. M. Stoker (1989). Semiparametric estimation of weighted average derivatives. Econometrica 57, 1403–30. Powell, J. L. and T. M. Stoker (1996). Optimal bandwidth choice for density-weighted averages. Journal of Econometrics 75, 291–316. Robinson, P. M. (1989). Hypothesis testing in semiparametric and nonparametric models for econometric time series. Review of Economic Studies 56, 511–34. Schafgans, M. M. A. and V. Zinde-Walsh (2007). Robust average derivative estimation. Working Paper 2007-12, Department of Economics, McGill University. C The Author(s). Journal compilation C Royal Economic Society 2010.
Smoothness adaptive average derivative estimation
55
Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Annals of Statistics 10, 1040–53. Woodroofe, M. (1970). On choosing a delta sequence. Annals of Mathematical Statistics 41, 1665– 71. Yang, Y. (2000). Combining different procedures for adaptive regression. Journal of Multivariate Analysis 74, 135–61. Yatchew, A. (2003). Semiparametric Regression for the Applied Econometrician. Cambridge: Cambridge University Press.
APPENDIX: PROOFS The proofs of our results rely on the following lemma. For notation purposes we recall (1.3) and (1.4): h is a diagonal matrix and hk is the scalar product of bandwidth components. Both h and h can be read as scalar h in the equal bandwidth setting (and Assumption 2.6(b) where it relates to hv¯ ). L EMMA A.1. Under Assumptions 2.1–2.5, if hs → 0 and N 2 hks h2s → ∞ for s = 1, . . . , S, the covariance of δˆN (Ks1 , hs1 ) and δˆN (Ks2 , hs2 ), s1 ,s2 , for s1 , s2 = 1, . . . , S is −1 −1 −1 , s1 ,s2 = N −2 h−k s2 hs1 ( 1 (Ks1 , Ks2 , hs1 , hs2 ) + o(1))hs2 + ( 2 + o(1))N
with
1 (Ks1 , Ks2 , hs1 , hs2 ) = 4E yi2 f (xi ) − (gf )(xi )yi μ2 (Ks1 , Ks2 , hs1 , hs2 ); hs T du; and μ2 (Ks1 , Ks2 , hs1 , hs2 ) = Ks1 (u)Ks2 u 1 hs2
2 = 4{E([(g f )(xi ) − (yi − g(xi ))f (xi )][(g f )(xi ) − (yi − g(xi ))f (xi )]T )} − 4δ0 δ0T .
Proof: To derive an expression for s1 ,s2 , for s1 , s2 = 1, . . . , S we note s1 ,s2 = E[δˆN (Ks1 , hs1 )δˆN (Ks2 , hs2 )T ] − E δˆN (Ks1 , hs1 )E δˆN (Ks2 , hs2 )T .
(A.1)
Let I (τ ) = 1, if the expression τ is true, zero otherwise. We decompose the first term as follows: ⎧ T ⎫ N N ⎨ 1 ⎬ 1 ˆ(K fˆ(K f (x )y (x )y E(δˆN (Ks1 , hs1 )δˆN (Ks2 , hs2 )T ) = 4E i i i i ,h ) ,h ) s1 s1 s2 s2 ⎩ N ⎭ N i=1 i=1 1 ˆ T 2 E f(Ks ,hs ) (xi )fˆ(K (x ) y =4 i i ,h ) s2 s2 1 1 N N −1 T ˆ E fˆ(K (x ) f (x ) y y I ( i = i ) . (A.2) + i1 (Ks ,hs ) i2 i1 i2 1 2 s1 ,hs1 ) 2 2 N C The Author(s). Journal compilation C Royal Economic Society 2010.
56
M. M. A. Schafgans and V. Zinde-Walsh
For the first term of (A.2) we obtain E fˆ(K s
(xi )fˆ(K (xi )T yi2 2 ,h2 ) 1 ,hs1 )
⎧ ⎛ ⎡ ⎤⎡ ⎤ ⎞⎫ T ⎨ ⎬ x x − x − x i j i j −k −k ⎦⎣ ⎦ zi ⎠ E E ⎝yi2 ⎣ h−1 h−1 = s1 hs1 Ks1 s2 hs2 Ks2 ⎩ ⎭ hs1 hs2 j =i j =i xi − xj xi − xj T 1 −k −1 2 K h h E y E K I (i = j ) zi = h−1 h−k i s1 s2 s2 s1 s2 s1 N −1 hs1 hs2 xi − xj1 xi − xj2 T N − 2 −k −k −1 2 hs1 hs2 hs1 E yi E Ks1 Ks2 + I ( i, j1 , j2 pairwise distinct)zi h−1 s2 N −1 hs1 hs2 hs1 T 1 −1 2 u = h E y (u)K f (xi − uhs1 ) du h−1 K h−k i s1 s2 s2 N − 1 s2 s1 hs2 T xi − xj1 xi − xj2 N − 2 −k −k −1 h h h E E yi Ks1 + h−1 s2 , zi E yi Ks2 zi N − 1 s1 s2 s1 hs1 hs2
1 N −1
2
where the final equality applies a change of variables u = (xi − xj )/hs1 to the first term and uses the independence of xj1 and xj2 for the second term. Note hs hs [1] hs [k] . Ks2 u 1 = Ks2 u1 1 , . . . , uk 1 hs2 hs2 [1] hs2 [k] A similar change of variables to the second term in the resulting expression yields
N − 2 −1 hs1 E yi2 Ks1 (u)f (xi − uhs1 ) du · Ks2 (u)f (xi − uhs2 ) du) h−1 s2 . N −1 Recognizing that the kernel vanishes at the boundary, using integration by parts and applying Assumption 2.5 we obtain e.g. for s = s1 , s2 with hs1 → 0, hs2 → 0 xi − xj −1 z = y h y K Ks (u)f (xi − uhs ) du E h−k i s i s s i hs = yi f (xi ) + o(1). Note that the o(1) term here and below is O(hv¯ ) by Assumptions 2.4 and 2.5. Similarly for hs1 → 0 and hs2 → 0 we obtain E fˆ(K s
(xi )fˆ(K (xi )T yi2 2 ,h2 ) 1 ,hs1 )
=
1 h−k h−1 E yi2 f (xi ) μ2 (Ks1 , Ks2 , hs1 /hs2 ) + o(1) h−1 s2 N − 1 s2 s1 N −2 2 E yi f (xi )f (xi )T + o(1) . + (A.3) N −1 C The Author(s). Journal compilation C Royal Economic Society 2010.
Smoothness adaptive average derivative estimation
57
For the second term in last line of (A.2), where for brevity we omit terms such as I (i1 = i2 ) in the expression, we obtain (xi1 )fˆ(K,h) (xi2 )T yi1 yi2 I (i1 = i2 ) E fˆ(K,h) ⎛ ⎞ T x x − x − x N − 2 −k −k −1 ⎝ i1 j1 i2 j1 h h h E E yi1 Ks1 = zj1 E yi2 Ks2 zj1 ⎠ h−1 s2 (N − 1)2 s1 s2 s1 hs1 hs2 xi1 − xi2 xi2 − xi1 T 1 −k −k −1 Ks2 + hs hs hs E yi2 E yi1 Ks1 h−1 zi2 s2 hs1 hs2 (N − 1)2 1 2 1 xi1 − xi2 xi2 − xj2 T N − 2 −k −k −1 + hs hs hs E E yi1 Ks1 zi2 E yi2 Ks2 zi2 h−1 s2 hs1 h2 (N − 1)2 1 2 1 xi1 − xj1 xi2 − xi1 T N − 2 −k −k −1 z + h h h E E y K K E y i1 zi1 h−1 i1 s1 i2 s2 s s2 s s hs1 hs2 (N − 1)2 1 2 1 T xi1 − xj1 xi2 − xj2 (N − 2)(N − 3) −k −k −1 + hs1 hs2 hs1 E E yi1 Ks1 h−1 zi1 E E yi2 Ks2 zi2 s2 . hs1 hs2 (N − 1)2
Using integration by parts to the various terms with kernel vanishing at the boundary and minimal smoothness requirements on g(x) and f (x), represented by Assumptions 2.3 and 2.5 with m ≥ 1, we note for s = s1 , s2 −1 h−k s hs E
Ks
xi − xj hs
yi zj = h−1 Ks (u)(gf )(xj + uhs ) du s = − Ks (u)(gf ) (xj + uhs ) du = −(gf ) (xj ) + o(1),
with hs → 0; h−k s1 E
Ks1
xj − xi hs1
Ks2
xi − xj hs2
T
hs T yj zi = Ks1 (u)Ks2 −u 1 (gf )(xi + uhs1 ) du hs2 = −μ2 (K1 , K2 , hs1 , hs2 )(gf )(xi ) + o(1),
with hs1 → 0; and E E
−1 h−k s hs yi Ks
xi − xj hs
= (1/2)(E δˆN (Ks , hs )) + o(1), zi
C The Author(s). Journal compilation C Royal Economic Society 2010.
58
M. M. A. Schafgans and V. Zinde-Walsh
with hs → 0; this gives E fˆ(K,h) (xi1 )fˆ(K,h) (xi2 )T yi1 yi2 I (i1 = i2 ) =
N −2 E((gf ) (xi )(gf ) (xi )T ) + o(1) 2 (N − 1) 1 hs1 T −k −1 + h s hs E du (gf )(xi )yi + o(1) h−1 Ks1 (u)Ks2 −u s2 hs2 (N − 1)2 2 1 ) N −2 ( E − (gf ) (xi )(f (xi )yi )T + o(1) + 2 (N − 1) ) N −2 ( T E (f + o(1) (x )y )(−(gf ) (x )) + i i i (N − 1)2 (N − 2)(N − 3) 1 ˆ E (δN (Ks1 , hs1 ))(δˆN (Ks2 , hs2 ))T + o(1) . + 4 (N − 1)2
Substituting (A.3) and (A.4) using (A.2) in (A.1) gives the desired result.
(A.4)
Proof of Lemma 3.1: The result is a special case of Lemma A.1 with s1 = s2 = s where the subscripts indicating a particular kernel/bandwidth combination are removed. Proof of Theorem 3.1: The proof relies on the expression for the MSE in (3.1) that combines squared bias from Assumption 2.6, based on Assumption 2.5, and the variance as given in Lemma A.1. The variance has two leading parts, one that converges to 2 at a parametric rate, O(N −1 ); the other converges with rates O(N −2 h−k h−2 ) to 1 (K); the squared bias converges with rates O(h2v¯ ). In case (a) for {h : [N hk h2 ]−1 = o(1); N h2v¯ = o(1)} the term N −1 2 dominates the MSE; correspondingly a parametric rate holds for the estimator; the asymptotic normality result in PSS can easily adapt to accommodate different bandwidths and holds for this case. For [N hk h2 ]−1 = O(1) the parametric rate still holds but the variance may have a part that depends on the kernel. For N h2v¯ = O(1) the rate is parametric, but asymptotic bias is present. When N hk h2 → 0 (undersmoothing) the MSE is dominated by N −2 h−k h−1 1 (K)h−1 . The estimator has no asymptotic bias, but its variance depends on the kernel, −1 −k −1 h h ). If N h2v¯ → ∞ (oversmoothing) the squared asymptotic bias convergence rate is r−1 N = O(N dominates in the MSE and by standard arguments (Chebyshev’s inequality) this situation results in the −1/2 −v¯ h ). estimator converging in probability to B(K) with rates r−1 N = O(N In case (b) the range of bandwidths corresponding to parametric rates cannot be obtained. When N 2 hk h2 h2v¯ → 0 the MSE is dominated by the term N −2 h−k h−1 1 (K)h−1 . The estimator has no asymptotic −1 −k −1 h h ). If N 2 hk h2 h2v¯ → ∞ the squared asymptotic bias dominates bias, convergence rate is r−1 N = O(N −1/2 −v¯ h ). in the MSE and the estimator converges in probability to B(K) with rates r−1 N = O(N For (c) without loss of generality assume that h[1] has the slowest rate among the bandwidth components, then in terms of rates every other component is h[] = O(h[1]σ ) with σ ≥ 1. The part of trMSE that depends on the bandwidth, trMSE(h), then takes the form k k h[1]−2σ s + =1 h[1]2v¯ σ b , N −2 h[1]− σ =1
with positive coefficients s , b . As h[1] increases, the first term declines and the second term increases; in either of the cases (a) or (b) over the relevant range of h[1] the first term dominates the sum at low bandwidths, and the second at higher ones. As a continuous function of h[1] the trMSE(h) attains a minimum over that range. If all the bandwidths are the same and equal to h[1], so that all σ = 1, v = v¯ = const we get the optimal rate in (3.2) by equating the rates of the two ¯ components. If the bandwidth rates are the same and v = v¯ = const, with h[1] = c0 N −2/(2v+k+2) and h[] = c h[1], = 1, . . . , k (c1 = 1), the optimal constants can be obtained by solving c0 = C The Author(s). Journal compilation C Royal Economic Society 2010.
59
Smoothness adaptive average derivative estimation
(k+2) kj =1 1 (K)jj /cj2 *k k 2v¯ 2 c j =1 j ·2v¯ j =1 cj B(K)j
1/(2v+k+2) ¯
and c =
(
k
2v¯ 2 j =1 cj B(K)j
(k+2)
= 2, . . . , k with respect to (c0 , c2 , . . . , ck ).
k
(
1 (K)jj k j =1 cj2
2 j =1 1 (K)jj /cj
+
2 1 (K) c2
)+
B(K)2
)1/(2v)¯ ,
Proof of Lemma 3.2: Lemma A.1 provides the limit covariance matrix for the vector with components rNs (δˆN (Ks , hNs ) − δ0 ), s = 1, . . . , S, with k × k blocks s1 ,s2 , s1 , s2 = 1, . . . , S. We note here that the expression for the covariance can also be written by interchanging s 1 and s2 . Thus for different bandwidth rates without any loss of generality we can assume that hs1 = o(hs2 ). For h μ2 = Ks1 (u)Ks2 (u hss1 )T du then the expression under the integral converges to zero since by symmetry 2 K (0) = 0 and we can interchange integration and going to the limit due to continuity, providing μ2 = o(1). We note now that only two cases of different rates are possible here: (a) a parametric rate for s 2 and a non-parametric for s1 , and (b) non-parametric (different) rates for both. Denote the square root of the product of bandwidth components, (hk )1/2 , as hk/2 . (1) Consider case (a): N hks1 h2s1 → 0; N hks2 h2s2 → ∞. Then √ ˆ Cov N hk/2 N δˆN (Ks2 , hs2 ) s1 hs1 δN (Ks1 , hs1 ), −2 −k −1 −1 hs2 hs1 o(1)h−1 O(1) = N 3/2 hk/2 s2 + N s1 hs1 N −1 −k −1 −1 N hs2 hs1 hs2 + O N 1/2 hk/2 = o N 1/2 hk/2 s1 hs1 s1 hs1 = o(1). (2) For case (b): N hks1 h2s1 → 0; N hks2 h2s2 → 0 we get k 1/2 ˆ hs2 δˆN (Ks2 , hs2 )) Cov N hk/2 s1 hs1 δN (Ks1 , hs1 , N hs2 −2 −k −1 k/2 2 k/2 −1 = N hs1 hs1 hs2 N hs2 hs1 o(1)hs2 + N −1 O(1) hs2 −k/2 −1 k/2 = o hk/2 hs2 + O N hk/2 s1 hs1 hs2 s1 hs1 hs2 hs2 = o(1).
Proof of Theorem 3.2: Consider each i separately and suppress the subscript i. First we find weights on the ith component that eliminate the ith leading component of the bias of the combination and show that we could have the norm of this weights vector less than one; then we show that as a result the term coming from the variance is smaller than the corresponding term for the optimal bandwidth. Solve first S S S as2 , subject to s=1 as = 1 and s=1 as Bs hvs¯ = 0. min s=1
(A.5)
Denoting Bs hvs¯ by bs , the Lagrangian is
S S S as2 − λ s=1 as − 1 − θ s=1 as bs .
s=1
From the FOC, we obtain λ = 2 as2 ;
θ=
Denoting as2 by α, we obtain as = α +
2 − 2S as2 ;
bs 1−Sα b.
bs s
and
as =
1 (λ + θbs ). 2
By squaring and summing as for s = 1, . . . , S, we get
α = Sα 2 + 2α(1 − Sα) + (1 − Sα)2
bs2 . ( bs )2
This quadratic equation for α has a root of S1 as a solution to the FOC. We denote 1s1 s2 = 1 (K, K, hs1 , hs2 ) and μ2s1 s2 = μ2 (K, K, hs1 , hs2 ) (defined in Lemma A.1) and recall the definitions of 1 (K) and μ2 (K) from Lemma 3.1. C The Author(s). Journal compilation C Royal Economic Society 2010.
60
M. M. A. Schafgans and V. Zinde-Walsh
1/2 { 1s1 s2 }ii ≤ { 1 }ii , with {·}ii denoting the ith diagonal element. This Next we show that (hks1 h−k s2 ) follows as we have k −k 1/2 μ2s1 s2 ii hs1 hs2 1/2 hs [i] hs [m] = hks1 h−k . . . K (ui )K ui 1 m=i K(um )K um 1 du1 . . . duk s2 hs2 [i] hs2 [m] hs1 [i] 1/2 hs1 [m] 1/2 hs1 [i] hs1 [m] dui m=i dum = K (ui )K ui K(um )K um hs2 [i] hs2 [i] hs2 [m] hs2 [m]
2 ≤ K (ui ) dui m=i K(um )2 dum = {μ2 (K)}ii ,
where the last line follows from: 1/2 1/2 ϕ G(ϕu)2 du ϕ 1/2 G(u)G(ζ u) du ≤ = G(u)2 du G(u)2 du for any ϕ > 0 and G(·). Now we can evaluate, for the combination with the weights that solve the (A.5) with α = S1 , the part of the ith diagonal element of trAMSE coming from the variance that depends on the bandwidth (involves
the leading term of s1 ,s2 ii ; see Lemma A.1). With hkopt denoting the product of the components of hopt , opt −2 we observe that its comparable component in trAMSE(δˆN (K, hopt )) is given by N −2 h−k opt (h [i]) { 1 }ii (see Lemma 3.1). Now,
−1
1s1 s2 ii as1 i as2 i }
sS1 ,s2 =1 N −2 h−k s (hs1 [i]hs2 [i]) , - , 2 1/2 −k 1/2
sS1 ,s2 =1 as1 i as2 i (hs1 [i]hs2 [i])−1 hks1 h−k
1s1 s2 < max N −2 h−k s1 hs2 s2 ii opt −2 opt −2 { 1 }ii α < N −2 h−k { 1 }ii S1 . < N −2 h−k opt h [i] opt h [i] With {hs []} = cs {hopt []}, cs > 1, = 1, . . . , k, the second inequality reflects the fact that
−k h−k s1 hs2
1/2
opt −2 (hs1 [i]hs2 [i])−1 < h−k opt h [i]
and uses Cauchy’s inequality sS1 ,s2 =1 |as1 i as2 i | ≤ ( sS1 ,s2 =1 |as1 i |2 )1/2 ( sS1 ,s2 =1 |as2 i |2 )1/2 = α 1/2 α 1/2 = α. The last inequality uses α = 1/S. Recall that the part of trAMSE that involves the matrix 2 does not depend on the weights. Thus the sum of the k diagonal elements in the trAMSE of the linear combination is no greater than that for the optimal if S > k, enabling this linear combination to be strictly better than the individual ADE based on the optimal bandwidth, δˆN (K, hopt ). Proof of Theorem 3.3: (a) We utilize the expression for the bias given in Assumption 2.6(a) componentwise ⎞ ⎛ h[1]v¯1 (B1 (K) + o(1)) ⎟ ⎜ .. ⎟. (A.6) E(δˆN (K, h) − δ0 ) = ⎜ . ⎠ ⎝ h[k]v¯k (Bk (K) + o(1)) Following Assumption 2.6(b), we consider constant v. ¯ Using (A.6) and Lemma A.1 we can write for bandwidth vector ht δˆN (K, ht ) = δ0 + hvt¯ [B(K) + op (1) + ψt ],
(A.7)
C The Author(s). Journal compilation C Royal Economic Society 2010.
61
Smoothness adaptive average derivative estimation −( k+2 )
v¯ −1 where ψt = h− ht 2 + N −1/2 ) by Chebyshev inequality. For each kernel consider the sequence t Op (N 1 gcv γt to of bandwidths ht = ct h N , t = 1, . . . , H , with γt > 0 (to ensure oversmoothing) and γt < 2v+k ¯ 1 relies on the unknown v; ¯ the more ensure that the bandwidths converge to zero; the condition γt < 2v+k ¯ 1 . For H ≥ 2 smoothness the tighter is the bound on γt ; thus we can replace this condition by γt < 2v(K)+k H we obtain a sequence of bandwidth vectors {ht }t=1 for which the bias term dominates in the MSE for this estimator so that ψt = op (1). For this sequence of bandwidths then
δˆN (K, ht ) = δ0 + hvt¯ [B(K) + op (1)]. Difference these equations component-wise to get rid of δ0 ; then for the th component based on two distinct bandwidth vectors ht , ht (t, t = 1, . . . , H ),
δˆN (K, ht ) − δˆN (K, ht ) = ht []v¯ − ht []v¯ B (K) + op (ht []v¯ + ht []v¯ ), = 1, . . . , k. When ht = o(ht ) we get
δˆN (K, ht ) − δˆN (K, ht ) = ht []v¯ (B (K) + op (1)),
= 1, . . . , k.
(A.8)
For each we define a subset T of all pairs {(ht , ht ), t, t = 1, . . . , H with t < t} with cardinality Q : 2 ≤ Q ≤ H (H2+1) ; we consider for each the following Q equations ((t, t ) ∈ T ): 2 (A.9) ln δˆN (K, ht ) − δˆN (K, ht ) = v¯ ln ht []2 + ln B2 (K) + e , = 1, . . . , k, with e = op (1). We obtain these equations by squaring both sides of (A.8) and applying the natural logarithm transformation. The rhs of (A.9) uses (ht []v¯ B (K) + op (ht []v¯ ))2 = ht []2v¯ B2 (K) + op (ht []2v¯ ) and follows an expansion of the ln function: ln ht []2v¯ B2 (K) + op (ht []2v¯ ) = ln(ht []2v¯ B2 (K)) + op (1). Define wt = [ln ht []2 − Q1 (t,t )∈T ln ht []2 ] and consider the least squares estimator 2 ˆ ˆ (t,t )∈T e wt (t,t )∈T ln {δN (K, ht ) − δN (K, ht )} wt = v¯ + . vˆ¯ = 2 2 w (t,t )∈T t (t,t )∈T wt 1 ¯ With the bandwidth vectors ht = ct N (γt −1/(2v+2)) , γ1 < · · · < γt < · · · < γH < 2v(K)+k , with some that the (non-stochastic) regressors are trending as N increases: wt = O(ln N ) as constants ct we note wt = 2[ln ht [] − Q1 (t,t )∈T ln ht []]] = 2[ln ct − Q1 (t,t )∈T ln ct ] + 2{γt − Q1 (t,t )∈T γt } ln N . So,
ln N
(t,t )∈T
(t,t )∈T
e wt 2 wt
= op (1) and for any we get |vˆ¯ − v| ¯ = op ((ln N )−1 ).
(b) Without loss of generality we assume that all bandwidth components are the same, so that both hv¯ and h−1 can be read as scalars. Using (A.7) we write ( ) k+2 v¯ −1 −( 2 ) δˆN (K, ho ) = δ0 + hvo¯ B(K) + op (1) + h− ho + N −1/2 . o Op N −( k+2 2 )
v¯ −1 With |vˆ¯ − v| ¯ = op ((ln N )−1 ) we note h− ho o Op (N −( k+2 2 )
vˆ¯ −1 N −1/2 ) = h− ho o Op (N
op (1) and we obtain
vˆ¯ −1/2 h− o N
−( k+2 2 )
vˆ¯ −1 + N −1/2 ) = (h− ho o + op (1))Op (N −( k+2 2 )
vˆ¯ −1 + N −1/2 ) + op (1). Recognizing that h− ho o N
= Op (1)N
ˆ¯ ˆ¯ 2v−(k+2)−2(2 v+k+2) ¯ vζ ˆ¯ 2(2v+k+2)
= Op (1)N −
ˆ¯ 2v+k+2 ζ 2
+ =
= op (1) with ζ satisfying the lower bound requirement,
δˆN (K, ho ) = δ0 + hvo¯ [B(K) + op (1)]. Now similarly we investigate for hu the rates in
−( k+2 ) δˆN (K, hu ) = δ0 + hvu¯ [B(K) + op (1)] + Op N −1 hu 2 + N −1/2
C The Author(s). Journal compilation C Royal Economic Society 2010.
62
M. M. A. Schafgans and V. Zinde-Walsh −( k+2 2 ) −v¯ ho
v¯ −1/2 = op (1); finally, N −1 hu relative to the rate of hvo¯ . Clearly hvu¯ = op (hvo¯ ); we have shown h− o N
−( k+2 ) vˆ¯ N −1 hu 2 h− o (1
k+2 + (k+2)ξ + 2vˆ¯ ˆ¯ −ζ v) ˆ¯ ˆ¯ 2 2v+k+2 (1)N (−1+ 2v+k+2
+ op (1)) = Op Then substituting and evaluating the rates we obtain
= Op (1)N
(k+2)ξ −2ζ vˆ¯ 2
=
= op (1).
δˆN (K, ho ) − δˆN (K, hu ) = hvo¯ [B(K) + op (1)], revealing that the difference provides a consistent estimator for asymptotic BiasδˆN (K, ho ). v¯ Consider now asymptotic BiasδˆN (K, h) = hv¯ B(K) = hhv¯ {hvo¯ B(K)} with h → 0 as N → ∞. The estimator
δˆN (K, h) = Bias
hvˆ¯ hvoˆ¯
δˆN (K, ho ) = Bias
hvˆ¯ hvoˆ¯
o
v¯
hvo¯ [B(K) + op (1)] = [ hhv¯ + op (1)]hvo¯ [B(K) + op (1)] = o
hv¯ [B(K) + op (1)] and thus provides a consistent estimator of the asymptotic bias.
C The Author(s). Journal compilation C Royal Economic Society 2010.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 63–94. doi: 10.1111/j.1368-423X.2009.00302.x
Unit root inference in panel data models where the time-series dimension is fixed: a comparison of different tests E DITH M ADSEN † †
Department of Economics and CAM, University of Copenhagen, Øster Farimagsgade 5, DK-1353 Copenhagen K, Denmark E-mail:
[email protected] First version received: September 2008; final version accepted: September 2009
Summary The objective of the paper is to investigate and compare the performance of some of the unit root tests in micropanels, which have been suggested in the literature. The framework is a first-order autoregressive panel data model allowing for heterogeneity in the intercept but not in the autoregressive parameter. The tests are all based on usual t-statistics corresponding to least squares estimators of the autoregressive parameter resulting from different transformations of the observed variables. The performance of the tests is investigated and compared by deriving the local power of the tests when the autoregressive parameter is local-to-unity. The results show that the assumption concerning the initial values is extremely important in this matter. The outcome of a simulation experiment demonstrates that the local power of the tests provides a good approximation to their actual power in finite samples. Keywords: Dynamic panel data model, Initial values, Local alternatives, Unit roots.
1. INTRODUCTION In this paper, we investigate unit root inference in panel data models where the cross-section dimension is much larger than the time-series dimension. So we consider traditional micropanels. At present there is a large econometric literature dealing with unit root testing in panel data models which has developed during the last 10 years. Contrary to the previous literature on dynamic panel data models, a large part of this new literature considers macropanels where the cross-section and time-series dimensions are similar in magnitude. Banerjee (1999), Baltagi and Kao (2000) and Breitung and Pesaran (2008) review many of the contributions to the literature on unit root testing in panel data models. Reviews of the literature on dynamic micropanels are provided in Hsiao (1986), Baltagi (1995) and Arellano (2003) of which only the latter discusses the issue of unit roots. The analysis in this paper is done within the framework of a first-order autoregressive panel data model allowing for individual-specific levels. This means that we are testing the null hypothesis of each time-series process being a random walk without drift against the alternative hypothesis of each time-series process being stationary with individual-specific levels but the same autoregressive parameter for all cross-section units. In the autoregressive panel C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
64
E. Madsen
data model there are two sources of persistency. One is the autoregressive mechanism which is the same for all cross-section units and the other is the unobserved individual-specific means. Everything else being equal, a high value of the autoregressive parameter means that more persistency is attributed to the autoregressive mechanism. The null hypothesis means that the effect of unanticipated shocks will persist over time, whereas the alternative hypothesis means that the effect will eventually disappear as time goes by. The hypothesis is of interest since many economic variables at the individual level, such as income of individuals and firm-level variables, are found to be highly persistent over time. The main contribution of the paper is to provide analytical results about the performance of some of the unit root tests which have been suggested in the literature. This is done by deriving the limiting distributions of the corresponding test statistics under local alternatives when the autoregressive parameter is local-to-unity. The results are used to compare the performance of the different tests in terms of their local power. In addition, the results reveal how the local power of the tests is affected by the nuisance parameters of the data-generating process (DGP). So far the power properties of unit root tests in micropanels have been investigated and compared in simulation studies; see, for example, Bond et al. (2002) and Hall and Mairesse (2005). However, the outcome of these might depend on the particular choice of nuisance parameters in the simulation set-up in a non-transparent way. Therefore, it seems to be a useful contribution within this research area. The paper by Breitung (2000) is related to this paper as it compares the local power of some of the unit root tests in macropanels. We consider three different unit root tests. They are all based on t-statistics corresponding to different least-squares (LS) estimators of the autoregressive parameter. The reason that this is not a trivial testing problem is the presence of the individual-specific (incidental) intercepts. Without the presence of these parameters standard testing theory implies that the t-statistic based on the OLS estimator of the autoregressive parameter in the original model gives a test which is optimal asymptotically. This is the first test we consider and we would expect it to perform well in terms of having high power when there is no or little variation in the individualspecific intercepts. On the other hand, when the variation in the individual-specific intercepts is high the OLS estimator has a substantial positive asymptotic bias and therefore the OLS unit root test is expected to have low power in this case. The other two tests we consider are both invariant with respect to adding an individual-specific constant to all variables but they differ in terms of the way in which the invariance with respect to this type of transformation is obtained. In other words, they use different ways of removing the individual-specific means from the variables. One subtracts the initial values from all variables and is suggested by Breitung and Meyer (1994) and the other subtracts the respective individual-specific time-series means of the variables from both sides of the equation and is suggested by Harris and Tzavalis (1999). The Breitung–Meyer test and the Harris–Tzavalis test are panel data versions of the unit root tests in single time series suggested by Schmidt and Phillips (1992) and Dickey and Fuller (1979), respectively. The Breitung–Meyer estimator of the autoregressive parameter is consistent under the null hypothesis whereas the Harris–Tzavalis estimator (the within-group estimator) is inconsistent and therefore a bias adjustment is necessary. Both estimators are inconsistent under the alternative hypothesis meaning that the removal of the individual-specific means that cause the inconsistency of the OLS estimator leads to new sources of asymptotic bias. From the description above, it is not straightforward to determine which test is best in terms of having the highest power. It turns out that the initial values are crucial for the performance of the tests in terms of asymptotic power under local alternatives. In general it is always important C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
65
to be aware of the power properties when applying a statistical test in practice and if there are several tests to choose from it is especially important to understand how their performance is affected by nuisance parameters in order to choose the best testing procedure. Even if it is not the case that one test outperforms the others for all values of the nuisance parameters it is important to understand under which assumptions the tests are likely to have high or low power. The importance of the initial values when testing the unit root hypothesis in micropanels is a result which is also found for single time series and macropanels; see M¨uller and Elliott (2003) and Harris et al. (2008), respectively. An important finding is that under mean stationary alternatives the local power of the Breitung–Meyer test is always higher than the local power of the Harris–Tzavalis test. This result is similar to findings in Moon et al. (2007) where they derive the power envelope of unit tests in macropanels. In the case with individual-specific intercepts they find that within the class of tests that are invariant with respect to individual-specific constants the macropanel version of the Breitung–Meyer test has asymptotic power equal to the power envelope whereas the macropanel version of the Harris–Tzavalis test suggested by Levin et al. (2002) has lower asymptotic power than the power envelope. This result is different from the findings for single time-series versions of these unit root tests where the Schmidt–Phillips test is close to being optimal for values of the autoregressive parameter close to unity whereas the Dickey–Fuller test is close to being optimal for values of the autoregressive parameter close to zero; see Hwang and Schmidt (1996). An important difference is that in macropanels it is possible to find tests that are uniformly most powerful, whereas this is not possible in single time series. In addition, we find that when there is no or little variation in the individual-specific means the local power of the OLS test is higher than the local power of the Breitung–Meyer test. This result implies that the estimation of the individual-specific means causes a decrease in local power since the number of observations over time which contains information about the individual-specific means remains constant and hence it makes a difference whether or not the individual-specific means are estimated. The paper is organized as follows. In Section 2, the basic model is specified. In Section 3, we investigate and compare the different unit root tests described above. This is done by deriving the limiting distributions of the corresponding test statistics under local alternatives. In Section 4, the analytical results are illustrated in a simulation study. In Section 5, we provide some concluding remarks. Proofs are provided in the Appendix.
2. THE MODEL AND ASSUMPTIONS We consider the first-order autoregressive panel data model with individual-specific intercepts defined by yit = ρyit−1 + (1 − ρ)αi + εit
for i = 1, . . . , N and t = 1, . . . , T ,
(2.1)
where −1 < ρ ≤ 1 and for every i = 1, . . . , N the sequence {εit }∞ t=1 is white noise. For notational convenience we assume that the initial values yi0 are observed such that the actual number of observations over time equals T + 1. The model provides a framework for testing the null hypothesis of each time-series process being a random walk against the alternative C The Author(s). Journal compilation C Royal Economic Society 2010.
66
E. Madsen
hypothesis of each time-series process being stationary with an individual-specific level. To specify the model further the assumptions below are imposed. A SSUMPTION 2.1. εit is independent across i, t with E(εit ) = 0, E(εit2 ) = σiε2 and E(εit4 ) = E(εis4 ) for all t, s = 1, . . . , T . In addition, εit is independent of αi and yi0 . A SSUMPTION 2.2. αi is i.i.d. across i with E(αi ) = 0, E(αi2 ) = σα2 and E(αi4 ) < ∞. √ A SSUMPTION 2.3. For −1 < ρ ≤ 1 the initial values satisfy yi0 = αi + τ εi0 , where εi0 is 2 independent of αi and independent across i with E(εi0 ) = 0 and E(εi0 ) = σiε2 . A SSUMPTION 2.4. The following hold: (i) E|εit |4+δ < K < ∞ for some δ > 0 and all i = N 1 1 2 4 σ → σ > 0 as N → ∞. (iii) 1, . . . , N , t = 0, 1, . . . , T . (ii) N N 2ε i=1 iε i=1 σiε → σ4ε as N 1 N 4 N → ∞. (iv) N i=1 E(εit ) → m4 as N → ∞. Assumption 2.1 states that the errors εit are independent over cross-section units and time and allowed to be heteroscedastic over cross-section units but not over time. Further, they are independent of the individual-specific term αi and the initial value yi0 . The assumption about independency over time is stronger than the usual assumption about εit being serially uncorrelated. It is a simplifying assumption made in order to derive the asymptotic properties of the test statistics in Section 3. Assumption 2.2 states that the αi ’s are i.i.d. across crosssection units and again it is made in order to simplify the derivation of the results in the next section. Note that the assumption that E(αi ) = 0 means that we interpret the model in (2.1) as describing the behaviour of the observed variables after having subtracted the overall or the timespecific means. In practice, it means that as a starting point we subtract either the overall or the time-specific sample means from all observed variables. This type of transformation maintains the asymptotic properties of LS estimators and related statistics such that we can consider the model in (2.1) with i.i.d. mean zero terms as the starting point after having subtracted the cross-section sample means from all variables. A similar result is shown in detail in Madsen (2005) within the framework of a pure cross-section analysis. Assumption 2.3 specifies the initial values and implies that they are such that the time-series processes for yit become mean stationary that is E(yit |αi ) = αi for all t = 0, 1, 2, . . . . It implies that it is possible to remove the individual-specific means from the observed variables by simple linear transformations. The parameter τ describes the dispersion of the initial deviation from the stationary level. If the initial values are such that the time-series processes are covariance stationary then τ = 1/(1 − ρ 2 ). This condition is only meaningful when −1 < ρ < 1. We see that as ρ approaches unity then the parameter τ tends to infinity such that all variables are dominated by the initial deviation from the individual-specific mean. In the next section we will formalize this property as it turns out to be important for the results in this paper. Note that εit is independent of εi0 by Assumption 2.1. Finally, Assumption 2.4 is a technical assumption which enables us to derive the asymptotic properties of the statistics of interest by applying standard asymptotic theory. The assumption states that the innovations εit have uniformly bounded moments of order slightly greater than four and that the cross-section average of their variances, squared variances and fourth-order moments have well-defined limits as the cross-section dimension N 2 . tends to infinity. Note that when the errors εit are homoscedastic across units then σ4ε = σ2ε Assumption 2.4(iv) is only required in relation to the test statistic suggested by Harris and Tzavalis (1999), as this is the only statistic of the ones considered in this on paper2 which depends 2 2 = limN→∞ N1 N fourth-order moments. Also note that σ4ε − σ2ε i=1 (σiε − σ2ε ) ≥ 0 and 2 2 2 2 m4 − σ4ε = limN→∞ N1 N i=1 E((εit − σiε ) ) ≥ 0 such that σ2ε ≤ σ4ε ≤ m4 . C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
67
3. THE TEST STATISTICS AND THEIR ASYMPTOTIC PROPERTIES We consider the testing problem where the null hypothesis and the alternative hypothesis are given by H0 : ρ = 1
HA : |ρ| < 1.
(3.1)
In the following, we consider local alternatives where ρ is modelled as being local-to-unity. More specifically, we consider local-to-unity sequences for ρ defined by ρN = 1 −
c Nk
for k, c > 0.
(3.2)
This means that as the sample size N increases, the value of the parameter ρ is in an N −k neighbourhood of unity. So instead of deriving asymptotic representations based on ρ being constant as N increases we derive asymptotic representations based on c = (1 − ρN )N k being constant as N increases. The idea is that these representations will provide good approximations to the actual distributions of the relevant statistics. With one exception the LS √ estimators of ρ considered in this paper converge weakly to normal distributions at the rate N and therefore we consider local-to-unity sequences for ρ with k = 12 . In one situation, the LS estimator must be normalized differently in order to converge weakly to a non-degenerate distribution under the local alternative and the local-to-unity sequence is defined accordingly. Note that c = 0 corresponds to the null hypothesis of ρ being unity. In the next section we show that the local power of the different unit root tests that we consider depend on c and possibly the other parameters√in the model. For a test that has non-trivial power against a local alternative where ρN = 1 − c/ N this means that when reducing (1 − ρ) by half (for example, from ρ = 0.950 to ρ = 0.975) the number of observations in the cross-section dimension N must be four times as large in order to attain the same level of local power. On the other hand, when a test has power against a local alternative where ρN = 1 − c/N this means that when reducing (1 − ρ) by half the number of observations in the cross-section dimension N only has to be twice as large in order to attain the same level of local power. It turns out that the assumption being made about the variation of the initial deviation from the mean stationary level is crucial for the limiting distributions of the different statistics under the local alternative defined by (3.2). We consider the following two situations: (i) :
τ is fixed,
(ii) :
τ =κ
1 1 − ρ2
(3.3)
for κ > 0.
(3.4)
(i) means that the initial deviation from the stationary level is described by a parameter τ that remains constant as ρ approaches unity. (ii) means that the variance of the initial deviation from the stationary level is proportional to the variance of the autoregressive process. The specification in (ii) contains the case where the time-series processes are covariance stationary (κ = 1) which in particular implies that the variances of the observed variables are constant over time. In (ii) τ depends on ρ and goes to infinity as ρ approaches unity and it is not defined for ρ equal to unity. This means that the two formulations are fundamentally different. C The Author(s). Journal compilation C Royal Economic Society 2010.
68
E. Madsen
Under the local-to-unity sequence for ρ given by ρN = 1 − c/N k the formulations in (3.3)– (3.4) correspond to (i) :
τN = τ
for τ ≥ 0,
(ii) :
τN = bN k + o(N k ) for k, b > 0.
(3.5) (3.6)
Note that (3.6) is more general than (3.4) since b and c might take values independently of each other. Equation (3.4) corresponds to (3.6) with b = κ/(2c), see Lemma A.1 in Section A.1 of the Appendix, implying that the parameters b and c are not independent of each other but are √ on a specific path in the parameter space. In (3.5) τ is a fixed parameter such that the term τ εi0 is of the same order of magnitude as the remaining terms in the expression for the variables yit . On the other hand, in (3.6) this term dominates √ the behaviour of the variables yit asymptotically as N tends to infinity since then we have τ εi0 = OP (N k/2 ). The interpretation is that the behaviour of the observed variables √ yi0 , . . . , yiT is dominated by the initial deviation from the mean stationary level (yi0 − αi ) = τ εi0 . In a time-series framework the assumption about the initial values being such that the time series become covariance stationary seems very natural as it implies that the initial values are of the same order of magnitude as the remaining term describing the observed variables as the number of observations over time goes to infinity. In a panel data framework this is not the case since it implies that the initial values are of a higher order of magnitude. This also means that the results about how the initial values affect the test statistics in single time series might not carry over to macro- and micropanels. In this paper, we will focus on the cases where τ is fixed (corresponding to b = 0) and covariance stationarity (corresponding to b = 1/(2c)). It could be the case that b = κ/(2c) (the variance of the initial values is proportional to the variance of the autoregressive process) or b > 0 and independent of c (the variance of the initial values is very high but does not depend on the value of the autoregressive parameter) and we shortly discuss how this affects our results. 3.1. OLS The equation in (2.1) can be rewritten as the following regression model: yit = ρyit−1 + vit vit = (1 − ρ)αi + εit
for i = 1, . . . , N and t = 1, . . . , T .
The OLS estimator of the autoregressive parameter ρ is defined by N −1 N yi,−1 yi,−1 yi,−1 yi , ρˆOLS = i=1
(3.7)
i=1
where yi = (yi1 , . . . , yiT ) and yi,−1 = (yi0 , . . . , yiT −1 ) . The estimator is consistent when ρ = 1 whereas inconsistent when |ρ| < 1. In the latter case, the inconsistency is attributable to the term αi which appears in both the regressor yit−1 and the regression error vit . As αi appears with the factor (1 − ρ) in vit the covariance between the regressor and the regression error is positive and decreases towards zero as ρ approaches unity. Now the regressor yit−1 can be expressed as the sum of the two independent terms αi and (yit−1 − αi ) which are the stationary level and the C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
69
deviation from the stationary level, respectively. If the variability of the two terms are of similar order as ρ approaches unity, the asymptotic bias of ρˆOLS is positive and decreases towards zero as ρ approaches unity. This describes the situation where the variance of the initial deviation from the mean stationary level is fixed. On the other hand, if the behaviour of yit−1 is dominated by the term (yit−1 − αi ) as ρ approaches unity, the asymptotic bias of ρˆOLS will be zero when ρ approaches unity. This describes the situation where the initial values are such that the time-series processes become covariance stationary. The discussion above is formalized by the results given in Proposition 3.1 below. The proposition provides the limiting distribution of the OLS estimator ρˆOLS under both the null hypothesis when ρ is unity and local alternatives when ρ is local-to-unity. We consider different local alternatives depending on the assumption being made about the initial values as given by equations (3.5)–(3.6). P ROPOSITION √ 3.1. Under Assumptions 2.1–2.4 and the local-to-unity sequence for ρ given by ρN = 1 − c/ N for c ≥ 0 and when τ is fixed, the limiting distribution of the OLS estimator ρˆOLS is given by √ σ4ε σα2 1 σα2 σ2ε + τ + T −1 w 2 , N (ρˆOLS − ρN ) → N c as N → ∞. σ2ε T σα2 + τ + T −1 σ2ε 2 σα2 + τ + T −1 2 2 (3.8) Under Assumptions 2.1–2.4 and the local-to-unity sequence for ρ given by ρN = 1 − c/N for c ≥ 0 and when τN = bN + o(N) for b > 0, the limiting distribution of the OLS estimator ρˆOLS is given by 1 σ4ε w as N → ∞. (3.9) N (ρˆOLS − ρN ) → N 0, 2 T bσ2ε the estimator ρˆOLS √The proposition shows that in the unit root case when c = 0 and τ is fixed, 2 is N -consistent and its limiting variance is decreasing in τ, T √ and σ2ε /σ4ε . Under the local alternative the estimator ρˆOLS has an asymptotic bias of order 1/ N which is always positive and increasing in c and σα2 /σ2ε and decreasing in τ and T. The limiting variance of ρˆOLS is 2 /σ4ε and increasing in σα2 /σ2ε and does not depend on the location decreasing in τ, T and σ2ε parameter c. On the other hand, when the variables are dominated by the initial deviation from the mean stationary level the estimator ρˆOLS is N-consistent for all values of c. In this case ρˆOLS estimates the parameter ρ very precisely also when its true value is close to unity. Further, 2 /σ4ε . The covariance stationary local the limiting variance of ρˆOLS is decreasing in b, T and σ2ε alternative corresponds to b = 1/(2c) such that the limiting variance is increasing in c. This rather surprising result is explained as follows. Under this assumption about the initial values that in particular holds under covariance stationarity the behaviour of yit for t = 0, . . . , T is dominated by the initial deviation from the stationary level (yi0 − αi ). More specifically, the variation of (yi0 − αi ) is of order N under the local-to-unity sequence for ρ given by 1 − c/N for c > 0, see the result in (3.6), whereas the variation of the remaining terms in yit is bounded as N tends to infinity. This implies that the numerator in (3.7) must be normalized by N in order to converge in distribution and the denominator in (3.7) must be normalized by N 2 in order to converge in probability. The consistency is a result of the term (yi0 − αi ), which dominates the behaviour of the regressor, being independent of the term αi . This indicates that the asymptotic representation in (3.9) is only appropriate when the variances of αi and εit are much smaller than the variance C The Author(s). Journal compilation C Royal Economic Society 2010.
70
E. Madsen
of (yi0 − αi ). Once the variances are of similar magnitude, the asymptotic representation in (3.8) is expected to provide a better approximation to the actual distribution of ρˆOLS . The unit root test based on the usual t-statistic is obtained by normalizing (ρˆOLS − 1) appropriately. For this purpose we need a consistent estimator of the limiting variance of ρˆOLS and we use White’s heteroscedastic-consistent estimator; see White (1980). Under the covariance stationary local alternative this estimator must be normalized differently in order to be consistent. Letting k = 12 and k = 1 refer to the situations where ρˆOLS converges in distribution at the rate √ N and N, respectively, White’s heteroscedastic-consistent estimator of the limiting variance of ρˆOLS is given by the following expression: −1 −1 N N N 1 1 1 VˆOLS (k) = y yi,−1 y vˆi vˆ yi,−1 y yi,−1 , N 2k i=1 i,−1 N 2k i=1 i,−1 i N 2k i=1 i,−1 where the vector of residuals is vˆi = yi − ρˆOLS yi,−1 . The t-statistic is then defined as 1 tOLS = VˆOLS (k)− 2 N k (ρˆOLS − 1) . 1 Note that VˆOLS (k)− 2 N k does not depend on k since the normalization factors cancel out. This means that the test statistic tOLS and also asymptotic confidence intervals do not depend on the actual normalization. This is a desirable feature since we might not know which assumption is appropriate for the initial values. The proposition below provides the limiting distribution of the t-statistic.
P ROPOSITION √ 3.2. Under Assumptions 2.1–2.4 and the local-to-unity sequence for ρ given by ρN = 1 − c/ N for c ≥ 0 and when τ is fixed, the limiting distribution of the OLS t-statistic tOLS is given by ⎛ ⎞ −1 2 σ σ T − 1 T − 1 w 4ε α tOLS → N ⎝−c τ + + τ+ T , 1⎠ as N → ∞. 2 2 σ2ε 2 σ2ε (3.10) Under Assumptions 2.1–2.4 and the local-to-unity sequence for ρ given by ρN = 1 − c/N for c ≥ 0 and when τN = bN + o(N ) for b > 0, the limiting distribution of the OLS t-statistic tOLS is given by ⎛ ⎞ 2 σ2ε w tOLS → N ⎝−c b T , 1⎠ as N → ∞. (3.11) σ4ε The proposition shows that in both cases under the null hypothesis of a unit root the tstatistic tOLS is asymptotically standard normal. So unit root inference is carried out by employing critical values from the standard normal distribution. Furthermore, the proposition shows that 2 /σ4ε (the location parameter when τ is fixed, the local power is increasing in c, τ, T and σ2ε is shifted to the left-hand side when these parameters increase) and decreasing in σα2 /σ2ε (the location parameter is shifted to the right-hand side when σα2 /σ2ε increases). Under the covariance 2 stationary alternative when b = 1/(2c), the local power only depends on c, T and σ2ε /σ4ε and is increasing in these parameters. When b > 0 and independent of the value of c we find that for a fixed value of c the local power is increasing in b. This result is similar to the finding in single time series where the local power of the Dickey–Fuller unit root test turns out to be an increasing C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
71
function of the initial value and in addition is an optimal test when the initial values are high; see M¨uller and Elliott (2003). The reason why we obtain a similar result here for a test statistic, which is not invariant with respect to individual-specific constants, is that in a single time series the estimation of a constant does not affect the test statistics asymptotically as the time-series dimension goes to infinity. Note that as discussed above, the limiting distribution in (3.11) will only provide a good approximation to the actual distribution of the t-statistic when the behaviour of yit is dominated by the initial deviation from the mean stationary level. To explore the results of the proposition in more detail let us consider the case where T + 1 = 2 /σ4ε = 1 and σα2 /σ2ε = 1. For τ = 1 5 and the following values of the nuisance parameters: σ2ε and ρ = 0.95 then N must be approximately 150 in order for the OLS unit root test to obtain a power level of 0.5 for a one-sided alternative at the nominal level of 5%. As explained in the previous section this immediately implies that for τ = 1 and ρ = 0.975 then N must be approximately 600 in order to attain the power level of 0.5. The numbers for N in this example would be higher if σα2 /σε2 > 1. Under the covariance stationary alternative we find that N must only be approximately 30 and 60, respectively, in order for this test to attain the power level of 0.5 in the alternatives ρ = 0.95 and ρ = 0.975, respectively. So the test is very powerful against this alternative. Altogether, the advantage of using the OLS unit root test is that it is expected to have high power under the covariance stationary alternative and when the variation in the initial deviation from the stationary levels are very high even for values of ρ very close to unity. However, if this is not the case the power of the test for values of ρ close to unity is expected to be low when σα2 /σ2ε is high. This will be most evident for small values of T. 3.2. Breitung–Meyer Subtracting the initial value yi0 from both sides of the equation in (2.1) yields the following regression model: yit − yi0 = ρ (yit−1 − yi0 ) + v˜it v˜it = (ρ − 1) (yi0 − αi ) + εit
for i = 1, . . . , N and t = 1, . . . , T .
The LS estimator of ρ obtained from this regression equation is defined by ρˆ0 =
N
y˜i,−1 y˜i,−1
−1 N
i=1
y˜i,−1 y˜i ,
(3.12)
i=1
where y˜i = yi − yi0 ιT , y˜i,−1 = yi,−1 − yi0 ιT and ιT is a T × 1 vector of ones. Again the estimator is consistent when ρ = 1, whereas inconsistent when |ρ| < 1. In the latter case, its asymptotic bias equals 12 (1 − ρ) under the assumption about covariance stationarity; see Breitung and Meyer (1994). As an example, this means that the asymptotic bias equals 0.050, 0.025 and 0.005 when ρ equals 0.90, 0.95 and 0.99, respectively. The inconsistency is attributable to the term (yi0 − αi ) as it appears in both the regressor (yit−1 − yi0 ) and the regression error v¯it . The covariance between the regressor and the regression error decreases towards zero as ρ approaches unity when the variance of (yi0 − αi ) is kept constant. However, the decrease might be offset if the variance of (yi0 − αi ) increases as ρ approaches unity. This is exactly what happens when the initial values are such that the time-series processes become covariance stationary. C The Author(s). Journal compilation C Royal Economic Society 2010.
72
E. Madsen
Proposition 3.3 below provides the limiting distribution of the Breitung–Meyer estimator ρˆ0 under both the null hypothesis when ρ is unity and the mean stationary local alternative when ρ is local-to-unity. In this case, the local alternatives are the same irrespective of the assumption about the dispersion of the initial deviation from the stationary level. P ROPOSITION √ 3.3. Under Assumptions 2.1–2.4 √ and the√local-to-unity sequence for ρ given by ρN = 1 − c/ N for c ≥ 0 and when τN = b N + o( N ) for b ≥ 0, the limiting distribution of the Breitung–Meyer estimator ρˆ0 is given by √ 2 σ4ε w 2 N (ρˆ0 − ρN ) → N c b, 2 as N → ∞. (3.13) σ2ε T (T − 1) The proposition shows that√in the unit root case and under the mean stationary local alternative when τ is fixed ρˆ0 is N -consistent. √ Under the covariance stationary local alternative ρˆ0 has a positive asymptotic bias of order 1/ N. The limiting variance of ρˆ0 does not depend on the assumption being made about the initial values and it is a simple function of T and 2 /σ4ε and decreasing in both. As indicated σ2ε √ above, the results follow by using that when the variance of (yi0 − αi ) is of order less than N , the asymptotic bias disappears under the local alternative. This is the case when τ is fixed. On the contrary, under covariance stationarity √ this is not the case, as the variance of (yi0 − αi ) in this case is of order N ; see the result in (3.6). As before, White’s heteroscedastic-consistent estimator of the limiting variance of ρˆ0 is given by the following expression: Vˆ0 =
N 1 y˜ y˜ N i=1 i,−1 i,−1
−1
N 1 y˜ vˆ˜ i vˆ˜ y˜i,−1 N i=1 i,−1 i
N 1 y˜ y˜ N i=1 i,−1 i,−1
−1 ,
where the vector of residuals is vˆ˜ i = y˜i − ρˆ0 y˜i,−1 . The t-statistic is then defined as − 12 √
t0 = Vˆ0
N (ρˆ0 − 1) .
2 When the errors εit are homoscedastic across units such that σ4ε = σ2ε , the limiting variance of ρˆ0 is a function of T only. Therefore, it is possible to use a normalized coefficient statistic when testing the unit root hypothesis. The statistic is defined in the following way:
t¯0 =
T (T − 1) √ N (ρˆ0 − 1 ). 2
The proposition below provides the limiting distributions of the test statistics defined above. P ROPOSITION √ 3.4. Under Assumptions 2.1–2.4 √ and the√local-to-unity sequence for ρ given by ρN = 1 − c/ N for c ≥ 0 and when τN = b N + o( N ) for b ≥ 0, the limiting distribution of the Breitung–Meyer t-statistic t 0 is given by ⎞ ⎛ 2 σ2ε T (T − 1) w , 1⎠ as N → ∞. (3.14) t0 → N ⎝−c (1 − cb) σ4ε 2 C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
73
The limiting distribution of the normalized coefficient statistic t¯0 is given by w
t¯0 → N −c (1 − cb)
T (T − 1) σ4ε , 2 2 σ2ε
as N → ∞.
(3.15)
The proposition shows that under the null hypothesis of a unit root, the t-statistic t 0 is asymptotically standard normal. So again unit root inference is carried out by employing critical values from the standard normal distribution. Further, the proposition shows that the 2 /σ4ε when b < 1/c. In particular, this local power of the test is increasing in c, T and σ2ε means that the local power is monotonically increasing in c when b = 0 (τ is fixed) and when 2 /σ4ε = 1 the local power of a one-sided test at the b = 1/(2c) (covariance stationarity). When σ2ε √ nominal 5% level√against these two alternatives is given by (−1.645 + c T (T − 1)/2) and
(−1.645 + c/2 T (T − 1)/2), respectively, where (·) denotes the cdf of the standard normal distribution. This means that for a specific value of ρ we need four times as many cross-section observations in the covariance stationary alternative as in the fixed τ alternative in order to obtain the same level of local power. When b > 1/c then the local power of the test is monotonically decreasing in c and the local power is in fact less than the nominal size of the test for all values of c. This would be the case when the variance of the initial deviation from the stationary level is more than two times σiε2 /(1 − ρ 2 ). Also we see that for a fixed value of c the local power is decreasing in b. These findings are similar to the results in M¨uller and Elliott (2003) and Harris et al. (2008) for unit root tests in single time series and macropanels, respectively. Note that for a fixed value of b > 0 which is not linked to the value of c then the local power is a non-monotonic function of c that tends to zero as c goes to infinity. Figure 1 shows the local 2 /σ4ε = 1 and we power as a function of c for different values of b when T + 1 = 5 and σ2ε see the findings described above. Also note that there is a power loss associated with having cross-sectional heterogeneity in the error terms since a test based on yit /σiε would have higher power. The test based on the normalized coefficient statistic t¯0 is asymptotically equivalent to the 2 . When this is not the case, the test based on test based on the t-statistic t 0 when σ4ε = σ2ε the normalized coefficient statistic will be distorted when employing critical values from the standard normal distribution. In a one-sided test it will reject the null hypothesis of a unit root 2 < σ4ε . So unless there is any prior knowledge about the error terms being too often since σ2ε homoscedastic over cross-section units the unit root test should be based on the t-statistic. Note, 2 < σ4ε such that there is a difference between the local power of the two tests, this that if σ2ε difference decreases as T increases. However, the size distortion is not affected by T and hence it remains as T increases. The advantage of using the Breitung–Meyer unit root test is that the local power only depends on one nuisance parameter. Further, under mean stationarity the test is invariant with respect to the individual-specific levels. This means that the size of the test is invariant with respect to the initial values and the power of the test is invariant with respect to the individual-specific term αi . On the other hand, the test is sensitive to the assumptions on the initial values through the initial deviation from the stationary level and the local power can be quite low if the variation in this term is very high. This is in contrast to the OLS unit root test where we found the opposite result. C The Author(s). Journal compilation C Royal Economic Society 2010.
74
E. Madsen
Figure 1. The local power of the Breitung–Meyer unit root test.
3.3. Harris–Tzavalis The within-group transformation of the original model is obtained by subtracting the individual time series means from the variables in equation (2.1). This yields the following regression model: T T 1 1 yit = ρ yit−1 − yit−1 + wit yit − T t=1 T t=1 for i = 1, . . . , N and t = 1, . . . , T . T 1 wit = εit − εit T t=1 The within-group estimator of ρ is then defined by N −1 N ρˆW G = yi,−1 QT yi,−1 yi,−1 QT yi , i=1
(3.16)
i=1
where QT is a T × T symmetric and idempotent matrix defined as QT = IT − T1 ιT ιT , where IT is the T × T identity matrix and ιT ιT is a T × T matrix of ones. It is well-known that this estimator is inconsistent when −1 < ρ < 1. The asymptotic bias is often referred to as the Nickell-bias since Nickell (1981) is the first to provide an analytical expression for it. Under the C The Author(s). Journal compilation C Royal Economic Society 2010.
75
Unit root inference in panel data models where the time-series dimension is fixed
assumption about the time-series processes being covariance stationary, the asymptotic bias is a function of ρ and T which is always negative when 0 < ρ < 1 and decreases numerically as T increases. Harris and Tzavalis (1999) show that the asymptotic bias of the within-group estimator equals −3/(T + 1) when ρ = 1. As this expression does not depend on any nuisance parameters, their idea is to base a unit root test on the bias adjusted within-group estimator. Proposition 3.5 below provides the limiting distribution of ρˆW G under both the null hypothesis of a unit root and the mean stationary local alternative when ρ is local-to-unity. P ROPOSITION √ 3.5. Under Assumptions 2.1–2.4 √ and the√local-to-unity sequence for ρ given by ρN = 1 − c/ N for c ≥ 0 and when τN = b N + o( N ) for b ≥ 0, the limiting distribution of the adjusted within-group estimator ρˆW G is given by √
N ρˆW G − ρN +
k1 m4 + k2 σ4ε 3 T −2 3T w 2 +c b , → N −c as N → ∞, 2 T +1 2 (T + 1) 2 (T + 1) σ2ε (3.17)
where k1 =
12(T − 2)(2T − 1) 5T (T − 1) (T + 1)3
k2 =
3(17T 3 − 44T 2 + 77T − 24) 5T (T − 1) (T + 1)3
.
The proposition shows that except √ in the unit root case, the adjusted within-group estimator has an asymptotic bias of order 1/ N under the local alternative. The bias is negative when τ is fixed and positive under covariance stationarity. This means that the adjustment is respectively too big and too small. The limiting variance of ρˆW G is the same in the unit root case and under the local alternatives. It depends on fourth-order moments of the errors εit through the term m 4 . As k1 < k2 the fourth-order moments receive less weight than the squared second-order moments. Harris and Tzavalis (1999) assume that the errors εit are i.i.d. normally distributed across i 2 2 and m4 = 3σ2ε . In this case, the limiting variance of ρˆW G only depends on T such that σ4ε = σ2ε and is given by the following expression: 3(17T 2 − 20T + 17) . V˜W G = 3k1 + k2 = 5(T − 1)(T + 1)3 Therefore, Harris and Tzavalis (1999) suggest using the normalized coefficient statistic as a unit root test statistic. It is defined as follows: 3 −1 √ ¯tW G = V˜W G2 N ρˆW G − 1 + . T +1 However, as before it is also possible to use the usual t-statistic as a test statistic. White’s heteroscedasticity-consistent estimator of the limiting variance of the bias adjusted within-group estimator is given by the following expression: VˆW G =
N 1 y QT yi,−1 N i=1 i,−1
−1
N 1 y QT ωˆ i ωˆ i QT yi,−1 N i=1 i,−1
C The Author(s). Journal compilation C Royal Economic Society 2010.
N 1 y QT yi,−1 N i=1 i,−1
−1 ,
76
E. Madsen
where the vector of residuals is wˆ i = QT yi − ρˆW G QT yi,−1 . The bias-adjusted within-group t-statistic is then defined in the following way: 3 −1 √ . tW G = VˆW G2 N ρˆW G − 1 + T +1 The limiting distributions of these test statistics are given in Proposition 3.6 below. P ROPOSITION √ 3.6. Under Assumptions 2.1–2.4 √ and the√local-to-unity sequence for ρ given by ρN = 1 − c/ N for c ≥ 0 and when τN = b N + o( N ) for b ≥ 0, the limiting distribution of the adjusted within-group t-statistic tW G is given by 3T w σ2ε (k1 m4 + k2 σ4ε )−1 , 1 as N → ∞. tW G → N −c (1 − cb) (3.18) 2 (T + 1) The limiting distribution of the Harris–Tzavalis normalized coefficient statistic t¯W G is given by 5(T − 1) (T + 1)3 3T k˜1 m4 + k˜2 σ4ε w , as N → ∞, t¯W G → N −c (1 − cb) 2 2 (T + 1) 3 17T 2 − 20T + 17 σ2ε (3.19) where k˜1 =
4(T − 2)(2T − 1) T (17T 2 − 20T + 17)
17T 3 − 44T 2 + 77T − 24 k˜2 = . T (17T 2 − 20T + 17)
Once again unit root inference based on the adjusted t-statistic tW G can be carried out by employing critical values from the standard normal distribution. We also note that the parameters c and b appear in a similar manner as in the limiting distribution of Breitung–Meyer test statistic and therefore the results here are similar to the results in Section 3.2. In particular, we find 2 2 /σ4ε and σ2ε /m4 when that the local power of the Harris–Tzavalis test is increasing in c, T , σ2ε b < 1/c. This means that the local power is monotonically increasing in c when b = 0 (τ is fixed) and when b = 1/(2c) (covariance staionarity). Also as in Section 3.2 the location parameter in the first case is twice as large as in the second case such that four times as many cross-section observations are necessary when b = 1/(2c) in order to obtain the same level of local power as when b = 0 for a specific value of ρ. The unit root test based on the Harris–Tzavalis normalized coefficient statistic t¯W G is asymptotically equivalent to the test based on the t-statistic tW G when the errors εit are normally distributed and homoscedastic across units. If at least one of these assumptions is violated, the test is likely to be distorted when employing critical values from the 2 < σ4ε standard normal distribution. The test will reject the null hypothesis too often when σ2ε 2 and when the excess kurtosis of εit is positive, i.e. m4 > 3σ2ε . Therefore, the Harris–Tzavalis normalized coefficient statistic should not be used for unit root inference unless the underlying assumptions have been verified. As with the Breitung–Meyer unit root test, the Harris–Tzavalis unit root test is invariant with respect to the individual-specific levels. However, the local power of the Harris–Tzavalis test depends on more nuisance parameters. A more serious disadvantage of this test is that the bias adjustment of the within-group estimator ρˆW G depends crucially on the errors εit being homoscedastic over time. If this assumption is violated, the Harris–Tzavalis unit root test is likely to be distorted. To avoid this problem, Kruiniger and Tzavalis (2001) suggest using an estimator of the asymptotic bias in the adjustment of ρˆW G . In the unit root case, the estimator of C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
77
the asymptotic bias is consistent. However, in this paper we only investigate the performance of the unit root tests when the errors εit are homoscedastic over time. Therefore, we do not consider this different bias adjustment in detail but we note that it is available. 3.4. Comparison of the tests Below we list the main findings about the local power of the tests. They follow immediately from the results in Proposition 3.2, 3.4 and 3.6. (1)
When b ≤ 1/c the local power of the Breitung–Meyer test is always higher than the local power of the Harris–Tzavalis test. This follows by using that σ4ε ≤ m4 such that (k1 m4 + 5T (T −1)(T +1)3 5T (T −1)(T +1)3 −1 −1 −1 (k1 + k2 )−1 = σ4ε ≤ σ4ε . This k2 σ4ε )−1 ≤ σ4ε 51T 3 −108T 2 +171T −48 51T 3 −108T 2 +171T −48 −1 T (T −1) gives ( 2(T3T+1) )2 (k1 m4 + k2 σ4ε )−1 ≤ σ4ε · 2
2ε
(3) (4)
−1 T (T −1) ≤ σ4ε since 2
45T (T +1) 0 < 2(51T 3 −108T 2 +171T −48) < 1. When τ is fixed the local power of the OLS test is higher than the local power of the σ2 Breitung–Meyer test when σ α < σσ4ε2 τ (1 + T2τ−1 ). 2
(2)
45T 2 (T +1) 2(51T 3 −108T 2 +171T −48)
2ε
1 When τ = 1−ρ 2 (covariance stationarity) the local power of the OLS test is higher than the local power of the Breitung–Meyer test when ρ > TT −5 . −1 2 2 /σ4ε = 1 and m4 = 3σ2ε the local power of the OLS test is higher than When τ is fixed, σ2ε
the local power of the Harris–Tzavalis test when T −1 ). 2
σa2 σ2ε
−20T +17) < ( 4(17T (τ + 15T (T −1)(T +1) 2
T −1 ) 2
− 1)(τ +
Figure 2 below illustrates some of these results. In each figure, the local power of one-sided √ tests at the 5% nominal level based on the t-statistics is graphed as a function of c = (1 − ρ) N . 2 2 /σ4ε = 1 and m4 = 3σ2ε . The It is calculated for the following parameter values: τ = 1, σ2ε 2 figures correspond to the value of T + 1 being 5 or 10 and the value of σα /σ2ε being 1 or 10. For this choice of parameters, the local power of the Breitung–Meyer test and the Harris–Tzavalis test only depends √ on T. As an example, the local power of the Breitung–Meyer test is obtained as
(−1.645 + c T (T − 1)/2) where denotes the cdf of the standard normal; see Proposition 3.4. The figures show that the local power of the Breitung–Meyer test is higher than the local power of the Harris–Tzavalis test for all values of c. When σα2 /σ2ε = 1 the local power of the OLS test is highest for all values of c, whereas when σα2 /σ2ε = 10 the local power of the OLS test is lowest for all values of c. More specifically, when σα2 /σ2ε = 1 and T + 1 = 5 then in order to attain a local power level equal to 0.5 for a given value of ρ we need approximately 1.2 (Breitung–Meyer test) and 1.6 (Harris–Tzavalis test) as many cross-section observations in order to do so compared to when using the OLS test. When σα2 /σ2ε = 10 and T + 1 = 5 we need approximately 1.4 (Harris–Tzavalis test) and 3.0 (OLS test) as many cross-section observations compared to when using the Breitung–Meyer test in order to attain a local power level of 0.5.
4. SIMULATION EXPERIMENTS In this section the analytical results obtained in Section 3 are illustrated in a simulation experiment. The simulated model is the following: yi0 = αi + εi0 , yit = ρyit−1 + (1 − ρ)αi + εit , C The Author(s). Journal compilation C Royal Economic Society 2010.
78
E. Madsen
Figure 2. Comparison of the local power under mean stationarity.
with εit ∼ i.i.d.N (0, 1)
αi ∼ i.i.d.N 0, σα2
εi0 ∼ i.i.d.N (0, τ ) .
We consider different values of T , N and ρ which are T + 1 = 5, 10, 15, N = 100, 250, 500 and ρ = 0.90, 0.95, 0.99, 1.00. The results are based on 5000 replications of the model. In Tables 1 and 2, we report the empirical rejection probabilities of one-sided unit root tests based on the t-statistics with the critical value taken from the standard normal distribution at the nominal 5% significance level. For comparison the analytical rejection probabilities (i.e. the local power) are reported in parentheses. We consider different simulation set-ups where the value of σα2 is either 1 or 10. This parameter will only affect the OLS test as the two other tests do not depend on this parameter under the alternatives considered here. Further, the simulation set-ups depend on the variance of initial error term τ . Table 1 corresponds to the unit root case with τ = 1 and Table 2 corresponds to the covariance stationary alternative with τ = 1/(1 − ρ 2 ). In Table 1, we see that the empirical size of all tests is close to the nominal size of 0.05 and the empirical power is quite high even for values of ρ close to unity such as ρ = 0.95. Further, the increase in power can be quite dramatic when increasing T + 1 from 5 to 10. For example, when ρ = 0.99 and N = 500 the power of the Breitung–Meyer test increases from 0.15 to 0.37, the power of the Harris–Tzavalis test increases from 0.13 to 0.25, and the power of the OLS test increases from 0.15 to 0.38 when σα2 = 1 and from 0.09 to 0.21 when σα2 = 10. When comparing C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
ρ
79
Table 1. Empirical and analytical (in brackets) rejection probabilities when τ = 1. T +1 N OLS, σα2 = 1 OLS, σα2 = 10 Breitung–Meyer Harris–Tzavalis
0.900 0.900
5 5
100 250
0.803 (0.848) 0.989 (0.995)
0.370 (0.409) 0.666 (0.723)
0.684 (0.790) 0.957 (0.987)
0.562 (0.667) 0.865 (0.949)
0.900 0.900
5 10
500 100
1.000 (1.000) 1.000 (1.000)
0.896 (0.935) 0.943 (0.987)
0.999 (1.000) 0.999 (1.000)
0.991 (0.999) 0.947 (0.998)
0.900 0.900 0.900
10 10 15
250 500 100
1.000 (1.000) 1.000 (1.000) 1.000 (1.000)
0.999 (1.000) 1.000 (1.000) 1.000 (1.000)
1.000 (1.000) 1.000 (1.000) 1.000 (1.000)
1.000 (1.000) 1.000 (1.000) 0.998 (1.000)
0.900 0.900
15 15
250 500
1.000 (1.000) 1.000 (1.000)
1.000 (1.000) 1.000 (1.000)
1.000 (1.000) 1.000 (1.000)
1.000 (1.000) 1.000 (1.000)
0.950 0.950
5 5
100 250
0.369 (0.379) 0.650 (0.680)
0.184 (0.174) 0.282 (0.299)
0.318 (0.337) 0.572 (0.615)
0.277 (0.272) 0.463 (0.498)
0.950 0.950 0.950
5 10 10
500 100 250
0.887 (0.910) 0.880 (0.922) 0.998 (0.999)
0.452 (0.475) 0.567 (0.615) 0.886 (0.922)
0.823 (0.863) 0.855 (0.912) 0.997 (0.999)
0.694 (0.750) 0.620 (0.723) 0.910 (0.971)
0.950 0.950 0.950
10 15 15
500 100 250
1.000 (1.000) 0.996 (0.999) 1.000 (1.000)
0.989 (0.996) 0.891 (0.956) 0.999 (1.000)
1.000 (1.000) 0.993 (0.999) 1.000 (1.000)
0.995 (1.000) 0.856 (0.962) 0.996 (1.000)
0.950
15
500
1.000 (1.000)
1.000 (1.000)
1.000 (1.000)
1.000 (1.000)
0.990
5
100
0.097 (0.084)
0.078 (0.066)
0.094 (0.081)
0.092 (0.075)
0.990 0.990
5 5
250 500
0.115 (0.111) 0.150 (0.148)
0.081 (0.078) 0.091 (0.092)
0.111 (0.104) 0.145 (0.136)
0.103 (0.094) 0.128 (0.119)
0.990 0.990 0.990
10 10 10
100 250 500
0.171 (0.151) 0.257 (0.249) 0.380 (0.391)
0.119 (0.104) 0.155 (0.151) 0.213 (0.218)
0.157 (0.148) 0.242 (0.243) 0.367 (0.381)
0.131 (0.116) 0.178 (0.174) 0.253 (0.260)
0.990 0.990
15 15
100 250
0.253 (0.248) 0.434 (0.451)
0.175 (0.165) 0.275 (0.280)
0.253 (0.245) 0.426 (0.446)
0.177 (0.168) 0.273 (0.286)
0.990
15
500
0.677 (0.694)
0.419 (0.442)
0.665 (0.687)
0.418 (0.454)
1.000 1.000
5 5
100 250
0.057 (0.050) 0.054 (0.050)
0.057 (0.050) 0.054 (0.050)
0.062 (0.050) 0.054 (0.050)
0.063 (0.050) 0.059 (0.050)
1.000 1.000
5 10
500 100
0.055 (0.050) 0.064 (0.050)
0.055 (0.050) 0.064 (0.050)
0.055 (0.050) 0.062 (0.050)
0.057 (0.050) 0.064 (0.050)
1.000 1.000 1.000
10 10 15
250 500 100
0.056 (0.050) 0.055 (0.050) 0.055 (0.050)
0.056 (0.050) 0.055 (0.050) 0.055 (0.050)
0.060 (0.050) 0.049 (0.050) 0.058 (0.050)
0.061 (0.050) 0.056 (0.050) 0.063 (0.050)
1.000 1.000
15 15
250 500
0.058 (0.050) 0.050 (0.050)
0.058 (0.050) 0.050 (0.050)
0.053 (0.050) 0.048 (0.050)
0.056 (0.050) 0.053 (0.050)
C The Author(s). Journal compilation C Royal Economic Society 2010.
80
ρ
E. Madsen Table 2. Empirical and analytical (in brackets) rejection probabilities when τ = 1/(1 − ρ 2 ). T +1 N OLS, σα2 = 1 OLS, σα2 = 10 Breitung–Meyer Harris–Tzavalis
0.900
5
100
0.997 (0.998)
0.883 (0.998)
0.336 (0.337)
0.301 (0.272)
0.900 0.900 0.900
5 5 10
250 500 100
1.000 (1.000) 1.000 (1.000) 1.000 (1.000)
0.997 (1.000) 1.000 (1.000) 0.998 (1.000)
0.604 (0.615) 0.851 (0.863) 0.888 (0.912)
0.519 (0.498) 0.760 (0.750) 0.737 (0.723)
0.900 0.900
10 10
250 500
1.000 (1.000) 1.000 (1.000)
1.000 (1.000) 1.000 (1.000)
0.998 (0.999) 1.000 (1.000)
0.976 (0.971) 0.999 (1.000)
0.900 0.900 0.900
15 15 15
100 250 500
1.000 (1.000) 1.000 (1.000) 1.000 (1.000)
1.000 (1.000) 1.000 (1.000) 1.000 (1.000)
1.000 (0.999) 1.000 (1.000) 1.000 (1.000)
0.965 (0.962) 1.000 (1.000) 1.000 (1.000)
0.950 0.950
5 5
100 250
0.934 (0.935) 1.000 (1.000)
0.759 (0.935) 0.981 (1.000)
0.167 (0.151) 0.258 (0.249)
0.157 (0.130) 0.231 (0.205)
0.950 0.950 0.950
5 10 10
500 100 250
1.000 (1.000) 1.000 (0.999) 1.000 (1.000)
1.000 (1.000) 0.979 (0.999) 1.000 (1.000)
0.383 (0.391) 0.431 (0.442) 0.759 (0.766)
0.325 (0.314) 0.320 (0.299) 0.559 (0.549)
0.950 0.950
10 15
500 100
1.000 (1.000) 1.000 (1.000)
1.000 (1.000) 0.999 (1.000)
0.947 (0.956) 0.744 (0.770)
0.809 (0.804) 0.556 (0.525)
0.950 0.950
15 15
250 500
1.000 (1.000) 1.000 (1.000)
1.000 (1.000) 1.000 (1.000)
0.976 (0.983) 1.000 (1.000)
0.867 (0.855) 0.986 (0.985)
0.990 0.990 0.990
5 5 5
100 250 500
0.427 (0.409) 0.722 (0.723) 0.933 (0.935)
0.379 (0.409) 0.668 (0.723) 0.888 (0.935)
0.078 (0.064) 0.080 (0.073) 0.094 (0.085)
0.074 (0.062) 0.082 (0.069) 0.090 (0.079)
0.990 0.990
10 10
100 250
0.695 (0.683) 0.957 (0.956)
0.632 (0.683) 0.931 (0.956)
0.101 (0.089) 0.130 (0.121)
0.096 (0.078) 0.105 (0.098)
0.990 0.990 0.990
10 15 15
500 100 250
0.998 (0.999) 0.860 (0.842) 0.994 (0.994)
0.996 (0.999) 0.809 (0.842) 0.988 (0.994)
0.162 (0.165) 0.133 (0.121) 0.193 (0.187)
0.130 (0.126) 0.118 (0.096) 0.145 (0.135)
0.990
15
500
1.000 (1.000)
1.000 (1.000)
0.269 (0.282)
0.195 (0.189)
the different tests we see the results described in Section 3.4. To summarize, the power of the Breitung–Meyer test is always higher than the power of the Harris–Tzavalis test, and the OLS test has the highest (lowest) power of the three tests when σα2 = 1 (σα2 = 10). Finally, we see that the empirical rejection probabilities are quite close to the analytical rejection probabilities. This demonstrates that the local power provides a good approximation to the actual power. In Table 2, the most striking result is that the OLS test has very high power even for values of ρ very close to unity such as ρ = 0.99. According to the analytical results in Section 3.1, this will be the case unless the variability of the variable of interest is dominated by the variability of the individual-specific term. This is also the main conclusion from the simulation studies in the papers by Bond et al. (2002) and Hall and Mairesse (2005) where the time-series processes are covariance stationary in the simulation set-ups. The empirical power of the OLS test is always higher than that of the Breitung–Meyer test and the Harris–Tzavalis test. In addition, C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
81
the empirical power of the Breitung–Meyer test is always higher that of the Harris–Tzavalis test and compared to Table 1 the empirical power of these tests is lower. These findings are all in accordance with the analytical results in Section 3. Again, we see that the empirical power is quite close to the analytical power except for the OLS test with σα2 = 10. As explained in Section 3.1, this is to be expected.
5. CONCLUSIONS In this paper, we have investigated the performance of some of the unit root tests in micropanels which have been suggested in the literature. To do this we have derived the asymptotic power of the tests under local alternatives. One of the main findings is that the initial values are very important for the performance of the tests. This result also holds for unit root tests in single time series and macropanels. The results show that the OLS unit root test is very powerful when the variation of the initial deviation from the mean stationary level is high and in fact the local power is increasing in the parameter describing this feature. However, this test is not invariant with respect to adding individual-specific means to all variables and the results show that its power can be very low when the variation in the individual-specific means is high. The Breitung–Meyer test and the Harris–Tzavalis test are invariant with respect to this type of transformation and another main finding is that the local power of the Breitung–Meyer test is always higher than the local power of the Harris–Tzavalis test. Since the Harris–Tzavalis test relies on rather strong assumptions such as the error terms having homoscedastic variances in order to perform the bias adjustment the results show that the Breitung–Meyer test is to be preferred. This result is confirmed by findings from macropanels; see Moon et al. (2007). In future research it would be interesting to investigate whether and under which conditions the tests considered in this paper are optimal by deriving the local power of optimal tests. This could be done in a more general framework where the AR parameter can differ across crosssection units under the alternative hypothesis. Results from macropanels suggest that in this case some of the tests considered here might be optimal (the OLS test without incidental intercepts and the Breitung–Meyer test with incidental intercepts); see Moon et al. (2007). These results are also interesting in relation to the type of panel data unit root test suggested by Im et al. (2003). Their test statistic is based on the cross-section average of individual-specific Dickey–Fuller test statistics as opposed to the pooled test statistics considered here in this paper. In macropanels the Im–Pesaran–Shin test appears to have substantially lower power than the optimal tests and that might also be the case in micropanels.
ACKNOWLEDGMENTS I wish to thank J. Breitung, M. Browning, S. Johansen, H. C. Kongsted, two anonymous referees and co-editor Pierre Perron for useful comments and suggestions. I also wish to thank the Danish National Research Foundation for its support through the Centre for Applied Microeconometrics (CAM).
REFERENCES Arellano, M. (2003). Panel Data Econometrics. Oxford: Oxford University Press. Baltagi, B. H. (1995). Econometric Analysis of Panel Data. New York: John Wiley. Baltagi, B. H. (2000). Nonstationary Panels, Panel Cointegration and Dynamic Panels, Advances in Econometrics, Volume 15. Amsterdam: Elsevier. C The Author(s). Journal compilation C Royal Economic Society 2010.
82
E. Madsen
Baltagi, B. H. and C. Kao (2000). Nonstationary panels, cointegration in panels and dynamic panels: a survey. In B. H. Baltagi (Ed.), Nonstationary Panels, Panel Cointegration, and Dynamic Panels, Advances in Econometrics, Volume 15, 7–52. Amsterdam: Elsevier. Banerjee, A. (1999). Panel data unit roots and cointegration: an overview. Oxford Bulletin of Economics and Statistics 61, 607–629. Bond, S., C. Nauges and F. Windmeijer (2002). Unit roots and identification in autoregressive panel data models: a comparison of alternative tests. Working Paper, University of Bristol. Breitung, J. (2000). The local power of some unit root tests for panel data. In B. H. Baltagi (Ed.), Nonstationary Panels, Panel Cointegration, and Dynamic Panels, Advances in Econometrics, Volume 15, 161–77. Amsterdam: Elsevier. Breitung, J. and W. Meyer (1994). Testing for unit roots in panel data: are wages on different bargaining levels cointegrated? Applied Economics 26, 353–61. Breitung, J. and M. H. Pesaran (2008). Unit roots and cointegration in panels. In L. M´atay´as and P. Sevestre (Eds.), The Econometrics of Panel Data, Advanced Studies in Theoretical and Applied Econometrics, Volume 46, 279–322. Berlin: Springer. Dickey, D. and W. Fuller (1979). Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association 74, 427–31. Hall, B. H. and J. Mairesse (2005). Testing for unit roots in panel data: an exploration using real and simulated data. In D. Andrews and J. Stock (Eds.), Identification and Inference in Econometric Models: Essays in Honor of Thomas J. Rothenberg. Cambridge: Cambridge University Press. Harris, D., D. I. Harvey, S. J. Leybourne and N. D. Sakkas (2008). Local asymptotic power of the Im– Pesaran–Shin panel unit root test and the impact of initial observations. Granger Centre Discussion Paper No. 08/02, University of Nottingham. Harris, R. D. F. and E. Tzavalis (1999). Inference for unit roots in dynamic panels where the time dimension is fixed. Journal of Econometrics 91, 201–26. Hsiao, C. (1986). Analysis of Panel Data. Cambridge: Cambridge University Press. Hwang, J. and P. Schmidt (1996). Alternative methods of detrending and the power of unit root tests. Journal of Econometrics 71, 227–48. Im, K. S., M. H. Pesaran and Y. Shin (2003). Testing for unit roots in heterogeneous panels. Journal of Econometrics 115, 53–74. Kruiniger, H. and E. Tzavalis (2001). Testing for unit roots in short dynamic panels with serially correlated and heteroscedastic disturbance terms. Working Paper, Department of Economics, Queen Mary, University of London. Levin, A., F. Lin and C. Chu (2002). Unit root tests in panel data: asymptotic and finite-sample properties. Journal of Econometrics 122, 81–126. Madsen, E. (2005). Estimating cointegrating relations from a cross section. Econometrics Journal, 380– 405. Moon, H. R., B. Perron and P. C. B. Phillips (2007). Incidental trends and the power of panel unit root tests. Journal of Econometrics 141, 416–59. M¨uller, U. and G. Elliott (2003). Tests for unit roots and the initial condition. Econometrica 71, 1269–86. Nickell, S. (1981). Biases in dynamic models with fixed effects. Econometrica 49, 1417–26. Schmidt, P. and P. C. B. Phillips (1992). LM tests for a unit root in the presence of deterministic trends. Oxford Bulletin of Economics and Statistics 54, 257–87. White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48, 817–38. White, H. (2001). Asymptotic Theory for Econometricians. San Diego: Academic Press.
C The Author(s). Journal compilation C Royal Economic Society 2010.
83
Unit root inference in panel data models where the time-series dimension is fixed
APPENDIX: PROOFS OF RESULTS This appendix contains the proofs of the propositions in Section 3. The proofs are all based on standard as
P
asymptotic theory; see for example White (2001). The notation ‘XN = YN ’ means that XN − YN → 0 as N → ∞, i.e. XN and YN , are asymptotically equivalent as N → ∞. We start out with some results that will be used in the following.
A.1. Preliminary lemmas and results L EMMA A.1. Under the local-to-unity sequence for ρ given by ρN = 1 − c/N k for k, c > 0, the following hold: c (A.1) ρNt = 1 − t k + o(N −k ), N Nk 1 + o(N k ). = 2c 1 − ρN2
(A.2)
Proof: The binomial formula yields c t c t(t − 1) c2 t(t − 1)(t − 2) c3 (−c)t − + ··· + ρNt = 1 − k = 1 − t k + 2k 3k N N 2! N 3! N N kt
and the results follow directly. For −1 < ρ ≤ 1, the following expression for yit is obtained by recursive substitution in (2.1): yit = (1 − ρ t )αi + ρ t yi0 + ρ t−1 εi1 + · · · + εit
for t = 1, . . . , T .
Inserting the expression for the initial value given in Assumption 2.3 yields √ yit = αi + ρ t τ εi0 + ρ t−1 εi1 + · · · + εit for t = 0, . . . , T . Using stacked notation, equation (2.1) can be expressed as yi = ρyi,−1 + vi . Expressions for the regressor yi,−1 and the regression error vi are given by √ yi,−1 = αi ιT + CT (ρ)εi + AT (ρ) τ εi0 , vi = (1 − ρ)αi ιT + εi ,
(A.3) (A.4)
where ιT is a T × 1 vector of ones and CT (ρ) is the T × T matrix and AT (ρ) is the T × 1 vector defined as ⎤ ⎡ 0 0 ··· 0 0 ⎡ ⎤ 1 ⎥ ⎢ .. .. ⎥ ⎢ ⎢ ⎥ ⎢ 1 0 ··· . .⎥ ⎢ ρ ⎥ ⎥ ⎢ ⎢ ⎥ ⎥ ⎢ ⎢ ρ2 ⎥ ⎥ ⎢ . . . . ⎢ ⎥. . . . . AT (ρ) = ⎢ CT (ρ) = ⎢ ρ . . ⎥ . .⎥ ⎥ ⎢ ⎢ . ⎥ ⎥ ⎢ ⎢ ⎥ . ⎢ . .⎥ .. .. ⎣ . ⎦ ⎢ .. . 0 .. ⎥ . ⎦ ⎣ ρ T −1 ρ T −2 · · · ρ 1 0 C The Author(s). Journal compilation C Royal Economic Society 2010.
84
E. Madsen
Note that CT (ρN ) = CT (1) + O(N −k ) and AT (ρN ) = ιT + O(N −k ) when ρN = 1 − c/N k according to Lemma A.1. In the following, we will use notation like yi,−1 (ρ) and ui (ρ) to indicate that these variables depend on the value of the parameter ρ. In Lemma A.2 below we provide results that are used to prove the propositions in Sections 3.1–3.3. L EMMA A.2. Consider the sequence {xi (ρ), ui (ρ)}N i=1 of independent variables where xi (ρ) and ui (ρ) are k × 1 variables with mean zero and finite fourth-order moments for all values of ρ. PART A: If the following hold for the sequence ρN N 1 E(xi (1) xi (1)) → mXX N i=1
as N → ∞,
(A.5)
E((xi (ρN ) − xi (1)) xi (ρN )) = o(1),
(A.6)
E((xi (ρN ) − xi (1)) xi (1)) = o(1),
(A.7)
then N 1 P xi (ρN ) xi (ρN ) → mXX N i=1
as N → ∞.
PART B: If the following hold for the sequence ρN E(xi (1) ui (1)) = 0, N 1 Var(xi (1) ui (1)) → N i=1
(A.8)
as N → ∞,
N 1 E(xi (ρN ) (ui (ρN ) − ui (1))) → μ1 √ N i=1
as N → ∞,
Var(xi (ρN ) (ui (ρN ) − ui (1))) = o(1), N 1 E((xi (ρN ) − xi (1)) ui (1)) → μ2 √ N i=1
as N → ∞,
Var((xi (ρN ) − xi (1)) ui (1)) = o(1),
(A.9)
(A.10)
(A.11)
(A.12)
(A.13)
then N 1 w xi (ρN ) ui (ρN ) → N (μ1 + μ2 , ) √ N i=1
PART√ C: Let ρN be a sequence such that that N (ρˆ − ρN ) = OP (1). Then
√
as N → ∞.
N (ρN − 1) = O(1) and let ρˆ be a sample statistic such
N 1 P xi (ρ) uˆ i (ρ)uˆ i (ρ) xi (ρ) → Vˆ (ρ) ≡ N i=1
as N → ∞,
C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
85
where uˆ i (ρ) = (ρN − ρ)x ˆ i (ρ) + ui (ρ). Proof: PART A: We have that N N N N 1 1 1 1 xi (ρ) xi (ρ) = xi (1) xi (1) + (xi (ρ) − xi (1)) xi (1) + (xi (ρ) − xi (1)) xi (ρ). N i=1 N i=1 N i=1 N i=1
The Law of Large Numbers together with the condition in (A.5) imply that the first term in the second line of the expression above converges in probability to mXX as N → ∞. Together with the assumption about existence of fourth-order moments the condition in (A.5) is sufficient to give this result. Using the same arguments the last two terms in the second line of the expression above converge in probability to zero as N → ∞ since their means converge to zero according to the conditions in (A.6)–(A.7). Altogether this proves that N N 1 P as 1 xi (ρN ) xi (ρN ) = xi (1) xi (1) → mXX N i=1 N i=1
as N → ∞.
PART B: We have that N N N 1 1 1 xi (ρ) ui (ρ) = √ xi (1) ui (1) + √ (xi (ρ) − xi (1)) ui (1) √ N i=1 N i=1 N i=1 N 1 +√ xi (ρ) (ui (ρ) − ui (1)). N i=1
The Central Limit Theorem and the Law of Large Numbers together with the conditions in (A.8)–(A.13) and the existence of fourth-order moments give N 1 w xi (1) ui (1) → N (0, ) √ N i=1
as N → ∞,
N 1 P (xi (ρN ) − xi (1)) ui (1) → μ1 √ N i=1
as N → ∞,
N 1 P xi (ρN ) (ui (ρN ) − ui (1)) → μ2 √ N i=1
as N → ∞.
In particular, the conditions in (A.10)–(A.11) together with independency across i imply the following as N → ∞ which give the result in the second equation above: N 1 (xi (ρN ) − xi (1)) ui (1) → μ1 , E √ N i=1 N N 1 1 Var √ (xi (ρN ) − xi (1)) ui (1) = Var((xi (ρN ) − xi (1)) ui (1)) → 0. N i=1 N i=1 Altogether this proves that N 1 w xi (ρN ) ui (ρN ) → N (μ1 + μ2 , ) √ N i=1 C The Author(s). Journal compilation C Royal Economic Society 2010.
as N → ∞.
86
E. Madsen
PART C: We use the following definitions: N 1 xi (ρ) uˆ i (ρ)uˆ i (ρ) xi (ρ), Vˆ (ρ) = N i=1
V (ρ) =
N 1 E(xi (ρ) ui (ρ)ui (ρ) xi (ρ)). N i=1
We have that Vˆ (ρN ) − V (ρN ) = oP (1). This follows since the terms xi (ρ) and ui (ρ) have finite fourth-order moments together with the assumption that (ρˆ − ρN ) = oP (1). Because that V (ρN ) − V (1) → 0 as N → ∞ this proves the result since V (1) =
N 1 E(xi (1) ui (1)ui (1) xi (1)) → N i=1
as N → ∞.
A.2. Proofs of the propositions in Section 3.1: OLS Using the equation in (3.7) we have that N k (ρˆOLS − ρ) =
N 1 y y 2k N i=1 i,−1 i,−1
−1
N 1 y v k N i=1 i,−1 i
for k > 0.
Proposition 3.1 now follows by the results in Lemma A.3 below.
√ L EMMA A.3. Under Assumptions 2.1–2.4 and the local-to-unity sequence for ρ given by ρN = 1 − c/ N for c ≥ 0 and when τ is fixed, then the following results hold: N T −1 1 P σ2ε yi,−1 yi,−1 → T σα2 + τ + N i=1 2
as N → ∞,
N 1 T −1 w 2 2 σ4ε yi,−1 vi → N cT σα , T σα σ2ε + τ + √ 2 N i=1
as N → ∞.
(A.14)
(A.15)
Under Assumptions 2.1–2.4 and the local-to-unity sequence for ρN given by ρN = 1 − c/N for c ≥ 0 and when τN = bN + o(N ) for b > 0, then the following results hold: N 1 P y yi,−1 → bT σ2ε N 2 i=1 i,−1
N 1 w y vi → N (0, bT σ4ε ) N i=1 i,−1
as N → ∞,
as N → ∞.
(A.16)
(A.17)
C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
87
Proof: The following results will be used below: tr{CT (ρ)} = 0, AT (1) AT (1) = ιT ιT = T , T (T − 2) , tr{CT (1) CT (1)} = tr CT (1) CT (1) = 2 −k AT (ρN ) − AT (1) = O(N ), CT (ρN ) − CT (1) = O(N −k ). The first part of Lemma A.3 follows by using Lemma A.2 with the following definitions of xi (ρ) and ui (ρ): √ xi (ρ) = yi,−1 (ρ) = αi ιT + AT (ρ) τ εi0 + CT (ρ)εi , ui (ρ) = vi (ρ) = (1 − ρ)αi ιT + εi , where expressions for yi,−1 and ui are given in equations (A.3)–(A.4). This gives that √ xi (ρ) − xi (1) = (AT (ρ) − AT (1)) τ εi0 + (CT (ρ) − CT (1))εi , ui (ρ) − ui (1) = (1 − ρ)αi ιT . The sequences xi (ρ) and ui (ρ) are both independent across i with finite fourth-order moments. We prove the result in (A.14) by using Part A in Lemma A.2. We have the following: N N N N 1 2 1 2 1 2 1 E(xi (1) xi (1)) = E αi ιT ιT + σiε τ AT (1) AT (1) + σ tr CT (1) CT (1) N i=1 N i=1 N i=1 N i=1 iε T −1 as N → ∞, → σα2 T + σ2ε T τ + 2
where we have used that αi , εi0 and εi are independent of each other with mean zero. We also have that for all ρ: E(xi (ρ) (xi (ρN ) − xi (1))) = σiε2 τ AT (ρ) (AT (ρN ) − AT (1))
√ + σiε2 tr {CT (ρ) (CT (ρN ) − CT (ρN ))} = O(1/ N).
This means that the conditions in Part A of Lemma A.2 are satisfied such that N 1 T −1 P σ2ε as N → ∞. xi (ρN ) xi (ρN ) → T σα2 + τ + N i=1 2 This proves the result in (A.14). We prove the result in (A.15) by using Part B in Lemma A.2. The mean and variance of xi (1) ui (1) are given by E(xi (1) ui (1)) = σiε2 tr{CT (1) } = 0,
T (T − 1) σiε2 , Var(xi (1) ui (1)) = σiε2 E(xi (1) xi (1)) = σiε2 T σα2 + T τ + 2 such that N 1 T −1 2 σ4ε as N → ∞. Var(xi (1) ui (1)) → T σα σ2ε + τ + N i=1 2 C The Author(s). Journal compilation C Royal Economic Society 2010.
88
E. Madsen
In addition, we have the following results concerning means: N N 1 1 xi (ρN ) (ui (ρN ) − ui (1)) = √ (1 − ρN )ιT ιT E αi2 = cT σα2 , E √ N i=1 N i=1 N N 1 1 2 E √ (xi (ρN ) − xi (1)) ui (1) = √ σiε tr{CT (ρ) − CT (1)} = 0. N i=1 N i=1 For the variances we have the following: Var(xi (ρN ) (ui (ρN ) − ui (1))) ≤ O(1/N ), Var((xi (ρN ) − xi (1)) ui (1)) ≤ O(1/N ). This holds since Var(xi (ρN ) (ui (ρN )−ui (1))) ≤ E (xi (ρN ) xi (ρN ))2 1/2E(((ui (ρN )−ui (1)) ui (ρN )−ui (1)))2 1/2 = O(1/N ), where the inequality sign in the expression above follows by Schwarz’s inequality and the equality sign follows by using that for dui (ρN ) = ui (ρN ) − ui (1) we have that E((dui (ρN ) dui (ρN ))2 ) = O(1/N 2 ) and E((xi (ρN ) xi (ρN ))2 ) = O(1). The second result follows by using similar arguments. This means that the conditions in Part B of Lemma A.2 are satisfied such that N 1 T −1 w σ4ε as N → ∞. xi (ρN ) ui (ρN ) → N cT σα2 , T σα2 σ2ε + τ + √ 2 N i=1 This proves the result in (A.15). The second part of Lemma A.3 follows by repeating the steps above but with the following definitions of xi (ρN ) and ui (ρN ): √ √ as xi (ρN ) = AT (ρN ) bεi0 = yi,−1 (ρN )/ N , ui (ρN ) = (1 − ρN )ιT αi + εi . We have that E(xi (1) xi (1)) = σiε2 bAT (1) AT (1) = σiε2 bT , E(xi (ρ) (xi (ρN ) − xi (1))) = σiε2 bAT (ρ) (AT (ρN ) − AT (1)) = O(1/N ). Such that N N 1 P as 1 yi,−1 (ρN ) yi,−1 (ρN ) = xi (ρN ) xi (ρN ) → σ2ε bT 2 N i=1 N i=1
as N → ∞.
This proves the result in (A.16). Using that xi (ρ) and ui (ρ) are independent for all values of ρ we have E(xi (1) ui (1)) = 0,
Var(xi (1) ui (1)) = E (xi (1) ui (1))2 = σiε4 bT , E(xi (ρN ) (ui (ρN ) − ui (1))) = 0, E((xi (ρN ) − xi (1)) ui (1)) = 0. Altogether this implies that N N 1 1 w as yi,−1 (ρN ) vi (ρN ) = √ xi (ρN ) ui (ρN ) → N (0, bT σ4ε ) N i=1 N i=1
which proves the result in (A.17).
as N → ∞,
C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
89
Proof of Proposition 3.2: Part C in Lemma A.2 immediately implies that VˆOLS (k) is a consistent estimator as N → ∞ of the variance in the limiting distribution of N k (ρˆOLS − ρN ) when ρN is local-to-unity. Then using the expression for tOLS we have that 1
1
1
tOLS = VˆOLS (k)− 2 N k (ρˆOLS − 1) = VˆOLS (k)− 2 N k (ρˆOLS − ρN ) − cVˆOLS (k)− 2 .
Proposition 3.2 now follows by the results already obtained.
A.3. Proofs of the propositions in Section 3.2: Breitung–Meyer Using the expressions for yi,−1 and vi given in (A.3)–(A.4) we have that √ y˜i,−1 = yi,−1 − ιT yi0 = (AT (ρ) − ιT ) τ εi0 + CT (ρ)εi , √ v˜i = vi + (ρ − 1)ιT yi0 = (ρ − 1) τ ιT εi0 + εi . Using the equation in (3.12) we have √
N (ρˆ0 − ρ) =
N 1 y˜ y˜i,−1 N i=1 i,−1
−1
N 1 y˜i,−1 v˜i . √ N i=1
√ √ √ L EMMA A.4. Let Assumptions 2.1–2.4 be satisfied. When ρN = 1 − c/ N and τN = b N + o( N ) for b, c ≥ 0, then the following results hold:
Proof of Proposition 3.3: Follows by the results in Lemma A.4 below.
N T (T − 1) 1 P y˜ y˜i,−1 → σ2ε N i=1 i,−1 2
as N → ∞,
N 1 T (T − 1) T (T − 1) w 2 , σ4ε y˜i,−1 v˜i → N c bσ2ε √ 2 2 N i=1
(A.18)
as N → ∞.
Proof: In the following we will use that √ AT (ρN ) − AT (1) = (ρN − 1)A˜ T + o(1/ N ), T (T − 1) , ιT A˜ T = 2 where the T × 1 vector A˜ T is defined as A˜ T = (0, 1, 2, . . . , T − 1) . We use the following specifications: √ xi (ρ) = y˜i,−1 (ρ) = (AT (ρ) − ιT ) τ εi0 + CT (ρ)εi , √ ui (ρ) = v˜i (ρ) = (ρ − 1) τ ιT εi0 + εi , such that
√ xi (ρ) − xi (1) = (AT (ρ) − ιT ) τ εi0 + (CT (ρ) − CT (1))εi , √ ui (ρ) − ui (1) = (ρ − 1) τ ιT εi0 .
It follows immediately that N N T (T − 1) 1 2 1 E(xi (1) xi (1)) = σ tr{CT (1) CT (1)} → σ2ε N i=1 N i=1 iε 2 C The Author(s). Journal compilation C Royal Economic Society 2010.
as N → ∞.
(A.19)
90
E. Madsen
We also have E((xi (ρN ) − xi (1)) xi (ρN )) = σiε2 (τN (AT (ρN ) − ιT ) (AT (ρN ) − ιT ) + tr{(CT (ρN ) − CT (1)) CT (ρN )}) √ = O(1/ N ). √ √ √ This holds since τN (AT (ρN ) − ιT ) (AT (ρN ) − ιT ) = c2 b/ N A˜ T A˜ T + o(1/ N) = O(1/ N ). Altogether this gives the result in (A.18) according to Part A of Lemma A.2. In order to prove the result in (A.19) we will show that the conditions in Part B of Lemma A.2 are satisfied. We have that E(xi (1) ui (1)) = σiε2 tr{CT (1) } = 0, T (T − 1) , Var(xi (1) ui (1)) = σiε4 2 E(xi (ρN ) (ui (ρN ) − ui (1))) = (ρN − 1)τN σiε2 ιT (AT (ρN ) − AT (1)), E((xi (ρN ) − xi (1)) ui (1)) = 0,
√ Var(xi (ρN ) (ui (ρN ) − ui (1))) ≤ O(1/ N ), √ Var((xi (ρN ) − xi (1)) ui (1)) ≤ O(1/ N ), such that N 1 T (T − 1) σ2ε as N → ∞, E(xi (ρN ) (ui (ρN ) − ui (1))) → c2 b √ 2 N i=1 √ √ ˜ where we √ have used that ιT (AT (ρN ) − AT (1)) = −cιT AT / N + o(1/ N ) and (ρN − 1)τN = −cb + o( N ). Altogether by Part B of Lemma A.2 this proves the result in (A.19).
Proof of Proposition 3.4: Part C in Lemma A.2 immediately implies that Vˆ0 is a consistent estimator as √ N → ∞ of the variance in the limiting distribution of N (ρˆ0 − ρN ) when ρN is local-to-unity. Then using the expressions for t 0 and t¯0 we have that −1 √ −1 √ −1 t0 = Vˆ0 2 N (ρˆ0 − 1) = Vˆ0 2 N (ρˆ0 − ρN ) − cVˆ0 2 , T (T − 1) √ T (T − 1) √ T (T − 1) . t¯0 = N (ρˆ0 − 1) = N(ρˆ0 − ρN ) − c 2 2 2 Proposition 3.4 now follows by the results already obtained.
A.4. Proofs of the propositions in Section 3.3: Harris–Tzavalis Using the expressions for yi,−1 and vi given in (A.3)–(A.4) and that QT ιT = 0 we have √ QT yi,−1 = QT CT (ρ)εi + QT AT (ρ) τ εi0 , QT vi = QT εi , where QT = IT − T1 ιT ιT is symmetric and idempotent. Using the expression for ρˆW G in (3.16) we have √
3 N ρˆW G − ρ + T +1
=
N 1 y QT yi,−1 N i=1 i,−1
−1
N 1 3 yi,−1 QT yi,−1 . yi,−1 QT εi + √ T +1 N i=1
Proof of Proposition 3.5: Follows by the results in Lemma A.5 below.
C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
91
√ √ √ L EMMA A.5. Let Assumptions 2.1–2.4 be satisfied. When ρN = 1 − c/ N and τN = b N + o( N ) for b, c ≥ 0, then the following results hold: N (T − 1)(T + 1) 1 P y QT yi,−1 → σ2ε N i=1 i,−1 6
as N → ∞,
N 3 1 w yi,−1 QT yi,−1 → N (−cσ2ε (a1 − cba2 ), g1 m4 + g2 σ4ε ) yi,−1 QT εi + √ T + 1 N i=1
(A.20)
as N → ∞, (A.21)
where a1 =
(T − 1)(T − 2) 12
a2 =
T (T − 1) , 4
and g1 =
(T − 1)(T − 2)(2T − 1) 15T (T + 1)
g2 =
(T − 1)(17T 3 − 44T 2 + 77T − 24) . 60T (T + 1)
Proof: The following results will be used below: QT AT (1) = QT ιT = 0, QT AT (ρ) = QT (AT (ρ) − AT (1)), (T − 1)(T + 1) tr{CT (1) QT CT (1)} = , 6 T (T − 1)(2T − 1) , tr{CT (1) ιT ιT CT (1)} = 6 T −1 tr{CT (1) QT } = − . 2 The T × T matrix C˜ T and the T × 1 vector A˜ T are defined as ⎤ ⎡ 0 0 ··· 0 0 ⎥ ⎢ .. .. ⎥ ⎢ ⎥ ⎢ 0 0 · · · . . ⎥ ⎢ ⎥ ⎢ ⎢ . . . . ˜ .. .. .. .. ⎥ CT = ⎢ 1 ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ . . . . . . 0 .. ⎥ .. ⎢ .. ⎦ ⎣ T − 2 ··· 1 0 0
⎡ ⎢ ⎢ ⎢ ⎢ ˜ AT = ⎢ ⎢ ⎢ ⎢ ⎣
0 1 2 .. .
⎤ ⎥ ⎥ ⎥ ⎥ ⎥. ⎥ ⎥ ⎥ ⎦
T −1
We will also use the following results: (AT (ρN ) − AT (1)) QT AT (ρN ) = (1 − ρN )2 A˜ T QT A˜ T + o(1/N ), T (T − 1)(T − 2) tr{CT (1) C˜ T } = , 6 T (T − 1)(T − 2)(3T − 1) , ιT CT (1) C˜ T ιT = 24 T (T − 1)(T − 2) ιT C˜ T ιT = , 6 1 T (T − 1)(T + 1) . A˜ T QT A˜ T = A˜ T A˜ T − (A˜ T ιT )2 = T 12 C The Author(s). Journal compilation C Royal Economic Society 2010.
92
E. Madsen
From above we have the following: √
N ρˆW G − ρ +
3 T +1
=
N 1 xi (ρ) xi (ρ) N i=1
−1
N 1 xi (ρ) ui (ρ) , √ N i=1
with √ xi (ρ) = QT yi,−1 (ρ) = QT AT (ρ) τN εi0 + QT CT (ρ)εi , √ 3 3 ui (ρ) = QT εi + QT yi,−1 (ρ) = QT εi + QT AT (ρ) τ εi0 + QT CT (ρ)εi . T +1 T +1 We have that xi (1) = QT CT (1)εi , 3 QT CT (1)εi , ui (1) = QT εi + T +1 √ xi (ρ) − xi (1) = QT (AT (ρ) − ιT ) τ εi0 + QT (CT (ρ) − CT (1))εi , √ 3 QT (AT (ρ) − ιT ) τ εi0 + QT (CT (ρ) − CT (1))εi . ui (ρ) − ui (1) = T +1 The sequences xi (ρN ) and ui (ρN ) are both independent across i with finite fourth-order moments. We have the following: N N (T − 1)(T + 1) 1 2 1 E(xi (1) xi (1)) = σ tr{CT (1) QT CT (1)} → σ2ε N i=1 N i=1 iε 6
as N → ∞,
and also E((xi (ρN ) − xi (1)) xi (ρN )) = σiε2 (tr{(CT (ρN ) − CT (1)) QT CT (ρN )} + τN (AT (ρN ) − AT (1)) QT AT (ρN )) √ = O(1/ N ). √ √ This holds since τN = b N + o( N ) in combination with (AT (ρN ) − AT (1)) QT AT (ρN ) = O(1/N ). Altogether this gives the result in (A.20) according to Part A of Lemma A.2. We prove (A.21) by showing that the conditions in Part B of Lemma A.2 are satisfied. We have the following: E(xi (1) ui (1)) = σiε2 tr{CT (1) QT } +
3 tr{CT (1) QT CT (1)} = 0. T +1
The following results can be found in Harris and Tzavalis (1999): 2 (2T − 1)(T − 1) 4 (T − 1)(2T 2 − 4T + 3) 4 E εit + σiε , = εi CT (1) QT εi 6T 6T 2 2 2 2 2 (T − 1)(T + 1) 4 (T − 1)(T + 1)(T − 2) 4 E εi CT (1) QT CT (1)εi = E εit + σiε , 30T 20T (T 2 − 1) 4 (T 2 − 1)(T − 2) 4 E εit − σiε . E εi CT (1) QT εi εi CT (1) QT CT (1)εi = − 12 12 E
C The Author(s). Journal compilation C Royal Economic Society 2010.
Unit root inference in panel data models where the time-series dimension is fixed
93
This gives that Var(xi (1) ui (1)) = E
εi CT (1) QT εi
2
+
3 T −1
2 E
εi CT (1) QT CT (1)εi
2
6 E εi CT (1) QT εi εi CT (1) QT CT (1)εi T +1 = g1 E εit4 + g2 σiε4 , +
where g 1 and g 2 are defined in Lemma A.5. This implies that N 1 Var(xi (1) ui (1)) → g1 m4 + g2 σ4ε N i=1
For the first mean term we have that E ((xi (ρN ) − xi (1)) ui (1)) = σiε2 tr
IT +
= σiε2 tr
IT +
3 CT (1) T +1 3 CT (1) T +1
as N → ∞.
QT (CT (ρN ) − CT (1))
√ QT C˜ T (ρN − 1) + o(1/ N),
such that as N → ∞: N 1 3 tr{CT (1) QT C˜ T } . E (xi (ρN ) − xi (1)) ui (1) = −cσ2ε tr{QT C˜ T } + √ T +1 N i=1 For the second mean term we have that E(xi (ρN ) (ui (ρN ) − ui (1))) 3 (tr{CT (ρN ) QT (CT (ρN ) − CT (1))} + τN AT (ρN ) QT (AT (ρN ) − AT (1))) T +1 √ 3 (tr{CT (1) QT C˜ T }(ρN − 1) + τN (ρN − 1)2 A˜ T QT A˜ T ) + o(1/ N), = σiε2 T +1
= σiε2
such that as N → ∞: N 1 3 (tr{CT (1) QT C˜ T } − cbA˜ T QT A˜ T ). E(xi (ρN ) (ui (ρN ) − ui (1))) → −cσ2ε √ T +1 N i=1
For the variance terms we have that
√ Var((xi (ρN ) − xi (1)) ui (1)) ≤ O(1/ N ), √ Var(xi (ρN ) (ui (ρN ) − ui (1))) ≤ O(1/ N ).
Altogether this gives the result in (A.21) according to Part B of Lemma A.2 since 3 ˜ 6 AT QT A˜ T −cσ2ε tr{QT C˜ T } + tr CT (1) QT C˜ T − cb T +1 T +1 T (T − 1) (T − 1)(T − 2) 6(T − 1)(T − 2) + − cb = −cσ2ε − 6 24 4 (T − 1)(T − 2) T (T − 1) − cb = −cσ2ε . 12 4
C The Author(s). Journal compilation C Royal Economic Society 2010.
94
E. Madsen
Proof of Proposition 3.6: Part C in Lemma A.2 immediately implies that VˆW G is a consistent estimator as √ N → ∞ of the variance in the limiting distribution of N(ρˆW G − ρN + 3/(T + 1)) when ρN is local-tounity. Then using the expressions for tW G and t¯W G we have that 3 3 −1 √ −1 √ −1 = VˆW G2 N ρˆW G − ρN + − cVˆW G2 , tW G = VˆW G2 N ρˆW G − 1 + T +1 T +1 3 3 − 12 √ − 12 √ −1 = V˜W G N ρˆW G − ρN + − cV˜W G2 . t¯W G = V˜W G N ρˆW G − 1 + T +1 T +1 Proposition 3.6 now follows by the results already obtained.
C The Author(s). Journal compilation C Royal Economic Society 2010.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 95–126. doi: 10.1111/j.1368-423X.2009.00299.x
The weak instrument problem of the system GMM estimator in dynamic panel data models M AURICE J. G. B UN † AND F RANK W INDMEIJER ‡,§ †
Department of Quantitative Economics, University of Amsterdam, Amsterdam 1018 WB, The Netherlands E-mail:
[email protected] ‡
§
Department of Economics, University of Bristol, Bristol BS8 1TN, UK E-mail:
[email protected] Centre for Microdata Methods and Practice, Institute for Fiscal Studies, 7 Ridgmount Street, London WC1E 7AE, UK First version received: September 2008; final version accepted: August 2009
Summary The system GMM estimator for dynamic panel data models combines moment conditions for the model in first differences with moment conditions for the model in levels. It has been shown to improve on the GMM estimator in the first differenced model in terms of bias and root mean squared error. However, we show in this paper that in the covariance stationary panel data AR(1) model the expected values of the concentration parameters in the differenced and levels equations for the cross-section at time t are the same when the variances of the individual heterogeneity and idiosyncratic errors are the same. This indicates a weak instrument problem also for the equation in levels. We show that the 2SLS biases relative to that of the OLS biases are then similar for the equations in differences and levels, as are the size distortions of the Wald tests. These results are shown to extend to the panel data GMM estimators. Keywords: Dynamic panel data, System GMM, Weak instruments.
1. INTRODUCTION A commonly employed estimation procedure to estimate the parameters in a dynamic panel data model with unobserved individual specific heterogeneity is to transform the model into first differences. Sequential moment conditions are then used where lagged levels of the variables are instruments for the endogenous differences and the parameters estimated by GMM; see Arellano and Bond (1991). It has been well documented (see e.g. Blundell and Bond, 1998) that this GMM estimator in the first differenced (DIF) model can have very poor finite sample properties in terms of bias and precision when the series are persistent, as the instruments are then weak predictors of the endogenous changes. Blundell and Bond (1998) proposed the use of extra moment conditions that rely on certain stationarity conditions of the initial observation, as suggested by Arellano and Bover (1995). When these conditions are satisfied, the resulting system (SYS) GMM estimator has been shown in Monte Carlo studies by e.g. Blundell and Bond (1998) and Blundell et al. C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
96
M. J. G. Bun and F. Windmeijer
(2000) to have much better finite sample properties in terms of bias and root mean squared error (rmse) than that of the DIF GMM estimator. The additional moment conditions of the SYS estimator can be shown to correspond to the model in levels (LEV), with lagged differences of the endogenous variables as instruments. Blundell and Bond (1998) argued that the SYS GMM estimator performs better than the DIF GMM estimator because the instruments in the LEV model remain good predictors for the endogenous variables in this model even when the series are very persistent. They showed for an AR(1) panel data model that the reduced form parameters in the LEV model do not approach 0 when the autoregressive parameter approaches 1, whereas the reduced form parameters in the DIF model do. Because of the good performance of the SYS GMM estimator relative to the DIF GMM estimator in terms of finite sample bias and rmse, it has become the estimator of choice in many applied panel data settings. Among the many examples where the SYS GMM estimator has been used are the estimation of production functions and technological spillovers using firm level panel data (see e.g. Levinsohn and Petrin, 2003, and Griffith et al., 2006), the estimation of demand for addictive goods using consumer level panel data (see e.g. Picone et al., 2004) and the estimation of growth models using country-level panel data (see e.g. Levine et al., 2000, and Bond et al., 2001). The country-level panel data in particular are characterized by highly persistent series (e.g. output or financial data) and a relatively small number of countries and time periods. The variance of the country effects is furthermore often expected to be quite high relative to the variance of the transitory shocks. As we show here, these characteristics combined may lead to a weak instrument problem also for the SYS GMM estimator. For a simple cross-section linear IV model, a measure of the information content of the instruments is the so-called concentration parameter (see e.g. Rothenberg, 1984). In this paper, we calculate the expected concentration parameters for the LEV and DIF reduced form models in a covariance stationary AR(1) panel data model. We do this per time period, i.e. we consider the estimation of the parameter using the moment conditions for a single cross-section only for any given time period. We show that the expected concentration parameters are equal in the LEV and DIF models when the variance of the unobserved heterogeneity term that is constant over time (ση2 ) is equal to the variance of the idiosyncratic shocks (σv2 ). This is exactly the environment under which most Monte Carlo results were obtained that showed the superiority of the SYS GMM estimator relative to the DIF GMM estimator. However, the equality in expectation of the concentration parameters indicates that there is also a weak instrument problem in the LEV model when the series are persistent. If the expected concentration parameters are the same, why is it that the extra information from the LEV moment conditions results in an estimator that has such superior finite sample properties in terms of bias and rmse? We first of all show that the bias of the OLS estimators in the DIF and LEV structural models are very different. The (absolute) bias of the LEV OLS estimator is much smaller than that of the OLS estimator in the DIF model when the series are very persistent. Using the results of higher-order expansions, we argue and show in Monte Carlo simulations that the biases of the LEV and DIF cross-sectional 2SLS estimators, relative to the biases of their respective OLS estimators, are the same when ση2 = σv2 . Therefore, the absolute bias of the LEV 2SLS estimator is smaller than that of the DIF 2SLS estimator when the series are persistent. Further expansion results as in Morimune (1989) indicate that we can expect the size distortions of the Wald tests to be similar in the cross-sectional 2SLS DIF and LEV models when the expected concentration parameters are the same. This is confirmed by a Monte Carlo C The Author(s). Journal compilation C Royal Economic Society 2010.
The weak instrument problem of the system GMM estimator
97
analysis. When the expected concentration parameters are small, which happens when the series are very persistent, the size distortions of the Wald tests can become substantial. As the SYS 2SLS estimator is a weighted average of the DIF and LEV 2SLS estimators, with the weight on the LEV moment conditions increasing with increasing persistence of the series, the results for the SYS estimator mimic that of the LEV estimator quite closely. The expectation of the LEV concentration parameter is larger than that of the DIF model when ση2 is smaller than σv2 , and the relative biases of LEV and SYS 2SLS estimators are smaller and the associated Wald tests perform better than those of DIF. The reverse is the case when ση2 is larger than σv2 . Also, unlike for DIF, the LEV OLS bias increases with increasing variance ratio, vr = ση2 /σv2 , and therefore the performances of the LEV and SYS 2SLS estimators deteriorate with increasing vr. These results are shown to extend to the panel data setting when estimating the model by GMM and are in line with the finite sample bias approximation results of Bun and Kiviet (2006) and Hayakawa (2007), and with the findings from an extensive Monte Carlo study by Kiviet (2007). Furthermore, our theoretical results provide a rationale for the poor performance of the SYS GMM Wald test when data are persistent, as found by Bond and Windmeijer (2005). For the covariance stationary AR(1) panel data model our results therefore show that the SYS GMM estimator has indeed a smaller bias and rmse than DIF GMM when the series are persistent, but that this bias increases with increasing vr = ση2 /σv2 and can become substantial. The Wald test can be severely size distorted for both DIF and SYS GMM with persistent data, but the SYS Wald test size properties deteriorate further with increasing vr. These results follow from the weak instrument problem that is also present in the LEV moment conditions. The set-up of the paper is as follows. Section 2 introduces the AR(1) panel data model, the moment conditions and GMM estimators. Section 3 briefly discusses the concentration parameter in a simple cross-section setting. Section 4 calculates the expected concentration parameters for the DIF and LEF models for cross-section analysis of the AR(1) panel data model, presents the OLS biases and some Monte Carlo and theoretical results on (relative) biases and Wald tests size distortions for the 2SLS estimators. Section 5 presents Monte Carlo and some analytical results for the GMM panel data estimators. Section 6 concludes.
2. MODEL AND GMM ESTIMATORS We consider the first-order autoregressive panel data model yit = αyi,t−1 + uit , uit = ηi + vit ,
i = 1, . . . , n; t = 2, . . . , T ,
(2.1)
where it is assumed that ηi and vit have an error components structure with E(ηi ) = 0,
E(vit ) = 0, E(vit vis ) = 0,
E(vit ηi ) = 0,
i = 1, . . . , n; t = 2, . . . , T ,
(2.2)
i = 1, . . . , n and t = s,
(2.3)
i = 1, . . . , n; t = 2, . . . , T .
(2.4)
and the initial condition satisfies E(yi1 vit ) = 0,
C The Author(s). Journal compilation C Royal Economic Society 2010.
98
M. J. G. Bun and F. Windmeijer
Under these assumptions the following (T − 1)(T − 2)/2 linear moment conditions are valid: E yit−2 uit = 0, t = 3, . . . , T , (2.5) where yit−2 = (yi1 , yi2 , . . . , yit−2 ) and uit = uit − ui,t−1 = yit − αyi,t−1 . Defining ⎡ ⎤ ⎡ ⎤ ui3 yi1 0 0 ··· 0 ··· 0 ⎢u ⎥ ⎢0 ⎢ i4 ⎥ yi1 yi2 · · · 0 ··· 0 ⎥ ⎢ ⎥ ⎥ ui = ⎢ Zdi = ⎢ ⎥; ⎢ .. ⎥ , ⎣ . . . ··· . ··· . ⎦ ⎣ . ⎦ 0 0 0 · · · yi1 · · · yiT −2 uit moment conditions (2.5) can be more compactly written as E Zdi ui = 0,
(2.6)
and the GMM estimator for α is given by (see e.g. Arellano and Bond, 1991): αˆ d =
y−1 Zd Wn−1 Zd y
y−1 Zd Wn−1 Zd y−1
,
where y = (y1 , y2 , . . . , yn ) , yi = (yi3 , yi4 , . . . , yit ) , y−1 the lagged version of , Zd2 , . . . , Zdn ) and Wn is a weight matrix determining the efficiency properties y, Zd = (Zd1 of the GMM estimator. Clearly, αˆ d is a GMM estimator in the differenced model and we refer to it as the DIF-GMM estimator, and moment conditions (2.5) or (2.6) as the DIF moment conditions. Blundell and Bond (1998) exploit additional moment conditions from the assumption on the initial condition (see Arellano and Bover, 1995) that E(ηi yi2 ) = 0,
(2.7)
which holds when the process is mean stationary: yi1 =
ηi + εi , 1−α
(2.8)
with E(εi ) = E(εi ηi ) = 0. If (2.2), (2.3), (2.4) and (2.7) hold, then the following (T − 1)(T − 2)/2 moment conditions are valid: (2.9) E uit yit−1 = 0, t = 3, . . . , T , where yit−1 = (yi2 , yi3 , . . . , yit−1 ) . Defining ⎡
yi2 ⎢ 0 ⎢ Zli = ⎢ ⎣ . 0
0 yi2 .
0 yi3 .
··· ··· ···
0 0 .
··· ··· ···
0 0 .
0
0
···
yi2
···
yiT −1
⎤ ⎥ ⎥ ⎥; ⎦
⎡
⎤ ui3 ⎢u ⎥ ⎢ i4 ⎥ ⎥ ui = ⎢ ⎢ .. ⎥ , ⎣ . ⎦ uit
moment conditions (2.9) can be written as E Zli ui = 0,
(2.10)
C The Author(s). Journal compilation C Royal Economic Society 2010.
99
The weak instrument problem of the system GMM estimator
with the GMM estimator based on these moment conditions given by αˆ l =
Zl Wn−1 Zl y y−1
y−1 Zl Wn−1 Zl y−1
,
where we will refer to αˆ l as the LEV-GMM estimator, and (2.9) or (2.10) as the LEV moment conditions. The full set of linear moment conditions under assumptions (2.2), (2.3), (2.4) and (2.7) is given by E yit−2 uit = 0 t = 3, . . . , T ; (2.11) E(uit yi,t−1 ) = 0 t = 3, . . . , T , or
where
E Zsi pi = 0, ⎡
Zdi ⎢ 0 ⎢ Zsi = ⎢ ⎢ ⎣ . 0
···
0 yi2 .
..
0
···
.
⎤ 0 0 ⎥ ⎥ ⎥; ⎥ . ⎦
(2.12)
pi =
ui ui
.
yit
The GMM estimator based on these moment conditions is αˆ s =
Zs Wn−1 Zs q q−1
q−1 Zs Wn−1 Zs q−1
with qi = (yi , yi ) . This estimator is called the system or SYS-GMM estimator, see Blundell and Bond (1998), and we refer to moment conditions (2.11) or (2.12) as the SYS moment conditions. In most derivations below, we further assume that the initial observation is drawn from the σ2 covariance stationary distribution, implying that E(εi2 ) = 1−αv 2 in (2.8).
3. CONCENTRATION PARAMETER Consider the simple linear cross-section model with one endogenous regressor x and kz instruments z: yi = xi β + ui , xi = zi π + ξi ,
(3.1)
for i = 1, . . . , n, where the (ui , ξi ) are independent draws from a bivariate normal distribution with zero means, variances σu2 and σξ2 , and correlation coefficient ρ. The parameter β is estimated by 2SLS: x PZ y , βˆ = x PZ x where PZ = Z(Z Z)−1 Z . C The Author(s). Journal compilation C Royal Economic Society 2010.
100
M. J. G. Bun and F. Windmeijer
It is well known that when instruments are weak, i.e. when they are only weakly correlated with the endogenous regressor, the 2SLS estimator can perform poorly in finite samples, see e.g. Bound et al. (1995), Staiger and Stock (1997), Stock et al. (2002) and Stock and Yogo (2005). With weak instruments, the 2SLS estimator is biased in the direction of the OLS estimator, and its distribution is non-normal which affects inference using the Wald testing procedure. A measure of the strength of the instruments is the concentration parameter, which is defined as π Z Zπ . σξ2
μ=
When it is evaluated at the OLS, first stage, estimated parameters πˆ Z Z πˆ , σˆ ξ2
μˆ =
ˆ z equal it is clear that μˆ is equal to the Wald test for testing the hypothesis H0 : π = 0, and μ/k to the F-test statistic. Bound et al. (1995) and Staiger and Stock (1997) advocate use of the first-stage F-test to investigate the strength of the instruments. Rothenberg (1984) shows how the concentration parameter relates to the distribution of the IV estimator by means of the following expansion: βˆ = β +
π Z u + ξ PZ u , π Z Zπ + 2π Z ξ + ξ PZ ξ
(3.2)
and so A + √sμ σu √ ˆ μ(β − β) = σξ 1 + 2 √Bμ +
S μ
,
where A=
π Zu ; √ σu π Z Zπ
B=
π Zξ ; √ σξ π Z Zπ
ξ PZ u ; σξ σu
S=
ξ P ξ . σξ2
s=
(A, B) is bivariate normal with zero means, unit variances and correlation coefficient ρ. The variable s has mean kz ρ and variance kz (1 + ρ 2 ) and S has mean kz and variance 2kz . It is clear √ that when μ is large, μ(βˆ − β) behaves like the N (0, σu2 /σξ2 ) random variable. The concentration parameter μ is a key quantity in describing the finite sample properties of the IV estimator. The approximate bias of the 2SLS estimator can be obtained using higherorder asymptotics based on the expansion in (3.2); see Nagar (1959), Buse (1992) and Hahn and Kuersteiner (2002). Following Hahn and Kuersteiner (2002), the bias is derived from the expansion
zξ Qzu (π zξ )(π zu ) π zu −1/2 E , (3.3) + n − 2E E(n1/2 (βˆ2SLS − β)) ≈ E π Qπ π Qπ (π Qπ )2 C The Author(s). Journal compilation C Royal Economic Society 2010.
The weak instrument problem of the system GMM estimator
101
where zu = √1n Z u, zξ = √1n Z ξ and Q = E( n1 zi zi ). It follows that the approximate bias of the IV estimator can be expressed as E(βˆ2SLS ) − β ≈
σuξ (kz − 2) 1 (kz − 2)σuξ . = 2 n π Qπ σξ nE n1 μ
(3.4)
Hence the bias is inversely proportional to the value of the concentration parameter. It does not only depend on the concentration parameter, but also on the number of instruments kz and the degree of endogeneity embodied in the covariance σuξ . However, the relevance of the concentration parameter for finite sample bias becomes even more pronounced when we consider the absolute bias of the IV estimator, relative to that of the OLS estimator as defined by RelBias =
|E(βˆ2SLS ) − β| , |E(βˆOLS ) − β|
see e.g. Bound et al. (1995). The bias of the OLS estimator can be approximated by (see e.g. Hahn and Hausman, 2002): E(βˆOLS ) − β ≈
σuξ π Qπ +
σξ2
=
1 σuξ 1 , 2 σξ E n μ + 1
which is equal to inconsistency of OLS. The relative bias is then approximately given by (kz − 2)(E n1 μ + 1) 1 RelBias ≈ , nE n μ
(3.5)
i.e. a function of E( n1 μ), n and kz only. The concentration parameter is further an important element in describing size distortions of t or Wald tests based on the 2SLS estimator. For μ large the standard 2SLS t-ratio for testing H0 : β = β0 behaves approximately as standard normal. Morimune (1989) derives a higher-order expansion of this conventional 2SLS t-ratio. Applying theorem 2 of Morimune (1989) we find for the set-up with one endogenous regressor and no additional exogenous regressors that the O(n−1/2 ) and O(n−1 ) terms in the expansion of the 2SLS t-statistic only depend on μ, kz and ρuξ . The latter quantity is the correlation coefficient of u and ξ . Moreover, for a two-sided t-test the O(n−1/2 ) term cancels in the approximation. All results discussed above are based on conventional higher-order asymptotics, i.e. assuming strong identification. Hence, these higher-order approximations may not always be informative in case of weak instruments. However, regarding the relevance of the concentration parameter, weak instrument asymptotics as derived by Staiger and Stock (1997) lead to similar conclusions compared with conventional fixed-parameter higher-order asymptotics. √ Staiger and Stock (1997) develop weak instrument asymptotics by setting π = πn = C/ n, in which case the concentration parameter converges to a constant. They then show that 2SLS is not consistent and has a non-standard asymptotic distribution. These results are of course different from conventional asymptotics. However, Staiger and Stock (1997) show that the asymptotic bias of the 2SLS estimator, relative to that of the OLS estimator again only depends on kz and μ. Furthermore, the distributions of the 2SLS t-ratio and Wald statistic only depend on μ, kz and ρuξ . Summarizing, conventional first-order fixed-parameter asymptotics fail to give accurate approximations in case of weak instruments. Inspired by Bound et al. (1995) and Staiger and C The Author(s). Journal compilation C Royal Economic Society 2010.
102
M. J. G. Bun and F. Windmeijer
Stock (1997) we use the concentration parameter to characterize relative bias and size distortions of Wald tests. One can proceed either with higher-order fixed-parameter asymptotics or consider weak instrument asymptotics. In the analysis below we have chosen the former approach. In σ2
the panel AR(1) model weak instruments arise when α → 1 and/or ση2 → ∞. Kruiniger (2009) v applies ‘local to unity’ asymptotics and shows that the Staiger and Stock (1997) set-up does not always apply straightforwardly to dynamic panel data models. More importantly, we find in our σ2
cross-sectional simulations below a weak instrument problem already for α = 0.4 and ση2 = 4, v with the relative bias well approximated by (3.5). Expansion (3.3) also allows us to approximate the bias for less straightforward cases, like the cross-sectional system 2SLS estimator.
4. CROSS-SECTION RESULTS FOR THE AR(1) PANEL DATA MODEL Although the data are not generated as in the cross-section model (3.1), we can write the structural equation and the reduced form model for the AR(1) panel data model in first differences for the cross-section at time t as yit = αyi,t−1 + uit , yi,t−1 = yit−2 πdt + di,t−1 , For the general expression of the expected value of the concentration parameter divided by n we get E
1 μdt n
=
πdt E yit−2 yit−2 πdt σdt2
.
For the model in levels we have for the cross-section at time t yit = αyi,t−1 + ηi + vit , yi,t−1 = yit−1 πlt + li,t−1 and the expected concentration parameter is given by E
1 μlt n
=
πlt E yit−1 yit−1 πlt σlt2
.
In the Appendix we show that, under covariance stationarity of the initial observation, E
1 μdt n
=
(1 − α)2 σv2 + (t − 3)ση2 (1 − α 2 )σv2 + ((t − 1) − (t − 3)α)(1 + α)ση2
and E
1 μlt n
=
(t − 2)(1 − α)2 σv2 , (1 − α 2 )σv2 + ((t − 1) − (t − 3)α)(1 + α)ση2 C The Author(s). Journal compilation C Royal Economic Society 2010.
The weak instrument problem of the system GMM estimator
from which it follows that
1
μdt n1 E n μlt
E
=
103
2 σv + (t − 3)ση2 (t − 2)σv2
ση2 1 = 1 + (t − 3) 2 t −2 σv
.
Therefore, E and for t > 3
E E E
1 μdt n
1 μdt n 1 μdt n 1 μdt n
=E
>E
=E
<E
1 μlt n
1 μlt n 1 μlt n 1 μlt n
if t = 3,
if ση2 > σv2 ,
if ση2 = σv2 ,
if ση2 < σv2 .
Figure 1 graphs the values of E( n1 μdt ) and E( n1 μlt ) as a function of α for t = 6 and various σ2
values of ση2 = { 41 , 1, 4}. The values of the concentration parameters decrease with increasing v α. The concentration parameter for the LEV model is much more sensitive to the value of the variance ratio
ση2 σv2
than the concentration parameter of the DIF model. 4.1. Discussion
The fact that the concentration parameters are the same in expectation for the IV estimators based on the DIF or LEV moment conditions for t = 3 and for t > 3 when ση2 = σv2 seems contrary to the findings in Monte Carlo studies; see e.g. Blundell and Bond (1998) and Blundell et al. (2000) who use a covariance stationary design with ση2 = σv2 = 1. In those simulation studies αˆ l outperforms αˆ d in terms of bias and rmse, especially when the series become more persistent, i.e. when α gets larger. The identification problem is apparent in the DIF model, where the reduced form parameters approach zero when α approaches 1. This is in sharp contrast to the reduced form parameters in the LEV model that approach 12 when α approaches 1. This was the argument used by Blundell and Bond (1998) to assert the strength of the LEV moment conditions for the estimation of α for larger values of α. There are two questions to be addressed. First, why are the behaviours of the two estimators so different in terms of bias and rmse when they have the same expected concentration parameter? Second, how does the weak instrument problem in the LEV model manifest itself? To answer the first question one has to realize that the structural models are different for DIF and LEV, with different endogeneity problems. Therefore, different biases arise for both OLS and 2SLS estimators in the two equations. For the DIF model yit = αyi,t−1 + uit , C The Author(s). Journal compilation C Royal Economic Society 2010.
104
M. J. G. Bun and F. Windmeijer
Figure 1. E( n1 μ).
the OLS estimator for the cross-section at time t is given by αˆ dOLS = α +
ut yt−1 , yt−1 yt−1
and the limiting bias of the OLS estimator is, again assuming covariance stationarity, plim(αˆ dOLS − α) = −
1+α . 2
For the LEV model yit = αyi,t−1 + ηi + vit , the OLS estimator is given by αˆ lOLS = α +
ut yt−1 , yt−1 yt−1
and the limiting bias of the OLS estimator is given by plim(αˆ lOLS − α) = (1 − α) σ 2 η σv2
ση2 σv2
+
1−α 1+α
,
C The Author(s). Journal compilation C Royal Economic Society 2010.
105
The weak instrument problem of the system GMM estimator
which reduces to plim(αˆ lOLS − α) = (1 − α 2 )/2 when ση2 = σv2 . The asymptotic absolute bias of αˆ lOLS is therefore (much) smaller than that of αˆ dOLS for high values of α. Using (3.4) we can approximate the bias of the 2SLS estimator in the DIF model by E(αˆ d ) − α ≈ (kz − 2)
σu,d σu,d ÷ E(μd ) = (t − 4) 2 ÷ E(μd ), σd2 σd
(4.1)
where we have suppressed the subscripts t for ease of exposition, and where σu,d = E yi,t−1 − yit−2 πd uit = −σv2 . Therefore, E(αˆ d ) − α ≈ −(t −
4)σv2
nσv2 ÷ (1 + α)2
ση2 (1 + α)2 1−α − σv2 + ση2 t − 3 +
2
1+α 1−α
(1 + α)2 , = −(t − 4) σ 2 (1+α)2 n 1 − α 2 − σ 2 +σ η2 t−3+ 1+α v η( 1−α ) where we have used the expressions for σd2 and E(μd ) from the Appendix. Equivalently for the LEV model we get E(αˆ l ) − α ≈ (t − 4)
σu,l ÷ E(μl ) σl2
(4.2)
with σu,l = E
yi,t−1 − yit−1 πl uit =
ση2 1−α
,
and therefore E(αˆ l ) − α ≈ =
(t − 4)ση2 1−α
÷
n(t − 2)σv2 (1 + α)((t − 1) − (t − 3)α)
t − 4 ση2 (1 + α)((t − 1) − (t − 3)α) . t − 2 σv2 n(1 − α)
Comparing these expressions is somewhat complicated but when ση2 = σv2 the absolute bias of the LEV 2SLS estimator will tend to be smaller than that of the DIF estimator. The main reason for this is that the absolute LEV OLS bias is smaller than the DIF OLS bias. To answer the second question we now consider relative bias. Combining the results above on absolute OLS and 2SLS bias we get for the approximate relative absolute bias (t − 4)E n1 μd + 1 |E(αˆ d ) − α| ≈ RelBiasd = |E(αˆ dOLS − α)| E(μd ) (1 + α) , = 2(t − 4) σ 2 (1+α)2 n 1 − α 2 − σ 2 +ση2 (t−3+ 1+α ) v
C The Author(s). Journal compilation C Royal Economic Society 2010.
η
1−α
106
and
M. J. G. Bun and F. Windmeijer
(t − 4)E n1 μl + 1 |E(αˆ l ) − α| RelBiasl = ≈ |E(αˆ lOLS − α)| E(μl ) σ2 η 1−α t − 4 σv2 + 1+α (1 + α)((t − 1) − (t − 3)α) = . t −2 n(1 − α)2
When ση2 = σv2 we have that E( n1 μd ) = E( n1 μl ) and we expect therefore that the relative biases are the same for the DIF and LEV 2SLS estimators. Indeed this is the case and it amounts to RelBiasd = RelBiasl ≈
2(t − 4) ((t − 1) − (t − 3)α) . t −2 n(1 − α)2
Finally, as mentioned in Section 3, the finite sample behaviour of the Wald test depends on the magnitude of the concentration parameter, the number of instruments and the correlation between 2 2 = ρu,l when ση2 = σv2 and therefore the size the model errors. It is easily verified that ρu,d distortions of the Wald test will be expected to be same for the DIF and LEV estimators in that 2 2 > ρu,l , and therefore the case. When ση2 < σv2 we have that both E(μd ) < E(μl ) and that ρu,d Wald size distortion is expected to be smaller for the LEV estimator in that case. It is expected to 2 2 < ρu,l . be smaller for the DIF estimator when ση2 > σv2 as then both E(μd ) > E(μl ) and ρu,d 4.2. System estimator For the cross-section at time t the SYS estimator combines the moment conditions of the DIF and LEV estimators. The OLS estimator in the SYS ‘model’ yit yi,t−1 uit =α + (4.3) yit yi,t−1 uit is given by −1 yt−1 yt + yt−1 yt−1 + yt−1 yt−1 yt αˆ sOLS = yt−1 and is clearly a weighted average of the DIF and LEV OLS estimators αˆ sOLS = γ˜ αˆ dOLS + (1 − γ˜ )αˆ lOLS , where γ˜ =
yt−1 yt−1 yt−1 yt−1 + yt−1 yt−1
and plim(γ˜ ) =
1−α 3 2
−α+
2 1 ση 1+α 2 σv2 1−α
.
The bias of the OLS estimator will therefore behave like the bias of the LEV OLS estimator when α → 1 and/or ση2 /σv2 → ∞, as γ˜ → 0 in these cases. The asymptotic bias of αˆ sOLS is C The Author(s). Journal compilation C Royal Economic Society 2010.
The weak instrument problem of the system GMM estimator
107
given by plim(αˆ sOLS − α) =
(1 − α 2 ) α − 1 + (3 − 2α)(1 − α) +
ση2 σv2
ση2 (1 σv2
+ α)
.
We can express the limiting bias of the SYS OLS estimator as (σu,d + σul )/ σd2 + σl2 , plim(αˆ sOLS ) − α = E n1 μs + 1 where
E
1 μs n
= φE
1 1 μd + (1 − φ)E μl n n
and σ2 φ = 2 d 2 . σd + σl When ση2 = σv2 , we have then that E( n1 μs ) = E( n1 μd ) = E( n1 μl ). As σu,d + σul =
ση2 1−α
− σv2
we see that the absolute SYS OLS bias is then (substantially) smaller than the DIF and LEV OLS biases, and equal to 0 when α = 0. Figure 2 shows the asymptotic biases of the DIF, LEV and SYS OLS estimators as a function of α for different values of ση2 /σv2 = { 41 , 1, 4}. It is clear from this picture that the LEV and SYS OLS biases are much smaller than the DIF OLS bias for higher values of α. The SYS 2SLS estimator for cross-section t is also a weighted average of the DIF and LEV cross-sectional 2SLS estimators ˜ αˆ l , αˆ s = δ˜αˆ d + (1 − δ) where δ˜ =
πˆ d Zd Zd πˆ d ; πˆ d Zd Zd πˆ d + πˆ l Zl Zl πˆ l
see also Blundell et al. (2000); with ˜ = plim(δ)
E
1 n
E
1 n
μd +
μd
σl2 E σd2
1 n
μl
and again δ˜ → 0 if α → 1 and/or ση2 /σv2 → ∞. Clearly, the absolute bias of the SYS 2SLS estimator will be smaller than the maximum of the absolute biases of the DIF and LEV 2SLS estimators. Combining the results of the OLS biases, values of the concentration parameters in the DIF and LEV models and relative weights on the DIF and LEV moment conditions in the SYS 2SLS estimator, we expect the absolute bias of the SYS estimator to be small for large values of α, C The Author(s). Journal compilation C Royal Economic Society 2010.
108
M. J. G. Bun and F. Windmeijer
Figure 2. Asymptotic biases of OLS estimators.
but that this bias is an increasing function of ση2 , σv2
ση2 . σv2
This happens because the bias of the LEV OLS
the LEV concentration parameter a decreasing function estimator is an increasing function of 2 ση2 ˜ an increasing function in ση2 , implying that more weight will be of 2 , and the weight (1 − δ) σv
σv
given to the LEV moment conditions. The definition of μs above suggest a concentration parameter equivalent for the SYS model given by μs =
πd Zd Zd πd + πl Zl Zl πl . σd2 + σl2
However, in this case the value of μs does not directly convey the magnitude of the bias of the 2SLS estimator, relative to the bias of the OLS estimator. This is due to the additional covariance terms of the reduced form errors d and l. As in (3.3), consider the approximation
πd zd,u + πl zl,u E(n1/2 (αˆ s − α)) ≈ E πd Qd πd + πl Ql πl
Ql zl,u zd,d Qd zd,u + zl,l −1/2 +n E πd Qd πd + πl Ql πl π π z + π z z + π z d,d l,l d,u l,u d l d l − 2n−1/2 E , 2 πd Qd πd + πl Ql πl C The Author(s). Journal compilation C Royal Economic Society 2010.
The weak instrument problem of the system GMM estimator
where za,b =
√1 Z b. n a
109
We then get the approximate bias expression for the SYS 2SLS estimator:
E(αˆ s ) − α ≈
1 (t − 2)(σu,d + σul ) n πd Qd πd + πl Ql πl 2 ση2 /(1 − α)πl Ql πl − σv2 πd Qd πd − 2 n πd Qd πd + πl Ql πl πd zd,d πl zl,u + πl zl,l πd zd,u 2 − E . 2 n π Qd πd + π Ql πl d
(4.4)
l
We calculate this approximate bias expression and the associated relative bias for the Monte Carlo simulation example in the next section, where it is shown that the relative bias of the SYS estimator is smaller than that of the LEV or DIF estimator when ση2 = σv2 even though in that case E(μd ) = E(μl ) = E(μs ). Clearly, the SYS 2SLS estimator is not efficient as there is heteroscedasticity and correlation between the errors in model (4.3). We will focus on the 2SLS estimator here in the cross-section analysis and consider the efficient two-step GMM estimator below when considering the full panel data analysis. 4.3. Some Monte Carlo results To investigate the finite sample behaviour of the estimators and Wald test statistics we conduct the following Monte Carlo experiment. We compute the OLS and 2SLS estimators for LEV, DIF and SYS for the cross-section t = 6 for the model specification ηi + εi ; 1−α yit = αyi,t−1 + ηi + vit ; 2 σv εi ∼ n 0, ; ηi ∼ N 0, ση2 ; vit ∼ N 0, σv2 , 2 1−α yi1 =
for sample size n = 200; σv2 = 1, and different values of α = {0.4, 0.8} and ση2 = { 41 , 1, 4}. Note that in this design results depend only on the relative value vr = ση2 /σv2 , not the total variance ση2 + σv2 . There are four instruments for the DIF and LEV 2SLS estimators, whereas the SYS 2SLS estimator is in this cross-sectional case based on the eight combined moment conditions. Tables 1 and 2 present the estimation results for 10,000 Monte Carlo replications. The results in Tables 1 and 2 confirm the findings and conjectures stated in the previous sections. The DIF OLS (absolute) bias is larger than the LEV OLS bias in all cases, especially when the series are more persistent when α = 0.8. The relative biases of the DIF and LEV 2SLS estimators are, however, the same when vr = ση2 /σv2 = 1. These relative biases are equal to 0.052 and 0.057, respectively, when α = 0.4, in which case the expected concentration parameters are equal to 46.75. The relative biases are larger, 0.310 and 0.312, respectively, when α = 0.8. For this case the expected concentration parameters are much smaller and equal to 6.35, which corresponds to a first-stage F-statistic of 6.35/4 = 1.58. The relative bias of the DIF 2SLS estimator does not vary much with the different values of vr when α = 0.4, whereas that of the LEV 2SLS estimator does. It is only 0.029 when vr = 1/4, but increases to 0.169 when vr = 4. These are exactly in line with the larger variation in the values C The Author(s). Journal compilation C Royal Economic Society 2010.
110
M. J. G. Bun and F. Windmeijer Table 1. Cross-section estimation results. DIF LEV
α
vr
0.4
1/4
1
Coeff.
SD
Coeff.
SD
Coeff.
SD
OLS
−0.300
0.067
0.621
0.056
0.224
0.057
2SLS E(μ) OLS
0.370
0.173
0.406
0.092
0.389
0.081
−0.301
0.067
0.820
0.041
0.523
0.049
2SLS E(μ)
0.364
0.189
0.424
0.113
0.404
0.095
OLS 2SLS E(μ)
−0.301 0.360
0.067 0.197
0.942 0.492
0.024 0.157
0.812 0.462
0.029 0.122
1/4
OLS 2SLS E(μ)
−0.100 0.597
0.025 0.084
0.824 0.793
0.028 0.083
1
OLS 2SLS
−0.100 0.521
0.014 0.092
0.938 0.834
0.015 0.090
E(μ) OLS 2SLS
−0.100 0.484
0.007 0.085
0.983 0.917
0.007 0.079
4
0.8
4
58.06
132.7
46.75
46.75
42.31
13.02 0.070 0.404
0.938 0.815
0.070 0.464
0.980 0.856
0.070 0.485
0.995 0.932
9.15
20.92
6.35
E(μ)
6.35
5.45
1.68
Notes: Means and standard deviations (SD) of 10,000 estimates. vr =
DIF
ση2 /σv2 .
n = 200. t = 6.
Table 2. Bias approximations. LEV
α
vr
Bias
RelBias
0.4
1/4
−0.030
1
−0.031 −0.036 −0.037
4
0.8
SYS
SYS
Bias
RelBias
Bias
RelBias
0.043
0.006
0.029
−0.011
0.063
0.044 0.052 0.053
0.006 0.024 0.022
0.025 0.057 0.053
−0.012 0.004 0.003
0.068 0.031 0.021
−0.039 −0.040
0.057 0.057
0.092 0.089
0.169 0.164
0.062 0.065
0.151 0.157
1/4
−0.203
0.225
0.015
0.109
−0.007
0.314
1
−0.206 −0.279
0.229 0.310
0.015 0.056
0.106 0.312
−0.010 0.034
0.403 0.243
−0.293 −0.316 −0.339
0.325 0.351 0.377
0.059 0.132 0.234
0.325 0.681 1.203
0.033 0.117 0.208
0.241 0.640 1.140
4
Notes: Mean bias and relative bias from 10,000 estimates. RelBias = |α¯ˆ 2SLS − α|/|α¯ˆ OLS − α|. Higher-order bias approximations in italics. vr = ση2 /σv2 . n = 200. t = 6.
C The Author(s). Journal compilation C Royal Economic Society 2010.
The weak instrument problem of the system GMM estimator
111
of the expected concentration parameter for the LEV model. They are 132.7 when vr = 1/4 and 13.0 when vr = 4, compared to 58.1 and 42.3, respectively, for the DIF model. The absolute bias of the DIF 2SLS estimator is smaller than that of the LEV 2SLS one when vr = 4, but larger in the other cases. When α = 0.8, there is a similar pattern to the results of the relative biases. For the LEV 2SLS model it now decreases to 0.11 when vr = 1/4, with the expected concentration parameter equal to 20.9. It increases to 0.68 when vr = 4 and the expected concentration parameter is only 1.68. As explained before, we see that the weak instrument problem for the LEV moment conditions, given α, becomes more severe with increasing vr. As both the OLS bias and the relative bias increase with increasing vr, so does the absolute bias of the 2SLS estimator. When α = 0.8, the absolute bias of the LEV 2SLS estimator ranges from 0.015 when vr = 1/4 to 0.132 when vr = 4. The SYS 2SLS estimator has a slightly smaller relative bias than the DIF and LEV ones when vr = 1. It is 0.03 when α = 0.4 and 0.24 when α = 0.8. Unlike the results for the LEV 2SLS estimator, the relative bias actually increases when vr = 1/4, although the absolute bias is quite small, especially when α = 0.8. The relative bias is quite large in that case because the bias of the SYS OLS estimator is very small. When vr = 4 the relative and absolute biases of the SYS 2SLS estimator are similar to that of the LEV 2SLS estimator, albeit slightly smaller. Table 2 further shows that the higher-order bias and relative 2SLS bias approximations calculated from (3.4) and (3.5) for DIF and LEV and from (4.4) for SYS are very accurate. The exception is when the concentration parameter is very small for LEV when α = 0.8 and vr = 4. Then the bias approximations indicate too high a bias for LEV and SYS. Figures 3 and 4 display p-value plots for the Wald test for testing H0 : α = α0 with α0 the true parameter value where vr = ση2 /σv2 . When vr = 1, the size properties of the Wald tests based on the DIF and LEV 2SLS estimates are virtually identical, which is as expected as the concentration parameters are equal in expectation as are the correlation coefficients of the model errors. It is also clear that when α = 0.8, the size properties of the Wald tests are very poor, with a large overrejection of the null reflecting the low value of the concentration parameters. The size properties of the Wald test based on the SYS 2SLS estimation results are better than those based on the DIF and LEV 2SLS results, but again very poor when α = 0.8. When vr = 1/4 the size properties of the Wald tests based on the LEV and SYS 2SLS estimation results are quite good, even when α = 0.8, whereas they are very poor when vr = 4. The Wald test results based on the DIF 2SLS estimates are not very sensitive to the value of vr. These results are again in line with expectation given the results of the previous section. 4.4. Mean stationarity only In all the derivations so far we assumed covariance stationarity of the initial condition. When we assume mean stationarity only, i.e. ηi + εi yi1 = 1−α with E(εi2 ) = σε2 , we show in the Appendix that for t = 3
1 1 σv2 if σε2 < μl3 > E μd3 E , n n 1 − α2
1 1 σv2 , E if σε2 > μl3 < E μd3 n n 1 − α2 C The Author(s). Journal compilation C Royal Economic Society 2010.
112
M. J. G. Bun and F. Windmeijer
Figure 3. Wald test p-value plots. H0 : α = 0.4.
so that, when t = 3, the expected concentration parameter for the LEV model is larger than that of the DIF model when the variance of the initial condition is smaller than the covariance stationary level and vice versa.
5. PANEL DATA ANALYSIS The concept of the concentration parameter and its relationship to relative bias and size distortion of the Wald test does not readily extend itself to general GMM estimation; see e.g. Stock and Wright (2000) and Han and Phillips (2006). Estimation of the panel AR(1) model by 2SLS, using all available time periods and the full set of sequential moment conditions for the DIF and SYS models (2.6) and (2.12) will result in a weighted average of the period specific 2SLS estimates. Weighting by the efficient weight matrix will lead to different results, but we expect the weak instrument issues as documented in the previous section for the DIF and LEV crosssectional estimates to carry over to the linear GMM estimation. This is indeed confirmed by our Monte Carlo results presented here. Tables 3 and 4 presents Monte Carlo estimation results for the AR(1) model with normally distributed ηi and vi , with n = 200, T = 6, α = 0.8 and vr = (0.25, 1, 4). We present 2SLS and one-step and two-step GMM estimation We use for the initial weight matrix for the results. AZdi , where A is a (T − 2) square matrix that has one-step GMM DIF estimator Wn = ni=1 Zdi 2’s on the main diagonal, −1’s on the first subdiagonals, and zeros elsewhere. This is the efficient C The Author(s). Journal compilation C Royal Economic Society 2010.
The weak instrument problem of the system GMM estimator
113
Figure 4. Wald test p-value plots. H0 : α = 0.8.
weight matrix for the DIF moment conditions when the vit are homoscedastic and not serially correlated, as is the case here. For the one-step GMM SYS estimator we use the commonly used initial weight matrix Wn = ni=1 Zsi H Zsi , where H is a 2(T − 2) square matrix
A 0 , H = 0 IT −2 where IT −2 is the identity matrix of order T − 2. The pattern of results for the 2SLS estimates is quite similar to that found for the t = 6 crosssection as reported in Table 1. The DIF 2SLS estimator displays somewhat larger relative biases, whereas the LEV 2SLS estimator has smaller relative biases than in the cross-section. SYS has smaller relative and absolute biases at vr = 1 and vr = 4, but the direction of the biases remains the same. Use of the efficient initial weight matrix reduces the bias of the one-step GMM DIF estimator significantly. This is due to the fact that the comparison bias is now no longer the OLS bias in the first differenced model, but the bias of the within groups estimator, which is smaller. There is no clear pattern to the bias of the SYS one- and two-step GMM estimators in comparison to the 2SLS estimator. Figure 5 displays the p-value plots of the Wald tests for testing H0 : α = 0.8 based on the DIF and SYS GMM estimation results when T = 6 with vr = ση2 /σv2 , where the Wald tests based on C The Author(s). Journal compilation C Royal Economic Society 2010.
114
M. J. G. Bun and F. Windmeijer Table 3. Panel data estimation results. DIF LEV
SYS
Coeff.
SD
Coeff.
SD
Coeff.
SD
OLS 2SLS One-step
−0.100 0.581 0.734
0.033 0.162 0.131
0.938 0.812
0.011 0.056
0.824 0.779 0.798
0.018 0.074 0.067
Two-step
0.734
0.140
0.812
0.060
0.797
0.060
OLS 2SLS One-step
−0.100 0.469 0.672
0.033 0.212 0.181
0.980 0.850
0.006 0.068
0.938 0.813 0.830
0.009 0.079 0.073
Two-step
0.664
0.201
0.844
0.042
0.818
0.068
−0.100 0.401
0.033 0.240
0.995 0.924
0.003 0.069
0.983 0.889
0.004 0.075
0.618 0.601
0.213 0.241
0.913
0.079
0.900 0.884
0.070 0.079
vr = 1/4
vr = 1
vr = 4 OLS 2SLS One-step Two-step
Note: Means and standard deviations (SD) of 10,000 estimates. n = 200. T = 6. α = 0.8.
Table 4. Bias and relative bias. LEV
DIF
SYS
vr
Bias
RelBias
Bias
RelBias
Bias
RelBias
1/4 1
−0.219 −0.331
0.244 0.367
0.012 0.050
0.086 0.279
−0.021 0.013
0.887 0.093
4
−0.399
0.443
0.124
0.637
0.089
0.488
Notes: Mean and relative bias from 10,000 estimates. RelBias = |α¯ˆ 2SLS − α|/|α¯ˆ OLS − α|. T = 6. α = 0.8.
the two-step GMM results use the Windmeijer (2005) corrected variance estimates. The pattern of size properties is very similar to that for the cross-section analysis. The Wald test based on the SYS GMM estimation results has better size properties than that based on the DIF GMM estimation results when vr = 0.25, especially for the one-step SYS GMM estimator. The size behaviours are very similar when vr = 1, but the SYS Wald tests size properties are much worse than that of the DIF Wald tests when vr = 4. As for the cross-sectional SYS estimator, we can start with the bias of the panel DIF OLS estimator in order to obtain a suggestion for a concentration parameter. − (T − 2) σv2 , T 2 t=3 πdt Qdt πdt + t=3 σdt
plim (αˆ dOLS ) − α = T
C The Author(s). Journal compilation C Royal Economic Society 2010.
The weak instrument problem of the system GMM estimator
115
Figure 5. Wald test p-value plots. H0 : α = 0.8.
suggesting a concentration parameter defined as T π Z Zdt πdt . μd = t=3dtT dt 2 t=3 σdt For the 2SLS bias we get
1/2
E n
πd zd,u (αˆ d − α) ≈ E πd Qd πd
Qd zd,u zd,d πd zd,d πd zd,u −1/2 − 2E +n E , 2 πd Qd πd π Qd πd
d
1 ((T − 1) (T − 2) /2 − 2)σv2 E (αˆ d ) − α ≈ − n πd Qd πd T πdj zd,uj 2 j =t πdt zd,dt t=3 − E . 2 n π Qd πd d
As before for the SYS cross-sectional 2SLS estimator, the concentration parameter μd does not convey all the information concerning the relative bias of the 2SLS estimator, due to the additional covariance terms in the expansion. Equivalent results can be obtained for the panel C The Author(s). Journal compilation C Royal Economic Society 2010.
116
M. J. G. Bun and F. Windmeijer
LEV and panel SYS 2SLS estimators. For the efficient one-step panel DIF GMM estimator, similar expansions can be derived, but now for the model where the individual data in the model is premultiplied by A−1/2 , but the instruments by A1/2 . 5.1. Bias approximations for panel 2SLS estimators Although the concept of concentration parameter does not automatically extend to panels it is possible to analyse absolute and relative bias of panel estimators of α. We now consider panel IV estimators, i.e. exploiting the identity weight matrix in the definitions of αˆ d and αˆ l . Hence, the WN matrix is of the simple form Z Z. We analyse the DIF and LEV panel IV estimators using results from Alvarez and Arellano (2003) and Hayakawa (2008), respectively. In those studies probability limits of the DIF and LEV panel IV estimators have been derived assuming both T and n growing large with T /n → c, 0 ≤ c < ∞. Regarding the panel DIF 2SLS estimator from theorem 4 of Alvarez and Arellano (2003) we have
c 1+α , plim(αˆ d − α) = − 2 2 − (1 + α)(2 − c)/2 while for the panel LEV 2SLS estimator using theorem 3 of Hayakawa (2008) we have 2 c ση 1 plim(αˆ l − α) =
2 σv2
2 c ση 2 σv2
1−α
1 2 1−α
+
1 1−α 2
.
Hence, for both T and n large panel IV estimators are inconsistent. Comparing these asymptotic 2SLS biases with the limiting biases of OLS (see Section 4.1 for analytical expressions) we find that for
ση2 σv2
= 1 relative bias for DIF and LEV is equal and amounts to c (1 2
c . + α) + 1 − α σ2
Furthermore, relative bias for LEV is larger than DIF when ση2 > 1 and vice versa. Hence, these v results for panel IV estimators mimic the cross-sectional results on relative bias as discussed in Section 4. Panel 2SLS estimators can be expressed as a weighted average of period specific 2SLS estimators. This suggests that cross-section-based concentration parameters as derived in the previous section are also informative about absolute and relative 2SLS bias when exploiting the whole panel. This conjecture is correct as we will now show. The above results of Alvarez and Arellano (2003) and Hayakawa (2008) can be interpreted as the 2SLS inconsistency under many instrument asymptotics. Hence, the bias of panel 2SLS estimators when the number of instruments is reasonably large can be approximated by −1 E y−1 Zd Zd Zd Zd u , E (αˆ d − α) ≈ −1 E y−1 Zd Zd Zd Zd y−1 −1 E y−1 Zl Zl Zl Zl u . E (αˆ l − α) ≈ −1 E y−1 Zl Zl Zl Zl y−1 C The Author(s). Journal compilation C Royal Economic Society 2010.
117
The weak instrument problem of the system GMM estimator Table 5. Panel data estimation results. DIF LEV
SYS
Coeff.
SD
Coeff.
SD
Coeff.
SD
OLS 2SLS One-step
−0.100 0.426 0.767
0.019 0.069 0.034
0.938 0.828
0.007 0.024
0.824 0.730 0.793
0.014 0.041 0.029
Two-step
0.766
0.039
0.822
0.027
0.796
0.027
OLS 2SLS One-step
−0.100 0.374 0.757
0.019 0.075 0.040
0.980 0.880
0.003 0.027
0.938 0.776 0.819
0.006 0.043 0.031
Two-step
0.754
0.046
0.866
0.032
0.816
0.030
−0.100 0.355
0.019 0.078
0.995 0.946
0.001 0.023
0.983 0.868
0.002 0.039
0.751 0.748
0.042 0.048
0.935
0.031
0.882 0.877
0.031 0.033
vr = 1/4
vr = 1
vr = 4 OLS 2SLS One-step Two-step
Notes: Means and standard deviations (SD) of 10,000 estimates. vr = ση2 /σv2 . n = 200. T = 15. α = 0.8.
Table 6. Panel bias approximations. DIF LEV
SYS
T
vr
Bias
RelBias
Bias
RelBias
Bias
RelBias
6
1/4
−0.219 −0.227
0.244 0.252
0.012 0.023
0.086 0.164
−0.021 −0.017
0.887 0.673
1
−0.331 −0.339
0.367 0.377
0.050 0.068
0.279 0.377
0.013 0.028
0.093 0.203
4
−0.399 −0.407
0.443 0.453
0.124 0.134
0.637 0.691
0.089 0.107
0.488 0.583
1/4
−0.374
0.416
0.028
0.200
−0.070
2.960
1
−0.376 −0.426
0.418 0.473
0.031 0.080
0.227 0.445
−0.069 −0.024
2.813 0.174
−0.428 −0.445 −0.447
0.475 0.495 0.497
0.086 0.146 0.150
0.475 0.752 0.770
−0.020 0.068 0.075
0.145 0.370 0.409
15
4
Notes: Mean and relative bias from 10,000 estimates. RelBias = |α¯ˆ 2SLS − α|/|α¯ˆ OLS − α|. Bias approximations in italics. vr = ση2 /σv2 . α = 0.8.
C The Author(s). Journal compilation C Royal Economic Society 2010.
118
M. J. G. Bun and F. Windmeijer
The above expressions are basically an evaluation of the expected value of the leading term (inconsistency) in an asymptotic expansion of the estimation error under many instruments. In the Appendix, we show that 0.5(T − 1)(T − 2)σu,d E (αˆ d − α) ≈ T , 2 t=3 σdt (E (μdt ) + (t − 2)) 0.5(T − 1)(T − 2)σu,l E (αˆ l − α) ≈ T . 2 t=3 σlt (E (μlt ) + (t − 2)) Indeed cross-section-specific concentration parameters appear in these bias approximations. Although analytically no tractable expression results it is interesting that regarding relative bias numerically the same pattern as in the pure cross-section case emerges. In other words, relative bias for panel DIF is larger than for panel LEV when ση2 < σv2 and vice versa. And when the variance ratio ση2 /σv2 is equal to 1 we have that the relative biases for the estimators are equal. Regarding the panel SYS 2SLS estimator we can proceed in a similar way and evaluate −1 E q−1 Zs Zs Zs Zs p E(αˆ s − α) ≈ −1 . E q−1 Zs Zs Zs Zs q−1 In the Appendix, we show that 0.5(T − 1)(T − 2)σu,d + (T − 2)σu,l . T 2 2 t=3 σdt (E (μdt ) + (t − 2)) + t=3 σlt (E (μlt ) + 1)
E (αˆ s − α) ≈ T
We expect the bias approximations of the panel IV estimators to work well when at least T is moderately large compared with n. Table 5 presents estimation results for the panel data Monte Carlo exercise when T = 15. Table 6 further presents the bias approximations. As expected, we now find that the relative biases of the DIF and LEV estimators are virtually identical for T = 15. We also include those for T = 6. These results corroborate our large T theoretical findings, with reasonable approximations even when T = 6, especially for DIF.
6. CONCLUSIONS We have shown that the concentration parameters in the reduced forms of the DIF and LEV crosssectional models are the same in expectation when the variances of the unobserved heterogeneity (ση2 ) and idiosyncratic errors (σv2 ) are the same in the covariance stationary AR(1) model. The LEV concentration parameter is smaller than the DIF one if ση2 > σv2 and it is larger if ση2 < σv2 . Therefore, the well-understood weak instrument problem in the DIF model also applies to the LEV model, especially when ση2 ≥ σv2 , with both concentration parameters decreasing in value with increasing persistence of the data series. The weak instrument problem does manifest itself in the magnitude of the bias of 2SLS relative to that of OLS, which we show are equal for DIF and LEV when ση2 = σv2 . The LEV 2SLS estimator has a smaller finite sample performance in terms of bias though, because the OLS bias of the LEV structural equation is smaller than that of DIF, especially when the series are persistent. The weak instrument problem further manifests itself in poor performances of the Wald tests, which we show to have the same size distortions in the DIF and LEV models when ση2 = σv2 . Although our theoretical results do not apply automatically to C The Author(s). Journal compilation C Royal Economic Society 2010.
The weak instrument problem of the system GMM estimator
119
GMM-based inference (Kiviet, 2008) we show by simulation that these properties generalize to the system GMM estimator. Having established this potential weak instrument problem for the system GMM estimator, for inference one should therefore consider use of testing procedures that are robust to the weak instruments problem. The Kleibergen (2005) Lagrange Multiplier test and his GMM extension of the Conditional Likelihood Ratio test of Moreira (2003) are possible candidates, as is the Stock and Wright (2000) GMM version of the Anderson–Rubin statistic. Newey and Windmeijer (2009) show that the behaviours of these test statistics are not only robust to weak instrument asymptotics, they are also robust to many weak instrument asymptotics, where the number of instruments grow with the sample size, but with the model bounded away from nonidentification. Newey and Windmeijer (2009) also propose use of the continuous updated GMM estimator (CUE, Hansen et al., 1996) with a new variance estimator that is valid under many weak instrument asymptotics. They show that the Wald test using the CUE estimation results and their proposed variance estimator performs well in a static panel data model estimated in first differences. As the number of potential instruments in this panel data setting grow quite rapidly with the time dimension of the panel, this may be a sensible approach also for the system moment conditions. As a final remark, the direction of the biases of the DIF (downward) and LEV (upward) GMM estimators in the AR(1) panel data model are quite specific to this model specification. In different models these biases may be different and the SYS GMM estimator may have a larger absolute bias than the DIF GMM estimator. For example, in the static panel data model yit = xit β + ηi + vit , xit = ρxi,t−1 + γ ηi + δvit + wit , the DIF GMM estimator may have a smaller finite sample bias than the SYS GMM estimator when the xit series are persistent, but |δ| is small and |γ | is large, as then the endogeneity problem and OLS bias in the DIF model may be less than that of the LEV model.
ACKNOWLEDGMENTS We are grateful for helpful comments by Steve Bond, Jan Kiviet, Jon Temple and two anonymous referees.
REFERENCES Alvarez, J. and M. Arellano (2003). The time series and cross-section asymptotics of dynamic panel data estimators. Econometrica 71, 1121–59. Arellano, M. and S. Bond (1991). Some tests of specification for panel data: Monte Carlo evidence and an application to employment equations. Review of Economic Studies 58, 277–98. Arellano, M. and O. Bover (1995). Another look at the instrumental variable estimation of error-components models. Journal of Econometrics 68, 29–51. Blundell, R. and S. Bond (1998). Initial conditions and moment restrictions in dynamic panel data models. Journal of Econometrics 87, 115–43. Blundell, R. W., S. R. Bond and F. Windmeijer (2000). Estimation in dynamic panel data models: improving on the performance of the standard GMM estimator. In B. Baltagi (Ed.), Nonstationary Panels, Panel C The Author(s). Journal compilation C Royal Economic Society 2010.
120
M. J. G. Bun and F. Windmeijer
Cointegration, and Dynamic Panels, Advances in Econometrics, Volume 15, 53–91. New York: JAI Press, Elsevier Science. Bond, S. R., A. Hoeffler and J. Temple (2001). GMM estimation of empirical growth models. Working paper, University of Oxford. Bond, S. R. and F. Windmeijer (2005). Reliable inference for GMM estimators? Finite sample properties of alternative test procedures in linear panel data models. Econometric Reviews 24, 1–37. Bound, J., D. A. Jaeger and R. M. Baker (1995). Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association 90, 443–50. Bun, M. J. G. and J. F. Kiviet (2006). The effects of dynamic feedbacks on LS and MM estimator accuracy in panel data models. Journal of Econometrics 132, 409–44. Buse, A. (1992). The bias of instrumental variable estimators. Econometrica 60, 173–80. Griffith, R., R. Harrison and J. Van Reenen (2006). How special is the special relationship? Using the impact of U.S. R&D spillovers on U.K. firms as a test of technology sourcing. American Economic Review 96, 1859–75. Hahn, J. and J. Hausman (2002). Note on bias in estimators for simultaneous equation models. Economics Letters 75, 237–41. Hahn, J. and G. Kuersteiner (2002). Discontinuities of weak instrument limiting distributions. Economics Letters 75, 325–31. Han, C. and P. C. B. Phillips (2006). GMM with many moment conditions. Econometrica 74, 147–92. Hansen, L. P., J. Heaton and A. Yaron (1996). Finite-sample properties of some alternative GMM estimators. Journal of Business and Economic Statistics 14, 262–80. Hayakawa, K. (2007). Small sample bias properties of the system GMM estimator in dynamic panel data models. Economics Letters 95, 32–38. Hayakawa, K. (2008). The asymptotic properties of the system GMM estimator in dynamic panel data models when both N and T are large. Working paper, Hiroshima University. Kiviet, J. F. (2007). Judging contending estimators by simulation: tournaments in dynamic panel data models. In G. D. A. Phillips and E. Tzavalis (Eds.), The Refinement of Econometric Estimation and Test Procedures, 282–318. Cambridge: Cambridge University Press. Kiviet, J. F. (2008). Strength and weakness of instruments in IV and GMM estimation of dynamic panel data models. Working paper, University of Amsterdam. Kleibergen, F. (2005). Testing parameters in GMM without assuming they are identified. Econometrica 73, 1103–23. Kruiniger, H. (2009). GMM estimation and inference in dynamic panel data models with persistent data. Econometric Theory 25, 1348–91. Levine, R., N. Loayza and T. Beck (2000). Financial intermediation and growth: causality and causes. Journal of Monetary Economics 46, 31–77. Levinsohn, J. and A. Petrin (2003). Estimating production functions using inputs to control for unobservables. Review of Economic Studies 70, 317–41. Moreira, M. (2003). A conditional likelihood ratio test for structural models. Econometrica 71, 1027–48. Morimune, K. (1989). t test in a structural equation. Econometrica 57, 1341–60. Nagar, A. L. (1959). The bias and moment matrix of the general k-class estimators of the parameters in simultaneous equations. Econometrica 27, 575–95. Newey, W. K. and F. Windmeijer (2009). Generalized method of moments with many weak moment conditions. Econometrica 77, 687–719. Picone, G. A., F. Sloan and J. G. Trogdon (2004). The effect of the tobacco settlement and smoking bans on alcohol consumption. Health Economics 13, 1063–80. C The Author(s). Journal compilation C Royal Economic Society 2010.
The weak instrument problem of the system GMM estimator
121
Ridder, G. and T. Wansbeek (1990). Dynamic models for panel data. In F. van der Ploeg (Ed.), Advanced Lectures in Quantitative Economics, 557–82. London: Academic Press. Rothenberg, T. J. (1984). Approximating the distributions of econometric estimators and test statistics. In Z. Griliches and M. D. Intriligator (Eds.), Handbook of Econometrics, Volume 2, 881–935. Amsterdam: North-Holland. Staiger, D. and J. H. Stock (1997). Instrumental variables regression with weak instruments. Econometrica 65, 557–86. Stock, J. H. and J. H. Wright (2000). GMM with weak identification. Econometrica 68, 1055–96. Stock, J. H., J. H. Wright and M. Yogo (2002). A survey of weak instruments and weak identification in generalized method of moments. Journal of Business and Economic Statistics 20, 518–29. Stock, J. H. and M. Yogo (2005). Testing for weak instruments in linear IV regression. In D. W. K. Andrews and J. H. Stock (Eds.), Identification and Inference for Econometric Models, Essays in Honor of Thomas Rothenberg, 80–108. New York: Cambridge University Press. Windmeijer, F. (2005). A finite sample correction for the variance of linear efficient two-step GMM estimators. Journal of Econometrics 126, 25–51.
APPENDIX A.1. Concentration parameters in cross-section analysis The model in first differences for the cross-section at time t is given by yit = αyi,t−1 + uit yi,t−1 = yit−2 πdt + di,t−1 . For the general expression of the expected value of the concentration parameter divided by n we get E
1 μdt n
=
πdt E yit−2 yit−2 πdt σdt2
but as −1 t−2 πdt = E yit−2 yit−2 E yi yi,t−1 and σdt2 = E
2 yi,t−1 − yit−2 πdt
we get E
1 μdt n
t−2 t−2 −1 t−2 t−2 E yi yi,t−1 E yi yi E yi yi,t−1 = t−2 t−2 t−2 −1 t−2 . 2 − E yi yi,t−1 E yi yi E yi,t−1 E yi yi,t−1
Under covariance stationarity E yit−2 yit−2 =
ση2
ι ι 2 t−2 t−2
(1 − α)
C The Author(s). Journal compilation C Royal Economic Society 2010.
+
σv2 Gt−2 , 1 − α2
122
M. J. G. Bun and F. Windmeijer
where
⎡
Gt−2
⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎣ α
1
α
α .. .
1 .. ···
t−3
⎤ α t−3 .. ⎥ ⎥ . ⎥ ⎥. ⎥ ⎥ α ⎦
···
.
α
1
The inverse of E(yit−2 yit−2 ) is given by (see e.g. Ridder and Wansbeek, 1990):
ση2 ht−2 ht−2 t−2 t−2 −1 1 , = 2 Rt−2 Rt−2 − 2 E yi yi σv σv + ση2 t − 3 + 1+α 1−α where ⎡
Rt−2
1
⎢0 ⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎣
−α
0
1
−α ..
..
.
⎤
−α
⎥ ⎥ ⎥ ⎥ ⎥; ⎥ ⎥ ⎦
.
1
0
0
√
0
ht−2 = (1 − α) ιt−2 + α (e1 + et−2 )
1 − α2
and ej is the jth unit vector of order t − 2. We further have that σ2 E yit−2 yi,t−1 = − v gt−2 , 1+α where
⎡
gt−2
As
⎡ Rt−2 gt−2
and so
⎢ ⎢ ⎢ =⎢ ⎢ ⎣
⎤ α t−3 ⎢ . ⎥ ⎢ . ⎥ ⎢ ⎥ = ⎢ . ⎥. ⎢ ⎥ ⎣ α ⎦ 1 ⎤
0 .. . √
⎥ ⎥ ⎥ ⎥; ⎥ ⎦
0
ht−2 gt−2 = 1 + α
1 − α2
t−2 t−2 −1 t−2 t−1 E yi yi E yi yi,t−1 E yi yi,t−1 ση2 (1 + α)2 σv2 2 . = 1−α − 2 σv + ση2 t − 3 + 1+α (1 + α)2 1−α
Further 2 2σv2 = E yi,t−1 . 1+α C The Author(s). Journal compilation C Royal Economic Society 2010.
123
The weak instrument problem of the system GMM estimator Combining these results in E
1 μdt n
=
σ 2 (1+α)2 1 − α 2 − 2 2η σv +ση t−3+ 1+α 1−α
2 (1+α)2 σ σv2 − (1+α)2 1 − α 2 − 2 2η 1+α
σv2 (1+α)2
2σv2 1+α
σv +ση t−3+ 1−α
ση2 (1+α)2 σv2 +ση2 t−3+ 1+α 1−α
1 − α2 − = 2 (1 + α) − 1 − α 2 −
ση2 (1+α)2 σv2 +ση2 t−3+ 1+α 1−α
1 − α 2 σv2 + ση2 t − 3 + 1+α − ση2 (1 + α)2 1−α 2 1+α 2 2 (1 + α) σv + ση t − 3 + 1−α + ση2 (1 + α)2 (1 − α) σv2 + (t − 3) ση2 = (1 + α) σv2 + ση2 t − 2 + 1+α 1−α (1 − α)2 σv2 + (t − 3) ση2 = . 1 − α 2 σv2 + ((t − 1) − (t − 3) α) (1 + α) ση2 =
For the model in levels we have for the cross-section at time t yit = αyi,t−1 + ηi + vit , yi,t−1 = yit−1 πlt + li,t−1 , and the expected concentration parameter is given by E
1 μlt n
=
t−1 t−1 −1 E yi,t−1 yit−1 E yi,t−1 yit−1 E yi yi t−1 t−1 −1 . 2 − E yi,t−1 yit−1 E yi yi E yi,t−1 E yi,t−1 yit−1
Again, under covariance stationarity, we have that ⎡ 2 ⎢ ⎢ α−1 ⎢ 2 ⎢ t−1 t−1 σv ⎢ = E yi yi ⎢ α (α − 1) 1+α ⎢ ⎢ .. ⎢ . ⎣ α t−4 (α − 1)
α−1
α (α − 1)
···
2
α−1
···
α−1
2
..
.
..
..
..
.
.
···
.
α (α − 1)
α−1
α t−4 (α − 1)
⎤
⎥ α t−5 (α − 1) ⎥ ⎥ ⎥ .. ⎥ . ⎥ ⎥ ⎥ ⎥ α−1 ⎦ 2
and ⎡
⎤ α t−3 ⎢ . ⎥ ⎥ σv2 ⎢ ⎢ .. ⎥ t−1 E yi,t−1 yi = ⎢ ⎥. 1+α ⎢ α ⎥ ⎣ ⎦ 1 It then follows that −1 E yi,t−1 yit−1 E yit−1 yit−1 E yi,t−1 yit−1 = C The Author(s). Journal compilation C Royal Economic Society 2010.
(t − 2) σv2 . (1 + α) ((t − 1) − (t − 3) α)
124
M. J. G. Bun and F. Windmeijer
As 2 = E yi,t−1
ση2 (1 − α)
2
+
σv2 1 − α2
we get that
E
1 n
(t−2)σv2 (1+α)((t−1)−(t−3)α)
μlt =
ση2 (1−α)2
+
σv2 1−α 2
−
(t−2)σv2 (1+α)((t−1)−(t−3)α)
=
(t − 2) σv2 σ2 η (1 + α) ((t − 1) − (t − 3) α) (1−α) 2 +
=
(t − 2) (1 − α)2 σv2 ((t − 1) − (t − 3) α) (1 + α) ση2 + (1 − α) σv2 − (t − 2) (1 − α)2 σv2
=
(1 −
α 2 )σv2
σv2 1−α 2
− (t − 2) σv2
(t − 2)(1 − α)2 σv2 . + ((t − 1) − (t − 3)α)(1 + α)ση2
A.2. Mean stationarity only We now relax the assumption of covariance stationarity, while maintaining mean stationarity, i.e. we specify the initial condition as ηi + εi 1−α
yi1 = with E(εi2 ) = σε2 . For t = 3, we get in this case πd3 =
E (y1 y2 ) (1 − α) σε2 (1 − α) σε2 2 = − 2 =− σ η σy21 E y1 + σ2 ε
(1−α)2
2 2 σd3 = E (y2 ) − 2πd3 E (y1 y2 ) + πd3 E y12 2
= σv2 + (1 − α)2 σε2 + πd3 (1 − α) σε2 μd3 = =
E
1 n
2 y1 y1 πd3 2 σd3
σv2
2 πd3 y1 y1 . 2 + (1 − α) σε + πd3 (1 − α) σε2 2 2 2
(1−α)σε σy21
μd3 = σv2
+ (1 − α) (
=
2
σε2
2 (1−α)σ 2 − ( σ2 ε )
σy21
y1
2 (1−α)σε2 σy21
)
2 (1−α)σ 2 σv2 + (1 − α)2 σε2 − ( σ 2 ε )
.
y1
C The Author(s). Journal compilation C Royal Economic Society 2010.
The weak instrument problem of the system GMM estimator
125
For the levels model we get E (y2 y2 ) E (y2 )2
πl3 =
σv2 − α (1 − α) σε2 σv2 + (1 − α)2 σε2
= and
σl32 = E y22 − πl3 E (y2 y2 ) =
ση2 (1 − α)2
+
σv2
+α
2
σε2
−
σv2 − α (1 − α) σε2
2 .
σv2 + (1 − α)2 σε2
The concentration parameter is therefore given by μl3 =
=
πl32 y2 y2 σl32 ση2 (1−α)2
+
σv2
σv2 −α(1−α)σε2 σv2 +(1−α)2 σε2
+
2
2 σ 2 −α(1−α)σ 2 − ( σv2 +(1−α)2 σε2)
α 2 σε2
v
y2 y2
ε
and so E
1 μl3 n
(σv2 −α(1−α)σε2 )2
σv2 +(1−α)2 σε2
=
ση2 (1−α)2
Calculating these expectations shows that
2 σ 2 −α(1−α)σ 2 + σv2 + α 2 σε2 − ( σv2 +(1−α)2 σε2) v
E( n1 μl3 )
>
E( n1 μd3 )
if
σε2
.
ε
1−α2 , i.e. the expected concentration parameter in the levels model is larger than that of the differenced model if the variance of the initial condition is smaller than the covariance stationary level and vice versa.
A.3. Bias approximations for panel 2SLS estimators We will first evaluate the bias approximation for the panel DIF estimator. Note that due to the block-diagonal structure of the Zdi instrument matrix we have −1 −1 −1 , = diag Zd3 Zd3 , . . . , ZdT ZdT Zd Zd where the n × (t − 2) matrix Zdt is y t−2 = (y1t−2 , . . . , ynt−2 ) . Hence, we can write T −1 −1 Zd Zd Zd Zd y−1 = yt−1 Zdt Zdt Zdt Zdt yt−1 , y−1 t=3 T −1 −1 y−1 Zd Zd Zd Zd u = yt−1 Zdt Zdt Zdt Zdt ut . t=3 Zdt )−1 Zdt we have Exploiting yt−1 = Zdt πdt + dt−1 and defining Pdt = Zdt (Zdt −1 E yt−1 Zdt Zdt Zdt Zdt yt−1 = πdt E Zdt Zdt πdt + E dt−1 Pdt dt−1
= σdt2 (E (μdt ) + (t − 2)). C The Author(s). Journal compilation C Royal Economic Society 2010.
126
M. J. G. Bun and F. Windmeijer
The expectation of the numerator of the estimation error is −1 Zdt Zdt Zdt Zdt ut = σu,d (t − 2). E yt−1 Combining results we have T E (αˆ d − α) ≈ T
t=3
σu,d (t − 2)
2 t=3 σdt (E (μdt ) + (t − 2))
0.5(T − 1)(T − 2)σu,d = T . 2 t=3 σdt (E (μdt ) + (t − 2)) The bias approximation for the panel LEV estimator can be derived in the same way. Regarding the SYS estimator we can write T T −1 −1 −1 Zs Zs Zs Zs q−1 = yt−1 Zdt Zdt Zdt Zdt yt−1 + yt−1 Zlt Zlt Zlt Zlt yt−1 , q−1 t=3
q−1 Zs Zs Zs
−1
Zs p =
T
t=3 T −1 −1 yt−1 Zdt Zdt Zdt Zdt ut + yt−1 Zlt Zlt Zlt Zlt ut .
t=3
t=3
It should be noted that only the non-redundant LEV moment conditions have been used in system estimation. In other words, Zli and, hence, Zlt in system estimation are defined as ⎡ ⎡ ⎤ ⎤ yi2 y1,t−1 0 ··· 0 ⎢ ⎢ ⎥ ⎥ ⎢ 0 ⎢ y2,t−1 ⎥ yi3 · · · 0 ⎥ ⎢ ⎢ ⎥ ⎥ Zli = ⎢ ⎥ , Zlt = ⎢ ⎥, ⎢ . ⎢ ⎥ ⎥ . · · · . . ⎣ ⎣ ⎦ ⎦ 0
0
···
yiT −1
yn,t−1
hence we exploit one instrument per period only. As a result we have −1 E yt−1 Zlt Zlt Zlt Zlt yt−1 = πlt E Zlt Zlt πlt + E lt−1 Plt lt−1 = σlt2 (E (μlt ) + 1) and
−1 Zlt Zlt Zlt Zlt ut = σu,l . E yt−1
Combining results we find −1 −1 T E Zdt ut + Tt=3 yt−1 Zlt Zlt Zlt Zlt ut t=3 yt−1 Zdt Zdt Zdt E (αˆ s − α) ≈ −1 −1 T T E Zdt yt−1 + t=3 yt−1 Zlt Zlt Zlt Zlt yt−1 t=3 yt−1 Zdt Zdt Zdt 0.5(T − 1)(T − 2)σu,d + (T − 2)σu,l . T 2 2 t=3 σdt (E (μdt ) + (t − 2)) + t=3 σlt (E (μlt ) + 1)
= T
C The Author(s). Journal compilation C Royal Economic Society 2010.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. 127–144. doi: 10.1111/j.1368-423X.2009.00303.x
Estimation of a transformation model with truncation, interval observation and time-varying covariates B O E. H ONOR E´ † AND L UOJIA H U ‡ †
‡
Department of Economics, Princeton University, Princeton, NJ 08544-1021, USA E-mail:
[email protected] Economic Research Department, Federal Reserve Bank of Chicago, 230 S. La Salle Street, Chicago, IL 60604, USA E-mail:
[email protected] First version received: November 2008; final version accepted: October 2009
Summary Abrevaya (1999b) considered estimation of a transformation model in the presence of left truncation. This paper observes that a cross-sectional version of the statistical model considered in Frederiksen et al. (2007) is a generalization of the model considered by Abrevaya (1999b) and the generalized model can be estimated by a pairwise comparison version of one of the estimators in Frederiksen et al. (2007). Specifically, our generalization will allow for discretized observations of the dependent variable and for piecewise constant time-varying explanatory variables. Keywords: Censoring, Time-varying covariates, Transformation models, Truncation.
1. INTRODUCTION The transformation model h Ti∗ = g Xi β + εi
(1.1)
is often used to model durations. In models like this, it is important to allow for right censoring and sometimes also for left truncation because the samples used in many applications include spells that are in progress at the start of the sample period. See Abrevaya (1999b). It is also sometimes desirable to allow for the dependent variable to be discretized so that one observes only whether it falls in a particular interval. So the observed duration, Ti , would be t if Ti∗ ∈ (t − 1, t]. See Prentice and Gloeckler (1978) or Meyer (1990). Moreover, in duration models it is often interesting to allow for time-varying covariates, which are not easily directly incorporated into the transformation model (Flinn and Heckman, 1982). The contribution of this paper is to specify a statistical model that allows for interval observations and time-varying covariates but which simplifies to a model with interval observations from (1.1) when the covariates are time invariant. We then propose an estimator for the parameters of the model. The estimator can be interpreted as a generalization of the truncated maximum rank correlation estimator proposed in Abrevaya (1999b). C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
128
B. E. Honor´e and L. Hu
Consider first the transformation model, (1.1), with strictly increasing h(·) and g(·) and with εi independent of Xi and continuously distributed with full support. In this model, P Ti∗ > t Xi = P h Ti∗ > h(t)Xi = P g Xi β + εi > h(t)Xi = 1 − F h(t) − g Xi β , where F is the CDF for εi . This gives P
Ti∗
> t Xi , Ti∗ > t − 1 =
1 − F h(t) − g Xi β , 1 − F h(t − 1) − g Xi β
where the assumption that εi has full support guarantees that the denominator is not 0. When 1 − F (·) is log-concave (which is implied by the density of εi being log-concave; see Heckman and Honor´e, 1990), the right-hand side is an increasing function of g(Xi β) and hence of Xi β . See the Appendix. This means that one can write the event Ti∗ > t|Xi , Ti∗ > t − 1 in the form 1{Xi β > ηit } for some (possibly infinite) random variable ηit that is independent of Xi and 1−F (h(t)−·) has CDF 1−F . Therefore, if we define (h(t−1)−·) Yit ≡ 1 Ti∗ ∈ (t − 1, t] = 1{Ti = t}, then we can write
Yil = 0. Yit = 1 Xi β − ηit ≤ 0 for t such that
(1.2)
l≤t−1
In other words, a transformation model with discretized observations of the dependent variable and log-concave errors is a special case of the model Yil = 0, (1.3) Yit = 1 Xit β − ηit ≤ 0 for t such that l≤t−1
where the difference between (1.2) and (1.3) is that the latter allows for time-varying covariates. Note that this line of argument is valid even if Ti is left truncated. It is interesting to note that Abrevaya (1999b) also assumes log-concavity of 1 − F (·). 1 Note that although log-concavity implies an increasing hazard for h(T ∗ ), it does not impose such a restriction on T ∗ . 2 Equation (1.3) is a cross-sectional version of the model considered by Frederiksen et al. (2007). It is well understood that estimators of panel data models can be turned into estimators of cross-sectional models by considering all pairs of observations as units in a panel. See, for example, Honor´e and Powell (1994), who apply the idea in Honor´e (1992) for panel data censored regression models to all pairs of observations in a cross-section. The insights in Frederiksen et al. (2007) can therefore be used to construct an estimator of β. We pursue this in Section 3 after discussing the model in more detail in Section 2. 1 The assumption of log-concavity also appears elsewhere in the literature on truncated regression models. See, for example, Honor´e (1992) and Lee (1993). ∗ 2 Let λ(·) denote the hazard for ε. The hazard for h(T ∗ ) is then − ∂ log(P (h(T )>t|X)) = − ∂ log(1−F (t−g(X β))) = λ(t − ∂t ∂t g(X β)). When 1 − F (·) is log-concave this is an increasing function of t. On the other hand, the hazard for T ∗ is ∗ ∗ − ∂ log(P (T∂t >t|X)) = − ∂ log(P (h(T∂t )>h(t)|X)) = λ(h(t) − g(X β))h (t). The derivative of this with respect to t is λ (h(t) − g(X β))h (t)2 + λ(h(t) − g(X β))h (t), which can be of either sign. C The Author(s). Journal compilation C Royal Economic Society 2010.
Transformation models with truncation and time-varying covariates
129
2. THE MODEL Consider a spell with integer-valued duration, Ti , that starts at (integer-valued) time −Vi ≤ 0. Following the discussion above, we model the event that the spell lasts s periods conditional on lasting at least s − 1, by the qualitative response model Yis = 1 Xis β − ηis ≤ 0 ,
(2.1)
where ηis is independent of Xis and the distribution of ηis is allowed to change over time. When there is left truncation, one must distinguish between duration time and calendar time. We will index the observables, Y and X, by calendar time, and the unobservable η by duration time. At first sight, this difference is confusing, but it is necessitated by the fact that the discussion in the previous section implied that one should allow the distribution of ηis to vary by duration time. On the other hand, it seems natural to denote the first observation for an individual by t = 1. With this notation, we assume that we observe (Yit , Xit ) starting at t = 1, where Yit = 1 Xit β − ηi,t+Vi ≤ 0 .
(2.2)
With this notation, η’s with the same time subscript will have the same distribution under the class of models discussed above. We will let Ti denote the first time that Yit equals 1. Since Yit is not defined after the end of the spell, and since we want to allow for random right censoring, we assume that we observe Yit from t = 1 until, and including, Ti or until a random censoring time Ci − 1 (whichever comes first). In other words, we observe (Yit , Xit ) for t = 1, 2, . . . , T¯i , where T¯i = min{Ti , Ci − 1}. So when an observation is censored, Ci will be the first time period in which individual i is not observed. We also assume that we observe the presample duration, Vi , for each observation. The statistical assumption on the errors in (2.1) is that conditional on Vi and on {Yis = 0 for s < t}, ηi,t+Vi is independent of (Ci , {Xis }s≤t ). As explained in Section 1, if the errors are logconcave and the covariates are time-invariant, this is exactly what is implied by an underlying transformation model for Ti∗ , where we observe whether a spell that started at time −Vi and was in progress at time t − 1 is still in progress at time t. Note that when the covariates are time-varying they are not restricted to be strictly exogenous, and that the censoring times can be covariate-dependent, as long as they do not depend on the η’s. In the next section, we will apply the insight of Frederiksen et al. (2007) to construct an estimator for β under these assumptions when the researcher has access to a random sample of individuals.
3. THE ESTIMATOR The key insight for the construction of the estimator can be easily illustrated if we ignore censoring first (so T¯i = Ti for all i). Let t 1 and t 2 be arbitrary. Consider the two events A = {Ti = t1 , Tj > t2 } and B = {Ti > t1 , Tj = t2 }, where t1 + Vi = t2 + Vj . Under the stated assumptions, it then follows immediately C The Author(s). Journal compilation C Royal Economic Society 2010.
130
B. E. Honor´e and L. Hu
from Lemma 1 of Frederiksen et al. (2007) that ⎧ ⎪ ⎨> P (A|A ∪ B, Xit1 , Xj t2 , Vi , Vj ) = ⎪ ⎩
0, if (Xit1 − Xj t2 ) β = 0, if (Xit1 − Xj t2 ) β < 0.
This suggests estimating β by maximizing Tj Ti
1{t1 + Vi = t2 + Vj } · 1{Ti = t1 , Tj > t2 } · 1{(Xit1 − Xj t2 ) b > 0}
i<j t1 =1 t2 =1
+ 1{Ti > t1 , Tj = t2 } · 1{(Xit1 − Xj t2 ) b < 0} .
(3.1)
Equation (3.1) is the same as one of the objective functions in Frederiksen et al. (2007), except that that paper considers a panel data situation. It is convenient to rewrite (3.1) as 1{Tj + Vj > Ti + Vi > Vj } · 1{(XiTi − Xj ,Ti +Vi −Vj ) b > 0} i<j
+ 1{Vi < Tj + Vj < Ti + Vi } · 1{(XiTj +Vj −Vi − Xj ,Tj ) b < 0}.
(3.2)
This has the same structure as the maximum rank correlation estimator developed in Han (1987). When there is censoring, (3.1) can be modified to ¯
Tj T¯i
1{t1 + Vi = t2 + Vj , t1 < Ci , t2 < Cj } · (1{Ti = t1 , Tj > t2 }
i<j t1 =1 t2 =1
· 1{(Xit1 − Xj t2 ) b > 0} + 1{Ti > t1 , Tj = t2 } · 1{(Xit1 − Xj t2 ) b < 0}).
(3.3)
And equation (3.2) can be rewritten as 1 T¯j + Vj > Ti + Vi > Vj , Ti < Ci · 1 XiTi − Xj ,Ti +Vi −Vj b > 0 i<j
+ 1{Vi < Tj + Vj < T¯i + Vi , Tj < Cj } · 1 XiTj +Vj −Vi − Xj ,Tj b < 0 .
(3.4)
The intuition for the estimator is essentially based on pairwise comparisons. Specifically, we compare an individual i who was observed to fail at time Ti (and thus had a complete duration Ti + Vi ) to all other observations j that survived up to the same duration. At the true parameter β, is larger than the index for value β, if the index for individual i at the time he/she failed, XiT i the comparable individual j at the time that corresponds to the same duration, Xj ,Ti +Vi −Vj β, then individual j is likely to survive longer than individual i (i.e. T¯j + Vj > Ti + Vi ). Note the set of comparison observations j includes censored spells provided the censoring time occurs after Ti + Vi − Vj . The additional inequality in the indicator function, Ti + Vi > Vj , ensures that the time at which j is being compared to i is within the sample period (i.e. not truncated). Again this estimator has the same structure as Han’s (1987) maximum rank correlation estimator and the asymptotic distribution is therefore the one given in Sherman (1993) under the regularity conditions stated there. C The Author(s). Journal compilation C Royal Economic Society 2010.
131
Transformation models with truncation and time-varying covariates
4. ASYMPTOTIC PROPERTIES Consistency and asymptotic normality can be established as in Sherman (1993), Abrevaya (1999b) or Khan and Tamer (2007). First note that some normalization of the parameter is needed, since the parameter vector is only identified up to scale. For example, we can normalize the last component of β to be 1. The two key assumptions for consistency of the estimator are (1) at least one component of the explanatory variable X is continuously distributed with full support, and (2) the error η has full support. Without the first assumption, the parameter is not identified, since a small change in the parameter value could leave the ranking of the index unchanged. The second assumption on the error guarantees that the set of effective observations that make a non-zero contribution to the objective function is not empty. Both assumptions are standard in the semi-parametric estimation literature. To establish asymptotic normality results, we need some additional notations. Denote Di = 1{Ti < Ci }, which is an observable variable indicating a complete (uncensored) spell. The objective function can be rewritten as n 1 Di · 1 Ti + Vi > Vj , Tj + Vj > Vi , T¯j + Vj > Ti + Vi n(n − 1) i=1 j =i b > Xj ,Ti +Vi −Vj b . ·1 XiT i
(4.1)
Define the function τ t, t¯, d, v, {xs }s≤t¯ , b ≡ E Di · 1{Ti + Vi > v, t + v > Vi , t¯ + v > Ti + Vi } · 1 XiT b > xT i +Vi −v b i + E d · 1{t + v > Vj , Tj + Vj > v, T¯j + Vj > t + v} · 1 xt b > Xj ,t+v−Vj b b > xT i +Vi −v b = E Di · 1{Ti + Vi > v, t + v > Vi , t¯ + v > Ti + Vi } · 1 XiT i + E d · 1 t + v > Vi , Ti + Vi > v, T¯i + Vi > t + v · 1 xt b > Xi,t+v−V b . i
Following Theorem 4 of Sherman (1993), we have √ ∼ n(βˆ − β) −→ N (0, 4 −1 −1 ),
(4.2)
where = E ∇2 τ Ti , T¯i , Di , Vi , Xis , β , = E ∇1 τ Ti , T¯i , Di , Vi , Xis , β ∇1 τ Ti , T¯i , di , Vi , Xis , β , with ∇1 and ∇2 denoting the first- and second-derivative operator, respectively. Following Sherman (1993), we can further express the variance–covariance matrix in terms of ‘model primitives’. Specifically, = VX
Xs2 − μs1 Xs 2 β S T , T¯ , D, V , s1 , s2 , Xs 2 β gXs 1 β Xs 2 β s1 ,s2
C The Author(s). Journal compilation C Royal Economic Society 2010.
132
B. E. Honor´e and L. Hu
and
Xs2 − μs1 Xs 2 β Xs2 − μs1 Xs 2 β = EX s1 ,s2
× S7 T , T¯ , D, V , s1 , s2 , Xs2 β · gXs 1 β Xs2 β , where Xs2 is composed of the first K − 1 coordinates of Xs 2 (the ones corresponding to the piece of β that is not normalized to 1), gXs 1 β (λ) is the marginal density of Xs 1 β,
μs1 (λ) = E Xs1 Xs 1 β = λ and
S((t, t¯, d, v, s1 , s2 ), λ) = E Ai (t, t¯, d, v, s1 , s2 )Xis 1 β = λ
and Ai (t, t¯, d, v, s1 , s2 ) = Di · 1{Ti + Vi > v, t + v > Vi , t¯ + v > Ti + Vi , Ti = s1 , Ti + Vi − v = s2 } − d · 1{t + v > Vi , Ti + Vi > v, T¯i + Vi > t + v, t = s2 , t + v − Vi = s1 }. The asymptotic variance matrix can be estimated by plugging in the estimator βˆ and calculating sample analogues of and using numerical derivatives based on a smoothed version of τ . See Section 6 for more discussion.
5. RELATIONSHIP TO OTHER ESTIMATORS The estimator proposed in the previous section is related to a number of existing estimators, and it coincides with some of them in special situations. For example, when Ci = 2 for all i, and with no left truncation (so Vi = 0 for all i), (2.1) is a standard discrete choice model, and in that case the objective function in (3.4) becomes n
1{Yi > Yj } · 1 (Xi − Xj ) b > 0 + 1{Yi < Yj } · 1 (Xi − Xj ) b < 0 ,
i<j
which is the objective function for Han’s (1987) maximum rank correlation estimator. When there is left truncation, and the covariates are time invariant and there is no censoring, the estimator defined by maximizing (3.4) is the same as the truncated maximum rank correlation estimator in Abrevaya (1999b). This is most easily seen by noting that without censoring and with time-invariant covariates, (4.1) becomes n 1 1{Ti + Vi > Vj , Tj + Vj > Vi , Tj + Vj > Ti + Vi } · 1 Xi b > Xj b n(n − 1) i=1 j =i
=
n 1 1{Ti + Vi > Vj , Tj + Vj > Vi }1{Tj + Vj > Ti + Vi } · 1 Xi b > Xj b . n(n − 1) i=1 j =i
Except for the difference in notation and the normalization by n(n − 1), this is exactly equation (7) in Abrevaya (1999b). C The Author(s). Journal compilation C Royal Economic Society 2010.
Transformation models with truncation and time-varying covariates
133
Khan and Tamer (2007) consider a model with left censoring as well as right censoring, whereas we allow for left truncation as well as right censoring. When there is neither left censoring nor left truncation, and when the covariates are time invariant, the estimator defined by maximizing (3.4) coincides with the estimator proposed by Khan and Tamer (2007), except that ours applies to discretized durations and theirs to exactly measured durations, and we allow for time-varying covariates. Whether right censoring or right truncation is more interesting depends on the specific application. Left truncation will, for example, be relevant if as in Frederiksen et al. (2007), the duration of interest is the length of employment on a given job, and one has information on a sample of workers observed between two fixed points in time. In this case, the durations are left truncated, because spells that ended before the start of the sampling will not appear in the data, and it is crucial for our approach that one observes the duration of employment in the current job at the start of the sampling. On the other hand, models with both left and right censoring are, for example, useful for estimation of models where the dependent variable is a fraction, which is restricted to be between zero and one, and where both zeros and ones are likely to be observed. See, for example, Alan and Leth-Petersen (2006). Both Khan and Tamer (2007) and we allow the censoring points at the right to be observed only when the observation equals a censoring point. Khan and Tamer (2007) also allow the left-censoring point to be unobserved when an observation is not left-censored, whereas we assume that the truncation point is observed for everybody who is not truncated, but not for truncated durations. Both papers assume that one observes the actual duration, and not just the duration from the censoring/truncation point. In the duration contexts we have in mind, this is the most severe assumption. 3 The framework here is also closely related to standard statistical duration models with discretized observations. The proportional hazard model can be written as Z(t) = −x β + ε, where Z is the log-integrated baseline hazard and ε has an extreme value distribution. Prentice and Gloeckler (1978), Meyer (1990) and Hausman and Woutersen (2005) study a version of this model with interval observations. Meyer (1990) and Hausman and Woutersen (2005) also allow for time-varying explanatory variables and for ε to be a sum of an extreme value distributed random variable and a random variable that captures unobserved heterogeneity. While the estimation in Meyer (1990) is likelihood-based and hence fundamentally different from ours, the structure of the estimator proposed in Hausman and Woutersen (2005) shares many of the features of the estimator proposed here. The main difference is that theirs is based on a comparison of the integrated hazards rather than just the current index, X β. As a result, the approach does not seem to generalize to models with left truncation. On the other hand, logconcavity plays no role in Hausman and Woutersen (2005).
6. MONTE CARLO EXPERIMENT In this section, we conduct a small-scale Monte Carlo study to illustrate the proposed estimation method and investigate its finite sample performance. We also demonstrate how to conduct inference and examine how good an approximation the asymptotic distribution provides for finite samples. 3 See also Heckman and Singer (1986) for a discussion of the effect of different sample schemes on the analysis of duration data. C The Author(s). Journal compilation C Royal Economic Society 2010.
134
B. E. Honor´e and L. Hu
The designs are based on the following: • All of the designs have two explanatory variables. • β = (β1 , β2 ) = (1, 2) . The parameter of interest is θ = β2 /β1 . The fact that this is onedimensional greatly simplifies the computations. • Time-varying intercept β0 = −4 + (s/10)1.2 . This introduces duration dependence beyond the duration dependence introduced by the shape of F and by the choice of h(·). • The time between the start of a spell and the first period of observation is uniformly distributed on the integers between 1 and 5. • The censoring time is generated as the minimum of 10 periods from the start of the spell and Q periods from the start of the sample, where Q is uniformly distributed on the integers between 1 and 8. We consider a number of designs within this framework. Design 1: Dynamic Probit. The two explanatory variables are generated by i.i.d. draws from a bivariate normal distribution with zero means, variances equal to 2 and 1, respectively, and covariance equal to 1. In this design, ηi,t is i.i.d. N (0, 4). Design 2: Transformation Model Hazard. This design is set up as a generalization of a transformation model. Specifically, using the notation of Section 1, we assume that h(u) = log(u), g(u) = u and ε ∼ N (0, 1). Using the derivation in Section 1, this yields ∗ 1 − log(t) − Xit β ∗ . P (Ti = t) = P Ti < t {Xis }s≤t , Ti > t − 1 = 1 − 1 − log(t − 1) − Xit β As in Design 1, the two explanatory variables are generated by i.i.d. draws from a bivariate normal distribution with zero means, variances equal to 2 and 1, respectively, and covariance equal to 1. Design 3: Feedback. Recall that our model does not require the explanatory variables to be strictly exogenous. In this design, we therefore allow for feedback from the error η to future values of X. Specifically, we follow Design 1 except that the explanatory variables are defined by • X2t = ηt−1 for t > 1 and standard normal for t = 1. • X1t = X2t + N(0, 1). Design 4: Covariate-Dependent Censoring and Truncation. Our model allows censoring and truncation to be correlated with explanatory variables. In this design, we follow the basic structure of Design 1 but let censoring be defined by the outcome of a probit with explanatory variable X1t . Design 5: Dynamic Probit 2. This design is like Design 1 except that • X2s ∼ N (0.5 − 0.2s, 1). C The Author(s). Journal compilation C Royal Economic Society 2010.
135
Transformation models with truncation and time-varying covariates Table 1. Summary statistics for the designs. Design 1 Design 2 Design 3
Design 4
Design 5
Fraction truncated Fraction censored
0.260 0.297
0.349 0.100
0.316 0.190
0.260 0.425
0.183 0.543
Mean duration Standard deviation of duration
5.317 3.422
3.592 2.023
4.369 2.918
5.317 3.422
7.270 3.937
• X1s = 1{X2s + N (0, 1) > 0} • ηi,s ∼ N (0, 0.1 + (0.15s)) The summary statistics for the five designs are reported in Table 1. For each design, 100,000 observations are drawn from the underlying data-generating process. We then compute the fraction of the sample that is censored, the fraction that is truncated, and the mean and standard deviation of the underlying duration. Below, we report Monte Carlo results for the point estimates of β as well as for the performance of tests statistics based on the asymptotic distribution in Section 4. To do this, we estimate the components of the variance of the estimators by sample analogues of smoothed versions of the components. 4 Recall that = E ∇2 τ Ti , T¯i , Di , Vi , Xis , β and = E ∇1 τ Ti , T¯i , Di , Vi , Xis , β ∇1 τ Ti , T¯i , Di , Vi , Xis , β , where τ ((t, t¯, d, v, {xs }s≤t¯), b) = E Di · 1{Ti + Vi > v, t + v > Vi , t¯ + v > Ti + Vi } · 1 XiT b > xT i +Vi −v b i + E d · 1{t + v > Vi , Ti + Vi > v, T¯i + Vi > t + v} · 1 xt b > Xi,t+v−V b . i We then estimate τ by the smoothed version: τˆ (t, t¯, d, v, {xs }s≤t¯), b =
n XiTi b − xT i +Vi −v b 1 Di · 1{Ti + Vi > v, t + v > Vi , t¯ + v > Ti + Vi } · n i=1 h n b xt b − Xi,t+v−V 1 i ¯ . + d · 1{t + v > Vi , Ti + Vi > v, Ti + Vi > t + v} · n i=1 h
4 A recent paper by Subbotin (2007) has shown that the non-parametric bootstrap can be used to estimate the quantiles and variance of various maximum rank correlation estimators. The structure of our estimator is essentially the same as that of the maximum rank correlation estimators he considers. We therefore conjecture that the bootstrap could have been used to estimate the variance in our case as well, although this would increase the computational burden.
C The Author(s). Journal compilation C Royal Economic Society 2010.
136
B. E. Honor´e and L. Hu
Then n 1 ¯ ∇1 τ (t, t , d, v, {xs }s≤t¯), b = Di · 1 Ti + Vi > v, t + v > Vi , t¯ + v > Ti + Vi nh i=1 XiTi b − xT i +Vi −v b ×φ X˜ iTi − x˜T i +Vi −v h + d · 1 t + v > Vi , Ti + Vi > v, T¯i + Vi > t + v xt b − Xi,t+v−V b i ×φ x˜t − X˜ i,t+v−V , i h
and n −1 ∇ Di · 1{Ti + Vi > v, t + v > Vi , t¯ + v > Ti + Vi } 2 τ (t, t¯, d, v, {xs }s≤t¯), b = nh3 i=1 XiTi b − xT i +Vi −v b × XiT b − x b φ Ti +Vi −v i h ˜ ˜ × XiTi − x˜Ti +Vi −v XiTi − x˜Ti +Vi −v + d · 1 t + v > Vi , Ti + Vi > v, T¯i + Vi > t + v b xt b − Xi,t+v−V i × XiTi b − xTi +Vi −v b φ h × x˜t − X˜ i,t+v−Vi x˜t − X˜ i,t+v−Vi .
And therefore 1 ˆ = ∇2 τ Ti , T¯i , Di , Vi , Xis , βˆ n i=1 n
and ¯ ¯ ˆ = 1 ˆ ˆ ∇ 1 τ Ti , Ti , Di , Vi , Xis , β ∇1 τ Ti , Ti , Di , Vi , Xis , β . n i=1 n
As mentioned earlier, β is only identified up to scale. One possibility is to normalize one of the coefficients to 1, and hence essentially focus on β2 /β1 or β1 /β2 . Unfortunately, this normalization will lead to different MAE and RMSE depending on which of the coefficients is normalized. So if one were to compare different estimators, one might reach different conclusions depending on a seemingly innocent normalization. This is unsatisfactory in models where there is only one parameter. For this reason, it is likely to be better to consider θ = log(β2 /β1 ) = log(β2 ) − log(β1 ) the parameter of interest. This means that the true parameter is log(2) ≈ 0.693 for all of the designs. We estimate θ by a grid search over the interval between − log(6) and 1 . Since the parameter of interest is one-dimensional, the line log(6) with equal grids of size 200 search is feasible despite the fact that the calculation of the objective function requires O(n2 ) operations. When it is of higher dimension, it would be beneficial to use the method described in Abrevaya (1999a) to calculate the objective function in O(n · log(n)) operations. We calculate the variance of θ by applying the so-called δ method to (4.2). C The Author(s). Journal compilation C Royal Economic Society 2010.
Transformation models with truncation and time-varying covariates
137
For each design, the Monte Carlo experiment is conducted with 5000 replications for each of the five sample sizes: 100, 200, 400, 800 and 1600. The results are reported in Tables 2–6. Overall, the results across the designs are broadly consistent with predictions from the asymptotic theory. Some additional remarks are in order.
n = 100
Table 2. Results for Design 1. n = 200 n = 400
n = 800
n = 1600
Performance of estimator Median MAE Mean
0.721 0.331 0.727
RMSE
0.488
0.695 0.215 0.706
0.702 0.155 0.702
0.693 0.105 0.698
0.693 0.070 0.694
0.331
0.227
0.155
0.106
Significance when testing at 20% level 0.05, 0.20 0.05, 0.40 0.10, 0.20
0.332 0.163 0.421
0.240 0.152 0.323
0.212 0.187 0.280
0.207 0.241 0.256
0.218 0.271 0.246
0.10, 0.40 0.20, 0.20
0.243 0.494
0.232 0.382
0.261 0.323
0.296 0.285
0.302 0.260
0.20, 0.40 0.40, 0.20 0.40, 0.40
0.324 0.536 0.390
0.301 0.423 0.347
0.305 0.350 0.337
0.326 0.303 0.343
0.319 0.269 0.329
Average
0.289
0.259
0.261
0.274
0.273
n = 800
n = 1600
n = 100
Table 3. Results for Design 2. n = 200 n = 400 Performance of estimator
Median MAE
0.718 0.540
0.713 0.370
0.698 0.255
0.693 0.175
0.692 0.125
Mean RMSE
0.656 0.752
0.712 0.553
0.705 0.392
0.697 0.274
0.695 0.186
0.05, 0.20 0.05, 0.40
0.423 0.164
0.350 0.175
0.268 0.180
0.236 0.224
0.227 0.265
0.10, 0.20 0.10, 0.40 0.20, 0.20
0.515 0.269 0.578
0.422 0.252 0.477
0.335 0.245 0.372
0.280 0.273 0.302
0.253 0.296 0.269
0.20, 0.40 0.40, 0.20 0.40, 0.40
0.359 0.620 0.426
0.317 0.514 0.363
0.293 0.394 0.322
0.303 0.317 0.319
0.313 0.279 0.323
Average
0.333
0.299
0.274
0.273
0.275
Significance when testing at 20% level
C The Author(s). Journal compilation C Royal Economic Society 2010.
138
B. E. Honor´e and L. Hu
n = 100
Table 4. Results for Design 3. n = 200 n = 400
n = 800
n = 1600
Performance of estimator Median MAE Mean
0.683 0.400 0.691
RMSE
0.569
0.699 0.263 0.716
0.693 0.174 0.707
0.696 0.120 0.700
0.688 0.085 0.694
0.405
0.271
0.185
0.124
Significance when testing at 20% level 0.05, 0.20 0.05, 0.40 0.10, 0.20
0.389 0.216 0.468
0.289 0.150 0.372
0.214 0.143 0.288
0.201 0.202 0.264
0.208 0.255 0.247
0.10, 0.40 0.20, 0.20
0.280 0.551
0.226 0.455
0.225 0.345
0.273 0.308
0.302 0.273
0.20, 0.40 0.40, 0.20 0.40, 0.40
0.368 0.608 0.438
0.307 0.503 0.370
0.287 0.384 0.325
0.316 0.330 0.344
0.327 0.287 0.342
Average
0.326
0.268
0.247
0.266
0.273
n = 100
Table 5. Results for Design 4. n = 200 n = 400
n = 800
n = 1600
Median
0.706
Performance of estimator 0.693 0.703
0.698
0.693
MAE Mean
0.385 0.704
0.260 0.703
0.178 0.704
0.118 0.701
0.085 0.694
RMSE
0.557
0.397
0.273
0.182
0.124
0.05, 0.20
0.355
0.195
0.216
0.05, 0.40 0.10, 0.20
0.176 0.437
0.161 0.369
0.186 0.307
0.219 0.250
0.270 0.249
0.10, 0.40 0.20, 0.20 0.20, 0.40
0.250 0.515 0.335
0.235 0.428 0.305
0.255 0.356 0.307
0.269 0.280 0.305
0.300 0.268 0.320
0.40, 0.20 0.40, 0.40
0.568 0.402
0.472 0.353
0.383 0.340
0.296 0.324
0.280 0.333
Average
0.300
0.272
0.270
0.259
0.276
Significance when testing at 20% level 0.280 0.235
First, the results illustrate the consistency of the estimator, since both the median absolute error (MAE) and root mean squared error (RMSE) decrease as sample size increases. Moreover, the estimator is close to median unbiased even for small sample sizes. Secondly, the theory predicts that the estimator converges to the true parameter value at the √ rate n . This is borne out in the simulation as the MAE and RMSE decrease toward zero at C The Author(s). Journal compilation C Royal Economic Society 2010.
Transformation models with truncation and time-varying covariates
n = 100
Table 6. Results for Design 5. n = 200 n = 400
139
n = 800
n = 1600
Performance of estimator Median MAE Mean
0.711 0.567 0.646
RMSE
0.789
0.704 0.409 0.710
0.682 0.285 0.720
0.698 0.195 0.718
0.693 0.135 0.703
0.612
0.438
0.305
0.207
Significance when testing at 20% level 0.05, 0.20 0.05, 0.40 0.10, 0.20
0.518 0.506 0.548
0.414 0.464 0.485
0.365 0.567 0.429
0.355 0.637 0.399
0.372 0.661 0.401
0.10, 0.40 0.20, 0.20
0.528 0.607
0.557 0.541
0.638 0.469
0.666 0.425
0.677 0.411
0.20, 0.40 0.40, 0.20 0.40, 0.40
0.584 0.631 0.627
0.618 0.566 0.648
0.672 0.480 0.682
0.681 0.431 0.684
0.683 0.412 0.683
Average
0.511
0.503
0.505
0.508
0.513
Figure 1. Design 1: density of estimation error for estimator and its log.
√ a rate of approximately 2 when the sample size is doubled. For example, a regression of the log of the median absolute error on the log of the sample size (and design dummies) yields a coefficient of −0.543 with a standard error of 0.006. Thirdly, to examine the normality prediction from the asymptotic theory, we estimate the density for θˆ − θ and plot the kernel estimate in Figures 1–5. The left-hand side of each figure gives the estimated density of the estimator of (β2 /β1 ) centred at the true value. They show severe asymmetry in the distribution of the estimator: it tends to be skewed to the left, especially in small samples. As mentioned, this is expected because of the somewhat unnatural normalization. The right-hand side of the figures shows the estimated density of the estimator of log(β2 /β1 ), again C The Author(s). Journal compilation C Royal Economic Society 2010.
140
B. E. Honor´e and L. Hu
Figure 2. Design 2: density of estimation error for estimator and its log.
Figure 3. Design 3: density of estimation error for estimator and its log.
Figure 4. Design 4: density of estimation error for estimator and its log.
C The Author(s). Journal compilation C Royal Economic Society 2010.
Transformation models with truncation and time-varying covariates
141
Figure 5. Design 5: density of estimation error for estimator and its log.
centred at the true value. There one can see that the distribution becomes more symmetric and closer to normal as sample size increases. Finally, the asymptotic theory suggests that we can conduct inference using t-tests. Under the null, the test statistic should follow a standard normal distribution. In the simulation, we compute a t-statistic for each of the 5000 estimates θˆ and calculate the fraction of times the null is rejected at the 20% level. We focus on tests with (nominal) size of 20% rather than the conventional 5% because the results for the latter are likely to be more erratic for a finite number of simulations. The results are reported for various bandwidths used in the estimation of the asymptotic variance–covariance matrix of the estimator. In general, the rejection rate is closer to the nominal size of the test when the bandwidth is smaller and the sample size is larger. For example, for the bandwidth (0.05, 0.20) and sample size 1600, the reject rate is 21.8%, 22.7%, 20.8% and 21.6% for Designs 1 to 4. 5 These are close to being statistically indistinguishable from the nominal size. The performance of the test under some other combinations of bandwidth and sample size is less encouraging. The test also performs less well under Design 5. We speculate that this is because of the discreteness of xi1 . The last row reports the reject rates computed using the average of the variance–covariance matrix estimated over all the bandwidth choices. Overall, the t-test tends to over-reject the null. It is interesting to compare our results to a standard logit or probit estimation of (2.1) where one uses x1 , x2 and time dummies as explanatory variables. Since we expect them to perform comparably, we focus on the logit maximum likelihood estimator. 6 Designs 1, 3 and 4 are all correctly specified probit models, so one would expect the logit estimator to do well for this design. This is confirmed in panels 1, 3 and 4 of Table 7. The bias is small and the MAE and RMSE fall at a rate close to root-n. It is less clear what to expect for Designs 2 and 5. Panel 2 of Table 7 shows that the logit estimator does well for Design 2. It appears to be close to unbiased and its MAE and RMSE fall at a rate close to root-n. One potential explanation for this is that 5 Different bandwidths are used in estimating the matrix and . The latter is based on a second derivative, and one would therefore expect it to require a larger bandwidth than the former. 6 Since our estimator of θ was calculated by a grid search over the interval between − log(6) and log(6), we censored the logit maximum likelihood estimator of β2 /β1 to be in the interval between 16 and 6.
C The Author(s). Journal compilation C Royal Economic Society 2010.
142
B. E. Honor´e and L. Hu
n = 100
Table 7. Performance of the logit MLE. n = 200 n = 400
n = 800
n = 1600
0.692 0.114 0.696
0.694 0.084 0.695
0.693 0.059 0.693
0.173
0.123
0.085
Design 1 Median MAE Mean
0.699 0.240 0.715
0.699 0.165 0.702
RMSE
0.369
0.252 Design 2
Median MAE Mean
0.696 0.413 0.691
0.703 0.282 0.720
0.692 0.199 0.701
0.689 0.139 0.694
0.695 0.100 0.695
RMSE
0.639
0.443
0.303
0.214
0.149
Median MAE Mean
0.691 0.302 0.711
0.692 0.202 0.710
0.695 0.137 0.701
0.692 0.098 0.696
0.694 0.069 0.695
RMSE
0.454
0.307
0.209
0.144
0.103
Median MAE Mean
0.687 0.288 0.706
0.695 0.201 0.702
0.696 0.138 0.701
0.696 0.097 0.699
0.692 0.067 0.693
RMSE
0.438
0.302
0.209
0.146
0.100
Median MAE Mean
0.884 0.521 0.761
1.008 0.406 1.012
1.046 0.361 1.080
1.055 0.362 1.082
1.063 0.370 1.073
RMSE
0.948
0.685
0.540
0.476
0.428
Design 3
Design 4
Design 5
misspecified maximum likelihood estimators often do well when the explanatory variables are jointly normally distributed. See, for example, Ruud (1983). Design 5 shows a situation where the logit estimator does relatively poorly. The bias is quite high and, as a result, the MAE and RMSE do not fall rapidly as the sample size increases.
7. CONCLUSION In this paper, we propose a generalization of the transformation model that is appropriate for studying duration outcome with truncation, censoring, interval observations of the dependent variable and time-varying covariates. We develop an estimator for this model, discuss its asymptotic properties and investigate its finite sample performance via a Monte Carlo study. C The Author(s). Journal compilation C Royal Economic Society 2010.
Transformation models with truncation and time-varying covariates
143
Overall, the results suggest that the estimator performs well in finite samples, and the asymptotic theory provides a reasonably good approximation to the distribution. We also investigate teststatistics for the estimator. Those require estimation of the asymptotic variance of the estimator. This is somewhat sensitive to different choices of bandwidth. Investigating the optimal bandwidth choice in this case could be an interesting future research topic.
ACKNOWLEDGMENTS This research was supported by the National Science Foundation and the Gregory C. Chow Econometric Research Program at Princeton University. We thank seminar participants at Rice University, Universit´e Paris 1 Panth´eon–Sorbonne and the Federal Reserve Bank of Chicago as well as members of Princeton’s Microeconometric Reading Group for comments. The opinions expressed here are those of the authors and not necessarily those of the Federal Reserve Bank of Chicago or the Federal Reserve System.
REFERENCES Abrevaya, J. (1999a). Computation of the maximum rank correlation estimator. Economics Letters 62, 279– 85. Abrevaya, J. (1999b). Rank estimation of a transformation model with observed truncation. Econometrics Journal 2, 292–305. Alan, S. and S. Leth-Petersen (2006). Tax incentives and household portfolios: a panel data analysis. Working Paper No. 2006-13, Center for Applied Microeconometrics, University of Copenhagen. Flinn, C. J. and J. J. Heckman (1982). Models for the Analysis of Labor Force Dynamics. In R. L. Basmann and G. F. Rhodes, Jr, (Eds.), Advances in Econometrics, Volume 1, 35–95. Greenwich: JAI Press. Frederiksen, A., B. E. Honor´e and L. Hu (2007). Discrete time duration models with group-level heterogeneity. Journal of Econometrics 141, 1014–43. Han, A. (1987). Nonparametric analysis of a generalized regression model. Journal of Econometrics 35, 303–16. Hausman, J. A. and T. Woutersen (2005). Estimating a semi-parametric duration model without specifying heterogeneity. Working Paper, Johns Hopkins University. Heckman, J. J. and B. E. Honor´e (1990). The empirical content of the Roy model. Econometrica 58, 1121– 49. Heckman, J. J. and B. Singer (1986). Econometric analysis of longitudinal data. In Z. Griliches and M. D. Intriligator (Eds.), Handbook of Econometrics, Volume 3, 1689–763. Amsterdam: North-Holland. Honor´e, B. E. (1992). Trimmed LAD and least squares estimation of truncated and censored regression models with fixed effects. Econometrica 60, 533–65. Honor´e, B. E. and J. L. Powell (1994). Pairwise difference estimators of censored and truncated regression models. Journal of Econometrics 64, 241–78. Khan, S. and E. Tamer (2007). Partial rank estimation of duration models with general forms of censoring. Journal of Econometrics 136, 251–80. Lee, M. J. (1993). Quadratic mode regression. Journal of Econometrics 57, 1–19. Meyer, B. D. (1990). Unemployment insurance and unemployment spells. Econometrica 58, 757–82. Prentice, R. L. and L. A. Gloeckler (1978). Regression analysis of grouped survival data with application to breast cancer data. Biometrics 34, 57–67. C The Author(s). Journal compilation C Royal Economic Society 2010.
144
B. E. Honor´e and L. Hu
Ruud, P. A. (1983). Sufficient conditions for the consistency of maximum likelihood estimation despite misspecification of distribution in multinomial discrete choice models. Econometrica 51, 225–28. Sherman, R. T. (1993). The limiting distribution of the maximum rank correlation estimator. Econometrica 61, 123–37. Subbotin, V. (2007). Asymptotic and Bootstrap Properties of Rank Regressions. Working Paper, Social Science Research Network.
APPENDIX Assume that H (·) is a log-concave function and let f (w) =
H (a2 − w) , H (a1 − w)
where a2 > a1 . Let w1 < w2 and a = a2 − a1 ,
w = w2 − w1
and
λ=
a , a + w
then a2 − w2 = λ(a2 − w1 ) + (1 − λ)(a1 − w2 ), so by concavity of ln(H (·)), ln(H (a2 − w2 )) > λ ln(H (a2 − w1 )) + (1 − λ) ln(H (a1 − w2 )).
(A.1)
Also a1 − w1 = (1 − λ)(a2 − w1 ) + λ(a1 − w2 ), so ln(H (a1 − w1 )) > (1 − λ) ln(H (a2 − w1 )) + λ ln(H (a1 − w2 )).
(A.2)
Adding (A.1) and (A.2) yields ln(H (a2 − w2 )) + ln(H (a1 − w1 )) > ln(H (a2 − w1 )) + ln(H (a1 − w2 )) and ln(f (w2 )) − ln(f (w1 )) = (ln(H (a2 − w2 )) − ln(H (a1 − w2 ))) − (ln(H (a2 − w1 )) − ln(H (a1 − w1 ))) > 0. Hence f is an increasing function.
C The Author(s). Journal compilation C Royal Economic Society 2010.
The
Econometrics Journal Econometrics Journal (2010), volume 13, pp. B1–B5. doi: 10.1111/j.1368-423X.2009.00306.x
BOOK REVIEW: A Review of A First Course in Bayesian Statistical Methods By H OFF (P ETER D.) (New York, NY: Springer Science + Business Media LLC: 2009. Pp. 268. £53.99, hardcover, ISBN: 978-0-387-92299-7)
INTRODUCTION Recent years have seen an increase of interest in Bayesian statistical methods in many fields, including econometrics. However, many Ph.D. students and empirical researchers have had little or no exposure to Bayesian methods. For these people, this slim volume is the book to read. It offers a clear and concise introduction to the why’s and how’s of Bayesian statistics. Despite its title (i.e. as a ‘first course’), it assumes the reader has a great deal of previous knowledge of probability and statistics. It is not intended as a first course in statistics, but rather a first course in Bayesian statistics for the reader already familiar with frequentist statistics. Hence, its intended use lies in advanced level courses in statistics and as a book to read for frequentist statisticians interested in learning more about Bayesian methods. I divide my review into two parts. The first is a conventional discussion of coverage, strengths and weaknesses of the book. The second is a discussion of the usefulness of this statistics book for an econometrics readership.
COVERAGE The author adopts a bold strategy: the book begins by addressing the ‘Why Bayes?’ question after only a one-page introduction of what Bayesian learning is. This strategy works well. By focusing on two examples (the first estimating the probability of a rare event, the second a predictive exercise), some essential features of Bayesian analysis are illustrated and comparisons with frequentist methods are made. The first of these examples makes clear the implications of the Bayesian practice of conditioning on the data. The comparison between frequentist confidence intervals and Bayesian credible intervals is a clever one which draws out the distinction between pre- and post-experimental coverage in a nice way. The second example, involving prediction in a regression with many explanatory variables but only a moderate sample size, is very relevant for modern macroeconomic forecasters who have learnt the value of shrinkage in improving forecast performance. This example compares Bayesian forecasts with OLS-based ones, discusses why the former are superior and relates Bayesian methods to lasso methods. This general pattern, of explaining Bayesian methods in relation to familiar frequentist methods, is adopted throughout the book (e.g. in discussions relating frequentist p-values to Bayesian concepts, examination of the bias and mean squared error of Bayesian estimators, etc.). C The Author(s). Journal compilation C Royal Economic Society 2010. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
B2
G. Koop
The second chapter goes back to the basics: developing the basic tools of probability that form the heart of Bayesian statistics. The flavour of this chapter is somewhat different than a comparable frequentist book. Following common Bayesian practice, it begins with a notion of probability based on a degree of belief. It begins with a standard set of axioms of belief and then proves (in a very clear and succinct fashion) that probability functions satisfy these axioms. The concept of exchangeability is fundamental to Bayesians and, thus, receives appropriate emphasis. The chapter ends with de Finetti’s theorem, showing how exchangeable beliefs for that data can be equivalently expressed in terms of a prior and a likelihood. With this result, the reader is now set to see how Bayesian methods work in practice. Many of the following chapters simply go through various models (e.g. Binomial, Poisson, exponential family, normal, multivariate normal, linear regression, etc.) and, in this review, I will not discuss each model and chapter individually. Instead I will discuss some important themes in modern Bayesian statistics and discuss how the author treats them. The ability to derive an appropriate posterior simulation algorithm and produce a computer program for implementing it are important skills in Bayesian econometrics. These skills are sometimes unfamiliar to frequentist econometricians. Apart from a few simple models, the posterior and predictive distributions used by the Bayesians do not have analytical forms. Thus, it is important for a book such as this one to provide adequate coverage of posterior simulation and computer programming. To a great extent, the coverage of these topics in this book is of high quality. Early on in the book, the author brings in the idea of posterior simulation, beginning with Monte Carlo methods. The Gibbs sampler, which is used with so many econometric models, appears in Chapter 6. Metropolis–Hastings is covered in Chapters 10 and 11. The algorithms are explained in a simple and intuitive manner, although often full details of proofs are omitted. For instance, the Gibbs sampler converges to the required posterior under certain regularity conditions. The book deals with these regularity conditions by saying ‘Under some conditions that will be met for all of the models discussed in this text . . .’. Similarly the author’s proof of the Metropolis–Hastings algorithm is labelled a ‘proof’ (basically the proof is done for only a discrete case). However, this is not a weakness. Such mathematical formalism is not required by the intended reader of this book. Having an informal description of how and why such algorithms work, as provided in this book, is much more useful (I found the section on ‘Why does the Metropolis–Hastings algorithm work?’ particularly good). This book is a very concise one, where many important concepts are explained in a very succinct manner. Thus, I was surprised to see the discussion of MCMC diagnostics in Chapter 6 to run to seven pages (and there are additional pages on this topic in later chapters). However, for the econometrician learning MCMC methods for the first time, it is perhaps useful to know this material. The step from a theoretical understanding of how MCMC works to actually doing MCMC in an empirical exercise can be a big one. Knowing how to monitor convergence of an algorithm is an important skill for any empirical Bayesian. The example used by the author in Chapter 6, involving a posterior with three modes, is well chosen to show how you can go wrong when doing MCMC and what you can do to make sure you do not go wrong. To aid the reader in understanding how the computer programming associated with such algorithms is done, the author provides R computer code as text in the book for many of the empirical examples (even though data and code are also available on the author’s website). Because the models used are relatively simple ones, these codes are relatively short (e.g. they tend to cover approximately half a page or a page at most). But for the reader who is unfamiliar with R (such as myself), these can be a bit hard to follow and could do with some more comments. My own preference is for sketches of the structure of the computer code (such as the author C The Author(s). Journal compilation C Royal Economic Society 2010.
Review
B3
includes when e.g. describing the Metropolis–Hasting algorithm) and then putting detailed code on a website. But no doubt R users will find the R code provided directly in the book useful. Programming is a useful skill for any empirical Bayesian and most Bayesian econometricians create their programs from scratch in languages such as R, MATLAB or Gauss. However, there are an increasing number of Bayesian programs or code repositories that allow the researcher to avoid much of the programming task. The Bayesian program WinBUGS (http://www.mrcbsu.cam.ac.uk/bugs/) is enjoying increasing popularity. The R-repository CRAN has many Bayesian packages (e.g. bayesm available at http://cran.r-project.org/web/packages/bayesm). Discussion of resources such as these is not provided in this book. Posterior simulation and computer programs are two aspects of Bayesian statistics which are often unfamiliar to frequentist econometricians and, hence, a strong feature of this book is its treatment of these issues. Another aspect unfamiliar to frequentists is prior elicitation. A point emphasized in this book is that Bayes’ theorem provides an updating rule for combining prior and data information. Priors are not right or wrong, but useful or not useful. A variety of different approaches, including the use of diffuse priors, subjectively choosing prior hyperparameters to match prior expectations, training sample priors, priors based on previous studies, etc., are presented. The usefulness of prior sensitivity analysis is also emphasized. There is also a chapter on hierarchical priors. Hierarchical modelling has played such an important role in modern Bayesian econometrics that it is good to see it getting a decent treatment here. Econometricians are interested in estimation, model comparison/selection (hypothesis testing) and prediction. This book contains a great deal of material (in terms of theoretical derivations, computation and in the empirical illustrations) about estimation and prediction, but relatively little about model comparison. The concept of a Bayes factor is briefly described at the very beginning of the book and then does not re-appear until Chapter 9 (which involves the linear regression model). Formulas for the marginal likelihood in standard models are typically not provided. Commonly-used tools for calculating Bayes factors (e.g. the Savage– Dickey density ratio) and marginal likelihoods using Gibbs sampler output are not discussed. However, in a short book an author cannot cover everything. There is a substantial discussion of posterior predictive model checking. Although such checks of model fit are not commonly used in Bayesian econometrics, their use and relation with frequentist diagnostic checks are made clear in this book. Various methods related to cross-validation, which involve withholding part of the data and comparing predictions to withheld data, are described. It is only in Chapter 9, for the linear regression model, that a more lengthy discussion of model selection and averaging is provided. The discussion of model comparison here is elegant, but focuses on a narrow range of issues. The set-up is for the common problem where the researcher is working with a regression model with a large number, K, of potential predictors, many of which are expected to be unimportant. A set of models is defined by introducing z = (z1 , . . . , zK ) , where zj ∈ {0, 1} indicates whether a predictor is included or excluded. That is, a conventional regression model is replaced with yi =
K
zj βj xj i + εi .
j =1
Results for comparing different models defined by different configurations for z are provided assuming a g-prior is used. A posterior simulation algorithm for drawing z is provided (this is required for the common case where z can take on 2K configurations and K is large). Such algorithms are widely used in contemporary Bayesian econometrics (e.g. in the cross-country C The Author(s). Journal compilation C Royal Economic Society 2010.
B4
G. Koop
growth regression literature where Bayesian model averaging methods have been influential), so such material is very useful. However, the book provides no broader discussion of hypothesis testing in the linear regression model and focuses only on the g-prior. For broader discussion, one must turn to a conventional Bayesian econometrics book. The final chapters of the book offer a brief introduction to the use of Bayesian methods and posterior simulators in more sophisticated models. Chapter 11 discusses general linear-mixed effects models and Chapter 12 discusses latent variable methods for ordinal data. I found of particular interest in the latter chapter the concept of rank likelihood methods and their use in Gaussian copula models.
HOW USEFUL IS THE BOOK FOR ECONOMETRICIANS? This book is a slim (approximately 250 pages long) statistics volume. So it is unfair of a reviewer such as myself to judge it as compared to a list of desired features in a more lengthy econometrics volume. Nevertheless, I will be unfair and try and answer the following question: ‘How good would this book be for a frequentist Ph.D. econometrician or empirical economist interested in making the leap to the Bayesian econometrics research frontier?’ There are some obvious (and mostly unimportant) issues that distinguish this book from an econometrics one. The examples are mostly non-economic (e.g. examples involve ice core data, diabetes, oxygen uptake, etc.). And some of the terminology differs from that used in Bayesian econometrics (e.g. the terminology marginal likelihood is used in a different sense than the Bayesian econometrician would). If I go through a standard econometrics textbook, then many standard econometrics topics are not explicitly covered. However, this is not to be expected. A reasonable expectation is that a reader of this book could be prepared to dive into the relevant Bayesian econometrics literature. And, to a large extent, I think this book meets this expectation. For instance, there is a large Bayesian panel data literature in which the focus is on modelling individual heterogeneity (e.g. in slope coefficients in panel data regressions). There is nothing explicitly on this topic in the book, but Chapter 8 discusses hierarchical modelling of grouped data. This chapter describes the concepts (both theoretical and computational) which underlie Bayesian panel data methods. I expect a knowledge of this chapter would prepare the reader well for the Bayesian panel data literature. Similarly, the brief discussion of generalized linear models offers some basic insights into models for qualitative and discrete choice popularly used by econometricians. Mixtures of various sorts have played a big role in modern Bayesian econometrics (e.g. in allowing for more flexible error distributions or more flexible modelling of conditional means/variances in regression models). There is little on this in this book, but the (very brief) discussion of generalized linear mixed effects models offers an insight into some basic ideas of mixture modelling. On the other hand, traditional econometric issues such as endogeneity (e.g. instrumental variables methods) are not discussed. Nor are time-series methods and concepts that are popular with econometricians (e.g. unit roots and cointegration). Bayesian econometrics has been revolutionized by the insight that many of our models can be written in terms of (typically high-dimensional) latent variables involving straightforward MCMC algorithms. Examples include probit, tobit, the stochastic frontier model, statespace models, Markov-switching models, structural beak models, various regime switching or C The Author(s). Journal compilation C Royal Economic Society 2010.
Review
B5
threshold models, random coefficient panel data models, various semi-parametric (or flexible) regression models, stochastic volatility models, etc. This book (with some exceptions, e.g. in Chapter 12) focuses on relatively simple parametric models and, thus, these benefits of the Bayesian methods may not come through clearly to the reader. To give an example, vector autoregressive (VAR) models are an old favourite of the Bayesian macroeconometrician. Through a discussion of multivariate Normal models (and empirical illustrations of the benefits of shrinkage through priors in regression contexts), the book prepares the reader very well for delving into the Bayesian VAR literature. However, the book will offer less of a preparation for the researcher interested in doing Bayesian empirical work with the time-varying parameter VAR (TVP-VAR) model or other non-linear VARs which form the basis of much recent empirical macroeconomic research. I would argue that adding material which would prepare the reader for this literature would be well within the structure and purpose of the present book. For instance, latent state models such as the state-space model, which form the basis of the TVP-VAR or Markov-switching VAR, could have been fitted into this book. The relevant MCMC algorithms are easy to explain and such models are used not only by econometricians, but in many other areas of statistics. However, I can think of few better books that explain so clearly and succinctly the computational tools used by the Bayesian econometrician. Bayesian inference in most econometric models is carried out using Gibbs sampling or the Metropolis–Hastings algorithm. The reader should come away from this book with a good intuition of how and why these methods work, the kinds of problems that can arise when using them and methods for surmounting these problems. In short, even though this book does not cover all the models used in econometrics, it does cover most of the ideas and methods (both theoretical and computational) that underlie Bayesian econometrics.
CONCLUSIONS This is an excellent book for its intended audience: statisticians who wish to learn Bayesian methods. Although designed for a statistics audience, it would also be a good book for econometricians who have been trained in frequentist methods, but wish to learn Bayes. In relatively few pages, it takes the reader through a vast amount of material, beginning with deep issues in statistical methodology such as de Finetti’s theorem, through the nitty-gritty of Bayesian computation to sophisticated models such as generalized linear mixed effects models and copulas. And it does so in a simple manner, always drawing parallels and contrasts between Bayesian and frequentist methods, so as to allow the reader to see the similarities and differences with clarity. G ARY KOOP University of Strathclyde
C The Author(s). Journal compilation C Royal Economic Society 2010.